<a href="https://colab.research.google.com/github/Pink-Raccoon/mldm_hs22/blob/main/labs/L01_Data-Processing_LAB_ASSIGNMENT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
from matplotlib import pyplot as plt
import sklearn
import pandas as pd

In [None]:
RANDOM_SEED = 0x0

# TASK 1 (2 Points): 

We work with the "Wine Recognition" dataset. You can read more about this dataset at [https://scikit-learn.org/stable/datasets/toy_dataset.html#wine-recognition-dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#wine-recognition-dataset).

The data is the results of a chemical analysis of wines grown in the same region in Italy by three different cultivators.
The data is loaded below and split into `data` and `target`. `data` is a `Dataframe` that contains the result of the chemical analysis while `target`contains an integer representing the wine cultivator.

In [None]:
from sklearn.datasets import load_wine
(data, target) = load_wine(return_X_y=True, as_frame=True)

In [None]:
data

In [None]:
target

Next, the data is split into training data and testing data.
The training data is used to train the model while the testing data is used to evaluate the model on different data than it was trained for. You will learn later in the course why this is necessary.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.33, random_state=42)


In the following, we define functions to classify the data. We use a [Decision Tree Classifier](https://scikit-learn.org/stable/modules/tree.html#tree) and a [Support Vector Classifier](https://scikit-learn.org/stable/modules/svm.html#svm-classification). You will learn later in the course how these classifiers work.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

def run_classifier(clf, X_train, y_train, X_test, y_test):
  clf.fit(X_train, y_train)
  y_test_predicted = clf.predict(X_test)
  return accuracy_score(y_test, y_test_predicted)


def run_decision_tree(X_train, y_train, X_test, y_test):
  clf = DecisionTreeClassifier(random_state=0)
  accuracy = run_classifier(clf, X_train, y_train, X_test, y_test)
  print("The accuracy of the Decision Tree classifier is", accuracy)

def run_svc(X_train, y_train, X_test, y_test):
  clf = SVC(random_state=0)
  accuracy = run_classifier(clf, X_train, y_train, X_test, y_test)
  print("The accuracy of the Support Vector classifier is", accuracy)


### Task 1a: Classify the data

Classify the data by calling the two functions `run_decision_tree` and `run_svc`.
Which classifier works better (i.e. achieves the higher accuracy)?

In [None]:
run_decision_tree(X_train, y_train, X_test, y_test)

In [None]:
run_svc(X_train, y_train, X_test, y_test)

In [None]:
# The decision Tree classifer is more accurate with 96%

### Task 1b: Normalize the data with mean and standard deviation

Normalize the training and testing data using the following formula:

$$X_{normalized} = \frac{X-\mu_X}{\sigma_X}$$

Calculate the mean and standard deviation __on the training data__ only (also when you normalize the testing dataset).

`Pandas` provides built-in functions to calculate the average and the standard deviation. For example, `X_train.mean()` returns the average value per feature in the training dataset while `X_train.std()` returns the standard deviation per feature.

In [None]:
mean_standard_normalization_train = (X_train - X_train.mean())/X_train.std()
mean_standard_normalization_test = (X_test - X_train.mean())/X_train.std()


In [None]:
run_decision_tree(mean_standard_normalization_train, y_train, mean_standard_normalization_test, y_test)

Call the two classification functions again with the normalized data and report the changes in accuracy. What do you notice?

In [None]:
run_svc(mean_standard_normalization_train, y_train, mean_standard_normalization_test, y_test)

In [None]:
#Now the support vector classifier is more accurate with 98%

### Task 1c: Repeat Task 1b with min-max Normalization

Repeat the task 1b but use the following formula to normalize tha data:

$$X_{normalized} = \frac{X-X_{min}}{X_{max} - X_{min}}$$

Again, calculate the mean and standard deviation __on the training data__ only (also when you normalize the testing dataset) and use the built-in function `X_train.min()` resp. `X_train.max()`.

In [None]:
normalized_min_max_train = (X_train - X_train.min())/X_train.max()
normalized_min_max_test = (X_test- X_train.min())/X_train.max()
run_svc(mean_standard_normalization_train, y_train, mean_standard_normalization_test, y_test)
run_decision_tree(mean_standard_normalization_train, y_train, mean_standard_normalization_test, y_test)

In [None]:
#Same as in 1b, support vector classifier is more accurate with 98%

### Task 1c: Repeat Task 1b with min-max Normalization

Repeat the task 1b but use the following formula to normalize tha data:

$$X_{normalized} = \frac{X-X_{min}}{X_{max} - X_{min}}$$

Again, calculate the mean and standard deviation __on the training data__ only (also when you normalize the testing dataset) and use the built-in function `X_train.min()` resp. `X_train.max()`.

### Task 1c: Repeat Task 1b with min-max Normalization

Repeat the task 1b but use the following formula to normalize tha data:

$$X_{normalized} = \frac{X-X_{min}}{X_{max} - X_{min}}$$

Again, calculate the mean and standard deviation __on the training data__ only (also when you normalize the testing dataset) and use the built-in function `X_train.min()` resp. `X_train.max()`.

Call the two classification functions again with the normalized data and report the changes in accuracy. What do you notice?

In [None]:
normalized_min_max_train = (X_train - X_train.min())/X_train.max()
normalized_min_max_test = (X_test- X_train.min())/X_train.max()
run_svc(mean_standard_normalization_train, y_train, mean_standard_normalization_test, y_test)
run_decision_tree(mean_standard_normalization_train, y_train, mean_standard_normalization_test, y_test)

## 📢 **HAND-IN** 📢: Report on Moodle whether you solved this task.

---
# TASK 2 (2 Points): 

In Task 1 we clearly saw that normalization improves the result for Support Vector Classifiers but not for Decision Trees. You will learn later in the course why Decision Trees don't need normalization.

However, to better understand the influence of normalization, we will plot the data with and without normalization.


In [None]:
import seaborn as sns
sns.set_theme(style="ticks")
from sklearn.datasets import load_wine
(data, target) = load_wine(return_X_y=True, as_frame=True)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.33, random_state=42)

### Task 2a: Plot the unnormalized data

For simplicity, we only consider only the columns `alcohol` and `malic_acid` from the training dataset.

Create a [Scatterplot](https://seaborn.pydata.org/generated/seaborn.scatterplot.html) from the data with  the attribute `alcohol` on the `x`-axis and `malic_acid` on the `y`-axis.

Plot the un-normalized data `X_train` as well as the two noramlized versions from Exercise 1 in the same plot and describe what happens.

__Hint:__ To visualize the data distribution in the same plot just call `sns.scatterplot` three times within the same code-cell.

In [None]:

sns.scatterplot(X_train.alcohol, X_train.malic_acid)
sns.scatterplot(normalized_min_max_train.alcohol, normalized_min_max_train.malic_acid)
sns.scatterplot(mean_standard_normalization_train.alcohol, mean_standard_normalization_train.malic_acid)




We will now have a closer look at the data. Calculate for the un-normalized data as well as for the two normalized versions of data

- The average value in the column `avg(alcohol)`
- The standard deviation in the column `std(alcohol)`
- The minimum value in the column `min(alcohol)`
- The maxmium value in the column `max(alcohol)`
- The range in the column by subtracting the minimum of the maximum in the column `max(alcohol) - min(alcohol)`

Compare the properties of the un-normalized data with the normalized data. What do you notice?

In [None]:
mean_alc = X_train.alcohol.mean()
print("mean_alc",mean_alc)
mean_norm_stand = mean_standard_normalization_train.alcohol.mean()
print("mean_norm_stand",mean_norm_stand)
mean_min_norm = normalized_min_max_train.alcohol.mean()
print("mean_min_norm",mean_min_norm)




In [None]:
std_alc = X_train.alcohol.std()
print("std_alc",std_alc)
std_norm_stand = mean_standard_normalization_train.alcohol.std()
print("std_norm_stand",std_norm_stand)
std_min_norm = normalized_min_max_train.alcohol.std()
print("std_min_norm",std_min_norm)




In [None]:

min_alc = X_train.alcohol.min()
print("min_alc",min_alc)
min_norm_stand = mean_standard_normalization_train.alcohol.min()
print("min_norm_stand",min_norm_stand)
min_min_norm = normalized_min_max_train.alcohol.min()
print("min_min_norm",min_min_norm)





In [None]:
max_alc = X_train.alcohol.max()
print("max_alc",max_alc)
max_norm_stand = mean_standard_normalization_train.alcohol.max()
print("max_norm_stand",max_norm_stand)
max_min_norm = normalized_min_max_train.alcohol.max()
print("max_min_norm",max_min_norm)


print(max_alc - min_alc)

## 📢 **HAND-IN** 📢: Report on Moodle whether you solved this task.

---

# TASK 3 (6 Points): Binning



The following list consists of the age of several people: 
```python
[13, 15, 16, 18, 19, 20, 20, 21, 22, 22, 25, 25, 26, 26, 30, 33, 34, 35, 35, 35, 36, 37, 40, 42, 46, 53, 70]
```

### Task 3a: Equal-Width Binning
Apply binning to the dataset using 3 equal-width bins. Smooth the data using the mean of the bins.

Tips:
1. Calculate the size of the bins
2. Assign each value to the corresponding bin
3. Calculate the mean per bin
4. Replace each value by the mean of its bin

__Solve this exercise by hand without using Python__

❗ TODO ❗

###Task 3b: Equal-Depth Binning

Apply binning to the dataset using 3 equal-depth bins. Smooth the data using the mean of the bins. Explain the steps of your approach and give the final result.

Tips:
1. Calculate the number of elements per bin
2. Assign each value to the corresponding bin
3. Calculate the mean per bin
4. Replace each value by the mean of its bin

__Please solve this exercise by hand without using Python__ 


❗ TODO ❗

## 📢 **HAND-IN** 📢: Describe on Moodle the results of Exercise 3: 

* Copy the results of Exercise 3a and 3b to Moodle
* Describe the differences between task 3a and task 3b
* Describe situations when binning should be used and give a concrete example. Are there also circumstances in which binning should not be applied?