![crack](https://www.deltawines.eu/assets/files/shutterstock-532006042-72-1.1920x0.jpg)

# Wine Grading

Let's practice Decision Trees & Random Forest on a super cool dataset. We'll be trying to predict the quality of a given wine! 

Your goal will be to:

1. Preprocess the data
2. Create a classification algorithm

Happy Coding!

## Step 1 - Import Data 🤹‍♀️

- Import usual librairies

In [None]:
import pandas as pd
import seaborn as sns 
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt 

from sklearn.tree import plot_tree

- Import `Wine_grading.csv` and visualize dataset

In [None]:
df = pd.read_csv("./assets/ML/Wine_grading.csv")
df.head()

- Remove `Unnamed:0` column from the dataset

In [None]:
#df = df.drop(columns=["Unnamed: 0"], axis=1)
df = df.drop(columns=["Unnamed: 0"])
df.head()

## Step 2 - EDA 📊

- Visualize `alcohol` and `Grade`

In [None]:
sns.catplot(data = df, x="Grade", y="alcohol", kind="bar")

- Visualize `magnesium` and `Grade`

In [None]:
sns.catplot(data = df, x="Grade", y="magnesium", kind="bar")

- Visualize `color_intensity` and `Grade`

In [None]:
sns.catplot(data = df, x="Grade", y="color_intensity", kind="bar")

* Show your dataset main statistics

In [None]:
df.describe(include="all")

- Let's take a look to missing values

In [None]:
df.isna().sum() 

In [None]:
df.isnull().any()

## Step 3 - Preprocessing 🍳

- Split your dataset by $X$ and $y$

In [None]:
y = df.loc[:,"Grade"]                                      

# features_list = df.columns[:-1]
# X = df.loc[:,features_list]
X = df.drop(columns=["Grade"])                             


- Split your data in train and test sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=0, 
                                                    stratify = y) 

- Make all the required preprocessings on the train set

In [None]:
print(X_train.head())                                               # print first 5 rows (not using iloc since now X_train became a numpy array)

In [None]:

numeric_features = list(range(13)) 
numeric_transformer = StandardScaler()

feature_encoder = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features)
        ]
    )
X_train = feature_encoder.fit_transform(X_train)
print(X_train[0:5,0:3])                                               # print first 5 rows (not using iloc since now X_train became a numpy array)

## Build your model 🏋️‍♂️

<!-- - Create your Logistic Regression model -->
- Create your Decision Tree Classifier

In [None]:
classifier = DecisionTreeClassifier()
classifier.fit(X_train, y_train)                  # This steps is the actual training
y_train_pred = classifier.predict(X_train)

- Evaluate it (don't forget to preprocess X_test)

In [None]:
X_test = feature_encoder.transform(X_test)
y_test_pred = classifier.predict(X_test)

- Look at your model scores on train and test

- What can you say about it ?

- Create the confusion matrix with `plot_confusion_matrix`

In [None]:
# Plot confusion matrix on train set
cm = confusion_matrix(y_train, y_train_pred, labels=classifier.classes_)
cm_display = ConfusionMatrixDisplay.from_predictions(y_train, y_train_pred)
cm_display.ax_.set_title("Confusion matrix on train et ") 
plt.show() 

In [None]:
# https://vitalflux.com/accuracy-precision-recall-f1-score-python-example/
# Precision : Out of all the positive predicted, what percentage is truly positive. Spam
# Recall    : Out of the total positive, what percentage are predicted positive.    Medicine, Credit card
# F1        : Harmonic average Recall Precision 

# ! average='micro' ==> https://stackoverflow.com/questions/52269187/facing-valueerror-target-is-multiclass-but-average-binary
# print(f"Precision TP/(TP+FP) - Left col                  : {precision_score(y_train, y_train_pred, average='micro'):.3f}" )
# print(f"Recall TP/(TP+FN)  - Bottom line                 : {recall_score(y_train, y_train_pred, average='micro'):.3f}" )
# print(f"F1 2/(1/Prec + 1/Rec)                            : {f1_score(y_train, y_train_pred, average='micro'):.3f}" )
# print(f"Accuracy (TP+TN)/(TP+FN+TN+FP) - Diag over total : {accuracy_score(y_train, y_train_pred):.3f}" )
print(f"Accuracy on train set                            : {classifier.score(X_train, y_train):.3f}")

In [None]:
# Plot confusion matrix on test set
cm = confusion_matrix(y_test, y_test_pred, labels=classifier.classes_)
cm_display = ConfusionMatrixDisplay.from_predictions(y_test, y_test_pred)
cm_display.ax_.set_title("Confusion matrix on test set ") 
plt.show() 

In [None]:
# https://vitalflux.com/accuracy-precision-recall-f1-score-python-example/
# Precision : Out of all the positive predicted, what percentage is truly positive. Credit card
# Recall    : Out of the total positive, what percentage are predicted positive.    Spam
# F1        : Harmonic average Recall Precision 
# print(f"Precision TP/(TP+FP) - Left col                  : {precision_score(y_test, y_test_pred, average='micro'):.3f}" )
# print(f"Recall TP/(TP+FN)  - Bottom line                 : {recall_score(y_test, y_test_pred, average='micro'):.3f}" )
# print(f"F1 2/(1/Prec + 1/Rec)                            : {f1_score(y_test, y_test_pred, average='micro'):.3f}" )
# print(f"Accuracy (TP+TN)/(TP+FN+TN+FP) - Diag over total : {accuracy_score(y_test, y_test_pred):.3f}" )
print(f"Accuracy on test set                            : {classifier.score(X_test, y_test):.3f}")

- Create a dataframe with features importance

* Our model is overfitting. Let's try to play with parameters. Using [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html?highlight=decision%20tree#sklearn.tree.DecisionTreeClassifier) try to: 
    * Increase `min_samples_split`
    * Play around with other parameters if you want to better optimize your model! 🔧

In [None]:
classifier = DecisionTreeClassifier(min_samples_split=80, class_weight="balanced" )
classifier.fit(X_train, y_train) 
y_train_pred = classifier.predict(X_train)
y_test_pred = classifier.predict(X_test)

# Plot confusion matrix on train set
cm = confusion_matrix(y_train, y_train_pred, labels=classifier.classes_)
cm_display = ConfusionMatrixDisplay.from_predictions(y_train, y_train_pred)
cm_display.ax_.set_title("Confusion matrix on train et ") 
plt.show() 
print(f"Accuracy on train set                            : {classifier.score(X_train, y_train):.3f}")

# Plot confusion matrix on test set
cm = confusion_matrix(y_test, y_test_pred, labels=classifier.classes_)
cm_display = ConfusionMatrixDisplay.from_predictions(y_test, y_test_pred)
cm_display.ax_.set_title("Confusion matrix on test set ") 
plt.show() 
print(f"Accuracy on test set                            : {classifier.score(X_test, y_test):.3f}")

## Bonus 1 - Feature Importance 🏄‍♂️

* Try to visualize feature importance of your decision tree

In [None]:
feature_importance = pd.DataFrame({
    "feature_names": X.columns,
    "coefficients": classifier.feature_importances_
})
print(feature_importance.sort_values(by="coefficients", ascending=False))
_ = feature_importance.sort_values(by="coefficients", ascending=False).plot(kind="bar", x="feature_names", figsize=(16*.65, 9*.65))
_ = plt.xticks(rotation=0)

## Bonus 2 - Try a Random Forest 🏄‍♂️

* Do you think a Random Forest can do better? 

In [None]:
# classifier = RandomForestClassifier(n_estimators = 30)
classifier = RandomForestClassifier(min_samples_split=80, class_weight="balanced" )

classifier.fit(X_train, y_train) 
y_train_pred = classifier.predict(X_train)
y_test_pred = classifier.predict(X_test)

# Plot confusion matrix on train set
cm = confusion_matrix(y_train, y_train_pred, labels=classifier.classes_)
cm_display = ConfusionMatrixDisplay.from_predictions(y_train, y_train_pred)
cm_display.ax_.set_title("Confusion matrix on train et ") 
plt.show() 
print(f"Accuracy on train set                            : {classifier.score(X_train, y_train):.3f}")

# Plot confusion matrix on test set
cm = confusion_matrix(y_test, y_test_pred, labels=classifier.classes_)
cm_display = ConfusionMatrixDisplay.from_predictions(y_test, y_test_pred)
cm_display.ax_.set_title("Confusion matrix on test set ") 
plt.show() 
print(f"Accuracy on test set                            : {classifier.score(X_test, y_test):.3f}")

## Bonus 3 [For the coding warriors] - Visualize your decision tree 🏄‍♂️

* Did you know that you can visualize an actual decision tree? 
    * Check out this [documentation](https://scikit-learn.org/stable/auto_examples/tree/plot_iris_dtc.html#sphx-glr-auto-examples-tree-plot-iris-dtc-py) and try to do it with your Decision 
    * Careful, it doesn't work for Random Forests 🙏

In [None]:
classifier = DecisionTreeClassifier(min_samples_split=80, class_weight="balanced" )
classifier.fit(X_train, y_train) 

fig, ax = plt.subplots(figsize=(30, 30)) # Resize figure
plot_tree(classifier, filled = True, feature_names = list(X.columns), proportion = True, ax = ax)
plt.show()