# Wine Quality Prediction - Part 3 - Binary Classification

![](https://cdn.pixabay.com/photo/2016/03/09/11/53/wine-glasses-1246240_1280.jpg)

This notebook is part of a trilogy in which I will approach the wine quality dataset from several different approaches:

+ [Part 1: Supervised Learning - Regression](https://www.kaggle.com/sgtsteiner/red-wine-quality-regression)
+ [Part 2: Supervised Learning - Multiclass Classification](https://www.kaggle.com/sgtsteiner/red-wine-quality-multiclass-classification)
+ Part 3: Supervised Learning - Binary Classification

In [the first part of this analysis](https://www.kaggle.com/sgtsteiner/red-wine-quality-regression) we approach the problem as supervised learning - regression. The resulting model was not satisfactory to us. [In a second analysis](https://www.kaggle.com/sgtsteiner/red-wine-quality-multiclass-classification), we approach the problem as supervised learning - multiclass classification. Given that the data we had was so unbalanced (the quality scores were concentrated on scores 5 or 6) the performance of our best model (RandomForest) was not quite good. In this third and final part, we will focus our analysis on a **supervised learning - binary classification** problem.

# Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.model_selection import RandomizedSearchCV, cross_val_score 
from sklearn.model_selection import cross_validate, cross_val_predict
from sklearn.linear_model import SGDClassifier
from sklearn.dummy import DummyClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier
from sklearn import metrics, utils

import xgboost as xgb

import pandas_profiling

%matplotlib inline

seed = 42

import warnings
warnings.filterwarnings('ignore')

# Get the data

In [None]:
red = pd.read_csv("../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv")

# Explore the data

In [None]:
red.head()

In [None]:
red.profile_report()

We are not going to delve into data exploration, as we already did in the [first part of this analysis](https://www.kaggle.com/sgtsteiner/red-wine-quality-regression). The only preprocessing that we are going to perform is to convert the target variable `quality` to categorical, indicating whether the score is good or not: `bad`: scores less than or equal to 5. `good`: scores greater than or equal to 6.

In [None]:
bins = [2, 5.5, 8]
labels = ["bad", "good"]
red['quality_cat'] = pd.cut(red['quality'], bins=bins, labels=labels)

In [None]:
red["quality_cat"].value_counts()

In [None]:
print(f"Percentage of quality scores")
red["quality_cat"].value_counts(normalize=True)*100

In [None]:
red["quality_cat"].value_counts().plot.pie(autopct='%1.2f%%');

# Select and train models

The goal of this phase is to train many models quickly and unrefined, of different categories (i.e. Random Forests, AdaBoost, Extra Tree, etc.) using the standard parameters. The idea is to have a quick overview of which models are most promising. Measure and compare the performance of all of them. Select the best models.

Create the predictor dataset and the dataset with the target variable:

In [None]:
predict_columns = red.columns[:-2]
predict_columns

In [None]:
X = red[predict_columns]
y = red["quality_cat"]

Create the training and test datasets:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    random_state=42, 
                                                    test_size=0.2)

## Baseline

First, we are going to train a dummy classifier that we will use as a baseline with which to compare.

In [None]:
clf_dummy = DummyClassifier(strategy="uniform", random_state=seed) # Random prediction
clf_dummy.fit(X_train, y_train)

In [None]:
cross_val_score(clf_dummy, X_train, y_train, cv=3, 
                scoring="accuracy", n_jobs=-1).mean()

As we can see, a classifier that predicts randomly obtains an accuracy of 53%.

In [None]:
# Always predict the most frequent class
clf_dummy = DummyClassifier(strategy="most_frequent", random_state=seed) 
clf_dummy.fit(X_train, y_train)

In [None]:
cross_val_score(clf_dummy, X_train, y_train, cv=3, scoring="accuracy", n_jobs=-1).mean()

A classifier that always predicts the most frequent class (in our case the `good` quality) also obtains an accuracy of 53%. We are going to take the prediction of this dummy classifier as our baseline.

In [None]:
preds = cross_val_predict(clf_dummy, X_train, y_train, cv=3, n_jobs=-1)

In [None]:
conf_mx = metrics.confusion_matrix(y_train, preds)
conf_mx

In [None]:
pd.crosstab(y_train, preds, rownames = ['Actual'], colnames =['Predicción'])

In [None]:
fig = plt.figure(figsize=(6,5))
ax = sns.heatmap(conf_mx, annot=True, fmt="d", 
                 xticklabels=clf_dummy.classes_,
                 yticklabels=clf_dummy.classes_,)

In [None]:
accuracy_base = metrics.accuracy_score(y_train, preds)
precision_base = metrics.precision_score(y_train, preds, 
                                         average='weighted', 
                                         zero_division=0)
recall_base = metrics.recall_score(y_train, preds, 
                                   average='weighted')
f1_base = metrics.f1_score(y_train, preds, 
                           average='weighted')
print(f"Accuracy: {accuracy_base}")
print(f"Precision: {precision_base}")
print(f"Recall: {recall_base}")
print(f"f1: {f1_base}")

In [None]:
print(metrics.classification_report(y_train, preds, zero_division=0))

Our dummy classifier is correct only 28% of the time (precision) and detects 53% of the actual scores (recall). It is often convenient to combine precision and sensitivity into a single metric called the F1 score, particularly if we need a simple way to compare two classifiers. The F1 score is the harmonic mean of precision and sensitivity. While the regular mean treats all values equally, the harmonic mean gives much more weight to low values. As a result, the classifier will only score high on F1 if both sensitivity and precision are high. In our case, F1 = 0.37. Okay, let's take these three metrics as our initial baseline.

## Shortlist Promising Models

In [None]:
def evaluate_model(estimator, X_train, y_train, cv=5, verbose=True):
    """Print and return cross validation of model
    """
    scoring = {"accuracy": "accuracy",
               "precision": "precision_weighted",
               "recall": "recall_weighted",
               "f1": "f1_weighted"}
    scores = cross_validate(estimator, X_train, y_train, cv=cv, scoring=scoring)
    
    accuracy, accuracy_std = scores['test_accuracy'].mean(), \
                                scores['test_accuracy'].std()
    
    precision, precision_std = scores['test_precision'].mean(), \
                                scores['test_precision'].std()
    
    recall, recall_std = scores['test_recall'].mean(), \
                                scores['test_recall'].std()
    
    f1, f1_std = scores['test_f1'].mean(), scores['test_f1'].std()

    
    result = {
        "Accuracy": accuracy,
        "Accuracy std": accuracy_std,
        "Precision": precision,
        "Precision std": precision_std,
        "Recall": recall,
        "Recall std": recall_std,
        "f1": f1,
        "f1 std": f1_std,
    }
    
    if verbose:
        print(f"Accuracy: {accuracy} - (std: {accuracy_std})")
        print(f"Precision: {precision} - (std: {precision_std})")
        print(f"Recall: {recall} - (std: {recall_std})")
        print(f"f1: {f1} - (std: {f1_std})")

    return result

In [None]:
models = [GaussianNB(), KNeighborsClassifier(), RandomForestClassifier(random_state=seed),
          DecisionTreeClassifier(random_state=seed), ExtraTreeClassifier(random_state=seed), 
          AdaBoostClassifier(random_state=seed), GradientBoostingClassifier(random_state=seed), 
          xgb.XGBClassifier()]

model_names = ["Naive Bayes Gaussian", "K Neighbors Classifier", "Random Forest",
               "Decision Tree", "Extra Tree", "Ada Boost", 
               "Gradient Boosting", "XGBoost"]

In [None]:
accuracy = []
precision = []
recall = []
f1 = []

for model in range(len(models)):
    print(f"Step {model+1} de {len(models)}")
    print(f"...running {model_names[model]}")
    
    clf_scores = evaluate_model(models[model], X_train, y_train)
    
    accuracy.append(clf_scores["Accuracy"])
    precision.append(clf_scores["Precision"])
    recall.append(clf_scores["Recall"])
    f1.append(clf_scores["f1"])

Let's see the performance of each of them:

In [None]:
df_result = pd.DataFrame({"Model": model_names,
                          "accuracy": accuracy,
                          "precision": precision,
                          "recall": recall,
                          "f1": f1})
df_result.sort_values(by="f1", ascending=False)

We are going to visualize the comparison of the different models / metrics:

In [None]:
metrics_list = ["f1", "accuracy", "precision", "recall"]

for metric in metrics_list:
    df_result.sort_values(by=metric).plot.barh("Model", metric)
    plt.title(f"Model by {metric}")
    plt.show()

The best performing model is Random Forest. Let's examine the execution of Random Forest a little more in detail:

In [None]:
clf_rf = RandomForestClassifier(random_state=seed)
preds = cross_val_predict(clf_rf, X_train, y_train, cv=5, n_jobs=-1)

In [None]:
clf_rf.get_params()

In [None]:
pd.crosstab(y_train, preds, rownames = ['Real'], colnames =['Predicted'])

In [None]:
print(metrics.classification_report(y_train, preds, zero_division=0))

Our model is correct 81% of the time (precision) and detects 81% of the actual scores (recall). The F1 score is 0.81. Well, it has improved our baseline significantly (remember, precision = 28%, recall = 53%, and F1 = 0.37).

# Fine-Tune

We are going to do a hyperparameter adjustment to see if any improvement is achieved.

In [None]:
param_grid = [
    {"n_estimators": range(20, 200, 20), 
     "bootstrap": [True, False],
     "criterion": ["gini", "entropy"],   
     "max_depth": [2, 4, 6, 8, 10, 12, 14, None],
     "max_features": ["auto", "sqrt", "log2"], 
     "min_samples_split": [2, 5, 10],
     "min_samples_leaf": [1, 2, 4],
     }
]

clf_rf = RandomForestClassifier(random_state=seed, n_jobs=-1)

## Initial fine-tune with Randomized Search

First we do a random quick sweep:

In [None]:
clf_random = RandomizedSearchCV(clf_rf, param_grid, n_iter = 200, cv = 5, 
                                scoring="f1_weighted", verbose=2, 
                                random_state=seed, n_jobs = -1)

In [None]:
clf_random.fit(X_train, y_train)

In [None]:
clf_random.best_params_

In [None]:
preds = cross_val_predict(clf_random.best_estimator_, 
                          X_train, y_train, 
                          cv=5, n_jobs=-1)
print(metrics.classification_report(y_train, preds, zero_division=0))

## Final fine-tune with GridSearch

In [None]:
param_grid = [
    {"n_estimators": range(20, 80, 10), 
     "bootstrap": [True, False],
     "criterion": ["gini", "entropy"],   
     "max_depth": [2, 4, 6, 8, 10, 12, 14, None],
     "max_features": ["auto", "sqrt", "log2"], 
     "min_samples_split": [2, 5, 10],
     "min_samples_leaf": [1, 2, 4],
     }
]

clf_rf = RandomForestClassifier(random_state=seed, n_jobs=-1)

In [None]:
grid_search = GridSearchCV(clf_rf, param_grid, cv=5,
                           scoring="f1_weighted", verbose=2, n_jobs=-1)

In [None]:
grid_search.fit(X_train, y_train)

In [None]:
grid_search.best_params_

In [None]:
final_model = grid_search.best_estimator_
preds = cross_val_predict(final_model, X_train, y_train, cv=5, n_jobs=-1)

In [None]:
pd.crosstab(y_train, preds, rownames = ['Real'], colnames =['Predicted'])

In [None]:
print(metrics.classification_report(y_train, preds))

After adjusting hyperparameters, a very slight improvement is achieved over the default hyperparameters. It's correct 82% of the time (precision) and detects 82% of the actual scores (recall). The F1 score is 0.82. Which significantly improves our baseline (remember, precision = 28%, recall = 53%, and F1 = 0.37).

Finally let's see how it runs on the test set:

In [None]:
y_pred = final_model.predict(X_test)

In [None]:
pd.crosstab(y_test, y_pred, rownames = ['Real'], colnames =['Predicted'])

In [None]:
print(metrics.classification_report(y_test, y_pred, zero_division=0))

It's correct 78% of the time (precision) and detects 78% of the actual scores (recall). The F1 score is 0.78. Which significantly improves our baseline (remember, precision = 28%, recall = 53%, and F1 = 0.37).

In [None]:
conf_mx = metrics.confusion_matrix(y_test, y_pred)

In [None]:
fig = plt.figure(figsize=(8,8))
ax = sns.heatmap(conf_mx, annot=True, fmt="d", 
                 xticklabels=final_model.classes_,
                 yticklabels=final_model.classes_,)

# Feature importances

In [None]:
feature_importances = final_model.feature_importances_
feature_importances

In [None]:
sorted(zip(feature_importances, X_test.columns), reverse=True)

In [None]:
feature_imp = pd.Series(feature_importances, index=X_train.columns).sort_values(ascending=False)
feature_imp.plot(kind='bar')
plt.title('Feature Importances');

# Feature Selection

We are going to use RFECV to determine the number of valid features with cross-validation.

In [None]:
selector = RFECV(final_model, step=1, cv=StratifiedKFold())
selector = selector.fit(X_train, y_train)
pd.DataFrame({"Feature": predict_columns, "Support": selector.support_})

In [None]:
plt.figure()
plt.xlabel("No. of features selected")
plt.ylabel("Cross validation scores")
plt.plot(range(1, len(selector.grid_scores_) + 1), selector.grid_scores_)
plt.show()

In [None]:
selector.grid_scores_

The conclusion is that all the variables are important for the model, since the maximum score is obtained with the 10 selected features.

## Conclusions

This analysis has addressed the issue as a problem of supervised learning binary classification. Our starting baseline, obtained from a classifier that always predicts the most frequent class, is the following:

+ Precision: **28%**
+ Recall: **53%**
+ Accuracy: **53%**
+ f1: **0.37**

After training various models, the one that has provided the best results is RandomForest. After fine-tuning the hyperparameters we obtain the following metrics:

+ Precision: **82%**
+ Recall: **82%**
+ Accuracy: **82%**
+ f1: **0.82**

The performance in the test set is as follows:

+ Precision: **78%**
+ Recall: **78%**
+ Accuracy: **78%**
+ f1: **0.78**

All predictor variables are relevant to the model. The three that most affect prediction are the following:

+ alcohol
+ sulphates
+ volatile acidity