By: David Plehn and Anika Achary

Link to UCI data repository where data was acquired:
https://archive.ics.uci.edu/dataset/186/wine+quality

# Introduction

Problem:
We will be creating a multinomial logistic regession model to predict wine quality (red and white variants of the Portuguese "Vinho Verde" wine) based on physicochemical tests.

# Requirements

Python Modules:
*   Pandas
*   Numpy
*   matplotlib
*   (%matplotlib inline to ensure it is properly displayed)
*   sklearn
*   seaborn
*   scipy
*   warnings

In [10]:
%matplotlib inline
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns

Before we uploaded the dataset, we had to split the dataset by delimeter in excel so we could read it by columns.

In [12]:
df_red = pd.read_csv("C:/Users/david/OneDrive/Desktop/Intro To ML/Final_Project/winequality-red.csv")
df_red.head()

FileNotFoundError: [Errno 2] No such file or directory: 'C:/Users/david/OneDrive/Desktop/Intro To ML/Final_Project/winequality-red.csv'

In [None]:
df_white = pd.read_csv("C:/Users/david/OneDrive/Desktop/Intro To ML/Final_Project/winequality-white.csv")
df_white.head()

In [None]:
print(df_white.shape)
print(df_red.shape)

In [None]:
print(df_white.info())
print("-----------------------------------------")
print(df_red.info())

In [None]:
print(df_white.isnull().sum())
print("-------------------")
print(df_red.isnull().sum())

## Histograms on DataFrame features

### Histograms for df_red's features

In [None]:
df_red.hist(figsize=(20, 20))

### Histograms for df_white's features

In [None]:
df_white.hist(figsize=(20, 20))

## Pairplots on DataFrames

### Pairplot for df_white

In [None]:
import warnings 
warnings.filterwarnings('ignore')

sns.pairplot(df_white)
plt.show()

### Pairplot for df_red

In [None]:
sns.pairplot(df_red)
plt.show()

## Correlation and heatmaps for DateFrames

### Correlation and heatmap for the df_red dataset

In [None]:
matrix_red = np.triu(df_red.corr())
ax_red = sns.heatmap(df_red.corr(), annot = True, square=True, \
            linewidths=1, linecolor='black') #, mask=matrix)
bottom_red, top_red = ax_red.get_ylim()
ax_red.set_ylim(bottom_red + 0.5, top_red - 0.5)
df_red.corr()

### Correlation and heatmap for the df_white dataset

In [None]:
matrix_white = np.triu(df_white.corr())
ax_white = sns.heatmap(df_white.corr(), annot = True, square=True, \
            linewidths=1, linecolor='black') #, mask=matrix)
bottom_white, top_white = ax_white.get_ylim()
ax_white.set_ylim(bottom_white + 0.5, top_white - 0.5)
df_white.corr()

### Multinomial Logistic Regression

#### The lbfgs solver can account for multinomial loss, so we dont have to alter it

In [None]:
from sklearn.linear_model import LogisticRegression

reg_red = LogisticRegression(penalty='l2', C=1, solver='lbfgs', max_iter = 1000)
reg_white = LogisticRegression(penalty='l2', C=1, solver='lbfgs', max_iter = 1000)
print(reg_red)
print(reg_white)

### K^2 test for Normality Test

In [None]:
from scipy.stats import kstest

normal_features_red = []
non_normal_features_red = []

for feature in df_red:
    stat, p = kstest(df_red[feature], 'norm', args=(df_red[feature].mean(), df_red[feature].std()))
    if p > 0.05:
        normal_features_red.append(feature)
    else:
        non_normal_features_red.append(feature)

normal_features_white = []
non_normal_features_white = []

for feature in df_white:
    stat, p = kstest(df_white[feature], 'norm', args=(df_white[feature].mean(), df_white[feature].std()))
    if p > 0.05:
        normal_features_white.append(feature)
    else:
        non_normal_features_white.append(feature)

print("Normally Distributed Features for df_red:", normal_features_red)
print("Non-Normally Distributed Features for df_red:", non_normal_features_red)
print("---------------------------------------------------------------------------------------")
print("Normally Distributed Features for df_white:", normal_features_white)
print("Non-Normally Distributed Features for df_white:", non_normal_features_white)

### QuantileTransformer on Non-Normally Distributed Features

In [None]:
from sklearn.preprocessing import QuantileTransformer

quantile_transformer = QuantileTransformer(output_distribution='uniform')
df_red_independent = quantile_transformer.fit_transform(df_red[['fixed acidity', 'volatile acidity', 'chlorides', 'total sulfur dioxide', 'density', 'alcohol']])
df_white_independent = quantile_transformer.fit_transform(df_white[['fixed acidity', 'volatile acidity', 'chlorides', 'total sulfur dioxide', 'density', 'alcohol']])

# after scaling
df_red_independent[0:5]
df_white_independent[0:5]

### Training the models and predicting their values and probabilities

In [None]:
x_red_train, x_red_test, y_red_train, y_red_test = train_test_split(df_red_independent, df_red["quality"], test_size=0.2, random_state=4)
reg_red.fit(x_red_train, y_red_train)

print(reg_red.coef_) 
print(reg_red.intercept_)

In [None]:
x_white_train, x_white_test, y_white_train, y_white_test = train_test_split(df_white_independent, df_white["quality"], test_size=0.2, random_state=4)
reg_white.fit(x_white_train, y_white_train)

print(reg_white.coef_) 
print(reg_white.intercept_)

### Predicting Probabilities

In [None]:
yhat_red = reg_red.predict(x_red_test) 
y_red_score = reg_red.predict_proba(x_red_test)

yhat_white = reg_white.predict(x_white_test) 
y_white_score = reg_white.predict_proba(x_white_test)

### Testing Model Accuracy

In [None]:
from sklearn.metrics import accuracy_score
print("Base rate accuracy for red is: %0.2f" %(accuracy_score(y_red_test, yhat_red)))
print("Base rate accuracy for white is: %0.2f" %(accuracy_score(y_white_test, yhat_white)))

### Transforming the test data for use in multinomial logistic regression

In [None]:
from sklearn.preprocessing import LabelBinarizer

label_binarizer_red = LabelBinarizer().fit(y_red_train)
y_red_onehot_test = label_binarizer.transform(y_red_test)

label_binarizer_white = LabelBinarizer().fit(y_white_train)
y_white_onehot_test = label_binarizer_white.transform(y_white_test)

target_names = df_red["quality"].unique()
n_classes = len(target_names)

In [None]:
print(target_names)
print(n_classes)

## Computing Micro and Macro averages

### Computing the Micro-averaged One-vs-Rest ROC AUC score for red wine

In [None]:
from sklearn.metrics import auc, roc_curve

# store the fpr, tpr, and roc_auc for all averaging strategies
fpr_red, tpr_red, roc_auc_red = dict(), dict(), dict()
# Compute micro-average ROC curve and ROC area
fpr_red["micro"], tpr_red["micro"], _ = roc_curve(y_red_onehot_test.ravel(), y_red_score.ravel())
roc_auc_red["micro"] = auc(fpr_red["micro"], tpr_red["micro"])

print(f"Micro-averaged One-vs-Rest ROC AUC score for red wine:\n{roc_auc_red['micro']:.2f}")

### Computing the Macro-averaged One-vs-Rest ROC AUC score for red wine

In [None]:
for i in range(n_classes):
    fpr_red[i], tpr_red[i], _ = roc_curve(y_red_onehot_test[:, i], y_red_score[:, i])
    roc_auc_red[i] = auc(fpr_red[i], tpr_red[i])

fpr_grid = np.linspace(0.0, 1.0, 1000)

# Interpolate all ROC curves at these points
mean_tpr = np.zeros_like(fpr_grid)

for i in range(n_classes):
    mean_tpr += np.interp(fpr_grid, fpr_red[i], tpr_red[i])  # linear interpolation

# Average it and compute AUC
mean_tpr /= n_classes

fpr_red["macro"] = fpr_grid
tpr_red["macro"] = mean_tpr
roc_auc_red["macro"] = auc(fpr_red["macro"], tpr_red["macro"])

print(f"Macro-averaged One-vs-Rest ROC AUC score for red wine:\n{roc_auc_red['macro']:.2f}")

### Computing the Micro-averaged One-vs-Rest ROC AUC score for white wine

In [None]:
# store the fpr, tpr, and roc_auc for all averaging strategies
fpr_white, tpr_white, roc_auc_white = dict(), dict(), dict()
# Compute micro-average ROC curve and ROC area
fpr_white["micro"], tpr_white["micro"], _ = roc_curve(y_white_onehot_test.ravel(), y_white_score.ravel())
roc_auc_white["micro"] = auc(fpr_white["micro"], tpr_white["micro"])

print(f"Micro-averaged One-vs-Rest ROC AUC score for white wine:\n{roc_auc_white['micro']:.2f}")

### Computing the Macro-averaged One-vs-Rest ROC AUC score for white wine

In [None]:
for i in range(n_classes):
    fpr_white[i], tpr_white[i], _ = roc_curve(y_red_onehot_test[:, i], y_score[:, i])
    roc_auc_white[i] = auc(fpr_white[i], tpr_white[i])

fpr_grid = np.linspace(0.0, 1.0, 1000)

# Interpolate all ROC curves at these points
mean_tpr = np.zeros_like(fpr_grid)

for i in range(n_classes):
    mean_tpr += np.interp(fpr_grid, fpr_white[i], tpr_white[i])  # linear interpolation

# Average it and compute AUC
mean_tpr /= n_classes

fpr_white["macro"] = fpr_grid
tpr_white["macro"] = mean_tpr
roc_auc_white["macro"] = auc(fpr_white["macro"], tpr_white["macro"])

print(f"Macro-averaged One-vs-Rest ROC AUC score for white wine:\n{roc_auc_white['macro']:.2f}")

## Displaying ROC AUC's for DataFrames

### Displaying ROC AUC's for each class in df_red

In [None]:
import matplotlib
from sklearn.metrics import RocCurveDisplay

fig, ax = plt.subplots(figsize=(6, 6))

plt.plot(
    fpr_red["micro"],
    tpr_red["micro"],
    label=f"micro-average ROC curve (AUC = {roc_auc_red['micro']:.2f})",
    color="deeppink",
    linestyle=":",
    linewidth=4,
)

plt.plot(
    fpr_red["macro"],
    tpr_red["macro"],
    label=f"macro-average ROC curve (AUC = {roc_auc_red['macro']:.2f})",
    color="navy",
    linestyle=":",
    linewidth=4,
)

colors = ["red", "aqua", "darkorange", "cornflowerblue", "green", "lightgreen"]

for class_id, color in zip(range(n_classes), colors):
    RocCurveDisplay.from_predictions(
        y_red_onehot_test[:, class_id],
        y_red_score[:, class_id],
        name=f"ROC curve for quality {target_names[class_id]}",
        color=color,
        ax=ax
    )

_ = ax.set(
    xlabel="False Positive Rate",
    ylabel="True Positive Rate",
    title="Red Wine Quality (0-10)\nto One-vs-Rest multiclass",
)

### Displaying ROC AUC's for each class in df_white

In [None]:
fig, ax = plt.subplots(figsize=(6, 6))

plt.plot(
    fpr_red["micro"],
    tpr_red["micro"],
    label=f"micro-average ROC curve (AUC = {roc_auc_red['micro']:.2f})",
    color="deeppink",
    linestyle=":",
    linewidth=4,
)

plt.plot(
    fpr_red["macro"],
    tpr_red["macro"],
    label=f"macro-average ROC curve (AUC = {roc_auc_red['macro']:.2f})",
    color="navy",
    linestyle=":",
    linewidth=4,
)

for class_id, color in zip(range(n_classes), colors):
    RocCurveDisplay.from_predictions(
        y_white_onehot_test[:, class_id],
        y_white_score[:, class_id],
        name=f"ROC curve for quality {target_names[class_id]}",
        color=color,
        ax=ax,
    )

_ = ax.set(
    xlabel="False Positive Rate",
    ylabel="True Positive Rate",
    title="White Wine Quality (0-10)\nto One-vs-Rest multiclass",
)

## Confusion Matricies for DataFrames

### Confusion Matricies for df_red

In [None]:
from sklearn.metrics import multilabel_confusion_matrix, ConfusionMatrixDisplay

confusion_matricies_red = multilabel_confusion_matrix(y_red_test, yhat_red, labels=target_names)

f, axes = plt.subplots(2, 3, figsize=(25, 15))
axes = axes.ravel()
for i in range(n_classes):
    disp = ConfusionMatrixDisplay(confusion_matricies_red[i],
                                  display_labels=[0, i])
    disp.plot(ax=axes[i], values_format='.4g')
    disp.ax_.set_title(f'Quality {target_names[i]}')
    if i<10:
        disp.ax_.set_xlabel('')
    if i%5!=0:
        disp.ax_.set_ylabel('')
    disp.im_.colorbar.remove()

plt.subplots_adjust(wspace=0.10, hspace=0.10)
f.colorbar(disp.im_, ax=axes)
plt.show()

### Confusion Matricies for df_white

In [None]:
confusion_matricies_white = multilabel_confusion_matrix(y_white_test, yhat_white, labels=target_names)

f, axes = plt.subplots(2, 3, figsize=(25, 15))
axes = axes.ravel()
for i in range(n_classes):
    disp = ConfusionMatrixDisplay(confusion_matricies_white[i],
                                  display_labels=[0, i])
    disp.plot(ax=axes[i], values_format='.4g')
    disp.ax_.set_title(f'Quality {target_names[i]}')
    if i<10:
        disp.ax_.set_xlabel('')
    if i%5!=0:
        disp.ax_.set_ylabel('')
    disp.im_.colorbar.remove()

plt.subplots_adjust(wspace=0.10, hspace=0.10)
f.colorbar(disp.im_, ax=axes)
plt.show()

# Conclusion

Problem:
We will be creating a multinomial logistic regession model to predict wine quality (red and white variants of the Portuguese "Vinho Verde" wine) based on physicochemical tests.

We wrote a program that split the wine dataset into two smaller sets, each for red wine and white wine. We analyzed the data to find which transformers would best apply to the data. Since the data wasn't normally distributed, we tried applying several different transformers - such has PowerTransformer, MaxAbsScaler, StandardScaler, and MinMaxScaler - to see which one would best normalize the data. We discovered that the QuantileTransformer worked the best. We needed to apply OneHot encoding to the y_test data for both red and white data, through LabelEncoders that were fit to their corrosponding y_train data. Once we completed this, we made 2 seperate multinomial logistic regression models (one for each dataset) that told us the relationship between the quality of red and white wine to their physiochemical tests. Based on the results from our model, the quality of red wine has a closer association to its physiochemical tests than white wine does. For the red wine dataset the AUC scores for qualities 5, 6, 7, 4, 8, 3 were 0.77, 0.66, 0.78, 0.56, 0.90, and 0.92. For the white wine dataset the AUC scores for qualities 5, 6, 7, 4, 8, 3 were 0.19, 0.71, 0.76, 0.60, 0.76, and 0.76. The red wine datasets higher AUC scores show that the model had better accuracy predicting its quality as opposed to the white wine. 