In [101]:
%matplotlib notebook

In [102]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [103]:
df = pd.read_csv("winequality-red.csv")

In [104]:
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [105]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


There are no Null values in the dataset

In [106]:
df.plot(kind = 'density',subplots = True, sharex =False, sharey = False, fontsize = 0.25,
        layout = (6,2),figsize=(8,12))         

<IPython.core.display.Javascript object>

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f01def51d90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f01def09df0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f01deec0430>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f01deee9af0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f01dee9f2b0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f01dee47970>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f01dee47a60>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f01dee7c280>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f01deddb130>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f01ded838b0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f01dedaf0d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f01ded647f0>]],
      dtype=object)

In [107]:
df["quality"].unique()

array([5, 6, 7, 4, 8, 3])

Observations:

1.Most of the features follow normal distribution, except wine quality

2.All the features will require Normalization + scaling

3.Wine quality can be either predicted as multi-class classification (4,5,...) or divide the wine in 2 or 3 classes: Good, Average, Bad

### Correlation between variables

In [108]:
plt.figure().set_size_inches(10,8)
sns.heatmap(df.corr(), annot = True)

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x7f0214ad2310>

Many of the Features have moderate collinearity: around 0.67 and -0.68.
Should we drop them??

In [109]:
df.columns

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')

## Fixed Acidity

In [110]:
plt.figure()
sns.barplot(y = df["fixed acidity"], x = df["quality"])

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x7f01df716af0>

In [111]:
plt.figure()
sns.boxplot(y = df["fixed acidity"], x = df["quality"])

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x7f01deaa7d90>

Though the range of the fixed acidity differs, median pretty much remains the same


## volatile acidity

In [112]:
plt.figure()
sns.barplot(y = df["volatile acidity"], x = df["quality"])

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x7f01dea766d0>

In [113]:
plt.figure()
sns.boxplot(y = df["volatile acidity"], x = df["quality"])

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x7f01dea004f0>

Here we see a much clear pattern, Volatile acidity and wine quality are negatively correlated

## 'citric acid'

In [114]:
plt.figure()
sns.barplot(y = df["citric acid"], x = df["quality"])

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x7f01dea9ec70>

In [115]:
plt.figure()
sns.boxplot(y = df["citric acid"], x = df["quality"])

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x7f01dea9ecd0>

The median show a certain pattern, but the range of the bars are large and we cannot get a clear picture of wine quality with just the citric acid quantity

'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],

## Residual Sugar

In [116]:
plt.figure()
sns.barplot(y = df["residual sugar"], x = df["quality"])
plt.figure()
sns.boxplot(y = df["residual sugar"], x = df["quality"])

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x7f01de7fbe20>

The medians are uniform, no clear relation


## Chlorides

In [117]:
plt.figure()
sns.barplot(y = df["chlorides"], x = df["quality"])
plt.figure()
sns.boxplot(y = df["chlorides"], x = df["quality"])

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x7f01de77b880>

Amount of chloride decreases as quality increases, but there are multiple outliers in the dataset, so do we have to clean them?

In [118]:
'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality'

('free sulfur dioxide',
 'total sulfur dioxide',
 'density',
 'pH',
 'sulphates',
 'alcohol',
 'quality')

## Free sulfur dioxide

In [119]:
plt.figure()
sns.barplot(y = df["free sulfur dioxide"], x = df["quality"])
plt.figure()
sns.boxplot(y = df["free sulfur dioxide"], x = df["quality"])

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x7f01de5e6c10>

## total sulfur dioxide

In [120]:
plt.figure()
sns.barplot(y = df["total sulfur dioxide"], x = df["quality"])
plt.figure()
sns.boxplot(y = df["total sulfur dioxide"], x = df["quality"])

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x7f01de646190>

## Density

In [121]:
plt.figure()
sns.barplot(y = df["density"], x = df["quality"])
plt.figure()
sns.boxplot(y = df["density"], x = df["quality"])

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x7f01de4f7c40>

## pH

In [122]:
plt.figure()
sns.barplot(y = df["pH"], x = df["quality"])
plt.figure()
sns.boxplot(y = df["pH"], x = df["quality"])

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x7f01de33bd30>

## Alcohol

In [123]:
plt.figure()
sns.barplot(y = df["alcohol"], x = df["quality"])
plt.figure()
sns.boxplot(y = df["alcohol"], x = df["quality"])

  plt.figure()


<IPython.core.display.Javascript object>

  plt.figure()


<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x7f01de21ee50>

# Data preprocessing

In [124]:
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [125]:
X = df.drop("quality", axis = 1)
y = df["quality"]

In [126]:
def log_tran(row):
    row = np.log(row + 1)
    return row
X_norm = X.apply(log_tran)
X_norm

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
0,2.128232,0.530628,0.000000,1.064711,0.073250,2.484907,3.555348,0.692047,1.506297,0.444686,2.341806
1,2.174752,0.631272,0.000000,1.280934,0.093490,3.258097,4.219508,0.691546,1.435085,0.518794,2.379546
2,2.174752,0.565314,0.039221,1.193922,0.088011,2.772589,4.007333,0.691646,1.449269,0.500775,2.379546
3,2.501436,0.246860,0.444686,1.064711,0.072321,2.890372,4.110874,0.692147,1.425515,0.457425,2.379546
4,2.128232,0.530628,0.000000,1.064711,0.073250,2.484907,3.555348,0.692047,1.506297,0.444686,2.341806
...,...,...,...,...,...,...,...,...,...,...,...
1594,1.974081,0.470004,0.076961,1.098612,0.086178,3.496508,3.806662,0.690594,1.492904,0.457425,2.442347
1595,1.931521,0.438255,0.095310,1.163151,0.060154,3.688879,3.951244,0.690704,1.508512,0.565314,2.501436
1596,1.987874,0.412110,0.122218,1.193922,0.073250,3.401197,3.713572,0.691015,1.486140,0.559616,2.484907
1597,1.931521,0.497740,0.113329,1.098612,0.072321,3.496508,3.806662,0.690880,1.519513,0.536493,2.415914


In [127]:
X_norm.plot(kind = 'density',subplots = True, sharex =False, sharey = False, fontsize = 0.25,
        layout = (6,2),figsize=(8,12))

  fig = plt.figure(**fig_kw)


<IPython.core.display.Javascript object>

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f01df6126a0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f0215baa0a0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f02158220d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f0214849ca0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f020c27e040>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f020c35f640>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f020c531310>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f021472d2b0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f02158e8a30>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f0214aed4f0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f0214433400>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f02144a0790>]],
      dtype=object)

## Train_test_split

In [128]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X_norm, y)

## Scaling the data

In [129]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

# Model Building

## Model 1: Decision Tree

In [130]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
dt_clf = DecisionTreeClassifier().fit(X_train_scaled, y_train)
y_predict = dt_clf.predict(X_test_scaled)
accuracy_score(y_test, y_predict)

0.4875

## Model 2: Knn

In [131]:
from sklearn.neighbors import KNeighborsClassifier
max_acc = 0
k = 0
for i in range(1,50,2):
    knn_clf = KNeighborsClassifier(n_neighbors = i).fit(X_train_scaled,y_train)
    y_predict = knn_clf.predict(X_test_scaled)
    accuracy = accuracy_score(y_test, y_predict)
    if accuracy>max_acc:
        max_acc = accuracy
        k = i
    print("Accuracy score of k = {} is {}".format(i, accuracy))
print("Max accuracy score is {} for k = {}".format(max_acc, k))

Accuracy score of k = 1 is 0.6375
Accuracy score of k = 3 is 0.575
Accuracy score of k = 5 is 0.555
Accuracy score of k = 7 is 0.5725
Accuracy score of k = 9 is 0.5825
Accuracy score of k = 11 is 0.59
Accuracy score of k = 13 is 0.5875
Accuracy score of k = 15 is 0.5925
Accuracy score of k = 17 is 0.5975
Accuracy score of k = 19 is 0.5875
Accuracy score of k = 21 is 0.5875
Accuracy score of k = 23 is 0.595
Accuracy score of k = 25 is 0.5875
Accuracy score of k = 27 is 0.585
Accuracy score of k = 29 is 0.6025
Accuracy score of k = 31 is 0.595
Accuracy score of k = 33 is 0.605
Accuracy score of k = 35 is 0.6075
Accuracy score of k = 37 is 0.605
Accuracy score of k = 39 is 0.5875
Accuracy score of k = 41 is 0.5775
Accuracy score of k = 43 is 0.57
Accuracy score of k = 45 is 0.585
Accuracy score of k = 47 is 0.5925
Accuracy score of k = 49 is 0.595
Max accuracy score is 0.6375 for k = 1


## Model 3: SVC

In [132]:
from sklearn.svm import SVC
svc_clf = SVC(kernel = 'linear').fit(X_train_scaled,y_train)
y_predict = svc_clf.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_predict)
print(accuracy)

0.6025


## Model 4: Logistic regression

In [133]:
from sklearn.linear_model import LogisticRegression
lr_clf = LogisticRegression(max_iter = 2000).fit(X_train_scaled,y_train)
y_predict = lr_clf.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_predict)
print(accuracy)

0.62


## Model 5: Random Forest

In [134]:
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier().fit(X_train_scaled, y_train)
y_predict = rf_clf.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_predict)
print(accuracy)

0.6525


# Tuning Parameters

In [135]:
from sklearn.model_selection import GridSearchCV 
lr = LogisticRegression()
param_grid = {'max_iter' : [2000],
              'penalty' : ['l1', 'l2'],
              'C' : np.logspace(-4, 4, 20),
              'solver' : ['liblinear', 'sag']}

lr_clf = GridSearchCV(lr, param_grid = param_grid, cv = 5, verbose = True, n_jobs = -1)
best_lr_clf = lr_clf.fit(X_train_scaled,y_train)
print('Best Score: ' + str(best_lr_clf.best_score_))
print('Best Parameters: ' + str(best_lr_clf.best_params_))

Fitting 5 folds for each of 80 candidates, totalling 400 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:    0.6s


Best Score: 0.5896443514644352
Best Parameters: {'C': 1.623776739188721, 'max_iter': 2000, 'penalty': 'l2', 'solver': 'sag'}


[Parallel(n_jobs=-1)]: Done 400 out of 400 | elapsed:    4.4s finished


In [136]:
svc = SVC(probability = False)
param_grid = tuned_parameters = [{'kernel': ['rbf'], 'gamma': [.1,.5,1,2,5,10],
                                  'C': [.1, 1, 10, 100, 1000]},
                                 {'kernel': ['linear'], 'C': [.1, 1, 10, 100, 1000]},
                                 ]
svc_clf = GridSearchCV(svc, param_grid = param_grid, cv = 3, verbose = True, n_jobs = -1)
best_svc_clf = svc_clf.fit(X_train_scaled,y_train)
print('Best Score: ' + str(best_svc_clf.best_score_))
print('Best Parameters: ' + str(best_svc_clf.best_params_))

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 6 concurrent workers.


Fitting 3 folds for each of 35 candidates, totalling 105 fits


[Parallel(n_jobs=-1)]: Done  77 tasks      | elapsed:    0.6s
[Parallel(n_jobs=-1)]: Done  94 out of 105 | elapsed:    0.8s remaining:    0.1s


Best Score: 0.6280263157894737
Best Parameters: {'C': 10, 'gamma': 5, 'kernel': 'rbf'}


[Parallel(n_jobs=-1)]: Done 105 out of 105 | elapsed:    1.8s finished


We are not getting good results for our models.
Maybe there are too many classes and not enough features, lets see if we can divide them and then train the models

In [137]:
def wine_qlty(row):
    if row<5:
        row = 1
    elif (row>4 and row<7):
        row = 2
    else:
        row = 3
    return row
y_train = y_train.apply(wine_qlty)
y_test = y_test.apply(wine_qlty)

Here, we have:
1 = Poor quality
2 = Average quality
3 = Good quality

## Decision Tree

In [139]:
dt_clf = DecisionTreeClassifier().fit(X_train_scaled, y_train)
y_predict = dt_clf.predict(X_test_scaled)
accuracy_score(y_test, y_predict)

0.685

## KNN

In [143]:
from sklearn.neighbors import KNeighborsClassifier
max_acc = 0
k = 0
for i in range(1,50,2):
    knn_clf = KNeighborsClassifier(n_neighbors = i).fit(X_train_scaled,y_train)
    y_predict = knn_clf.predict(X_test_scaled)
    accuracy = accuracy_score(y_test, y_predict)
    if accuracy>max_acc:
        max_acc = accuracy
        k = i
    print("Accuracy score of k = {} is {}".format(i, accuracy))
print("Max accuracy score is {} for k = {}".format(max_acc, k))

Accuracy score of k = 1 is 0.8475
Accuracy score of k = 3 is 0.86
Accuracy score of k = 5 is 0.8425
Accuracy score of k = 7 is 0.835
Accuracy score of k = 9 is 0.8425
Accuracy score of k = 11 is 0.8475
Accuracy score of k = 13 is 0.8375
Accuracy score of k = 15 is 0.8425
Accuracy score of k = 17 is 0.84
Accuracy score of k = 19 is 0.8425
Accuracy score of k = 21 is 0.8375
Accuracy score of k = 23 is 0.83
Accuracy score of k = 25 is 0.8325
Accuracy score of k = 27 is 0.8375
Accuracy score of k = 29 is 0.84
Accuracy score of k = 31 is 0.84
Accuracy score of k = 33 is 0.845
Accuracy score of k = 35 is 0.8475
Accuracy score of k = 37 is 0.845
Accuracy score of k = 39 is 0.845
Accuracy score of k = 41 is 0.8425
Accuracy score of k = 43 is 0.835
Accuracy score of k = 45 is 0.8375
Accuracy score of k = 47 is 0.84
Accuracy score of k = 49 is 0.8425
Max accuracy score is 0.86 for k = 3


## SVC

In [142]:
from sklearn.svm import SVC
svc_clf = SVC(kernel = 'linear').fit(X_train_scaled,y_train)
y_predict = svc_clf.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_predict)
print(accuracy)

0.8275


## Random Forest

In [140]:
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier().fit(X_train_scaled, y_train)
y_predict = rf_clf.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_predict)
print(accuracy)

0.8475


## Logistic Regression

In [141]:
from sklearn.linear_model import LogisticRegression
lr_clf = LogisticRegression(max_iter = 2000).fit(X_train_scaled,y_train)
y_predict = lr_clf.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_predict)
print(accuracy)

0.845


# Tuning Parameters

In [144]:
from sklearn.model_selection import GridSearchCV 
lr = LogisticRegression()
param_grid = {'max_iter' : [2000],
              'penalty' : ['l1', 'l2'],
              'C' : np.logspace(-4, 4, 20),
              'solver' : ['liblinear', 'sag']}

lr_clf = GridSearchCV(lr, param_grid = param_grid, cv = 5, verbose = True, n_jobs = -1)
best_lr_clf = lr_clf.fit(X_train_scaled,y_train)
print('Best Score: ' + str(best_lr_clf.best_score_))
print('Best Parameters: ' + str(best_lr_clf.best_params_))

Fitting 5 folds for each of 80 candidates, totalling 400 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=-1)]: Done 100 tasks      | elapsed:    0.1s


Best Score: 0.8448570432357044
Best Parameters: {'C': 78.47599703514607, 'max_iter': 2000, 'penalty': 'l1', 'solver': 'liblinear'}


[Parallel(n_jobs=-1)]: Done 400 out of 400 | elapsed:    1.1s finished


In [145]:
y_predict = best_lr_clf.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_predict)
print(accuracy)

0.8375


In [146]:
svc = SVC(probability = False)
param_grid = tuned_parameters = [{'kernel': ['rbf'], 'gamma': [.1,.5,1,2,5,10],
                                  'C': [.1, 1, 10, 100, 1000]},
                                 {'kernel': ['linear'], 'C': [.1, 1, 10, 100, 1000]},
                                 ]
svc_clf = GridSearchCV(svc, param_grid = param_grid, cv = 3, verbose = True, n_jobs = -1)
best_svc_clf = svc_clf.fit(X_train,y_train)
print('Best Score: ' + str(best_svc_clf.best_score_))
print('Best Parameters: ' + str(best_svc_clf.best_params_))

Fitting 3 folds for each of 35 candidates, totalling 105 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=-1)]: Done  77 tasks      | elapsed:    0.4s


Best Score: 0.8448663324979115
Best Parameters: {'C': 1, 'gamma': 10, 'kernel': 'rbf'}


[Parallel(n_jobs=-1)]: Done 105 out of 105 | elapsed:    2.2s finished


In [147]:
y_predict = best_svc_clf.predict(X_test)
accuracy_score(y_test, y_predict)

0.8575

Thus we get much better score by having only 3 classes to predict wine quality.

Here we can choose SVC classifier or KNN classifier with k=3