# Red Wine Quality Prediction

## Problem Statement:

The dataset is related to red and white variants of the Portuguese "Vinho Verde" wine. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

This dataset can be viewed as classification task. The classes are ordered and not balanced (e.g. there are many more normal wines than excellent or poor ones). Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.

## Attribute Information

Input variables (based on physicochemical tests):

1 - Fixed Acidity

2 - Volatile Acidity

3 - Citric Acid

4 - Residual Sugar

5 - Chlorides

6 - Free Sulfur dioxide

7 - Total Sulfur dioxide

8 - Density

9 - pH

10 - Sulphates

11 - Alcohol

Output variable (based on sensory data):

12 - Quality (score between 0 and 10)

What might be an interesting thing to do, is to set an arbitrary cutoff for your dependent variable (wine quality) at e.g. 7 or higher getting classified as 'good/1' and the remainder as 'not good/0'.
This allows you to practice with hyper parameter tuning on e.g. decision tree algorithms looking at the ROC curve and the AUC value.

You need to build a classification model.

Downlaod Files:
https://github.com/dsrscientist/DSData/blob/master/winequality-red.csv

--------------------------------------------------------------------------------------------------------------------------

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

In [None]:
df=pd.read_csv('winequality-red.csv')

In [None]:
print("Rows, columns: " + str(df.shape))
df.head()

In [None]:
df.info()

--------------------------------------------------------------------------------------------------------------------------

### Metadata about features

1. Alcohol: the amount of alcohol in wine
2. Volatile acidity: are high acetic acid in wine which leads to an unpleasant vinegar taste
3. Sulphates: a wine additive that contributes to SO2 levels and acts as an antimicrobial and antioxidant
4. Citric Acid: acts as a preservative to increase acidity (small quantities add freshness and flavor to wines)
5. Total Sulfur Dioxide: is the amount of free + bound forms of SO2
6. Density: sweeter wines have a higher density
7. Chlorides: the amount of salt in the wine
8. Fixed acidity: are non-volatile acids that do not evaporate readily
9. pH: the level of acidity
10. Free Sulfur Dioxide: it prevents microbial growth and the oxidation of wine
11. Residual sugar: is the amount of sugar remaining after fermentation stops. The key is to have a perfect balance between — sweetness and sourness (wines > 45g/ltrs are sweet)
12. For the purpose of this project, I converted the output to a binary output where each wine is either “good quality” (a score of 7 or higher) or not (a score below 7).
-------------------------------------------------------------------------------------------------------------------------

## Objectives  :

1. ML modelling with different classification algorithim to build model with highest accuracy which in turns lead to predicting quality of wine in term of good or not good.
2. With help of EDA to determine which features are the most indicative of a good quality wine
--------------------------------------------------------------------------------------------------------------------------

# Statsical Summary

In [None]:
df.describe()

#### Comment -
1. In each feature we can see that mean is greater than median.
2. Minimum value of citric acid is zero. Need check if it valid data or some kind of data error.
3. There is lot difference between 75 th percentile and max in residual sugar, free sulfer dioxide, total sulfer dioxide.
4. If we consider spread of data based on mean/std & right/left side skewed data based on 3rd quartile and max, we can definitely say that outliers are present in data.
--------------------------------------------------------------------------------------------------------------------------

#### Mean feature values as per different quality grade

In [None]:
means = pd.pivot_table(data=df, index='quality',aggfunc={'fixed acidity':np.mean, 'volatile acidity':np.mean, 
                                                     'citric acid':np.mean, 
                                                     'residual sugar':np.mean,'chlorides':np.mean,
                                                     'free sulfur dioxide':np.mean,'density':np.mean,
                                                     'pH':np.mean,'sulphates':np.mean,'alcohol':np.mean})
means

#### Comment-
#### Based on mean value quality
1. Good quality (grade 7 & 8) of wine posses higher amount of alcohol, citric acid, fixed acidity, sulphates.
2. Good quality (grade 7 & 8) of wine posses lower amount of Chlorides, low pH value,volatile acidity.
3. Good quality (grade 7 & 8) of wine posses moderate amount of free sulfur dioxide in range of 14-16.
4. Density and residual sugar are not deciding factor in determining quality of wine.
--------------------------------------------------------------------------------------------------------------------------

####  Create Classification version of target variable
We will create two class for purpose of classification based on quality grade of red wine
1. Class 1- Good quality red wine - if a quality grade of 7 or higher
2. Class 2- Low quality red wine - if a quality grade less than 7

In [None]:
df['class'] =[1 if x >= 7 else 0 for x in df['quality']]

In [None]:
df['class'].value_counts()

In [None]:
df1=df.drop(columns='quality')

---------------------------------------------------------------------------------------------------------------------------

#### Mean feature values based on class

In [None]:
means = pd.pivot_table(data=df, index='class',aggfunc={'fixed acidity':np.mean, 'volatile acidity':np.mean, 
                                                     'citric acid':np.mean, 
                                                     'residual sugar':np.mean,'chlorides':np.mean,
                                                     'free sulfur dioxide':np.mean,'density':np.mean,
                                                     'pH':np.mean,'sulphates':np.mean,'alcohol':np.mean})
means

In [None]:
labels = '1','0',
fig, ax = plt.subplots()
ax.pie(df['class'].value_counts(),labels = labels,radius =1,autopct = '%1.2f%%', shadow=True,)
plt.show()

---------------------------------------------------------------------------------------------------------------------------

#### Checking null value or missing data

In [None]:
sns.heatmap(df.isnull(), cmap='hsv')

In [None]:
missing_values = df.isnull().sum().sort_values(ascending = False)
percentage_missing_values =(missing_values/len(df))*100
print(pd.concat([missing_values, percentage_missing_values], axis =1, keys =['Missing Values', '% Missing data']))

# Exploratory Data Analysis

In [None]:
plt.figure(figsize =(10, 7))
sns.countplot(df['quality'])

In [None]:
df['quality'].value_counts()

In [None]:
plt.figure(figsize=(10,10))
labels = '5','6','7','4','8','3'
fig, ax = plt.subplots()
ax.pie(df['quality'].value_counts(),labels = labels,radius =3 ,autopct = '%1.1f%%', shadow=True,)
plt.show()

#### Comment -
1. Majority of wine samples are of quality level 5 and 6.
2. This dataset we have only 217 wine sample with higher quality grade.
----------------------------------------------------------------------------------------------------------------

#### Distribution of features :

In [None]:
plt.figure(figsize=(20,25), facecolor='white')
plotnumber =1
for column in df:
    if plotnumber <=12:
        ax = plt.subplot(4,3,plotnumber)
        sns.distplot(df[column], color='r')
        plt.xlabel(column,fontsize=20)
    plotnumber+=1
plt.show()

#### Comment - 

There is skewness in data

---------------------------------------------------------------------------------------------------------------------------

In [None]:
plt.figure(figsize=(20,25), facecolor='white')
plotnumber =1
for column in df:
    if plotnumber <=12:
        ax = plt.subplot(4,3,plotnumber)
        plt.bar(df['quality'], df[column], color='b') 
        plt.xlabel('quality',fontsize=20)
        plt.ylabel(column, fontsize =20)
    plotnumber+=1
plt.tight_layout()
plt.show()

#### Comment-
##### Based on quality
1. Good quality (grade 7 & 8) of wine posses higher amount of alcohol, fixed acidity.
2. Good quality (grade 7 & 8) of wine posses lower amount of low pH value,volatile acidity.
3. Good quality (grade 7 & 8) of wine posses moderate amount of free sulfur dioxide in range of 14-16.
4. Density and residual sugar are not deciding factor in determining quality of wine.
5. Low grade quality of wine posses lower amount of total sulfer dioxide.
6. Higher volatile acid lower the quality of wine.
--------------------------------------------------------------------------------------------------------------------------

In [None]:
plt.figure(figsize=(12,16), facecolor='white')
plotnumber =1
for column in df:
    if plotnumber <=12:
        ax = plt.subplot(4,3,plotnumber)
        sns.barplot(df['class'],df[column]) 
        plt.xlabel('class',fontsize=20)
        plt.ylabel(column, fontsize =20)
    plotnumber+=1
plt.tight_layout()
plt.show()
# class 1 - good quality
# class 0 - low quality

#### Comment - 
1. Quality of wine increase with increase in alcohol, sulpates, residual sugar, citric acid,fixed acidity.
2. Quality of wine decreses with increase in total sulfur dioxide,chlorides, volatile acidity,free sulfur dioxide.

In [None]:
plt.figure(figsize=(20,50), facecolor='white')
plotnumber =1
for column in df:
    if plotnumber <=12:
        ax = plt.subplot(6,2,plotnumber)
        sns.boxplot(df['quality'],df[column]) 
        plt.xlabel('quality',fontsize=20)
        plt.ylabel(column, fontsize =20)
    plotnumber+=1
plt.tight_layout()
plt.show()

In [None]:
Grp_c=df.groupby('class')
C_1=Grp_c.get_group(1)
C_2=Grp_c.get_group(0)

In [None]:
plt.figure(figsize=(20,25), facecolor='white')
plotnumber =1
for column in C_1:
    if plotnumber <=12:
        ax = plt.subplot(4,3,plotnumber)
        sns.kdeplot(C_1[column], color='b')
        plt.xlabel(column,fontsize=20)
    plotnumber+=1
plt.show()

In [None]:
sns.pairplot(data=df1, hue='class')
plt.legend()
plt.show()

---------------------------------------------------------------------------------------------------------------------------

# Feature selection

## Outliers Detection basesd on IQR 

In [None]:
df2 =df1.copy()
Q1 =df2.quantile(0.25)
Q3= df2.quantile(0.75)
IQR = Q3-Q1
print(IQR)

In [None]:
df_new =df2[~((df <(Q1 - 1.5*IQR)) | (df >(Q3 + 1.5*IQR))).any(axis=1)]
print(df_new.shape)

#### Data Loss

In [None]:
print("\033[1m"+'Percentage Data Loss :'+"\033[0m",((1599-1047)/1599)*100,'%')

There is significant data loss  with IQR method.

## Removing Outliers using Z score Method

In [None]:
from scipy.stats import zscore
df3=df1.copy()
z_score = zscore(df3)
z_score_abs = np.abs(z_score)
df_new= df3[(z_score_abs < 3).all(axis=1)]
df_new.shape

#### Data Loss

In [None]:
print("\033[1m"+'Percentage Data Loss :'+"\033[0m",((1599-1458)/1599)*100,'%')

--------------------------------------------------------------------------------------------------------------------------

## Skewness detection and transformation

In [None]:
df_new.skew()

##### Data is highly skewed. So it need to transform

#### Transforming positive or right skew data using boxcox transformation

In [None]:
from scipy.stats import boxcox

In [None]:
df_new['fixed acidity']=boxcox(df_new['fixed acidity'],0)
df_new['residual sugar']=boxcox(df_new['residual sugar'],-1)
df_new['chlorides']=boxcox(df_new['chlorides'],-0.5)
df_new['free sulfur dioxide']=boxcox(df_new['free sulfur dioxide'],0)
df_new['total sulfur dioxide']=boxcox(df_new['total sulfur dioxide'],0)
df_new['sulphates']=boxcox(df_new['sulphates'],0)
df_new['alcohol']=boxcox(df_new['alcohol'],-0.5)

In [None]:
df_new.skew()

--------------------------------------------------------------------------------------------------------------------------

### Corrleation 

In [None]:
df_new.corr()

In [None]:
plt.figure(figsize =(12,10))
sns.heatmap(df_new.corr(), annot= True ,cmap='Spectral')
# cmap =PiYG cmap='Spectral'
plt.tight_layout
plt.show()

#### Visualizing correlation of feature columns with label column.

In [None]:
plt.figure(figsize = (12,6))
df_new.corr()['class'].drop(['class']).plot(kind='bar',color = 'c')
plt.xlabel('Features',fontsize=15)
plt.ylabel('Class',fontsize=15)
plt.title('Correlation of features with class',fontsize = 18)
plt.show()

#### Checking Multicollinearity between features using variance_inflation_factor

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
df_new2=df_new.copy()

In [None]:
vif=pd.DataFrame()
vif['vif'] = [variance_inflation_factor(df_new2.values,i) for i in range(df_new2.shape[1])]
vif['Features']= df_new2.columns
vif

#### pH, density are not contributing to label and also high multicollinearity exists. so we will drop density.

In [None]:
df_new2= df_new2.drop(['density','pH'], axis=1)

In [None]:
vif=pd.DataFrame()
vif['vif'] = [variance_inflation_factor(df_new2.values,i) for i in range(df_new2.shape[1])]
vif['Features']= df_new2.columns
vif

#### Still there are lot of multicollinearity. So we need to scale data and apply pca dimensionilty reduction technique.

### Standard Scaling

In [None]:
X= df_new.drop(columns=['class'])
Y= df_new['class']

In [None]:
from sklearn.preprocessing import StandardScaler
scaler= StandardScaler()
X_scale = scaler.fit_transform(X)

In [None]:
X_scale

## PCA 

In [None]:
from sklearn.decomposition import PCA
pca = PCA()
#plot the graph to find the principal components
x_pca = pca.fit_transform(X_scale)
plt.figure(figsize=(10,10))
plt.plot(np.cumsum(pca.explained_variance_ratio_), 'ro-')
plt.grid()

#### Comment -
###### AS per the graph, we can see that 8 principal components attribute for 90% of variation in the data.  We shall pick the first 8 components for our prediction

In [None]:
pca_new = PCA(n_components=8)
x_new = pca_new.fit_transform(X_scale)
print(x_new)

# Machine Learning Model Building

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix,classification_report,f1_score

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(x_new, Y, random_state=42, test_size=.3)
print('Training feature matrix size:',X_train.shape)
print('Training target vector size:',Y_train.shape)
print('Test feature matrix size:',X_test.shape)
print('Test target vector size:',Y_test.shape)

### Finding best Random state

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix,classification_report,f1_score
maxAccu=0
maxRS=0
for i in range(1,250):
    X_train,X_test,Y_train,Y_test = train_test_split(x_new,Y,test_size = 0.3, random_state=i)
    log_reg=LogisticRegression()
    log_reg.fit(X_train,Y_train)
    y_pred=log_reg.predict(X_test)
    acc=accuracy_score(Y_test,y_pred)
    if acc>maxAccu:
        maxAccu=acc
        maxRS=i
print('Best accuracy is', maxAccu ,'on Random_state', maxRS)

---------------------------------------------------------------------------------------------------------------------------

## Logistics Regression

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(x_new, Y, random_state=133, test_size=.33)
log_reg=LogisticRegression()
log_reg.fit(X_train,Y_train)
y_pred=log_reg.predict(X_test)
print('\033[1m'+'Logistics Regression Evaluation'+'\033[0m')
print('\n')
print('\033[1m'+'Accuracy Score of Logistics Regression :'+'\033[0m', accuracy_score(Y_test, y_pred))
print('\n')
print('\033[1m'+'Confusion matrix of Logistics Regression :'+'\033[0m \n',confusion_matrix(Y_test, y_pred))
print('\n')
print('\033[1m'+'classification Report of Logistics Regression'+'\033[0m \n',classification_report(Y_test, y_pred))

--------------------------------------------------------------------------------------------------------------------------

### Finding Optimal value of n_neighbors for KNN

In [None]:
from sklearn import neighbors
from math import sqrt
from sklearn.metrics import mean_squared_error
rmse_val = [] #to store rmse values for different k
for K in range(20):
    K = K+1
    model = neighbors.KNeighborsClassifier(n_neighbors = K)

    model.fit(X_train,Y_train)  #fit the model
    y_pred=model.predict(X_test) #make prediction on test set
    error = sqrt(mean_squared_error(Y_test,y_pred)) #calculate rmse
    rmse_val.append(error) #store rmse values
    print('RMSE value for k= ' , K , 'is:', error)

In [None]:
#plotting the rmse values against k values
curve = pd.DataFrame(rmse_val) #elbow curve 
curve.plot()

#### Comment- 
At k= 12, the RMSE is approximately 0.29868314022934306, and shoots up on further increasing the k value. We can safely say that k=12 will give us the best result in this case

---------------------------------------------------------------------------------------------------------------------------

In [None]:
model=[
        SVC(),
        GaussianNB(),
        DecisionTreeClassifier(),
        KNeighborsClassifier(n_neighbors = 12),
        RandomForestClassifier(),
        AdaBoostClassifier(),
        GradientBoostingClassifier(),
        BaggingClassifier()]

for m in model:
    m.fit(X_train,Y_train)
    y_pred=m.predict(X_test)
    print('\033[1m'+'Classification ML Algorithm Evaluation Matrix',m,'is' +'\033[0m')
    print('\n')
    print('\033[1m'+'Accuracy Score :'+'\033[0m\n', accuracy_score(Y_test, y_pred))
    print('\n')
    print('\033[1m'+'Confusion matrix :'+'\033[0m \n',confusion_matrix(Y_test, y_pred))
    print('\n')
    print('\033[1m'+'Classification Report :'+'\033[0m \n',classification_report(Y_test, y_pred))
    print('\n')
    print('============================================================================================================')

##### We can see that  RandomForestClassifier() gives maximum Accuracy so we will continue further investigation with crossvalidation of above model

# CrossValidation :

In [None]:
from sklearn.model_selection import cross_val_score
model=[
        SVC(),
        GaussianNB(),
        DecisionTreeClassifier(),
        KNeighborsClassifier(n_neighbors = 12),
        RandomForestClassifier(),
        AdaBoostClassifier(),
        GradientBoostingClassifier(),
        BaggingClassifier()]

for m in model:
    score = cross_val_score(m, X, Y, cv =5)
    print('\n')
    print('\033[1m'+'Cross Validation Score', m, ':'+'\033[0m\n')
    print("Score :" ,score)
    print("Mean Score :",score.mean())
    print("Std deviation :",score.std())
    print('\n')
    print('============================================================================================================')

#### We can see that Random Forest Classifier gives maximum Accuracy. So we will apply Hyperparameter tuning on Random Forest model

# Hyper Parameter Tuning : GridSearchCV

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
parameter = {'n_estimators':[30,50,60],'max_depth': [10,20,40,60,80],
             'criterion':['gini','entropy'],'max_features':["auto","sqrt","log2"]}


In [None]:
GCV = GridSearchCV(RandomForestClassifier(),parameter,cv=5,n_jobs = -1)
GCV.fit(X_train,Y_train)

In [None]:
GCV.best_params_

# Final Model

In [None]:
Final_mod = RandomForestClassifier(criterion='entropy',n_estimators= 50, max_depth=20 ,max_features='sqrt')
Final_mod.fit(X_train,Y_train)
y_pred=Final_mod.predict(X_test)
print('\033[1m'+'Accuracy Score :'+'\033[0m\n', accuracy_score(Y_test, y_pred))

In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

y_pred_prob = Final_mod.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(Y_test,y_pred_prob)
plt.plot([0,1],[0,1], 'k--')
plt.plot(fpr, tpr, label='Random Forest Classifier')
plt.xlabel('False postive rate')
plt.ylabel('True postive rate')
plt.show()
auc_score = roc_auc_score(Y_test, Final_mod.predict(X_test))
print('\033[1m'+'Auc Score :'+'\033[0m\n',auc_score)

## Saving model

In [None]:
import joblib
joblib.dump(Final_mod,'Red_Wine_Quality_Final.pkl')