<table align="center" width=100%>
    <tr>
        <td width="35%">
            <img src="https://www.cnet.com/a/img/TQz8Ib-VK2VLerOOig8ky841Fgs=/940x0/2019/06/27/347b8f9c-65a4-448b-b641-73a2becfbf83/vegan-wine-club.jpg">
        </td>
        <td>
            <div align="center">
                <font color="#7F0542 ";size=300px>
                    <b>Wine Quality Prediction
                    </b>
                </font>
            </div>
        </td>
    </tr>
</table>

# Problem Statement 🍷🍾

**Predicting the quality of wine with respect to different physiochemical parameters such as alcohol, acidity, density, pH, etc.¶**

# Data Dictionary

**Input variables (based on physicochemical tests):** 

**1 - fixed acidity** : Amount of Tartaric acid found

**2 - volatile acidity** : Amount of Acetic acid found

**3 - citric acid** : Amount of Citric acid found

**4 - residual sugar** : Amount of sugar left post fermentation

**5 - chlorides** : Amount of salts present in wine

**6 - free sulfur dioxide** : Amount of Sulfur Dioxide present in free form

**7 - total sulfur dioxide** : Amount of Sulfur Dioxide present in wine

**8 - density** : Density of wine

**9 - pH** : Indicate the pH value of wine ranging from 0 to 14

**10 - sulphates** : Amount of Potassium Sulphate in wine

**11 - alcohol** : Alcohol content in wine

**Output variable (based on sensory data):**

**12 - quality (score between 0 and 10)** : Indicates quality of wine ranging from 1 to 10 where, the higher the value the better the wine

## Table of Contents

1. **[Import Libraries](#import_lib)**
2. **[Set Options](#set_options)**
3. **[Read Data](#Read_Data)**
4. **[Exploratory Data Analysis](#data_preparation)**
    - 4.1 - [Preparing the Dataset](#Data_Preparing)
        - 4.1.1 - [Data Dimension](#Data_Shape)
        - 4.1.2 - [Data Types](#Data_Types)
        - 4.1.3 - [Missing Values](#Missing_Values)
        - 4.1.4 - [Duplicate Data](#duplicate)
        - 4.1.5 - [Indexing](#indexing)
        - 4.1.6 - [Final Dataset](#final_dataset)
    - 4.2 - [Understanding the Dataset](#Data_Understanding)
        - 4.2.1 - [Summary Statistics](#Summary_Statistics)
        - 4.2.2 - [Correlation](#correlation)
        - 4.2.3 - [Analyze Categorical Variables](#analyze_cat_var)
        - 4.2.4 - [Anaylze Target Variable](#analyze_tar_var)
        - 4.2.5 - [Analyze Relationship Between Target and Independent Variables](#analyze_tar_ind_var)
        - 4.2.6 - [Feature Engineering](#feature_eng)
5. **[Data Pre-Processing](#data_pre)**
    - 5.1 - [Outliers](#out)
        - 5.1.1 - [Discovery of Outliers](#dis_out)
        - 5.1.2 - [Removal of Outliers](#rem_out)
        - 5.1.3 - [Rechecking of Correlation](#rec_cor)
    - 5.2 - [Categorical Encoding](#cat_enc)
    - 5.3 - [Feature Scaling](#fea_sca)
    - 5.2 - [Train-Test Split](#split)
6. **[Logistic Regression](#log_reg)**
7. **[Naive Bayes Algorithm](#nai_bay)**
8. **[K Nearest Neighbors (KNN)](#knn)**
9. **[Decision Tree for Classification](#dec_tre)**
10. **[Random Forest](#ran_for)**
11. **[AdaBoost](#ada)**
12. **[Gradient Boosting](#gra_boo)**
13. **[Extreme Gradient Boosting (XGB)](#xgb)**
14. **[Stack Generalisation](#stack)**
15. **[Displaying Score Summary](#dis_sco)**
16. **[Feature Importance](#fea_imp)**
17. **[Conclusion](#conclu)**
18. **[Deployment](#deploy)**
19. **[References](#Refer)**

# 1. Import Libraries <a id='import_lib'></a>

In [None]:
# import 'Pandas' 
import pandas as pd 

# import 'Numpy' 
import numpy as np

# import subpackage of Matplotlib
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

# import 'Seaborn' 
import seaborn as sns

# to suppress warnings 
from warnings import filterwarnings
filterwarnings('ignore')

# display all columns of the dataframe
pd.options.display.max_columns = None

# display all rows of the dataframe
pd.options.display.max_rows = None
 
# to display the float values upto 6 decimal places     
pd.options.display.float_format = '{:.6f}'.format

# import train-test split 
from sklearn.model_selection import train_test_split

# import various functions from statsmodels
import statsmodels
import statsmodels.api as sm

# import StandardScaler to perform scaling
from sklearn.preprocessing import StandardScaler 

# import various functions from sklearn 
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score 

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
from sklearn.model_selection import GridSearchCV 

# import function to perform feature selection
from sklearn.feature_selection import RFE

from sklearn.preprocessing import MinMaxScaler
import scipy
from scipy.stats import shapiro
from mlxtend.feature_selection import SequentialFeatureSelector as sfs
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import StackingClassifier
from xgboost import XGBClassifier
from sklearn.tree import plot_tree
from sklearn.naive_bayes import GaussianNB

# 2. Set Options <a id='set_options'></a>

In [None]:
# display all columns of the dataframe
pd.options.display.max_columns = None
# display all rows of the dataframe
pd.options.display.max_rows = None
# return an output value upto 6 decimals
pd.options.display.float_format = '{:.6f}'.format

# 3. Read Data <a id='Read_Data'></a>

In [None]:
# load the csv file
# store the data in 'df_admissions'
df_wine = pd.read_csv('../input/d/nischithasai/wine-quality/winequalityN.csv')

# display first five observations using head()
df_wine.head(10).style.set_properties(**{'background-color':'black','color':'white','border-color':'red'})

In [None]:
df_wine.info()

# 4. Exploratory Data Analysis <a id='data_preparation'></a>

## 4.1 Preparing the Dataset <a id='Data_Preparing'></a>

### 4.1.1 Data Dimensions <a id='Data_Shape'></a>

In [None]:
df_wine.shape

In this dataset I have 6497 records across 13 features

### 4.1.2 Data Types <a id='Data_Types'></a>

In [None]:
df_wine.dtypes

In this dataset I have **1 object, 11 float and 1 int columns**
But according to our metadata , the column **quality** should be off object datatype

In [None]:
df_wine['quality']=df_wine['quality'].astype('object')

In [None]:
df_wine.dtypes

After converting the datatype of **quality** our dataset contains **2 object columns, 1 int column and 11 float columns**

### 4.1.3 Missing Values <a id='Missing_Values'></a>

In [None]:
missing_value = pd.DataFrame({
    'Missing Value': df_wine.isnull().sum(),
    'Percentage': (df_wine.isnull().sum() / len(df_wine))*100
})

In [None]:
missing_value.sort_values(by='Percentage', ascending=False).style.set_properties(**{'background-color':'black','color':'white','border-color':'red'})

**Visualising missing values using Heatmap**

In [None]:
# set the figure size
plt.figure(figsize=(25,15))

# plot heatmap to check null values
# isnull(): returns 'True' for a missing value
# cbar: specifies whether to draw a colorbar; draws the colorbar for 'True' 
sns.heatmap(df_wine.isnull(), cbar=False)


# display the plot
plt.show()

I have **7 columns with a few missing values** present in the dataset

**Missing Values Replacement**

In [None]:
df=df_wine[['fixed_acidity','pH','volatile_acidity','sulphates','citric_acid','chlorides','residual_sugar']]

In [None]:
for column in enumerate(df):
    plt.figure(figsize=(30,5))
    #sns.set_theme(style="darkgrid",palette='deep')
    sns.boxplot(x=column[1], data=  df,color='red')
    plt.xlabel(column[1],fontsize=18)
    plt.show()

Since,there are many outliers present in the data for each column with missing values, I are replacing the null values using median.

In [None]:
for column in df.columns:
    df_wine[column]= df_wine[column].fillna(df_wine[column].median())

In [None]:
df_wine.isna().sum()

**Visualising missing values using Heatmap**

In [None]:
# set the figure size
plt.figure(figsize=(10,8))

# plot heatmap to check null values
# isnull(): returns 'True' for a missing value
# cbar: specifies whether to draw a colorbar; draws the colorbar for 'True' 
sns.heatmap(df_wine.isnull(), cbar=False)

# display the plot
plt.show()

Now, There are **No missing values** present in the dataset

### 4.1.4 Duplicate Data <a id='duplicate'></a>

In [None]:
duplicate = df_wine.duplicated().sum()
print('There are {} duplicated rows in the data'.format(duplicate))

**Getting rid of duplicate data**

In [None]:
df_wine.drop_duplicates(inplace=True)

**Checking for duplicate data after removal of duplicates**

In [None]:
duplicate = df_wine.duplicated().sum()
print('There are {} duplicated rows in the data'.format(duplicate))

### 4.1.5 Indexing <a id='indexing'></a>

In [None]:
df_wine.shape

There are **5329 records** after dropping duplicates

In [None]:
df_wine.tail(5).style.set_properties(**{'background-color':'black','color':'white','border-color':'red'})

**The last 5 index values range from 6491-6496 but I have only 5329 records thus the indexes need to be reset**

In [None]:
df_wine.reset_index(inplace=True,drop=True)

In [None]:
df_wine.tail().style.set_properties(**{'background-color':'black','color':'white','border-color':'red'})

### 4.1.6 Final Dataset <a id='final_dataset'></a>

In [None]:
df_wine.shape

In [None]:
df_wine.head().style.set_properties(**{'background-color':'black','color':'white','border-color':'red'})

The final dataset has **5329 records and 13 features with no missing and duplicate values**

## 4.2 Understanding the Dataset <a id='Data_Understanding'></a>

### 4.2.1 Summary Statistics <a id='Summary_Statistics'></a>

**Numeric Variables**

In [None]:
df_wine.describe(include=np.number).style.set_properties(**{'background-color':'black','color':'white','border-color':'red'})

From the above table , I can infer:

    1. The minimum pH value is 2.7 and maximum pH value found is 4.01 thus all wine are acidic in nature
    
    2. The alocohl content in wine range from 8 to 15 with an average of 10.5
    
    3. The free_sulfur_dioxide in wine is less than 41 for 75% which is still 4 times less than that of total sulfur dioxide
    
    4. The maximum amount of fixed_acidity is 15.9 while 75% of it is less than 7.7 which implies possible outliers

**Categorical Variables**

In [None]:
df_wine.describe(include = object).style.set_properties(**{'background-color':'black','color':'white','border-color':'red'})

From the above table, I can infer:

    1. There are two types of wine with white frequenting more
    
    2. There are seven unqiue records for the quality of wine with 6 being chosen the most

### 4.2.2 Correlation <a id='correlation'></a>

In [None]:
corr_matrix=df_wine.corr()
corr_matrix.style.set_properties(**{'background-color':'black','color':'white','border-color':'red'})

In [None]:
plt.figure(figsize=(11,9))
dropSelf = np.zeros_like(corr_matrix)
dropSelf[np.triu_indices_from(dropSelf)] = True

sns.heatmap(corr_matrix, cmap=sns.diverging_palette(220, 10, as_cmap=True), annot=True, fmt=".2f", mask=dropSelf)

sns.set(font_scale=1.5)

**Inferences:**
1. free_sufur_dioxide is highly positively correlated with total_sulfur_dioxide
    
2. density is moderatly positively correlated with fixed_acidity and residual_sugar whilst moderately negatively        correlated with alcohol
    
3. Relation degrees are very low with each other, such as citric_acid, free_sulfur_dioxide, sulpahtes and pH

### 4.2.3 Analyse Categorical Variables <a id='analyze_cat_var'></a>

In [None]:
data = df_wine.groupby('type')['quality'].count()
fig, ax = plt.subplots(figsize=[10,6])
labels = ['red','white']
ax = plt.pie(x=data, autopct="%.1f%%", explode=[0.05]*2, labels=labels, colors=['darkred','white'],
             wedgeprops={"edgecolor":"black"},pctdistance=0.5)
plt.show()

From the above pie chart, I can infer:

     About 75% of the data is pertaining to white wine while the remaining is of red wine

### 4.2.4 Analyse Target Variable <a id='analyze_tar_var'></a>

In [None]:
quaity_mapping = { 3 : "Low",4 : "Low",5: "Low",6 : "High",7: "High",8 : "High",9 : "High"}
df_wine["quality"] =  df_wine["quality"].map(quaity_mapping)

In [None]:
df_target = df_wine['quality'].copy()

df_target.value_counts()
sns.countplot(x = df_target)

# use below code to print the values in the graph
# 'x' and 'y' gives position of the text
# 's' is the text 
plt.text(x = -0.05, y = df_target.value_counts()[0]+1, s = str(round((df_target.value_counts()[0])*100/len(df_target),2)) + '%')
plt.text(x = 0.95, y = df_target.value_counts()[1]+1, s = str(round((df_target.value_counts()[1])*100/len(df_target),2)) + '%')

# add plot and axes labels
# set text size using 'fontsize'
plt.yticks([0,1000,2000,3000,4000])
plt.title('Count Plot for Target Variable (wine_quality)', fontsize = 15)
plt.xlabel('Target Variable', fontsize = 15)
plt.ylabel('Count', fontsize = 15)
plt.tight_layout()
# to show the plot
plt.show()

From the above graph, I can infer:

     Majority of the wine is high quality and the target column is balanced.

### 4.2.5 Analyse Relationship between Target and Independent Variables <a id='analyze_tar_ind_var'></a>

In [None]:
fig, ax = plt.subplots(figsize=(12,4))
pd.options.display.float_format = '{:,.2f}'.format

bar_chart = df_wine.groupby(['type','quality'])['quality'].count().unstack('type')
bar_chart= (bar_chart.T/bar_chart.T.sum()).T
ax = bar_chart.plot(kind='bar', stacked=True, color=['r','w'], edgecolor='black', ax=ax)

labels = []
for j in bar_chart.columns:
    for i in bar_chart.index:
          label = str('{0:.2%}'.format(bar_chart.loc[i][j]))
          labels.append(label)

patches = ax.patches

for label, rect in zip(labels, patches):
    width = rect.get_width()
    if width > 0:
        x = rect.get_x()
        y = rect.get_y()
        height = rect.get_height()
        ax.text(x + width/2., y + height/2., label, ha='center', va='center', color='black')

ax.set_xticklabels(labels=ax.get_xticklabels(), rotation=0)
ax.set_yticklabels(labels='')
ax.set_ylabel('% of records')
plt.legend(bbox_to_anchor = (1, 1.01), edgecolor='black')
plt.show()

From the above graph:

    I can infer that majority is white wine compared to red wine in both high and low quality.

In [None]:
sns.boxplot(data=df_wine, x="quality", y ="alcohol").set(title='Quality v/s Alcohol')

From the above boxplot, I can infer:

     Wine with high alcohol content have gotten higher ratings than that of wine with low alcohol content.

In [None]:
def KdeAndBox(at1,at2):
    plt.figure(figsize=(14,9))
    plt.subplot(2,2,1)
    sns.kdeplot(df_wine.loc[df_wine["quality"]=="Low"][at1],shade=True)
    sns.kdeplot(df_wine.loc[df_wine["quality"]=="High"][at1],shade=True)

    plt.legend(["Low","High"])
    plt.title(at1.upper(),fontsize=15)
    plt.subplot(2,2,2)
    sns.kdeplot(df_wine.loc[df_wine["quality"]=="Low"][at2],shade=True)
    sns.kdeplot(df_wine.loc[df_wine["quality"]=="High"][at2],shade=True)
    plt.legend(["Low","High"])
    plt.title(at2.upper(),fontsize=15)
    plt.subplot(2,2,3)
    sns.violinplot(data=df_wine,y=at1,x="quality")
    plt.subplot(2,2,4)
    sns.violinplot(data=df_wine,y=at2,x="quality")
    plt.show()

In [None]:
KdeAndBox("fixed_acidity","volatile_acidity")

From the above KDE plot, I can infer:
    1. The distribution for both, volatile_acidity and fixed_acidity for Low and High wine quality seem to be highly        positively skewed.
    2. For fixed_acidity, values between 6-7.5 depict highest probability density irrespective of quality and there is not   much difference between the probability density for Low and High.
    3. For volatile_acidity, values between 0-0.5 depict highest probabilty density irrespective of quality but the         probabiltity density for High quality wine is greater than that of Low quality.

In [None]:
KdeAndBox("citric_acid","alcohol")

From the above KDE plot, I can infer:
    1. The distribution for citric_acid when the quality is Low seems to be close to normally distributed whilst when        quality is High it seems to be moderatley positively skewed
    2. The distribution for alcohol when the quality is Low seems to be highly positively skewed whilst when the quality is High the distribution seems very close to being normally distributed
    3. For citric_acid, values between 0-0.5 depict highest probability density irrespective of quality and there is a lot  of difference between the probability density for Low and High.
    4. For alcohol, values between 8-10 depict highest probabilty density for Low quality wine while high quality wine seems to be normally distributed with the highest point between 10-12

In [None]:
KdeAndBox("chlorides","density")

From the above KDE plot, I can infer:
    1. The distribution for chlorides seems to be extremely highly positive skewed irrespective of the quality
    2. The distribution for density when the quality is Low seems to be negatively skewed whilst when the quality is High   the distribution seems to be positively skewed
    3. For chlorides, values between 0-0.1 depict highest probability density irrespective of quality and there is a lot  of difference between the probability density for Low and High.
    4. For density, values between 0.99-1 depict highest probabilty density irrespective of the quality

In [None]:
KdeAndBox("total_sulfur_dioxide","free_sulfur_dioxide")

From the above KDE plot, I can infer:
    1. The distribution for total_sulfur_dioxide seems to be normally distributed for both quality types
    2. The distribution for free_sulfur_dioxide seems to be highly positively skewed for both quality types
    3. For total_sulfur_dioxide, values between 50-150 depict highest probability density when quality of wine is High       whilst the probably density function seems to be evenly spread across 50-200
    4. For free_sulfur_dioxide, values between 0-50 depict highest probabilty density irrespective of the quality

In [None]:
KdeAndBox("pH","sulphates")

From the above KDE plot, I can infer:
    1. The distribution for pH seems to be normally distributed for both quality types
    2. The distribution for sulphates seems to be highly positively skewed for both quality types
    3. For pH, values between 3-3.5 depict highest probability density for both quality types
    4. For sulphates, values around 0.5 depict highest probabilty density irrespective of the quality

In [None]:
plt.figure(figsize=(14,4.5))
plt.subplot(1,2,1)
sns.kdeplot(df_wine.loc[df_wine["quality"]=="Low"]["residual_sugar"],shade=True)
sns.kdeplot(df_wine.loc[df_wine["quality"]=="High"]["residual_sugar"],shade=True)

plt.legend(["Low","High"])
plt.title("residual sugar".upper(),fontsize=15)
plt.subplot(1,2,2)
sns.violinplot(data=df_wine,y="residual_sugar",x="quality")
plt.show()


From the above KDE plot, I can infer:
    1. The distribution for residual_sugar seems to be highly positively skewed for both quality types
    3. For residual_sugar, values between 0-5 depict highest probability density for both quality types

In [None]:
# Seaborn pairplot
sns_plot = sns.pairplot(df_wine,corner=True,hue='type',palette='dark:salmon_r',height=4.0)
plt.show()

### 4.2.6 Feature Engineering <a id='feature_eng'></a>

**Sulfur dioxide ratio**

Since free sulfur dioxide is the unbound part of total sulfur dioxide, I will caculate the ratio of this two features. This feature has higher correlation to quality than each of the individuals

In [None]:
df_wine['sulfur_dioxide_ratio'] = df_wine['free_sulfur_dioxide']/df_wine['total_sulfur_dioxide']

In [None]:
df_wine.drop(['free_sulfur_dioxide'],axis=1,inplace=True)

In [None]:
df_wine.head().style.set_properties(**{'background-color':'black','color':'white','border-color':'red'})

# 5. Data Preprocessing <a id='data_pre'></a>

## 5.1 Outliers <a id='out'></a>

### 5.1.1 Discovery of Outliers<a id='dis_out'></a>

In [None]:
df_num_features=df_wine.select_dtypes(include=np.number)

**Identifying outliers using IQR**

In [None]:
Q1 = df_num_features.quantile(0.25)
Q3 = df_num_features.quantile(0.75)
IQR = Q3 - Q1
print(IQR)

In [None]:
outlier = pd.DataFrame((df_num_features < (Q1 - 1.5 * IQR)) | (df_num_features > (Q3 + 1.5 * IQR)))

In [None]:
for i in outlier.columns:
    print('Total number of Outliers in column {} are {}'.format(i, (len(outlier[outlier[i] == True][i]))))

**Visualizing outliers using Boxplots**

In [None]:
for column in enumerate(df_num_features):
    plt.figure(figsize=(30,5))
    sns.set_theme(style="darkgrid")
    sns.boxplot(x=column[1], data=  df_num_features,color='red')
    plt.xlabel(column[1],fontsize=18)
    plt.show()

### 5.1.2 Removal of Outliers<a id='rem_out'></a>

**Checking the normality of numeric features**

In [None]:
import scipy
from scipy.stats import shapiro
stat, p_value = shapiro(df_num_features)

# print the test statistic and corresponding p-value 
print('Test statistic:', stat)
print('P-Value:', p_value)

Since the numeric features are not normal I are removing the outliers using IQR method

In [None]:
df_wine = df_wine[~((df_wine < (Q1 - 1.5 * IQR)) |(df_wine > (Q3 + 1.5 * IQR))).any(axis=1)]
df_wine.shape

In [None]:
df_wine.reset_index(inplace=True,drop=True)

In [None]:
df_wine.tail().style.set_properties(**{'background-color':'black','color':'white','border-color':'red'})

### 5.1.3 Re-checking Correlation<a id='rec_cor'></a>

In [None]:
data_num_features = df_wine.select_dtypes(include=np.number)
# print the names of the numeric variables 
print('The numerical columns in the dataset are: ',data_num_features.columns)

In [None]:
corr =  data_num_features.corr()

# print the correlation matrix
corr.style.set_properties(**{'background-color':'black','color':'white','border-color':'red'})

In [None]:
plt.figure(figsize=(11,9))
dropSelf = np.zeros_like(corr_matrix)
dropSelf[np.triu_indices_from(dropSelf)] = True

sns.heatmap(corr, cmap=sns.diverging_palette(220, 10, as_cmap=True), annot=True, fmt=".2f", mask=dropSelf)

sns.set(font_scale=1.5)

Recheck of correlation after treating outliers. There has been a slight change with respect to the correlation between numeric values

## 5.2 Categorical Encoding<a id='cat_enc'></a>

In [None]:
df_wine['type']=pd.get_dummies(df_wine['type'])
quaity_mapping = {"Low":0, "High":1}
df_wine["quality"] =  df_wine["quality"].map(quaity_mapping)
df_wine.head().style.set_properties(**{'background-color':'black','color':'white','border-color':'red'})

In [None]:
df_wine.dtypes

## 5.3 Feature Scaling<a id='fea_sca'></a>

In [None]:
df_num_features=df_wine.drop(['type','quality'],axis=1)

**Checking normality for numerical columns**

In [None]:
for col in df_num_features.columns:
    print("Column ", col, " :", shapiro(df_num_features[col]))

Since none of the numerical features are normally distributed (p-value<0.05) , I will perform Min-Max normalisation to scale the data

In [None]:
mms = MinMaxScaler()
mmsfit = mms.fit(df_num_features)
dfxz = pd.DataFrame(mms.fit_transform(df_num_features), columns = df_num_features.columns)

In [None]:
dfxz.head().style.set_properties(**{'background-color':'black','color':'white','border-color':'red'})

In [None]:
df_cat=df_wine[['type','quality']]

In [None]:
dfxz = pd.concat([dfxz, df_cat], axis = 1)
dfxz.head().style.set_properties(**{'background-color':'black','color':'white','border-color':'red'})

In [None]:
dfxz.isna().sum()

## 5.4 Train-Test Split<a id="split"></a>

Before applying various classification techniques to predict the quality of the wine, let us split the dataset in train and test set.

In [None]:
X=dfxz.drop('quality',axis=1)
y=dfxz['quality']

In [None]:
# add a constant column to the dataframe
# while using the 'Logit' method in the Statsmodels library, the method do not consider the intercept by default
# I can add the intercept to the set of independent variables using 'add_constant()'
X = sm.add_constant(X)

# split data into train subset and test subset
# set 'random_state' to generate the same dataset each time you run the code 
# 'test_size' returns the proportion of data to be included in the testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 10, test_size = 0.3)

# check the dimensions of the train & test subset using 'shape'
# print dimension of train set
print('X_train', X_train.shape)
print('y_train', y_train.shape)

# print dimension of test set
print('X_test', X_test.shape)
print('y_test', y_test.shape)

#### Creating a generalized function to create a dataframe containing the scores for the models.

In [None]:
# create an empty dataframe to store the scores for various algorithms
score_card1 = pd.DataFrame(columns=['Probability Cutoff', 'AUC Score', 'Precision Score', 'Recall Score',
                                       'Accuracy Score', 'Kappa Score', 'f1-score'])

# append the result table for all performance scores
# performance measures considered for model comparision are 'AUC Score', 'Precision Score', 'Recall Score','Accuracy Score',
# 'Kappa Score', and 'f1-score'
# compile the required information in a user defined function 
def update_score_card1(model, cutoff):
    
    # let 'y_pred_prob' be the predicted values of y
    y_pred_prob = model.predict(X_test[features])

    # convert probabilities to 0 and 1 using 'if_else'
    y_pred = [ 0 if x < cutoff else 1 for x in y_pred_prob]
    
    # assign 'score_card' as global variable
    global score_card1

    # append the results to the dataframe 'score_card'
    # 'ignore_index = True' do not consider the index labels
    score_card1 = score_card1.append({'Probability Cutoff': cutoff,
                                    'AUC Score' : metrics.roc_auc_score(y_test, y_pred),
                                    'Precision Score': metrics.precision_score(y_test, y_pred),
                                    'Recall Score': metrics.recall_score(y_test, y_pred),
                                    'Accuracy Score': metrics.accuracy_score(y_test, y_pred),
                                    'Kappa Score':metrics.cohen_kappa_score(y_test, y_pred),
                                    'f1-score': metrics.f1_score(y_test, y_pred)}, 
                                    ignore_index = True)

# 6. Logistic Regression<a id="log_reg"></a>

In [None]:
logreg = sm.Logit(y_train, X_train).fit()

# print the summary of the model
print(logreg.summary())

**Interpretation:** The `Pseudo R-squ.` obtained from the above model summary is **0.2312**  which is also the value of `McFadden's R-squared`. This value can be obtained from the formula:

<p style='text-indent:25em'> <strong> McFadden's R-squared = $ 1 - \frac{Log-Likelihood}{LL-Null} $</strong> </p>

Where,<br>
Log-Likelihood: It is the maximum value of the log-likelihood function<br>
LL-Null: It is the maximum value of the log-likelihood function for the model containing only the intercept 

The LLR p-value is less than 0.05, implies that the model is significant.

Even though the model is significant there are few features which are insignificant (P-value < 0.05)

**Backward Elimination Model**

To obtain the best significant features which are realated to target variable I perform backward elimination process below:

In [None]:
sfs_backward=sfs(estimator=LogisticRegression(),k_features='best',forward=False,verbose=0,scoring='accuracy')
sfs_model=sfs_backward.fit(X_train,y_train)
features=list(sfs_model.k_feature_names_)
print("The best features obtained from elimination process:",features)

Now,building a Logistic regression model obtained from the above elimination process

In [None]:
logreg_backward = sm.Logit(y_train, X_train[features]).fit()

# print the summary of the model
print(logreg_backward.summary())

**Interpretation:** The `Pseudo R-squ.` obtained from the above model summary is **0.2310**  
The LLR p-value is less than 0.05, implies that the model is significant.

**Identifying the Best Cut-off Value**

Now, let us consider a list of values as cut-off to calculate the different performance measures.

In [None]:
# consider a list of values for cut-off
cutoff = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

# use the for loop to compute performance measures for each value of the cut-off
# call the update_score_card() to update the score card for each cut-off
# pass the model and cut-off value to the function
for value in cutoff:
    update_score_card1(logreg_backward, value)

In [None]:
# print the score card 
print('Score Card for Logistic regression:')

# sort the dataframe based on the probability cut-off values ascending order
# 'reset_index' resets the index of the dataframe
# 'drop = True' drops the previous index
score_card1 = score_card1.sort_values('Probability Cutoff').reset_index(drop = True)

# color the cell in the columns 'AUC Score', 'Accuracy Score', 'Kappa Score', 'f1-score' having maximum values
# 'style.highlight_max' assigns color to the maximum value
# pass specified color to the parameter, 'color'
# pass the data to limit the color assignment to the parameter, 'subset' 
score_card1.style.highlight_max(color = 'red', subset = ['Accuracy Score'])

**Interpretation:** The above dataframe shows that,all the highest scores of different perfomance metrics.
The optimal probability cut-off score which is considered for futher analysis is taken by considering the **Accuracy Score**.

**Predictions on the train set.**

In [None]:
# let 'y_pred_prob1' be the predicted values of y
y_pred_prob1 = logreg_backward.predict(X_train[features])

# print the y_pred_prob1
y_pred_prob1.head()

I decided the cut-off to be 0.5. i.e. if 'y_pred_prob1' is less than 0.5, then consider it to be 0 else consider it to be 1.

In [None]:
# convert probabilities to 0 and 1 using 'if_else'
y_pred1 = [ 0 if x < 0.5 else 1 for x in y_pred_prob1]
y_pred1[:10]

**Predictions on the test set.**

In [None]:
# let 'y_pred_prob' be the predicted values of y
y_pred_prob = logreg_backward.predict(X_test[features])

# print the y_pred_prob
y_pred_prob.head()

In [None]:
# convert probabilities to 0 and 1 using 'if_else'
y_pred = [ 0 if x < 0.5 else 1 for x in y_pred_prob]
y_pred[:10]

**Confusion Matrix**

In [None]:
# create a confusion matrix
# pass the actual and predicted target values to the confusion_matrix()
cm = confusion_matrix(y_test, y_pred)

# label the confusion matrix  
# pass the matrix as 'data'
# pass the required column names to the parameter, 'columns'
# pass the required row names to the parameter, 'index'
conf_matrix = pd.DataFrame(data = cm,columns = ['Predicted:0','Predicted:1'], index = ['Actual:0','Actual:1'])

# plot a heatmap to visualize the confusion matrix
# 'annot' prints the value of each grid 
# 'fmt = d' returns the integer value in each grid
# 'cmap' assigns color to each grid
# as I do not require different colors for each grid in the heatmap,
# use 'ListedColormap' to assign the specified color to the grid
# 'cbar = False' will not return the color bar to the right side of the heatmap
# 'linewidths' assigns the width to the line that divides each grid
# 'annot_kws = {'size':25})' assigns the font size of the annotated text 
sns.heatmap(conf_matrix, annot= True, fmt = 'd', cmap ='Reds', cbar = False, linewidths = 0.1, annot_kws = {'size':25})

# set the font size of x-axis ticks using 'fontsize'
plt.xticks(fontsize = 20)

# set the font size of y-axis ticks using 'fontsize'
plt.yticks(fontsize = 20)

# display the plot
plt.show()

**Train Report**

In [None]:
print(classification_report(y_train, y_pred1))

**Interpretation:** From the above output, I can see that the training model has 76% accuracy.

**Test Report**

In [None]:
print(classification_report(y_test, y_pred))

**Interpretation:** From the above output, I can see that the model  is 74% accurate.

From the above classification reports,I can infer that there is a little difference when compared to test and train reports.
Hence I conclude that the model is bit overfitted.

**ROC Curve**

In [None]:
# the roc_curve() returns the values for false positive rate, true positive rate and threshold
# pass the actual target values and predicted probabilities to the function
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

# plot the ROC curve
plt.plot(fpr, tpr)

# set limits for x and y axes
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])

# plot the straight line showing worst prediction for the model
plt.plot([0, 1], [0, 1],'r--')

# add plot and axes labels
# set text size using 'fontsize'
plt.title('ROC curve ', fontsize = 15)
plt.xlabel('False positive rate (1-Specificity)', fontsize = 15)
plt.ylabel('True positive rate (Sensitivity)', fontsize = 15)

# add the AUC score to the plot
# 'x' and 'y' gives position of the text
# 's' is the text 
# use round() to round-off the AUC score upto 4 digits
plt.text(x = 0.02, y = 0.9, s = ('AUC Score:', round(metrics.roc_auc_score(y_test, y_pred_prob),4)))
                               
# plot the grid
plt.grid(True)

**Interpretation:** The red dotted line represents the ROC curve of a purely random classifier; a good classifier stays as far away from that line as possible (toward the top-left corner).<br>
From the above plot, I can see that our classifier (logistic regression with features obtained from Backward elimination method) is away from the dotted line; with the AUC score **0.7957**

**Score Card**

In [None]:
#defining a score card
score_card=pd.DataFrame(columns=['Model_Name','Accuracy(Train)','Accuracy(Test)','Diff_b/w_train&test(Acc)','AUC_Score','Avg(Acc)'])

In [None]:
# Predicting Cross Validation Score
cv_lr = cross_val_score(estimator = LogisticRegression() , X = X_train[features], y = y_train, cv = 10,scoring='accuracy')

In [None]:
score_card=score_card.append({'Model_Name': 'Logistic Regression',
                             'Accuracy(Train)': metrics.accuracy_score(y_train, y_pred1),
                             'Accuracy(Test)':metrics.accuracy_score(y_test, y_pred),
                             'Diff_b/w_train&test(Acc)': abs(metrics.accuracy_score(y_train, y_pred1)-metrics.accuracy_score(y_test, y_pred)),
                             'AUC_Score':metrics.roc_auc_score(y_test, y_pred_prob),
                             'Avg(Acc)':cv_lr.mean()},ignore_index=True)
score_card

**Some Pre-defined functions**

#### A generalized function to calculate the performance metrics for the train set.

In [None]:
# create a generalized function to calculate the metrics values for train set
def get_train_report(model):
    
    # for training set:
    # train_pred: prediction made by the model on the train dataset 'X_train'
    # y_train: actual values of the target variable for the train dataset

    # predict the output of the target variable from the train data 
    train_pred = model.predict(X_train)

    # return the performace measures on train set
    return(classification_report(y_train, train_pred))

#### A generalized function to calculate the performance metrics for the test set.

In [None]:
# create a generalized function to calculate the performance metrics values for test set
def get_test_report(model):
    
    # for test set:
    # test_pred: prediction made by the model on the test dataset 'X_test'
    # y_test: actual values of the target variable for the test dataset

    # predict the output of the target variable from the test data 
    test_pred = model.predict(X_test)

    # return the classification report for test data
    return(classification_report(y_test, test_pred))

#### Function to plot the confusion matrix.

In [None]:
# define a to plot a confusion matrix for the model
def plot_confusion_matrix(model):
    
    # predict the target values using df_test
    y_pred = model.predict(X_test)
    
    # create a confusion matrix
    # pass the actual and predicted target values to the confusion_matrix()
    cm = confusion_matrix(y_test, y_pred)

    # label the confusion matrix  
    # pass the matrix as 'data'
    # pass the required column names to the parameter, 'columns'
    # pass the required row names to the parameter, 'index'
    conf_matrix = pd.DataFrame(data = cm,columns = ['Predicted:0','Predicted:1'], index = ['Actual:0','Actual:1'])

    # plot a heatmap to visualize the confusion matrix
    # 'annot' prints the value of each grid 
    # 'fmt = d' returns the integer value in each grid
    # 'cmap' assigns color to each grid
    # as I do not require different colors for each grid in the heatmap,
    # use 'ListedColormap' to assign the specified color to the grid
    # 'cbar = False' will not return the color bar to the right side of the heatmap
    # 'linewidths' assigns the width to the line that divides each grid
    # 'annot_kws = {'size':25})' assigns the font size of the annotated text 
    sns.heatmap(conf_matrix, annot = True, fmt = 'd', cmap = 'Reds', cbar = False, 
                linewidths = 0.1, annot_kws = {'size':25})

    # set the font size of x-axis ticks using 'fontsize'
    plt.xticks(fontsize = 20)

    # set the font size of y-axis ticks using 'fontsize'
    plt.yticks(fontsize = 20)

    # display the plot
    plt.show()

#### Function to plot the ROC curve.

In [None]:
# define a function to plot the ROC curve and print the ROC-AUC score
def plot_roc(model):
    
    # predict the probability of target variable using X_test
    # consider the probability of positive class by subsetting with '[:,1]'
    y_pred_prob = model.predict_proba(X_test)[:,1]
    
    # the roc_curve() returns the values for false positive rate, true positive rate and threshold
    # pass the actual target values and predicted probabilities to the function
    fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

    # plot the ROC curve
    plt.plot(fpr, tpr)

    # set limits for x and y axes
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.0])

    # plot the straight line showing worst prediction for the model
    plt.plot([0, 1], [0, 1],'r--')

    # add plot and axes labels
    # set text size using 'fontsize'
    plt.title('ROC curve ', fontsize = 15)
    plt.xlabel('False positive rate (1-Specificity)', fontsize = 15)
    plt.ylabel('True positive rate (Sensitivity)', fontsize = 15)

    # add the AUC score to the plot
    # 'x' and 'y' gives position of the text
    # 's' is the text 
    # use round() to round-off the AUC score upto 4 digits
    plt.text(x = 0.02, y = 0.9, s = ('AUC Score:',round(metrics.roc_auc_score(y_test, y_pred_prob),4)))

    # plot the grid
    plt.grid(True)

#### Function for  predicting Cross Validation Score

In [None]:
# Predicting Cross Validation Score
def cross_valid_score(obj):
    cv = cross_val_score(estimator = obj , X = X_train, y = y_train, cv = 10,scoring='accuracy')
    return cv.mean()

#### Function to update the score card

In [None]:
def update_score_card(model_name,model):
    global score_card
    train_pred = model.predict(X_train)
    test_pred=model.predict(X_test)
    y_pred_prob = model.predict_proba(X_test)[:,1]
    score_card=score_card.append({'Model_Name': model_name,
                             'Accuracy(Train)': metrics.accuracy_score(y_train, train_pred),
                             'Accuracy(Test)':metrics.accuracy_score(y_test, test_pred),
                             'Diff_b/w_train&test(Acc)': abs(metrics.accuracy_score(y_train, train_pred)-metrics.accuracy_score(y_test, test_pred)),
                             'AUC_Score':metrics.roc_auc_score(y_test, y_pred_prob),
                             'Avg(Acc)':cross_valid_score(model)},ignore_index=True)
    return score_card

# 7. Naive Bayes Algorithm<a id="nai_bay"></a>

#### Building a naive bayes model on a training dataset.

In [None]:
# instantiate the 'GaussianNB'
gnb = GaussianNB()

# fit the model using fit() on train data
gnb_model = gnb.fit(X_train, y_train)

**Confusion matrix**

In [None]:
plot_confusion_matrix(gnb_model)

**Train Report**

In [None]:
train_report = get_train_report(gnb_model)

# print the performace measures
print(train_report)

**Interpretation:** From the above output, I can see that the training model has 72% accuracy.

**Test Report**

In [None]:
test_report = get_test_report(gnb_model)

# print the performace measures
print(test_report)

**Interpretation:** From the above output, I can see that the testing model is 69% accurate.

From the above classification reports,I can infer that there is a little difference when compared to test and train reports.
Hence I conclude that the model is bit overfitted.

**ROC curve**

In [None]:
# call the function to plot the ROC curve
# pass the  gaussian naive bayes model to the function
plot_roc(gnb_model)

**Interpretation:**
From the above plot, I can see that our classifier (Gaussian NaiveBayes) is away from the dotted line; with the AUC score **0.7455**

#### Score Card

In [None]:
update_score_card('Navie Bayes',gnb_model)

# 8. K Nearest Neighbors (KNN)<a id="knn"></a>

**Finding  Optimal Value of K (using GridSearchCV)**

In [None]:
# create a dictionary with hyperparameters and its values
# n_neighnors: number of neighbors to consider
# usually, I consider the odd value of 'n_neighnors' to avoid the equal number of nearest points with more than one class
# pass the different distance metrics to the parameter, 'metric'
tuned_paramaters = {'n_neighbors': np.arange(1, 25, 2),
                   'metric': ['hamming','euclidean','manhattan','Chebyshev']}
 
# instantiate the 'KNeighborsClassifier' 
knn_classification = KNeighborsClassifier()

# use GridSearchCV() to find the optimal value of the hyperparameters
# estimator: pass the knn model
# param_grid: pass the list 'tuned_parameters'
# cv: number of folds in k-fold i.e. here cv = 5
# scoring: pass the scoring parameter 'accuracy'
knn_grid = GridSearchCV(estimator = knn_classification, 
                        param_grid = tuned_paramaters, 
                        cv = 5, 
                        scoring = 'accuracy')

# fit the model on X_train and y_train using fit()
knn_grid.fit(X_train, y_train)

# get the best parameters
print('Best parameters for KNN Classifier: ', knn_grid.best_params_, '\n')

#### Building a knn model on a training dataset using the above best parameters.

In [None]:
# instantiate the 'KNeighborsClassifier'
# n_neighnors: number of neighbors to consider
# default metric is minkowski, and with p=2 it is equivalent to the euclidean metric
knn_classification = KNeighborsClassifier(n_neighbors =17,p=2)

# fit the model using fit() on train data
knn_model = knn_classification.fit(X_train, y_train)

#### Line plot to see the accuracy rate and  error rate for each value of K using euclidean distance as a metric of KNN model

In [None]:
# consider an empty list to store accuracy rate
accuracy_rate = []

# use for loop to build a knn model for each K
for i in np.arange(1,25,2):
    
    # setup a knn classifier with k neighbors
    # use the 'euclidean' metric 
    knn = KNeighborsClassifier(i, metric = 'euclidean')
   
    # fit the model using 'cross_val_score'
    # pass the knn model as 'estimator'
    # use 5-fold cross validation
    score = cross_val_score(knn, X_train, y_train, cv = 5)
    
    # calculate the mean score
    score = score.mean()
    
    # compute accuracy rate 
    accuracy_rate.append(score)

# plot the accuracy_rate for different values of K 
plt.plot(range(1,25,2), accuracy_rate,color ='blue',linestyle ='dashed', marker ='o',markerfacecolor ='red', markersize = 10)

# add plot and axes labels
# set text size using 'fontsize'
plt.title('accuracy Rate', fontsize = 15)
plt.xlabel('K', fontsize = 15)
plt.ylabel('accuracy Rate', fontsize = 15)

# set the x-axis labels
plt.xticks(np.arange(1, 25, step = 2))

# plot a vertical line across the maximum accuracy rate
plt.axvline(x = 17, color = 'red')

# display the plot
plt.show()

In [None]:
# consider an empty list to store error rate
error_rate = []

# use for loop to build a knn model for each K
for i in np.arange(1,25,2):
    
    # setup a knn classifier with k neighbors
    # use the 'euclidean' metric 
    knn = KNeighborsClassifier(i, metric = 'euclidean')
   
    # fit the model using 'cross_val_score'
    # pass the knn model as 'estimator'
    # use 5-fold cross validation
    score = cross_val_score(knn, X_train, y_train, cv = 5)
    
    # calculate the mean score
    score = score.mean()
    
    # compute error rate 
    error_rate.append(1 - score)

# plot the error_rate for different values of K 
plt.plot(range(1,25,2), error_rate,color ='blue',linestyle ='dashed', marker ='o',markerfacecolor ='red', markersize = 10)

# add plot and axes labels
# set text size using 'fontsize'
plt.title('Error Rate', fontsize = 15)
plt.xlabel('K', fontsize = 15)
plt.ylabel('Error Rate', fontsize = 15)

# set the x-axis labels
plt.xticks(np.arange(1, 25, step = 2))

# plot a vertical line across the minimum error rate
plt.axvline(x = 17, color = 'red')

# display the plot
plt.show()

**Interpretation:** I can see that the optimal value of K (= 17) obtained from the GridSearchCV() results in a lowest error rate and highest accuracy rate. 

**Confusion matrix**

In [None]:
plot_confusion_matrix(knn_model)

**Train Report**

In [None]:
# compute the performance measures on test data
# call the function 'get_train_report'
# pass the knn model to the function
train_report = get_train_report(knn_model)

# print the performace measures
print(train_report)

**Interpretation:** From the above output, I can see that the training model has 79% accuracy.

**Test report**

In [None]:
# compute the performance measures on test data
# call the function 'get_test_report'
# pass the knn model to the function
test_report = get_test_report(knn_model)

# print the performace measures
print(test_report)

**Interpretation:** From the above output, I can see that the testing model is 74% accurate.

**Interpretation:** From the above classification reports,I can infer that there is a little difference when compared to test and train reports.
Hence I conclude that the model is bit overfitted.

**ROC curve**

In [None]:
# call the function to plot the ROC curve
# pass the knn model to the function
plot_roc(knn_model)

**Interpretation:** 
From the above plot, I can see that our classifier (knn_model with n_neighbors = 17) is away from the dotted line; with the AUC score **0.7933**.

#### Score Card

In [None]:
update_score_card('KNeighbors Classifier',knn_model)

# 9. Decision Tree for Classification<a id="dec_tre"></a>

**Finding Hyperparameters using GridSearchCV (Decision Tree)**

In [None]:
tuned_paramaters = [{'criterion': ['entropy', 'gini'], 
                     'max_depth': [2,4,6,8,10],
                     'max_features': ["sqrt", "log2"],
                     'min_samples_split': [2,4,6,8,10],
                     'min_samples_leaf': [2,4,6,8,10],
                     'max_leaf_nodes': [2,4,6,8,10]}]
 
# instantiate the 'DecisionTreeClassifier' 
# pass the 'random_state' to obtain the same samples for each time you run the code
decision_tree_classification = DecisionTreeClassifier(random_state = 10)

# use GridSearchCV() to find the optimal value of the hyperparameters
# estimator: pass the decision tree classifier model
# param_grid: pass the list 'tuned_parameters'
# cv: number of folds in k-fold i.e. here cv = 5
tree_grid = GridSearchCV(estimator = decision_tree_classification, 
                         param_grid = tuned_paramaters, 
                         cv = 5)

# fit the model on X_train and y_train using fit()
tree_grid_model = tree_grid.fit(X_train, y_train)

# get the best parameters
print('Best parameters for decision tree classifier: ', tree_grid_model.best_params_, '\n')

#### Building the model using the above obtained tuned hyperparameters.

In [None]:
decision_tree = DecisionTreeClassifier(criterion = tree_grid_model.best_params_.get('criterion'),
                                  max_depth = tree_grid_model.best_params_.get('max_depth'),
                                  max_features = tree_grid_model.best_params_.get('max_features'),
                                  max_leaf_nodes = tree_grid_model.best_params_.get('max_leaf_nodes'),
                                  min_samples_leaf = tree_grid_model.best_params_.get('min_samples_leaf'),
                                  min_samples_split = tree_grid_model.best_params_.get('min_samples_split'),
                                  random_state = 10)

# use fit() to fit the model on the train set
decision_tree = decision_tree.fit(X_train, y_train)

**Decision tree with tuned hyperparameters**

In [None]:
plt.figure(figsize=(60,30))
plot_tree(decision_tree, filled=True);

**Confusion matrix**

In [None]:
plot_confusion_matrix(decision_tree)

**Train report**

In [None]:
# compute the performance measures on test data
# call the function 'get_train_report'
# pass the decision tree  model to the function
train_report = get_train_report(decision_tree)

# print the performace measures
print(train_report)

**Interpretation:** From the above output, I can see that the training model has 73% accuracy.

**Test report**

In [None]:
# compute the performance measures on test data
# call the function 'get_test_report'
# pass the decision tree model to the function
test_report = get_test_report(decision_tree)

# print the performace measures
print(test_report)

**Interpretation:** From the above output, I can see that the testing model is 71% accurate.

From the above classification reports,I can infer that there is a difference when compared to test and train reports.
Hence I conclude that the model is bit overfitted.

**ROC Curve**

In [None]:
# call the function to plot the ROC curve
# pass the decision tree model to the function
plot_roc(decision_tree)

**Interpretation:** 
From the above plot, I can see that our classifier (Decision tree) is away from the dotted line; with the AUC score **0.7415**.

#### Score Card

In [None]:
update_score_card('Decision Tree Classifier',decision_tree)

In order to decrease the overfitting and increase the performance and accuracy of the Decision tree model.I further perform some Bagging and Boosting techniques.

# 10. Random Forest<a id="ran_for"></a>

**Tuning the Hyperparameters using GridSearchCV**

In [None]:
tuned_paramaters = [{'criterion': ['entropy', 'gini'],
                     'n_estimators': [ 30, 50, 70],
                     'max_depth': [10,15,20],
                     'max_features': ["sqrt", "log2"],
                     'min_samples_split': [2,6,10],
                     'min_samples_leaf': [2,6,10],
                     'max_leaf_nodes': [2,6,10]}]
 
# instantiate the 'RandomForestClassifier' 
# pass the 'random_state' to obtain the same samples for each time you run the code
random_forest_classification = RandomForestClassifier(random_state = 10)

# use GridSearchCV() to find the optimal value of the hyperparameters
# estimator: pass the random forest classifier model
# param_grid: pass the list 'tuned_parameters'
# cv: number of folds in k-fold i.e. here cv = 5
rf_grid = GridSearchCV(estimator = random_forest_classification, 
                       param_grid = tuned_paramaters, 
                       cv = 5)

# use fit() to fit the model on the train set
rf_grid_model = rf_grid.fit(X_train, y_train)

# get the best parameters
print('Best parameters for random forest classifier: ', rf_grid_model.best_params_, '\n')

#### Building the model using the tuned hyperparameters obtained above.

In [None]:
# instantiate the 'RandomForestClassifier'
# 'best_params_' returns the dictionary containing best parameter values and parameter name  
# 'get()' returns the value of specified parameter
# pass the 'random_state' to obtain the same samples for each time you run the code
random_forest = RandomForestClassifier(criterion = rf_grid_model.best_params_.get('criterion'),
                                   n_estimators=rf_grid_model.best_params_.get('n_estimators'),
                                  max_depth = rf_grid_model.best_params_.get('max_depth'),
                                  max_features = rf_grid_model.best_params_.get('max_features'),
                                  max_leaf_nodes = rf_grid_model.best_params_.get('max_leaf_nodes'),
                                  min_samples_leaf = rf_grid_model.best_params_.get('min_samples_leaf'),
                                  min_samples_split = rf_grid_model.best_params_.get('min_samples_split'),
                                  random_state = 10)

# use fit() to fit the model on the train set
random_forest = random_forest.fit(X_train, y_train)

**Confusion matrix**

In [None]:
plot_confusion_matrix(random_forest)

**Train report**

In [None]:
# compute the performance measures on test data
# call the function 'get_train_report'
# pass the Random Forest  model to the function
train_report = get_train_report(random_forest)


# print the performace measures
print(train_report)

**Interpretation:** From the above output, I can see that the training model has 77% accuracy.

**Test report**

In [None]:
# compute the performance measures on test data
# call the function 'get_test_report'
# pass the Random Forest model to the function
test_report = get_test_report(random_forest)

# print the performace measures
print(test_report)

**Interpretation:** From the above output, I can see that the testing model is 74% accurate.

**ROC Curve**

In [None]:
# call the function to plot the ROC curve
# pass the random forest model to the function
plot_roc(random_forest)

**Interpretation:** 
From the above plot, I can see that our classifier (random forest) is away from the dotted line; with the AUC score **0.7932**.

#### Score Card

In [None]:
update_score_card('Random Forest Classifier',random_forest)

# 11. AdaBoost<a id="ada"></a>

**Tune the Hyperparameters (GridSearchCV)**

In [None]:
tuning_parameters = {'learning_rate': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6],'n_estimators' : [10,20,30,40,50]}
ada_model = AdaBoostClassifier()

# use GridSearchCV() to find the optimal value of the hyperparameters
# estimator: pass the AdaBoost classifier model
# param_grid: pass the list 'tuned_parameters'
# cv: number of folds in k-fold i.e. here cv = 3
# scoring: pass a measure to evaluate the model on test set
ada_grid = GridSearchCV(estimator = ada_model, param_grid = tuning_parameters, cv = 3, scoring = 'roc_auc')

# fit the model on X_train and y_train using fit()
ada_grid.fit(X_train, y_train)

# get the best parameters
print('Best parameters for ADABoost classifier: ', ada_grid.best_params_, '\n')

**Building an Adaboost model on a training dataset.**

In [None]:
# instantiate the 'AdaBoostClassifier'
# n_estimators: number of estimators at which boosting is terminated
# pass the 'random_state' to obtain the same results for each code implementation
ada_model = AdaBoostClassifier(n_estimators = 40, random_state = 10,learning_rate= 0.4)

# fit the model using fit() on train data
ada_model.fit(X_train, y_train)

**Confusion Matrix**

In [None]:
plot_confusion_matrix(ada_model)

**Train Report**

In [None]:
# compute the performance measures on test data
# call the function 'get_train_report'
# pass the adaboost model to the function
train_report = get_train_report(ada_model)

# print the performace measures
print(train_report)

**Interpretation:** From the above output, I can see that the training model has 78% accuracy.

**Test Report**

In [None]:
# compute the performance measures on test data
# call the function 'get_test_report'
# pass the adaboost model to the function
test_report = get_test_report(ada_model)

# print the performace measures
print(test_report)

**Interpretation:** From the above output, I can see that the testing model is 75% accurate.

**ROC Curve**

In [None]:
# call the function to plot the ROC curve
# pass the adaboost model to the function
plot_roc(ada_model)

**Interpretation:** 
From the above plot, I can see that our classifier (ADA Boost) is away from the dotted line; with the AUC score **0.7961**.

#### Score Card

In [None]:
update_score_card('Ada Boost Classifier',ada_model)

# 12. Gradient Boosting<a id="gra_boo"></a>

**Tune the Hyperparameters (GridSearchCV)**

In [None]:
tuning_parameters = {'learning_rate': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6],'n_estimators':[30, 50, 70, 90],
                     'max_depth': [2,6,10],'min_samples_split': [2,6,10],
                     'min_samples_leaf': [2,6,10]}
gboost_model = GradientBoostingClassifier()
gb_grid = GridSearchCV(estimator = gboost_model, param_grid = tuning_parameters, cv = 3, scoring = 'roc_auc')
gb_grid.fit(X_train, y_train)
print('Best parameters for GBoost classifier: ', gb_grid.best_params_, '\n')

#### Building a Gradient boost model on a training dataset

In [None]:
gboost_model = GradientBoostingClassifier(n_estimators = 70, random_state = 10,learning_rate=0.2,
                                          min_samples_leaf=10,min_samples_split=2,max_depth= 2)
gboost_model.fit(X_train, y_train)

**Confusion Matrix**

In [None]:
plot_confusion_matrix(gboost_model)

**Train report**

In [None]:
train_report = get_train_report(gboost_model)

# print the performace measures
print(train_report)

**Interpretation:** From the above output, I can see that the training model has 80% accuracy.

**Test Report**

In [None]:
test_report = get_test_report(gboost_model)

# print the performance measures
print(test_report)

**Interpretation:** From the above output, I can see that the testing model is 76% accurate.

**ROC Curve**

In [None]:
plot_roc(gboost_model)

**Interpretation:** 

From the above plot, I can see that our classifier(Gradient Boosting model) is away from the dotted line; with the AUC score **0.8136**.

**Score Card**

In [None]:
update_score_card('Gradient Boosting Classifier',gboost_model)

# 13 Extreme Gradient Boosting (XGB)<a id="xgb"> </a>

**Tuning the Hyperparameters (GridSearchCV)**

In [None]:
tuning_parameters = {'learning_rate': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6],
                     'max_depth': [1,3,5,7,9],
                     'gamma': [0, 1, 2, 3, 4]}
xgb_model = XGBClassifier()
xgb_grid = GridSearchCV(estimator = xgb_model, param_grid = tuning_parameters, cv = 3, scoring = 'roc_auc',verbose=0)
xgb_grid.fit(X_train, y_train)

# get the best parameters
print('Best parameters for XGBoost classifier: ', xgb_grid.best_params_, '\n')

#### Building an Extreme Gradient boost model on a training dataset

In [None]:
xgb_model = XGBClassifier(learning_rate=0.1, gamma = 1,max_depth=3,verbosity=0)

# fit the model using fit() on train data
xgb_model.fit(X_train, y_train)

**Confusion Matrix**

In [None]:
plot_confusion_matrix(xgb_model)

**Train Report**

In [None]:
train_report = get_train_report(xgb_model)

# print the performace measures
print(train_report)

**Interpretation:** From the above output, I can see that the training model has 82% accuracy.

**Test Report**

In [None]:
test_report = get_test_report(xgb_model)

# print the performance measures
print(test_report)

**Interpretation:** From the above output, I can see that the testing model is 76% accurate.

**ROC Curve**

In [None]:
plot_roc(xgb_model)

**Interpretation:** 

From the above plot, I can see that our classifier(Extreme Gradient Boosting model) is away from the dotted line; with the AUC score **0.8129**.

#### Score Card

In [None]:
update_score_card('Extreme Gradient Boosting Classifier',xgb_model)

# 14. Stack Generalization<a id="stack"></a>

In [None]:
# consider the various algorithms as base learners
base_learners = [('rf_model', RandomForestClassifier(criterion = 'gini', max_depth = 10, max_features = 'sqrt', 
                                                     max_leaf_nodes = 10, min_samples_leaf = 10, min_samples_split = 2, 
                                                     n_estimators = 50, random_state = 10)),
                 ('KNN_model', KNeighborsClassifier(n_neighbors = 17, metric = 'euclidean')),
                 ('NB_model', GaussianNB()),
                 ('Decision_tree',DecisionTreeClassifier(criterion = 'entropy', max_depth= 6, max_features= 'sqrt',
                                                         max_leaf_nodes= 10, min_samples_leaf= 10, min_samples_split= 2,random_state = 10))]

# initialize stacking classifier 
# pass the base learners to the parameter, 'estimators'
# pass the XGB Classifier model as the 'final_estimator'/ meta model
stack_model = StackingClassifier(estimators = base_learners, final_estimator = XGBClassifier())

# fit the model on train dataset
stack_model.fit(X_train, y_train)

**Confusion Matrix**

In [None]:
plot_confusion_matrix(stack_model)

**Train Report**

In [None]:
train_report = get_train_report(stack_model)

# print the performace measures
print(train_report)

**Interpretation:** From the above output, I can see that the training model has 77% accuracy.

**Test Report**

In [None]:
test_report = get_test_report(stack_model)

# print the performance measures
print(test_report)

**Interpretation:** From the above output, I can see that the testing model is 72% accurate.

**ROC Curve**

In [None]:
plot_roc(stack_model)

**Interpretation:** 

From the above plot, I can see that our classifier(Stack Generalized model) is away from the dotted line; with the AUC score **0.7574**.

#### Score Card

In [None]:
update_score_card('Stack Generalization',stack_model)

# 15. Displaying score summary<a id="dis_sco"></a>

In [None]:
score_card = score_card.sort_values('Diff_b/w_train&test(Acc)').reset_index(drop = True)

score_card.style.highlight_min(color = 'red', subset = ['Diff_b/w_train&test(Acc)'])

**Interpretation:**  I can see that Decision tree classifier has the lowest difference between train accuracy and test                                   accuracy.Hence,I conclude Decision tree classifier is the `Best_Model`

# 16. Feature Importance<a id="fea_imp"></a>

In [None]:
# create a dataframe that stores the feature names and their importance
# 'feature_importances_' returns the features based on the average gain 
important_features = pd.DataFrame({'Features': X_train.columns, 
                                   'Importance': decision_tree.feature_importances_})

# sort the dataframe in the descending order according to the feature importance
important_features = important_features.sort_values('Importance', ascending = False)

# create a barplot to visualize the features based on their importance
sns.barplot(x = 'Importance', y = 'Features', data = important_features)

# add plot and axes labels
# set text size using 'fontsize'
plt.title('Feature Importance', fontsize = 15)
plt.xlabel('Importance', fontsize = 15)
plt.ylabel('Features', fontsize = 15)

# display the plot
plt.show()

**Interpretation:** The above bar plot shows that, of all the features `chlorides` is of most important feature. 

# 17. Conclusion<a id="conclu"></a>

**Of all the models built, I see that Decision tree classifier model has been the most effective with no overfitting.**

**Some of the features which contribute more for prediction of quality are chlorides,density,citric acid,volatile_acidity and sulfur_dioxide_ratio.**

**Results can be used by wine manufactures to improve the quality of wine in future and can also be used by consumers for wine selection.**

**I can hereby conclude that I have successfully built a model that can predict quality of wine.**

# 18.Deployment<a id="deploy"></a>

https://winesupreme.herokuapp.com/

# 19. References<a id="Refer"></a>

https://www.ijsr.net/archive/v9i7/SR20718002904.pdf

https://ijcat.com/archieve/volume8/issue9/ijcatr08091010.pdf

https://broncoscholar.library.cpp.edu/bitstream/handle/10211.3/216015/NelsonGregory_Thesis2020.pdf?sequence=3

http://cs229.stanford.edu/proj2015/245_report.pdf

https://scihub.se/https://www.sciencedirect.com/science/article/pii/S1877050917328053

<table align="center" width=100%>
    <tr>
        <td width="50%">
            <img src="https://media4.giphy.com/media/QMkPpxPDYY0fu/200.gif">
        </td>
        <td>
            <div align="center">
                <font color="#1A040C " size=24px>
                    <b>Thank You.
                    </b>
                </font>
            </div>
        </td>
    </tr>
</table>