# Wine Quality Prediction and Clustering

# Problem statement:

1)To make use of available wine quality data and train the model to predict wine quality

2)To build a clustering of wine datasets based on their content

Lets begin by loading the required packages

In [None]:
#load libraries and packages
from sklearn.metrics import make_scorer, accuracy_score ,classification_report,f1_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster import AgglomerativeClustering
import scipy.cluster.hierarchy as sch
from sklearn.cluster import KMeans
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
import sklearn.metrics as sk
from sklearn import preprocessing
import matplotlib.pylab as pylab
import matplotlib.pyplot as plt
from pandas import get_dummies
import matplotlib as mpl
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib
import warnings
import sklearn
import scipy
import numpy
import json
import sys
import csv
import os

from IPython.core.interactiveshell import InteractiveShell         #to display multiple outputs in same cell
InteractiveShell.ast_node_interactivity = "all"


Lets load the data and do some explorations

In [None]:
data = pd.read_csv("../input/wine-quality/winequality.csv")
data

It can be inferred from the above data that our dataset contains 6497 rows and 14 columns whose description are as follows

1)fixed acidity : most acids involved with wine are fixed or nonvolatile (do not evaporate readily) 

2)volatile acidity:the amount of acetic acid in wine, which at too high  levels can lead to an unpleasant, vinegar taste

3)citric acid : found in small quantities, citric acid can add 'freshness' and flavor to wines

4)residual sugar : the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

5)chlorides : the amount of salt in the wine

6)free sulfur dioxide : the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

7)total sulfur dioxide : amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

8)density : the density of water is close to that of water depending on the percent alcohol and sugar content

9)pH : describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

10)sulphates : a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

11)alcohol : the percent alcohol content of the wine

12)quality : output variable (based on sensory data, score between 0 and 10)

13)good : goodness quality of wine (whether good or not)

14)color : colour of the wine (either red or white)


# Exploratory data analysis

In [None]:
#identify datatypes of features
data.dtypes

In [None]:
#statistical analysis
data.describe()

In [None]:
#find number of wines that are classified 'good' and 'bad'
data.good.value_counts()

It seems like our dataset is imbalanced with only around 20 percent of wines that are classified 'good'

# Data Visualizations

In [None]:
#plotting histogram of all features
data.hist(figsize=(15,20))

The above plot gives us an insights into how different features are distributed

In [None]:
#regression plot of chlorides vs quality
f,ax=plt.subplots(figsize=(10,10))
sns.regplot(x='chlorides',y='quality',data=data)
plt.title('regression plot of chlorides and quality')

observing the above regression plot , the qulaity of the wine is inversely proportional to the presence of chlorides in wine.

In [None]:
#regression plot of alcohol vs quality
f,ax=plt.subplots(figsize=(10,10))
sns.regplot(x='alcohol',y='quality',data=data)
plt.title('regression plot of alcohol and quality')

observing the above regression plot , the qulaity of the wine is directly proportional to the presence of alcohol content in wine.

In [None]:
#countplot of quality based on goodness feature
sns.catplot(x='quality',data=data,height=5,aspect=3,hue='good',kind='count')
plt.title('countplot of qulaity')

Based on the above obsevation, it is found that wines that are rated 7 or above are categorised as good. Thus , it makes no logical sense to apply machine learning algorithms here. Just to keep the scope of building ML model alive here, lets remove column 'quality' from the dataset for the sake of predictive analytics.

# Correlation matrix and heatmap


In [None]:
#correltaion matrix/heatmap
data.corr()
f,ax=plt.subplots(figsize=(10,10))
sns.heatmap(data.corr())
plt.title('heat map')

Observing the heatmap, it is found that 'quality' column is highly correlated to 'good'column as previously determined by countplot. Besides, correlation between 'free sulfur dioxide' and 'total sulfur dioxide' is also high. Thus it is logical to remove one among them to reduce redundancy. Since , 'free sulfur dioxide' is part of the 'total sulfur dioxide' ,lets remove 'free sulfur dioxide'.

In [None]:
#remove unwanted columns
df=data.drop(['quality','free sulfur dioxide'],axis=1)

# check missing values

In [None]:
data.isnull().sum()

Lucky we dont have any missing values

# Apply one hot encoding to convert categorical columns


In [None]:
#one hot encoding
df=pd.get_dummies(data)

# Separating independant and dependant variables

In [None]:
#creating x and y dataframes
x=df.drop('good',axis=1)
y=df[['good']]

# Create test and train data

In [None]:
#split into train and test data
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=100)

# Build model with Random forest

In [None]:
# Perform random forest with grid search to optimize model
rfc=RandomForestClassifier(random_state=42)
param_grid = { 
    'n_estimators': [200,300,400],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [4,5,6,7,8],
    'criterion' :['gini', 'entropy']
}
random = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 5).fit(x_train, y_train)
random.predict(x_train)
random.predict(x_test)

# confusion matrix, accuracy and f1 score

In [None]:
print('train model metrics')
confusion_matrix(y_train,random.predict(x_train))
print('accuracy-',round(random.score(x_train,y_train) * 100, 2))
print('f1 score-',f1_score(y_train,random.predict(x_train)))
print(" ")
print('test model metrics')
confusion_matrix(y_test,random.predict(x_test))
print('accuracy-',round(random.score(x_test,y_test) * 100, 2))
print('f1 score-',f1_score(y_test,random.predict(x_test)))


There u go! Random forest has lived up to its expectation and we see a perfect predictions on both train and test model

# plot roc curve and calculate auc


Before we calculate auc and plot roc, we need to generate outputs in the form of probabilities. Thus we use 'predict_proba' function for the same

In [None]:
#predict probabilites
rf_prob=random.predict_proba(x_train)
rf_prob=rf_prob[:,1]
false_positive_rate1, true_positive_rate1, threshold1 = roc_curve(y_train, rf_prob)
print('auc_score for random forest(train): ', roc_auc_score(y_train, rf_prob))

# Plot ROC curves
plt.subplots(1, figsize=(5,5))
plt.title('Receiver Operating Characteristic(train) - random forest')
plt.plot(false_positive_rate1, true_positive_rate1)
plt.plot([0, 1], ls="--")
plt.plot([0, 0], [1, 0] , c=".7"), plt.plot([1, 1] , c=".7")
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()


rf_prob_test=random.predict_proba(x_test)
rf_prob_test=rf_prob_test[:,1]
false_positive_rate2, true_positive_rate2, threshold2 = roc_curve(y_test, rf_prob_test)
print('auc_score for random forest(test): ', roc_auc_score(y_test, rf_prob_test))

# Plot ROC curves
plt.subplots(1, figsize=(5,5))
plt.title('Receiver Operating Characteristic(test) - random forest')
plt.plot(false_positive_rate2, true_positive_rate2)
plt.plot([0, 1], ls="--")
plt.plot([0, 0], [1, 0] , c=".7"), plt.plot([1, 1] , c=".7")
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()



#  Clustering model (Unsupervised learning)

Lets move on to the second part of our problem i.e, to build a cluster model

Since we have already explored our dataset with visualizations, lets skip that part. However , there is always a scope for some change in  building a new model. Lets plot a correlation matrix to check that.

In [None]:
#correltaion matrix/heatmap
data.corr()
f,ax=plt.subplots(figsize=(10,10))
sns.heatmap(df.corr())
plt.title('heat map')

Observing the heatmap, it is found that 'quality' column is highly correlated to 'good'column. Also, correlation between 'free sulfur dioxide' and 'total sulfur dioxide' is high. Thus it is logical to remove one among them to reduce redundancy. Since , 'free sulfur dioxide' is part of the 'total sulfur dioxide' ,lets remove 'free sulfur dioxide'.Besides, 'color' column is removed to check whether final clusters formed have similarites based on color features.

# one-hot encoding

In [None]:
#one hot encoding to deal with categorical data
x_onehot=pd.get_dummies(x)

# Data normalisation

In [None]:
#data normalisation to bring data of every feature on a same scale
x_scale = StandardScaler().fit_transform(x_onehot)

# K-Means Clustering Model



Before we begin to build a cluster model in unsupervised learning , the most important parameter to decide is to determine the number of clusters . Lets explore some methods within K-Means to decide on it.


# Elbow method to identify number of clusters.

In [None]:
#elbow method with inertia to find n clusters
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
    kmeans.fit(x_scale)
    wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

Usually , the number of clusters is decided by observing the elbow plot and where the inertia seems to level off. However ,we cannot decide on where the inertia seems to level off(can't decide between 2 and 4) in this case to select the value for number of clusters. Thus , we shall move on to elbow method by  'silhouette score' to identify number of clusters.

In [None]:
#check silhouette score
# Instantiate a scikit-learn K-Means model
model = KMeans(random_state=0)

# Instantiate the KElbowVisualizer with the number of clusters and the metric 
visualizer = KElbowVisualizer(model, k=(2,10), metric='silhouette', timings=False)

# Fit the data and visualize
visualizer.fit(x_scale)    
visualizer.poof()   

The above graph depicts that the highest silhouette score is obtained when we select number of groupings to be 2.Thus moving forward to apply k-Means algorithm.

# Build model

In [None]:
#applying kmeans algorith
kmeans = KMeans(n_clusters=2, init='k-means++', max_iter=300, n_init=10, random_state=0)
pred_y = kmeans.fit_predict(x_scale)
plt.scatter(x_scale[:,0],x_scale[:,1],c=pred_y,cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='red')
plt.show()
#calculating davies bouldin score
sklearn.metrics.davies_bouldin_score(x_scale,pred_y)

The Davies–Bouldin index (DBI) (introduced by David L. Davies and Donald W. Bouldin in 1979) is a metric for evaluating clustering algorithms. This is an internal evaluation scheme, where the validation of how well the clustering has been done is made using quantities and features inherent to the dataset.

# Comparisons to check similarity

In [None]:
#comparisons using mean
x_kmeans=x.copy()
x_kmeans['labels']=pred_y
x_kmeans.groupby('labels').mean()
data.groupby('color').mean()

The above two tabular comparisons shows that our obtained clusters have statistics which are more or less similar to that of the original dataframe when grouped by 'color' feature. The similarities in mean values of features like 'fixed acidity','alcohol','density' etc. signifies the fact that wines are closely grouped as per their colours. 

# Agglomerative Clustering

lets move on to the other type of clustering algorithm called 'agglomerative clustering' which is grouped under hierarchical clustering. Just like elbow method, we have dendrogram plot to identify number of clusters in hierarchical clustering.

A dendrogram is a diagram that shows the hierarchical relationship between objects. It is most commonly created as an output from hierarchical clustering. The main use of a dendrogram is to work out the best way to allocate objects to clusters. 

In [None]:
#hierarchical clustering-plotting dendrogram
dendrogram = sch.dendrogram(sch.linkage(x_scale, method='ward'))

The key to interpreting a dendrogram is to focus on the height at which any two objects are joined together which indicates the order in which the clusters were joined.From the dendrogram plot above, it can be inferred that number of clusters can be selected as 2.

In [None]:
#applying agglomerative clustering algorithm
model = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward')
model.fit_predict(x_scale)
labels = model.labels_
#plotting clusters on scatter plot
plt.figure(figsize=(10, 7))
plt.scatter(x_scale[labels==0, 0], x_scale[labels==0, 1], s=50, marker='o', color='red')
plt.scatter(x_scale[labels==1, 0], x_scale[labels==1, 1], s=50, marker='o', color='blue')
sklearn.metrics.davies_bouldin_score(x_scale,labels)

# Comparisons to check similarties

In [None]:
x_hrcl=x.copy()
x_hrcl['labels']=labels
x_hrcl.groupby('labels').mean()
data.groupby('color').mean()


Again it can be inferred from the tabular column that our clustering model has performed fairly well enough to group objects based on color feature.

#            THANK YOU