### Problem Statement :
There is a multiclass problem in which you need to Predict the wine quality(0-10)

#### Importing required Libraries

In [9]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix,f1_score 
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler 

#### Loading the Data Set

In [2]:
df=pd.read_csv("C:\\Users\\rupan\\OneDrive - stu.aud.ac.in\\Desktop\\winequality_red.csv")
df

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
1,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5
2,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5
3,11.2,0.280,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6
4,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
...,...,...,...,...,...,...,...,...,...,...,...,...
1594,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5
1595,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6
1596,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5


To predict quality of wine according to features

#### Checking shape of data

In [3]:
df.shape

(1599, 12)

#### checking missing values

In [4]:
df.isnull().sum()

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

#### Statstical Summary

In [5]:
df.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


#### Seggregate X and Y

In [8]:
x=df.drop(columns="quality")
y=df["quality"]

#### Features scaling

In [10]:
sc=StandardScaler()
x_scaled=sc.fit_transform(x)

#### Checking class imbalance

In [11]:
df["quality"].value_counts()

5    681
6    638
7    199
4     53
8     18
3     10
Name: quality, dtype: int64

- From above output it is showing that quality is target variable which is multiclass as it has 6 classes as follows :

    Class Categories
    - 5
    - 6
    - 7
    - 4
    - 8
    - 3

- Class Imbalance

#### Handle imbalance problem using oversampling technique name SMOTE install imblearn

In [12]:
#pip install imblearn

#### Oversampling

In [25]:
smote=SMOTE()
x_smote,y_smote=smote.fit_resample(x_scaled,y)

In [26]:
from collections import Counter
Counter(y_smote)

Counter({5: 681, 6: 681, 7: 681, 4: 681, 8: 681, 3: 681})

- Class Balanced

#### Checking shape of data after over sampling

In [27]:
print(x_smote.shape)
print(y_smote.shape)

(4086, 11)
(4086,)


#### Now dividing this SMOTE data (over sampled data) into training and test data.

In [28]:
x_train,x_test,y_train,y_test=train_test_split(x_smote,y_smote,test_size=0.25,random_state=42)

#### Getting shape of train and test Data

In [29]:
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

(3064, 11)
(3064,)
(1022, 11)
(1022,)


#### SVC

In [30]:
svc=SVC()
svc.fit(x_train,y_train)

SVC()

#### Prediction

In [31]:
y_pred=svc.predict(x_test)
y_pred

array([6, 5, 7, ..., 5, 6, 4], dtype=int64)

#### Accuracy on Test

In [32]:
acc=accuracy_score(y_test,y_pred)
acc

0.7514677103718199

#### Accuracy on training

In [37]:
y_pred_train=svc.predict(x_train)

In [38]:
accuracy_score(y_train,y_pred_train)

0.799934725848564

- No Overfitting

In [39]:
svc=SVC(kernel='linear')
svc.fit(x_train,y_train)

SVC(kernel='linear')

In [40]:
y_pred=svc.predict(x_test)
y_pred

array([4, 5, 7, ..., 5, 6, 4], dtype=int64)

#### Accuracy on Test

In [41]:
acc=accuracy_score(y_test,y_pred)
acc

0.6056751467710372

#### Accuracy on training

In [42]:
y_pred_train=svc.predict(x_train)

In [43]:
accuracy_score(y_train,y_pred_train)

0.6370757180156658

- Underfitting Situation

####  hyperparameter Tuning

#### Seeing combination of hyperparameters by grid_search

In [44]:
param_grid={'gamma':[0.1,1,10,20,30,40],'C':[1,0.5,0.1,1.5,2,2.5]}

In [49]:
grid = GridSearchCV(SVC(),param_grid,verbose=3,n_jobs =-1) 

In [50]:
grid.fit(x_train,y_train)

Fitting 5 folds for each of 36 candidates, totalling 180 fits


GridSearchCV(estimator=SVC(), n_jobs=-1,
             param_grid={'C': [1, 0.5, 0.1, 1.5, 2, 2.5],
                         'gamma': [0.1, 1, 10, 20, 30, 40]},
             verbose=3)

In [47]:
grid.best_params_

{'C': 2.5, 'gamma': 1}


In [51]:
grid.best_score_  ## accuracy on training

0.8825006130782928

#### Now, Fit the model using optimal parameters of C and gamma

In [52]:
model_new=SVC(C=2.0, gamma=1)
model_new.fit(x_train,y_train)


SVC(C=2.0, gamma=1)

In [53]:
accuracy_score(y_test,model_new.predict(x_test))

0.8669275929549902

- Now, our accuracy increased using hyper paramter tuning of c and gamma we can play with other parameters as well.