# Wine Quality Prediction Using Support Vector Machine

## DataSource

White wine data has twelve variables.

    1. fixed acidity
    2. volatile acidity
    3. citric acid
    4. residual sugar
    5. chlorides
    6. free sulfur dioxide
    7. total sulfur dioxide
    8. density
    9. pH
    10. sulphates
    11. alcohol
    12. quality
   


## Import Library

In [1]:
import pandas as pd

In [3]:
data=pd.read_csv('https://github.com/YBI-Foundation/Dataset/raw/main/WhiteWineQuality.csv',delimiter=';')
data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


## Get Information of Dataset

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898 entries, 0 to 4897
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         4898 non-null   float64
 1   volatile acidity      4898 non-null   float64
 2   citric acid           4898 non-null   float64
 3   residual sugar        4898 non-null   float64
 4   chlorides             4898 non-null   float64
 5   free sulfur dioxide   4898 non-null   float64
 6   total sulfur dioxide  4898 non-null   float64
 7   density               4898 non-null   float64
 8   pH                    4898 non-null   float64
 9   sulphates             4898 non-null   float64
 10  alcohol               4898 non-null   float64
 11  quality               4898 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 459.3 KB


## Get Summary Statistics

In [6]:
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
fixed acidity,4898.0,6.854788,0.843868,3.8,6.3,6.8,7.3,14.2
volatile acidity,4898.0,0.278241,0.100795,0.08,0.21,0.26,0.32,1.1
citric acid,4898.0,0.334192,0.12102,0.0,0.27,0.32,0.39,1.66
residual sugar,4898.0,6.391415,5.072058,0.6,1.7,5.2,9.9,65.8
chlorides,4898.0,0.045772,0.021848,0.009,0.036,0.043,0.05,0.346
free sulfur dioxide,4898.0,35.308085,17.007137,2.0,23.0,34.0,46.0,289.0
total sulfur dioxide,4898.0,138.360657,42.498065,9.0,108.0,134.0,167.0,440.0
density,4898.0,0.994027,0.002991,0.98711,0.991723,0.99374,0.9961,1.03898
pH,4898.0,3.188267,0.151001,2.72,3.09,3.18,3.28,3.82
sulphates,4898.0,0.489847,0.114126,0.22,0.41,0.47,0.55,1.08


## Get Column Names

In [9]:
data.columns

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')

### Get Shape

In [10]:
data.shape

(4898, 12)

## Get Unique Value of target Variable

In [11]:
data.quality.value_counts()

6    2198
5    1457
7     880
8     175
4     163
3      20
9       5
Name: quality, dtype: int64

In [12]:
data.groupby('quality').mean()

Unnamed: 0_level_0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
quality,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
3,7.6,0.33325,0.336,6.3925,0.0543,53.325,170.6,0.994884,3.1875,0.4745,10.345
4,7.129448,0.381227,0.304233,4.628221,0.050098,23.358896,125.279141,0.994277,3.182883,0.476135,10.152454
5,6.933974,0.302011,0.337653,7.334969,0.051546,36.432052,150.904598,0.995263,3.168833,0.482203,9.80884
6,6.837671,0.260564,0.338025,6.441606,0.045217,35.650591,137.047316,0.993961,3.188599,0.491106,10.575372
7,6.734716,0.262767,0.325625,5.186477,0.038191,34.125568,125.114773,0.992452,3.213898,0.503102,11.367936
8,6.657143,0.2774,0.326514,5.671429,0.038314,36.72,126.165714,0.992236,3.218686,0.486229,11.636
9,7.42,0.298,0.386,4.12,0.0274,33.4,116.0,0.99146,3.308,0.466,12.18


## Define y (Target Variable) and X(Dependent Variable)

In [13]:
y=data.iloc[:,-1]
X=data.iloc[:,:-1]

In [14]:
y.shape,X.shape

((4898,), (4898, 11))

## Standarized X

In [15]:
from sklearn.preprocessing import StandardScaler

sc=StandardScaler()

In [16]:
X=sc.fit_transform(X)

In [17]:
X


array([[ 1.72096961e-01, -8.17699008e-02,  2.13280202e-01, ...,
        -1.24692128e+00, -3.49184257e-01, -1.39315246e+00],
       [-6.57501128e-01,  2.15895632e-01,  4.80011213e-02, ...,
         7.40028640e-01,  1.34184656e-03, -8.24275678e-01],
       [ 1.47575110e+00,  1.74519434e-02,  5.43838363e-01, ...,
         4.75101984e-01, -4.36815783e-01, -3.36667007e-01],
       ...,
       [-4.20473102e-01, -3.79435433e-01, -1.19159198e+00, ...,
        -1.31315295e+00, -2.61552731e-01, -9.05543789e-01],
       [-1.60561323e+00,  1.16673788e-01, -2.82557040e-01, ...,
         1.00495530e+00, -9.62604939e-01,  1.85757201e+00],
       [-1.01304317e+00, -6.77100966e-01,  3.78559282e-01, ...,
         4.75101984e-01, -1.48839409e+00,  1.04489089e+00]])

## Train Test Split

In [21]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=232,stratify=y)

In [22]:
X_train.shape,y_train.shape

((3428, 11), (3428,))

## Get Model Train

In [23]:
from sklearn.svm import SVC

model=SVC()

In [24]:
model.fit(X_train,y_train)

## Get Model Prediction

In [25]:
y_pred=model.predict(X_test)

## Model Evaluation

In [26]:
from sklearn.metrics import confusion_matrix,classification_report

print(confusion_matrix(y_test,y_pred))

[[  0   0   3   3   0   0   0]
 [  0   1  31  17   0   0   0]
 [  0   0 244 192   1   0   0]
 [  0   0 126 510  24   0   0]
 [  0   0   6 195  63   0   0]
 [  0   0   1  34  18   0   0]
 [  0   0   0   1   0   0   0]]


In [27]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           3       0.00      0.00      0.00         6
           4       1.00      0.02      0.04        49
           5       0.59      0.56      0.58       437
           6       0.54      0.77      0.63       660
           7       0.59      0.24      0.34       264
           8       0.00      0.00      0.00        53
           9       0.00      0.00      0.00         1

    accuracy                           0.56      1470
   macro avg       0.39      0.23      0.23      1470
weighted avg       0.56      0.56      0.52      1470



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Get Model Rerun with new Class created

In [28]:
y=data['quality'].apply(lambda y_value : 1 if y_value>=6 else 0)

## Train Test Split

In [29]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=2323,stratify=y)

## Model Training

In [30]:
from sklearn.svm import SVC

model=SVC()

In [31]:
model.fit(X_train,y_train)

## Get Model Prediction

In [32]:
y_pred=model.predict(X_test)

## Model Evaluation

In [33]:
from sklearn.metrics import confusion_matrix,classification_report

print(confusion_matrix(y_test,y_pred))

[[288 204]
 [118 860]]


In [34]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.71      0.59      0.64       492
           1       0.81      0.88      0.84       978

    accuracy                           0.78      1470
   macro avg       0.76      0.73      0.74      1470
weighted avg       0.78      0.78      0.78      1470



## Future Prediction

Lets select a random sample from existing dataset as new value
Steps to follow

1, Extract a random row using sample function
2. Separate X and y

3, Standardize X

4, Predict


In [35]:
data_new=data.sample(1)

In [36]:
data_new

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
3547,7.3,0.2,0.29,19.9,0.039,69.0,237.0,1.00037,3.1,0.48,9.2,6


In [37]:
data_new.shape

(1, 12)

In [38]:
X_new=data_new.drop(['quality'],axis=1)

In [39]:
X_new=sc.fit_transform(X_new)

In [40]:
y_pred_new=model.predict(X_new)

In [41]:
y_pred_new

array([1])