#Wine Quality Prediction through SVM and LR
- Data Exploration and Preprocessing

    - Explore the dataset thoroughly and provide a summary of your observations.
    - Perform necessary preprocessing steps:
        * Preparing feature values to be used by your models.
        * Optionally, data augmentation techniques.
        * Splitting the data into training and test sets appropriately.

- SVM and LR Implementation
    - Implement both SVM and LR from scratch. Evaluate and compare their performance.
    - Clearly define the two models and describe your implementation, also listing their hyperparameters if any.
    - Train the two models using an appropriate performance metric.
    - Demonstrate proper hyperparameter tuning, and evaluate at least one of your models using accuracy estimates via 5-fold cross-validation.

- Kernel Methods
    - Extend the above models to a kernelized form by adopting non-linear kernels.
    - Clearly describe how the kernelization happens and its consequences for both predictions and performance.
    - Comment on how the kernelized models compare with respect to the standard ones.

- Evaluation and Analysis
    - Evaluate your model performance using suitable metrics such as accuracy, precision, recall, and F1-score.
    - Provide appropriate visualizations of the performance of each model (loss and accuracy).
    - When reasonable, conduct an analysis of misclassified examples to understand potential model limitations.
    - Discuss the presence or absence of overfitting and underfitting at any point.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
#to output plots within the notebook
%matplotlib inline

!git clone https://github.com/Abudo-S/WineQualityPrediction.git

Cloning into 'WineQualityPrediction'...
remote: Enumerating objects: 15, done.[K
remote: Counting objects: 100% (15/15), done.[K
remote: Compressing objects: 100% (12/12), done.[K
remote: Total 15 (delta 2), reused 11 (delta 1), pack-reused 0 (from 0)[K
Receiving objects: 100% (15/15), 99.20 KiB | 898.00 KiB/s, done.
Resolving deltas: 100% (2/2), done.


# General observations:
- The red-wine dataset as well as the white-wine dataset have the same features labeled with the same target feature "quality".
- All feature values are numeric with no NaN values.
- Both datasets contain some duplicated instances.
- We'd need to combine both datasets in order to develop a universal model for both of red and white wines, by introducing a new feature 'wine_type' valorized with 1: for red wine instance, 0: for white wine instance.
In fact, introducing a new feature for the wine type the learning model would be to build relationships between the wine type and the other features.
By combining both datasets, we're increasing the training-set volume which gives the possibility to the learning model to see further records, so it reduces the risk of overfitting.

In [39]:
red_wine_quality = pd.read_csv("/content/WineQualityPrediction/wine+quality/winequality-red.csv", sep=';')
red_wine_quality.info()

white_wine_quality = pd.read_csv("/content/WineQualityPrediction/wine+quality/winequality-white.csv", sep=';')
white_wine_quality.info()
#should we mix both datasets into a single dataset introducing and new field wins_type = 'Red', 'White'?

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898 entries, 0 to 4897
Data columns (total 12 columns):
 #   Column        

Introduce feature "wine_type" in red_wine_quality and white_wine_quality datasets

In [40]:
red_wine_quality['wine_type'] = 1
white_wine_quality['wine_type'] = 0

Merge red_wine_quality and white_wine_quality datasets

In [41]:
wine_quality = pd.concat([red_wine_quality, white_wine_quality], ignore_index=True);
wine_quality.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,wine_type
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,1
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,1
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,1
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,1
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,1


In [42]:
y = wine_quality['quality'].values
print(f'Wine unique values: {np.unique(y, return_counts=True)}')

Wine unique values: (array([3, 4, 5, 6, 7, 8, 9]), array([  30,  216, 2138, 2836, 1079,  193,    5]))


# Observation on data duplication
Duplicated instances might lead to data leakage in which after dataset splitting into training and test sets, there might be identical records in both splitted sets. The model also becomes too specialized to the training data, including the duplicated instances, which will cause the problem of overfitting.
Furthermore, there'd be additional elaboration cost of redundant data.

Remove duplicates

In [45]:
#wine_quality.info() # initial enteries 6497
wine_quality = wine_quality.drop_duplicates(keep='first')
wine_quality.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5320 entries, 0 to 6496
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         5320 non-null   float64
 1   volatile acidity      5320 non-null   float64
 2   citric acid           5320 non-null   float64
 3   residual sugar        5320 non-null   float64
 4   chlorides             5320 non-null   float64
 5   free sulfur dioxide   5320 non-null   float64
 6   total sulfur dioxide  5320 non-null   float64
 7   density               5320 non-null   float64
 8   pH                    5320 non-null   float64
 9   sulphates             5320 non-null   float64
 10  alcohol               5320 non-null   float64
 11  quality               5320 non-null   int64  
 12  wine_type             5320 non-null   int64  
dtypes: float64(11), int64(2)
memory usage: 581.9 KB


#Observation on feature scaling
Some features (ex. residual sugar, free sulfur dioxide, total sulfur dioxide, ...) which have high difference between the minumum and maximum values, will need to be standardized on the same scale.
Especially in case of models that aim to find an optimal hyperplane (ex. SVM) that maximizes the margin between different classes. The calculation of this margin and the positioning of the hyperplane rely on the distances between data points in the feature space.

In [31]:
minMax_feature_values = zip(wine_quality.columns, wine_quality.min(), wine_quality.max())
print(f'Minimum/Maximum values for each feature: {list(minMax_feature_values)}')


Minimum/Maximum values for each feature: [('fixed acidity', 3.8, 15.9), ('volatile acidity', 0.08, 1.58), ('citric acid', 0.0, 1.66), ('residual sugar', 0.6, 65.8), ('chlorides', 0.009, 0.611), ('free sulfur dioxide', 1.0, 289.0), ('total sulfur dioxide', 6.0, 440.0), ('density', 0.98711, 1.03898), ('pH', 2.72, 4.01), ('sulphates', 0.22, 2.0), ('alcohol', 8.0, 14.9), ('quality', 3.0, 9.0), ('wine_type', 0.0, 1.0)]
