# Data Science on Wine Quality

Get the dataset [here](https://archive.ics.uci.edu/ml/datasets/Wine+Quality) (UCI Machine Learning Repository)

## Import data

In [1]:
import pandas as pd

red = pd.read_csv("winequality-red.csv", sep=";")
white = pd.read_csv("winequality-white.csv", sep=";")

red

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
1,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5
2,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5
3,11.2,0.280,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6
4,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
...,...,...,...,...,...,...,...,...,...,...,...,...
1594,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5
1595,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6
1596,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5


In [2]:
white

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.00100,3.00,0.45,8.8,6
1,6.3,0.30,0.34,1.6,0.049,14.0,132.0,0.99400,3.30,0.49,9.5,6
2,8.1,0.28,0.40,6.9,0.050,30.0,97.0,0.99510,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.99560,3.19,0.40,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.99560,3.19,0.40,9.9,6
...,...,...,...,...,...,...,...,...,...,...,...,...
4893,6.2,0.21,0.29,1.6,0.039,24.0,92.0,0.99114,3.27,0.50,11.2,6
4894,6.6,0.32,0.36,8.0,0.047,57.0,168.0,0.99490,3.15,0.46,9.6,5
4895,6.5,0.24,0.19,1.2,0.041,30.0,111.0,0.99254,2.99,0.46,9.4,6
4896,5.5,0.29,0.30,1.1,0.022,20.0,110.0,0.98869,3.34,0.38,12.8,7


In [3]:
red["type"] = "red"
white["type"] = "white"

df = pd.concat([red, white])
df

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type
0,7.4,0.70,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5,red
1,7.8,0.88,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5,red
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5,red
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6,red
4,7.4,0.70,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5,red
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4893,6.2,0.21,0.29,1.6,0.039,24.0,92.0,0.99114,3.27,0.50,11.2,6,white
4894,6.6,0.32,0.36,8.0,0.047,57.0,168.0,0.99490,3.15,0.46,9.6,5,white
4895,6.5,0.24,0.19,1.2,0.041,30.0,111.0,0.99254,2.99,0.46,9.4,6,white
4896,5.5,0.29,0.30,1.1,0.022,20.0,110.0,0.98869,3.34,0.38,12.8,7,white


## Prepare dataset

We will do **Classification** here. You can also do regression (predict any of the numerical features)

**Question**: Is this a balanced dataset?

In [4]:
df["type"].value_counts()

white    4898
red      1599
Name: type, dtype: int64

General methods to get a balanced dataset

- Undersampling: sample a subset of the majority class(es)
- Oversampling: oversample the minority class(es) (not as good as Undersampling)
- Data synthesis

In [5]:
df_balanced = pd.concat([red, white.sample(1599)])
df_balanced

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type
0,7.4,0.70,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5,red
1,7.8,0.88,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5,red
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5,red
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6,red
4,7.4,0.70,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5,red
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4177,6.4,0.25,0.33,1.7,0.037,35.0,113.0,0.99164,3.23,0.66,10.6,6,white
4601,6.9,0.23,0.35,6.9,0.030,45.0,116.0,0.99244,2.80,0.54,11.0,6,white
369,7.1,0.39,0.35,12.5,0.044,26.0,72.0,0.99410,3.17,0.29,11.6,5,white
4395,6.6,0.24,0.22,12.3,0.051,35.0,146.0,0.99676,3.10,0.67,9.4,5,white


In [6]:
df_balanced["type"].value_counts()

red      1599
white    1599
Name: type, dtype: int64

Now the dataset is balanced

Extract features (input, X) and target (output, y)

In [7]:
df_X = df_balanced.iloc[:, :-1]
df_y = df_balanced.iloc[:, -1]
df_y

0         red
1         red
2         red
3         red
4         red
        ...  
4177    white
4601    white
369     white
4395    white
324     white
Name: type, Length: 3198, dtype: object

Train-test split

- Train-validation-test split
  - Train: train the model on
  - Validation: tune the model hyperparameters e.g. L2 regularization. whether the model can generalize the data
  - Test (optional): in research, used to report results. in production/industry, used to see if the model achieve acceptable performance
  - In some cases, you might want to carefully design the test set instead of random sampling e.g. test set contains specific hard examples
- How much to split?
  - Common ratio: 80-20
  - Validation/test: must be representative of the distribution/data that you want to test

random_state: important for reproducibility

In [8]:
from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split(df_X, df_y, train_size=0.8, random_state=0)
train_X

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
4195,7.1,0.450,0.24,2.7,0.040,24.0,87.0,0.98862,2.94,0.38,13.4,8
1581,6.2,0.560,0.09,1.7,0.053,24.0,32.0,0.99402,3.54,0.60,11.3,5
529,9.9,0.630,0.24,2.4,0.077,6.0,33.0,0.99740,3.09,0.57,9.4,5
3458,5.8,0.320,0.20,2.6,0.027,17.0,123.0,0.98936,3.36,0.78,13.9,7
752,6.7,0.200,0.42,14.0,0.038,83.0,160.0,0.99870,3.16,0.50,9.4,6
...,...,...,...,...,...,...,...,...,...,...,...,...
763,9.3,0.655,0.26,2.0,0.096,5.0,35.0,0.99738,3.25,0.42,9.6,5
835,7.6,0.665,0.10,1.5,0.066,27.0,55.0,0.99655,3.39,0.51,9.3,5
4750,6.0,0.140,0.37,1.2,0.032,63.0,148.0,0.99185,3.32,0.44,11.2,5
3598,5.8,0.200,0.24,1.4,0.033,65.0,169.0,0.99043,3.59,0.56,12.3,7


(Optional) Save your dataset to disk

df.to_csv()

**Scale the dataset** so that features are within the same range. Some algorithms work better when features are within the same range

Two common scalers

- Standard Scaler: normalize to N(0,1)
- Min max: [0,1]

scikit-learn interface
- .fit()
- .transform()

In [26]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

scaler = StandardScaler()
scaler

StandardScaler()

In [27]:
scaler.fit(train_X)

StandardScaler()

In [31]:
train_X_normalized = scaler.transform(train_X)
train_X_normalized

array([[-1.27076173,  1.52297764, -1.46053155, ..., -0.78276301,
        -0.21747326, -0.90110917],
       [-0.88541308,  0.82135894, -1.27975543, ...,  0.16591702,
         0.74287303, -0.90110917],
       [ 1.49090356,  1.18516123, -0.37587486, ..., -0.01196049,
        -0.91590692, -0.90110917],
       ...,
       [-0.50006444, -1.04962425,  1.73317981, ..., -1.13851802,
        -1.17781955,  0.28278259],
       [-0.30739011,  0.19769788,  1.91395592, ..., -0.664178  ,
        -1.52703638,  0.28278259],
       [-0.75696353, -0.89370899, -0.19509875, ...,  0.34379452,
         0.3936562 ,  0.28278259]])

## Build ML model

Select model. Since we are doing classifications, here are some classifiers we can use

- Logistic regression
- SVM
- Random forest

In [52]:
from sklearn.svm import SVC

model = SVC(kernel="rbf")

In [53]:
model.fit(train_X_normalized, train_y)

SVC()

In [54]:
preds = model.predict(train_X_normalized)
preds

array(['white', 'red', 'red', ..., 'white', 'white', 'white'],
      dtype=object)

Evaluate the model on train set. NOTE: this is to see if the model can learn anything at all

In [55]:
(preds == train_y).sum() / len(train_y)

0.9945269741985927

Evaluate the model on test set. NOTE: this is to see if the model can **generalize** to unseen data

You **have to** use train set statistics to normalize the data here (basically just use the `scaler` object from above). Some reasons for this

- Normalize with test set statistics can be considered as **data leakage**, because you are fusing test data to your prediction (the data is not **unseen** anymore)
- You should be able to make a prediction **even with 1 sample**. If there is only 1 sample, you can't normalize it

In [57]:
test_X_normalized = scaler.transform(test_X)

In [59]:
preds = model.predict(test_X_normalized)
(preds == test_y).sum() / len(test_y)

0.9875

Evaluation (for classification)

- Binary classifcation: Confusion matrix, precision, recall and F1 score
- Multi-class classfication: macro and micro average of the above metrics

Examine the error

- Which are the misclassifications? -> why your model fails to predict
- Reduce number of input features? -> feature weight. manually remove 1 by 1 feature