This notebook illustrates the impact of various normalization and discretization techniques using the white wine dataset as an example. The tested techniques serve as valuable preprocessing steps before implementing machine learning algorithms on datasets.

In [1]:
import pandas as pd

# Read the data from the CSV files
data_white = pd.read_csv('data/full/winequality-white.csv', delimiter=';')

data_white.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


### Normalization

The presence of features with values spanning diverse ranges can lead to confusion for algorithms and potentially hinder their performance. This underscores the importance of normalization techniques to enhance algorithmic efficiency.

In [6]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
normalized_data_white = scaler.fit_transform(data_white)
data_white_scaled = pd.DataFrame(normalized_data_white, columns=data_white.columns)
data_white_scaled.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,0.307692,0.186275,0.216867,0.308282,0.106825,0.149826,0.37355,0.267785,0.254545,0.267442,0.129032,0.5
1,0.240385,0.215686,0.204819,0.015337,0.118694,0.041812,0.285383,0.132832,0.527273,0.313953,0.241935,0.5
2,0.413462,0.196078,0.240964,0.096626,0.121662,0.097561,0.204176,0.154039,0.490909,0.255814,0.33871,0.5
3,0.326923,0.147059,0.192771,0.121166,0.145401,0.156794,0.410673,0.163678,0.427273,0.209302,0.306452,0.5
4,0.326923,0.147059,0.192771,0.121166,0.145401,0.156794,0.410673,0.163678,0.427273,0.209302,0.306452,0.5


A widely employed normalization technique is Min-Max scaling, achieved by scaling values to a range between 0 and 1. The implementation utilized the MinMaxScaler from the sklearn library. MinMaxScaler maintains the shape of the original distribution while adjusting the values, which proves valuable when preserving relationships between features and ensuring the interpretability of the data are priorities. Additionally, the scaler offers an inverse transform capability, enabling a reverse transformation to retrieve the original data. The outcome of this normalization process is evident, with values now confined to the 0 to 1 range.

In [7]:

scaler = StandardScaler()
normalized_data_white_zscore = scaler.fit_transform(data_white)
data_white_zscore_scaled = pd.DataFrame(normalized_data_white_zscore, columns=data_white.columns)
data_white_zscore_scaled.head()



Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,0.172097,-0.08177,0.21328,2.821349,-0.035355,0.569932,0.744565,2.331512,-1.246921,-0.349184,-1.393152,0.13787
1,-0.657501,0.215896,0.048001,-0.944765,0.147747,-1.253019,-0.149685,-0.009154,0.740029,0.001342,-0.824276,0.13787
2,1.475751,0.017452,0.543838,0.100282,0.193523,-0.312141,-0.973336,0.358665,0.475102,-0.436816,-0.336667,0.13787
3,0.409125,-0.478657,-0.117278,0.415768,0.559727,0.687541,1.121091,0.525855,0.01148,-0.787342,-0.499203,0.13787
4,0.409125,-0.478657,-0.117278,0.415768,0.559727,0.687541,1.121091,0.525855,0.01148,-0.787342,-0.499203,0.13787


Another notable technique is Z-score normalization, which adjusts values to have a mean of 0 and a standard deviation of 1. This transformation is particularly beneficial for certain machine learning algorithms. In this case, the StandardScaler from the sklearn library was employed for the process. StandardScaler proves advantageous when the distribution of features is unknown or not Gaussian. Upon inspection of the data, it is evident that all values have been transformed, converging around the mean of 0.