## Homework2 Data Normalization and Data Points Distance Calculation

The Homework aims to use the Wine Quality Data Set to normalize the data and compute some basic distance calculations. Python 3.7.1 was used for this purpose.

The winequality-red.csv and winequality-white.csv files were downloaded [1], and the file winequality.names [2] containing the data set documentation was used to understand its contents and to identify the column labels.

The pandas [3], scikit learn[4] and Seaborn [5] documentation was used as a guide to solve the problems presented in Homework 1. Previously courses taken at DataCamp [6] were also helpful in solving the homework.

## Libraries used
- Pandas was imported to handle the Dataframes and to make the statistics calculations (mean, standard deviation, variance, skewness and mode).

- Scikit Learn: This is an open-source machine learning library for Python and contains a helpful preprocessing module that can help us do the normalization and standardization

- Seaborn was used to build the plots.

In [1]:
##--Importing necessary libraries
import pandas as pd 
from sklearn import preprocessing
from scipy.spatial import distance
import numpy as np
import seaborn as sns

## Loading the File
The dataset was inspected and it was noticed that it is stored as a comma separated values (csv) file and each value is separated by a semi-colon (;). The files contain a header with the column names. 

The file was taken directly from the dataset url and loaded as a pandas Dataframe with parameters for the header and the delimiter were specified to 0 and ";" respectively. The first 4 rows can be seen in the table below.

In [2]:
##--Reading the files

#Red from URL
file_red='http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'

#Read the red file
df_red = pd.read_csv(file_red, header = 0, delimiter = ";")

print(df_red.info())
df_red.head(4)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
fixed acidity           1599 non-null float64
volatile acidity        1599 non-null float64
citric acid             1599 non-null float64
residual sugar          1599 non-null float64
chlorides               1599 non-null float64
free sulfur dioxide     1599 non-null float64
total sulfur dioxide    1599 non-null float64
density                 1599 non-null float64
pH                      1599 non-null float64
sulphates               1599 non-null float64
alcohol                 1599 non-null float64
quality                 1599 non-null int64
dtypes: float64(11), int64(1)
memory usage: 150.0 KB
None


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [3]:
#White from URL
file_white='http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv'

#Read the white file
df_white = pd.read_csv(file_white, header = 0, delimiter = ";")

print(df_white.info())
df_white.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898 entries, 0 to 4897
Data columns (total 12 columns):
fixed acidity           4898 non-null float64
volatile acidity        4898 non-null float64
citric acid             4898 non-null float64
residual sugar          4898 non-null float64
chlorides               4898 non-null float64
free sulfur dioxide     4898 non-null float64
total sulfur dioxide    4898 non-null float64
density                 4898 non-null float64
pH                      4898 non-null float64
sulphates               4898 non-null float64
alcohol                 4898 non-null float64
quality                 4898 non-null int64
dtypes: float64(11), int64(1)
memory usage: 459.3 KB
None


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


The datasets are related to red and white samples of the Portuguese "Vinho Verde" wine. They represent objective physicochemical test features of different red and white wines and the median of at least 3 sensory evaluations made by wine testers according to the dataset description. The features have varying ranges and the sensory evaluation has a range from 0 (very bad) to 10 (very excellent). 

The .info() method returns some information about the data set. It shows that the objective tests consist of 11 columns of attributes represented with float numbers, and also shows that the sensory evaluation is represented by integers. The red wine dataset is comprised of 1599 observations and the white wine dataset of 4898. None of the 12 columns have missing values.

## Standardization and Normalization
Standardization and normalization are processes in which the data is rescaled in order to fit a predetermined distribution or range. This process is useful for comparing measurements that have different scales or units and it is also required for many machine learning algorithms [7].

There exist many types of normalizations but this work focuses on three:
1. Min max normalization
2. Z-score normalization
3. Mean subtraction normalization

The normalizations will be performed in the red wine data set first and then in the one for the white wine. 

### Min max scaling
The min max scaling (also called “normalization”) targets to limit the range of the data to 0-1.

This normalization can be done using the function minmax = preprocessing.MinMaxScaler().fit_transform(df) as shown below. This function was found in [4]

In [4]:
#First ten values sliced
df_r_10 = df_red.iloc[:10,:]
df_w_10 = df_white.iloc[:10,:]

#This line can turn down efficiency but it will avoid a warning
df_r_10 = df_r_10.astype(float)

#Min-Max normalization

minmax_r = preprocessing.MinMaxScaler().fit_transform(df_r_10)

minmax_r = pd.DataFrame(minmax_r, columns=df_r_10.columns)
minmax_r

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,0.025641,0.7,0.0,0.142857,0.333333,0.125,0.190476,0.941176,1.0,0.294118,0.0,0.0
1,0.128205,1.0,0.0,0.285714,1.0,1.0,0.583333,0.647059,0.114286,0.647059,0.363636,0.0
2,0.128205,0.8,0.071429,0.22449,0.818182,0.375,0.428571,0.705882,0.285714,0.558824,0.363636,0.0
3,1.0,0.0,1.0,0.142857,0.30303,0.5,0.5,1.0,0.0,0.352941,0.363636,0.5
4,0.025641,0.7,0.0,0.142857,0.333333,0.125,0.190476,0.941176,1.0,0.294118,0.0,0.0
5,0.025641,0.633333,0.0,0.122449,0.30303,0.25,0.261905,0.941176,1.0,0.294118,0.0,0.0
6,0.153846,0.533333,0.107143,0.081633,0.121212,0.375,0.488095,0.529412,0.4,0.0,0.0,0.0
7,0.0,0.616667,0.0,0.0,0.0,0.375,0.035714,0.0,0.657143,0.029412,0.545455,1.0
8,0.128205,0.5,0.035714,0.163265,0.242424,0.0,0.0,0.647059,0.571429,0.323529,0.090909,1.0
9,0.051282,0.366667,0.642857,1.0,0.181818,0.5,1.0,0.941176,0.542857,1.0,1.0,0.0


In [5]:
print("Max-Min Dataframe description")
pd.DataFrame(minmax_r.describe(percentiles=[0.5])).round(3)

Max-Min Dataframe description


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0
mean,0.167,0.585,0.186,0.231,0.364,0.362,0.368,0.729,0.557,0.379,0.273,0.25
std,0.298,0.268,0.347,0.281,0.309,0.279,0.298,0.305,0.366,0.295,0.326,0.425
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.09,0.625,0.018,0.143,0.303,0.375,0.345,0.824,0.557,0.309,0.227,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


The min max normalization has scaled all the data to an interval from 0 to 1 in a column wise fashion. It can be observed in the Dataframe’s description that the minimum and maximum value for every column is 0 and 1, respectively. It can also be noticed that every column has a different mean and standard deviation, something that will not occur with the following normalization.

### Z-score normalization
Also called standardization, it scales data to fit a normal distribution with a mean of 0 and a standard deviation of 1. This normalization can be done using the function z = preprocessing.StandardScaler().fit_transform(df), as shown below.

In [6]:
#z-score normalization
z = preprocessing.StandardScaler().fit_transform(df_r_10)
z = pd.DataFrame(z, columns=df_r_10.columns)
z

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,-0.498662,0.451753,-0.563489,-0.329398,-0.103362,-0.896665,-0.626576,0.7312,1.276466,-0.305196,-0.88083,-0.620174
1,-0.135999,1.630239,-0.563489,0.206831,2.170608,2.406839,0.761143,-0.284356,-1.276466,0.957683,0.29361,-0.620174
2,-0.135999,0.844582,-0.346763,-0.022981,1.550434,0.047193,0.214466,-0.081244,-0.78235,0.641963,0.29361,-0.620174
3,2.946642,-2.298048,2.470683,-0.329398,-0.206725,0.519122,0.466778,0.934311,-1.605877,-0.094716,0.29361,0.620174
4,-0.498662,0.451753,-0.563489,-0.329398,-0.103362,-0.896665,-0.626576,0.7312,1.276466,-0.305196,-0.88083,-0.620174
5,-0.498662,0.189867,-0.563489,-0.406002,-0.206725,-0.424736,-0.374264,0.7312,1.276466,-0.305196,-0.88083,-0.620174
6,-0.045333,-0.202961,-0.238399,-0.559211,-0.826898,0.047193,0.424726,-0.690578,-0.45294,-1.357594,-0.88083,-0.620174
7,-0.589328,0.124396,-0.563489,-0.865627,-1.240347,0.047193,-1.173253,-2.518578,0.288234,-1.252354,0.88083,1.860521
8,-0.135999,-0.333904,-0.455126,-0.252794,-0.413449,-1.368595,-1.299409,-0.284356,0.041176,-0.199956,-0.58722,1.860521
9,-0.407997,-0.857676,1.38705,2.887978,-0.620174,0.519122,2.232966,0.7312,-0.041176,2.220561,2.348881,-0.620174


In [7]:
print("Z-score Dataframe description")
pd.DataFrame(z.describe(percentiles=[0.5])).round(2)

Z-score Dataframe description


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0
mean,-0.0,-0.0,0.0,0.0,0.0,-0.0,0.0,-0.0,0.0,0.0,0.0,0.0
std,1.05,1.05,1.05,1.05,1.05,1.05,1.05,1.05,1.05,1.05,1.05,1.05
min,-0.59,-2.3,-0.56,-0.87,-1.24,-1.37,-1.3,-2.52,-1.61,-1.36,-0.88,-0.62
50%,-0.27,0.16,-0.51,-0.33,-0.21,0.05,-0.08,0.32,0.0,-0.25,-0.15,-0.62
max,2.95,1.63,2.47,2.89,2.17,2.41,2.23,0.93,1.28,2.22,2.35,1.86
