# Normalizzazione

La normalizzazione è una tecnica di pre-elaborazione dei dati usata per ridimensionare le funzionalità nello stesso intervallo, in genere compreso tra 0 e 1, in modo che possano essere elaborate in modo più accurato da un algoritmo di Machine Learning

<img src="https://frenzy86.s3.eu-west-2.amazonaws.com/python/normalization.png " width="1200">

In [1]:
import pandas as pd
df = pd.DataFrame({
               'A':[1,2,3],
               'B':[100,300,500],
               'C':list('abc')
             })
df

Unnamed: 0,A,B,C
0,1,100,a
1,2,300,b
2,3,500,c


In [2]:
#nella normalizzazione andiamo a sottrarre ad ogni valore il minimo della colonna, diviso il delta (max-min)

def normalize(df):
    result = df.copy()
    for feature_name in df.columns:
        max_value = df[feature_name].max()
        min_value = df[feature_name].min()
        result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
    return result

In [3]:
normalize(df[['A','B']])

Unnamed: 0,A,B
0,0.0,0.0
1,0.5,0.5
2,1.0,1.0


In [4]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
A,3.0,2.0,1.0,1.0,1.5,2.0,2.5,3.0
B,3.0,300.0,200.0,100.0,200.0,300.0,400.0,500.0


In [4]:
df.describe()

Unnamed: 0,A,B
count,3.0,3.0
mean,2.0,300.0
std,1.0,200.0
min,1.0,100.0
25%,1.5,200.0
50%,2.0,300.0
75%,2.5,400.0
max,3.0,500.0


In [5]:
import pandas as pd
df = pd.DataFrame({
               'A':[1,2,3],
               'B':[100,300,500],
             })

# Using sklearn
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler() 
arr_scaled = scaler.fit_transform(df) 

print(arr_scaled)

df_scaled = pd.DataFrame(arr_scaled, columns=df.columns,index=df.index)
df_scaled

[[0.  0. ]
 [0.5 0.5]
 [1.  1. ]]


Unnamed: 0,A,B
0,0.0,0.0
1,0.5,0.5
2,1.0,1.0


In [6]:
df_scaled.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
A,3.0,0.5,0.5,0.0,0.25,0.5,0.75,1.0
B,3.0,0.5,0.5,0.0,0.25,0.5,0.75,1.0


## Standardizzazione Manuale

La standardizzazione è un altro metodo di scalatura in cui i valori vengono centrati intorno alla media con una deviazione standard unitaria. Ciò significa che la media dell'attributo diventa zero e la distribuzione risultante ha una deviazione standard unitaria

In [8]:
df = pd.DataFrame({
               'A':[1,2,3],
               'B':[100,300,500],
               'C':list('abc')
             })
df

Unnamed: 0,A,B,C
0,1,100,a
1,2,300,b
2,3,500,c


In [None]:
## Column A --> Mean=0 std=1
df.iloc[:,0:-1] = df.iloc[:,0:-1].apply(lambda x: (x-x.mean())/ x.std(), axis=0)
df

Unnamed: 0,A,B,C
0,-1.0,-1.0,a
1,0.0,0.0,b
2,1.0,1.0,c


In [None]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
A,3.0,0.0,1.0,-1.0,-0.5,0.0,0.5,1.0
B,3.0,0.0,1.0,-1.0,-0.5,0.0,0.5,1.0


## Standardizzazione con Sklearn

In [7]:
import pandas as pd

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()


df = pd.DataFrame({
               'A':[1,2,3],
               'B':[100,300,500],
               'C':list('abc')
             })
df.iloc[:,0:-1] = scaler.fit_transform(df.iloc[:,0:-1])
df

Unnamed: 0,A,B,C
0,-1.224745,-1.224745,a
1,0.0,0.0,b
2,1.224745,1.224745,c


In [8]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
A,3.0,0.0,1.224745,-1.224745,-0.612372,0.0,0.612372,1.224745
B,3.0,0.0,1.224745,-1.224745,-0.612372,0.0,0.612372,1.224745


In [10]:
pd.DataFrame(scaler.inverse_transform(df.iloc[:,0:-1] ), columns = ['A','B'])

Unnamed: 0,A,B
0,1.0,100.0
1,2.0,300.0
2,3.0,500.0
