# Normalization for numerical features
> An example of preprocessing, namely scaling numerical variables

- toc: true
- badges: false
- comments: true
- author: Cécile Gallioz
- categories: [sklearn]

# Preparation

In [11]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [2]:
myDataFrame = pd.read_csv("../../scikit-learn-mooc/datasets/penguins_classification.csv")

## The set

In [3]:
target_column = 'Species'
target = myDataFrame[target_column]
target.value_counts()

Adelie       151
Gentoo       123
Chinstrap     68
Name: Species, dtype: int64

In [4]:
target.value_counts(normalize=True)

Adelie       0.441520
Gentoo       0.359649
Chinstrap    0.198830
Name: Species, dtype: float64

## Continuation of preparation

In [5]:
data = myDataFrame.drop(columns=target_column)
data.columns

Index(['Culmen Length (mm)', 'Culmen Depth (mm)'], dtype='object')

In [6]:
numerical_columns = ['Culmen Length (mm)', 'Culmen Depth (mm)']
data_numeric = data[numerical_columns]

In [9]:
data_train, data_test, target_train, target_test = train_test_split(
    data_numeric, 
    target, 
    #random_state=42, 
    test_size=0.25)

In [10]:
data_train.describe()

Unnamed: 0,Culmen Length (mm),Culmen Depth (mm)
count,256.0,256.0
mean,43.614844,17.176953
std,5.394101,2.001934
min,32.1,13.1
25%,38.8,15.575
50%,43.55,17.3
75%,48.4,18.725
max,55.8,21.5


# Normalization

In [15]:
scaler = StandardScaler()
data_train_scaled = scaler.fit_transform(data_train)
data_train_scaled = pd.DataFrame(data_train_scaled,
                                 columns=data_train.columns)
data_train_scaled.describe()

Unnamed: 0,Culmen Length (mm),Culmen Depth (mm)
count,256.0,256.0
mean,-1.013079e-15,-2.983724e-16
std,1.001959,1.001959
min,-2.138892,-2.040496
25%,-0.8943613,-0.8017701
50%,-0.01204478,0.06158439
75%,0.8888468,0.7747903
max,2.263403,2.163665


# Conclusion

This transformer shifts and scales each feature individually so that they all have a 0-mean and a unit standard deviation.