# Preprocessing Numerical Features using sci-kit learn `StandardScaler` Transformer
Although sci-kit learn provides a range of preprocessing algorithms for data transfromation before training the model, we are using the StandardScaler because we want to standardize the data. 

## Background on Preprocessing Features
A closer look at many Machine Learning algorithms will show us that they make assumptions about the distribution of features in a dataset. Hence, it is a good practise to normalize or scale the features to address these assumptions. 

## Why should we scale features?
1. Models that rely on the distance between a pair of samples for instance K-nearest neighbor should be trained on normalized features to make each feature contribute equally to the distance computations. 
2. Many models such as logistic regression use a numerical solver () based on gradient descent to find their optimal parameters. These solvers converge faster when their features are scaled. 

In summary, whether or not a ML model requires feature scaling depends on the model family. Linear Models such as Logistic Regressions generally benefit greatly from feature scaling whereas other model families like Decision Trees might not necessarily require it but won't suffer any harm from it. 

### Data Preparation
We first load the full datasets and remove the categorical features so we can focus on the numerical features in this notebook. 

In [4]:
import pandas as pd 

adult_census = pd.read_csv("adult_census.csv")

In [5]:
adult_census.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [6]:
#Separate the target from the data features. 
target_name = 'class'

target = adult_census[target_name]
data = adult_census.drop(columns=target_name)

In [12]:
numeric_columns = [
    "age", "capital-gain", "capital-loss", "hours-per-week"]

data_numeric = data[numeric_columns]

In [13]:
data_numeric.head()

Unnamed: 0,age,capital-gain,capital-loss,hours-per-week
0,25,0,0,40
1,38,0,0,50
2,28,0,0,40
3,44,7688,0,40
4,18,0,0,30


In [14]:
from sklearn.model_selection import train_test_split

data_train, data_test, target_train, target_test = train_test_split(
    data_numeric, target, random_state=42)

### Model fiting and preprocessing
Now, we will standardize our datasets and train a new logistic regression on the new version of the dataset. 

In [15]:
data_train.describe()

Unnamed: 0,age,capital-gain,capital-loss,hours-per-week
count,36631.0,36631.0,36631.0,36631.0
mean,38.642352,1087.077721,89.665311,40.431247
std,13.725748,7522.692939,407.110175,12.423952
min,17.0,0.0,0.0,1.0
25%,28.0,0.0,0.0,40.0
50%,37.0,0.0,0.0,40.0
75%,48.0,0.0,0.0,45.0
max,90.0,99999.0,4356.0,99.0


In [17]:
# to display nice model diagram
from sklearn import set_config
set_config(display='diagram')

In [18]:
#Here, we will use the sci-kit learn StandardScaler. This transformer shofts and scales each feature individually so that they all have a 0-mean and a unit standard deviation. 

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(data_train)

# The fit method for transformers is similar to the fit method for predictors except that the former receives one argument while the later receives two arguments -- the data matrix and the target.

# NOTE also that the algorithm need to compute the mean and standard deviation of each feature and store them as a numpy array. These statistics are the model states here. Note that the computation of mean and standard deviation as model states is specific to this algorithm, Other algorithms will compute different statistics and store them as model states in the same fashion as this.


In [19]:
# Let's inspect the computed means and standard deviation. 
scaler.mean_

array([  38.64235211, 1087.07772106,   89.6653108 ,   40.43124676])

In [20]:
scaler.scale_

array([  13.72556083, 7522.59025606,  407.10461772,   12.42378265])

You would have a sci-kit learn convention in the two code statements above. The convention is simply that if a feature is learned from the data, its name ends with an underscore _ as in mean_ and scaler_ for StandardScaler.

Scaling data is performed per feature in the data matrix. That is, for each feature we substract its mean and divide by its standard deviation. Since we have the `fit` method, we can perform data transformation by calling the method `transform`

In [22]:
data_train_scaled = scaler.transform(data_train)
data_train_scaled
# The transform method for transformers is similar to the predict method for predictors. Primarily, it uses a predefined function called a transformation function and uses the model states and the input data to output a transformed version of the input data.

array([[ 0.17177061, -0.14450843,  5.71188483, -2.28845333],
       [ 0.02605707, -0.14450843, -0.22025127, -0.27618374],
       [-0.33822677, -0.14450843, -0.22025127,  0.77019645],
       ...,
       [-0.77536738, -0.14450843, -0.22025127, -0.03471139],
       [ 0.53605445, -0.14450843, -0.22025127, -0.03471139],
       [ 1.48319243, -0.14450843, -0.22025127, -2.69090725]])

In [23]:
# The method fit_transform is a shorthand method to call successively fit and then transform. 
data_train_scaled = scaler.fit_transform(data_train)
data_train_scaled

array([[ 0.17177061, -0.14450843,  5.71188483, -2.28845333],
       [ 0.02605707, -0.14450843, -0.22025127, -0.27618374],
       [-0.33822677, -0.14450843, -0.22025127,  0.77019645],
       ...,
       [-0.77536738, -0.14450843, -0.22025127, -0.03471139],
       [ 0.53605445, -0.14450843, -0.22025127, -0.03471139],
       [ 1.48319243, -0.14450843, -0.22025127, -2.69090725]])

### Chaining Operations together with Scikit Learn Pipeline
The scikit learn helper function make_pipeline will create a Pipeline, which takes as arguments the successive transformations to perform, followed by the classifier or regressor model. 

In [25]:
import time
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(), LogisticRegression())
model

In [26]:
# We can check the name of each steps of our model. 
model.named_steps

{'standardscaler': StandardScaler(),
 'logisticregression': LogisticRegression()}

In [27]:
start = time.time()
model.fit(data_train, target_train)
elapsed_time = time.time() - start

In [29]:
predicted_target = model.predict(data_test)
predicted_target[:5]

array([' <=50K', ' <=50K', ' >50K', ' <=50K', ' <=50K'], dtype=object)

In [30]:
## We can check the computational and statistical performance of our predictive pipeline
model_name = model.__class__.__name__

score = model.score(data_test, target_test)

print(f"The accuracy using a {model_name} is {score:.3f} "
      f"with a fitting time of {elapsed_time:.3f} seconds "
      f"in {model[-1].n_iter_[0]} iterations")

The accuracy using a Pipeline is 0.807 with a fitting time of 0.093 seconds in 12 iterations


We could compare this predictive model with the predictive model we had developed earlier that did not scale features. 

In [31]:
model = LogisticRegression()
start = time.time()
model.fit(data_train, target_train)
elapsed_time = time.time() - start

In [32]:
model_name = model.__class__.__name__
score = model.score(data_test, target_test)
print(f"The accuracy using a {model_name} is {score:.3f} "
      f"with a fitting time of {elapsed_time:.3f} seconds "
      f"in {model.n_iter_[0]} iterations")

The accuracy using a LogisticRegression is 0.807 with a fitting time of 0.153 seconds in 59 iterations


Our observation is that scaling the data before training the logistic regression model is more beneficial in terms of computational performance. Although, the statistical performance did not change for both models since they both converge, but a stark difference was noticed in the training time and the number of iterations. 

In [1]:
## Model evaluation using cross-validation
