#### Preprocessing
##### Scaling (Normalization or Standardization)
##### Binarize (Make data binary based on threshold)

Scaling / Normalization / Standardization
(Applicable to Continuous Column)
Algos like Nueral Networks, PCA and KNN expect all variables to be in same scale - 0 to 1 or a given min and given max value

3Type of Scaling:
    
    1. 0 - 1 Scaling or Min Max Scaling: Transform values between 0 and 1

        Range of Age 0 - 85; Fare 0 - 800
        Covert it into range 0 - 1
    2. Z score scaling or standard scaling 
        
        Coverts values from -3 to 3 range
        To apply z score scaling, the data should be normally distributed
        Also data should not have outlier data. If data has outlier values would fall outside -3 and +3
        Resulting in algos failing without a fixed min and max        

    3. Robust Scaling - for skewed distribution that are not exactly normally distributed
    
        Applicable when data has more outliers - fat tails or skewed
        Calculate as (Xi - Q1)/(Q3-Q1) where Q1, Q2, Q3 and Q4 are the quartiles
        This approach ensures outliers are not considered to scale values instead Q1 and Q3 are considered.
        Impact of outliers on scaling is reduced
        
When data has very small variance then Z-score or Robust Scaling is not good. Better to prefer 0-1 Scaling

In [45]:
import os
import pandas as pd
from sklearn import preprocessing
import numpy as np
from sklearn.impute import SimpleImputer

os.chdir('/Users/suma/Documents/01 Data Science/Titanic Problem/')

In [63]:
df_train = pd.read_csv('titanic_train.csv')
df_test = pd.read_csv('titanic_test.csv')

#### Append Train and Test data (for preprocessing)

In [64]:
frames = [df_train, df_test]
df = pd.concat(frames, axis = 0, sort = False)
#df.info()

In [23]:
scalable_features = ['Age', 'Fare']

#### Scaling
##### Standard Scaling or Z Score Scaling

Elements such as l1 ,l2 regularizer in linear models (logistic comes under this category) and RBF kernel in SVM in objective function of learners assumes that all the features are centered around zero and have variance in the same order.

In [25]:
scaler1 = preprocessing.StandardScaler()
scaler1.fit(df[scalable_features])
print(scaler1.scale_) #scale is sqrt of variance here
print(scaler1.var_)
df[scalable_features]= scaler1.transform(df[scalable_features])
df[scalable_features].describe()

[1. 1.]
[1. 1.]


Unnamed: 0,Age,Fare
count,1046.0,1308.0
mean,1.921132e-17,1.2944110000000002e-17
std,1.000478,1.000382
min,-2.062328,-0.6435292
25%,-0.6164631,-0.4909206
50%,-0.1305747,-0.3641609
75%,0.6329641,-0.03905147
max,3.478882,9.25868


##### Min Max Scaling or 0-1 Scaling

KNN, Neural Networks etc

In [26]:
scaler2 = preprocessing.MinMaxScaler()
scaler2.fit(df[scalable_features])
print(scaler2.data_min_)
print(scaler2.data_max_)
df[scalable_features]= scaler2.transform(df[scalable_features])
df[scalable_features].describe()

[-2.06232797 -0.6435292 ]
[3.47888164 9.25867993]


Unnamed: 0,Age,Fare
count,1046.0,1308.0
mean,0.37218,0.064988
std,0.180552,0.101026
min,0.0,0.0
25%,0.260929,0.015412
50%,0.348616,0.028213
75%,0.486409,0.061045
max,1.0,1.0


In [None]:
from sklearn.preprocessing import normalize
# min max scaling
df[scalable_features] = normalize(df[scalable_features], axis = 0)
#unit vector scaling
df[scalable_features] = normalize(df[scalable_features], axis = 0)

##### Robust Scaling

In [28]:
#Robust Scaling
scaler3 = preprocessing.RobustScaler()
scaler3.fit(df[scalable_features])
print(scaler3.center_)
print(scaler3.scale_)
df[scalable_features]= scaler3.transform(df[scalable_features])
df[scalable_features].describe()

[0. 0.]
[1. 1.]


Unnamed: 0,Age,Fare
count,1046.0,1308.0
mean,0.104508,0.805899
std,0.80075,2.213877
min,-1.546111,-0.61825
25%,-0.388889,-0.280523
50%,0.0,0.0
75%,0.611111,0.719477
max,2.888889,21.295639


##### Binarize

In [66]:
# Expects a 2D matrix or df as input
# Read input data before running below code - since normalized data doesn't have threshold values

In [65]:
cont_imputer = SimpleImputer()
df['Fare'] = cont_imputer.fit_transform(df[['Fare']])

binarizer = preprocessing.Binarizer(threshold=50.0).fit(df.Fare.values.reshape(-1,1)) 
df['binaryFare'] = binarizer.transform(df[['Fare']]) 

In [69]:
print(df[df['Fare'] > 50].binaryFare.unique())
print(df[df['Fare'] < 50].binaryFare.unique())

[1.]
[0.]


In [42]:
x = df.Fare.values
print(x.shape)
x = x.reshape(-1,1)
print(x.shape)

x = df.Fare.values
print(x.shape)
x = x.reshape(1,-1)
print(x.shape)

(1309,)
(1309, 1)
(1309,)
(1, 1309)


In [54]:
df[['Fare']].shape

(1309, 1)