# Feature Scaling and Transformation Pipelines
### One of the most important transformations you need to apply to your data is feature scaling.
<br> **Machine Learning algorithms don’t perform well when the input numerical attributes have very different scales**<br>_This is the case for the housing data: the total number of rooms ranges from about 6 to 39,320, while the median incomes only range from 0 to 15. Note that scaling the target values is generally not required._

<br> **There are two common ways to get all attributes to have the same scale**: 
        <br>- min-maxscaling : also known as NORMALIZATION
        
            - values are shifted and rescaled so that they end up ranging from 0 to 1
            - We do this by subtracting the min value and dividing by the max minus the min.
            - Scikit-Learn provides a transformer called MinMaxScaler for this. It has a feature_range hyperparameter that lets you change the range if you don’t want 0–1 for some reason.
            
   <br>- Standardization :
   
           - first it subtracts the mean value (so standardized values always have a zero mean), and then it divides by the standard deviation so that the resulting distribution has unit variance.
           - Unlike min-max scaling, standardization does not bound values to a specific range, which may be a problem for some algorithms (e.g., neural networks often expect an input value ranging from 0 to 1).
           - standardization is much less affected by outliers;
           - For example, suppose a district had a median income equal to 100 (by mistake). Min-max scaling would then crush all the other values from 0–15 down to 0–0.15, whereas standardization would not be much affected;
           - Scikit-Learn provides a transformer called StandardScaler for standardization.
        
### !!!! As with all the transformations, it is important to fit the scalers to the training data only, not to the full dataset (including the test set). Only then can you use them to transform the training set and the test set (and new data).

In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit

housing = pd.read_csv(r"C:\Users\georg\Desktop\Machine Learning\notebooks_detailed\datasets\housing\housing.csv") 

housing["income_category"] =pd.cut(housing["median_income"],bins=[0,1.5,3.0,4.5,6.,np.inf],labels=[1,2,3,4,5])
split_indices = StratifiedShuffleSplit(n_splits=1,test_size=0.2,random_state=42)
for train_index, test_index in split_indices.split(housing,housing["income_category"]): 
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index] 
# standard by now

for set_ in (strat_train_set, strat_test_set):  ## we are dropping the new attribute
    set_.drop("income_category", axis=1, inplace=True)  
    

housing = strat_train_set.copy()  # make a copy of original data 

housing = strat_train_set.drop("median_house_value", axis=1) # drop labels for training set
housing_labels = strat_train_set["median_house_value"].copy() # seprate the target column

housing_num = housing.drop("ocean_proximity", axis=1) # numerical attributes
housing_cat = housing[["ocean_proximity"]] # categorical attributes


# this is a very condesated form , don't worry about it

### So lets see how the Normalization works
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html


In [13]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

In [23]:
normalization = MinMaxScaler() #instantiate
normalized_feature = normalization.fit_transform(housing_num[["total_rooms"]])  #fit_transform method on target
normalized_feature

array([[0.03973139],
       [0.01711858],
       [0.04949891],
       ...,
       [0.12334029],
       [0.0497024 ],
       [0.07857252]])

### Normalization is good to use when you know that the distribution of your data does not follow a Gaussian distribution. This can be useful in algorithms that do not assume any distribution of the data like K-Nearest Neighbors and Neural Networks.

### Standardization now
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

In [24]:
standardization = StandardScaler()
standard_feature = standardization.fit_transform(housing_num[["population"]])
standard_feature

array([[-0.63621141],
       [-0.99833135],
       [-0.43363936],
       ...,
       [ 0.60790363],
       [-0.05717804],
       [-0.13515931]])

### Standardization, can be helpful in cases where the data follows a Gaussian distribution. However, this does not have to be necessarily true. Also, unlike normalization, standardization does not have a bounding range. So, even if you have outliers in your data, they will not be affected by standardization.

## As you can see, there are many data transformation steps that need to be executed in the right order.
## Scikit-Learn provides the Pipeline class to help withsuch sequences of transformations.

**Pipeline** is just an utility that helps you sequence different transformations ( find set of features,generate new features ,select only some good features etc) of the original dataset before applying a final estimator.

https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

#### !!! We will need some imports and code from the previous notebooks !!!

In [34]:

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin

rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room=True): # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X):
        rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
        population_per_household = X[:, population_ix] / X[:, households_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]



In [32]:
numerical_pipeline = Pipeline([    # The Pipeline constructor takes a list of name/estimator pairs defining a sequence of steps.
('imputer', SimpleImputer(strategy="median")),
('attribs_adder', CombinedAttributesAdder()), 
('std_scaler', StandardScaler()),
])
#The names can be anything you like (as long as they are unique and don’t contain double underscores “__”): they will come in handy later for hyperparameter tuning.

housing_numerical_transformed = numerical_pipeline.fit_transform(housing_num)
housing_numerical_transformed


array([[-1.15604281,  0.77194962,  0.74333089, ..., -0.31205452,
        -0.08649871,  0.15531753],
       [-1.17602483,  0.6596948 , -1.1653172 , ...,  0.21768338,
        -0.03353391, -0.83628902],
       [ 1.18684903, -1.34218285,  0.18664186, ..., -0.46531516,
        -0.09240499,  0.4222004 ],
       ...,
       [ 1.58648943, -0.72478134, -1.56295222, ...,  0.3469342 ,
        -0.03055414, -0.52177644],
       [ 0.78221312, -0.85106801,  0.18664186, ...,  0.02499488,
         0.06150916, -0.30340741],
       [-1.43579109,  0.99645926,  1.85670895, ..., -0.22852947,
        -0.09586294,  0.10180567]])

### We can do this for the categorical data to but it would be more convenient to have a single transformer able to handle all columns, applying the appropriate transformations to each column.
### Scikit-Learn introduced the ColumnTransformer
https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html

In [37]:
from sklearn.compose import ColumnTransformer
num_attribs = list(housing_num)  # list of numerical columns 
cat_attribs = ["ocean_proximity"] #list of categorical columns 

full_pipeline = ColumnTransformer([ # The constructor requires a list of tuples, where each tuple contains a name, a transformer and a list of names (or indices) of columns that the transformer should be applied to
("num", numerical_pipeline, num_attribs), # name : whatever u want| transfomer : numerical_pipeline defined earlier|target:num_attribs
("cat", OneHotEncoder(), cat_attribs),
])
housing_prepared = full_pipeline.fit_transform(housing)
housing_prepared

array([[-1.15604281,  0.77194962,  0.74333089, ...,  0.        ,
         0.        ,  0.        ],
       [-1.17602483,  0.6596948 , -1.1653172 , ...,  0.        ,
         0.        ,  0.        ],
       [ 1.18684903, -1.34218285,  0.18664186, ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [ 1.58648943, -0.72478134, -1.56295222, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.78221312, -0.85106801,  0.18664186, ...,  0.        ,
         0.        ,  0.        ],
       [-1.43579109,  0.99645926,  1.85670895, ...,  0.        ,
         1.        ,  0.        ]])

!!! An important parameter of ColumnTransformer estimator is **remainder** which can be {‘drop’, ‘passthrough’}
<br> drop - allows u to drop any column/s u like 
<br> passthrough - it leaves the column/s specified untouched
<br> !!! Also is worth mentioning that ColumnTransformer automatically concatonates all output from there specific transformer !!!