# Data Processing by Scikit-Learning

This notebook is based in the process showned in "*Hands-On Machine Learning with Scikit-Learn and Keras*", by O'Rilley, the code example can be found in the second chapter of this one.

In [8]:
# In this cell you will found all packages used for this pipeline.

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import os
import urllib
import tarfile
# all sckit-learn and keras function will be imported in the cell i use this one.

Recap:  ([reference](https://github.com/BPalhano/Machine-Learning/blob/main/Chapter.2/Basic%20Dataset%20Analysis.ipynb))

In [14]:
# Just coping the dataset acess method of the previous notebook, see the link in the Markdown cell
# above this one.

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    os.makedirs(housing_path, exist_ok=True)
    tgz_path =os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()
    
def load_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

fetch_housing_data()
data = load_data()

data.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0
mean,-119.569704,35.631861,28.639486,2635.763081,537.870553,1425.476744,499.53968,3.870671,206855.816909
std,2.003532,2.135952,12.585558,2181.615252,421.38507,1132.462122,382.329753,1.899822,115395.615874
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1447.75,296.0,787.0,280.0,2.5634,119600.0
50%,-118.49,34.26,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0
75%,-118.01,37.71,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0


And let's generate the ``strat_train_set``again.

In [13]:
data["income_cat"] = np.ceil(data["median_income"] / 1.5)
data["income_cat"].where(data["income_cat"] < 5, 5.0, inplace=True)

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(data, data["income_cat"]):
    strat_train_set = data.loc[train_index]
    strat_test_set = data.loc[test_index]
    
for set in (strat_train_set, strat_test_set):
    set.drop(["income_cat"], axis=1, inplace=True)
    
strat_train_set.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,16512.0,16512.0,16512.0,16512.0,16354.0,16512.0,16512.0,16512.0,16512.0
mean,-119.575635,35.639314,28.653404,2622.539789,534.914639,1419.687379,497.01181,3.875884,207005.322372
std,2.001828,2.137963,12.574819,2138.41708,412.665649,1115.663036,375.696156,1.904931,115701.29725
min,-124.35,32.54,1.0,6.0,2.0,3.0,2.0,0.4999,14999.0
25%,-121.8,33.94,18.0,1443.0,295.0,784.0,279.0,2.56695,119800.0
50%,-118.51,34.26,29.0,2119.0,433.0,1164.0,408.0,3.54155,179500.0
75%,-118.01,37.72,37.0,3141.0,644.0,1719.0,602.0,4.745325,263900.0
max,-114.31,41.95,52.0,39320.0,6210.0,35682.0,5358.0,15.0001,500001.0


Then, let's copy this variable to build our Data Processing Pipeline! We can clean this dataset and drop some features before copy the train set.

In [19]:
housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()

housing_labels.describe()

count     16512.000000
mean     207005.322372
std      115701.297250
min       14999.000000
25%      119800.000000
50%      179500.000000
75%      263900.000000
max      500001.000000
Name: median_house_value, dtype: float64

Let's start the Data Cleaning. We have a feature ``total_bedrooms`` with some missing values. We have 3 approachs for this:

1. Get rid of the corresponding districts.
2. Get rid of the whole attribute.
3. Set the values to some value (zero, the mean, the median, etc.)

+ (Igor's comment: We have a fourth approach: falsificate the data using Generative Neural Networks for interpolate or extrapolate the dataset!)

We can accomplish these easily using DataFrame's ``dropna()``, ``drop()`` and ``fillna()`` methods:

In [21]:
housing.dropna(subset=["total_bedrooms"]) # first approach

housing.drop("total_bedrooms", axis=1) # second approach

median = housing["total_bedrooms"].median() # third approach 
housing["total_bedrooms"].fillna(median, inplace=True)

# the fourth approach deserves a notebook only for it.


A another approach is use the ``SimpleImputer`` function of ``Scikit-Learn`` package:

In [23]:
# Importing function:
from sklearn.impute import SimpleImputer

# then, create a SimpleImputer instance, specifying that you want to replace 
# each attribute's missing values with tthe median of the attribute:

imputer = SimpleImputer(strategy="median")


Since the median can only be computed on numerical attributes, you need to create a copy of the data without the text attribute ``ocean_proximity``.

In [24]:
housing_num = housing.drop("ocean_proximity", axis=1)

Now, we can fit the imputer instance to the training data using the fit() method:

In [26]:
imputer.fit(housing_num)

# Let's see the imputer.statistics_:

imputer.statistics_

array([-118.51   ,   34.26   ,   29.     , 2119.     ,  433.     ,
       1164.     ,  408.     ,    3.54155])

Now, we can use this "trained" imputer to transform the training set by replacing missing values with the learned medians:

In [29]:
X = imputer.transform(housing_num)

# This X vairable is a plain NumPy array, let's put it back into a pandas DataFrame object:
housing_tr = pd.DataFrame(X, columns=housing_num.columns, index=housing_num.index)

housing_tr.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income
count,16512.0,16512.0,16512.0,16512.0,16512.0,16512.0,16512.0,16512.0
mean,-119.575635,35.639314,28.653404,2622.539789,533.939438,1419.687379,497.01181,3.875884
std,2.001828,2.137963,12.574819,2138.41708,410.80626,1115.663036,375.696156,1.904931
min,-124.35,32.54,1.0,6.0,2.0,3.0,2.0,0.4999
25%,-121.8,33.94,18.0,1443.0,296.0,784.0,279.0,2.56695
50%,-118.51,34.26,29.0,2119.0,433.0,1164.0,408.0,3.54155
75%,-118.01,37.72,37.0,3141.0,641.0,1719.0,602.0,4.745325
max,-114.31,41.95,52.0,39320.0,6210.0,35682.0,5358.0,15.0001


Scikit-Learn Design have some main design principles:

* Consistency:
  ***All objects share a consistent and simple interface:***
     + Estimators:
     
         - Any object that can estimate some parameters based on a dataset is called an *estimator* (e.g. an *imputer* is an estimator). The estimation itself is performer by *fit( )* method, and it takes only a parameter (or two for supervised learning algorithms; the second dataset contains the labels). Any other parameter needed to guide the estiation process is considered a hyperparameter (such as an impter's strategy), and it must be set as an instance variable (geberally via a constructor parameter).
         
     + Transformers:
     
         - Some estimators (such as  an imputer) can also transform a dataset; these are called *transformers*. Once again, the API is simple; the transformation generally relies on the learned parameters, as is the case for an imputer. All transformers also have a convenience method called *fit_transform( )* that is equivalent to calling *fit( )* and then *transform( )* is optimized and runs much faster).
         
     + Predictors:
     
         - Some estimatros, given a dataset, are capable of making predictions; they are called *predictors*. For example, the **LinearRegression** model of Scikit-Learn. A predictor has a *predict( )* method that takes a dataset of new instances and returns a dataset of corresponding predictions. It also has a *score( )* method that measures the quality of the predicitions, given a test set (and the corresponding labels, in the case of supervised learning algorithms).

* Inspection:
    - ***All the estimator's hyperparameters are acessible directly via public instance variables (e.g. imputer.strategy), and all the estimator's learned parameters are acessible via public instance variables with an underscore suffix (e.g. imputer.statistics_)***.
    
* Nonproliferation of classes:
    - ***Datasets are represented as NumPy arrays or SciPy sparse matrices, instead of homemade classes. Hyperparameters are just regular Python strings numbers***.

* Composition:
    - ***Existing building blocks are reused as much as possible. For example, it is easy to create a *Pipeline* estimator from an arbitrary sequence of transformers followed by a final estimator, as we will see.***
    
* Sensible defaults:
    - ***Scikit-Learn provides reasonable default values for most parameters, making it easy to quickly create a baseline working system.***

Let's take a look at text attributes, in our housing dataset we have only one: ``ocean_proximity``.

In [32]:
housing_cat = housing[["ocean_proximity"]]
housing_cat.head(10)

Unnamed: 0,ocean_proximity
12655,INLAND
15502,NEAR OCEAN
2908,INLAND
14053,NEAR OCEAN
20496,<1H OCEAN
1481,NEAR BAY
18125,<1H OCEAN
5830,<1H OCEAN
17989,<1H OCEAN
4861,<1H OCEAN


It's not arbitrary text, there are a limited nujmber of possible values, each of which represents a category. Let's convert this finite set of possibilities of words to numbers, using ``OrdinalEncoder``, by Scikit-Learn.

In [37]:
from sklearn.preprocessing import OrdinalEncoder

ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)

# show the 10 first values:
print(type(housing_cat_encoded), '\n\n')
housing_cat_encoded[:10]

<class 'numpy.ndarray'> 




array([[1.],
       [4.],
       [1.],
       [4.],
       [0.],
       [3.],
       [0.],
       [0.],
       [0.],
       [0.]])

This function auto generate a list of categories that is present in the ``housing_cat``:


In [38]:
ordinal_encoder.categories_

[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
       dtype=object)]

One issue with this representation is that ML algorithms will assume that two nearby values are most similiar than two distant values. This may be fine in some cases (e.g. for ordered categories such as *bad*, *average*, *good*, and *excellent*).

Another way to organize this information is using the ``OneHotEncoder``, where we select a variable present in the list (let's chose "<1H OCEAN") to be 1 (hot), and the remainder of the variables are 0 (we can call this variable for *dummy attributes*).

In [39]:
from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)

# Inspecting the variable:
housing_cat_1hot

<16512x5 sparse matrix of type '<class 'numpy.float64'>'
	with 16512 stored elements in Compressed Sparse Row format>

Here we get a SciPy sparse matrix, we can convert this object to NumPy arrray just calling ``toarray( )`` method.

In [40]:
housing_cat_1hot.toarray()

array([[0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0.],
       ...,
       [1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.]])

Let's see the econder's categories_ instance variable:


In [41]:
cat_encoder.categories_

[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
       dtype=object)]

### Custom Transformers

Scikit-Learn provides many useful transformers, you will need to write your own for tasks such as custom cleanup oerations or combining specific attributes. 
If you add *BaseEstimator* as a base class, you will be useful for automatic hyperparameter tuning.


In [54]:
from sklearn.base import BaseEstimator, TransformerMixin

rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room=True):
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
        population_per_household = X[:, population_ix] / X [:, households_ix]
        
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household]
        
        else:
            return np.c_[X, rooms_per_household, population_per_household]
        
attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)

### Feature Scaling

Machine Learning algorithms don't perform well when the input numerical attributes have very different scales. That is the case for the housing data: the total number of rooms ranges from about 6 yo 39.320, while the median incomes only range from 0 to 15. Note that scaling the target values is generally not required.

There are two common ways to get all attributes to have the same scale: *min-max scaling* and *standadization*.

#### Min-max scaling (normalization)

Values are shifted and rescaled so that they end up ranging from 0 to 1. We do this by subtracting the min value and dividing by the max minus the min. Scikit-Learn provides a transformer called ``MinMaxScaler`` for this. 

#### Standardization
first subtracts the mean value (i.e. standardized vaues alaways have zero mean), and then it divides by the standard deviation so that the resulting distribution has unit variance.


Unlike min-max scaling, standardzation does not bound values to a specific range, which may be a problem for some algortihms (e.g. neural networks often expect an input value ranging from 0 to 1).

 - Standardzation is much less affected by outliers than Normalization.

# Transformation Pipelines

Scikit-Learn provides the ``Pipeline`` class to help with such sequences of transformations.

In [55]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="median")),
    ('attribs_adder', CombinedAttributesAdder()),
    ('std_scaler', StandardScaler())
])

housing_num_tr = num_pipeline.fit_transform(housing_num)

The pipeline constructor takes a list of name/estimator pairs defining a sequence of steps. All but the last estimator must be trasformers (i.e. they must have a *fit_transform( )* method).

So far, we have handled the categorical columns and the numerical columns separately. It would be more convenient to have a single transformer able to handle all coluimns, applying the appropriate transformations to each column. 

In [62]:
from sklearn.compose import ColumnTransformer

num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

full_pipeline = ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    ("cat", OneHotEncoder(), cat_attribs)
])

housing_prepared = full_pipeline.fit_transform(housing)

Now, we get a complete pipeline!