---
# Data Science and Artificial Intelliegence Practicum
## 5-modul. Machine Learning
---

**Original Notebook -> https://jovian.ai/anvarnarz/05-ml-02-ml-preparation**

## 5.2 - Data Preparation

In [1]:
# Importing Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

**Data Science methodology (CRISP-DM):**
<img src="https://i.imgur.com/dzZnnYi.png" alt="CRISP-DM" width="800"/>

### Preparing Data for Machine Learning

In [2]:
URL = "https://raw.githubusercontent.com/ageron/handson-ml2/master/datasets/housing/housing.csv"
df = pd.read_csv(URL)
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


Split data into *train set* and *test set*.
1. Simple (non-balanced) train & test set.

In [3]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(df, test_size=0.2, random_state=42)

2. Balanced (stratified) train & test set.

In [4]:
df['income_cat'] = pd.cut(df['median_income'], bins=[0, 1.5, 3.0, 4.5, 6.0, np.inf], labels=[1,2,3,4,5])

from sklearn.model_selection import StratifiedShuffleSplit

stratified_split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
# stratified_split.split function returns indices
for train_index, test_index in stratified_split.split(df, df['income_cat']):
    strat_train_set = df.loc[train_index]
    strat_test_set = df.loc[test_index]

strat_train_set.drop('income_cat', axis=1, inplace=True)
strat_test_set.drop('income_cat', axis=1, inplace=True)

In [5]:
# We separate the median_house_value(label) column.
housing = strat_train_set.drop('median_house_value', axis=1)
housing_labels = strat_train_set['median_house_value'].copy()

![DataFrame-label.png](https://i.imgur.com/lO0yL15.png)

### Data Cleaning

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   longitude           20640 non-null  float64 
 1   latitude            20640 non-null  float64 
 2   housing_median_age  20640 non-null  float64 
 3   total_rooms         20640 non-null  float64 
 4   total_bedrooms      20433 non-null  float64 
 5   population          20640 non-null  float64 
 6   households          20640 non-null  float64 
 7   median_income       20640 non-null  float64 
 8   median_house_value  20640 non-null  float64 
 9   ocean_proximity     20640 non-null  object  
 10  income_cat          20640 non-null  category
dtypes: category(1), float64(9), object(1)
memory usage: 1.6+ MB


We can see there is `NaN` values in `total_bedrooms` column.

We've got 3 ways to handle missing values:
1. Dropping observatoins(rows) with `NaN` values
2. Dropping the entire column
3. Filling `NaN` values (e.g. `0`, *mean*, *median*, etc.)

In [7]:
# 1-variant. Dropping rows with NaN values
housing.dropna(subset=['total_bedrooms'])

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
12655,-121.46,38.52,29.0,3873.0,797.0,2237.0,706.0,2.1736,INLAND
15502,-117.23,33.09,7.0,5320.0,855.0,2015.0,768.0,6.3373,NEAR OCEAN
2908,-119.04,35.37,44.0,1618.0,310.0,667.0,300.0,2.8750,INLAND
14053,-117.13,32.75,24.0,1877.0,519.0,898.0,483.0,2.2264,NEAR OCEAN
20496,-118.70,34.28,27.0,3536.0,646.0,1837.0,580.0,4.4964,<1H OCEAN
...,...,...,...,...,...,...,...,...,...
15174,-117.07,33.03,14.0,6665.0,1231.0,2026.0,1001.0,5.0900,<1H OCEAN
12661,-121.42,38.51,15.0,7901.0,1422.0,4769.0,1418.0,2.8139,INLAND
19263,-122.72,38.44,48.0,707.0,166.0,458.0,172.0,3.1797,<1H OCEAN
19140,-122.70,38.31,14.0,3155.0,580.0,1208.0,501.0,4.1964,<1H OCEAN


In [8]:
# 2-variant. Dropping the entire column
housing.drop('total_bedrooms', axis=1)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,population,households,median_income,ocean_proximity
12655,-121.46,38.52,29.0,3873.0,2237.0,706.0,2.1736,INLAND
15502,-117.23,33.09,7.0,5320.0,2015.0,768.0,6.3373,NEAR OCEAN
2908,-119.04,35.37,44.0,1618.0,667.0,300.0,2.8750,INLAND
14053,-117.13,32.75,24.0,1877.0,898.0,483.0,2.2264,NEAR OCEAN
20496,-118.70,34.28,27.0,3536.0,1837.0,580.0,4.4964,<1H OCEAN
...,...,...,...,...,...,...,...,...
15174,-117.07,33.03,14.0,6665.0,2026.0,1001.0,5.0900,<1H OCEAN
12661,-121.42,38.51,15.0,7901.0,4769.0,1418.0,2.8139,INLAND
19263,-122.72,38.44,48.0,707.0,458.0,172.0,3.1797,<1H OCEAN
19140,-122.70,38.31,14.0,3155.0,1208.0,501.0,4.1964,<1H OCEAN


In [9]:
# 3-variant. Filling NaN values (with median)
median = housing['total_bedrooms'].median()
housing['total_bedrooms'].fillna(median)

12655     797.0
15502     855.0
2908      310.0
14053     519.0
20496     646.0
          ...  
15174    1231.0
12661    1422.0
19263     166.0
19140     580.0
19773     222.0
Name: total_bedrooms, Length: 16512, dtype: float64

It is convenient for us to automate data processing by writing a function that can populate `NaN` values in any column.

For this, scikit-learn has `SimpleImputer` function.

**`sklearn.impute.SimpleImputer`** - Univariate imputer for completing missing values with simple strategies. Replace missing values using a descriptive statistic (e.g. mean, median, or most frequent) along each column, or using a constant value.

In [10]:
# import SimpleImputer class
from sklearn.impute import SimpleImputer
# create a new object from SimpleImputer class and define strategy for filling NaN values with median values
imputer = SimpleImputer(strategy='median')

In [11]:
imputer.fit(housing)

ValueError: Cannot use median strategy with non-numeric data:
could not convert string to float: 'INLAND'

**To avoid an error, we need to separate the numerical columns:**

In [12]:
housing_num = housing.drop('ocean_proximity', axis=1)
imputer.fit(housing_num)

The calculated median values are stored in the `statistics_` attribute:

In [13]:
imputer.statistics_

array([-118.51   ,   34.26   ,   29.     , 2119.     ,  433.     ,
       1164.     ,  408.     ,    3.54155])

We use `transform` function to fill `NaN` values in the dataset with median values:

In [14]:
X = imputer.transform(housing_num)

In [15]:
type(X)

numpy.ndarray

Since the `transform()` function returns a numpy array, we transfer the array back to the dataframe:

In [16]:
housing_tr = pd.DataFrame(X, columns=housing_num.columns, index=housing_num.index)
housing_tr.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income
12655,-121.46,38.52,29.0,3873.0,797.0,2237.0,706.0,2.1736
15502,-117.23,33.09,7.0,5320.0,855.0,2015.0,768.0,6.3373
2908,-119.04,35.37,44.0,1618.0,310.0,667.0,300.0,2.875
14053,-117.13,32.75,24.0,1877.0,519.0,898.0,483.0,2.2264
20496,-118.7,34.28,27.0,3536.0,646.0,1837.0,580.0,4.4964


### Transforming text columns in dataset

The `ocean_proximity` column in the dataset is a text column.

In [17]:
housing_cat = housing[['ocean_proximity']]
housing_cat.head(10)

Unnamed: 0,ocean_proximity
12655,INLAND
15502,NEAR OCEAN
2908,INLAND
14053,NEAR OCEAN
20496,<1H OCEAN
1481,NEAR BAY
18125,<1H OCEAN
5830,<1H OCEAN
17989,<1H OCEAN
4861,<1H OCEAN


Machine Learning algorithms only work with numeric values. So we need to change this column to numbers as well:

There are 2 ways to do this:
1. Replace each text with a number.

For this we use `OrdinalEncoder` in *scikit-learn*.

**`sklearn.preprocessing.OrdinalEncoder`** - Encode categorical features as an integer array.

In [18]:
from sklearn.preprocessing import OrdinalEncoder

ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)
housing_cat_encoded[:10]

array([[1.],
       [4.],
       [1.],
       [4.],
       [0.],
       [3.],
       [0.],
       [0.],
       [0.],
       [0.]])

In [19]:
ordinal_encoder.categories_

[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
       dtype=object)]

2. Using the `OneHotEncoder` - method, each unique value is converted into a separate column and 1 is placed in the corresponding column and 0 in the rest.

**`sklearn.preprocessing.OneHotEncoder`** - Encode categorical features as a one-hot numeric array.

In [20]:
from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot.toarray()

array([[0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0.],
       ...,
       [1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.]])

### Transfomer

The main objects in scikit-learn are (one class can implement multiple interfaces):
- **Estimator:** The base object, implements a fit method to learn from data;
- **Predictor:** For supervised learning, or some unsupervised problems;
- **Transformer:** For filtering or modifying the data, in a supervised or unsupervised way;
- **Model:** A model that can give a goodness of fit measure or a likelihood of unseen data.

© [Different objects in scikit-learn](https://scikit-learn.org/stable/developers/develop.html#different-objects)

Let's learn to implement a transformer ourselves:

Let's make a transformer that automatically adds these two columns to the given dataset: `rooms_per_household` and `bedrooms_per_room`.

For this, we create a new class inheriting from the `BaseEstimator` and `TransformerMixin` classes in sklearn and add the `fit()` and `transform()` methods to our class:

In [29]:
from sklearn.base import BaseEstimator, TransformerMixin

# indices of columns that we need
rooms_idx, bedrooms_idx, population_idx, households_idx = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    
    def __init__(self, add_bedrooms_per_room=True):
        self.add_bedrooms_per_room = add_bedrooms_per_room
        
    def fit(self, X, y=None):
        return self  # our function is only a transformer (not an estimator)
    
    def transform(self, X):
        rooms_per_household = X[:, rooms_idx] / X[:, households_idx]
        population_per_household = X[:, population_idx] / X[:, households_idx]
        if self.add_bedrooms_per_room:  # add_bedrooms_per_room column is optional
            bedrooms_per_room = X[:, bedrooms_idx] / X[:, rooms_idx]
            return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

In [30]:
attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=True)
housing_extra_attribs = attr_adder.transform(housing.values)
housing_extra_attribs[0, :]

array([-121.46, 38.52, 29.0, 3873.0, 797.0, 2237.0, 706.0, 2.1736,
       'INLAND', 5.485835694050992, 3.168555240793201,
       0.20578363026077975], dtype=object)

In [31]:
attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)
housing_extra_attribs[0, :]

array([-121.46, 38.52, 29.0, 3873.0, 797.0, 2237.0, 706.0, 2.1736,
       'INLAND', 5.485835694050992, 3.168555240793201], dtype=object)

### Standardization and Normalization

#### Min-max scaling

![Min–max normalization.png](https://arshpreetsingh.files.wordpress.com/2017/03/normal.png)

**`sklearn.preprocessing.MinMaxScaler`** - Transform features by scaling each feature to a given range.

In [35]:
from sklearn.preprocessing import MinMaxScaler

min_max_scaler = MinMaxScaler()
min_max_scaler.fit_transform(housing_num)

array([[0.28784861, 0.63549416, 0.54901961, ..., 0.06261386, 0.13144137,
        0.11542599],
       [0.70916335, 0.05844846, 0.11764706, ..., 0.05639172, 0.14301718,
        0.40257376],
       [0.52888446, 0.30074389, 0.84313725, ..., 0.01861039, 0.05563854,
        0.16379774],
       ...,
       [0.1623506 , 0.62699256, 0.92156863, ..., 0.0127526 , 0.0317401 ,
        0.18481124],
       [0.16434263, 0.61317747, 0.25490196, ..., 0.03377337, 0.09316654,
        0.25492752],
       [0.22011952, 0.78958555, 0.50980392, ..., 0.01743322, 0.03640777,
        0.18151474]])

#### Standard Scaler

![StandardScaler.png](https://cdn-images-1.medium.com/max/370/1*Nlgc_wq2b-VfdawWX9MLWA.png)

**`sklearn.preprocessing.StandardScaler`** - Standardize features by removing the mean and scaling to unit variance.

In [36]:
from sklearn.preprocessing import StandardScaler

standard_scaler = StandardScaler()
standard_scaler.fit_transform(housing_num)

array([[-0.94135046,  1.34743822,  0.02756357, ...,  0.73260236,
         0.55628602, -0.8936472 ],
       [ 1.17178212, -1.19243966, -1.72201763, ...,  0.53361152,
         0.72131799,  1.292168  ],
       [ 0.26758118, -0.1259716 ,  1.22045984, ..., -0.67467519,
        -0.52440722, -0.52543365],
       ...,
       [-1.5707942 ,  1.31001828,  1.53856552, ..., -0.86201341,
        -0.86511838, -0.36547546],
       [-1.56080303,  1.2492109 , -1.1653327 , ..., -0.18974707,
         0.01061579,  0.16826095],
       [-1.28105026,  2.02567448, -0.13148926, ..., -0.71232211,
        -0.79857323, -0.390569  ]])

In [38]:
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
12655,-121.46,38.52,29.0,3873.0,797.0,2237.0,706.0,2.1736,INLAND
15502,-117.23,33.09,7.0,5320.0,855.0,2015.0,768.0,6.3373,NEAR OCEAN
2908,-119.04,35.37,44.0,1618.0,310.0,667.0,300.0,2.875,INLAND
14053,-117.13,32.75,24.0,1877.0,519.0,898.0,483.0,2.2264,NEAR OCEAN
20496,-118.7,34.28,27.0,3536.0,646.0,1837.0,580.0,4.4964,<1H OCEAN


**`pandas.get_dummies`** - Convert categorical variable into dummy/indicator variables.

In [42]:
housing_onehot = pd.get_dummies(housing['ocean_proximity'])
housing_onehot

Unnamed: 0,<1H OCEAN,INLAND,ISLAND,NEAR BAY,NEAR OCEAN
12655,0,1,0,0,0
15502,0,0,0,0,1
2908,0,1,0,0,0
14053,0,0,0,0,1
20496,1,0,0,0,0
...,...,...,...,...,...
15174,1,0,0,0,0
12661,0,1,0,0,0
19263,1,0,0,0,0
19140,1,0,0,0,0


In [43]:
housing_num.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income
12655,-121.46,38.52,29.0,3873.0,797.0,2237.0,706.0,2.1736
15502,-117.23,33.09,7.0,5320.0,855.0,2015.0,768.0,6.3373
2908,-119.04,35.37,44.0,1618.0,310.0,667.0,300.0,2.875
14053,-117.13,32.75,24.0,1877.0,519.0,898.0,483.0,2.2264
20496,-118.7,34.28,27.0,3536.0,646.0,1837.0,580.0,4.4964
