<a href="https://colab.research.google.com/github/DulanDias/HandsOnMachineLearning/blob/master/Practical_3_%5BLab_Sheet%5D_Cleaning_up_the_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **CSC 319 1.5 Machine Learning 1**
## **Practical 3 - Cleaning up the Data**

In this section, we will continue to work with the California Housing Prices Dataset. The following are the main steps we will go through in this section:
1. Data Cleaning
2. Handling Text and Categorical Attributes
3. Feature Scaling
4. Transformation Pipelines


From the previous practical:

In [2]:
import numpy as np
import os

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

import warnings

import pandas as pd
housing = pd.read_csv('https://raw.githubusercontent.com/gurupratap-matharu/machine-learning-regression/master/dataset/housing.csv')

def split_train_test(data, test_ratio):
  shuffled_indices = np.random.permutation(len(data)) 
  test_set_size = int(len(data) * test_ratio) 
  test_indices = shuffled_indices[:test_set_size] 
  train_indices = shuffled_indices[test_set_size:]
  return data.iloc[train_indices], data.iloc[test_indices]

train_set, test_set = split_train_test(housing, 0.2)

train_set.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,16512.0,16512.0,16512.0,16512.0,16341.0,16512.0,16512.0,16512.0,16512.0
mean,-119.58177,35.644064,28.601017,2639.296754,538.76458,1424.323825,500.289184,3.874488,207332.326853
std,2.006344,2.141206,12.601003,2182.258031,422.925332,1110.586108,382.783614,1.900277,115886.889856
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.81,33.93,18.0,1448.0,296.0,786.0,280.0,2.565075,119400.0
50%,-118.515,34.26,29.0,2137.0,436.0,1169.0,410.0,3.5313,179750.0
75%,-118.01,37.72,37.0,3146.0,647.0,1724.0,606.0,4.7639,265900.0
max,-114.31,41.95,52.0,39320.0,6445.0,28566.0,6082.0,15.0001,500001.0


### **1. Data Cleaning**

Most Machine Learning algorithms cannot work with missing features, so let’s create a few functions to take care of them. 

We noticed earlier that the total_bedrooms attribute has some missing values, so let's fix this. To do this, we have three options:
1. Get rid of the corresponding districts
2. Get rid of the whole attribute
3. Set the value to some value (zero, the mean, the median, etc.)

We can accomplish these easily using Pandas DataFrame’s dropna(), drop(), and fillna() methods:

In [0]:
# option 1 
# YOUR CODE COMES HERE

In [0]:
# option 2
# YOUR CODE COMES HERE

In [0]:
# option 3 
# YOUR CODE COMES HERE

In [0]:
# YOUR CODE COMES HERE

On the other hand, SciKit-Learn also provides a very useful class to handle missing values, using SimpleImputer.

In [0]:
# YOUR CODE COMES HERE

Since the median can only be calculated on numerical attributes, we need to remove the categorical attributes from our dataset and create another copy.

In [0]:
# YOUR CODE COMES HERE

Now, we can use the fit() method on the imputer instance to our training set.

In [0]:
# YOUR CODE COMES HERE

The imputer simply computes the median for each of the numerical attributes and stores it in its own statistics_ instance variable.

Eventhough only total_bedrooms had missing values, we cannot be sure of this in a production scenario. Therefore, it is always best to apply the imputer fit() method to all numerical attributes.

In [0]:
# YOUR CODE COMES HERE

In [0]:
# YOUR CODE COMES HERE

Now we can use this imputer which is aware about the medians of each the numerical attributes, in order to transform the training set by replacing the missing values with the learned medians.

In [0]:
# YOUR CODE COMES HERE

In [0]:
# YOUR CODE COMES HERE

### **2. Handling Text and Categorical Attributes**

Earlier we left out the categorical attribute ocean_proximity because it is a text attribute so we cannot compute its median.

In [0]:
# YOUR CODE COMES HERE

#### **Ordinal Encoding**

Most Machine Learning algorithms prefer to work with numbers. Therefore, let's try to learn convert text to numbers.

In [0]:
# YOUR CODE COMES HERE

In [0]:
# YOUR CODE COMES HERE

One issue with this type of encoding is that it will assume that two nearby values are more similar than two distant values.

Categorical data is of two types. Categorical data that are having any intrinsic ordering among themselves are called **Ordinal type**. Categorical data which don’t have any intrinsic ordering among themselves are called **Nominal type**.

Some examples of Ordinal Categorical data are:
*   Low, Medium, High.
*   Agree, Neutral, Disagree.
*   Unhappy, Happy, Very Happy.
*   Young, Old.

Some examples of Nominal Categorical data are:
*   Colombo, New York, New Delhi, New Jersey, England.
*   Pen, Pencil, Eraser.
*   Lion, Monkey, Zebra, Peacock, Elephant.

Therefore, ocean_proximity can be identified as a Nominal categorical data, and hence, Ordinal Encoding will not be suitable for our purpose.


#### **One Hot Encoding**

To fix this issue, a common solution is to create one binary attribute per category: 
one attribute equal to 1 when the category is “<1H OCEAN” (and 0 otherwise), another attribute equal to 1 when the category is “INLAND” (and 0 otherwise), and so on. This is called **one-hot encoding**, because only one attribute will be equal to 1 (hot), while the others will be 0 (cold).

Scikit-Learn provides a OneHotEn coder class to convert categorical values into one-hot vectors.

In [0]:
# YOUR CODE COMES HERE

In [0]:
# YOUR CODE COMES HERE

In [0]:
# YOUR CODE COMES HERE

##### **Embedding**

If a categorical attribute has a large number of categories, then one-hot encoding will result in a very large number of features.

In this case, we might want to consider replacing the categorical attributes with related and useful numerical attributes. For example, we can replace ocean_proximity to a numerical attribute such as the distance to the ocean.

Otherwise, instead of replacing the whole attribute itself, you can replace categories with some learnable low dimensional vector called an **embedding**.

### **3. Feature Scaling**

Traditionally, Machine Learning algorithms do not tend to learn properly when its numerical attributes are in different scales.

For example, in the Califonia Housing Prices Dataset, we observed that total number of rooms ranged from 6 to 39,320, while the median incomes only range from 0 to 15.

**Note:** Scaling the target value is not required.

There are two common methods to get all numerical attributes within the same scale, that is, **min-max scaling** (a.k.a. normalization) and **standardization**.