### The Goal of the Script is to perform:
Feature Engineering: Feature Encoding (Converting the Categorical Features into Numerical Features)

### Four approaches were covered in the lecture:
1. Ordinal Encoding
2. One-hot Encoding
3. Label Encoding
4. Feature Hashing

### Environment Setup for performing Feature Encoding on Categorical Features

In [112]:
# Importing the libraries
import pandas as pd

# Libraries to train the model
from sklearn import tree
from sklearn.model_selection import train_test_split

# Library for Evaluating the model
from sklearn.metrics import mean_absolute_error

In [113]:
# Importing the dataset: Dataset is the Melbourne House Pricing Dataset
# It is a Regression problem where the goal is to predict the house price based on features of the house
data = pd.read_csv('melbourne_data.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,1,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,2,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,4,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,5,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,...,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,6,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


In [114]:
# Preparing the dataset to solve the Regression Problem

# Prepare the Dependent feature, i.e., Price of the house as the Dependent feature
y = data.Price

# Features other than Price feature are now considered as Independent features

# Assumption:
# To solve the regression problem, for simplicity only numerical independent features are considered
X = data.drop(['Price'], axis=1)

In [115]:
# Get the information about the features present in the dataset
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18396 entries, 0 to 18395
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Unnamed: 0     18396 non-null  int64  
 1   Suburb         18396 non-null  object 
 2   Address        18396 non-null  object 
 3   Rooms          18396 non-null  int64  
 4   Type           18396 non-null  object 
 5   Method         18396 non-null  object 
 6   SellerG        18396 non-null  object 
 7   Date           18396 non-null  object 
 8   Distance       18395 non-null  float64
 9   Postcode       18395 non-null  float64
 10  Bedroom2       14927 non-null  float64
 11  Bathroom       14925 non-null  float64
 12  Car            14820 non-null  float64
 13  Landsize       13603 non-null  float64
 14  BuildingArea   7762 non-null   float64
 15  YearBuilt      8958 non-null   float64
 16  CouncilArea    12233 non-null  object 
 17  Lattitude      15064 non-null  float64
 18  Longti

We can observe that the "Date" column is present as an Object data type instead of Datetime datatype

In [116]:
# As "Date" column is presented as an Object datatype, let us convert the datatype of Date column to date time format
X['Date'] = pd.to_datetime(X['Date'])

### Handle the missing values for Numerical and Categorical Features

1. Handle Missing Values for Categorical features

In [118]:
# Get list of categorical features
categorical_features = (X.dtypes == 'object')
object_cols = list(categorical_features[categorical_features].index)

print("Categorical features:")
print(object_cols)

Categorical features:
['Suburb', 'Address', 'Type', 'Method', 'SellerG', 'CouncilArea', 'Regionname']


In [119]:
# Create a dataframe for the categorical features
categorical_dataset = X[object_cols]

In [120]:
# Get categorical features with missing values
cols_with_missing = [col for col in categorical_dataset.columns
                     if categorical_dataset[col].isnull().any()]

cols_with_missing

['CouncilArea', 'Regionname']

There are two features "CouncilArea", and "Regionname" with the missing values. Therefore, apply imputation approach to fill in the missing values

In [121]:
# Imputation Approach on the Indepdent features, i.e., replacing all the missing value 
# in a feature with a FIXED Value
import numpy as np
from sklearn.impute import SimpleImputer

univariate_imputer = SimpleImputer(strategy='most_frequent', missing_values=np.nan)
imputed_categorical_features = pd.DataFrame(univariate_imputer.fit_transform(categorical_dataset))
imputed_categorical_features.columns = categorical_dataset.columns

2. Handle Missing Values for Numerical Values using Extended Imputation

In [122]:
# Get list of Numerical features
numeric_features = (X.dtypes != 'object')
numeric_cols = list(numeric_features[numeric_features].index)

print("Numeric features:")
print(numeric_cols)

Numeric features:
['Unnamed: 0', 'Rooms', 'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car', 'Landsize', 'BuildingArea', 'YearBuilt', 'Lattitude', 'Longtitude', 'Propertycount']


In [123]:
# Create a dataframe for the numeric features
numeric_dataset = X[numeric_cols]

In [124]:
# Get numerical features with missing values
cols_with_missing = [col for col in numeric_dataset.columns
                     if numeric_dataset[col].isnull().any()]

cols_with_missing

['Distance',
 'Postcode',
 'Bedroom2',
 'Bathroom',
 'Car',
 'Landsize',
 'BuildingArea',
 'YearBuilt',
 'Lattitude',
 'Longtitude',
 'Propertycount']

In [125]:
# Make new columns indicating what will be imputed
for col in cols_with_missing:
    numeric_dataset[col + '_was_missing'] = numeric_dataset[col].isnull()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  numeric_dataset[col + '_was_missing'] = numeric_dataset[col].isnull()


In [126]:
numeric_dataset = numeric_dataset.drop(['Date'], axis = 1)

In [127]:
# Imputation Approach on the Indepdent features, i.e., replacing all the missing value 
# in a feature with a FIXED Value
import numpy as np
from sklearn.impute import SimpleImputer

univariate_imputer_mean = SimpleImputer(strategy='mean', missing_values=np.nan)
imputed_numerical_features = pd.DataFrame(univariate_imputer_mean.fit_transform(numeric_dataset))
imputed_numerical_features.columns = numeric_dataset.columns

In [128]:
X_imputed = pd.concat([imputed_categorical_features, imputed_numerical_features], axis=1, ignore_index=False)

In [129]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

### Label Encoding


Label Encoder encode feature with a value between 0 and n_classes-1 where n is the number of distinct values present in the feature. If a label repeats, it assigns the same value as assigned earlier. 



In [130]:
# Import Libraries
import pandas as pd

# Library to perform Label Encoding
from sklearn.preprocessing import LabelEncoder

In [131]:
# Get list of categorical features
categorical_features = (X_imputed.dtypes == 'object')
object_cols = list(categorical_features[categorical_features].index)

print("Categorical features:")
print(object_cols)

Categorical features:
['Suburb', 'Address', 'Type', 'Method', 'SellerG', 'CouncilArea', 'Regionname']


In [132]:
# In the Dataset, seven categorical features are present
# Therefore, we will perform Label Encoding on these feature
le = LabelEncoder()
for i in object_cols:
    X_imputed[i] = le.fit_transform(X_imputed[i])

X_imputed

Unnamed: 0.1,Suburb,Address,Type,Method,SellerG,CouncilArea,Regionname,Unnamed: 0,Rooms,Distance,...,Postcode_was_missing,Bedroom2_was_missing,Bathroom_was_missing,Car_was_missing,Landsize_was_missing,BuildingArea_was_missing,YearBuilt_was_missing,Lattitude_was_missing,Longtitude_was_missing,Propertycount_was_missing
0,0,17365,0,1,29,31,2,1.0,2.0,2.5,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
1,0,8233,0,1,29,31,2,2.0,2.0,2.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0,13470,0,3,29,31,2,4.0,3.0,2.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0,12439,0,0,29,31,2,5.0,3.0,2.5,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
4,0,14496,0,4,178,31,2,6.0,4.0,2.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18391,320,17042,1,3,100,23,6,23540.0,2.0,6.8,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
18392,320,18046,0,0,254,23,6,23541.0,4.0,6.8,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
18393,326,4665,0,1,38,23,2,23544.0,4.0,12.7,...,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0
18394,328,14735,0,3,273,23,6,23545.0,4.0,6.3,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We can observe that Categorical column is label encoded where categorical columns are encoded by alloting an index.

Depending on the data, label encoding introduces a new problem. For example, we have encoded a set of country names into numerical data. This is actually categorical data and there is no relation, of any kind, between the rows.
The problem here is since there are different numbers in the same column, the model will misunderstand the data to be in some kind of order, 0 < 1 < 2.
The model may derive a correlation like as the country number increases the population increases but this clearly may not be the scenario in some other data or the prediction set. To overcome this problem, we use One Hot Encoder.

In [133]:
X_train_label, X_valid_label, y_train_label, y_valid_label = train_test_split(X_imputed, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)

In [134]:
print("MAE after Label Encoding:")
print(score_dataset(X_train_label, X_valid_label, y_train_label, y_valid_label))

MAE after Label Encoding:
179618.08135054348


### Ordinal Encoding


An ordinal encoding involves mapping each unique label to an integer value. This type of encoding is really only appropriate if there is a known relationship between the categories. This relationship does exist for some of the variables in our dataset, and ideally, this should be harnessed when preparing the data.


In [135]:
X_imputed_ordinal = pd.concat([imputed_categorical_features, imputed_numerical_features], axis=1, ignore_index=False)

# Import Libraries
import pandas as pd

# Library to perform Ordinal Encoding
from sklearn.preprocessing import OrdinalEncoder

# Get list of categorical features
categorical_features = (X_imputed_ordinal.dtypes == 'object')
object_cols = list(categorical_features[categorical_features].index)

# In the Dataset, seven categorical features are present
# Therefore, we will perform Ordinal Encoding on these feature
ordinal_encoder = OrdinalEncoder()
X_imputed_ordinal[object_cols] = ordinal_encoder.fit_transform(X_imputed_ordinal[object_cols])

X_train_ordinal, X_valid_ordinal, y_train_ordinal, y_valid_ordinal = train_test_split(X_imputed_ordinal, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)

print("MAE after Ordinal Encoding:")
print(score_dataset(X_train_ordinal, X_valid_ordinal, y_train_ordinal, y_valid_ordinal))

MAE after Ordinal Encoding:
179618.08135054348


### One Hot Encoding 
We use the OneHotEncoder class from scikit-learn to get one-hot encodings. 
There are a number of parameters that can be used to customize its behavior.

We set handle_unknown='ignore' to avoid errors when the validation data contains classes that aren't represented in the training data, and
setting sparse=False ensures that the encoded columns are returned as a numpy array (instead of a sparse matrix).

To use the encoder, we supply only the categorical columns that we want to be one-hot encoded. 

For instance, to encode the training data, we supply X_train[object_cols]. (object_cols in the code cell below is a list of the column names with categorical data, and so X_train[object_cols] contains all of the categorical data in the training set.)

In [136]:
X_imputed_one_hot = pd.concat([imputed_categorical_features, imputed_numerical_features], axis=1, ignore_index=False)

# Import Libraries
import pandas as pd

# Library to perform Ordinal Encoding
from sklearn.preprocessing import OneHotEncoder

# Get list of categorical features
categorical_features = (X_imputed_one_hot.dtypes == 'object')
object_cols = list(categorical_features[categorical_features].index)

# In the Dataset, seven categorical features are present
# Therefore, we will perform One-Hot Encoding on these feature
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols = pd.DataFrame(OH_encoder.fit_transform(X_imputed_one_hot[object_cols]))
OH_cols.index = X_imputed_one_hot.index

# Remove categorical columns (will replace with one-hot encoding)
# Remove categorical columns (will replace with one-hot encoding)
num_X = X_imputed_one_hot.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features
OH_X = pd.concat([num_X, OH_cols], axis=1)

X_train_OH, X_valid_OH, y_train_OH, y_valid_OH = train_test_split(OH_X, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)

print("MAE after One-Hot Encoding:")
print(score_dataset(X_train_OH, X_valid_OH, y_train_OH, y_valid_OH))

MAE after One-Hot Encoding:




171575.36725815217


### Feature Hashing

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html#:~:text=Implements%20feature%20hashing%2C%20aka%20the,column%20corresponding%20to%20a%20name.

Can be used to perform feature hashing of categorical features

### In class assignment -

House Price Prediction - https://www.kaggle.com/c/home-data-for-ml-course/data

Take this data into consideration figure out which features contain missing data, and apply label and one hot encoding on the categorical data.

Take the train.csv into consideration.
