# More topics on Melbourne House Prices

This notebeook is a continuation of the notebook `house-pricing`. The idea is to go deep in new topics in ML.


## MISSING VALUES

Missing values are the most common data issue that you will be able to find in almost every data set. The best thing is to be prepared to face this challenge. 

Most of the machine learning libraries (like sci-kit learn) cannot deal with missing values. These raise an error if a model is built using data with missing values. To avoid these, there are some strategies.
 
**1. Drop columns with missing values**

This is the mos simple option but not the most recommended as it has to be clear that the column that is removed does not add anything to the model.

**2. Filling the missing values (Imputation)**

**Filling** the missing values with some other value. An example of a filling value is the _mean_ but in case you have [skewed](https://en.wikipedia.org/wiki/Skewness) data you might want to use the median, or even the mode. 

This is probably the best method that can be used. This depends on that the value that is added really makes sense as a value for the numerical values that are in the range of the column. 

**3. Extension of filling**

Filling is the standard approach, and it usually works well. There might be disadvantages as the value that is used to fill in the missing might not be completely accurate, or the rows with the missig values can be unique in some other way. In that case, the model would make better prediction considering which values were originlly missing. For that purpose, for each column with a missing value, a new column is added `"original_name"_was_missing` that has as values `True`if the value in the `original_name` was missing and `False` otherwise. This trick can help in some cases but is not always the case.

Now, let us test those three in the following.

In [126]:
# import all the necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

df = pd.read_csv('data/train.csv')
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34857 entries, 0 to 34856
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Suburb         34857 non-null  object 
 1   Address        34857 non-null  object 
 2   Rooms          34857 non-null  int64  
 3   Type           34857 non-null  object 
 4   Price          27247 non-null  float64
 5   Method         34857 non-null  object 
 6   SellerG        34857 non-null  object 
 7   Date           34857 non-null  object 
 8   Distance       34856 non-null  float64
 9   Postcode       34856 non-null  float64
 10  Bedroom2       26640 non-null  float64
 11  Bathroom       26631 non-null  float64
 12  Car            26129 non-null  float64
 13  Landsize       23047 non-null  float64
 14  BuildingArea   13742 non-null  float64
 15  YearBuilt      15551 non-null  float64
 16  CouncilArea    34854 non-null  object 
 17  Lattitude      26881 non-null  float64
 18  Longti

In [129]:
#select the main variable
y = df.SalePrice
y = y.fillna(y.median())

X = df.drop(['SalePrice'], axis = 1)

#For simplicity select only the numerical variables
X = X.select_dtypes(exclude = ['object'])

# Divide the dataset
X_train, X_valid, y_train, y_valid= train_test_split(X, y, 
                                                     train_size = 0.8,
                                                     test_size = 0.2)

#function to compare the different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators =10, random_state = 100)
    model.fit(X_train, y_train)
    pred = model.predict(X_valid)
    return mean_absolute_error(y_valid, pred)


In [130]:
# Case 1. drop the columns that contain missing values
cols_missing = [col for col in X_train.columns if X_train[col].isnull().any()]

reduced_X_train = X_train.drop(cols_missing, axis =1)
reduced_X_valid = X_valid.drop(cols_missing, axis = 1)

print('MAE for case1: Dropping columns with missng values : {}'.format(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid)))

MAE for case1: Dropping columns with missng values : 343301.41088616254


In [131]:
#Case 2: Imputer
from sklearn.impute import SimpleImputer

## Imputation
imputer = SimpleImputer()

imputed_X_train = pd.DataFrame(imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(imputer.transform(X_valid))

## gave back the original names to the columns

imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

print('MAE case 2: Filling/Imputation is:{}'.format(score_dataset(imputed_X_train,imputed_X_valid,y_train, y_valid)))

MAE case 2: Filling/Imputation is:254327.7195829716


There is already a considerable reduction in the MAE in between the first two approaches.

In [132]:
#Case three, Extension of filling
# make copies of the X sets
X_train_plus = X_train.copy()
X_valid_plus = X_valid.copy()

#make new column indicating what will be imputed
for col in cols_missing:
    X_train_plus[col+'_was_missing'] = X_train_plus[col].isnull()
    X_valid_plus[col+'_was_missing'] = X_valid_plus[col].isnull()

#Generate the imputer and change the values
imputer = SimpleImputer()
imputed_X_train_plus = pd.DataFrame(imputer.fit_transform(X_train_plus))
imputed_X_valid_plus = pd.DataFrame(imputer.transform(X_valid_plus))

# rename the columns of the imputer with the previous names
imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns

print('MAE case 3: Imputer extended: {}'.format(score_dataset(imputed_X_train_plus,imputed_X_valid_plus,y_train, y_valid)))


MAE case 3: Imputer extended: 254276.4703580419


As seen in the last result, adding that extra layer of sophistication is not reflected in the overall model performance. 

## Categorical Data

A categorical variable takes only a limited number of values. Think for example surveys in which you are asked your mood and the possible answers are 'happy','normal', 'sad'. These are categories. 

How to use the data? 

**1. Drop Categorical variables:**
This is again the easiest approach. Remove those columns that contain categorical values.

**2. Ordinal Encoding:**
Assign a unique value to each of the categories. In this approach, it is assumed that there is an ordering of the variables. In the case, 'happy'>'normal'>'sad'
The assumtption makes sense in this example but this might not be the case all the time. These variables as referred as 'ordinal variables'

_Be carefull in this approach as both the training and the validation dataset should have the same collection of categories_

**3. One-hot Encoding:**
One-hot encoding creates new columns indicating the presence (or absence) of each of the possible values in the original data. Each category is represented by a column that has the following form. For variable X and cateogry i, the variable $X_i$ takes the follwing values for the observation j
$$ X_i =\begin{cases}1 &\text{if }X(j) = i\\ 0& \text{otherwise} \end{cases}$$

Therefore, at the end, the new vectors contain the information that it was contained in the original categorical variable. 

In this approach, there is no order assumed, therefore this is a good approach when there is not a clear order in the categories. These are 'nominal variables'. One drawback from this approach is that it does not perform well if the categorical variable has too many categories where too many can be as small as 15 categories.

In [133]:
# import the necessary libraries
import pandas as pd
import numpy as np
import sklearn
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble  import RandomForestRegressor

## Read the data set as before we were not considering categorical 
# variables
df_2 = pd.read_csv('data/train.csv')
# extract the y value
y = df_2.SalePrice 
y = y.fillna(y.median())

X = df_2.drop('SalePrice', axis = 1)

X_train_full, X_valid_full, y_train, y_valid = train_test_split(X,y,
                                                                train_size=0.8,
                                                                test_size= 0.2,
                                                                random_state = 1) 

# For this part of the exersice, the data from the first approach
# to missing values is going to be used. 

cols_missing = [col for col in X_train_full.columns if X_train_full[col].isna().any()]

X_train_full = X_train_full.drop(columns = cols_missing, axis = 1)
X_valid_full = X_valid_full.drop(columns = cols_missing, axis = 1)

# check the variables that are categorical with low cardinality 
# and also the ones with numerical values

# Categorical values:_
low_cardinality = [name for name in X_train_full.columns 
                   if X_train_full[name].dtype == 'object' and X_train_full[name].nunique()<=10]

#Numerical values
numerical = [name for name in X_train_full.columns if X_train_full[name].dtype in ['int64', 'float64']]

# Columns to study
my_cols = low_cardinality + numerical

X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()


In [134]:
X_train.head()


Unnamed: 0,Type,Method,Rooms
23491,h,S,3
5998,h,S,4
14381,h,S,4
1202,h,PI,3
16775,h,PI,4


In [135]:
# Get list of categorical variables
s = (X_train.dtypes == 'object')
object_cols = list(s[s].index)

To measure the different approaches, the function `score_dataset` defined above will be used again. This is again defined here for completeness.

In [136]:
## define the function score_dataset

def socre_dataset(X_train, X_valid, y_train, y_valid ):
    '''
    Function to socre the data set based on the approach that was given 
    '''
    
    model = RandomForestRegressor(n_estimators=100,
                                  random_state=100)
    model.fit(X_train, y_train)
    
    #MAke the predictions
    predictions = (X_valid)
    
    error = mean_absolute_error(y_valid, predictions)
    return round(error,4)
    

In [137]:
# Case 1: Drop categorical variables
drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])

print("MAE from Approach 1 (Drop categorical variables): {}".format(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid)))

MAE from Approach 1 (Drop categorical variables): 348410.8782100967


With an `OrdinalEncoder` from `sklearn`, a random integer is assign to each unique value in the categorical variable. This is a common practice that is simpler than giving customized labels. However, it is expected that the performance boosts if the better-informed labels for the ordinal variables are given.


In [None]:
def category_checker(df1, df2, cat_columns = None):
    '''
    Function to check if the columns in both dataframes have
    the same categories in the cat_columns. If not, return which 
    columns have different classifications.
    '''

In [138]:
# Case 2: Ordinal Encoding
from sklearn.preprocessing import OrdinalEncoder


#Make sure that both data sets have the same categories.


# Make copy of the training a validation sets
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()

# Delete the values categories that are not in the training set.
for cat_col in object_cols:
    categories = label_X_train

# apply the ordinal encoder to each of the categorical variables
ordinal_encoder = OrdinalEncoder()

label_X_train[object_cols] = ordinal_encoder.fit_transform(X_train[object_cols])

label_X_valid[object_cols] = ordinal_encoder.transform(X_valid[object_cols])


print("MAE from Approach 2 (Ordinal Encoding):{}".format(score_dataset(label_X_train, label_X_valid, y_train, y_valid))) 

MAE from Approach 2 (Ordinal Encoding):321528.2349797436


In [84]:
# Case 3 

(array(['CompShg', 'WdShngl', 'WdShake', 'Tar&Grv', 'Metal'], dtype=object),
 array(['CompShg', 'WdShngl', 'Tar&Grv', 'ClyTile', 'WdShake', 'Membran',
        'Roll'], dtype=object))

In [93]:
label_X_valid

1