# More topics on Melbourne House Prices

This notebeook is a continuation of the notebook `house-pricing`. The idea is to go deep in new topics in ML.


## MISSING VALUES

Missing values are the most common data issue that you will be able to find in almost every data set. The best thing is to be prepared to face this challenge. 

Most of the machine learning libraries (like sci-kit learn) cannot deal with missing values. These raise an error if a model is built using data with missing values. To avoid these, there are some strategies.
 
**1. Drop columns with missing values**

This is the mos simple option but not the most recommended as it has to be clear that the column that is removed does not add anything to the model.

**2. Filling the missing values (Imputation)**

**Filling** the missing values with some other value. An example of a filling value is the _mean_ but in case you have [skewed](https://en.wikipedia.org/wiki/Skewness) data you might want to use the median, or even the mode. 

This is probably the best method that can be used. This depends on that the value that is added really makes sense as a value for the numerical values that are in the range of the column. 

**3. Extension of filling**

Filling is the standard approach, and it usually works well. There might be disadvantages as the value that is used to fill in the missing might not be completely accurate, or the rows with the missig values can be unique in some other way. In that case, the model would make better prediction considering which values were originlly missing. For that purpose, for each column with a missing value, a new column is added `"original_name"_was_missing` that has as values `True`if the value in the `original_name` was missing and `False` otherwise. This trick can help in some cases but is not always the case.

Now, let us test those three in the following.

In [18]:
# import all the necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

df = pd.read_csv('data/train.csv')
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

In [19]:
#select the main variable
y = df.SalePrice
y = y.fillna(y.median())

X = df.drop(['SalePrice'], axis = 1)

#For simplicity select only the numerical variables
X = X.select_dtypes(exclude = ['object'])

# Divide the dataset
X_train, X_valid, y_train, y_valid= train_test_split(X, y, 
                                                     train_size = 0.8,
                                                     test_size = 0.2)

#function to compare the different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators =10, random_state = 100)
    model.fit(X_train, y_train)
    pred = model.predict(X_valid)
    return mean_absolute_error(y_valid, pred)


In [20]:
# Case 1. drop the columns that contain missing values
cols_missing = [col for col in X_train.columns if X_train[col].isnull().any()]

reduced_X_train = X_train.drop(cols_missing, axis =1)
reduced_X_valid = X_valid.drop(cols_missing, axis = 1)

print('MAE for case1: Dropping columns with missng values : {}'.format(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid)))

MAE for case1: Dropping columns with missng values : 17540.179109589044


In [21]:
#Case 2: Imputer
from sklearn.impute import SimpleImputer

## Imputation
imputer = SimpleImputer()

imputed_X_train = pd.DataFrame(imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(imputer.transform(X_valid))

## gave back the original names to the columns

imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

print('MAE case 2: Filling/Imputation is:{}'.format(score_dataset(imputed_X_train,imputed_X_valid,y_train, y_valid)))

MAE case 2: Filling/Imputation is:17616.610958904108


There is already a considerable reduction in the MAE in between the first two approaches.

In [22]:
#Case three, Extension of filling
# make copies of the X sets
X_train_plus = X_train.copy()
X_valid_plus = X_valid.copy()

#make new column indicating what will be imputed
for col in cols_missing:
    X_train_plus[col+'_was_missing'] = X_train_plus[col].isnull()
    X_valid_plus[col+'_was_missing'] = X_valid_plus[col].isnull()

#Generate the imputer and change the values
imputer = SimpleImputer()
imputed_X_train_plus = pd.DataFrame(imputer.fit_transform(X_train_plus))
imputed_X_valid_plus = pd.DataFrame(imputer.transform(X_valid_plus))

# rename the columns of the imputer with the previous names
imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns

print('MAE case 3: Imputer extended: {}'.format(score_dataset(imputed_X_train_plus,imputed_X_valid_plus,y_train, y_valid)))


MAE case 3: Imputer extended: 18006.705479452055


As seen in the last result, adding that extra layer of sophistication is not reflected in the overall model performance. 

## Categorical Data

A categorical variable takes only a limited number of values. Think for example surveys in which you are asked your mood and the possible answers are 'happy','normal', 'sad'. These are categories. 

How to use the data? 

**1. Drop Categorical variables:**
This is again the easiest approach. Remove those columns that contain categorical values.

**2. Ordinal Encoding:**
Assign a unique value to each of the categories. In this approach, it is assumed that there is an ordering of the variables. In the case, 'happy'>'normal'>'sad'
The assumtption makes sense in this example but this might not be the case all the time. These variables as referred as 'ordinal variables'

_Be carefull in this approach as both the training and the validation dataset should have the same collection of categories_

**3. One-hot Encoding:**
One-hot encoding creates new columns indicating the presence (or absence) of each of the possible values in the original data. Each category is represented by a column that has the following form. For variable X and cateogry i, the variable $X_i$ takes the follwing values for the observation j
$$ X_i =\begin{cases}1 &\text{if }X(j) = i\\ 0& \text{otherwise} \end{cases}$$

Therefore, at the end, the new vectors contain the information that it was contained in the original categorical variable. 

In this approach, there is no order assumed, therefore this is a good approach when there is not a clear order in the categories. These are 'nominal variables'. One drawback from this approach is that it does not perform well if the categorical variable has too many categories where too many can be as small as 15 categories.

In [23]:
# import the necessary libraries
import pandas as pd
import numpy as np
import sklearn
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble  import RandomForestRegressor

## Read the data set as before we were not considering categorical 
# variables
df_2 = pd.read_csv('data/train.csv')
# extract the y value
y = df_2.SalePrice 
y = y.fillna(y.median())

X = df_2.drop('SalePrice', axis = 1)

X_train_full, X_valid_full, y_train, y_valid = train_test_split(X,y,
                                                                train_size=0.8,
                                                                test_size= 0.2,
                                                                random_state = 1) 

# For this part of the exersice, the data from the first approach
# to missing values is going to be used. 

cols_missing = [col for col in X_train_full.columns if X_train_full[col].isna().any()]

X_train_full = X_train_full.drop(columns = cols_missing, axis = 1)
X_valid_full = X_valid_full.drop(columns = cols_missing, axis = 1)

# check the variables that are categorical with low cardinality 
# and also the ones with numerical values

# Categorical values:_
low_cardinality = [name for name in X_train_full.columns 
                   if X_train_full[name].dtype == 'object' and X_train_full[name].nunique()<=10]

#Numerical values
numerical = [name for name in X_train_full.columns if X_train_full[name].dtype in ['int64', 'float64']]

# Columns to study
my_cols = low_cardinality + numerical

X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()


In [24]:
X_train.head()


Unnamed: 0,MSZoning,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,Condition1,Condition2,BldgType,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
921,RL,Pave,Reg,Lvl,AllPub,Inside,Gtl,Feedr,Norm,Duplex,...,0,0,70,0,0,0,0,0,9,2008
520,RL,Pave,Reg,Lvl,AllPub,Inside,Gtl,Norm,Norm,2fmCon,...,0,220,114,210,0,0,0,0,8,2008
401,RL,Pave,IR1,Lvl,AllPub,Inside,Gtl,Norm,Norm,1Fam,...,400,0,0,0,0,0,0,0,7,2006
280,RL,Pave,Reg,Lvl,AllPub,Inside,Gtl,Norm,Norm,1Fam,...,575,0,84,0,196,0,0,0,1,2007
1401,RL,Pave,IR1,Lvl,AllPub,Inside,Gtl,Norm,Norm,1Fam,...,398,100,75,0,0,0,0,0,4,2008


In [25]:
# Get list of categorical variables
s = (X_train.dtypes == 'object')
object_cols = list(s[s].index)

To measure the different approaches, the function `score_dataset` defined above will be used again. This is again defined here for completeness.

In [32]:
## define the function score_dataset

def socre_dataset(X_train, X_valid, y_train, y_valid ):
    '''
    Function to socre the data set based on the approach that was given 
    '''
    
    model = RandomForestRegressor(n_estimators=100,
                                  random_state=100)
    model.fit(X_train, y_train)
    
    #MAke the predictions
    predictions = (X_valid) 
    
    error = mean_absolute_error(y_valid, predictions)
    return round(error,4)
    

In [33]:
# Case 1: Drop categorical variables
drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])

print("MAE from Approach 1 (Drop categorical variables): {}".format(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid)))

MAE from Approach 1 (Drop categorical variables): 17185.62705479452


With an `OrdinalEncoder` from `sklearn`, a random integer is assign to each unique value in the categorical variable. This is a common practice that is simpler than giving customized labels. However, it is expected that the performance boosts if the better-informed labels for the ordinal variables are given.


In [34]:
def category_checker(df1, df2, cat_columns = None):
    '''
    Function to check if the columns in both dataframes have
    the same categories in the cat_columns. If not, return which 
    columns have different categories in both dataframes. This is 
    important for the Ordinal encoding and the one-hot encoding
    as they are trained with one dataframe and the other one will be 
    fitted based on the previous results. If there is new information 
    in the second, then it will raise an error.
    '''
    cols_to_drop = []
    for col in cat_columns:
        categories_df1 = set(df1[col].unique())
        categories_df2 = set(df2[col].unique())
        if len(categories_df1 ^ categories_df2)>0:
            cols_to_drop = cols_to_drop + [col]
    return cols_to_drop



In [35]:
# Case 2: Ordinal Encoding
from sklearn.preprocessing import OrdinalEncoder


#Make sure that both data sets have the same categories.
not_same_categories = category_checker(X_train,X_valid, object_cols)

# Make copy of the training a validation sets
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()

# remove the categorical variables result from category_checker()
# First from the train and validation data
label_X_train = label_X_train.drop(not_same_categories, axis =1)
label_X_valid = label_X_valid.drop(not_same_categories, axis =1)

#From the columns that are going to be checked
object_cols_reduced = list(set(object_cols) - set(not_same_categories))


# apply the ordinal encoder to each of the categorical variables
ordinal_encoder = OrdinalEncoder()

label_X_train[object_cols_reduced] = ordinal_encoder.fit_transform(X_train[object_cols_reduced])

label_X_valid[object_cols_reduced] = ordinal_encoder.transform(X_valid[object_cols_reduced])


print("MAE from Approach 2 (Ordinal Encoding):{}".format(score_dataset(label_X_train, label_X_valid, y_train, y_valid))) 

MAE from Approach 2 (Ordinal Encoding):17312.79075342466


In [38]:
# Case 3 One-Hot Encoding
from sklearn.preprocessing import OneHotEncoder

## Apply one hot encoder to each columns with categorical data
oh_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)

oh_cols_train = pd.DataFrame(oh_encoder.fit_transform(X_train[object_cols]))
oh_cols_valid = pd.DataFrame(oh_encoder.transform(X_valid[object_cols]))

## one-hot encoding removed index: put it back
oh_cols_train.index = X_train.index
oh_cols_valid.index = X_valid.index

# Remove the categorical columns (those were replaced with the one-hot-encoding)
num_X_train = X_train.drop(object_cols, axis = 1)
num_X_valid = X_valid.drop(object_cols, axis = 1)

# Add the one-hot- encoding columns to the numerical ones
oh_X_train = pd.concat([num_X_train, oh_cols_train], axis =1)
oh_X_valid = pd.concat([num_X_valid, oh_cols_valid], axis =1)

print('MAE from Approach 3 (One-hot Encoding)')
print(score_dataset(oh_X_train, oh_X_valid, y_train, y_valid))


MAE from Approach 3 (One-hot Encoding)
16874.49383561644




### Which is the winning approach? 

Dropping the categorical variables usually is the one that perform the worst and the best is the one-hot encoding, but it varies in a case-by-case basis

### Conclusion

There are always categorical values all around the world and it is good to know how to handle it.