# Introduction

Handle missing values, non-numeric values, data leakage, and more.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:
x_full = pd.read_csv('dataset/home_data_train.csv')
x_test_full = pd.read_csv('dataset/home_data_test.csv')

x_full.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [4]:
x_test_full.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,6,2010,WD,Normal
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,144,0,,,,0,1,2010,WD,Normal


In [3]:
# target and features (predictors)

y = x_full['SalePrice']

features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']

x = x_full[features].copy()

x_test = x_test_full[features].copy()

In [4]:
# Break off validation set from training data
x_train, x_val, y_train, y_val = train_test_split(x, y, random_state=1, train_size=0.8, test_size=0.2)

In [5]:
x_train.head()

Unnamed: 0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd
921,8777,1900,1272,928,2,4,9
520,10800,1900,694,600,2,3,7
401,8767,2005,1310,0,2,3,6
280,11287,1989,1175,807,2,3,7
1401,7415,2004,864,729,2,3,8


In [6]:
from sklearn.ensemble import RandomForestRegressor

In [7]:
# Define the models
model_1 = RandomForestRegressor(n_estimators=50, random_state=0)
model_2 = RandomForestRegressor(n_estimators=100, random_state=0)
model_3 = RandomForestRegressor(n_estimators=100, criterion='mae', random_state=0)
model_4 = RandomForestRegressor(n_estimators=200, min_samples_split=20, random_state=0)
model_5 = RandomForestRegressor(n_estimators=100, max_depth=7, random_state=0)

models = [model_1, model_2, model_3, model_4, model_5]

To select the best model out of the five, we define a function score_model() below. This function returns the mean absolute error (MAE) from the validation set. Recall that the best model will obtain the lowest MAE.

In [21]:
from sklearn.metrics import mean_absolute_error

# Function for comparing different models
def score_model(model, X_t=x_train, X_v=x_val, y_t=y_train, y_v=y_val):
    model.fit(X_t, y_t)
    preds = model.predict(X_v)
    return mean_absolute_error(y_v, preds)

dic_score = {}

for i in range(0, len(models)):
    mae = score_model(models[i])
    dic_score[i] = mae
    print("Model %d MAE: %d" % (i+1, mae))
    
best_model_with_min_mae = min(dic_score, key=dic_score.get)
best_model_with_min_mae

Model 1 MAE: 22074
Model 2 MAE: 21979
Model 3 MAE: 22457
Model 4 MAE: 22508
Model 5 MAE: 22439


1

# Step 1: Evaluate several models

Use the above results to fill in the line below. Which model is the best model? Your answer should be one of model_1, model_2, model_3, model_4, or model_5.

In [22]:
# Fill in the best model
best_model = models[best_model_with_min_mae]

# Step 2: Generate test predictions

Great. You know how to evaluate what makes an accurate model. Now it's time to go through the modeling process and make predictions. In the line below, create a Random Forest model with the variable name `my_model`.

In [23]:
my_model = RandomForestRegressor(n_estimators=100, random_state = 0)

Run the next code cell without changes. The code fits the model to the training and validation data, and then generates test predictions that are saved to a CSV file. These test predictions can be submitted directly to the competition!

In [24]:
# Fit the model to the training data
my_model.fit(x, y)

# Generate test predictions
preds_test = my_model.predict(x_test)

# Save predictions in format used for competition scoring
output = pd.DataFrame({'Id': x_test.index,
                       'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

# Missing Values

Missing values happen. Be prepared for this common challenge in real datasets.

There are many ways data can end up with missing values. For example,

A 2 bedroom house won't include a value for the size of a third bedroom.
A survey respondent may choose not to share his income.
Most machine learning libraries (including scikit-learn) give an error if you try to build a model using data with missing values. So you'll need to choose one of the strategies below.

# Three Approaches

1) A Simple Option: Drop Columns with Missing Values:
    - The simplest option is to drop columns with missing values. Unless most values in the dropped columns are missing, the model loses access to a lot of (potentially useful!) information with this approach. As an extreme example, consider a dataset with 10,000 rows, where one important column is missing a single entry. This approach would drop the column entirely!
    
<img src="https://i.imgur.com/Sax80za.png" width='80%' align='center'/>
    
2) A Better Option: Imputation
    - Imputation fills in the missing values with some number. For instance, we can fill in the mean value along each column. The imputed value won't be exactly right in most cases, but it usually leads to more accurate models than you would get from dropping the column entirely.
    
<img src="https://i.imgur.com/4BpnlPA.png" width='80%' align='center'/>
    
3) An Extension To Imputation
    - Imputation is the standard approach, and it usually works well. However, imputed values may be systematically above or below their actual values (which weren't collected in the dataset). Or rows with missing values may be unique in some other way. In that case, your model would make better predictions by considering which values were originally missing. In this approach, we impute the missing values, as before. And, additionally, for each column with missing entries in the original dataset, we add a new column that shows the location of the imputed entries. In some cases, this will meaningfully improve results. In other cases, it doesn't help at all.
    
    
<img src="https://i.imgur.com/UWOyg4a.png" width='80%' align='center'/>

In [None]:
melb = pd.read_csv('dataset/melb_data.csv')

melb.columns

# getting the target
y = melb['Price']
# To keep things simple, we'll use only numerical predictors
x = melb.drop('Price', axis=1)
x = x.select_dtypes(exclude=['object'])

# x.select_dtypes? 

## Return a subset of the DataFrame's columns based on the column dtypes.

Parameters
----------

include, exclude : scalar or list-like

    A selection of dtypes or strings to be included/excluded. At least
    one of these parameters must be supplied.
    

In [25]:
x.head()

Unnamed: 0,Rooms,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
0,2,2.5,3067.0,2.0,1.0,1.0,202.0,,,-37.7996,144.9984,4019.0
1,2,2.5,3067.0,2.0,1.0,0.0,156.0,79.0,1900.0,-37.8079,144.9934,4019.0
2,3,2.5,3067.0,3.0,2.0,0.0,134.0,150.0,1900.0,-37.8093,144.9944,4019.0
3,3,2.5,3067.0,3.0,2.0,1.0,94.0,,,-37.7969,144.9969,4019.0
4,4,2.5,3067.0,3.0,1.0,2.0,120.0,142.0,2014.0,-37.8072,144.9941,4019.0


In [26]:
y.head()

0    1480000.0
1    1035000.0
2    1465000.0
3     850000.0
4    1600000.0
Name: Price, dtype: float64

In [27]:
x_train, x_val, y_train, y_val = train_test_split(x, y, random_state=0, train_size=0.8, test_size=0.2)

Define Function to Measure Quality of Each Approach

We define a function score_dataset() to compare different approaches to dealing with missing values. This function reports the mean absolute error (MAE) from a random forest model.

In [28]:
def score_dataset(x_train, x_val, y_train, y_val):
    model = RandomForestRegressor(n_estimators=10, random_state=0)
    model.fit(x_train, y_train)
    preds = model.predict(x_val)
    return mean_absolute_error(y_val, preds)

## Score from Approach 1 (Drop Columns with Missing Values)

Since we are working with both training and validation sets, we are careful to drop the same columns in both DataFrames.

In [29]:
# Get names of columns with missing values

cols_with_missing = [col for col in x_train.columns
                     if x_train[col].isnull().any()]

# Drop columns in training and validation data
reduced_X_train = x_train.drop(cols_with_missing, axis=1)
reduced_X_valid = x_val.drop(cols_with_missing, axis=1)

print("MAE from Approach 1 (Drop columns with missing values):")
print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_val))

MAE from Approach 1 (Drop columns with missing values):
183550.22137772635


# Score from Approach 2 (Imputation)

Next, we use SimpleImputer to replace missing values with the mean value along each column.

Although it's simple, filling in the mean value generally performs quite well (but this varies by dataset). While statisticians have experimented with more complex ways to determine imputed values (such as regression imputation, for instance), the complex strategies typically give no additional benefit once you plug the results into sophisticated machine learning models.

In [30]:
from sklearn.impute import SimpleImputer

my_imputer = SimpleImputer()

In [16]:
my_imputer.fit_transform? # return an array

In [31]:
# Imputation

imputed_X_train = pd.DataFrame(my_imputer.fit_transform(x_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(x_val))

# Imputation removed column names; put them back
imputed_X_train.columns = x_train.columns
imputed_X_valid.columns = x_val.columns

print("MAE from Approach 2 (Imputation):")
print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_val))

MAE from Approach 2 (Imputation):
178166.46269899711


SimpleImputer is a scikit-learn class which is helpful in handling the missing data in the predictive model dataset. It replaces the NaN values with a specified placeholder. 

SimpleImputer() has as parameter: 
     - missing_values : The missing_values placeholder which has to be imputed. By default is NaN 
     - strategy : The data which will replace the NaN values from the dataset. The strategy argument can take the values – ‘mean'(default), ‘median’, ‘most_frequent’ and ‘constant’. 
     - fill_value : The constant value to be given to the NaN data using the constant strategy. 
     
 The mean or median is taken along the column of the matrix
 
 Methods: Fit, fit_transform and transform
 
 1.Fit(): Method calculates the parameters μ (mean) and σ (standard deviation) and saves them as internal objects.
 
 2.Transform(): Method using these calculated parameters apply the transformation to a particular dataset.
 
 3.Fit_transform(): joins the fit() and transform() method for transformation of dataset.

In [60]:
# example from geeksforgeeks

import numpy as np

# Imputer object using the mean strategy and
# missing_values type for imputation
imputer = SimpleImputer(missing_values = np.nan,
                        strategy ='mean')
 
data = [[12, np.nan, 34], [10, 32, np.nan],
        [np.nan, 11, 20]]
 
print("Original Data : \n", data)
# Fitting the data to the imputer object
imputer = imputer.fit(data)
 
# Imputing the data    
data = imputer.transform(data)
 
print("Imputed Data : \n", data)

Original Data : 
 [[12, nan, 34], [10, 32, nan], [nan, 11, 20]]
Imputed Data : 
 [[12.  21.5 34. ]
 [10.  32.  27. ]
 [11.  11.  20. ]]


## Score from Approach 3 (An Extension to Imputation)

Next, we impute the missing values, while also keeping track of which values were imputed.

In [61]:
# Make copy to avoid changing original data (when imputing)
X_train_plus = x_train.copy()
X_valid_plus = x_val.copy()

In [62]:
# Make new columns indicating what will be imputed
for col in cols_with_missing:
    X_train_plus[col + '_was_missing'] = X_train_plus[col].isnull() 
    # it's going to be true for cols that are missing and false otherwise
    X_valid_plus[col + '_was_missing'] = X_valid_plus[col].isnull()

In [63]:
X_valid_plus.head()

Unnamed: 0,Rooms,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount,Car_was_missing,BuildingArea_was_missing,YearBuilt_was_missing
8505,4,8.0,3016.0,4.0,2.0,2.0,450.0,190.0,1910.0,-37.861,144.8985,6380.0,False,False,False
5523,2,6.6,3011.0,2.0,1.0,0.0,172.0,81.0,1900.0,-37.81,144.8896,2417.0,False,False,False
12852,3,10.5,3020.0,3.0,1.0,1.0,581.0,,,-37.7674,144.82421,4217.0,False,True,True
4818,3,4.5,3181.0,2.0,2.0,1.0,128.0,134.0,2000.0,-37.8526,145.0071,7717.0,False,False,False
12812,3,8.5,3044.0,3.0,2.0,2.0,480.0,,,-37.72523,144.94567,7485.0,False,True,True


In [64]:
# Imputation
my_inputer = SimpleImputer()
imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus = pd.DataFrame(my_imputer.transform(X_valid_plus))

In [65]:
# Imputation removed column names; put them back
imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns

In [66]:
print("MAE from Approach 3 (An Extension to Imputation):")
print(score_dataset(imputed_X_train_plus, imputed_X_valid_plus, y_train, y_val))

MAE from Approach 3 (An Extension to Imputation):
178927.503183954


So, why did imputation perform better than dropping the columns?
The training data has 10864 rows and 12 columns, where three columns contain missing data. For each column, less than half of the entries are missing. Thus, dropping the columns removes a lot of useful information, and so it makes sense that imputation would perform better.

In [68]:
# Shape of training data (num_rows, num_columns)
print(x_train.shape)

# Number of missing values in each column of training data
missing_val_count_by_column = (x_train.isnull().sum())

print(missing_val_count_by_column[missing_val_count_by_column > 0])

(10864, 12)
Car               49
BuildingArea    5156
YearBuilt       4307
dtype: int64


As is common, imputing missing values (in Approach 2 and Approach 3) yielded better results, relative to when we simply dropped columns with missing values (in Approach 1).

# Exercise

In [34]:
import numpy as np
students = [[85, 'M', 'verygood'],
           [95, 'F', 'excellent'],
           [75, None,'good'],
           [np.NaN, 'M', 'average'],
           [70, 'M', 'good'],
           [np.NaN, None, 'verygood'],
           [92, 'F', 'verygood'],
           [98, 'M', 'excellent']]

students = pd.DataFrame(students, columns=['marks', 'gender', 'result'])

students.head()

Unnamed: 0,marks,gender,result
0,85.0,M,verygood
1,95.0,F,excellent
2,75.0,,good
3,,M,average
4,70.0,M,good


There are two columns / features (one numerical - marks, and another categorical - gender) which are having missing values and need to be imputed. In the code below, an instance of SimpleImputer is created with strategy as "mean". The missing value is represented using NaN. 

- sklearn.impute package is used for importing SimpleImputer class.
- SimpleImputer takes two argument such as missing_values and strategy.
- fit_transform method is invoked on the instance of SimpleImputer to impute the missing values.

In [38]:
from sklearn.impute import SimpleImputer

In [69]:
# Missing values is represented using NaN and hence specified. If it 
# is empty field, missing values will be specified as ''

# instatiate the simple imputer
my_imputer_examples = SimpleImputer(missing_values=np.NaN, strategy='mean')

# Reshape your data either using array.reshape(-1, 1) if your data has a single feature 
# or array.reshape(1, -1) if it contains a single sample.
students.marks = my_imputer_examples.fit_transform(students.marks.values.reshape(-1, 1))

In [71]:
students_transformed = pd.DataFrame(students, columns=students.columns)
students_transformed

Unnamed: 0,marks,gender,result
0,85.0,M,verygood
1,95.0,F,excellent
2,75.0,M,good
3,85.833333,M,average
4,70.0,M,good
5,85.833333,M,verygood
6,92.0,F,verygood
7,98.0,M,excellent


In [67]:
# other strategies are
# Imputing with mean value
imputer = SimpleImputer(missing_values=np.NaN, strategy='mean')

# Imputing with median value
imputer = SimpleImputer(missing_values=np.NaN, strategy='median')

# Imputing with most frequent / mode value
imputer = SimpleImputer(missing_values=np.NaN, strategy='most_frequent')

# Imputing with constant value; The command below replaces the missing
# value with constant value such as 80
imputer = SimpleImputer(missing_values=np.NaN, strategy='constant', fill_value=80)

In [72]:
x_full = pd.read_csv('dataset/home_data_train.csv')
x_test = pd.read_csv('dataset/home_data_test.csv')

In [82]:
x_full.describe()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,46.549315,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,161.319273,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,0.0,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,0.0,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,1474.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


In [83]:
x_full.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 1 to 1460
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     1460 non-null   int64  
 1   MSZoning       1460 non-null   object 
 2   LotFrontage    1201 non-null   float64
 3   LotArea        1460 non-null   int64  
 4   Street         1460 non-null   object 
 5   Alley          91 non-null     object 
 6   LotShape       1460 non-null   object 
 7   LandContour    1460 non-null   object 
 8   Utilities      1460 non-null   object 
 9   LotConfig      1460 non-null   object 
 10  LandSlope      1460 non-null   object 
 11  Neighborhood   1460 non-null   object 
 12  Condition1     1460 non-null   object 
 13  Condition2     1460 non-null   object 
 14  BldgType       1460 non-null   object 
 15  HouseStyle     1460 non-null   object 
 16  OverallQual    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  YearBuil

In [85]:
# Remove rows with missing target, separate target from predictors
#  o subset especifica a coluna para a qual serão eliminadas as linhas com nulos.
x_full.dropna(axis=0, subset=['SalePrice'], inplace=True)

In [89]:
x_full.head()

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500
3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500
4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,2,2006,WD,Abnorml,140000
5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,,,,0,12,2008,WD,Normal,250000


In [78]:
x_full.set_index('Id', inplace=True, drop=True);

In [79]:
x_test.set_index('Id', inplace=True, drop=True);

In [80]:
y_target = x_full['SalePrice']

In [81]:
x_features = x_full.drop('SalePrice', axis=1)

In [93]:
# get only numerical data, avoid for while the categorical 
x_features = x_features.select_dtypes(exclude=['object'])

In [94]:
x_features.head()

Unnamed: 0_level_0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,65.0,8450,7,5,2003,2003,196.0,706,0,...,548,0,61,0,0,0,0,0,2,2008
2,20,80.0,9600,6,8,1976,1976,0.0,978,0,...,460,298,0,0,0,0,0,0,5,2007
3,60,68.0,11250,7,5,2001,2002,162.0,486,0,...,608,0,42,0,0,0,0,0,9,2008
4,70,60.0,9550,7,5,1915,1970,0.0,216,0,...,642,0,35,272,0,0,0,0,2,2006
5,60,84.0,14260,8,5,2000,2000,350.0,655,0,...,836,192,84,0,0,0,0,0,12,2008


In [96]:
x_test = x_test.select_dtypes(exclude=['object'])

In [97]:
x_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1459 entries, 1461 to 2919
Data columns (total 36 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     1459 non-null   int64  
 1   LotFrontage    1232 non-null   float64
 2   LotArea        1459 non-null   int64  
 3   OverallQual    1459 non-null   int64  
 4   OverallCond    1459 non-null   int64  
 5   YearBuilt      1459 non-null   int64  
 6   YearRemodAdd   1459 non-null   int64  
 7   MasVnrArea     1444 non-null   float64
 8   BsmtFinSF1     1458 non-null   float64
 9   BsmtFinSF2     1458 non-null   float64
 10  BsmtUnfSF      1458 non-null   float64
 11  TotalBsmtSF    1458 non-null   float64
 12  1stFlrSF       1459 non-null   int64  
 13  2ndFlrSF       1459 non-null   int64  
 14  LowQualFinSF   1459 non-null   int64  
 15  GrLivArea      1459 non-null   int64  
 16  BsmtFullBath   1457 non-null   float64
 17  BsmtHalfBath   1457 non-null   float64
 18  FullB

In [98]:
x_train, x_val, y_train, y_val = train_test_split(x_features, y_target, random_state=0, test_size=0.2)

In [99]:
x_train.head()

Unnamed: 0_level_0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
619,20,90.0,11694,9,5,2007,2007,452.0,48,0,...,774,0,108,0,0,260,0,0,7,2007
871,20,60.0,6600,5,5,1962,1962,0.0,0,0,...,308,0,0,0,0,0,0,0,8,2009
93,30,80.0,13360,5,7,1921,2006,0.0,713,0,...,432,0,0,44,0,0,0,0,8,2009
818,20,,13265,8,5,2002,2002,148.0,1218,0,...,857,150,59,0,0,0,0,0,7,2008
303,20,118.0,13704,7,5,2001,2002,150.0,0,0,...,843,468,81,0,0,0,0,0,1,2006


You can already see a few missing values in the first several rows.  In the next step, you'll obtain a more comprehensive understanding of the missing values in the dataset.

# Step 1: Preliminary investigation

Run the code cell below without changes.

In [100]:
# Shape of training data (num_rows, num_columns)
print(x_train.shape)

# Number of missing values in each column of training data
num_col_missing = [col for col in x_train if x_train[col].isnull().sum() > 0]

(1168, 36)


In [101]:
num_col_missing

['LotFrontage', 'MasVnrArea', 'GarageYrBlt']

In [109]:
x_train[num_col_missing].isnull().sum()

LotFrontage    212
MasVnrArea       6
GarageYrBlt     58
dtype: int64

In [106]:
# Shape of training data (num_rows, num_columns)
print(x_train.shape)

# Number of missing values in each column of training data
missing_val_count_by_column = (x_train.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0])

(1168, 36)
LotFrontage    212
MasVnrArea       6
GarageYrBlt     58
dtype: int64


### Part A

Use the above output to answer the questions below.

In [114]:
# Fill in the line below: How many rows are in the training data?
num_rows = x_train.shape[0]
print("Number of rows {}".format(num_rows))
# Fill in the line below: How many columns in the training data
# have missing values?
num_cols_with_missing = len(missing_val_count_by_column[missing_val_count_by_column > 0])
print("Number of rows {}".format(num_cols_with_missing))

# Fill in the line below: How many missing entries are contained in 
# all of the training data?
tot_missing = missing_val_count_by_column[missing_val_count_by_column > 0].sum()
print("Missing values: {}".format(tot_missing))

Number of rows 1168
Number of rows 3
Missing values: 276


In [107]:
len(num_col_missing)

3

In [110]:
x_train[num_col_missing].isnull().sum().sum()

276

### Part B
Considering your answers above, what do you think is likely the best approach to dealing with the missing values?

**My answer is to perform imputation bacause there are a small number of data missing. Like in MasVnrArea only 6 data are NaN. So remove all the col may be a wrong decision because I can lose too many information**

Since there are relatively few missing entries in the data (the column with the greatest percentage of missing values is missing less than 20% of its entries), we can expect that dropping columns is unlikely to yield good results. This is because we'd be throwing away a lot of valuable data, and so imputation will likely perform better.



To compare different approaches to dealing with missing values, you'll use the same score_dataset() function from the tutorial. This function reports the mean absolute error (MAE) from a random forest model.

In [116]:
def score_dataset(x_train, x_val, y_train, y_val):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(x_train, y_train)
    return mean_absolute_error(y_val, model.predict(x_val))

# Step 2: Drop columns with missing values

In this step, you'll preprocess the data in `X_train` and `X_valid` to remove columns with missing values.  Set the preprocessed DataFrames to `reduced_X_train` and `reduced_X_valid`, respectively.  

In [126]:
# Fill in the line below: get names of columns with missing values
cols_with_missing_values = [col for col in x_train if x_train[col].isnull().any()] # Your code here
cols_with_missing_values

['LotFrontage', 'MasVnrArea', 'GarageYrBlt']

In [127]:
# Fill in the lines below: drop columns in training and validation data
reduced_X_train = x_train.drop(cols_with_missing_values, axis=1)
reduced_X_valid = x_val.drop(cols_with_missing_values, axis=1)

In [128]:
reduced_X_train.shape

(1168, 33)

In [129]:
print("MAE (Drop columns with missing values):")
print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_val))

MAE (Drop columns with missing values):
17837.82570776256


# Step 3: Imputation

### Part A

Use the next code cell to impute missing values with the mean value along each column.  Set the preprocessed DataFrames to `imputed_X_train` and `imputed_X_valid`.  Make sure that the column names match those in `X_train` and `X_valid`.

In [138]:
from sklearn.impute import SimpleImputer

# Fill in the lines below: imputation
my_imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

imputed_X_train = pd.DataFrame(my_imputer.fit_transform(x_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(x_val))

# Fill in the lines below: imputation removed column names; put them back
imputed_X_train.columns = x_train.columns
imputed_X_valid.columns = x_val.columns

In [139]:
imputed_X_train.head()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
0,20.0,90.0,11694.0,9.0,5.0,2007.0,2007.0,452.0,48.0,0.0,...,774.0,0.0,108.0,0.0,0.0,260.0,0.0,0.0,7.0,2007.0
1,20.0,60.0,6600.0,5.0,5.0,1962.0,1962.0,0.0,0.0,0.0,...,308.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0
2,30.0,80.0,13360.0,5.0,7.0,1921.0,2006.0,0.0,713.0,0.0,...,432.0,0.0,0.0,44.0,0.0,0.0,0.0,0.0,8.0,2009.0
3,20.0,69.614017,13265.0,8.0,5.0,2002.0,2002.0,148.0,1218.0,0.0,...,857.0,150.0,59.0,0.0,0.0,0.0,0.0,0.0,7.0,2008.0
4,20.0,118.0,13704.0,7.0,5.0,2001.0,2002.0,150.0,0.0,0.0,...,843.0,468.0,81.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0


In [140]:
print("MAE (Imputation):")
print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_val))

MAE (Imputation):
18062.894611872147


### Part B

Compare the MAE from each approach.  Does anything surprise you about the results?  Why do you think one approach performed better than the other?

**Incredibly the drop approach performed way better than the imputation approach. That surprised me. But I don't know why this happened. Maybe, Of course they imputed values were not significant at all.**

Given that thre are so few missing values in the dataset, we'd expect imputation to perform better than dropping columns entirely. However, we see that dropping columns performs slightly better! While this can probably partially be attributed to noise in the dataset, another potential explanation is that the imputation method is not a great match to this dataset. That is, maybe instead of filling in the mean value, it makes more sense to set every missing value to a value of 0, to fill in the most frequently encountered value, or to use some other method. For instance, consider the GarageYrBlt column (which indicates the year that the garage was built). It's likely that in some cases, a missing value could indicate a house that does not have a garage. Does it make more sense to fill in the median value along each column in this case? Or could we get better results by filling in the minimum value along each column? It's not quite clear what's best in this case, but perhaps we can rule out some options immediately - for instance, setting missing values in this column to 0 is likely to yield horrible results!

# Step 4: Generate test predictions

In this final step, you'll use any approach of your choosing to deal with missing values.  Once you've preprocessed the training and validation features, you'll train and evaluate a random forest model.  Then, you'll preprocess the test data before generating predictions that can be submitted to the competition!

### Part A

Use the next code cell to preprocess the training and validation data.  Set the preprocessed DataFrames to `final_X_train` and `final_X_valid`.  **You can use any approach of your choosing here!**  in order for this step to be marked as correct, you need only ensure:
- the preprocessed DataFrames have the same number of columns,
- the preprocessed DataFrames have no missing values, 
- `final_X_train` and `y_train` have the same number of rows, and
- `final_X_valid` and `y_valid` have the same number of rows.

In [155]:
# Preprocessed training and validation features
my_imputer = SimpleImputer(strategy='median') # assing 0 to each value, i could also use fillna of panda for this
final_X_train = pd.DataFrame(my_imputer.fit_transform(x_train), columns=x_train.columns)
final_X_valid = pd.DataFrame(my_imputer.transform(x_val), columns=x_val.columns)

In [156]:
final_X_train.head()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
0,20.0,90.0,11694.0,9.0,5.0,2007.0,2007.0,452.0,48.0,0.0,...,774.0,0.0,108.0,0.0,0.0,260.0,0.0,0.0,7.0,2007.0
1,20.0,60.0,6600.0,5.0,5.0,1962.0,1962.0,0.0,0.0,0.0,...,308.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0
2,30.0,80.0,13360.0,5.0,7.0,1921.0,2006.0,0.0,713.0,0.0,...,432.0,0.0,0.0,44.0,0.0,0.0,0.0,0.0,8.0,2009.0
3,20.0,69.0,13265.0,8.0,5.0,2002.0,2002.0,148.0,1218.0,0.0,...,857.0,150.0,59.0,0.0,0.0,0.0,0.0,0.0,7.0,2008.0
4,20.0,118.0,13704.0,7.0,5.0,2001.0,2002.0,150.0,0.0,0.0,...,843.0,468.0,81.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0


In [157]:
final_X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1168 entries, 0 to 1167
Data columns (total 36 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     1168 non-null   float64
 1   LotFrontage    1168 non-null   float64
 2   LotArea        1168 non-null   float64
 3   OverallQual    1168 non-null   float64
 4   OverallCond    1168 non-null   float64
 5   YearBuilt      1168 non-null   float64
 6   YearRemodAdd   1168 non-null   float64
 7   MasVnrArea     1168 non-null   float64
 8   BsmtFinSF1     1168 non-null   float64
 9   BsmtFinSF2     1168 non-null   float64
 10  BsmtUnfSF      1168 non-null   float64
 11  TotalBsmtSF    1168 non-null   float64
 12  1stFlrSF       1168 non-null   float64
 13  2ndFlrSF       1168 non-null   float64
 14  LowQualFinSF   1168 non-null   float64
 15  GrLivArea      1168 non-null   float64
 16  BsmtFullBath   1168 non-null   float64
 17  BsmtHalfBath   1168 non-null   float64
 18  FullBath

Run the next code cell to train and evaluate a random forest model. (Note that we don't use the score_dataset() function above, because we will soon use the trained model to generate test predictions!)

In [158]:
# Define and fit model
model = RandomForestRegressor(n_estimators=100, random_state=0)
model.fit(final_X_train, y_train)

# Get validation predictions and MAE
preds_valid = model.predict(final_X_valid)
print("MAE (Your approach):")
print(mean_absolute_error(y_val, preds_valid))

MAE (Your approach):
17791.59899543379


### Part B

Use the next code cell to preprocess your test data.  Make sure that you use a method that agrees with how you preprocessed the training and validation data, and set the preprocessed test features to `final_X_test`.

Then, use the preprocessed test features and the trained model to generate test predictions in `preds_test`.

In order for this step to be marked correct, you need only ensure:
- the preprocessed test DataFrame has no missing values, and
- `final_X_test` has the same number of rows as `X_test`.

In [160]:
# Fill in the line below: preprocess test data
my_imputer = my_imputer.fit(x_test)
final_X_test = pd.DataFrame(my_imputer.transform(x_test), columns = x_test.columns)

# Fill in the line below: get test predictions
preds_test = model.predict(final_X_test)

In [162]:
preds_test[:5]

array([125985.5 , 154599.5 , 180070.24, 183544.5 , 197549.92])

In [165]:
pd.DataFrame({'Id' : x_test.index, 'SalePrice' : preds_test}).to_csv('submission.csv', index=False)