**Introduction**

The following dataset shows a history of house sales in Ames, Iowa.

We want to be able to predict a house price based on the information in the given dataset.

In [76]:
#invite friends for the Kaggle party
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(style="whitegrid")

import warnings
warnings.filterwarnings("ignore")
warnings.filterwarnings("ignore", category=DeprecationWarning) 

Know your data - exploratory data analysis!

In [None]:
train = pd.read_csv("../input/train.csv")
test = pd.read_csv("../input/test.csv")

#starting with EDA
train.head()

In [None]:
train.describe()

In [None]:
train.shape

In [None]:
train.columns

There are a lot of columns to work with, let's check which do we need.

We'll begin by inspecting which columns correlate best with 'SalePrice'.

In [None]:
#SalePrice correlation matrix
correlations = train.corr()
cols = correlations.nlargest(10, 'SalePrice').index
cm = np.corrcoef(train[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show() #display heatmap

#take the 5 columns in which the correlation is highest.
correlations = correlations["SalePrice"].sort_values(ascending=False)
features = correlations.index[1:6]

**Handling Missing Data:**

For a dataset of this size, a lot of missing values can be found. In order to effectively train our model we build, we must first deal with the missing values. There are missing values for both numerical and categorical data. 

For numerical imputing, we would typically fill the missing values with a measure like median, mean, or mode.
For categorical imputing, we'll fill the missing values with the most common term that appeared from the entire column (one of many techniques). 

We can see in the data description file, that for some categories, NaN means something.
This means that if a value is NaN, the house might not have that certain attribute, which will affect the price of the house.
We will deal with it by filling the null cell with "None".

In [None]:
train_null = pd.isnull(train).sum() #number of null values for each column in the train set
test_null = pd.isnull(test).sum() #number of null values for each column in the test set

null = pd.concat([train_null, test_null], axis=1, keys=["Train", "Test"], sort='True')

In [None]:
null_many = null[null.sum(axis=1) > 200]  #many missing values
null_few = null[(null.sum(axis=1) > 0) & (null.sum(axis=1) < 200)]  #few much missing values

In [None]:
null_many

For example, we can see that there are a lot of missing values in the 'Alley' column. A quick look at the description will show us that NaN in the 'Alley' column stands for no alley access.

In [None]:
#more can be found on the description data file provided

null_has_meaning = ["Alley", "BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinType2", "FireplaceQu", "GarageType", "GarageFinish", "GarageQual", "GarageCond", "PoolQC", "Fence", "MiscFeature"]

In [None]:
#change the null value to "None" where null means something
for null_value in null_has_meaning:
    train[null_value].fillna("None", inplace=True)
    test[null_value].fillna("None", inplace=True)

Dealing with the "real" NaN values:


In [None]:
from sklearn.preprocessing import Imputer

imputer = Imputer(strategy="median")

We made some changes to null values, so let's update our dataframes:

In [None]:
train_null = pd.isnull(train).sum() #number of null values for each column in the train set
test_null = pd.isnull(test).sum() #number of null values for each column in the test set

null = pd.concat([train_null, test_null], axis=1, keys=["Train", "Test"], sort='True')

In [None]:
null_many = null[null.sum(axis=1) > 200]  #many missing values
null_few = null[(null.sum(axis=1) > 0) & (null.sum(axis=1) < 200)]  #few much missing values

In [None]:
null_many

It seems like 'LotFrontage' has too many null values and it is a numerical value so it may be better to drop it.

In [None]:
train.drop("LotFrontage", axis=1, inplace=True)
test.drop("LotFrontage", axis=1, inplace=True)

In [None]:
null_few

GarageYrBlt, MasVnrArea, and MasVnrType all have a decent amount of missing values. MasVnrType is categorical so we can replace the missing values with "None", as we did before. We'll fill the others with median.

In [None]:
train["GarageYrBlt"].fillna(train["GarageYrBlt"].median(), inplace=True)
test["GarageYrBlt"].fillna(test["GarageYrBlt"].median(), inplace=True)
train["MasVnrArea"].fillna(train["MasVnrArea"].median(), inplace=True)
test["MasVnrArea"].fillna(test["MasVnrArea"].median(), inplace=True)
train["MasVnrType"].fillna("None", inplace=True)
test["MasVnrType"].fillna("None", inplace=True)

We took care of the features with a lot of missing values, now we'll take care of the ones with few missing values.

In [None]:
#split to numerical (type = int, float) and categorical (type = object) featues:

#train set
types_train = train.dtypes #type of each feature in data: int, float, object
num_train = types_train[(types_train == int) | (types_train == float)] 
cat_train = types_train[types_train == object] 

#test set
types_test = test.dtypes
num_test = types_test[(types_test == int) | (types_test == float)]
cat_test = types_test[types_test == object]

Numerical Imputing

In [None]:
sns.distplot(train['SalePrice'])

We can see that our data is skewed right, so we'll impute with median.

In [None]:
#lists are easier to work with, so we'll convert num_train and num_test.
numerical_values_train = list(num_train.index)
numerical_values_test = list(num_test.index)

Those are all of the numerical features in our data:

In [None]:
print (numerical_values_train)

In [None]:
#create a list of all features with missing values
missing_num = []

for feature in numerical_values_train:
    if feature in list(null_few.index):
        missing_num.append(feature)

Those are all of the numerical features with missing data

In [None]:
print (missing_num)

In [None]:
#impute
for feature in missing_num:
    train[feature].fillna(train[feature].median(), inplace=True)
    test[feature].fillna(test[feature].median(), inplace=True)

Categorical Imputing

Those are non-numerical features so we can't use a technique like median value on them. Instead we'll impute with the most common term that appears in the entire list.

In [None]:
categorical_values_train = list(cat_train.index)
categorical_values_test = list(cat_test.index)

All of the categorical features:

In [None]:
print(categorical_values_train)

In [None]:
#create a list of all features with missing values
missing_cat = []

for feature in categorical_values_train:
    if feature in list(null_few.index):
        missing_cat.append(feature)

Those are all of the categorical features with missing data

In [None]:
print(missing_cat)

In [None]:
def most_common_term(lst):
    lst = list(lst)
    return max(set(lst), key=lst.count)
#most_common_term finds the most common term in a series

most_common = ["Electrical", "Exterior1st", "Exterior2nd", "Functional", "KitchenQual", "MSZoning", "SaleType", "Utilities", "MasVnrType"]

counter = 0
for i in missing_cat:
    most_common[counter] = most_common_term(train[i])
    counter += 1

Those are the categorical features with missing values

In [None]:
most_common_dictionary = {missing_cat[x]: [most_common[x]] for x in range(len(most_common))}
most_common_dictionary

In [None]:
#replace null values with most common term
counter = 0
for feature in missing_cat:  
    train[feature].fillna(most_common[counter], inplace=True)
    test[feature].fillna(most_common[counter], inplace=True)
    counter += 1

We took care of both the numerical features and the categorical featues, if all worked according to our plan we shouldn't have any null values left. Since we are being thorough, we will check if all is well.

In [None]:
#updating the null values series
train_null = pd.isnull(train).sum() #number of null values for each column in the train set
test_null = pd.isnull(test).sum() #number of null values for each column in the test set

null = pd.concat([train_null, test_null], axis=1, keys=["Train", "Test"], sort='True')
null[null.sum(axis=1) > 0] #all features with 1 or more null values

An empty table, we did it!

**Feature Engineering**

We have dealt with all of the missing values, now it's time for the next step of our data preprocessing - feature engineering!
We need to create feature vectors in order to get the data ready for our model as training data. To do so, we will have to convert the categorical values into representative numbers.

As we saw earlier, out data is skewed right, so we'll use log transformation on it.

In [None]:
#before log transforamtion
sns.distplot(train['SalePrice'])

In [None]:
train["TransformedPrice"] = np.log(train["SalePrice"])

#after log transformation
sns.distplot(np.log(train["SalePrice"]))

Our target feature SalePrice used to be very skewed, but thanks to the logarithm transformation it is no more.
Now we can see that it is more normally distributed, which works better with machine learning models.

Now we'll look on the catergorical data that needs to be transformed

In [None]:
categorical_values_train = list(cat_train.index)
categorical_values_test = list(cat_test.index)

print(categorical_values_train)

In [None]:
#convert categorical values into representative numbers
#train set
for feature in categorical_values_train:
    feature_set = set(train[feature]) #unique values for the feature
    for cat_val in feature_set:
        feature_list = list(feature_set)
        train.loc[train[feature] == cat_val, feature] = feature_list.index(cat_val)
        
#test set
for feature in categorical_values_test:
    feature_set2 = set(test[feature]) #unique values for the feature
    for cat_val in feature_set2:
        feature_list2 = list(feature_set2)
        test.loc[test[feature] == cat_val, feature] = feature_list2.index(cat_val)

In [None]:
train.head()

In [None]:
test.head()

Looks like we have changed all the categorical strings into a representative number. Now we can move on to the next step.


**Creating, Training, Evaluating, Validating, and Testing ML Models**

We've finished the preprocessing part. Now we know and understand our data much better!
We can start to build and test different models for regression to predit the sale price of each house.
We'll import the models, train them and evaluate them. We'll use the R^2 score and the RMSE to  evaluate our model performance.
We will also use cross validation to optimize our model hyperparameters.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold

**Defining Training/Test Sets**

We drop the Id and SalePrice columns for the training set since those are not involved in predicting the Sale Price of a house. The SalePrice column will become our training target. Remember how we transformed SalePrice to make the distribution more normal? Well we can apply that here and make TransformedPrice the target instead of SalePrice. This will improve model performance and yield a much smaller RMSE because of the scale.

In [None]:
X_train = train.drop(["Id", "SalePrice", "TransformedPrice"], axis=1).values
y_train = train["TransformedPrice"].values
X_test = test.drop("Id", axis=1).values

**Splitting into Validation**

It is always good to split our training data again into validation sets. This will help us evaluate our model performance as well as avoid overfitting our model.

In [None]:
from sklearn.model_selection import train_test_split #to create validation data set

X_training, X_valid, y_training, y_valid = train_test_split(X_train, y_train, test_size=0.2, random_state=0) #X_valid and y_valid are the validation sets

**Linear Regression Model**

In [None]:
linreg = LinearRegression()
parameters_lin = {"fit_intercept" : [True, False], "normalize" : [True, False], "copy_X" : [True, False]}
grid_linreg = GridSearchCV(linreg, parameters_lin, verbose=1 , scoring = "r2")
grid_linreg.fit(X_training, y_training)

print("Best Linear Regression Model: " + str(grid_linreg.best_estimator_))
print("Best Score: " + str(grid_linreg.best_score_))

In [None]:
linreg = grid_linreg.best_estimator_
linreg.fit(X_training, y_training)
lin_pred = linreg.predict(X_valid)
r2_lin = r2_score(y_valid, lin_pred)
rmse_lin = np.sqrt(mean_squared_error(y_valid, lin_pred))
print("R^2 Score: " + str(r2_lin))
print("RMSE Score: " + str(rmse_lin))

In [None]:
scores_lin = cross_val_score(linreg, X_training, y_training, cv=10, scoring="r2")
print("Cross Validation Score: " + str(np.mean(scores_lin)))

**Decision Tree Regressor Model**

In [None]:
dtr = DecisionTreeRegressor()
parameters_dtr = {"criterion" : ["mse", "friedman_mse", "mae"], "splitter" : ["best", "random"], "min_samples_split" : [2, 3, 5, 10], 
                  "max_features" : ["auto", "log2"]}
grid_dtr = GridSearchCV(dtr, parameters_dtr, verbose=1, scoring="r2")
grid_dtr.fit(X_training, y_training)

print("Best DecisionTreeRegressor Model: " + str(grid_dtr.best_estimator_))
print("Best Score: " + str(grid_dtr.best_score_))

In [None]:
dtr = grid_dtr.best_estimator_
dtr.fit(X_training, y_training)
dtr_pred = dtr.predict(X_valid)
r2_dtr = r2_score(y_valid, dtr_pred)
rmse_dtr = np.sqrt(mean_squared_error(y_valid, dtr_pred))
print("R^2 Score: " + str(r2_dtr))
print("RMSE Score: " + str(rmse_dtr))

In [None]:
scores_dtr = cross_val_score(dtr, X_training, y_training, cv=10, scoring="r2")
print("Cross Validation Score: " + str(np.mean(scores_dtr)))

**Random Forest Regressor**

In [None]:
rf = RandomForestRegressor()
paremeters_rf = {"n_estimators" : [5, 10, 15, 20], "criterion" : ["mse" , "mae"], "min_samples_split" : [2, 3, 5, 10], 
                 "max_features" : ["auto", "log2"]}
grid_rf = GridSearchCV(rf, paremeters_rf, verbose=1, scoring="r2")
grid_rf.fit(X_training, y_training)

print("Best RandomForestRegressor Model: " + str(grid_rf.best_estimator_))
print("Best Score: " + str(grid_rf.best_score_))

In [None]:
rf = grid_rf.best_estimator_
rf.fit(X_training, y_training)
rf_pred = rf.predict(X_valid)
r2_rf = r2_score(y_valid, rf_pred)
rmse_rf = np.sqrt(mean_squared_error(y_valid, rf_pred))
print("R^2 Score: " + str(r2_rf))
print("RMSE Score: " + str(rmse_rf))

In [None]:
scores_rf = cross_val_score(rf, X_training, y_training, cv=10, scoring="r2")
print("Cross Validation Score: " + str(np.mean(scores_rf)))

**Evaluation Our Models**

We have built and trained a few different regression models, now we'll compare them to see which one is best and should be used to predict on the test test.

In [None]:
model_performances = pd.DataFrame({
    "Model" : ["Linear Regression", "Decision Tree Regressor", "Random Forest Regressor"],
    "Best Score" : [grid_linreg.best_score_, grid_dtr.best_score_, grid_rf.best_score_],
    "R Squared" : [str(r2_lin)[0:5], str(r2_dtr)[0:5], str(r2_rf)[0:5]],
    "RMSE" : [str(rmse_lin)[0:8], str(rmse_dtr)[0:8], str(rmse_rf)[0:8]]
})

model_performances.round(4)

print("Sorted by Best Score:")
model_performances.sort_values(by="Best Score", ascending=False)

In [None]:
print("Sorted by R Squared:")
model_performances.sort_values(by="R Squared", ascending=False)

In [None]:
print("Sorted by RMSE:")
model_performances.sort_values(by="RMSE", ascending=True)

The RMSEs are small because of the log transformation we performed. So even a 0.1 RMSE may be significant in this case.

I chose to use Random Forest Regressor because it ranked first on 2 of our 3 measurements.
It has a low RMSE and a high R^2.

In [None]:
rf.fit(X_train, y_train)

In [None]:
final_predictions = np.exp(rf.predict(X_test))

In [None]:
results = pd.DataFrame({
        "Id": test["Id"],
        "SalePrice": final_predictions
    })

print(results.shape)

In [75]:
results.head(10)

Unnamed: 0,Id,SalePrice
0,1461,125971.779476
1,1462,157017.488253
2,1463,168426.399809
3,1464,208868.713918
4,1465,176834.384424
5,1466,196824.640202
6,1467,169242.19174
7,1468,175351.270525
8,1469,186593.337109
9,1470,140957.78049
