# House Prices - Advanced Regression Techniques

<img src="banner.png" style="width:1500px"/>

KAGGLE-X MENTORSHIP

My name is Odutayo Odufuwa mentored by Nafisa Lawal. I would embarking on this project that allows me learn machine learning techniques such as feature engineering, data pewprocessing, model evaluation, model deployment and so on.

The goal of the project is to predict sales prices of houses based on various house features.

<a id="cont"></a>

### Table of Contents

<a href=#zero>0. Overview</a>

<a href=#one>1. Dataset Description</a>

<a href=#two>2. Importing Packages</a>

<a href=#three>3. Loading Data</a>

<a href=#four>4. Exploratory Data Analysis (EDA)</a>

<a href=#five>5. Data Engineering</a>

<a href=#six>6. Modeling</a>

<a href=#seven>7. Model Performance</a>

<a href=#eight>8. Model Explanations</a>

 <a id="zero"></a>
### 0. Overview
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Overview ⚡ |
| :--------------------------- |
| In this section, the project is described. |

---

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. This dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

 <a id="one"></a>
### 1. Dataset Description
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Dataset Description ⚡ |
| :--------------------------- |
| In this section, the features of the dataset are explored. |

---

In [52]:
with open("data_description.txt", "r") as file:
    lines = file.readlines()
    f_lines = [line for line in lines if ":" in line]
    columns = []
    for line in f_lines:
       if not line.strip()[0].isdigit():
           col_name = line[:line.find(":")]
           columns.append(col_name)

dataset_fields = columns


In [54]:
print(f"The first 5 fields in the dataset are: {dataset_fields[:5]}")

The first 5 fields in the dataset are: ['MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street']


See the "data_description.txt" file for more details

 <a id="two"></a>
### 2. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section, the packages and libraries necessary for analysis are imported |

---

In [63]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

 <a id="three"></a>
### 3. Loading Data
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading Data ⚡ |
| :--------------------------- |
| In this section, the `train` and `test` data are loaded. |

---

In [56]:
df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")

In [57]:
df_train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [58]:
df_test.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,6,2010,WD,Normal
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,144,0,,,,0,1,2010,WD,Normal


In [64]:
print(f"The train data consists of {df_train.shape[0]} rows and {df_train.shape[1]} columns")
print(f"The train data consists of {df_test.shape[0]} rows and {df_test.shape[1]} columns")

The train data consists of 1460 rows and 81 columns
The train data consists of 1459 rows and 80 columns


 <a id="four"></a>
### 4. Exploratory Data Analysis
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory Data Analysis ⚡ |
| :--------------------------- |
| In this section, an in-depth analysis of all the variables in the DataFrame would be performed. |

---

In [79]:
print(f"The dataset consists of fields with the following datatypes: {df_train.dtypes.unique()[0]}, {df_train.dtypes.unique()[1]}, {df_train.dtypes.unique()[2]}")

The dataset consists of fields with the following datatypes: int64, object, float64


In [86]:
numerical_columns = [col for col in df_train.columns if df_train[col].dtype == "int64" or df_train[col].dtype == "float64"]
categorical_columns = [col for col in df_train.columns if df_train[col].dtype == "object"]

print(f"In the train dataset, there are {len(numerical_columns)} numerical columns. There are also {len(categorical_columns)} categorical columns.")

In the train dataset, there are 38 numerical columns. There are also 43 categorical columns.


In [94]:
print("See all numerical columns in the train dataset below\n")
for i in range(0, len(numerical_columns), 10):
    print(numerical_columns[i:i + 10])

See all numerical columns in the train dataset below

['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1']
['BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath']
['HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF']
['EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold', 'SalePrice']


In [95]:
print("See all categorical columns in the train dataset below\n")
for i in range(0, len(categorical_columns), 10):
    print(categorical_columns[i:i + 10])

See all categorical columns in the train dataset below

['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1']
['Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond']
['Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical']
['KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence']
['MiscFeature', 'SaleType', 'SaleCondition']


In [99]:
null_counts = df_train.isnull().sum()
null_counts_df = pd.DataFrame({'Column Names': null_counts.index, 'Null Count': null_counts.values})
null_counts_df = null_counts_df[null_counts_df["Null Count"] > 0] # returns columns with null values

null_counts_df

Unnamed: 0,Column Names,Null Count
3,LotFrontage,259
6,Alley,1369
25,MasVnrType,8
26,MasVnrArea,8
30,BsmtQual,37
31,BsmtCond,37
32,BsmtExposure,38
33,BsmtFinType1,37
35,BsmtFinType2,38
42,Electrical,1


In [102]:
print(f"The above output confirms that out of {len(df_train.columns)} columns in the df_train dataset, there are {len(null_counts_df)} columns with null values")

The above output confirms that out of 81 columns in the df_train dataset, there are 19 columns with null values


In [103]:
# look at data statistics
df_train.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


In [146]:
# create empty lists for the respective columns to be added
high_negative_skew = []
moderate_negative_skew = []
fairly_symmetrical = []
moderate_positive_skew = []
high_positive_skew = []


def skew(df):  # this evaluates the skewness of each numerical variale 
    skew = pd.DataFrame(df.skew()).to_dict()[0]
    for k, v in skew.items():
        if v < -1:
            high_negative_skew.append(k)
        elif v > -1 and v < -0.5:
            moderate_negative_skew.append(k)
        elif v > -0.5 and v < 0.5:
            fairly_symmetrical.append(k)
        elif v > 0.5 and v < 1:
            moderate_positive_skew.append(k)
        else:
            high_positive_skew.append(k)
    print(f"The following columns are highly negative skewed - {high_negative_skew}.\nThe following columns are moderately negative skewed - {moderate_negative_skew}.\nThe following columns are fairly symmetrical - {fairly_symmetrical}.\nThe following columns are moderately positive skewed - {moderate_positive_skew}.\nThe following columns are highly positive skewed - {high_positive_skew}.\n")

skew(df_train)

The following columns are highly negative skewed - [].
The following columns are moderately negative skewed - ['YearBuilt', 'YearRemodAdd', 'GarageYrBlt'].
The following columns are fairly symmetrical - ['Id', 'OverallQual', 'FullBath', 'BedroomAbvGr', 'GarageCars', 'GarageArea', 'MoSold', 'YrSold'].
The following columns are moderately positive skewed - ['OverallCond', 'BsmtUnfSF', '2ndFlrSF', 'BsmtFullBath', 'HalfBath', 'TotRmsAbvGrd', 'Fireplaces'].
The following columns are highly positive skewed - ['MSSubClass', 'LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'TotalBsmtSF', '1stFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtHalfBath', 'KitchenAbvGr', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'SalePrice'].



  # Remove the CWD from sys.path while we load stuff.


In [149]:
contains_high_outliers = []
no_outliers = []

def kurtosis(df):
    kurtosis = pd.DataFrame(df.kurtosis()).to_dict()[0]
    for k, v in kurtosis.items():
        if v < 3:
            no_outliers.append(k)
        else:
            contains_high_outliers.append(k)
    print(f"The following columns contain a large number of outliers - {contains_high_outliers}.\nThe following columns contain no outliers - {no_outliers}.")
    
kurtosis(df_train)

The following columns contain a large number of outliers - ['LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'TotalBsmtSF', '1stFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtHalfBath', 'KitchenAbvGr', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'SalePrice'].
The following columns contain no outliers - ['Id', 'MSSubClass', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'BsmtUnfSF', '2ndFlrSF', 'BsmtFullBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'MoSold', 'YrSold'].


  """


In [None]:
# plot relevant feature interactions

In [None]:
# evaluate correlation

In [None]:
# have a look at feature distributions

 <a id="five"></a>
### 5. Data Preprocessing/Engineering
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Data Preprocessing/Engineering ⚡ |
| :--------------------------- |
| In this section, cleaning the dataset, and possibly creating new features - as identified in the EDA phase. |

---

In [None]:
# remove missing values/ features

In [None]:
# create new features

In [None]:
# engineer existing features

 <a id="six"></a>
### 6. Model Building
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model Building ⚡ |
| :--------------------------- |
| In this section, one or more regression models that are able to accurately predict house prices woulde be created. |

---

In [None]:
# create targets and features dataset

In [None]:
# create one or more ML models

In [None]:
# evaluate one or more ML models

 <a id="seven"></a>
### 7. Model Evaluation/Performance
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model Evaluation/Performance ⚡ |
| :--------------------------- |
| In this section, the relative performance of the various trained ML models on the dataset and comment on what model is the best and why would be done. |

---

In [None]:
# Compare model performance

In [None]:
# Choose best model and motivate why it is the best choice

 <a id="eight"></a>
### 8. Model Deployment
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model Deployment ⚡ |
| :--------------------------- |
| In this section, the model would be deployed. |

---