# Predicting House Prices with Advanced Regression Techniques

<p><a name="sections"></a></p>

## Sections

- <a href="#Understanding">Data Understanding</a><br>
    - <a href="#Meanings">Feature Meanings</a><br>
    - <a href="#Categorized">Features Categorized</a><br>
    - <a href="#Distributions">Feature Distributions</a><br>
    - <a href="#Missing">Missing Values</a><br>
    - <a href="#Correlation">Correlation Among Numerical Features</a><br>
    - <a href="#Outliers">Outliers</a><br>
- <a href="#Preparation">Data Preparation</a><br>
    - <a href="#Engineering">Feature Engineering</a><br>
        - <a href="#Drop">Drop Columns</a><br>
        - <a href="#H_Outliers">Handle Outliers</a><br>
        - <a href="#H_Missing">Handle Missing Values</a><br>
        - <a href="#Impute">Imputation</a><br>        
        - <a href="#Encode">Encoding</a><br>  
        - <a href="#Normalize">Normalization</a><br>  
        - <a href="#Selection">Feature Selection</a><br>          
- <a href="#Modeling">Data Modeling</a><br>

Useful Links:
- https://www.kaggle.com/c/house-prices-advanced-regression-techniques
- http://jse.amstat.org/v19n3/decock.pdf
- http://jse.amstat.org/v19n3/decock/DataDocumentation.txt

<p><a name="Understanding"></a></p>

## Data Understanding: Exploratory Data Analaysis

Import Packages:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

pd.set_option('max_rows', None)
pd.set_option('max_columns', None)

Load the Ames, Iowa Housing Data Set:

In [None]:
housing=pd.read_csv('data/train.csv')

In [None]:
housing.head()

In [None]:
housing.shape

In [None]:
housing.info()

<p><a name="Meanings"></a></p>

### Feature Meanings
- Id: Observation number
- MSSubClass: Identifies the type of dwelling involved in the sale
- MSZoning: Identifies the general zoning classification of the sale
- LotFrontage: Linear feet of street connected to property
- LotArea: Lot size in square feet
- Street: Type of road access to property
- Alley: Type of alley access to property
- LotShape: General shape of property
- LandContour: Flatness of the property
- Utilities: Type of utilities available
- LotConfig: Lot configuration
- LandSlope: Slope of property
- Neighborhood: Physical locations within Ames city limits
- Condition1: Proximity to various conditions
- Condition2: Proximity to various conditions (if more than one is present)
- BldgType: Type of dwelling
- HouseStyle: Style of dwelling
- OverallQual: Rates the overall material and finish of the house
- OverallCond: Rates the overall condition of the house
- YearBuilt: Original construction date
- YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)
- RoofStyle: Type of roof
- RoofMatl: Roof material
- Exterior1st: Exterior covering on house
- Exterior2nd: Exterior covering on house (if more than one material)
- MasVnrType: Masonry veneer type
- MasVnrArea: Masonry veneer area in square feet
- ExterQual: Evaluates the quality of the material on the exterior 
- ExterCond: Evaluates the present condition of the material on the exterior
- Foundation: Type of foundation
- BsmtQual: Evaluates the height of the basement
- BsmtCond: Evaluates the general condition of the basement
- BsmtExposure: Refers to walkout or garden level walls
- BsmtFinType1: Rating of basement finished area
- BsmtFinSF1: Type 1 finished square feet
- BsmtFinType2: Rating of basement finished area (if multiple types)
- BsmtFinSF2: Type 2 finished square feet
- BsmtUnfSF: Unfinished square feet of basement area
- TotalBsmtSF: Total square feet of basement area
- Heating: Type of heating
- HeatingQC: Heating quality and condition
- CentralAir: Central air conditioning
- Electrical: Electrical system
- 1stFlrSF: First Floor square feet
- 2ndFlrSF: Second floor square feet
- LowQualFinSF: Low quality finished square feet (all floors)
- GrLivArea: Above grade (ground) living area square feet
- BsmtFullBath: Basement full bathrooms
- BsmtHalfBath: Basement half bathrooms
- FullBath: Full bathrooms above grade
- HalfBath: Half baths above grade
- Bedroom: Bedrooms above grade (does NOT include basement bedrooms)
- Kitchen: Kitchens above grade
- KitchenQual: Kitchen quality
- TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
- Functional: Home functionality (Assume typical unless deductions are warranted)
- Fireplaces: Number of fireplaces
- FireplaceQu: Fireplace quality
- GarageType: Garage location
- GarageYrBlt: Year garage was built
- GarageFinish: Interior finish of the garage
- GarageCars: Size of garage in car capacity
- GarageArea: Size of garage in square feet
- GarageQual: Garage quality
- GarageCond: Garage condition
- PavedDrive: Paved driveway
- WoodDeckSF: Wood deck area in square feet
- OpenPorchSF: Open porch area in square feet
- EnclosedPorch: Enclosed porch area in square feet
- 3SsnPorch: Three season porch area in square feet
- ScreenPorch: Screen porch area in square feet
- PoolArea: Pool area in square feet
- PoolQC: Pool quality
- Fence: Fence quality
- MiscFeature: Miscellaneous feature not covered in other categories
- MiscVal: Value of miscellaneous feature
- MoSold: Month Sold (MM)
- YrSold: Year Sold (YYYY)
- SaleType: Type of sale
- SaleCondition: Condition of sale
- SalePrice: Sale price 



<p><a name="Categorized"></a></p>

### Features Categorized
- **20 Continuous Variables**
    - LotFrontage: Linear feet of street connected to property
    - LotArea: Lot size in square feet
    - MasVnrArea: Masonry veneer area in square feet
    - BsmtFinSF1: Type 1 finished square feet
    - BsmtFinSF2: Type 2 finished square feet
    - BsmtUnfSF: Unfinished square feet of basement area
    - TotalBsmtSF: Total square feet of basement area
    - 1stFlrSF: First Floor square feet
    - 2ndFlrSF: Second floor square feet
    - LowQualFinSF: Low quality finished square feet (all floors)
    - GrLivArea: Above grade (ground) living area square feet
    - GarageArea: Size of garage in square feet
    - WoodDeckSF: Wood deck area in square feet
    - OpenPorchSF: Open porch area in square feet
    - EnclosedPorch: Enclosed porch area in square feet
    - 3SsnPorch: Three season porch area in square feet
    - ScreenPorch: Screen porch area in square feet
    - PoolArea: Pool area in square feet
    - MiscVal: Value of miscellaneous feature
    - SalePrice: Sale price 
    

- **15 Discrete Variables**
    - Id: Observation number
    - YearBuilt: Original construction date
    - YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)
    - BsmtFullBath: Basement full bathrooms
    - BsmtHalfBath: Basement half bathrooms
    - FullBath: Full bathrooms above grade
    - HalfBath: Half baths above grade
    - Bedroom: Bedrooms above grade (does NOT include basement bedrooms)
    - Kitchen: Kitchens above grade
    - TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
    - Fireplaces: Number of fireplaces
    - GarageYrBlt: Year garage was built
    - GarageCars: Size of garage in car capacity
    - MoSold: Month Sold (MM)
    - YrSold: Year Sold (YYYY)


- **46 Categorical Variables**
    - 23 Nominal Variables:
        - MSSubClass: Identifies the type of dwelling involved in the sale.
        - MSZoning: Identifies the general zoning classification of the sale.
        - Street: Type of road access to property
        - Alley: Type of alley access to property
        - LandContour: Flatness of the property
        - LotConfig: Lot configuration
        - Neighborhood: Physical locations within Ames city limits
        - Condition1: Proximity to various conditions
        - Condition2: Proximity to various conditions (if more than one is present)
        - BldgType: Type of dwelling
        - HouseStyle: Style of dwelling
        - RoofStyle: Type of roof
        - RoofMatl: Roof material
        - Exterior1st: Exterior covering on house
        - Exterior2nd: Exterior covering on house (if more than one material)
        - MasVnrType: Masonry veneer type
        - Foundation: Type of foundation
        - Heating: Type of heating
        - CentralAir: Central air conditioning
        - GarageType: Garage location
        - MiscFeature: Miscellaneous feature not covered in other categories
        - SaleType: Type of sale
        - SaleCondition: Condition of sale
        
    - 23 Ordinal Variables:
        - LotShape: General shape of property
        - Utilities: Type of utilities available
        - LandSlope: Slope of property
        - OverallQual: Rates the overall material and finish of the house
        - OverallCond: Rates the overall condition of the house
        - ExterQual: Evaluates the quality of the material on the exterior 
        - ExterCond: Evaluates the present condition of the material on the exterior
        - BsmtQual: Evaluates the height of the basement
        - BsmtCond: Evaluates the general condition of the basement
        - BsmtExposure: Refers to walkout or garden level walls
        - BsmtFinType1: Rating of basement finished area
        - BsmtFinType2: Rating of basement finished area (if multiple types)
        - HeatingQC: Heating quality and condition
        - Electrical: Electrical system
        - KitchenQual: Kitchen quality
        - Functional: Home functionality (Assume typical unless deductions are warranted)
        - FireplaceQu: Fireplace quality
        - GarageFinish: Interior finish of the garage
        - GarageQual: Garage quality
        - GarageCond: Garage condition
        - PavedDrive: Paved driveway
        - PoolQC: Pool quality
        - Fence: Fence quality




- **_I should drop the Id column since it's redundant._**

In [None]:
housing.Id

<p><a name="Distributions"></a></p>

### Feature Distributions

Distribution of Numerical Features:

In [None]:
housing.hist(bins=50, figsize=(20,15))

Distribution of Continuous Features:

In [None]:
continuous = ['LotFrontage',
'LotArea',
'MasVnrArea',
'BsmtFinSF1',
'BsmtFinSF2',
'BsmtUnfSF',
'TotalBsmtSF',
'1stFlrSF',
'2ndFlrSF',
'LowQualFinSF',
'GrLivArea',
'GarageArea',
'WoodDeckSF',
'OpenPorchSF',
'EnclosedPorch',
'3SsnPorch',
'ScreenPorch',
'PoolArea',
'MiscVal',
'SalePrice']

In [None]:
housing[continuous].hist(bins=50, figsize=(20,15))

In [None]:
housing['PoolArea'].value_counts()

In [None]:
housing['LowQualFinSF'].value_counts()

- **_Much of the continuous features are skewed. Perhaps I should normalize._** 
- **_Perhaps I can get rid of some of these features like 'PoolArea' and 'LowQualFinSF' that seem to add nothing to the data._** 

Descriptive Statistics of Numerical Features:

In [None]:
housing.describe()

Count distribution of Categorical Features:

In [None]:
housing.select_dtypes(include=[np.object]).columns.tolist()

In [None]:
categorical = ['MSSubClass',
'MSZoning',
 'Street',
 'Alley',
 'LotShape',
 'LandContour',
 'Utilities',
 'LotConfig',
 'LandSlope',
 'Neighborhood',
 'Condition1',
 'Condition2',
 'BldgType',
 'HouseStyle',
 'OverallQual',
 'OverallCond',
 'RoofStyle',
 'RoofMatl',
 'Exterior1st',
 'Exterior2nd',
 'MasVnrType',
 'ExterQual',
 'ExterCond',
 'Foundation',
 'BsmtQual',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinType2',
 'Heating',
 'HeatingQC',
 'CentralAir',
 'Electrical',
 'KitchenQual',
 'Functional',
 'FireplaceQu',
 'GarageType',
 'GarageFinish',
 'GarageQual',
 'GarageCond',
 'PavedDrive',
 'PoolQC',
 'Fence',
 'MiscFeature',
 'SaleType',
 'SaleCondition']

In [None]:
for feature in categorical:
    display(housing[feature].value_counts().to_frame())
    print(f'There are {housing[feature].unique().size} features in {feature}.')

- **_Street and Utilities to name a couple seem useless._**
- **_PoolQC seems to be missing most of its data..._**

In [None]:
housing['PoolQC']

<p><a name="Missing"></a></p>

### Missing Values

In [None]:
housing.columns[housing.isnull().any()].size

In [None]:
missing_cols = housing.columns[housing.isnull().any()].tolist()
missing_cols

In [None]:
missing = 1 - housing.count()/len(housing) 
missing

Percentage of missing values:

In [None]:
missing = missing[missing > 0]
missing

In [None]:
missing.sort_values(ascending=False)

In [None]:
missing.sort_values().plot.bar(color=np.random.rand(len(missing),3))

- **_I may have to get rid of some features due to the preponderance of missing values, and find suitable replacements for the missing values in other features._**


Relationship Between Missing Values and SalesPrice:

In [None]:
housing[missing_cols]

In [None]:
# Note: Median is used since that measure of central tendency is resistant to outliers.
housing1 = housing[missing_cols].copy()
housing1[missing_cols] = np.where(housing[missing[missing > 0].index].isnull(), 1, 0)
housing1 = pd.concat([housing1, housing[['SalePrice']]], axis=1)
housing1

In [None]:
for feature in missing_cols:
    housing1.groupby(feature)['SalePrice'].median().plot.bar()
    plt.title(f'Missingness in {feature}')
    plt.ylabel('SalePrice')
    plt.show()

**_Due to the relationship between the missing values and SalePrice, it would prove useful to replace them with meaningful values instead of just dropping the columns._**

<p><a name="Correlation"></a></p>

### Correlation Among the Numerical Features 

In [None]:
numerical = housing.select_dtypes(include=[np.number]).columns.tolist()
numerical

In [None]:
correlation = housing[numerical].corr()
correlation

In [None]:
correlation['SalePrice'].sort_values(ascending=False)

In [None]:
f,ax = plt.subplots(figsize=(25,25))
sns.heatmap(correlation, vmax=.8, linewidths=0.01, linecolor='white', square=True)

**_There is high multicolinearity._**

The features that have a correlation greater than .5 with SalePrice:

In [None]:
n=11
features = correlation['SalePrice'].sort_values(ascending=False).index.tolist()[:n]
features

In [None]:
np.corrcoef(housing[features].values)

In [None]:
correlation.loc[features,features]

In [None]:
f,ax = plt.subplots(figsize=(10,10))
sns.heatmap(correlation.loc[features,features], vmax=0.8, annot=True, linewidths=0.01, linecolor='white', square=True)

**_Multicolinearity can be seen more plainly. For instance, GarageCars and GarageArea have a pearson correlation coefficient of 0.88. Since SalePrice and GarageCars have a higher correlation coefficient than SalePrice and GarageArea, it'd make sense to get rid of GarageArea_**.

<p><a name="Outliers"></a></p>

### Outliers

Documentation: 

"There are 5 observations that an instructor may wish to remove from the data set before giving it to students (a plot of SALE PRICE versus GR LIV AREA will indicate them quickly). Three of them are true outliers (Partial Sales that likely don’t represent actual market values) and two of them are simply unusual sales (very large houses priced relatively appropriately). I would recommend removing any houses with more than 4000 square feet from the data set (which eliminates these 5 unusual observations) before assigning it to students."

In [None]:
plt.scatter(housing['GrLivArea'],housing['SalePrice'])
plt.xlabel('GrLivArea')
plt.ylabel('SalePrice')
plt.show()

In [None]:
housing['GrLivArea'].nlargest(5)

In [None]:
f,ax = plt.subplots(figsize=(10,5))
plt.boxplot(housing['GrLivArea'], vert=False)
plt.yticks(ticks=[])
plt.title('GrLivArea')
plt.show()

I see 4 values here. Let's see what other outliers may exist in the data:

In [None]:
for feature in continuous:
    f,ax = plt.subplots(figsize=(12,3))
    sns.boxplot(x=housing[feature])
    plt.yticks(ticks=[])
    plt.title(feature)
    plt.show()

**_Outliers exist in the data. They'll have to be dealt with._**

<p><a name="Preparation"></a></p>

## Data Preparation

<p><a name="Engineering"></a></p>

### Feature Engineering

In [None]:
train,test=pd.read_csv('data/train.csv'),pd.read_csv('data/test.csv')

<p><a name="Drop"></a></p>

#### Drop Columns

In [None]:
housing.head()

In [None]:
housing.drop(columns='Id').columns

In [None]:
def drop_columns(dataset):
    data = dataset.copy()
    return data.drop(columns='Id')

In [None]:
housingD = drop_columns(housing)
housingD.columns

In [None]:
train,test=drop_columns(train),drop_columns(test)

_Note: Multicolinearity does **not** affect the accuracy of predictive models, including regression models. However, the feature importance scores will be influenced. Since the objective is to predict house prices, and not to determine the most important features, I won't attempt to mitigate multicolinearity._

<p><a name="H_Outliers"></a></p>

#### Handle Outliers

In [None]:
housingD.shape

In [None]:
housingO = housingD.copy()

In [None]:
housingO[housingO['GrLivArea'] > 4000].shape

In [None]:
housingO = housingO[housingO['GrLivArea'] < 4000]
housingO.shape

In [None]:
housingO[housingO['LotFrontage'] > 200].shape

In [None]:
housingO = housingO[(housingO['LotFrontage'].isna()) | (housingO['LotFrontage'] < 200)]
housingO.shape

In [None]:
housingO[housingO['LotArea'] > 100000] .shape

In [None]:
housingO = housingO[housingO['LotArea'] < 100000] 
housingO.shape

In [None]:
def handle_outliers(dataset):
    data = dataset.copy()
    condition1 = data['GrLivArea'] < 4000
    condition2 = (data['LotFrontage'].isna()) | (data['LotFrontage'] < 200)
    condition3 = data['LotArea'] < 100000
    return data[condition1 & condition2 & condition3]

In [None]:
handle_outliers(housingD).shape

In [None]:
train,test=handle_outliers(train),handle_outliers(test)

<p><a name="H_Missing"></a></p>

#### Missing Values

In [None]:
missing.sort_values(ascending=False)

According to the documentation, the **NA** in the follow features means "not present" or "no access." Let's give the **NA**s a suitable value so that they are not interpreted as _missing_.

    PoolQC          
    MiscFeature     
    Alley           
    Fence           
    FireplaceQu     
    GarageYrBlt     
    GarageType      
    GarageFinish    
    GarageQual      
    GarageCond      
    BsmtFinType2    
    BsmtExposure    
    BsmtFinType1    
    BsmtCond        
    BsmtQual        
    MasVnrArea      
    MasVnrType      

In [None]:
housingM = housingO.copy()
housingM.head()

In [None]:
def replace_na_cat(dataset,features):
    data = dataset.copy()
    data[features] = data[features].fillna('NotApplicable')
    return data

In [None]:
features = ['Alley',
 'MasVnrType',
 'BsmtQual',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinType2',
 'FireplaceQu',
 'GarageType',
 'GarageFinish',
 'GarageQual',
 'GarageCond',
 'PoolQC',
 'Fence',
 'MiscFeature']

housingM=replace_na_cat(housingM,features)
housingM.head(10)

In [None]:
train,test=replace_na_cat(train,features),replace_na_cat(test,features)

Note: Since there are no 0 (zero) values in the 'LotFrontage' feature, it is possible that the missing values, although not specified in the documentation, should be zero to reflect that the street is not directly connected to the property. The lack of **NA** values in the 'Street' feature does not affect my hypothesis since one can still have road access without the road being directly connected to one's property. SO, let's the **NA** values in that column zero in addition to the other features.

In [None]:
def replace_na_num(dataset,features):
    data = dataset.copy()
    data[features] = data[features].fillna(0)
    return data

In [None]:
features = ['MasVnrArea', 'GarageYrBlt', 'LotFrontage']

housingM=replace_na_num(housingM,features)
housingM.head(10)

In [None]:
train,test=replace_na_num(train,features),replace_na_num(test,features)

In [None]:
missing = 1 - housingM.count()/len(housingM) 
missing = missing[missing > 0]
missing.sort_values(ascending=False)

In [None]:
housingM[housingM['Electrical'].isna()]

In [None]:
housingM['SalePrice'].median()

In [None]:
# From above:
housing1.groupby('Electrical')['SalePrice'].median().plot.bar()
plt.title(f'Missingness in Electrical')
plt.ylabel('SalePrice')
plt.show()

Given that only one feature with a missing value remains, that feature only has one missing value, and the SalePrice for that house is fairly close to the median SalePrice, let's just get rid of its missing value(s).

In [None]:
housingM.shape

In [None]:
housingM[~housingM['Electrical'].isna()].shape


In [None]:
def remove_missing(dataset,features):
    data = dataset.copy()
    return data[~data[features].isna()]

In [None]:
features = 'Electrical'
housingM = remove_missing(housingM,features)
housingM.shape

In [None]:
housingM

In [None]:
train,test=remove_missing(train,features),remove_missing(test,features)

<p><a name="Impute"></a></p>

#### Imputation
Not applicable; all of the missing values are accounted for.

In [None]:
# # pip install impyute
# from impyute.imputation.cs import mice
# imputed_numerics = mice(housingFE[numerical])
# imputed_numerics.columns = housingFE[numerical].columns
# imputed_numerics.head() 


<p><a name="Encode"></a></p>

#### Encoding
I need to encode the following features:

In [None]:
housingE = housingM.copy()

In [None]:
housingM

In [None]:
housingE[housingE.select_dtypes(include=[np.object])]