<a href="https://colab.research.google.com/github/K-Erath/Dataquest/blob/master/16_Predicting_House_Sale_Prices.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicting House Sale Prices
In this project we will explore ways to improve models, build intuition for model based learning, explore linear regression, understand different approaches to model fitting, and use techniques for cleaning transforming and selecting features. 

We will work with the city of Ames, Iowa from 2006 to 2010.You can read more about how the data was collected [here](https://www.tandfonline.com/doi/abs/10.1080/10691898.2011.11889627). We will become more familiar with the columns by reading the data [documentation](https://s3.amazonaws.com/dq-content/307/data_description.txt). This will help us determine what data transformations (if any) are necessary. Succeeding in predictive modeling is highly dependent on the quality of features the model has. Libraries like scikit-learn have made it quick and easy to simply try and tweak many different models, but cleaning, selecting, and transforming features are still more of an art that requires a bit of human ingenuity.

We will start by setting up a pipeline of functions that will let us quickly iterate on different models. 

![img_01](content/drive/MyDrive/Training/Dataquest - Data Scientist in Python/Step 7 Machine Learning Fundamentals/04 Linear Regression for Machine Learning/GuidedProject_PredictingHouseSalePrices/img_01.png)


[Solution notebook](https://github.com/dataquestio/solutions/blob/master/Mission240Solutions.ipynb)

[Community](https://community.dataquest.io/tags/c/share/47/240)

## Import Modules and Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.utils import shuffle
from math import sqrt
import datetime
from sklearn.model_selection import cross_val_score, KFold
import seaborn as sns
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

In [2]:
!python --version

Python 3.7.15


In [3]:
pd.__version__

'1.3.5'

In [4]:
from google.colab import drive

drive.mount('/content/gdrive/', force_remount=True)

Mounted at /content/gdrive/


In [5]:
%cd gdrive/MyDrive/Training/Dataquest-DataScientistPython/Step-7-Machine-Learning-Fundamentals/04-Linear-Regression-for-Machine-Learning/GuidedProject_PredictingHouseSalePrices

/content/gdrive/MyDrive/Training/Dataquest-DataScientistPython/Step-7-Machine-Learning-Fundamentals/04-Linear-Regression-for-Machine-Learning/GuidedProject_PredictingHouseSalePrices


In [6]:
csv = "AmesHousing.tsv"

ames_df = pd.read_csv(csv, delimiter="\t")
ames_df.head()

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,...,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,,Reg,Lvl,...,0,,,,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74.0,13830,Pave,,IR1,Lvl,...,0,,MnPrv,,0,3,2010,WD,Normal,189900


In [7]:
ames_df.shape

(2930, 82)

## First Draft Model
These initial functions will be updated throughout the project as we improve the model.

In [8]:
def transform_features(df):
    '''for now return a copy of the dataframe'''
    transformed_df = df.copy()
    return transformed_df

def select_features(df, high_corr_cols=[]):
    '''for now return Gr Liv Area and SalePrice columns from the train data frame.'''
    if high_corr_cols==[]:
        selected_df = df.copy()
    else:
        selected_df = df[high_corr_cols].copy()
    return selected_df

def train_and_test(df):
    '''
    * Selects the first 1460 rows from from data and assign to train.
    * Selects the remaining rows from data and assign to test.
    * Trains a model using all numerical columns except the SalePrice column (the target column) from the data frame returned from select_features().
    * Tests the model on the test set and returns the RMSE value.
    '''
    # get numeric columns and print warning about non-numeric columns
    num_cols = df.select_dtypes(include=np.number).columns.tolist()
    non_num_cols = [col for col in df.columns if col not in num_cols]
    if len(non_num_cols) > 0:
        print(f'WARNING: Non-numeric columns will not be included in the model: {non_num_cols}')

    # shuffling a pandas dataframe with sklearn
    shuffled_df = shuffle(df, random_state=1)

    # split dataframe into train and test
    train_df = shuffled_df.iloc[:1460]
    test_df = shuffled_df.iloc[1460:]
    target_col = 'SalePrice'
    features = [col for col in num_cols if col != target_col]

    lr = LinearRegression()
    x = train_df[features]
    y = train_df[target_col]
    reg = lr.fit(X=x, y=y)
    y_pred_train = reg.predict(x)
    y_pred_test = reg.predict(test_df[features])

    rmse_train = sqrt(mean_squared_error(
        y_true=y,
        y_pred=y_pred_train
    ))

    rmse_test = sqrt(mean_squared_error(
        y_true=test_df[target_col],
        y_pred=y_pred_test
    ))
    print(f'Test RMSE: {rmse_test}, Train RMSE: {rmse_train}')
    return [rmse_test, rmse_train]

In [9]:
transformed_df = transform_features(ames_df)

In [10]:
# only use two columns for now, if we use all columns we will run into errors because of null values
filtered_df = select_features(transformed_df, high_corr_cols=['Gr Liv Area', 'SalePrice'])
filtered_df.shape

(2930, 2)

In [11]:
rmse = train_and_test(filtered_df)

Test RMSE: 55943.34407431997, Train RMSE: 57086.31613271551


In [12]:
results = {'original': rmse}
results

{'original': [55943.34407431997, 57086.31613271551]}

# Feature Engineering
Let's now start removing features with many missing values, diving deeper into potential categorical features, and transforming text and numerical columns. 

#### Remove features that we don't want to use in the model:
* Drop any column from the data frame with more than 25% missing values.
* Remove columns that are not relevant to machine learning.
* Remove any columns that leak information about the sale (e.g. year the sale happened). 

#### Transorm features into the proper format:
* For any column missing 5% or less of its values, fill null values witth he most frequent value.
* Numerical to categorical
* Text to categorical
* Categorical to numerical
* Text to numerical
* Scaling numerical
* Create new features by combining other features.

##### **Notes on Data leakage**
##### Definition
* Any feature whose value would not actually be available in practice at the time you’d want to use the model to make a prediction, is a feature that can introduce leakage to your model.
* When the data you are using to train a machine learning algorithm happens to have the information you are trying to predict.

##### Problem
* You may be creating overly optimistic models that are practically useless and cannot be used in production.
* The effect is overfitting your training data and having an overly optimistic evaluation of your models performance on unseen data.

##### How to tell?  
* An easy way to know you have data leakage is if you are achieving performance that seems a little too good to be true.

##### How to minimize?  
1. Perform data preparation within your cross validation folds.
2. Hold back a validation dataset for final sanity check of your developed models.

[source](https://machinelearningmastery.com/data-leakage-machine-learning/)


### Drop columns that are not relevant to machine learning or that leak information about the sale.

In [13]:
# drop columns that are not useful for machine learning
drop_cols = ['Order', 'PID']
# print(f'Dropping columns not useful for machine learning: {drop_cols}')
transformed_df = transformed_df.drop(columns=drop_cols)

In [14]:
# drop columns that leak information
drop_cols = ['Mo Sold', 'Sale Type', 'Sale Condition']
# print(f'Dropping columns that leak information: {drop_cols}')
transformed_df = transformed_df.drop(columns=drop_cols)

### Deal with null values

In [15]:
# drop columns where more than 25% of values are missing
# total number of rows
len_df = transformed_df.shape[0]
# count of missing values in each column
missing_count = transformed_df.isna().sum()
# 25% of rows
twentyfive_percent = int(len_df*0.25)
# columns to drop
drop_cols = missing_count[missing_count > twentyfive_percent].index
# drop columns where the missing count is greater than 25%
transformed_df = transformed_df.drop(columns=drop_cols)

In [16]:
# For columns that have 5% or less missing values, we will fill in missing values with the most frequent value
# 5% of values
five_percent = int(len_df*0.05)
# filter series by missing less than 5%
missing_0_to_5 = missing_count[(missing_count < five_percent) & (missing_count > 0)]
# columns to fill with mode
cols_to_fill = missing_0_to_5.index
# fill na values with mode
transformed_df[cols_to_fill] = transformed_df[cols_to_fill].fillna(transformed_df.mode().iloc[0])

In [17]:
# find columns that still have missing values
missing_count = transformed_df.isna().sum()
missing_percent = transformed_df.isna().sum() / len(transformed_df)
columns = ['missing_count','missing_percent']
missing_df = pd.concat([missing_count,missing_percent], axis=1, keys=columns)
missing_df = missing_df[missing_df['missing_percent'] > 0].sort_values('missing_percent', ascending=False)
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(missing_df)

               missing_count  missing_percent
Lot Frontage             490         0.167235
Garage Yr Blt            159         0.054266
Garage Finish            159         0.054266
Garage Qual              159         0.054266
Garage Cond              159         0.054266
Garage Type              157         0.053584


In [18]:
filtered_df = select_features(transformed_df, high_corr_cols=[col for col in transformed_df.columns if col not in missing_df.index])
rmse = train_and_test(filtered_df)

Test RMSE: 31309.33622848826, Train RMSE: 34532.95844367829


In [19]:
# get list of garage columns
garage_cols = [col for col in transformed_df.columns if 'garage' in col.lower()]
garage_cols

['Garage Type',
 'Garage Yr Blt',
 'Garage Finish',
 'Garage Cars',
 'Garage Area',
 'Garage Qual',
 'Garage Cond']

In [20]:
garage_bool = transformed_df['Garage Cars'] == 0
no_garage_bool = transformed_df['Garage Cars'] != 0
# if there is a garage, fill na values with the mode
transformed_df.loc[garage_bool, garage_cols] = transformed_df.loc[garage_bool, garage_cols].fillna(transformed_df.mode().iloc[0])
# if there is not a garage, for Garage Yr Blt fill with Yr Sold
transformed_df.loc[no_garage_bool, 'Garage Yr Blt'] = transformed_df.loc[no_garage_bool, 'Garage Yr Blt'].fillna(transformed_df['Yr Sold'])
# if there is not a garage for text columns, fill na values with 'NA'
transformed_df.loc[no_garage_bool, garage_cols] = transformed_df.loc[no_garage_bool, garage_cols].fillna('NA')

In [21]:
filtered_df = select_features(transformed_df, high_corr_cols=[col for col in transformed_df.columns if col not in 'Lot Frontage'])
rmse = train_and_test(filtered_df)

Test RMSE: 31137.44397514447, Train RMSE: 34375.42169380369


A Google search revealed that lot frontage was important for determining sale price, and that it can vary a lot by neighborhood. 

In [22]:
# # get statistics using groupby() & describe()
# describe_df = transformed_df.groupby(['Neighborhood'])['Lot Frontage'].describe()
# # look at how standard deviation varies by neighborhood, compared to the dataset
# fig, ax = plt.subplots(figsize=(15,7))
# describe_df[['std']].plot(kind='bar', ax=ax)
# ax.axhline(transformed_df['Lot Frontage'].std(), color='green', lw=2)
# ax.legend(['Standard Deviation', 'Lot Frontage'])

Generally, the standard deviation is lower for the neighborhoods than for the entire dataset. The lower variability in the neighborhoods vs. the whole dataset is a good sign that filling na values with the mean for the neighborhood will be better than using the mean for the entire dataset.

In [23]:
# # boxplot of lot frontage by neighborhood
# fig, ax = plt.subplots(figsize=(25,12))
# transformed_df.boxplot(by ='Neighborhood', column =['Lot Frontage'], grid = False, ax=ax)
# ax.axhline(transformed_df['Lot Frontage'].mean(), color ='orange', label='Average')
# ax.legend()

Unfortunately, we can see from the box plot that there are still a lot of outliers within the neighborhoods.

There are two neighborhoods with no values for lot frontage. We will fill these with the mean for the dataset.

In [24]:
# calculate null values in lot frontage
# get dataframe of mean values for each neighborhood
mean_df = transformed_df.groupby('Neighborhood')['Lot Frontage'].mean()
# get names of neighborhoods where the mean is null
nbhd_null = mean_df[mean_df.isna()].index
# if the neighborhood has no values for lot frontage, fill the values in with the mean for the dataset
transformed_df.loc[transformed_df['Neighborhood'].isin(nbhd_null), 'Lot Frontage'] = transformed_df['Lot Frontage'].mean()

In [25]:
transformed_df['Lot Frontage'] = transformed_df.groupby('Neighborhood')['Lot Frontage'].transform(
    lambda grp: grp.fillna(np.mean(grp))
)

In [26]:
# find columns that still have missing values
missing_count = transformed_df.isna().sum()
missing_percent = transformed_df.isna().sum() / len(transformed_df)
columns = ['missing_count','missing_percent']
missing_df = pd.concat([missing_count,missing_percent], axis=1, keys=columns)
missing_df = missing_df[missing_df['missing_percent'] > 0].sort_values('missing_percent', ascending=False)
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(missing_df)

Empty DataFrame
Columns: [missing_count, missing_percent]
Index: []


Now that we have dealt with null values we can test the model with all numeric columns

In [27]:
filtered_df = select_features(transformed_df)
rmse = train_and_test(filtered_df)

Test RMSE: 31459.519644043823, Train RMSE: 34272.84697547075


Fixing the null values in lot frontage increased the RMSE, so it might be better to drop that column. 

In [28]:
results['resolve null values']=rmse
results

{'original': [55943.34407431997, 57086.31613271551],
 'resolve null values': [31459.519644043823, 34272.84697547075]}

### Create new features

In [29]:
# create new feature that better captures the age of the home
transformed_df['years_since_built'] = transformed_df['Yr Sold'] - transformed_df['Year Built']
# drop rows with negative values since those are not possible
drop_idx = transformed_df['years_since_built'][transformed_df['years_since_built']<0].index.tolist()
transformed_df.drop(index=drop_idx, inplace=True)
# drop column no longer needed
transformed_df = transformed_df.drop(columns='Year Built')

In [30]:
filtered_df = select_features(transformed_df)
rmse = train_and_test(filtered_df)

Test RMSE: 30243.710194330593, Train RMSE: 33020.51202798555


In [31]:
# create new feature that better captures the remodel date
transformed_df['years_since_remod'] = transformed_df['Yr Sold'] - transformed_df['Year Remod/Add']
# drop rows with negative values since those are not possible
drop_idx = transformed_df['years_since_remod'][transformed_df['years_since_remod']<0].index.tolist()
transformed_df.drop(index=drop_idx, inplace=True)
# drop column no longer needed
transformed_df = transformed_df.drop(columns='Year Remod/Add')

In [32]:
filtered_df = select_features(transformed_df)
rmse = train_and_test(filtered_df)

Test RMSE: 28615.586316187793, Train RMSE: 32758.431019892592


In [33]:
# create new feature that better captures when the garage was built
transformed_df['years_since_garage_built'] = transformed_df['Yr Sold'] - transformed_df['Garage Yr Blt']
# drop rows with negative values since those are not possible
drop_idx = transformed_df['years_since_garage_built'][transformed_df['years_since_garage_built']<0].index.tolist()
transformed_df.drop(index=drop_idx, inplace=True)
# drop column no longer needed
transformed_df = transformed_df.drop(columns=['Garage Yr Blt','Yr Sold'])

In [34]:
filtered_df = select_features(transformed_df)
rmse = train_and_test(filtered_df)

Test RMSE: 27896.577355463498, Train RMSE: 33328.97298133154


In [35]:
# combine two columns that represent number of bathrooms
transformed_df['Bathrooms'] = transformed_df['Full Bath'] + transformed_df['Half Bath']
# drop column no longer needed
transformed_df = transformed_df.drop(columns=['Half Bath','Full Bath'])

In [36]:
filtered_df = select_features(transformed_df)
rmse = train_and_test(filtered_df)

Test RMSE: 27887.9378446402, Train RMSE: 33343.907741871146


In [37]:
# combine porch/deck sq ft columns
porch_cols = ['Wood Deck SF','Open Porch SF','Enclosed Porch','3Ssn Porch','Screen Porch']
transformed_df['PorchDeck_SF'] = transformed_df[porch_cols].sum(axis=1)
# drop original columns
transformed_df.drop(columns=porch_cols, inplace=True)

In [38]:
filtered_df = select_features(transformed_df)
rmse = train_and_test(filtered_df)

Test RMSE: 27947.03139306928, Train RMSE: 33425.29432231938


Combining the porch/deck square footage also increased RMSE.

In [39]:
results['new features']=rmse
results

{'original': [55943.34407431997, 57086.31613271551],
 'resolve null values': [31459.519644043823, 34272.84697547075],
 'new features': [27947.03139306928, 33425.29432231938]}

### Transform nominal numerical to text
MS SubClass is numerical but it represents categories.

In [40]:
# ms subclass is numerical discrete, but it should be categorical nominal
# later we will convert the numerical codes to text descriptions
# this dictionary maps the codes to descriptions
ms_subclass_dict = {
    20: '1-STORY 1946 & NEWER ALL STYLES',
    30: '1-STORY 1945 & OLDER1-STORY 1945 & OLDER',
    40: '1-STORY W/FINISHED ATTIC ALL AGES',
    45: '1-1/2 STORY - UNFINISHED ALL AGES',
    50: '1-1/2 STORY FINISHED ALL AGES',
    60: '2-STORY 1946 & NEWER',
    70: '2-STORY 1945 & OLDER',
    75: '2-1/2 STORY ALL AGES',
    80: 'SPLIT OR MULTI-LEVEL',
    85: 'SPLIT FOYER',
    90: 'DUPLEX - ALL STYLES AND AGES',
    120: '1-STORY PUD (Planned Unit Development) - 1946 & NEWER',
    150: '1-1/2 STORY PUD - ALL AGES',
    160: '2-STORY PUD - 1946 & NEWER',
    180: 'PUD - MULTILEVEL - INCL SPLIT LEV/FOYER',
    190: '2 FAMILY CONVERSION - ALL STYLES AND AGES'
 }

In [41]:
# get descriptions instead of codes
transformed_df['MS SubClass'] = transformed_df['MS SubClass'].replace(ms_subclass_dict)

filtered_df = select_features(transformed_df)
rmse = train_and_test(filtered_df)

Test RMSE: 28316.71501624453, Train RMSE: 33909.57028239677


We wouldn't expect this to drop the RMSE right away becuase text columns get dropped. It might help later after we transform text columns.

### Scaling Numerical Columns
Our method for scaling was chosen after doing some [research](https://www.mage.ai/blog/scaling-numerical-data). We will use min-max normalization on our discrete data and z-score standardization on our continuous data.


The following variables were defined by looking at the data documentation and determining the data type for each field. As new features were created they were added to these lists.

In [42]:
# discrete data columns will be scaled using min-max normalization
discrete_cols = ['Overall Qual','Overall Cond','years_since_remod','years_since_garage_built','years_since_built','Bsmt Full Bath','Bsmt Half Bath','Full Bath','Half Bath','Bedroom AbvGr','Kitchen AbvGr','TotRms AbvGrd','Fireplaces','Garage Cars','Misc Val','Bathrooms','Bsmt Rating','Bsmt Bath']

# continuous data columns will be scaled using z-score standardization
continuous_cols = ['Lot Frontage','Lot Area','Mas Vnr Area','BsmtFin SF 1','BsmtFin SF 2','Bsmt Unf SF','Total Bsmt SF','1st Flr SF','2nd Flr SF','Low Qual Fin SF','Gr Liv Area','Garage Area','Wood Deck SF','Open Porch SF','Enclosed Porch','3Ssn Porch','Screen Porch','Pool Area','Bsmt Fin SQ','PorchDeck_SF']

# this dictionary will be used for converting ordinal text columns to numeric values
ordinal_dict = {
    'Bsmt Cond': {'Ex': 5, 'Fa': 2, 'Gd': 4, 'Po': 1, 'TA': 3, 'NA': 0},
    'Bsmt Exposure': {'Av': 3, 'Gd': 4, 'Mn': 2, 'No': 1, 'NA': 0},
    'Bsmt Qual': {'Ex': 5, 'Fa': 2, 'Gd': 4, 'Po': 1, 'TA': 3, 'NA': 0},
    'BsmtFin Type 1': {'ALQ': 5, 'BLQ': 4, 'GLQ': 6, 'LwQ': 2, 'Rec': 3, 'Unf': 1, 'NA': 0},
    'BsmtFin Type 2': {'ALQ': 5, 'BLQ': 4, 'GLQ': 6, 'LwQ': 2, 'Rec': 3, 'Unf': 1, 'NA': 0},
    'Exter Cond': {'Ex': 5, 'Fa': 2, 'Gd': 4, 'Po': 1, 'TA': 3, 'NA': 0},
    'Exter Qual': {'Ex': 5, 'Fa': 2, 'Gd': 4, 'Po': 1, 'TA': 3, 'NA': 0},
    'Garage Cond': {'Ex': 5, 'Fa': 2, 'Gd': 4, 'Po': 1, 'TA': 3, 'NA': 0},
    'Garage Qual': {'Ex': 5, 'Fa': 2, 'Gd': 4, 'Po': 1, 'TA': 3, 'NA': 0},
    'Heating QC': {'Ex': 5, 'Fa': 2, 'Gd': 4, 'Po': 1, 'TA': 3, 'NA': 0},
    'Kitchen Qual': {'Ex': 5, 'Fa': 2, 'Gd': 4, 'Po': 1, 'TA': 3, 'NA': 0},
    'Land Slope': {'Gtl': 1, 'Mod': 2, 'Sev': 3},
    'Paved Drive': {'N': 1, 'P': 2, 'Y': 3}
 }

In [43]:
zscore_cols = continuous_cols
# Scale numeric columns
if len(zscore_cols) > 0:
    zscore_cols = [col for col in zscore_cols if col in transformed_df.columns]
    print(f'Scaling numerical columns using z-score standardization: {zscore_cols}')
    zscore_df = transformed_df[zscore_cols]
    # scale data with z-score standardization (normal distribution)
    # (x - xmean / xstd.dev)
    standardized_df = (zscore_df - zscore_df.mean()) / (zscore_df.std())
    df1 = transformed_df.drop(columns=zscore_cols)
    df2 = standardized_df
    transformed_df = pd.concat([df1, df2], axis=1)

Scaling numerical columns using z-score standardization: ['Lot Frontage', 'Lot Area', 'Mas Vnr Area', 'BsmtFin SF 1', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF', '1st Flr SF', '2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area', 'Garage Area', 'Pool Area', 'PorchDeck_SF']


In [44]:
rmse

[28316.71501624453, 33909.57028239677]

In [45]:
filtered_df = select_features(transformed_df)
rmse = train_and_test(filtered_df)

Test RMSE: 28316.71501624457, Train RMSE: 33909.57028239677



Scaling numerical columns using z-score standardization: ['Lot Frontage', 'Lot Area', 'Mas Vnr Area', 'BsmtFin SF 1', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF', '1st Flr SF', '2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area', 'Garage Area', 'Pool Area', 'PorchDeck_SF']

Scaling continuous data columns with z-score standardization increased RMSE by a small amount.

28316.71501624453 - 28316.71501624457 = -4.001776687800884e-11

In [46]:
minmax_cols = discrete_cols
scale_ordinal = False

if len(minmax_cols) > 0:
    minmax_cols = [col for col in minmax_cols if col in transformed_df.columns]
    print(f'Scaling numerical columns using min-max normalization: {minmax_cols}')
    if scale_ordinal == True:
        cols = minmax_cols + [k for k in ordinal_dict if k in transformed_df.columns]
        print(f'Scaling ordinal and minmax cols: {cols}')
        minmax_df = transformed_df[cols]
    else:
        minmax_df = transformed_df[minmax_cols]
    # scale discrete numbers with normalization 
    # (x-xmin) / (x-xmax) 
    normalized_df = (minmax_df - minmax_df.min()) / (minmax_df.max() - minmax_df.min())
    df1 = transformed_df.drop(columns=minmax_cols)
    df2 = normalized_df
    transformed_df = pd.concat([df1, df2], axis=1)

Scaling numerical columns using min-max normalization: ['Overall Qual', 'Overall Cond', 'years_since_remod', 'years_since_garage_built', 'years_since_built', 'Bsmt Full Bath', 'Bsmt Half Bath', 'Bedroom AbvGr', 'Kitchen AbvGr', 'TotRms AbvGrd', 'Fireplaces', 'Garage Cars', 'Misc Val', 'Bathrooms']


In [47]:
filtered_df = select_features(transformed_df)
rmse = train_and_test(filtered_df)

Test RMSE: 29854.66373554057, Train RMSE: 34942.641059269015


Scaling numerical columns using min-max normalization: ['Overall Qual', 'Overall Cond', 'years_since_remod', 'years_since_garage_built', 'years_since_built', 'Bsmt Full Bath', 'Bsmt Half Bath', 'Bedroom AbvGr', 'Kitchen AbvGr', 'TotRms AbvGrd', 'Fireplaces', 'Garage Cars', 'Misc Val', 'Bathrooms']

Scaling discrete numerical columns using min-max normalization increased the RMSE by 1537.9487192960005 (29854.66373554057- 28316.71501624457).  

### Update the transform features function. 
* We will drop columns that are not relevant to machine learning or leak data about the sale price
* We will deal with null values by dropping columns or filling in the null values
* We will create new feature that better represent the data than the current features
* We will transform MS Subclass to text since it represents categorical data
* None of our attempts to scale numerical columns improved the model. We will include these options as parameters in case we want to do more testing later. 
* We will also include parameters for combining the porch square footage and filling in null values for lot frontage.

In [48]:
def transform_features(df, lot_frontage_nbhd=True, combine_porch_sf=True, zscore_cols=[], minmax_cols=[]):
    '''
    Feature Engineering
    * Remove features we do not want to use in the model, based on the number 
    of missing values or data leakage. 
    * Transform features into the proper format (filling in missing values). 
    * Create new features by combining other features.
    '''
    print(f'Begin Transform Features with {df.shape[1]} columns and {df.shape[0]} rows')

    # drop columns that are not useful for machine learning
    drop_cols = ['Order', 'PID']
    # print(f'Dropping columns not useful for machine learning: {drop_cols}')
    transformed_df = df.drop(columns=drop_cols)

    # drop columns that leak information
    drop_cols = ['Mo Sold', 'Sale Type', 'Sale Condition']
    # print(f'Dropping columns that leak information: {drop_cols}')
    transformed_df = transformed_df.drop(columns=drop_cols)

    ## deal with null values by dropping columns or filling in

    # drop columns where more than 25% of values are missing
    # total number of rows
    len_df = transformed_df.shape[0]
    # count of missing values in each column
    missing_count = transformed_df.isna().sum()
    # 25% of rows
    twentyfive_percent = int(len_df*0.25)
    # columns to drop
    drop_cols = missing_count[missing_count > twentyfive_percent].index
    # drop columns where the missing count is greater than 25%
    transformed_df = transformed_df.drop(columns=drop_cols)

    # For columns that have 5% or less missing values, we will fill in missing values with the most frequent value
    # 5% of values
    five_percent = int(len_df*0.05)
    # filter series by missing less than 5%
    missing_0_to_5 = missing_count[(missing_count < five_percent) & (missing_count > 0)]
    # columns to fill with mode
    cols_to_fill = missing_0_to_5.index
    # fill na values with mode
    transformed_df[cols_to_fill] = transformed_df[cols_to_fill].fillna(transformed_df.mode().iloc[0])

    # get list of garage columns
    garage_cols = [col for col in transformed_df.columns if 'garage' in col.lower()]
    garage_bool = transformed_df['Garage Cars'] == 0
    no_garage_bool = transformed_df['Garage Cars'] != 0
    # if there is a garage, fill na values with the mode
    transformed_df.loc[garage_bool, garage_cols] = transformed_df.loc[garage_bool, garage_cols].fillna(transformed_df.mode().iloc[0])
    # if there is not a garage, for Garage Yr Blt fill with Yr Sold
    transformed_df.loc[no_garage_bool, 'Garage Yr Blt'] = transformed_df.loc[no_garage_bool, 'Garage Yr Blt'].fillna(transformed_df['Yr Sold'])
    # if there is not a garage for text columns, fill na values with 'NA'
    transformed_df.loc[no_garage_bool, garage_cols] = transformed_df.loc[no_garage_bool, garage_cols].fillna('NA')

    # calculate null values in lot frontage
    if lot_frontage_nbhd == True:
        # if true, calculate the eman for the neighborhood and use that to fill null values
        # get dataframe of mean values for each neighborhood
        mean_df = transformed_df.groupby('Neighborhood')['Lot Frontage'].mean()
        # get names of neighborhoods where the mean is null
        nbhd_null = mean_df[mean_df.isna()].index
        # if the neighborhood has no values for lot frontage, fill the values in with the mean for the dataset
        transformed_df.loc[transformed_df['Neighborhood'].isin(nbhd_null), 'Lot Frontage'] = transformed_df['Lot Frontage'].mean()
        def nbhd_mean(nbhd):
            '''function returns mean lot frontage for neighborhood
            '''
            nbhd_mean = transformed_df[transformed_df['Neighborhood'] == nbhd]['Lot Frontage'].mean()
            return nbhd_mean
        # fill in remaining null lot frontage values with mean for neighborhood
        transformed_df['Lot Frontage'] = transformed_df['Lot Frontage']
        transformed_df.loc[transformed_df['Lot Frontage'].isna(), 'Lot Frontage'] = transformed_df.loc[transformed_df['Lot Frontage'].isna(), 'Neighborhood'].apply(nbhd_mean)
    else:
        # if false, drop the column
        transformed_df.drop(columns='Lot Frontage', inplace=True)

    # create new features

    # create new feature that better captures the age of the home
    transformed_df['years_since_built'] = transformed_df['Yr Sold'] - transformed_df['Year Built']
    # drop rows with negative values since those are not possible
    drop_idx = transformed_df['years_since_built'][transformed_df['years_since_built']<0].index.tolist()
    transformed_df.drop(index=drop_idx, inplace=True)
    # drop column no longer needed
    transformed_df = transformed_df.drop(columns='Year Built')

    # create new feature that better captures the remodel date
    transformed_df['years_since_remod'] = transformed_df['Yr Sold'] - transformed_df['Year Remod/Add']
    # drop rows with negative values since those are not possible
    drop_idx = transformed_df['years_since_remod'][transformed_df['years_since_remod']<0].index.tolist()
    transformed_df.drop(index=drop_idx, inplace=True)
    # drop column no longer needed
    transformed_df = transformed_df.drop(columns='Year Remod/Add')

    # create new feature that better captures when the garage was built
    transformed_df['years_since_garage_built'] = transformed_df['Yr Sold'] - transformed_df['Garage Yr Blt']
    # drop rows with negative values since those are not possible
    drop_idx = transformed_df['years_since_garage_built'][transformed_df['years_since_garage_built']<0].index.tolist()
    transformed_df.drop(index=drop_idx, inplace=True)
    # drop column no longer needed
    transformed_df = transformed_df.drop(columns=['Garage Yr Blt','Yr Sold'])

    # combine two columns that represent number of bathrooms
    transformed_df['Bathrooms'] = transformed_df['Full Bath'] + transformed_df['Half Bath']
    # drop column no longer needed
    transformed_df = transformed_df.drop(columns=['Half Bath','Full Bath'])

    # combine porch/deck sq ft columns
    if combine_porch_sf == True:
        porch_cols = ['Wood Deck SF','Open Porch SF','Enclosed Porch','3Ssn Porch','Screen Porch']
        transformed_df['PorchDeck_SF'] = transformed_df[porch_cols].sum(axis=1)
        # drop original columns
        transformed_df.drop(columns=porch_cols, inplace=True)

    # get descriptions instead of codes
    transformed_df['MS SubClass'] = transformed_df['MS SubClass'].replace(ms_subclass_dict)

    # Scale numeric columns
    if len(zscore_cols) > 0:
        zscore_cols = [col for col in zscore_cols if col in transformed_df.columns]
        print(f'Scaling numerical columns using z-score standardization: {zscore_cols}')
        zscore_df = transformed_df[zscore_cols]
        # scale data with z-score standardization (normal distribution)
        # (x - xmean / xstd.dev)
        standardized_df = (zscore_df - zscore_df.mean()) / (zscore_df.std())
        df1 = transformed_df.drop(columns=zscore_cols)
        df2 = standardized_df
        transformed_df = pd.concat([df1, df2], axis=1)
    if len(minmax_cols) > 0:
        minmax_cols = [col for col in minmax_cols if col in transformed_df.columns]
        print(f'Scaling numerical columns using min-max normalization: {minmax_cols}')
        minmax_df = transformed_df[minmax_cols]
        # scale discrete numbers with normalization 
        # (x-xmin) / (x-xmax) 
        normalized_df = (minmax_df - minmax_df.min()) / (minmax_df.max() - minmax_df.min())
        df1 = transformed_df.drop(columns=minmax_cols)
        df2 = normalized_df
        transformed_df = pd.concat([df1, df2], axis=1)

    print(f'End Transform Features with {transformed_df.shape[1]} columns and {transformed_df.shape[0]} rows')
    return transformed_df

In [49]:
transformed_df = transform_features(ames_df, lot_frontage_nbhd=True, combine_porch_sf=True, zscore_cols=continuous_cols, minmax_cols=discrete_cols)
filtered_df = select_features(transformed_df)
rmse = train_and_test(filtered_df)

Begin Transform Features with 82 columns and 2930 rows
Scaling numerical columns using z-score standardization: ['Lot Frontage', 'Lot Area', 'Mas Vnr Area', 'BsmtFin SF 1', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF', '1st Flr SF', '2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area', 'Garage Area', 'Pool Area', 'PorchDeck_SF']
Scaling numerical columns using min-max normalization: ['Overall Qual', 'Overall Cond', 'years_since_remod', 'years_since_garage_built', 'years_since_built', 'Bsmt Full Bath', 'Bsmt Half Bath', 'Bedroom AbvGr', 'Kitchen AbvGr', 'TotRms AbvGrd', 'Fireplaces', 'Garage Cars', 'Misc Val', 'Bathrooms']
End Transform Features with 66 columns and 2926 rows
Test RMSE: 29854.66373554057, Train RMSE: 34942.641059269015


In [50]:
transformed_df = transform_features(ames_df, lot_frontage_nbhd=True, combine_porch_sf=True)
filtered_df = select_features(transformed_df)
rmse = train_and_test(filtered_df)

Begin Transform Features with 82 columns and 2930 rows
End Transform Features with 66 columns and 2926 rows
Test RMSE: 28316.71501624453, Train RMSE: 33909.57028239677


In [51]:
transformed_df = transform_features(ames_df, lot_frontage_nbhd=False, combine_porch_sf=False, zscore_cols=continuous_cols, minmax_cols=discrete_cols)
filtered_df = select_features(transformed_df)
rmse = train_and_test(filtered_df)

Begin Transform Features with 82 columns and 2930 rows
Scaling numerical columns using z-score standardization: ['Lot Area', 'Mas Vnr Area', 'BsmtFin SF 1', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF', '1st Flr SF', '2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area', 'Garage Area', 'Wood Deck SF', 'Open Porch SF', 'Enclosed Porch', '3Ssn Porch', 'Screen Porch', 'Pool Area']
Scaling numerical columns using min-max normalization: ['Overall Qual', 'Overall Cond', 'years_since_remod', 'years_since_garage_built', 'years_since_built', 'Bsmt Full Bath', 'Bsmt Half Bath', 'Bedroom AbvGr', 'Kitchen AbvGr', 'TotRms AbvGrd', 'Fireplaces', 'Garage Cars', 'Misc Val', 'Bathrooms']
End Transform Features with 69 columns and 2926 rows
Test RMSE: 39908.851259838666, Train RMSE: 43205.17405838527


In [52]:
transformed_df = transform_features(ames_df, lot_frontage_nbhd=False, combine_porch_sf=False)
filtered_df = select_features(transformed_df)
rmse = train_and_test(filtered_df)

Begin Transform Features with 82 columns and 2930 rows
End Transform Features with 69 columns and 2926 rows
Test RMSE: 28217.333098304378, Train RMSE: 33835.459860906936


In [53]:
results['updated transform_features() function']=rmse
results

{'original': [55943.34407431997, 57086.31613271551],
 'resolve null values': [31459.519644043823, 34272.84697547075],
 'new features': [27947.03139306928, 33425.29432231938],
 'updated transform_features() function': [28217.333098304378,
  33835.459860906936]}

# Feature Selection

### Feature selection based correlations of numerical columns
We created a correlation heat map but I am going to leave it out because it is maxing out the CPU on my Chromebook.

In [54]:
# create correlation matrix
corr_df = pd.concat([transformed_df, ames_df[['SalePrice']]], axis=1).corr()
corr_df.head()

# # heatmap of correlation matrix
# fig, ax = plt.subplots(figsize=(10,10))
# sns.heatmap(corr_df, ax=ax, annot=True)

Unnamed: 0,Lot Area,Overall Qual,Overall Cond,Mas Vnr Area,BsmtFin SF 1,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,1st Flr SF,2nd Flr SF,...,3Ssn Porch,Screen Porch,Pool Area,Misc Val,SalePrice,years_since_built,years_since_remod,years_since_garage_built,Bathrooms,SalePrice.1
Lot Area,1.0,0.090563,-0.033529,0.114944,0.180222,0.084433,0.020976,0.241501,0.321461,0.030322,...,0.016611,0.056114,0.094417,0.038006,0.267663,-0.02098,-0.019942,0.027973,0.105132,0.267663
Overall Qual,0.090563,1.0,-0.093706,0.41939,0.27822,-0.040428,0.268457,0.545554,0.474637,0.240763,...,0.018591,0.042559,0.030677,-0.02765,0.801116,-0.596142,-0.570011,-0.46999,0.525266,0.801116
Overall Cond,-0.033529,-0.093706,1.0,-0.132183,-0.050091,0.040798,-0.136624,-0.174688,-0.1576,0.00654,...,0.043789,0.043868,-0.016834,0.047052,-0.101377,0.369166,-0.046581,0.307162,-0.201844,-0.101377
Mas Vnr Area,0.114944,0.41939,-0.132183,1.0,0.284386,-0.014197,0.087732,0.378874,0.376504,0.119542,...,0.01467,0.068315,0.005131,-0.022904,0.507323,-0.306479,-0.191092,-0.210351,0.290209,0.507323
BsmtFin SF 1,0.180222,0.27822,-0.050091,0.284386,1.0,-0.053626,-0.488008,0.522778,0.439639,-0.167701,...,0.051656,0.09837,0.085539,0.015759,0.439262,-0.278013,-0.148368,-0.155036,0.042632,0.439262


In [55]:
# drop columns that are highly correlated to columns other than sale price
# drop_cols = ['Garage Area','TotRms AbvGrd'] # increased RMSE
drop_cols = ['Garage Area']
transformed_df = transformed_df.drop(columns=drop_cols, errors='ignore')

In [56]:
filtered_df = select_features(transformed_df)
rmse = train_and_test(filtered_df)

Test RMSE: 28215.893633155334, Train RMSE: 33835.46431280156


Garage Area is highly correlated with Garage Cars and TotRms AbvGrd is highly correlated with GrLiving Area. Dropping Garage Cars and TotRms AbvGrd increased RMSE:  
28058.556035136608 - 27947.03139306928 = 111.52464206732839

Only dropping Garage Area reduced RMSE.

In [57]:
# if highly correlated columns were provided, skip this step
high_corr_cols=[]
if high_corr_cols==[]:
    # create correlation matrix
    corr_df = transformed_df.corr()
    # calculate absolute correlation coefficients
    abs_corr_coeffs = corr_df['SalePrice'].abs().sort_values()
    # get columns that have a high correlation coefficients
    corr_thresh = 0.0
    high_corr_cols = abs_corr_coeffs[abs_corr_coeffs > corr_thresh].index
# keep highly correlated
transformed_df = transformed_df[high_corr_cols]

In [58]:
high_corr_cols

Index(['BsmtFin SF 2', 'Misc Val', '3Ssn Porch', 'Bsmt Half Bath',
       'Low Qual Fin SF', 'Pool Area', 'Overall Cond', 'Screen Porch',
       'Kitchen AbvGr', 'Enclosed Porch', 'Bedroom AbvGr', 'Bsmt Unf SF',
       'Lot Area', '2nd Flr SF', 'Bsmt Full Bath', 'Open Porch SF',
       'Wood Deck SF', 'BsmtFin SF 1', 'years_since_garage_built',
       'Fireplaces', 'TotRms AbvGrd', 'Mas Vnr Area', 'years_since_remod',
       'Bathrooms', 'years_since_built', '1st Flr SF', 'Total Bsmt SF',
       'Garage Cars', 'Gr Liv Area', 'Overall Qual', 'SalePrice'],
      dtype='object')

In [59]:
filtered_df = select_features(transformed_df)
rmse = train_and_test(filtered_df) 

Test RMSE: 28215.893633155567, Train RMSE: 33835.46431280156


0.8  
'Overall Qual'  
Test RMSE: 46236.32346664704, Train RMSE: 49443.87470242076  

0.6  
'1st Flr SF', 'Garage Area', 'Total Bsmt SF', 'Garage Cars', 'Gr Liv Area', 'Overall Qual'  
Test RMSE: 34110.88679971664, Train RMSE: 37738.79474281697  

0.4  
'BsmtFin SF 1', 'years_since_garage_built', 'Fireplaces',
       'TotRms AbvGrd', 'Mas Vnr Area', 'years_since_remod', 'Bathrooms',
       'years_since_built', '1st Flr SF', 'Total Bsmt SF', 'Garage Cars',
       'Gr Liv Area', 'Overall Qual'  
Test RMSE: 30298.367837063786, Train RMSE: 35089.03029383569  

0.2  
'Lot Area', '2nd Flr SF', 'Bsmt Full Bath', 'Open Porch SF',
       'Wood Deck SF', 'BsmtFin SF 1', 'years_since_garage_built',
       'Fireplaces', 'TotRms AbvGrd', 'Mas Vnr Area', 'years_since_remod',
       'Bathrooms', 'years_since_built', '1st Flr SF', 'Total Bsmt SF',
       'Garage Cars', 'Gr Liv Area', 'Overall Qual'    
Test RMSE: 29731.310264257252, Train RMSE: 34774.37630203527  

0.1  
'Overall Cond', 'Screen Porch', 'Kitchen AbvGr', 'Enclosed Porch',
       'Bedroom AbvGr', 'Bsmt Unf SF', 'Lot Area', '2nd Flr SF',
       'Bsmt Full Bath', 'Open Porch SF', 'Wood Deck SF', 'BsmtFin SF 1',
       'years_since_garage_built', 'Fireplaces', 'TotRms AbvGrd',
       'Mas Vnr Area', 'years_since_remod', 'Bathrooms', 'years_since_built',
       '1st Flr SF', 'Total Bsmt SF', 'Garage Cars', 'Gr Liv Area',
       'Overall Qual'  
Test RMSE: 28257.60600135309, Train RMSE: 33925.887835433285  

0  
'BsmtFin SF 2', 'Misc Val', '3Ssn Porch', 'Bsmt Half Bath',
       'Low Qual Fin SF', 'Pool Area', 'MS SubClass', 'Overall Cond',
       'Screen Porch', 'Kitchen AbvGr', 'Enclosed Porch', 'Bedroom AbvGr',
       'Bsmt Unf SF', 'Lot Area', '2nd Flr SF', 'Bsmt Full Bath',
       'Open Porch SF', 'Wood Deck SF', 'BsmtFin SF 1',
       'years_since_garage_built', 'Fireplaces', 'TotRms AbvGrd',
       'Mas Vnr Area', 'years_since_remod', 'Bathrooms', 'years_since_built',
       '1st Flr SF', 'Total Bsmt SF', 'Garage Cars', 'Gr Liv Area',
       'Overall Qual'   
Test RMSE: 27660.449988989607, Train RMSE: 33423.505865954816  


Filtering by highly correlated columns did not reduce RMSE.


### Transforming Text to Numerical
Text columns will be transformed to categorical, then dummy columns will be created so that all the data is in numerical format.

In [60]:
# finding highly correlated columns removed text colums, so we have to re-create transformed_df
transformed_df = transform_features(ames_df, lot_frontage_nbhd=False, combine_porch_sf=False)
filtered_df = select_features(transformed_df)
rmse = train_and_test(filtered_df)

Begin Transform Features with 82 columns and 2930 rows
End Transform Features with 69 columns and 2926 rows
Test RMSE: 28217.333098304378, Train RMSE: 33835.459860906936


#### Test transforming ordinal text columns to numerical

In [61]:
# list all text columns
all_text_cols = transformed_df.select_dtypes(include=['object']).columns.to_list()
len(all_text_cols)

37

In [62]:
test_ordinal_df = transformed_df.copy()
# convert ordinal text columns to numeric
transformed_ordinal = []
for col, mapper in ordinal_dict.items():
    if col in test_ordinal_df.columns:
        test_ordinal_df[col] = test_ordinal_df[col].map(mapper)
        transformed_ordinal.append(col)

In [63]:
filtered_df = select_features(test_ordinal_df)
rmse = train_and_test(filtered_df) 

Test RMSE: 26799.049498739518, Train RMSE: 31993.13153788107


Including ordinal columns by converting them to numeric decreased RMSE.

In [64]:
# transformed_df = test_ordinal_df.copy()

In [65]:
# list all text columns
all_text_cols = transformed_df.select_dtypes(include=['object']).columns.to_list()
len(all_text_cols)

37

In [66]:
# create dataframe of text columns
text_df = transformed_df[all_text_cols].copy()

When we create dummy columns, that will create a new column for each unique values, so we will drop text columns that have too many unique values.

In [67]:
# set arbitrary threshold and drop columns that have more than that many unique values
unique_val_thresh=30
drop_cols = text_df.nunique()[text_df.nunique() > unique_val_thresh].index
text_df = text_df.drop(columns=drop_cols)
print(f'Removed {len(drop_cols)} columns because they contained {unique_val_thresh} or more unique values: {drop_cols}')

Removed 0 columns because they contained 30 or more unique values: Index([], dtype='object')


We will try to remove columns with low variance by dropping columns where one value accounts for more than 95% (or some other threshold) of values.

In [68]:
# the describe df index 'freq' tells us how many times the most frequent value occurs
# get most the common value's frequency for each text column
# divide by length of dataframe to convert to percentage
value_frequency = text_df.describe().loc['freq'].sort_values() / len(text_df)
# columns where most frequent value accounts for more than some percentage of values
single_value_thresh=0.95
drop_cols = value_frequency[value_frequency > 0.95].index
print(f'Dropping columns if one value represents more than {single_value_thresh*100}% of the data: {drop_cols}')
text_df = text_df.drop(columns=drop_cols)

# get dummies for text columns
dummy_df = pd.get_dummies(text_df)

Dropping columns if one value represents more than 95.0% of the data: Index(['Land Slope', 'Garage Cond', 'Heating', 'Roof Matl', 'Condition 2',
       'Street', 'Utilities'],
      dtype='object')


In [69]:
dummy_df.shape

(2926, 215)

In [70]:
df1 = transformed_df.drop(columns=all_text_cols)
df2 = dummy_df
test_df = pd.concat([df1, df2], axis=1)
filtered_df = select_features(test_df)
rmse = train_and_test(filtered_df)

Test RMSE: 25172.61071035756, Train RMSE: 23874.39327541863


In [71]:
results['dummy columns'] = rmse
results

{'original': [55943.34407431997, 57086.31613271551],
 'resolve null values': [31459.519644043823, 34272.84697547075],
 'new features': [27947.03139306928, 33425.29432231938],
 'updated transform_features() function': [28217.333098304378,
  33835.459860906936],
 'dummy columns': [25172.61071035756, 23874.39327541863]}

Not converting text to ordinal:  
Test RMSE: 25172.61071035756, Train RMSE: 23874.39327541863

Converting text to ordinal first, before creating dummy columns:  
RMSE: 26516.470316924082, Train RMSE: 25127.3365618802  

Even though converting text to ordinal improved the RMSE, it was better to convert all columns to categorical and use dummy columns.

We will remove dummy columns with low variance.

In [72]:
# create dataframe of dummy column variance
var_dummy_cols = []
for col in dummy_df.columns:
    var_dummy_cols.append(dummy_df[col].value_counts(normalize=True))
dummy_var_df = pd.DataFrame(var_dummy_cols)
# set a threshold
# if False values exceed this threshold, the column will be dropped
dummy_var_thresh = 0.99
drop_dummy_cols = dummy_var_df[dummy_var_df[0]>dummy_var_thresh].index
dummy_df.drop(columns=drop_dummy_cols, inplace=True)

In [73]:
dummy_df.shape

(2926, 149)

In [74]:
df1 = transformed_df.drop(columns=all_text_cols)
df2 = dummy_df
test_df = pd.concat([df1, df2], axis=1)
filtered_df = select_features(test_df)
rmse = train_and_test(filtered_df)

Test RMSE: 24645.647930544466, Train RMSE: 24768.03660552109


In [75]:
results['dummy columns reduced variance'] = rmse
results

{'original': [55943.34407431997, 57086.31613271551],
 'resolve null values': [31459.519644043823, 34272.84697547075],
 'new features': [27947.03139306928, 33425.29432231938],
 'updated transform_features() function': [28217.333098304378,
  33835.459860906936],
 'dummy columns': [25172.61071035756, 23874.39327541863],
 'dummy columns reduced variance': [24645.647930544466, 24768.03660552109]}

Using 99% threshold with dummy cols
Test RMSE: 24949.63068566496, Train RMSE: 24740.132336733237

Find dummy columns that have high correlation with sale price.

In [76]:
# create correlation matrix
corr_df = pd.concat([dummy_df, ames_df[['SalePrice']]], axis=1).corr()

In [77]:
corr_df.shape

(150, 150)

In [78]:
# calculate absolute correlation coefficients
abs_corr_coeffs = corr_df['SalePrice'].abs().sort_values()
# get columns that have a high correlation coefficients
corr_thresh = 0.2
high_corr_cols = [col for col in abs_corr_coeffs[abs_corr_coeffs > corr_thresh].index if col != 'SalePrice']
high_corr_df = pd.concat([dummy_df[high_corr_cols], ames_df[['SalePrice']]], axis=1).corr()

In [79]:
high_corr_df.shape

(46, 46)

In [80]:
df1 = transformed_df.drop(columns=all_text_cols)
df2 = dummy_df[high_corr_cols]
test_df = pd.concat([df1, df2], axis=1)
filtered_df = select_features(test_df)
rmse = train_and_test(filtered_df)

Test RMSE: 23877.208641396446, Train RMSE: 28152.671086980783


In [81]:
results['dummy columns high correlation'] = rmse
results

{'original': [55943.34407431997, 57086.31613271551],
 'resolve null values': [31459.519644043823, 34272.84697547075],
 'new features': [27947.03139306928, 33425.29432231938],
 'updated transform_features() function': [28217.333098304378,
  33835.459860906936],
 'dummy columns': [25172.61071035756, 23874.39327541863],
 'dummy columns reduced variance': [24645.647930544466, 24768.03660552109],
 'dummy columns high correlation': [23877.208641396446, 28152.671086980783]}

In [82]:
high_corr_cols.sort()
high_corr_cols

['Bsmt Exposure_Gd',
 'Bsmt Exposure_No',
 'Bsmt Qual_Ex',
 'Bsmt Qual_Gd',
 'Bsmt Qual_TA',
 'BsmtFin Type 1_GLQ',
 'Central Air_N',
 'Central Air_Y',
 'Electrical_SBrkr',
 'Exter Qual_Ex',
 'Exter Qual_Gd',
 'Exter Qual_TA',
 'Exterior 1st_VinylSd',
 'Exterior 2nd_VinylSd',
 'Foundation_BrkTil',
 'Foundation_CBlock',
 'Foundation_PConc',
 'Garage Finish_Fin',
 'Garage Finish_Unf',
 'Garage Type_Attchd',
 'Garage Type_BuiltIn',
 'Garage Type_Detchd',
 'Heating QC_Ex',
 'Heating QC_TA',
 'House Style_2Story',
 'Kitchen Qual_Ex',
 'Kitchen Qual_Gd',
 'Kitchen Qual_TA',
 'Lot Shape_IR1',
 'Lot Shape_Reg',
 'MS SubClass_1-STORY 1945 & OLDER1-STORY 1945 & OLDER',
 'MS SubClass_2-STORY 1946 & NEWER',
 'MS Zoning_RL',
 'MS Zoning_RM',
 'Mas Vnr Type_BrkFace',
 'Mas Vnr Type_None',
 'Mas Vnr Type_Stone',
 'Neighborhood_NoRidge',
 'Neighborhood_NridgHt',
 'Neighborhood_OldTown',
 'Neighborhood_StoneBr',
 'Paved Drive_N',
 'Paved Drive_Y',
 'Roof Style_Gable',
 'Roof Style_Hip']

Drop the boolean columns, we only need to keep the True columns.

In [83]:
high_corr_cols.remove('Central Air_N')
high_corr_cols.remove('Paved Drive_N')

In [84]:
df1 = transformed_df.drop(columns=all_text_cols)
df2 = dummy_df[high_corr_cols]
test_df = pd.concat([df1, df2], axis=1)
filtered_df = select_features(test_df)
rmse = train_and_test(filtered_df)

Test RMSE: 23876.835511422658, Train RMSE: 28152.707588296424


In [85]:
# update high correlation value
results['dummy columns high correlation'] = rmse
results

{'original': [55943.34407431997, 57086.31613271551],
 'resolve null values': [31459.519644043823, 34272.84697547075],
 'new features': [27947.03139306928, 33425.29432231938],
 'updated transform_features() function': [28217.333098304378,
  33835.459860906936],
 'dummy columns': [25172.61071035756, 23874.39327541863],
 'dummy columns reduced variance': [24645.647930544466, 24768.03660552109],
 'dummy columns high correlation': [23876.835511422658, 28152.707588296424]}

In [86]:
# update transformed df by dropping the original text columns and adding in the dummy columns
df1 = transformed_df.drop(columns=all_text_cols)
df2 = dummy_df[high_corr_cols]
transformed_df = pd.concat([df1, df2], axis=1)

### Update select_features() function
* Selecting highly correlated numeric features was not beneficial, but we will make a parameter in case we want to do more testing later.
* We wil not transform ordinal values to numeric before converting text columns to categorical since that did not improve the model.
* We will 
    * Transform text columns to categorical and then create dummy columns.
    * Remove dummy columns with low variance.
    * Select dummy columns with high correlation with sale price and create a parameter so we can adjust this value.

In [87]:
def select_features(df, num_corr_thresh=0.0, dummy_corr_thresh=0.2):
    '''Feature Selection
    '''
    print(f'Begin Feature Selection with {len(df.columns)} columns')

    # drop columns that are highly correlated to columns other than sale price
    # drop_cols = ['Garage Area','TotRms AbvGrd'] # increased RMSE
    drop_cols = ['Garage Area']
    selected_df = df.drop(columns=drop_cols, errors='ignore')

    # only keep highly correlated numeric columns
    if num_corr_thresh > 0:
        # create correlation matrix
        corr_df = selected_df.corr()
        # calculate absolute correlation coefficients
        abs_corr_coeffs = corr_df['SalePrice'].abs().sort_values()
        # get columns that have a high correlation coefficients
        high_corr_cols = abs_corr_coeffs[abs_corr_coeffs > num_corr_thresh].index
        # keep highly correlated
        high_corr_num_df = selected_df[high_corr_cols]

    # list all text columns
    all_text_cols = selected_df.select_dtypes(include=['object']).columns.to_list()
    # create dataframe of text columns
    text_df = selected_df[all_text_cols].copy()
    # set arbitrary threshold and drop columns that have more than that many unique values
    unique_val_thresh=30
    drop_cols = text_df.nunique()[text_df.nunique() > unique_val_thresh].index
    text_df = text_df.drop(columns=drop_cols)
    print(f'Removed {len(drop_cols)} columns because they contained {unique_val_thresh} or more unique values: {drop_cols}')
    # the describe df index 'freq' tells us how many times the most frequent value occurs
    # get most the common value's frequency for each text column
    # divide by length of dataframe to convert to percentage
    value_frequency = text_df.describe().loc['freq'].sort_values() / len(text_df)
    # columns where most frequent value accounts for more than some percentage of values
    single_value_thresh=0.95
    drop_cols = value_frequency[value_frequency > 0.95].index
    print(f'Dropping columns if one value represents more than {single_value_thresh*100}% of the data: {drop_cols}')
    text_df = text_df.drop(columns=drop_cols)
    # get dummies for text columns
    dummy_df = pd.get_dummies(text_df)
    # create dataframe of dummy column variance
    var_dummy_cols = []
    for col in dummy_df.columns:
        var_dummy_cols.append(dummy_df[col].value_counts(normalize=True))
    dummy_var_df = pd.DataFrame(var_dummy_cols)
    # set a threshold
    # if False values exceed this threshold, the column will be dropped
    dummy_var_thresh = 0.99
    drop_dummy_cols = dummy_var_df[dummy_var_df[0]>dummy_var_thresh].index
    dummy_df.drop(columns=drop_dummy_cols, inplace=True)
    # create correlation matrix
    corr_df = pd.concat([dummy_df, ames_df[['SalePrice']]], axis=1).corr()  
    # calculate absolute correlation coefficients
    abs_corr_coeffs = corr_df['SalePrice'].abs().sort_values()
    # get columns that have a high correlation coefficients
    high_corr_cols = [col for col in abs_corr_coeffs[abs_corr_coeffs > dummy_corr_thresh].index if col != 'SalePrice']
    high_corr_df = pd.concat([dummy_df[high_corr_cols], ames_df[['SalePrice']]], axis=1).corr()
    dummy_df = dummy_df[high_corr_cols]
    # Drop the boolean columns, we only need to keep the True columns.
    dummy_df.drop(columns=['Central Air_N','Paved Drive_N'], errors='ignore')

    # update transformed df
    if num_corr_thresh > 0:
        # only include highly correlated numeric columns
        df1 = high_corr_num_df
    else:
        # drop the original text columns from selected df
        df1 = selected_df.drop(columns=all_text_cols)
    # combine numeric and transformed text features
    df2 = dummy_df[high_corr_cols]
    selected_df = pd.concat([df1, df2], axis=1)

    return selected_df 

In [88]:
transformed_df = transform_features(ames_df, lot_frontage_nbhd=False, combine_porch_sf=False)
filtered_df = select_features(transformed_df, num_corr_thresh=0.0, dummy_corr_thresh=0.2)
rmse = train_and_test(filtered_df)

Begin Transform Features with 82 columns and 2930 rows
End Transform Features with 69 columns and 2926 rows
Begin Feature Selection with 69 columns
Removed 0 columns because they contained 30 or more unique values: Index([], dtype='object')
Dropping columns if one value represents more than 95.0% of the data: Index(['Land Slope', 'Garage Cond', 'Heating', 'Roof Matl', 'Condition 2',
       'Street', 'Utilities'],
      dtype='object')
Test RMSE: 23881.453611208774, Train RMSE: 28152.827208334926


In [89]:
transformed_df = transform_features(ames_df, lot_frontage_nbhd=False, combine_porch_sf=False)
filtered_df = select_features(transformed_df, num_corr_thresh=0.0, dummy_corr_thresh=0.0)
rmse = train_and_test(filtered_df)

Begin Transform Features with 82 columns and 2930 rows
End Transform Features with 69 columns and 2926 rows
Begin Feature Selection with 69 columns
Removed 0 columns because they contained 30 or more unique values: Index([], dtype='object')
Dropping columns if one value represents more than 95.0% of the data: Index(['Land Slope', 'Garage Cond', 'Heating', 'Roof Matl', 'Condition 2',
       'Street', 'Utilities'],
      dtype='object')
Test RMSE: 24646.24777029301, Train RMSE: 24770.17622003498


In [90]:
transformed_df = transform_features(ames_df, lot_frontage_nbhd=False, combine_porch_sf=False)
filtered_df = select_features(transformed_df, num_corr_thresh=0.0, dummy_corr_thresh=0.1)
rmse = train_and_test(filtered_df)

Begin Transform Features with 82 columns and 2930 rows
End Transform Features with 69 columns and 2926 rows
Begin Feature Selection with 69 columns
Removed 0 columns because they contained 30 or more unique values: Index([], dtype='object')
Dropping columns if one value represents more than 95.0% of the data: Index(['Land Slope', 'Garage Cond', 'Heating', 'Roof Matl', 'Condition 2',
       'Street', 'Utilities'],
      dtype='object')
Test RMSE: 23707.856487351604, Train RMSE: 26312.897911700937


In [91]:
results['updated select_features() function']=rmse
results

{'original': [55943.34407431997, 57086.31613271551],
 'resolve null values': [31459.519644043823, 34272.84697547075],
 'new features': [27947.03139306928, 33425.29432231938],
 'updated transform_features() function': [28217.333098304378,
  33835.459860906936],
 'dummy columns': [25172.61071035756, 23874.39327541863],
 'dummy columns reduced variance': [24645.647930544466, 24768.03660552109],
 'dummy columns high correlation': [23876.835511422658, 28152.707588296424],
 'updated select_features() function': [23707.856487351604,
  26312.897911700937]}

# Train And Test
Now for the final part of the pipeline, training and testing. When iterating on different features, using simple validation is a good idea. Let's add a parameter named k that controls the type of cross validation that occurs.

* The optional k parameter should accept integer values, with a default value of 0.
* When k equals 0, perform **holdout validation** (what we already implemented):
    * Select the first 1460 rows and assign to train.
    * Select the remaining rows and assign to test.
    * Train on train and test on test.
    * Compute the RMSE and return.

* When k equals 1, perform **simple cross validation**:
    * Shuffle the ordering of the rows in the data frame.
    * Select the first 1460 rows and assign to fold_one.
    * Select the remaining rows and assign to fold_two.
    * Train on fold_one and test on fold_two.
    * Train on fold_two and test on fold_one.
    * Compute the average RMSE and return.

* When k is greater than 0, implement **k-fold cross validation** using k folds:
    * Perform k-fold cross validation using k folds.
    * Calculate the average RMSE value and return this value.

In [92]:
def get_rmse(train_df, test_df, train_features, test_features, target_col):
    lr = LinearRegression()
    x = train_df[train_features]
    y = train_df[target_col]
    reg = lr.fit(X=x, y=y)
    y_pred_train = reg.predict(x)
    y_pred_test = reg.predict(test_df[test_features])

    rmse_train = sqrt(mean_squared_error(
        y_true=y,
        y_pred=y_pred_train
    ))

    rmse_test = sqrt(mean_squared_error(
        y_true=test_df[target_col],
        y_pred=y_pred_test
    ))

    return [rmse_test, rmse_train]

def train_and_test(df, k=0):
    '''
    * Selects the first 1460 rows from from data and assign to train.
    * Selects the remaining rows from data and assign to test.
    * Trains a model using all numerical columns except the SalePrice column (the target column) from the data frame returned from select_features().
    * Tests the model on the test set and returns the RMSE value.
    '''
    target_col = 'SalePrice'
    # print('Train:')
    # train_df = transform_features(train_df)
    # train_df = select_features(train_df)
    # train_features = [col for col in train_df.columns if col != target_col]
    # print('')

    # print('Test:')
    # test_df = transform_features(test_df)
    # test_df = select_features(test_df)
    # test_features = [col for col in test_df.columns if col != target_col]
    # print('')

    selected_features = [col for col in df.columns if col != target_col]
    train_features = selected_features
    test_features = selected_features

    # when k = 0, perform holdout validation
    if k == 0:
        # shuffling a pandas dataframe with sklearn
        shuffled_df = shuffle(df, random_state=1)
        train_df = shuffled_df.iloc[:1460]
        test_df = shuffled_df.iloc[1460:]
        # Randomize *all* rows (frac=1) from `df` and return
        # shuffled_df = df.sample(frac=1, )
        # train_df = df.iloc[:1460]
        # test_df = df.iloc[1460:]

        rmse_test, rmse_train = get_rmse(train_df, test_df, train_features, test_features, target_col)

        print('Holdout validation')
        print(f'Test RMSE: {rmse_test}, Train RMSE: {rmse_train}')
        return [rmse_test, rmse_train]

    # if k = 1, perform simple cross validation (also called holdout)
    elif k == 1:
        # Shuffle the ordering of the rows in the data frame.
        shuffled_df = df.iloc[np.random.permutation(len(df))]
        # Select the first 1460 rows and assign to fold_one.
        fold1_df = shuffled_df.iloc[:1460]
        # Select the remaining rows and assign to fold_two.
        fold2_df = shuffled_df.iloc[1460:]
        # Train on fold_one and test on fold_two.
        rmse_test1, rmse_train1 = get_rmse(train_df=fold1_df, test_df=fold2_df, train_features=train_features, test_features=test_features, target_col=target_col)
        # Train on fold_two and test on fold_one.
        rmse_test2, rmse_train2 = get_rmse(train_df=fold2_df, test_df=fold1_df, train_features=train_features, test_features=test_features, target_col=target_col)
        # Compute the average RMSE and return.
        avg_train = np.mean([rmse_train1, rmse_train2])
        avg_test = np.mean([rmse_test1, rmse_test2])
        print('Simple cross validation/holdout validation')
        print(f'Avg Test RMSE: {avg_test}, Avg Train RMSE: {avg_train}')
        return [avg_test, avg_train]

    # if k > 1, perform k-fold cross validation using k folds
    else:
        # Perform k-fold cross validation using k folds.
        kf = KFold(n_splits=k, shuffle=True, random_state=1)
        # Calculate the average RMSE value and return this value.
        # lr = LinearRegression()
        # rmses = cross_val_score(estimator=lr, X=df[selected_features], y=df[target_col], scoring='neg_root_mean_squared_error', cv=kf)
        test_rmses = []
        train_rmses = []
        for train_index, test_index, in kf.split(df):
            train_df = df.iloc[train_index]
            test_df = df.iloc[test_index]
            rmse_test, rmse_train = get_rmse(train_df, test_df, train_features, test_features, target_col)
            test_rmses.append(rmse_test)
            train_rmses.append(rmse_train)
        avg_train = np.mean(train_rmses)
        avg_test = np.mean(test_rmses)  
        print('K-fold cross validation')
        print(f'Avg Test RMSE: {avg_test}, Avg Train RMSE: {avg_train}')
        return [avg_test, avg_train]


In [93]:
transformed_df = transform_features(ames_df, lot_frontage_nbhd=False, combine_porch_sf=False)
filtered_df = select_features(transformed_df, num_corr_thresh=0.0, dummy_corr_thresh=0.1)
rmse = train_and_test(filtered_df, k=0)

Begin Transform Features with 82 columns and 2930 rows
End Transform Features with 69 columns and 2926 rows
Begin Feature Selection with 69 columns
Removed 0 columns because they contained 30 or more unique values: Index([], dtype='object')
Dropping columns if one value represents more than 95.0% of the data: Index(['Land Slope', 'Garage Cond', 'Heating', 'Roof Matl', 'Condition 2',
       'Street', 'Utilities'],
      dtype='object')
Holdout validation
Test RMSE: 23707.856487351604, Train RMSE: 26312.897911700937


In [94]:
rmse = train_and_test(filtered_df, k=1)

Simple cross validation/holdout validation
Avg Test RMSE: 27740.101294570308, Avg Train RMSE: 22837.6513111386


In [95]:
rmse = train_and_test(filtered_df, k=10)

K-fold cross validation
Avg Test RMSE: 24908.35614107272, Avg Train RMSE: 23961.31645258876


In [96]:
results['final']=rmse
results

{'original': [55943.34407431997, 57086.31613271551],
 'resolve null values': [31459.519644043823, 34272.84697547075],
 'new features': [27947.03139306928, 33425.29432231938],
 'updated transform_features() function': [28217.333098304378,
  33835.459860906936],
 'dummy columns': [25172.61071035756, 23874.39327541863],
 'dummy columns reduced variance': [24645.647930544466, 24768.03660552109],
 'dummy columns high correlation': [23876.835511422658, 28152.707588296424],
 'updated select_features() function': [23707.856487351604,
  26312.897911700937],
 'final': [24908.35614107272, 23961.31645258876]}

In [98]:
results_df = pd.DataFrame.from_dict(results, orient='index', columns=['Test', 'Train'])
results_df

Unnamed: 0,Test,Train
original,55943.344074,57086.316133
resolve null values,31459.519644,34272.846975
new features,27947.031393,33425.294322
updated transform_features() function,28217.333098,33835.459861
dummy columns,25172.61071,23874.393275
dummy columns reduced variance,24645.647931,24768.036606
dummy columns high correlation,23876.835511,28152.707588
updated select_features() function,23707.856487,26312.897912
final,24908.356141,23961.316453
