# Predicting House Sale Prices (Linear Regression)

In this project I will work with housing data for the city of Ames, Iowa, United States from 2006 to 2010. The goal of this project is to go through the entire data science workflow of data gathering, data processing, feature engineering, feature selection, modeling, optimizing, and analyzing the results.

In [1]:
import pandas as pd
pd.options.display.max_columns = 999
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold
from pandas_profiling import ProfileReport

from sklearn.metrics import mean_squared_error
from sklearn import linear_model
from sklearn.model_selection import KFold

In [2]:
df = pd.read_csv('/Users/miesner.jacob/Desktop/DataQuest/datasets/AmesHousing.txt', delimiter = '\t')

In [3]:
def transform_features(df):
    return df

def select_features(df):
    return df[['Gr Liv Area','SalePrice']]

def train_and_test(df):
    train = df[:1460]
    test = df[1460:]
    
    numeric_train = train.select_dtypes(include=['integer', 'float'])
    numeric_test = test.select_dtypes(include=['integer', 'float'])
    
    features = numeric_train.columns.drop("SalePrice")
    linear_reg = linear_model.LinearRegression()
    linear_reg.fit(train[features], train["SalePrice"])
    predictions = linear_reg.predict(test[features])
    mse = mean_squared_error(test["SalePrice"], predictions)
    rmse = np.sqrt(mse)
    
    return rmse

In [4]:
# testing functions
transform_df = transform_features(df)
filtered_df = select_features(transform_df)
rmse_1 = train_and_test(filtered_df)

'${:,.2f}'.format(rmse_1)

'$57,088.25'

# Feature Engineering

I am going to expand the transform_features function to include:

* removing features that we don't want to use in the model, just based on the number of missing values or data leakage
* transforming features into the proper format (numerical to categorical, scaling numerical, filling in missing values, etc)
* creating new features by combining other features

In [5]:
# profile = ProfileReport(df, title='Pandas Profiling Report')
# profile

In [6]:
#3 create new features by combining other features
years_sold = df['Yr Sold'] - df['Year Built']
years_sold[years_sold < 0]
years_since_remod = df['Yr Sold'] - df['Year Remod/Add']
years_since_remod[years_since_remod < 0]
df['Years Before Sale'] = years_sold
df['Years Since Remod'] = years_since_remod
df = df.drop([1702, 2180, 2181], axis=0)
df = df.drop(["Year Built", "Year Remod/Add"], axis = 1)

In [7]:
#1 Removing columns with more than 5% missing values and ones that either leak information about saleprice or arent useful
pct_null = df.isnull().sum() / len(df)
missing_features = pct_null[pct_null > 0.95].index
df.drop(missing_features, axis=1, inplace=True)

missing = df.isnull().sum()
missing = missing[missing > 0]
missing = missing.index
df[missing]=df[missing].fillna(df[missing].mode().iloc[0])

df = df.drop(["PID", "Order", "Mo Sold", "Sale Condition", "Sale Type", "Yr Sold"], axis=1)

In [8]:
#2 transforming features into the proper format
cols = ['Bsmt Full Bath', 'Bsmt Half Bath', 'Half Bath', 'Kitchen AbvGr']
df[cols] = df[cols].astype('int')
df.dtypes

MS SubClass            int64
MS Zoning             object
Lot Frontage         float64
Lot Area               int64
Street                object
Alley                 object
Lot Shape             object
Land Contour          object
Utilities             object
Lot Config            object
Land Slope            object
Neighborhood          object
Condition 1           object
Condition 2           object
Bldg Type             object
House Style           object
Overall Qual           int64
Overall Cond           int64
Roof Style            object
Roof Matl             object
Exterior 1st          object
Exterior 2nd          object
Mas Vnr Type          object
Mas Vnr Area         float64
Exter Qual            object
Exter Cond            object
Foundation            object
Bsmt Qual             object
Bsmt Cond             object
Bsmt Exposure         object
                      ...   
Bsmt Full Bath         int64
Bsmt Half Bath         int64
Full Bath              int64
Half Bath     

Now lets update our transform_features function and see if it improved our model performance

In [9]:
df = pd.read_csv('/Users/miesner.jacob/Desktop/DataQuest/datasets/AmesHousing.txt', delimiter = '\t')

In [10]:
def transform_features(df):
    years_sold = df['Yr Sold'] - df['Year Built']
    years_sold[years_sold < 0]
    years_since_remod = df['Yr Sold'] - df['Year Remod/Add']
    years_since_remod[years_since_remod < 0]
    df['Years Before Sale'] = years_sold
    df['Years Since Remod'] = years_since_remod
    df = df.drop([1702, 2180, 2181], axis=0)
    df = df.drop(["Year Built", "Year Remod/Add"], axis = 1)
    
    pct_null = df.isnull().sum() / len(df)
    missing_features = pct_null[pct_null > 0.95].index
    df.drop(missing_features, axis=1, inplace=True)

    missing = df.isnull().sum()
    missing = missing[missing > 0]
    missing = missing.index
    df[missing]=df[missing].fillna(df[missing].mode().iloc[0])

    df = df.drop(["PID", "Order", "Mo Sold", "Sale Condition", "Sale Type", "Yr Sold"], axis=1)
    cols = ['Bsmt Full Bath', 'Bsmt Half Bath', 'Half Bath', 'Kitchen AbvGr']
    df[cols] = df[cols].astype('int')
    return df

def select_features(df):
    return df[['Gr Liv Area','SalePrice']]

def train_and_test(df):
    train = df[:1460]
    test = df[1460:]
    
    numeric_train = train.select_dtypes(include=['integer', 'float'])
    numeric_test = test.select_dtypes(include=['integer', 'float'])
    
    features = numeric_train.columns.drop("SalePrice")
    linear_reg = linear_model.LinearRegression()
    linear_reg.fit(train[features], train["SalePrice"])
    predictions = linear_reg.predict(test[features])
    mse = mean_squared_error(test["SalePrice"], predictions)
    rmse = np.sqrt(mse)
    
    return rmse

In [11]:
# testing functions
transform_df = transform_features(df)
filtered_df = select_features(transform_df)
rmse_2 = train_and_test(filtered_df)

pct = abs(((rmse_2/rmse_1) - 1) *100)
'Feature Transformation improved the models rmse by {pct} from {rmse_1} to {rmse_2}'.format(pct = '{:,.2f}%'.format(pct), rmse_1 = '${:,.2f}'.format(rmse_1), rmse_2 = '${:,.2f}'.format(rmse_2))

'Feature Transformation improved the models rmse by 3.18% from $57,088.25 to $55,275.37'

# Feature Selection

In performing feature selection I will look to see which columns correlate well with the target feature, which columns need to be converted to categorical types for creation of dummy variables, and which categorical columns have too many unique values.

In [12]:
abs(df.select_dtypes(include=['int', 'float']).drop(columns=['SalePrice']).corrwith(df['SalePrice'])).sort_values()

BsmtFin SF 2         0.005891
Misc Val             0.015691
Yr Sold              0.030569
Order                0.031408
3Ssn Porch           0.032225
Mo Sold              0.035259
Bsmt Half Bath       0.035835
Low Qual Fin SF      0.037660
Pool Area            0.068403
MS SubClass          0.085092
Overall Cond         0.101697
Screen Porch         0.112151
Kitchen AbvGr        0.119814
Enclosed Porch       0.128787
Bedroom AbvGr        0.143913
Bsmt Unf SF          0.182855
PID                  0.246521
Lot Area             0.266549
2nd Flr SF           0.269373
Bsmt Full Bath       0.276050
Half Bath            0.285056
Open Porch SF        0.312951
Wood Deck SF         0.327143
Lot Frontage         0.357318
BsmtFin SF 1         0.432914
Fireplaces           0.474558
TotRms AbvGrd        0.495474
Mas Vnr Area         0.508285
Garage Yr Blt        0.526965
Year Remod/Add       0.532974
Years Since Remod    0.534940
Full Bath            0.545604
Year Built           0.558426
Years Befo

In [13]:
#removing numerical columns that have a correlation of less than +/-20%
numerical_df = transform_df.select_dtypes(include=['int', 'float'])
numerical_df = abs(numerical_df.drop(columns=['SalePrice']).corrwith(df['SalePrice']))
numerical_df_drop = numerical_df[numerical_df < .20]
df = df.drop(numerical_df_drop.index, axis = 1)

In [14]:
#getting rid od categorical columns with over 10 unique values
nominal_features = ["PID", "MS SubClass", "MS Zoning", "Street", "Alley", "Land Contour", "Lot Config", "Neighborhood", 
                    "Condition 1", "Condition 2", "Bldg Type", "House Style", "Roof Style", "Roof Matl", "Exterior 1st", 
                    "Exterior 2nd", "Mas Vnr Type", "Foundation", "Heating", "Central Air", "Garage Type", 
                    "Misc Feature", "Sale Type", "Sale Condition"]
    
transform_cat_cols = []
for col in nominal_features:
    if col in df.columns:
        transform_cat_cols.append(col)
        
col_unique_counts = df[transform_cat_cols].nunique()
cols_drop = col_unique_counts[col_unique_counts >= 10]
df = df.drop(cols_drop.index, axis = 1)

In [15]:
#tranforming all string colums to categorical variables and then creating dummy variables
categorical_cols = df.select_dtypes(include=['object'])

for col in categorical_cols:
    df[col] = df[col].astype('category') 

df = pd.concat([df, pd.get_dummies(df.select_dtypes(include=['category']))], axis=1).drop(categorical_cols,axis=1)

Now lets update our select_features function and see if it improved our model performance

In [16]:
df = pd.read_csv('/Users/miesner.jacob/Desktop/DataQuest/datasets/AmesHousing.txt', delimiter = '\t')

In [17]:
def transform_features(df):
    years_sold = df['Yr Sold'] - df['Year Built']
    years_sold[years_sold < 0]
    years_since_remod = df['Yr Sold'] - df['Year Remod/Add']
    years_since_remod[years_since_remod < 0]
    df['Years Before Sale'] = years_sold
    df['Years Since Remod'] = years_since_remod
    df = df.drop([1702, 2180, 2181], axis=0)
    df = df.drop(["Year Built", "Year Remod/Add"], axis = 1)
    
    pct_null = df.isnull().sum() / len(df)
    missing_features = pct_null[pct_null > 0.95].index
    df.drop(missing_features, axis=1, inplace=True)

    missing = df.isnull().sum()
    missing = missing[missing > 0]
    missing = missing.index
    df[missing]=df[missing].fillna(df[missing].mode().iloc[0])

    df = df.drop(["PID", "Order", "Mo Sold", "Sale Condition", "Sale Type", "Yr Sold"], axis=1)
    cols = ['Bsmt Full Bath', 'Bsmt Half Bath', 'Half Bath', 'Kitchen AbvGr']
    df[cols] = df[cols].astype('int')
    return df

def select_features(df):
    numerical_df = transform_df.select_dtypes(include=['int', 'float'])
    numerical_df = abs(numerical_df.drop(columns=['SalePrice']).corrwith(df['SalePrice']))
    numerical_df_drop = numerical_df[numerical_df < .20]
    df = df.drop(numerical_df_drop.index, axis = 1)
    nominal_features = ["PID", "MS SubClass", "MS Zoning", "Street", "Alley", "Land Contour", "Lot Config", "Neighborhood", 
                        "Condition 1", "Condition 2", "Bldg Type", "House Style", "Roof Style", "Roof Matl", "Exterior 1st", 
                        "Exterior 2nd", "Mas Vnr Type", "Foundation", "Heating", "Central Air", "Garage Type", 
                        "Misc Feature", "Sale Type", "Sale Condition"]

    transform_cat_cols = []
    for col in nominal_features:
        if col in df.columns:
            transform_cat_cols.append(col)

    col_unique_counts = df[transform_cat_cols].nunique()
    cols_drop = col_unique_counts[col_unique_counts >= 10]
    df = df.drop(cols_drop.index, axis = 1)
    categorical_cols = df.select_dtypes(include=['object'])

    for col in categorical_cols:
        df[col] = df[col].astype('category') 

    df = pd.concat([df, pd.get_dummies(df.select_dtypes(include=['category']))], axis=1).drop(categorical_cols,axis=1)

    return df

def train_and_test(df):
    train = df[:1460]
    test = df[1460:]
    
    numeric_train = train.select_dtypes(include=['integer', 'float'])
    numeric_test = test.select_dtypes(include=['integer', 'float'])
    
    features = numeric_train.columns.drop("SalePrice")
    linear_reg = linear_model.LinearRegression()
    linear_reg.fit(train[features], train["SalePrice"])
    predictions = linear_reg.predict(test[features])
    mse = mean_squared_error(test["SalePrice"], predictions)
    rmse = np.sqrt(mse)
    
    return rmse

In [18]:
# testing functions
transform_df = transform_features(df)
filtered_df = select_features(transform_df)
rmse_3 = train_and_test(filtered_df)

pct = abs(((rmse_3/rmse_2) - 1) *100)
'Feature Selection improved the models rmse by an additional {pct} from {rmse_2} to {rmse_3}'.format(pct = '{:,.2f}%'.format(pct), rmse_2 = '${:,.2f}'.format(rmse_2), rmse_3 = '${:,.2f}'.format(rmse_3))

'Feature Selection improved the models rmse by an additional 40.59% from $55,275.37 to $32,837.73'

# Adding Cross Validation to Model

In [19]:
df = pd.read_csv('/Users/miesner.jacob/Desktop/DataQuest/datasets/AmesHousing.txt', delimiter = '\t')

In [20]:
def transform_features(df):
    years_sold = df['Yr Sold'] - df['Year Built']
    years_sold[years_sold < 0]
    years_since_remod = df['Yr Sold'] - df['Year Remod/Add']
    years_since_remod[years_since_remod < 0]
    df['Years Before Sale'] = years_sold
    df['Years Since Remod'] = years_since_remod
    df = df.drop([1702, 2180, 2181], axis=0)
    df = df.drop(["Year Built", "Year Remod/Add"], axis = 1)
    
    pct_null = df.isnull().sum() / len(df)
    missing_features = pct_null[pct_null > 0.95].index
    df.drop(missing_features, axis=1, inplace=True)

    missing = df.isnull().sum()
    missing = missing[missing > 0]
    missing = missing.index
    df[missing]=df[missing].fillna(df[missing].mode().iloc[0])

    df = df.drop(["PID", "Order", "Mo Sold", "Sale Condition", "Sale Type", "Yr Sold"], axis=1)
    cols = ['Bsmt Full Bath', 'Bsmt Half Bath', 'Half Bath', 'Kitchen AbvGr']
    df[cols] = df[cols].astype('int')
    return df

def select_features(df):
    numerical_df = transform_df.select_dtypes(include=['int', 'float'])
    numerical_df = abs(numerical_df.drop(columns=['SalePrice']).corrwith(df['SalePrice']))
    numerical_df_drop = numerical_df[numerical_df < .20]
    df = df.drop(numerical_df_drop.index, axis = 1)
    nominal_features = ["PID", "MS SubClass", "MS Zoning", "Street", "Alley", "Land Contour", "Lot Config", "Neighborhood", 
                        "Condition 1", "Condition 2", "Bldg Type", "House Style", "Roof Style", "Roof Matl", "Exterior 1st", 
                        "Exterior 2nd", "Mas Vnr Type", "Foundation", "Heating", "Central Air", "Garage Type", 
                        "Misc Feature", "Sale Type", "Sale Condition"]

    transform_cat_cols = []
    for col in nominal_features:
        if col in df.columns:
            transform_cat_cols.append(col)

    col_unique_counts = df[transform_cat_cols].nunique()
    cols_drop = col_unique_counts[col_unique_counts >= 10]
    df = df.drop(cols_drop.index, axis = 1)
    categorical_cols = df.select_dtypes(include=['object'])

    for col in categorical_cols:
        df[col] = df[col].astype('category') 

    df = pd.concat([df, pd.get_dummies(df.select_dtypes(include=['category']))], axis=1).drop(categorical_cols,axis=1)

    return df

def train_and_test(df, k = 0):
    numeric_df = df.select_dtypes(include=['integer', 'float'])
    features = df.columns.drop("SalePrice")
    linear_reg = linear_model.LinearRegression()
    
    if k == 0:
        train = df[:1460]
        test = df[1460:]

        numeric_train = train.select_dtypes(include=['integer', 'float'])
        numeric_test = test.select_dtypes(include=['integer', 'float'])
        linear_reg.fit(train[features], train["SalePrice"])
        predictions = linear_reg.predict(test[features])
        mse = mean_squared_error(test["SalePrice"], predictions)
        rmse = np.sqrt(mse)
    elif k == 1:
        shuffled_df = df.sample(frac=1)
        train = df[:1460]
        test = df[1460:]

        linear_reg.fit(train[features], train["SalePrice"])
        predictions_one = linear_reg.predict(test[features])        
        mse_one = mean_squared_error(test["SalePrice"], predictions_one)
        rmse_one = mse_one ** (1/2)
        
        linear_reg.fit(test[features], test["SalePrice"])
        predictions_two = linear_reg.predict(train[features])        
        mse_two = mean_squared_error(train["SalePrice"], predictions_two)
        rmse_two = mse_two ** (1/2)
        
        rmse = np.mean([rmse_one, rmse_two])
    else:
        kf = KFold(n_splits=k, shuffle=True)
        rmse_values = []
        for train_index, test_index, in kf.split(df):
            train = df.iloc[train_index]
            test = df.iloc[test_index]
            linear_reg.fit(train[features], train["SalePrice"])
            predictions = linear_reg.predict(test[features])
            mse = mean_squared_error(test["SalePrice"], predictions)
            rmse = mse ** (1/2)
            rmse_values.append(rmse)
        rmse = np.mean(rmse_values)
        
    return rmse

In [21]:
# 0 folds
transform_df = transform_features(df)
filtered_df = select_features(transform_df)
rmse_0_fold = train_and_test(filtered_df, k=0)

'${:,.2f}'.format(rmse_0_fold)

'$32,837.73'

In [22]:
# 1 folds
transform_df = transform_features(df)
filtered_df = select_features(transform_df)
rmse_1_fold = train_and_test(filtered_df, k=1)

'${:,.2f}'.format(rmse_1_fold)

'$29,123.07'

In [23]:
# 2 folds
transform_df = transform_features(df)
filtered_df = select_features(transform_df)
rmse_2_fold = train_and_test(filtered_df, k=2)

'${:,.2f}'.format(rmse_2_fold)

'$28,359.65'

In [24]:
# 3 folds
transform_df = transform_features(df)
filtered_df = select_features(transform_df)
rmse_3_fold = train_and_test(filtered_df, k=3)

'${:,.2f}'.format(rmse_3_fold)

'$27,855.37'

In [25]:
# 10 folds
transform_df = transform_features(df)
filtered_df = select_features(transform_df)
rmse_10_fold = train_and_test(filtered_df, k=10)

'${:,.2f}'.format(rmse_10_fold)

'$26,979.97'

We see that the RMSE gets lower with more k-folds. Cross-validation does not improve accuracy but it helps give a better depiction of how accurate your model truley is.

This project shows an end to end data science workflow on Linear Regression (Ordinary Least Squares). This project really highlights the importance of feature engineering and feature selection. These are crucial steps in the data science workflow that cannot be neglected because they make sure that your models are given the best input possible.