# Predicting House Sale Prices

In this project, we'll work with linear regression. We'll use  housing data for the city of Ames, Iowa, United States from 2006 to 2010. You can read about the different columns in the data <a href="https://s3.amazonaws.com/dq-content/307/data_description.txt">here</a>.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold

ames_housing = pd.read_csv('AmesHousing.tsv', sep='\t')
ames_housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 82 columns):
Order              2930 non-null int64
PID                2930 non-null int64
MS SubClass        2930 non-null int64
MS Zoning          2930 non-null object
Lot Frontage       2440 non-null float64
Lot Area           2930 non-null int64
Street             2930 non-null object
Alley              198 non-null object
Lot Shape          2930 non-null object
Land Contour       2930 non-null object
Utilities          2930 non-null object
Lot Config         2930 non-null object
Land Slope         2930 non-null object
Neighborhood       2930 non-null object
Condition 1        2930 non-null object
Condition 2        2930 non-null object
Bldg Type          2930 non-null object
House Style        2930 non-null object
Overall Qual       2930 non-null int64
Overall Cond       2930 non-null int64
Year Built         2930 non-null int64
Year Remod/Add     2930 non-null int64
Roof Style         29

Let's write some basic functions that we will modify during the project execution. At this stage, we will only work with column "Gr Liv Area". Our target column is "SalePrice".

In [2]:
from sklearn.metrics import mean_squared_error
def transform_features(data):
    return data

def select_features(df):
    return df[['Gr Liv Area']], df['SalePrice']

def train_and_test(data):
    test = transform_features(data)
    
    half = int(len(data)/2)
    train = data[:half]
    test = data[half:]

    num_train, train_target = select_features(train)
    num_test, test_targer = select_features(test)

    lr = LinearRegression()
    print(num_train.shape)
    print(train_target.shape)
    lr.fit(num_train, train_target)
    test_predict = lr.predict(num_test)
    
    test_mse = mean_squared_error(test_predict, test_targer)
    return np.sqrt(test_mse)

rmse = train_and_test(ames_housing)
rmse

(1465, 1)
(1465,)


57120.50729008638

So we have a starting point. Let's now try to improve the metric by removing features with many missing values, diving deeper into potential categorical features, and transforming text and numerical columns:

In [3]:
def select_features(df):
    price = df['SalePrice']
    df = df.drop(['SalePrice'], axis=1)
    return df.select_dtypes(include=['int','float']), price

def transform_features(data):
 
    cutoff_25 = len(data)/4
    missing_values = data.isnull().sum()
    missing_under_25pers = missing_values[(missing_values > 0) & 
                                (missing_values < cutoff_25)]

    missing_above_25pers = missing_values[(missing_values > cutoff_25)].index

    # delete all columns with more than 25% of missing data
    data = data.drop(missing_above_25pers, 1) 

    # delete all rows with missing data in target row
    data = data.dropna(subset=['SalePrice'], axis=0)

    # Missing under 25% numeric values we fill by mean values 
    data_under_25 = data[missing_under_25pers.index].select_dtypes(include=['float'])
    miss_num = data[missing_under_25pers.index].select_dtypes(include=['float']).columns
    data[miss_num] = data[miss_num].fillna(data[miss_num].mean())

    # Remove the remaining text columns with missing values
    data = data.dropna(axis='columns')
    return data

def train_and_test(data):
    transform_data = transform_features(data)
    
    half = int(len(transform_data)/2)
    train = transform_data[:half]
    test = transform_data[half:]

    num_train, train_target = select_features(train)

    num_test, test_targer = select_features(test)
    lr = LinearRegression()

    lr.fit(num_train, train_target)
    test_predict = lr.predict(num_test)
    
    test_mse = mean_squared_error(test_predict, test_targer)
    return np.sqrt(test_mse)

rmse = train_and_test(ames_housing)
rmse



72126.5749001712

Not very impressive. Let's analise closer numerical columns:

In [4]:
transform_df = transform_features(ames_housing)
numerical_df = transform_df.select_dtypes(include=['int', 'float'])
numerical_df.head(5)

Unnamed: 0,Order,PID,MS SubClass,Lot Frontage,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Mas Vnr Area,...,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Misc Val,Mo Sold,Yr Sold,SalePrice
0,1,526301100,20,141.0,31770,6,5,1960,1960,112.0,...,210,62,0,0,0,0,0,5,2010,215000
1,2,526350040,20,80.0,11622,5,6,1961,1961,0.0,...,140,0,0,0,120,0,0,6,2010,105000
2,3,526351010,20,81.0,14267,6,6,1958,1958,108.0,...,393,36,0,0,0,0,12500,6,2010,172000
3,4,526353030,20,93.0,11160,7,5,1968,1968,0.0,...,0,0,0,0,0,0,0,4,2010,244000
4,5,527105010,60,74.0,13830,5,5,1997,1998,0.0,...,212,34,0,0,0,0,0,3,2010,189900


In [5]:
# analyse the age of the house when it was sold

transform_df['Years Before Sale'] = transform_df['Yr Sold'] - transform_df['Year Remod/Add']

# let's remove all columns that do not correlate with the price of the house
transform_df = transform_df.drop(["PID", "Order", "Mo Sold", "Sale Condition", "Sale Type", "Year Built", "Year Remod/Add"], axis=1)

# We analyze the links between the columns and the price
numerical_df = transform_df.select_dtypes(include=['int', 'float'])
saleprice_corr_coef = numerical_df.corr()['SalePrice'].abs().sort_values(ascending=False)
print(saleprice_corr_coef)

SalePrice            1.000000
Overall Qual         0.799262
Gr Liv Area          0.706780
Garage Cars          0.647861
Garage Area          0.640385
Total Bsmt SF        0.632105
1st Flr SF           0.621676
Full Bath            0.545604
Years Before Sale    0.534940
Garage Yr Blt        0.510684
Mas Vnr Area         0.505784
TotRms AbvGrd        0.495474
Fireplaces           0.474558
BsmtFin SF 1         0.432794
Lot Frontage         0.340751
Wood Deck SF         0.327143
Open Porch SF        0.312951
Half Bath            0.285056
Bsmt Full Bath       0.275894
2nd Flr SF           0.269373
Lot Area             0.266549
Bsmt Unf SF          0.182805
Bedroom AbvGr        0.143913
Enclosed Porch       0.128787
Kitchen AbvGr        0.119814
Screen Porch         0.112151
Overall Cond         0.101697
MS SubClass          0.085092
Pool Area            0.068403
Low Qual Fin SF      0.037660
Bsmt Half Bath       0.035815
3Ssn Porch           0.032225
Yr Sold              0.030569
Misc Val  

In [6]:
# remove columns where correlation is below 0.25.
corr_under_025 = saleprice_corr_coef[saleprice_corr_coef < 0.25].index
transform_df = transform_df.drop(corr_under_025, axis=1)
transform_df

Unnamed: 0,MS Zoning,Lot Frontage,Lot Area,Street,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,Neighborhood,...,Functional,Fireplaces,Garage Yr Blt,Garage Cars,Garage Area,Paved Drive,Wood Deck SF,Open Porch SF,SalePrice,Years Before Sale
0,RL,141.00000,31770,Pave,IR1,Lvl,AllPub,Corner,Gtl,NAmes,...,Typ,2,1960.000000,2.0,528.0,P,210,62,215000,50
1,RH,80.00000,11622,Pave,Reg,Lvl,AllPub,Inside,Gtl,NAmes,...,Typ,0,1961.000000,1.0,730.0,Y,140,0,105000,49
2,RL,81.00000,14267,Pave,IR1,Lvl,AllPub,Corner,Gtl,NAmes,...,Typ,0,1958.000000,1.0,312.0,Y,393,36,172000,52
3,RL,93.00000,11160,Pave,Reg,Lvl,AllPub,Corner,Gtl,NAmes,...,Typ,2,1968.000000,2.0,522.0,Y,0,0,244000,42
4,RL,74.00000,13830,Pave,IR1,Lvl,AllPub,Inside,Gtl,Gilbert,...,Typ,1,1997.000000,2.0,482.0,Y,212,34,189900,12
5,RL,78.00000,9978,Pave,IR1,Lvl,AllPub,Inside,Gtl,Gilbert,...,Typ,1,1998.000000,2.0,470.0,Y,360,36,195500,12
6,RL,41.00000,4920,Pave,Reg,Lvl,AllPub,Inside,Gtl,StoneBr,...,Typ,0,2001.000000,2.0,582.0,Y,0,0,213500,9
7,RL,43.00000,5005,Pave,IR1,HLS,AllPub,Inside,Gtl,StoneBr,...,Typ,0,1992.000000,2.0,506.0,Y,0,82,191500,18
8,RL,39.00000,5389,Pave,IR1,Lvl,AllPub,Inside,Gtl,StoneBr,...,Typ,1,1995.000000,2.0,608.0,Y,237,152,236500,14
9,RL,60.00000,7500,Pave,Reg,Lvl,AllPub,Inside,Gtl,Gilbert,...,Typ,1,1999.000000,2.0,442.0,Y,140,60,189000,11


 Let's put it all together:

In [7]:
def transform_features(data):

    cutoff_25 = len(data)/4
    missing_values = data.isnull().sum()
    missing_under_25pers = missing_values[(missing_values > 0) & 
                                (missing_values < cutoff_25)]

    missing_above_25pers = missing_values[(missing_values > cutoff_25)].index

    # delete all columns with more than 25% of missing data
    data = data.drop(missing_above_25pers, 1) 

    # delete all rows with missing data in the target row
    data = data.dropna(subset=['SalePrice'], axis=0)

    # we fill missing under 25% numeric values by mean values 
    data_under_25 = data[missing_under_25pers.index].select_dtypes(include=['float'])
    miss_num = data[missing_under_25pers.index].select_dtypes(include=['float']).columns
    data[miss_num] = data[miss_num].fillna(data[miss_num].mean())

    # Remove the remaining text columns with missing values
    data = data.dropna(axis='columns')
    
    # analyse the age of the house when it was sold

    data['Years Before Sale'] = data['Yr Sold'] - data['Year Remod/Add']

    # let's remove all columns that do not correlate with the price of the house
    data = data.drop(["PID", "Order", "Mo Sold", "Sale Condition", "Sale Type", "Year Built", "Year Remod/Add"], axis=1)

    # We analyze the links between the columns and the price
    numerical_df = data.select_dtypes(include=['int', 'float'])
    saleprice_corr_coef = numerical_df.corr()['SalePrice'].abs().sort_values()

    # remove columns where correlation is below 0.35.
    corr_under_025 = saleprice_corr_coef[saleprice_corr_coef < 0.35].index
    data = data.drop(corr_under_025, axis=1)
    
    return data

rmse = train_and_test(ames_housing)
rmse


40766.49251628936

Analyzing the documentation, we find out that therest of text columns can be of the 'category' type. 

In [8]:

text_df = transform_df.select_dtypes(include=['object'])
text_df = text_df.columns
text_df



Index(['MS Zoning', 'Street', 'Lot Shape', 'Land Contour', 'Utilities',
       'Lot Config', 'Land Slope', 'Neighborhood', 'Condition 1',
       'Condition 2', 'Bldg Type', 'House Style', 'Roof Style', 'Roof Matl',
       'Exterior 1st', 'Exterior 2nd', 'Exter Qual', 'Exter Cond',
       'Foundation', 'Heating', 'Heating QC', 'Central Air', 'Kitchen Qual',
       'Functional', 'Paved Drive'],
      dtype='object')

Let's convert them to the 'dummies'. 

In [9]:
def select_features(df):
    
    for col in text_df:
        df[col] = df[col].astype('category')
        col_dummies = pd.get_dummies(df[col])
        df = pd.concat([df, col_dummies], axis=1).drop([col], axis=1)
    
    price = df['SalePrice']
    df = df.drop(['SalePrice'], axis=1)        
    return df, price


def transform_features(data):

    cutoff_25 = len(data)/4
    missing_values = data.isnull().sum()
    
    # find all columns with less than 25% missing values
    missing_under_25pers = missing_values[(missing_values > 0) & 
                                (missing_values < cutoff_25)]

    missing_above_25pers = missing_values[(missing_values > cutoff_25)].index

    # delete all columns with more than 25% of missing data
    data = data.drop(missing_above_25pers, 1) 

    # delete all rows with missing data in target row
    data = data.dropna(subset=['SalePrice'], axis=0)

    # Missing under 25% numeric values we fill by mean values 
    data_under_25 = data[missing_under_25pers.index].select_dtypes(include=['float'])
    miss_num = data[missing_under_25pers.index].select_dtypes(include=['float']).columns
    data[miss_num] = data[miss_num].fillna(data[miss_num].mean())

    # Remove the remaining text columns with missing values
    data = data.dropna(axis='columns')
    
    # analyse the age of the house when it was sold

    data['Years Before Sale'] = data['Yr Sold'] - data['Year Remod/Add']

    # let's remove all columns that do not correlate with the price of the house
    data = data.drop(["PID", "Order", "Mo Sold", "Sale Condition", "Sale Type", "Year Built", "Year Remod/Add"], axis=1)

    # We analyze the links between the columns and the price
    numerical_df = data.select_dtypes(include=['int', 'float'])
    saleprice_corr_coef = numerical_df.corr()['SalePrice'].abs().sort_values()

    # remove columns where correlation is below 0.35.
    corr_under_025 = saleprice_corr_coef[saleprice_corr_coef < 0.35].index
    data = data.drop(corr_under_025, axis=1)

    return data


def train_and_test(data):
    transform_data = transform_features(data)
    # let's randomise data
    
    transform_data.sample(frac=1, random_state=1)
    data, target = select_features(transform_data)
    
    half = int(len(transform_data)/2)
    
    train_data = data[:half]
    test_data = data[half:]   
    
    train_target = target[:half]
    test_target = target[half:]
    
    lr = LinearRegression()
    lr.fit(train_data, train_target)
    test_predict = lr.predict(test_data)
    
    test_mse = mean_squared_error(test_predict, test_target)
    return np.sqrt(test_mse)




rmse = train_and_test(ames_housing)
rmse

35713.34109737731

Great job! Now let's try to get the maximum accuracy using K-Folds cross-validator.

In [10]:
def select_features(df):
    
    for col in text_df:
        df[col] = df[col].astype('category')
        col_dummies = pd.get_dummies(df[col])
        df = pd.concat([df, col_dummies], axis=1).drop([col], axis=1)
    return df


def train_and_test(data, k):
    transform_data = transform_features(data)
    clean_data = select_features(transform_data)
    
    target = clean_data['SalePrice']
    clean_data = clean_data.drop(['SalePrice'], axis=1) 
    
    kf = KFold(n_splits=k, shuffle=True)
    rmse_values = []
    
    for train_index, test_index, in kf.split(clean_data):
        train = clean_data.iloc[train_index]
        test = clean_data.iloc[test_index]
        
        train_tg = target.iloc[train_index]
        test_tg = target.iloc[test_index]
        
        lr = LinearRegression()
        lr.fit(train, train_tg)
        predictions = lr.predict(test)
        
        mse = mean_squared_error(test_tg, predictions)
        rmse = np.sqrt(mse)
        rmse_values.append(rmse)
    print(rmse_values)
    avg_rmse = np.mean(rmse_values)
    return avg_rmse

rmse = train_and_test(ames_housing, k=4)

rmse.mean()

[31374.402380949636, 25006.755576604417, 38503.16790472058, 27061.942985748537]


30486.567212005793

We got a result that is almost 2 times better than the original one using cleaning, transforming, and selecting features metods.