## Predicting House Sale Prices

**Introduction**

This project aims at predicting house sale prices. The dataset used is the housing data for cities of Ames, Iowa and United Sates from 2006 to 2010.

The data set contains 2930 observations and a large number of explanatory variables (23 nominal, 23 ordinal, 14 discrete, and 20 continuous) involved in assessing home values.

Information on the various columns in the data is available on this [link](https://s3.amazonaws.com/dq-content/307/data_description.txt). 


In [1]:
# libraries and classes
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [2]:
df = pd.read_csv('AmesHousing.tsv', sep='\t', header=0)
df.head(3)

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,...,0,,,Gar2,12500,6,2010,WD,Normal,172000


In [3]:
def transform_features(data):
    #split the data into train and test sets
#     edge = int(0.7 * len(data))
#     train_set = data[:edge]
#     test_set = data[edge:]
    
    return data
    

In [4]:
def select_features(data):
    features = ['Gr Liv Area', 'SalePrice']
    
    return data[features]

In [6]:
def train_and_test(data):
    train = data[:1460]
    test = data[1460:]
    
#  numeric columns
    numeric_train = train.select_dtypes(exclude = 'object')
    numeric_test =  test.select_dtypes(exclude = 'object')
    
#  features and target
    features = numeric_train.columns.drop('SalePrice')
    target = 'SalePrice'
    
#  instantiating the regression model
    lr_model = LinearRegression()
    
    lr_model.fit(train[features], train[target])
    
    prediction = lr_model.predict(test[features])
    rmse = np.sqrt(mean_squared_error(test[target], prediction))
    
    return rmse

In [8]:
# train_and_test(df)

#### Feature Engineering

The purpose of this step is to:
- Remove features that we don't want to use in the model, just based on the number of missing values or data leakage
- Transform features into the proper format (numerical to categorical, scaling numerical, filling in missing values, etc)
- Create new features by combining other features


In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 82 columns):
Order              2930 non-null int64
PID                2930 non-null int64
MS SubClass        2930 non-null int64
MS Zoning          2930 non-null object
Lot Frontage       2440 non-null float64
Lot Area           2930 non-null int64
Street             2930 non-null object
Alley              198 non-null object
Lot Shape          2930 non-null object
Land Contour       2930 non-null object
Utilities          2930 non-null object
Lot Config         2930 non-null object
Land Slope         2930 non-null object
Neighborhood       2930 non-null object
Condition 1        2930 non-null object
Condition 2        2930 non-null object
Bldg Type          2930 non-null object
House Style        2930 non-null object
Overall Qual       2930 non-null int64
Overall Cond       2930 non-null int64
Year Built         2930 non-null int64
Year Remod/Add     2930 non-null int64
Roof Style         29

In [21]:
# Inspecting the dataframe
na_cols = df.columns[df.isna().sum() / len(df) > 0.05]
NA_df = df[na_cols]

print('Columns with missing values less than 5% and their data types')
print('')
print(NA_df.info())

Columns with missing values less than 5% and their data types

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 11 columns):
Lot Frontage     2440 non-null float64
Alley            198 non-null object
Fireplace Qu     1508 non-null object
Garage Type      2773 non-null object
Garage Yr Blt    2771 non-null float64
Garage Finish    2771 non-null object
Garage Qual      2771 non-null object
Garage Cond      2771 non-null object
Pool QC          13 non-null object
Fence            572 non-null object
Misc Feature     106 non-null object
dtypes: float64(2), object(9)
memory usage: 251.9+ KB
None


In [22]:
# df with numeric columns only
NA_df.select_dtypes(exclude = 'object')

Unnamed: 0,Lot Frontage,Garage Yr Blt
0,141.0,1960.0
1,80.0,1961.0
2,81.0,1958.0
3,93.0,1968.0
4,74.0,1997.0
5,78.0,1998.0
6,41.0,2001.0
7,43.0,1992.0
8,39.0,1995.0
9,60.0,1999.0


In [None]:
def transform_features(data):
    df1 = data.copy()
    
# select only columns whose percentage of missing values is > 5% 
    df1 = df1[df1.columns[df1.isna().sum() / len(df1) > 0.05]]

In [12]:
df.columns[df.isna().sum()/len(df) > 0.25]

Index(['Alley', 'Fireplace Qu', 'Pool QC', 'Fence', 'Misc Feature'], dtype='object')