# Predicting House Sale Prices

In this project we will create linear regression model, which will predict house sales price. We will focus on feature engineering. We will analyze housing data for the city of Ames, Iowa, U.S.A. from 2006 to 2010. The detailed information about columns in the data could be find under following link: https://ww2.amstat.org/publications/jse/v19n3/decock/DataDocumentation.txt. 

## Reading the data

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
pd.options.display.max_columns = 999

In [155]:
data = pd.read_csv(r"Data\AmesHousing.txt",delimiter="\t")

In [156]:
data.head()

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,Neighborhood,Condition 1,Condition 2,Bldg Type,House Style,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Roof Style,Roof Matl,Exterior 1st,Exterior 2nd,Mas Vnr Type,Mas Vnr Area,Exter Qual,Exter Cond,Foundation,Bsmt Qual,Bsmt Cond,Bsmt Exposure,BsmtFin Type 1,BsmtFin SF 1,BsmtFin Type 2,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,Heating,Heating QC,Central Air,Electrical,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom AbvGr,Kitchen AbvGr,Kitchen Qual,TotRms AbvGrd,Functional,Fireplaces,Fireplace Qu,Garage Type,Garage Yr Blt,Garage Finish,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,5,1960,1960,Hip,CompShg,BrkFace,Plywood,Stone,112.0,TA,TA,CBlock,TA,Gd,Gd,BLQ,639.0,Unf,0.0,441.0,1080.0,GasA,Fa,Y,SBrkr,1656,0,0,1656,1.0,0.0,1,0,3,1,TA,7,Typ,2,Gd,Attchd,1960.0,Fin,2.0,528.0,TA,TA,P,210,62,0,0,0,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Feedr,Norm,1Fam,1Story,5,6,1961,1961,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,CBlock,TA,TA,No,Rec,468.0,LwQ,144.0,270.0,882.0,GasA,TA,Y,SBrkr,896,0,0,896,0.0,0.0,1,0,2,1,TA,5,Typ,0,,Attchd,1961.0,Unf,1.0,730.0,TA,TA,Y,140,0,0,0,120,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,6,1958,1958,Hip,CompShg,Wd Sdng,Wd Sdng,BrkFace,108.0,TA,TA,CBlock,TA,TA,No,ALQ,923.0,Unf,0.0,406.0,1329.0,GasA,TA,Y,SBrkr,1329,0,0,1329,0.0,0.0,1,1,3,1,Gd,6,Typ,0,,Attchd,1958.0,Unf,1.0,312.0,TA,TA,Y,393,36,0,0,0,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,,Reg,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,7,5,1968,1968,Hip,CompShg,BrkFace,BrkFace,,0.0,Gd,TA,CBlock,TA,TA,No,ALQ,1065.0,Unf,0.0,1045.0,2110.0,GasA,Ex,Y,SBrkr,2110,0,0,2110,1.0,0.0,2,1,3,1,Ex,8,Typ,2,TA,Attchd,1968.0,Fin,2.0,522.0,TA,TA,Y,0,0,0,0,0,0,,,,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,1Fam,2Story,5,5,1997,1998,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,PConc,Gd,TA,No,GLQ,791.0,Unf,0.0,137.0,928.0,GasA,Gd,Y,SBrkr,928,701,0,1629,0.0,0.0,2,1,3,1,TA,6,Typ,1,TA,Attchd,1997.0,Fin,2.0,482.0,TA,TA,Y,212,34,0,0,0,0,,MnPrv,,0,3,2010,WD,Normal,189900


We are going to create three functions that will be expanded in this project:
- **transform_features** - cleaning the data
- **select_features** - selecting the most accurate features
- **train_and_test** - applying linear regression model

In [157]:
def transform_features(df):
    return df

In [158]:
def select_features(df):
    return df[["Gr Liv Area","SalePrice"]]

In [159]:
def train_and_test(df):
    train = df[:1460]
    test = df[1460:]
    
    #We will use numeric columns and drop our target column.
    numeric_train = train.select_dtypes(include=["integer","float"])
    features = numeric_train.columns.drop("SalePrice")
    
    #Let's train the model and make predictions.
    lr = linear_model.LinearRegression()
    lr.fit(train[features],train["SalePrice"])
    predictions = lr.predict(test[features])
    
    #Finally we will calculate the metrics that will help us to evaluate the model.
    mse = mean_squared_error(predictions, test["SalePrice"])
    rmse = np.sqrt(mse)
    
    return rmse

transform_data = transform_features(data)
filtered_data = select_features(data)
rmse = train_and_test(filtered_data)

rmse

57088.251612639091

## Feature Engineering

We will handle missing values in our data according to following rules:
- For all columns we will drop any with 5% or more missing values
- For text columns we will drop any with 1 or more missing values
- For numerical columns we will fill missing values with mean value

### For all columns we will drop any with 5% or more missing values

In [160]:
#Let's see how many missing values do we have.
missing_values = data.isnull().sum()
missing_values

Order                0
PID                  0
MS SubClass          0
MS Zoning            0
Lot Frontage       490
Lot Area             0
Street               0
Alley             2732
Lot Shape            0
Land Contour         0
Utilities            0
Lot Config           0
Land Slope           0
Neighborhood         0
Condition 1          0
Condition 2          0
Bldg Type            0
House Style          0
Overall Qual         0
Overall Cond         0
Year Built           0
Year Remod/Add       0
Roof Style           0
Roof Matl            0
Exterior 1st         0
Exterior 2nd         0
Mas Vnr Type        23
Mas Vnr Area        23
Exter Qual           0
Exter Cond           0
                  ... 
Bedroom AbvGr        0
Kitchen AbvGr        0
Kitchen Qual         0
TotRms AbvGrd        0
Functional           0
Fireplaces           0
Fireplace Qu      1422
Garage Type        157
Garage Yr Blt      159
Garage Finish      159
Garage Cars          1
Garage Area          1
Garage Qual

In [161]:
#We will drop columns if the ratio of missing values is greater than 5%.
ratio = 0.05*data.shape[0]
ratio

146.5

In [162]:
clean_data_cols = missing_values[(missing_values < ratio)].index
data = data[clean_data_cols]

In [163]:
#All of columns with the number of missing values greater than 146.5 should be removed.
data.isnull().sum()

Order               0
PID                 0
MS SubClass         0
MS Zoning           0
Lot Area            0
Street              0
Lot Shape           0
Land Contour        0
Utilities           0
Lot Config          0
Land Slope          0
Neighborhood        0
Condition 1         0
Condition 2         0
Bldg Type           0
House Style         0
Overall Qual        0
Overall Cond        0
Year Built          0
Year Remod/Add      0
Roof Style          0
Roof Matl           0
Exterior 1st        0
Exterior 2nd        0
Mas Vnr Type       23
Mas Vnr Area       23
Exter Qual          0
Exter Cond          0
Foundation          0
Bsmt Qual          80
                   ..
Electrical          1
1st Flr SF          0
2nd Flr SF          0
Low Qual Fin SF     0
Gr Liv Area         0
Bsmt Full Bath      2
Bsmt Half Bath      2
Full Bath           0
Half Bath           0
Bedroom AbvGr       0
Kitchen AbvGr       0
Kitchen Qual        0
TotRms AbvGrd       0
Functional          0
Fireplaces

### For text columns we will drop any with 1 or more missing values

In [164]:
text_cols = data.select_dtypes(include=["object"]).isnull().sum()
text_cols = text_cols[(text_cols>0)].index
data = data.drop(text_cols, axis=1)
data.select_dtypes(include=["object"]).isnull().sum()

MS Zoning         0
Street            0
Lot Shape         0
Land Contour      0
Utilities         0
Lot Config        0
Land Slope        0
Neighborhood      0
Condition 1       0
Condition 2       0
Bldg Type         0
House Style       0
Roof Style        0
Roof Matl         0
Exterior 1st      0
Exterior 2nd      0
Exter Qual        0
Exter Cond        0
Foundation        0
Heating           0
Heating QC        0
Central Air       0
Kitchen Qual      0
Functional        0
Paved Drive       0
Sale Type         0
Sale Condition    0
dtype: int64

### For numerical columns we will fill missing values with mean value

In [165]:
num_missing_cols = data.select_dtypes(include=["integer", "float"]).isnull().sum()
num_missing_cols = num_missing_cols[(num_missing_cols>0)].index
num_missing_cols

Index(['Mas Vnr Area', 'BsmtFin SF 1', 'BsmtFin SF 2', 'Bsmt Unf SF',
       'Total Bsmt SF', 'Bsmt Full Bath', 'Bsmt Half Bath', 'Garage Cars',
       'Garage Area'],
      dtype='object')

In [166]:
#We will fill missing values with mean value.
data = data.fillna(data[num_missing_cols].mean())

In [167]:
data.select_dtypes(include=["integer", "float"]).isnull().sum()

Order              0
PID                0
MS SubClass        0
Lot Area           0
Overall Qual       0
Overall Cond       0
Year Built         0
Year Remod/Add     0
Mas Vnr Area       0
BsmtFin SF 1       0
BsmtFin SF 2       0
Bsmt Unf SF        0
Total Bsmt SF      0
1st Flr SF         0
2nd Flr SF         0
Low Qual Fin SF    0
Gr Liv Area        0
Bsmt Full Bath     0
Bsmt Half Bath     0
Full Bath          0
Half Bath          0
Bedroom AbvGr      0
Kitchen AbvGr      0
TotRms AbvGrd      0
Fireplaces         0
Garage Cars        0
Garage Area        0
Wood Deck SF       0
Open Porch SF      0
Enclosed Porch     0
3Ssn Porch         0
Screen Porch       0
Pool Area          0
Misc Val           0
Mo Sold            0
Yr Sold            0
SalePrice          0
dtype: int64

### What new features can we create, that better capture the information in some of the features?

In [168]:
#We will create new feature - the number of years from the year that house was built to year that it was sold
years_sold = data['Yr Sold'] - data['Year Built']
years_sold[years_sold<0]

2180   -1
dtype: int64

In [169]:
#We will create new feature - the number of years from the year that houes was remodeled to year that it was sold.
years_remod = data['Yr Sold'] - data['Year Remod/Add']
years_remod[years_remod<0]

1702   -1
2180   -2
2181   -1
dtype: int64

In [170]:
#Let's create new columns with values we just calculated, also we will remove rows with negative values.
data['Years before Sold'] = years_sold
data['Years since Remod'] = years_remod

data = data.drop([1702,2180,2181], axis=0)
data[data['Years since Remod']<0]

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Area,Street,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,Neighborhood,Condition 1,Condition 2,Bldg Type,House Style,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Roof Style,Roof Matl,Exterior 1st,Exterior 2nd,Mas Vnr Area,Exter Qual,Exter Cond,Foundation,BsmtFin SF 1,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,Heating,Heating QC,Central Air,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom AbvGr,Kitchen AbvGr,Kitchen Qual,TotRms AbvGrd,Functional,Fireplaces,Garage Cars,Garage Area,Paved Drive,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice,Years before Sold,Years since Remod


### Now we are going to drop columns that:

- aren't useful for ML
- leak data about the final sale

In [171]:
#First we will drop columns that are nor useful for ML.
data = data.drop(["PID", "Order"], axis=1)

#Now let's frop columns that have leak data about the final sale.
data = data.drop(["Mo Sold", "Sale Condition", "Sale Type", "Yr Sold"], axis=1)

Let's update the first function (transform_features()) with the changes we made above.

In [222]:
def transform_features(df):
    missing_values = df.isnull().sum()
    clean_data_cols = missing_values[(missing_values < ratio)].index
    df = df[clean_data_cols]
    
    text_cols = df.select_dtypes(include=["object"]).isnull().sum()
    text_cols = text_cols[(text_cols>0)].index
    df = df.drop(text_cols, axis=1)

    num_missing_cols = df.select_dtypes(include=["integer", "float"]).isnull().sum()
    num_missing_cols = num_missing_cols[(num_missing_cols>0)].index

    df = df.fillna(df[num_missing_cols].mean())

    years_sold = df['Yr Sold'] - df['Year Built']
    df['Years before Sold'] = years_sold
    df['Years since Remod'] = years_remod

    df = df.drop([1702,2180,2181], axis=0)

    #First we will drop columns that are nor useful for ML.
    df = df.drop(["PID", "Order"], axis=1)
    #Now let's frop columns that have leak data about the final sale.
    df = df.drop(["Mo Sold", "Sale Condition", "Sale Type", "Yr Sold"], axis=1)
    
    return df

def select_features(df):
    return df[["Gr Liv Area","SalePrice"]]

def train_and_test(df):
    train = df[:1460]
    test = df[1460:]
    
    numeric_train = train.select_dtypes(include=["integer","float"])
    features = numeric_train.columns.drop("SalePrice")
    
    lr = linear_model.LinearRegression()
    lr.fit(train[features],train["SalePrice"])
    predictions = lr.predict(test[features])
    
    mse = mean_squared_error(predictions, test["SalePrice"])
    rmse = np.sqrt(mse)
    
    return rmse

data = pd.read_csv(r"Data\AmesHousing.txt",delimiter="\t")
transform_data = transform_features(data)
filtered_data = select_features(transform_data)
rmse = train_and_test(filtered_data)

rmse

55275.367312413066

## Feature Selection

We will select the most accurate features by verifying the correlation. Also we will create dummy columns (each value will be represented by different column) from categorical columns, that we will convert from text columns.

In [223]:
num_data = transform_data.select_dtypes(include=['integer', 'float'])
num_data.head()

Unnamed: 0,MS SubClass,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Mas Vnr Area,BsmtFin SF 1,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom AbvGr,Kitchen AbvGr,TotRms AbvGrd,Fireplaces,Garage Cars,Garage Area,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Misc Val,SalePrice,Years before Sold,Years since Remod
0,20,31770,6,5,1960,1960,112.0,639.0,0.0,441.0,1080.0,1656,0,0,1656,1.0,0.0,1,0,3,1,7,2,2.0,528.0,210,62,0,0,0,0,0,215000,50,50
1,20,11622,5,6,1961,1961,0.0,468.0,144.0,270.0,882.0,896,0,0,896,0.0,0.0,1,0,2,1,5,0,1.0,730.0,140,0,0,0,120,0,0,105000,49,49
2,20,14267,6,6,1958,1958,108.0,923.0,0.0,406.0,1329.0,1329,0,0,1329,0.0,0.0,1,1,3,1,6,0,1.0,312.0,393,36,0,0,0,0,12500,172000,52,52
3,20,11160,7,5,1968,1968,0.0,1065.0,0.0,1045.0,2110.0,2110,0,0,2110,1.0,0.0,2,1,3,1,8,2,2.0,522.0,0,0,0,0,0,0,0,244000,42,42
4,60,13830,5,5,1997,1998,0.0,791.0,0.0,137.0,928.0,928,701,0,1629,0.0,0.0,2,1,3,1,6,1,2.0,482.0,212,34,0,0,0,0,0,189900,13,12


In [224]:
num_data_corr = num_data.corr()["SalePrice"].abs().sort_values(ascending=False)
num_data_corr

SalePrice            1.000000
Overall Qual         0.801206
Gr Liv Area          0.717596
Garage Cars          0.648411
Total Bsmt SF        0.643601
Garage Area          0.641675
1st Flr SF           0.635185
Years before Sold    0.558979
Year Built           0.558490
Full Bath            0.546118
Years since Remod    0.534985
Year Remod/Add       0.533007
Mas Vnr Area         0.510611
TotRms AbvGrd        0.498574
Fireplaces           0.474831
BsmtFin SF 1         0.438928
Wood Deck SF         0.328183
Open Porch SF        0.316262
Half Bath            0.284871
Bsmt Full Bath       0.276329
2nd Flr SF           0.269601
Lot Area             0.267520
Bsmt Unf SF          0.182248
Bedroom AbvGr        0.143916
Enclosed Porch       0.128685
Kitchen AbvGr        0.119760
Screen Porch         0.112280
Overall Cond         0.101540
MS SubClass          0.085128
Pool Area            0.068438
Low Qual Fin SF      0.037629
Bsmt Half Bath       0.035874
3Ssn Porch           0.032268
Misc Val  

In [225]:
#We will keep columns only with correlation > 0.4.
transform_data = transform_data.drop(num_data_corr[num_data_corr<0.4].index, axis=1)

In [226]:
transform_data.head()

Unnamed: 0,MS Zoning,Street,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,Neighborhood,Condition 1,Condition 2,Bldg Type,House Style,Overall Qual,Year Built,Year Remod/Add,Roof Style,Roof Matl,Exterior 1st,Exterior 2nd,Mas Vnr Area,Exter Qual,Exter Cond,Foundation,BsmtFin SF 1,Total Bsmt SF,Heating,Heating QC,Central Air,1st Flr SF,Gr Liv Area,Full Bath,Kitchen Qual,TotRms AbvGrd,Functional,Fireplaces,Garage Cars,Garage Area,Paved Drive,SalePrice,Years before Sold,Years since Remod
0,RL,Pave,IR1,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,1960,1960,Hip,CompShg,BrkFace,Plywood,112.0,TA,TA,CBlock,639.0,1080.0,GasA,Fa,Y,1656,1656,1,TA,7,Typ,2,2.0,528.0,P,215000,50,50
1,RH,Pave,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Feedr,Norm,1Fam,1Story,5,1961,1961,Gable,CompShg,VinylSd,VinylSd,0.0,TA,TA,CBlock,468.0,882.0,GasA,TA,Y,896,896,1,TA,5,Typ,0,1.0,730.0,Y,105000,49,49
2,RL,Pave,IR1,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,1958,1958,Hip,CompShg,Wd Sdng,Wd Sdng,108.0,TA,TA,CBlock,923.0,1329.0,GasA,TA,Y,1329,1329,1,Gd,6,Typ,0,1.0,312.0,Y,172000,52,52
3,RL,Pave,Reg,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,7,1968,1968,Hip,CompShg,BrkFace,BrkFace,0.0,Gd,TA,CBlock,1065.0,2110.0,GasA,Ex,Y,2110,2110,2,Ex,8,Typ,2,2.0,522.0,Y,244000,42,42
4,RL,Pave,IR1,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,1Fam,2Story,5,1997,1998,Gable,CompShg,VinylSd,VinylSd,0.0,TA,TA,PConc,791.0,928.0,GasA,Gd,Y,928,1629,2,TA,6,Typ,1,2.0,482.0,Y,189900,13,12


All of the columns marked as nominal from the documentation are candidates for being converted to categorical. 

In [227]:
#We will create a list of column names from documentation that were marked as nominal.
nominal_features = ["PID", "MS SubClass", "MS Zoning", "Street", "Alley", "Land Contour", "Lot Config", "Neighborhood", 
                    "Condition 1", "Condition 2", "Bldg Type", "House Style", "Roof Style", "Roof Matl", "Exterior 1st", 
                    "Exterior 2nd", "Mas Vnr Type", "Foundation", "Heating", "Central Air", "Garage Type", 
                    "Misc Feature", "Sale Type", "Sale Condition"]

In [228]:
#We will keep those nominal features that were not removed yet from our data while we were cleaning them.
transform_cat_col = []
for feature in nominal_features:
    if feature in transform_data.columns:
        transform_cat_col.append(feature)
transform_cat_col

['MS Zoning',
 'Street',
 'Land Contour',
 'Lot Config',
 'Neighborhood',
 'Condition 1',
 'Condition 2',
 'Bldg Type',
 'House Style',
 'Roof Style',
 'Roof Matl',
 'Exterior 1st',
 'Exterior 2nd',
 'Foundation',
 'Heating',
 'Central Air']

In [229]:
#Now we are going to check how many unique values has each of above column.
unique_counts = {}
for col in transform_cat_col:
    value = len(transform_data[col].value_counts())
    unique_counts[col] = value

In [230]:
unique_counts

{'Bldg Type': 5,
 'Central Air': 2,
 'Condition 1': 9,
 'Condition 2': 8,
 'Exterior 1st': 16,
 'Exterior 2nd': 17,
 'Foundation': 6,
 'Heating': 6,
 'House Style': 8,
 'Land Contour': 4,
 'Lot Config': 5,
 'MS Zoning': 7,
 'Neighborhood': 28,
 'Roof Matl': 8,
 'Roof Style': 6,
 'Street': 2}

In [231]:
#We will drop columns with more than 10 unique values.
drop_col = []
for key, value in unique_counts.items():
    if  value > 10:
        drop_col.append(key)
drop_col

['Neighborhood', 'Exterior 1st', 'Exterior 2nd']

In [232]:
transform_data = transform_data.drop(drop_col, axis=1)
transform_data.head()

Unnamed: 0,MS Zoning,Street,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,Condition 1,Condition 2,Bldg Type,House Style,Overall Qual,Year Built,Year Remod/Add,Roof Style,Roof Matl,Mas Vnr Area,Exter Qual,Exter Cond,Foundation,BsmtFin SF 1,Total Bsmt SF,Heating,Heating QC,Central Air,1st Flr SF,Gr Liv Area,Full Bath,Kitchen Qual,TotRms AbvGrd,Functional,Fireplaces,Garage Cars,Garage Area,Paved Drive,SalePrice,Years before Sold,Years since Remod
0,RL,Pave,IR1,Lvl,AllPub,Corner,Gtl,Norm,Norm,1Fam,1Story,6,1960,1960,Hip,CompShg,112.0,TA,TA,CBlock,639.0,1080.0,GasA,Fa,Y,1656,1656,1,TA,7,Typ,2,2.0,528.0,P,215000,50,50
1,RH,Pave,Reg,Lvl,AllPub,Inside,Gtl,Feedr,Norm,1Fam,1Story,5,1961,1961,Gable,CompShg,0.0,TA,TA,CBlock,468.0,882.0,GasA,TA,Y,896,896,1,TA,5,Typ,0,1.0,730.0,Y,105000,49,49
2,RL,Pave,IR1,Lvl,AllPub,Corner,Gtl,Norm,Norm,1Fam,1Story,6,1958,1958,Hip,CompShg,108.0,TA,TA,CBlock,923.0,1329.0,GasA,TA,Y,1329,1329,1,Gd,6,Typ,0,1.0,312.0,Y,172000,52,52
3,RL,Pave,Reg,Lvl,AllPub,Corner,Gtl,Norm,Norm,1Fam,1Story,7,1968,1968,Hip,CompShg,0.0,Gd,TA,CBlock,1065.0,2110.0,GasA,Ex,Y,2110,2110,2,Ex,8,Typ,2,2.0,522.0,Y,244000,42,42
4,RL,Pave,IR1,Lvl,AllPub,Inside,Gtl,Norm,Norm,1Fam,2Story,5,1997,1998,Gable,CompShg,0.0,TA,TA,PConc,791.0,928.0,GasA,Gd,Y,928,1629,2,TA,6,Typ,1,2.0,482.0,Y,189900,13,12


In [245]:
#Let's change the type of remaining text columns to categorical.
text_col = transform_data.select_dtypes(include=["object"])
for col in text_col:
    transform_data[col] = transform_data[col].astype("category")

In [246]:
#Let's create dummy columns from our categorical columns.
transform_data = pd.concat([
    transform_data,
    pd.get_dummies(transform_data.select_dtypes(include=["category"]))
], axis=1)

In [247]:
transform_data.head()

Unnamed: 0,MS Zoning,Street,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,Condition 1,Condition 2,Bldg Type,House Style,Overall Qual,Year Built,Year Remod/Add,Roof Style,Roof Matl,Mas Vnr Area,Exter Qual,Exter Cond,Foundation,BsmtFin SF 1,Total Bsmt SF,Heating,Heating QC,Central Air,1st Flr SF,Gr Liv Area,Full Bath,Kitchen Qual,TotRms AbvGrd,Functional,Fireplaces,Garage Cars,Garage Area,Paved Drive,SalePrice,Years before Sold,Years since Remod,MS Zoning_A (agr),MS Zoning_C (all),MS Zoning_FV,MS Zoning_I (all),MS Zoning_RH,MS Zoning_RL,MS Zoning_RM,Street_Grvl,Street_Pave,Lot Shape_IR1,Lot Shape_IR2,Lot Shape_IR3,Lot Shape_Reg,Land Contour_Bnk,Land Contour_HLS,Land Contour_Low,Land Contour_Lvl,Utilities_AllPub,Utilities_NoSeWa,Utilities_NoSewr,Lot Config_Corner,Lot Config_CulDSac,Lot Config_FR2,Lot Config_FR3,Lot Config_Inside,Land Slope_Gtl,Land Slope_Mod,Land Slope_Sev,Condition 1_Artery,Condition 1_Feedr,Condition 1_Norm,Condition 1_PosA,Condition 1_PosN,Condition 1_RRAe,Condition 1_RRAn,Condition 1_RRNe,Condition 1_RRNn,Condition 2_Artery,Condition 2_Feedr,Condition 2_Norm,Condition 2_PosA,Condition 2_PosN,Condition 2_RRAe,Condition 2_RRAn,Condition 2_RRNn,Bldg Type_1Fam,Bldg Type_2fmCon,Bldg Type_Duplex,Bldg Type_Twnhs,Bldg Type_TwnhsE,House Style_1.5Fin,House Style_1.5Unf,House Style_1Story,House Style_2.5Fin,House Style_2.5Unf,House Style_2Story,House Style_SFoyer,House Style_SLvl,Roof Style_Flat,Roof Style_Gable,Roof Style_Gambrel,Roof Style_Hip,Roof Style_Mansard,Roof Style_Shed,Roof Matl_ClyTile,Roof Matl_CompShg,Roof Matl_Membran,Roof Matl_Metal,Roof Matl_Roll,Roof Matl_Tar&Grv,Roof Matl_WdShake,Roof Matl_WdShngl,Exter Qual_Ex,Exter Qual_Fa,Exter Qual_Gd,Exter Qual_TA,Exter Cond_Ex,Exter Cond_Fa,Exter Cond_Gd,Exter Cond_Po,Exter Cond_TA,Foundation_BrkTil,Foundation_CBlock,Foundation_PConc,Foundation_Slab,Foundation_Stone,Foundation_Wood,Heating_Floor,Heating_GasA,Heating_GasW,Heating_Grav,Heating_OthW,Heating_Wall,Heating QC_Ex,Heating QC_Fa,Heating QC_Gd,Heating QC_Po,Heating QC_TA,Central Air_N,Central Air_Y,Kitchen Qual_Ex,Kitchen Qual_Fa,Kitchen Qual_Gd,Kitchen Qual_Po,Kitchen Qual_TA,Functional_Maj1,Functional_Maj2,Functional_Min1,Functional_Min2,Functional_Mod,Functional_Sal,Functional_Sev,Functional_Typ,Paved Drive_N,Paved Drive_P,Paved Drive_Y
0,RL,Pave,IR1,Lvl,AllPub,Corner,Gtl,Norm,Norm,1Fam,1Story,6,1960,1960,Hip,CompShg,112.0,TA,TA,CBlock,639.0,1080.0,GasA,Fa,Y,1656,1656,1,TA,7,Typ,2,2.0,528.0,P,215000,50,50,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,1,1,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,1,0
1,RH,Pave,Reg,Lvl,AllPub,Inside,Gtl,Feedr,Norm,1Fam,1Story,5,1961,1961,Gable,CompShg,0.0,TA,TA,CBlock,468.0,882.0,GasA,TA,Y,896,896,1,TA,5,Typ,0,1.0,730.0,Y,105000,49,49,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,1,0,0,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1
2,RL,Pave,IR1,Lvl,AllPub,Corner,Gtl,Norm,Norm,1Fam,1Story,6,1958,1958,Hip,CompShg,108.0,TA,TA,CBlock,923.0,1329.0,GasA,TA,Y,1329,1329,1,Gd,6,Typ,0,1.0,312.0,Y,172000,52,52,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,1,1,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1
3,RL,Pave,Reg,Lvl,AllPub,Corner,Gtl,Norm,Norm,1Fam,1Story,7,1968,1968,Hip,CompShg,0.0,Gd,TA,CBlock,1065.0,2110.0,GasA,Ex,Y,2110,2110,2,Ex,8,Typ,2,2.0,522.0,Y,244000,42,42,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,1,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1
4,RL,Pave,IR1,Lvl,AllPub,Inside,Gtl,Norm,Norm,1Fam,2Story,5,1997,1998,Gable,CompShg,0.0,TA,TA,PConc,791.0,928.0,GasA,Gd,Y,928,1629,2,TA,6,Typ,1,2.0,482.0,Y,189900,13,12,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1


- Let's update the second function (select_features()) with the changes we made above.
- Also for the last function (train_and_test()) we will add k-fold cross validation.

In [282]:
def transform_features(df):
    missing_values = df.isnull().sum()
    clean_data_cols = missing_values[(missing_values < 0.05*df.shape[0])].index
    df = df[clean_data_cols]
    
    text_cols = df.select_dtypes(include=["object"]).isnull().sum()
    text_cols = text_cols[(text_cols>0)].index
    df = df.drop(text_cols, axis=1)

    num_missing_cols = df.select_dtypes(include=["integer", "float"]).isnull().sum()
    num_missing_cols = num_missing_cols[(num_missing_cols>0)].index
    df = df.fillna(df[num_missing_cols].mean())

    
    years_sold = df['Yr Sold'] - df['Year Built']
    years_remod = df['Yr Sold'] - df['Year Remod/Add']
    df['Years before Sold'] = years_sold
    df['Years since Remod'] = years_remod

    df = df.drop([1702,2180,2181], axis=0)

    #First we will drop columns that are nor useful for ML
    df = df.drop(["PID", "Order"], axis=1)
    #Now let's frop columns that have leak data about the final sale
    df = df.drop(["Mo Sold", "Sale Condition", "Sale Type", "Yr Sold", "Year Built", "Year Remod/Add"], axis=1)

    return df

def select_features(df, coeff_threshold=0.4, uniq_threshold=10):
    
    num_data = df.select_dtypes(include=['integer', 'float'])
    num_data_corr = num_data.corr()["SalePrice"].abs().sort_values(ascending=False)
    df = df.drop(num_data_corr[num_data_corr < coeff_threshold].index, axis=1)
    
    nominal_features = ["PID", "MS SubClass", "MS Zoning", "Street", "Alley", "Land Contour", "Lot Config", "Neighborhood", 
                    "Condition 1", "Condition 2", "Bldg Type", "House Style", "Roof Style", "Roof Matl", "Exterior 1st", 
                    "Exterior 2nd", "Mas Vnr Type", "Foundation", "Heating", "Central Air", "Garage Type", 
                    "Misc Feature", "Sale Type", "Sale Condition"]
    transform_cat_col = []
    for feature in nominal_features:
        if feature in df.columns:
            transform_cat_col.append(feature)
    
    unique_counts = {}
    for col in transform_cat_col:
        value = len(df[col].value_counts())
        unique_counts[col] = value
    
    drop_col = []
    for key, value in unique_counts.items():
        if  value > 10:
            drop_col.append(key)
        
    df = df.drop(drop_col, axis=1)
    
    text_col = df.select_dtypes(include=["object"])
    
    for col in text_col:
        df[col] = df[col].astype("category")
    df = pd.concat([df,pd.get_dummies(df.select_dtypes(include=["category"]))], axis=1)
    
    return df

def train_and_test(df, k):
    
    numeric_df = df.select_dtypes(include=['integer', 'float'])
    features = numeric_df.columns.drop("SalePrice")
    lr = linear_model.LinearRegression()
    
    if k==0:
        train = df[:1460]
        test = df[1460:]
    
        lr.fit(train[features],train["SalePrice"])
        predictions = lr.predict(test[features])
    
        mse = mean_squared_error(predictions, test["SalePrice"])
        rmse = np.sqrt(mse)
        
        return rmse
    
    if k==1:
        
        df = df.sample(frac=1)
        fold_one = df[:1460]
        fold_two = df[1460:]
        
        #Train on fold_one and test on fold_two.
        lr.fit(fold_one[features],fold_one["SalePrice"])
        predictions_one = lr.predict(fold_two[features])
    
        mse_one = mean_squared_error(predictions_one, fold_two["SalePrice"])
        rmse_one = np.sqrt(mse_one)
        
        #Train on fold_two and test on fold_one.
        lr.fit(fold_two[features],fold_two["SalePrice"])
        predictions_two = lr.predict(fold_one[features])
    
        mse_two = mean_squared_error(predictions_two, fold_one["SalePrice"])
        rmse_two = np.sqrt(mse_two)
        
        #Let's compute mean of moth rmses
        print(rmse_one, rmse_two)
        avg_rmse = np.mean([rmse_one, rmse_two])
        return avg_rmse
    
    else:
        
        kf = KFold(n_splits=k, shuffle=True)
        rmse_values = []
        
        for train_index, test_index in kf.split(df):
            train = df.iloc[train_index]
            test = df.iloc[test_index]
        
            lr.fit(train[features],train["SalePrice"])
            predictions = lr.predict(test[features])
    
            mse = mean_squared_error(predictions, test["SalePrice"])
            rmse = np.sqrt(mse)
            rmse_values.append(rmse)
        print(rmse_values)
        avg_rmse = np.mean(rmse_values)
        return avg_rmse

data = pd.read_csv(r"Data\AmesHousing.txt",delimiter="\t")
transform_data = transform_features(data)
filtered_data = select_features(transform_data)
rmse = train_and_test(filtered_data, k=4)

rmse

[23830.606264250062, 37919.891644098658, 24602.080904885406, 27275.593267459448]


28407.043020173394

The final RMSE after feature engineering improved about two times from the first one that we calculated and because we are using cross-validation we have more confidence that the results are not overfitting.