## Linear Regression: Predicting House Sale Prices

![alt text](https://www.pewresearch.org/wp-content/uploads/2021/08/FT_21.08.17_BigHousesSmallHouses_feature.jpg)

We will work with housing data for the city of Ames, Iowa, United States from 2006 to 2010. Our aim is to predict house sale prices using a machine learning linear regression model. You can read more about why the data was collected [here](https://doi.org/10.1080/10691898.2011.11889627). You can also read about the different columns in the data [here](https://s3.amazonaws.com/dq-content/307/data_description.txt).

### 1. Initial exploration of data

In [1]:
#Importing all necessary tools and setting options
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error

%matplotlib inline
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', 50)

Summary of the data set:

In [2]:
houses = pd.read_csv('AmesHousing.tsv', delimiter='\t')

missing_values = houses.isnull().sum()*100/len(houses)

print(f'\033[1mNumber of houses:\033[0m   {houses.shape[0]:,}\n'
      f'\033[1mNumber of features:\033[0m {houses.shape[1]}\n\n'
      f'\033[1mMissing values by column, in %:\033[0m\n'
      f'{missing_values[missing_values > 0].sort_values(ascending=False).round(2)}\n\n'
      f'\033[1mColumn names:\033[0m\n'
      f'{houses.columns}')
houses.head()

[1mNumber of houses:[0m   2,930
[1mNumber of features:[0m 82

[1mMissing values by column, in %:[0m
Pool QC           99.56
Misc Feature      96.38
Alley             93.24
Fence             80.48
Fireplace Qu      48.53
Lot Frontage      16.72
Garage Cond        5.43
Garage Qual        5.43
Garage Finish      5.43
Garage Yr Blt      5.43
Garage Type        5.36
Bsmt Exposure      2.83
BsmtFin Type 2     2.76
BsmtFin Type 1     2.73
Bsmt Qual          2.73
Bsmt Cond          2.73
Mas Vnr Area       0.78
Mas Vnr Type       0.78
Bsmt Half Bath     0.07
Bsmt Full Bath     0.07
Total Bsmt SF      0.03
Bsmt Unf SF        0.03
Garage Cars        0.03
Garage Area        0.03
BsmtFin SF 2       0.03
BsmtFin SF 1       0.03
Electrical         0.03
dtype: float64

[1mColumn names:[0m
Index(['Order', 'PID', 'MS SubClass', 'MS Zoning', 'Lot Frontage', 'Lot Area',
       'Street', 'Alley', 'Lot Shape', 'Land Contour', 'Utilities',
       'Lot Config', 'Land Slope', 'Neighborhood', 'Condition

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,Neighborhood,Condition 1,Condition 2,Bldg Type,House Style,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Roof Style,Roof Matl,Exterior 1st,...,Fireplaces,Fireplace Qu,Garage Type,Garage Yr Blt,Garage Finish,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,5,1960,1960,Hip,CompShg,BrkFace,...,2,Gd,Attchd,1960.0,Fin,2.0,528.0,TA,TA,P,210,62,0,0,0,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Feedr,Norm,1Fam,1Story,5,6,1961,1961,Gable,CompShg,VinylSd,...,0,,Attchd,1961.0,Unf,1.0,730.0,TA,TA,Y,140,0,0,0,120,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,6,1958,1958,Hip,CompShg,Wd Sdng,...,0,,Attchd,1958.0,Unf,1.0,312.0,TA,TA,Y,393,36,0,0,0,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,,Reg,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,7,5,1968,1968,Hip,CompShg,BrkFace,...,2,TA,Attchd,1968.0,Fin,2.0,522.0,TA,TA,Y,0,0,0,0,0,0,,,,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,1Fam,2Story,5,5,1997,1998,Gable,CompShg,VinylSd,...,1,TA,Attchd,1997.0,Fin,2.0,482.0,TA,TA,Y,212,34,0,0,0,0,,MnPrv,,0,3,2010,WD,Normal,189900


### 2. Feature Engineering
2.1 Handle missing values:
- All columns:
 - Drop any with 5% or more missing values and fill in the rest with mode
- Numerical columns:
 - For columns with missing values, fill in with the most common value in that column
 
2.2 Creating new features

2.3 Dropping  features that:
- aren't useful for machine learning,
- leak data about the final sale.

2.4 Create functions to automatize all steps


##### 2.1 Handle missing values
Drop any column with 5% or more missing values for now.

In [3]:
print(f'{missing_values[missing_values > 5].sort_values(ascending=False).round(2)}\n\n')

Pool QC          99.56
Misc Feature     96.38
Alley            93.24
Fence            80.48
Fireplace Qu     48.53
Lot Frontage     16.72
Garage Yr Blt     5.43
Garage Finish     5.43
Garage Qual       5.43
Garage Cond       5.43
Garage Type       5.36
dtype: float64




In [4]:
houses_few_missing = houses[missing_values[missing_values < 5].index]

houses_few_missing.shape

(2930, 71)

For the remaining features with missing values (less or equal to 5%), let's fill in the missing values using the most frequent value from the corresponding feature for categorical/discret features and mean for numerical continuous features.

In [5]:
print(f'{missing_values[(missing_values <= 5)&(missing_values > 0)].sort_values(ascending=False).round(2)}')

Bsmt Exposure     2.83
BsmtFin Type 2    2.76
Bsmt Qual         2.73
Bsmt Cond         2.73
BsmtFin Type 1    2.73
Mas Vnr Type      0.78
Mas Vnr Area      0.78
Bsmt Full Bath    0.07
Bsmt Half Bath    0.07
BsmtFin SF 1      0.03
BsmtFin SF 2      0.03
Bsmt Unf SF       0.03
Total Bsmt SF     0.03
Electrical        0.03
Garage Cars       0.03
Garage Area       0.03
dtype: float64


In [6]:
col_low5 = missing_values[(missing_values <= 5)&(missing_values > 0)].index
houses_few_missing[col_low5].head(5)

Unnamed: 0,Mas Vnr Type,Mas Vnr Area,Bsmt Qual,Bsmt Cond,Bsmt Exposure,BsmtFin Type 1,BsmtFin SF 1,BsmtFin Type 2,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,Electrical,Bsmt Full Bath,Bsmt Half Bath,Garage Cars,Garage Area
0,Stone,112.0,TA,Gd,Gd,BLQ,639.0,Unf,0.0,441.0,1080.0,SBrkr,1.0,0.0,2.0,528.0
1,,0.0,TA,TA,No,Rec,468.0,LwQ,144.0,270.0,882.0,SBrkr,0.0,0.0,1.0,730.0
2,BrkFace,108.0,TA,TA,No,ALQ,923.0,Unf,0.0,406.0,1329.0,SBrkr,0.0,0.0,1.0,312.0
3,,0.0,TA,TA,No,ALQ,1065.0,Unf,0.0,1045.0,2110.0,SBrkr,1.0,0.0,2.0,522.0
4,,0.0,Gd,TA,No,GLQ,791.0,Unf,0.0,137.0,928.0,SBrkr,0.0,0.0,2.0,482.0


We will divide columns into continuous and categorical/discreet:

In [7]:
num_cont = ['Mas Vnr Area', 'BsmtFin SF 1', 
       'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF', 'Garage Area']
cat_dis = ['Mas Vnr Type','Bsmt Qual', 'Bsmt Cond',
       'Bsmt Exposure', 'BsmtFin Type 1','BsmtFin Type 2','Electrical',
       'Bsmt Full Bath', 'Bsmt Half Bath', 'Garage Cars']

In [8]:
for col in num_cont:
    houses_few_missing.loc[:, col].fillna(houses_few_missing[col].mean(), inplace=True)
    
houses_few_missing[num_cont].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Mas Vnr Area   2930 non-null   float64
 1   BsmtFin SF 1   2930 non-null   float64
 2   BsmtFin SF 2   2930 non-null   float64
 3   Bsmt Unf SF    2930 non-null   float64
 4   Total Bsmt SF  2930 non-null   float64
 5   Garage Area    2930 non-null   float64
dtypes: float64(6)
memory usage: 137.5 KB


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().fillna(


In [9]:
for col in cat_dis:
    houses_few_missing[col].fillna(houses_few_missing[col].mode()[0], inplace=True)
    
houses_few_missing[cat_dis].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Mas Vnr Type    2930 non-null   object 
 1   Bsmt Qual       2930 non-null   object 
 2   Bsmt Cond       2930 non-null   object 
 3   Bsmt Exposure   2930 non-null   object 
 4   BsmtFin Type 1  2930 non-null   object 
 5   BsmtFin Type 2  2930 non-null   object 
 6   Electrical      2930 non-null   object 
 7   Bsmt Full Bath  2930 non-null   float64
 8   Bsmt Half Bath  2930 non-null   float64
 9   Garage Cars     2930 non-null   float64
dtypes: float64(3), object(7)
memory usage: 229.0+ KB


There is no more missing values in the dataset

In [10]:
houses_no_missing = houses_few_missing
houses_no_missing.isnull().sum().value_counts()

0    71
dtype: int64

##### 2.2 Creating new features

What new features can we create, that better capture the information in some of the features?
The features Yr Sold, Year Remod/Add, and Year Built are not so informative. It would be better to know the number of years passed after each house was built and remodeled.

In [11]:
years_sold = houses_no_missing['Yr Sold'] - houses_no_missing['Year Built']
years_sold[years_sold < 0]

2180   -1
dtype: int64

In [12]:
years_since_remod = houses_no_missing['Yr Sold'] - houses_no_missing['Year Remod/Add']
years_since_remod[years_since_remod < 0]

1702   -1
2180   -2
2181   -1
dtype: int64

In [13]:
## Create new columns
houses_no_missing['Years Before Sale'] = years_sold
houses_no_missing['Years Since Remod'] = years_since_remod

## Drop rows with negative values for both of these new features
houses_no_missing = houses_no_missing.drop([1702, 2180, 2181], axis=0)

## No longer need original year columns
houses_no_missing = houses_no_missing.drop(["Year Built", "Year Remod/Add"], axis = 1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  houses_no_missing['Years Before Sale'] = years_sold
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  houses_no_missing['Years Since Remod'] = years_since_remod


##### 2.3 Dropping features

Let's remove the features that:
- aren't useful for machine learning (Order, PID),
- leak information about the sale (Mo Sold, Sale Type, Sale Condition, Yr Sold).

In [14]:
#Drop columns that aren't useful for ML
houses_no_missing = houses_no_missing.drop(["PID", "Order"], axis=1)

#Drop columns that leak info about the final sale
houses_no_missing = houses_no_missing.drop(["Mo Sold", "Sale Condition", "Sale Type", "Yr Sold"], axis=1)

##### 2.4 Create functions to automatize all steps

In [15]:
#Transforming features

def transform_features(df):
    #Removing columns with more than 5% missing data
    df_copy = df.copy()
    missing_values = df_copy.isnull().sum()*100/len(houses)
    df_copy_2 = df_copy[missing_values[missing_values < 5].index]
    
    #Splitting remaining columns with missing values to numerical continuous and categorical/discrete
    num_cont = ['Mas Vnr Area', 'BsmtFin SF 1', 
       'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF', 'Garage Area']
    cat_dis = ['Mas Vnr Type','Bsmt Qual', 'Bsmt Cond',
       'Bsmt Exposure', 'BsmtFin Type 1','BsmtFin Type 2','Electrical',
       'Bsmt Full Bath', 'Bsmt Half Bath', 'Garage Cars']
    
    #Filling in the missing values using the mode for categorical/discret features 
    #and mean for numerical continuous features.
    for col in num_cont:
        df_copy_2.loc[:, col].fillna(df_copy_2[col].mean(), inplace=True)
    for col in cat_dis:
        df_copy_2[col].fillna(df_copy_2[col].mode()[0], inplace=True)
    
    #Create new columns
    years_sold = df_copy_2['Yr Sold'] - df_copy_2['Year Built']
    years_since_remod = df_copy_2['Yr Sold'] - df_copy_2['Year Remod/Add']
    df_copy_2['Years Before Sale'] = years_sold
    df_copy_2['Years Since Remod'] = years_since_remod
    
    #Drop rows with negative values for both of these new features
    df3 = df_copy_2.drop([1702, 2180, 2181], axis=0)
    
    #No longer need original year columns
    df3 = df3.drop(["Year Built", "Year Remod/Add"], axis = 1)
    
    #Drop columns that aren't useful for ML
    df3 = df3.drop(["PID", "Order"], axis=1)

    #Drop columns that leak info about the final sale
    df3 = df3.drop(["Mo Sold", "Sale Condition", "Sale Type", "Yr Sold"], axis=1) 
    
    return df3

### 3. Feature Selection

Now, we're going to select the features that we'll use for further machine learning modeling.

#### 3.1. Numerical features
First we will create correlation matrix of the numerical features and the target - Sale Price.

In [16]:
num_corr = houses_no_missing.select_dtypes(['int64', 'float64'])
num_corr.head(5)

Unnamed: 0,MS SubClass,Lot Area,Overall Qual,Overall Cond,Mas Vnr Area,BsmtFin SF 1,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom AbvGr,Kitchen AbvGr,TotRms AbvGrd,Fireplaces,Garage Cars,Garage Area,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Misc Val,SalePrice,Years Before Sale,Years Since Remod
0,20,31770,6,5,112.0,639.0,0.0,441.0,1080.0,1656,0,0,1656,1.0,0.0,1,0,3,1,7,2,2.0,528.0,210,62,0,0,0,0,0,215000,50,50
1,20,11622,5,6,0.0,468.0,144.0,270.0,882.0,896,0,0,896,0.0,0.0,1,0,2,1,5,0,1.0,730.0,140,0,0,0,120,0,0,105000,49,49
2,20,14267,6,6,108.0,923.0,0.0,406.0,1329.0,1329,0,0,1329,0.0,0.0,1,1,3,1,6,0,1.0,312.0,393,36,0,0,0,0,12500,172000,52,52
3,20,11160,7,5,0.0,1065.0,0.0,1045.0,2110.0,2110,0,0,2110,1.0,0.0,2,1,3,1,8,2,2.0,522.0,0,0,0,0,0,0,0,244000,42,42
4,60,13830,5,5,0.0,791.0,0.0,137.0,928.0,928,701,0,1629,0.0,0.0,2,1,3,1,6,1,2.0,482.0,212,34,0,0,0,0,0,189900,13,12


In [17]:
corr = num_corr.corr()['SalePrice'].abs().sort_values(ascending=False)
corr

SalePrice            1.000000
Overall Qual         0.801206
Gr Liv Area          0.717596
Garage Cars          0.648361
Total Bsmt SF        0.643601
Garage Area          0.641675
1st Flr SF           0.635185
Years Before Sale    0.558979
Full Bath            0.546118
Years Since Remod    0.534985
Mas Vnr Area         0.510611
TotRms AbvGrd        0.498574
Fireplaces           0.474831
BsmtFin SF 1         0.438928
Wood Deck SF         0.328183
Open Porch SF        0.316262
Half Bath            0.284871
Bsmt Full Bath       0.276258
2nd Flr SF           0.269601
Lot Area             0.267520
Bsmt Unf SF          0.182248
Bedroom AbvGr        0.143916
Enclosed Porch       0.128685
Kitchen AbvGr        0.119760
Screen Porch         0.112280
Overall Cond         0.101540
MS SubClass          0.085128
Pool Area            0.068438
Low Qual Fin SF      0.037629
Bsmt Half Bath       0.035875
3Ssn Porch           0.032268
Misc Val             0.019273
BsmtFin SF 2         0.006000
Name: Sale

Let's keep only the features with a correlation coefficient higher or equal to 0.4. Again, this cut-off is tentative and can be reconsidered for the sake of testing different models.

In [18]:
#We choose only features with correlation with 'SalePrice > 0.4
houses_no_missing = houses_no_missing.drop(corr[corr < 0.4].index, axis=1)
houses_no_missing.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2927 entries, 0 to 2929
Data columns (total 46 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   MS Zoning          2927 non-null   object 
 1   Street             2927 non-null   object 
 2   Lot Shape          2927 non-null   object 
 3   Land Contour       2927 non-null   object 
 4   Utilities          2927 non-null   object 
 5   Lot Config         2927 non-null   object 
 6   Land Slope         2927 non-null   object 
 7   Neighborhood       2927 non-null   object 
 8   Condition 1        2927 non-null   object 
 9   Condition 2        2927 non-null   object 
 10  Bldg Type          2927 non-null   object 
 11  House Style        2927 non-null   object 
 12  Overall Qual       2927 non-null   int64  
 13  Roof Style         2927 non-null   object 
 14  Roof Matl          2927 non-null   object 
 15  Exterior 1st       2927 non-null   object 
 16  Exterior 2nd       2927 

There is no columns with one unique value.

In [19]:
one_val_col = houses_no_missing.nunique()[houses_no_missing.nunique() == 1].index
one_val_col

Index([], dtype='object')

#### 3.2. Categorical features

Which categorical columns should we keep?

First let check which columns are categorical according to the documentation:

In [20]:
#Create a list of column names from documentation that are *meant* to be categorical
nominal_features = ["PID", "MS SubClass", "MS Zoning", "Street", "Alley", "Land Contour", "Lot Config", "Neighborhood", 
                    "Condition 1", "Condition 2", "Bldg Type", "House Style", "Roof Style", "Roof Matl", "Exterior 1st", 
                    "Exterior 2nd", "Mas Vnr Type", "Foundation", "Heating", "Central Air", "Garage Type", 
                    "Misc Feature", "Sale Type", "Sale Condition"]

- Which columns are currently numerical but need to be encoded as categorical instead?
- If a categorical column has hundreds of unique values (or categories), should we keep it? When we dummy code this column, hundreds of columns will need to be added back to the data frame.

Now let's check which of them are still in our data set and how many unique values they have.

In [21]:
#Which categorical columns have we still carried with us?
transform_cat_cols = []
for col in nominal_features:
    if col in houses_no_missing.columns:
        transform_cat_cols.append(col)

#How many unique values in each categorical column?
uniqueness_counts = houses_no_missing[transform_cat_cols].apply(lambda col: len(col.value_counts())).sort_values()
uniqueness_counts

Central Air      2
Street           2
Land Contour     4
Lot Config       5
Mas Vnr Type     5
Bldg Type        5
Foundation       6
Roof Style       6
Heating          6
MS Zoning        7
Roof Matl        8
Condition 2      8
House Style      8
Condition 1      9
Exterior 1st    16
Exterior 2nd    17
Neighborhood    28
dtype: int64

We will remove the features with more than 10 unique values.

In [22]:
# Aribtrary cutoff of 10 unique values (worth experimenting)
drop_nonuniq_cols = uniqueness_counts[uniqueness_counts > 10].index
houses_no_missing = houses_no_missing.drop(drop_nonuniq_cols, axis=1)

Let's conver all text columns to the dummy collumns.

In [23]:
# Select just the remaining text columns and convert to categorical
text_cols = houses_no_missing.select_dtypes(include=['object'])
for col in text_cols:
    houses_no_missing[col] = houses_no_missing[col].astype('category')
houses_no_missing.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2927 entries, 0 to 2929
Data columns (total 43 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   MS Zoning          2927 non-null   category
 1   Street             2927 non-null   category
 2   Lot Shape          2927 non-null   category
 3   Land Contour       2927 non-null   category
 4   Utilities          2927 non-null   category
 5   Lot Config         2927 non-null   category
 6   Land Slope         2927 non-null   category
 7   Condition 1        2927 non-null   category
 8   Condition 2        2927 non-null   category
 9   Bldg Type          2927 non-null   category
 10  House Style        2927 non-null   category
 11  Overall Qual       2927 non-null   int64   
 12  Roof Style         2927 non-null   category
 13  Roof Matl          2927 non-null   category
 14  Mas Vnr Type       2927 non-null   category
 15  Mas Vnr Area       2927 non-null   float64 
 16  Exter 

In [24]:
# Create dummy columns and add back to the dataframe!
houses_no_missing = pd.concat([
    houses_no_missing, 
    pd.get_dummies(houses_no_missing.select_dtypes(include=['category']))
], axis=1).drop(text_cols,axis=1)
houses_no_missing.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2927 entries, 0 to 2929
Columns: 166 entries, Overall Qual to Paved Drive_Y
dtypes: float64(5), int64(9), uint8(152)
memory usage: 777.5 KB


In [25]:
houses_no_missing['SalePrice']

0       215000
1       105000
2       172000
3       244000
4       189900
5       195500
6       213500
7       191500
8       236500
9       189000
10      175900
11      185000
12      180400
13      171500
14      212000
15      538000
16      164000
17      394432
18      141000
19      210000
20      190000
21      170000
22      216000
23      149000
24      149900
25      142000
26      126000
27      115000
28      184000
29       96000
30      105500
31       88000
32      127500
33      149900
34      120000
35      146000
36      376162
37      306000
38      395192
39      290941
40      220000
41      275000
42      259000
43      214000
44      611657
45      224000
46      500000
47      320000
48      319900
49      205000
50      175500
51      199500
52      160000
53      192000
54      184500
55      216500
56      185088
57      180000
58      222500
59      333168
60      355000
61      260400
62      325000
63      290000
64      221000
65      410000
66      22

#### 3.3. Now we will combine all the steps in a function.

In [26]:
#Selecting features
def select_features(df, coeff_threshold=0.4, uniq_threshold=10):
    
    #Selecting columns with correlation above 0.4
    df_num = df.select_dtypes(['int64', 'float64'])
    correlation = df_num.corr()['SalePrice'].abs().sort_values()
    df = df.drop(correlation[correlation < coeff_threshold].index, axis=1)
    
    #Create a list of column names from documentation that are *meant* to be categorical
    nominal_features = ["PID", "MS SubClass", "MS Zoning", "Street", "Alley", "Land Contour", "Lot Config", "Neighborhood", 
                    "Condition 1", "Condition 2", "Bldg Type", "House Style", "Roof Style", "Roof Matl", "Exterior 1st", 
                    "Exterior 2nd", "Mas Vnr Type", "Foundation", "Heating", "Central Air", "Garage Type", 
                    "Misc Feature", "Sale Type", "Sale Condition"]
    #Which categorical columns have we still carried with us? We'll test these 
    transform_cat_cols = []
    for col in nominal_features:
        if col in df.columns:
            transform_cat_cols.append(col)

    #How many unique values in each categorical column?
    uniqueness_counts = df[transform_cat_cols].apply(lambda col: len(col.value_counts())).sort_values()
    drop_nonuniq_cols = uniqueness_counts[uniqueness_counts > uniq_threshold].index
    df = df.drop(drop_nonuniq_cols, axis=1)
    
    text_cols = df.select_dtypes(include=['object'])
    for col in text_cols:
        df[col] = df[col].astype('category')
    df = pd.concat([df, pd.get_dummies(df.select_dtypes(include=['category']))
        ], axis=1).drop(text_cols,axis=1)
    
    return df

### 4. Train and Test

Now that we cleaned the features and selected the most relevant ones, let's creat train_and_test() function for the linear regression with different cross validation k-values.

In [27]:
# Training and testing the model

def train_and_test(df2, k=0):
    df = df2.copy()
    np.random.seed(1)
    df = df.loc[np.random.permutation(df.index)]
    features = df.columns.drop("SalePrice")
  
    
    if k==0:
        train = df.iloc[0:1460]
        test = df.iloc[1460:]
        reg = LinearRegression()
        reg.fit(train[features], train["SalePrice"])
        predictions = reg.predict(test[features])
    
        #Calculating the model error
        mse = mean_squared_error(test["SalePrice"], predictions)
        rmse = np.sqrt(mse)
        return rmse
    
    if k==1:
        train = df.iloc[0:1460]
        test = df.iloc[1460:]
        reg = LinearRegression()
        reg.fit(train[features], train["SalePrice"])
        predictions_one = reg.predict(test[features])        
        
        mse_one = mean_squared_error(test["SalePrice"], predictions_one)
        rmse_one = np.sqrt(mse_one)
        
        reg.fit(test[features], test["SalePrice"])
        predictions_two = reg.predict(train[features])        
       
        mse_two = mean_squared_error(train["SalePrice"], predictions_two)
        rmse_two = np.sqrt(mse_two)
        
        avg_rmse = np.mean([rmse_one, rmse_two])
        print(rmse_one)
        print(rmse_two)
        return avg_rmse
    
    if k>1:
        reg = LinearRegression()
        kf = KFold(n_splits=k, shuffle=True)
        rmse_values = []
        for train_index, test_index, in kf.split(df):
            train = df.iloc[train_index]
            test = df.iloc[test_index]
            reg.fit(train[features], train["SalePrice"])
            predictions = reg.predict(test[features])
            mse = mean_squared_error(test["SalePrice"], predictions)
            rmse = np.sqrt(mse)
            rmse_values.append(rmse)
        print(rmse_values)
        avg_rmse = np.mean(rmse_values)
        return avg_rmse

transformed_df = transform_features(houses)
selected_df = select_features(transformed_df)

#Holdout-validation
model_results = train_and_test(selected_df)
model_results

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().fillna(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_copy_2['Years Before Sale'] = years_sold
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_copy_2['Years Since Remod'] = years_since_remod


26429.835039622954

In [28]:
#Simple cross-validation
model_results = train_and_test(selected_df, 1)
model_results

26429.835039622954
30873.542529071743


28651.68878434735

In [29]:
#k-fold cross-validation
model_results = train_and_test(selected_df, 4)
model_results

[36042.78440481057, 26069.05551944564, 25044.60677969058, 25795.265223777027]


28237.927981930952

### Conclusion¶

In this project, we cleaned, wrangled, and transformed the housing data for the city of Ames, Iowa, the USA (2006-2010) for further using it to predict house sale prices with a linear regression algorithm of machine learning.

We created a pipeline of 3 functions to efficiently perform all the necessary manipulations. In addition, we applied 3 different approaches for train-test validation including holdout validation, simple cross-validation, and k-fold cross-validation.