<h1><center><font color='blue'> Machine Learning - Predict a Home's Sale Price </font> </center></h1>  

***by Susan Fisher***

For this project, a linear regression model will be built to predict a home's sale price.  Building the model will include feature engineering, feature selection, and training the model.  The validation method and number of K-folds will be selected.  The data is taken from the city of Ames, Iowa from 2006 to 2010.  

Information on the dataset can be found here: https://www.tandfonline.com/doi/abs/10.1080/10691898.2011.11889627  
The data dictionary is located here:  https://s3.amazonaws.com/dq-content/307/data_description.txt

The target for the model will be a house's sale price, and the model's features will be all numeric columns.  

**Import Python libraries and read in the data file as 'df' dataframe.**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, KFold
from sklearn.metrics import mean_squared_error

# Displays all columns of the dataframe
pd.options.display.max_columns = 150

In [2]:
df = pd.read_csv('C:/Users/Name/Documents/PythonScripts/DataSets/AmesHousing.tsv',
            delimiter='\t')

In [3]:
# View few rows of the data
df.head(5)

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,Neighborhood,Condition 1,Condition 2,Bldg Type,House Style,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Roof Style,Roof Matl,Exterior 1st,Exterior 2nd,Mas Vnr Type,Mas Vnr Area,Exter Qual,Exter Cond,Foundation,Bsmt Qual,Bsmt Cond,Bsmt Exposure,BsmtFin Type 1,BsmtFin SF 1,BsmtFin Type 2,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,Heating,Heating QC,Central Air,Electrical,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom AbvGr,Kitchen AbvGr,Kitchen Qual,TotRms AbvGrd,Functional,Fireplaces,Fireplace Qu,Garage Type,Garage Yr Blt,Garage Finish,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,5,1960,1960,Hip,CompShg,BrkFace,Plywood,Stone,112.0,TA,TA,CBlock,TA,Gd,Gd,BLQ,639.0,Unf,0.0,441.0,1080.0,GasA,Fa,Y,SBrkr,1656,0,0,1656,1.0,0.0,1,0,3,1,TA,7,Typ,2,Gd,Attchd,1960.0,Fin,2.0,528.0,TA,TA,P,210,62,0,0,0,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Feedr,Norm,1Fam,1Story,5,6,1961,1961,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,CBlock,TA,TA,No,Rec,468.0,LwQ,144.0,270.0,882.0,GasA,TA,Y,SBrkr,896,0,0,896,0.0,0.0,1,0,2,1,TA,5,Typ,0,,Attchd,1961.0,Unf,1.0,730.0,TA,TA,Y,140,0,0,0,120,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,6,1958,1958,Hip,CompShg,Wd Sdng,Wd Sdng,BrkFace,108.0,TA,TA,CBlock,TA,TA,No,ALQ,923.0,Unf,0.0,406.0,1329.0,GasA,TA,Y,SBrkr,1329,0,0,1329,0.0,0.0,1,1,3,1,Gd,6,Typ,0,,Attchd,1958.0,Unf,1.0,312.0,TA,TA,Y,393,36,0,0,0,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,,Reg,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,7,5,1968,1968,Hip,CompShg,BrkFace,BrkFace,,0.0,Gd,TA,CBlock,TA,TA,No,ALQ,1065.0,Unf,0.0,1045.0,2110.0,GasA,Ex,Y,SBrkr,2110,0,0,2110,1.0,0.0,2,1,3,1,Ex,8,Typ,2,TA,Attchd,1968.0,Fin,2.0,522.0,TA,TA,Y,0,0,0,0,0,0,,,,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,1Fam,2Story,5,5,1997,1998,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,PConc,Gd,TA,No,GLQ,791.0,Unf,0.0,137.0,928.0,GasA,Gd,Y,SBrkr,928,701,0,1629,0.0,0.0,2,1,3,1,TA,6,Typ,1,TA,Attchd,1997.0,Fin,2.0,482.0,TA,TA,Y,212,34,0,0,0,0,,MnPrv,,0,3,2010,WD,Normal,189900


## <font color='blue'> Initial build of Functions </font>
    
This section is an initial build of the functions that will engineer features, select features, and train the model.  In the following sections of this project, the functions will be updated.

There will be 3 functions:
- transform_features():  engineers features
- select_features():  select features
- train_and_test():  selects a model type and validation method, and trains and tests the model

In [4]:
target = 'SalePrice'
# features = ['Gr Liv Area']

**FUNCTION - transform_features():  engineers features**  
Returns a transformed dataframe

In [5]:
def transform_features(df):
    return df

**FUNCTION - select_features():  select features**  

Returns dataframe columns, 'Gr Liv Area' and 'SalePrice'.

In [6]:
def select_features(df):
    return df[['Gr Liv Area', 'SalePrice']]

**FUNCTION - train_and_test()**  

- Initial function creates a Holdout validation method
- Creates a training and testing datasets
- Sets a Linear Regression model
- Trains the model on numeric features (columns)
- Returns RMSE

In [7]:
def train_and_test(df):    
    # Dataframe: select all numeric columns except Target column 
    df_numeric = df.select_dtypes(include=['integer', 'float'])
    
    # Define target
    target = 'SalePrice'
    # Define features, which are all numeric columns - target column
    features = df_numeric.columns.drop(target)
    
    # Split dataframe in half and assign to TRAIN & TEST dataset
    half_df = round(len(df)*0.5)
    train = df[:half_df]
    test = df[half_df:]
        
    # Train a Linear Regression model on TEST set using numeric_cols
    lr = LinearRegression()
    lr.fit(train[features], train[target])
    predictions = lr.predict(test[features])
    mse = mean_squared_error(test[target], predictions)
    rmse = mse**0.5
    return rmse

In [8]:
transform_df = transform_features(df)
filtered_df = select_features(transform_df)
rmse = train_and_test(filtered_df)

rmse

57120.50729008638

## <font color='blue'> Feature Engineering </font>  

Feature engineering will include dealing with missing values, creating new features, and removing features (columns).  Lastly, the transform_features() function will be updated with all the engineering.

1. Missing values:  
    a. All columns: drop columns that are >= 5% missing values  
    b. Numerical columns: fill in missing values with the most common value (mode)  
    c. Text columns: drop columns with >= 1 missing values  
2. Create new features by adding or subtracting columns e.g. years might make more sense if it is subtracted from a current year
3. Remove features (columns):  
    a. That are not useful for the model  
    b. That leak info about the sale i.e. information about the target that would not have been known at the time of prediction
4. Lastly, update the transform_features() function with 1-3.

**1) Feature Engineering:  Missing Values**

In [9]:
# Number of columns with missing values before dropping any columns
df.isnull().sum().value_counts().count()

15

In [10]:
# All Columns: drop columns with missing values >= 5%
null_percent = (df.isnull().sum()) / (len(df))
null_five = (null_percent[null_percent >= 0.05]).index
df = df.drop(null_five, axis=1)

In [11]:
# Number of columns with missing values after dropping columns
df.isnull().sum().value_counts().count()

7

In [12]:
# Numeric columns: replace missing values with the most common value in the column
numeric_cols = df.select_dtypes(include=['float', 'int'])
df[numeric_cols.columns] = numeric_cols.fillna(numeric_cols.mode().iloc[0])

In [13]:
# After dropping columns: number of columns with missing values, and column names
print(df.isnull().sum().value_counts().count())
df.columns[df.isnull().any()]

6


Index(['Mas Vnr Type', 'Bsmt Qual', 'Bsmt Cond', 'Bsmt Exposure',
       'BsmtFin Type 1', 'BsmtFin Type 2', 'Electrical'],
      dtype='object')

In [14]:
# For each column, view the number of missing values and the percent of missing values
null_list = []
cols_null = ['Mas Vnr Type', 'Bsmt Qual', 'Bsmt Cond', 'Bsmt Exposure',
       'BsmtFin Type 1', 'BsmtFin Type 2', 'Electrical']

for col in cols_null:
    null_count = df[col].isnull().sum()
    null_percent = round(100 * null_count/(len(df)), 1)
    null_list.append([col, null_count, null_percent])
null_df

NameError: name 'null_df' is not defined

The columns that contain missing values are likely not significantly impact a home's sale price, so they will be dropped.  As a reminder, the data dictionary can be found here:  https://s3.amazonaws.com/dq-content/307/data_description.txt

Although, 'Electrical' column only has one missing value, the column is not thought to significantly impact a home's sale price, so the entire column will be dropped (vs dropping the row).

In [None]:
# Text columns: drop columns with missing values >= 1
object_cols = df.select_dtypes(include=['object'])
null_object_col = object_cols.isnull().sum()
isnull_object_col = null_object_col[null_object_col > 0]

df.drop(isnull_object_col.index, axis=1, inplace=True)

In [None]:
# Verify there are no missing values in the dataframe
df.isnull().sum().value_counts()

**2) Feature Engineering:  Create New Features**  

Create new features by adding or subtracting columns.

1. Create 'Years Before Sale' column: by subtracting 'Year Built' from 'Yr Sold'
2. Create 'Years Since Remod' column: by taking the difference of 'Yr Sold' and 'Year Remod/Add'

In [None]:
# Create New Feature: 'Years Before Sale'
df['Years Before Sale'] = df['Yr Sold'] - df['Year Built']

In [None]:
# Create New Feature: 'Years Since Remod'
df['Years Since Remod'] = df['Yr Sold'] - df['Year Remod/Add']

In [None]:
# New Features: verify no negative values
print(df['Years Before Sale'][df['Years Before Sale']<0])
print(df['Years Since Remod'][df['Years Since Remod']<0])

In [None]:
# New Features: drop rows with negative values

# Number of rows before dropping rows
print(df.shape)

# Drop rows with negative values
df.drop([1702, 2180, 2181], axis=0, inplace=True)

# Verify rows have been dropped
print(df.shape)

**3) Feature Engineering:  Remove Features**  

1. Remove features that are not useful to the model: 'PID', and 'Order'
2. Remone features that leak data about the final sale e.g. year the sale occurred: 'Mo Sold', 'Sale Condition', 'Sale Type', 'Yr Sold'

In [None]:
# Remove columns that are not useful to the model: 'PID' and 'Order'
df.drop(['PID', 'Order'], axis=1, inplace=True)

In [None]:
# Remove columns that leak data about the final sale: 'Mo Sold', 'Sale Condition',
# 'Sale Type', 'Yr Sold'
df.drop(['Mo Sold', 'Sale Condition', 'Sale Type', 'Yr Sold'], axis=1, inplace=True)

In [None]:
# Verify columns have been dropped
df.shape

**4) Feature Engineering:  update transform_features() function**

In [None]:
def transform_features(df):
    # All Columns: drop columns with missing values >= 5%
    null_percent = (df.isnull().sum()) / (len(df))
    null_five = (null_percent[null_percent >= 0.05]).index
    df = df.drop(null_five, axis=1)

    # Numeric columns: replace missing values with the most common value
    # in the column.
    numeric_cols = df.select_dtypes(include=['float', 'int'])
    df[numeric_cols.columns] = numeric_cols.fillna(numeric_cols.mode().iloc[0])

    # Text columns: drop columns with missing values >= 1
    object_cols = df.select_dtypes(include=['object'])
    null_object_col = object_cols.isnull().sum()
    isnull_object_col = null_object_col[null_object_col > 0]
    df.drop(isnull_object_col.index, axis=1, inplace=True)

    # Create New Features
    df['Years Before Sale'] = df['Yr Sold'] - df['Year Built']
    df['Years Since Remod'] = df['Yr Sold'] - df['Year Remod/Add']

    # Drop negative values
    df.drop([1702, 2180, 2181], axis=0, inplace=True)

    # Remove columns that are not useful to the model: 'PID' and 'Order'
    df.drop(['PID', 'Order'], axis=1, inplace=True)
    # Remove columns that leak data about the final sale: 'Mo Sold',
    # 'Sale Condition', 'Sale Type', 'Yr Sold'
    df.drop(['Mo Sold', 'Sale Condition', 'Sale Type', 'Yr Sold'], axis=1, inplace=True)
    
    return df

In [None]:
# Reset dataframe, df, by reading the file in again.

df = pd.read_csv('C:/Users/Name/Documents/PythonScripts/DataSets/AmesHousing.tsv',
            delimiter='\t')
transform_df = transform_features(df)
filtered_df = select_features(transform_df)
rmse = train_and_test(filtered_df)

rmse

## <font color='blue'> Feature Selection </font>  

This section will check for collinarity or when features are highly correlated, and convert columns to categorical type.  

When features are highly correlated then information can be duplicated or there is the potential of information overload.  

Nominal columns such as zip code, jersey number, gender, etc. can be converted to categorical type.  The number of unique values in a column must be limited, otherwise, the final dataframe could have hundreds of columns. 

Feature Selection:
1. Generate a correlation heatmap matrix to check data for collinearity
2. Retain columns with a correlation coefficient above a certain limit
3. Convert object columns to categorical data type 
4. Update select_features() function to include the feature selections 1-3

**1) Feature Selection:  correlation heatmap matrix of the numerical features in the training dataset**

In [None]:
# Split dataframe in half and assign to TRAIN & TEST dataset
half_df = round(len(transform_df)*0.5)
train = transform_df[:half_df]
test = transform_df[half_df:]

In [None]:
# transform_df dataset: select numeric features/columns
df_numeric = transform_df.select_dtypes(include=['integer', 'float'])

# Find correlations between target, 'SalePrice', and numeric columns
corr_coeffs = df_numeric.corr()['SalePrice'].abs().sort_values()

In [None]:
# Heatmap of numeric correlations, where correlations >= 0.5
plt.figure(figsize=(10, 8))
corr_fifty = corr_coeffs[corr_coeffs < 0.5]
corr_matrix = train[corr_fifty.index].corr().abs()

sns.heatmap(corr_matrix,cmap='YlGnBu')

In [None]:
corr_one = train[['TotRms AbvGrd', 'Bedroom AbvGr', '2nd Flr SF']].corr()
print(corr_one, '\n')

corr_two = train[['Bsmt Full Bath', 'BsmtFin SF 1']].corr()
print(corr_two)

According to the correlation heatmap matrix, the most highly correlated features are:
- Bedroom AbvGr and TotRms AbvGrd:  correlation coefficient=0.66
- Bsmt Full Bath and BsmtFin SF 1 (finished square feet):  correlation coefficient=0.65
- 2nd Flr SF (square feet) and TotRms AbvGrd:  correlation coefficient=0.57

It is understandable that the features would be well correlated, though the data is not duplicated.  Also, the correlation coefficients are less than 0.7, or not highly correlated.  So the columns will be retained

**2) Feature Selection:  retain columns with a correlation coefficient above 0.5**

In [None]:
# Only retain colums with correlations >= 0.5 with 'SalePrice', drop other columns

# transform_df size before dropping columns
print(transform_df.shape)

# Drop columns with correlation < 0.5 with 'SalePrice'
transform_df = transform_df.drop(corr_coeffs[corr_coeffs < 0.5].index, axis=1)

# Verify columns were dropped
print(transform_df.shape)

**3) Feature Selection:  convert columns to categorical data type**  

Convert features or columns to categorical data type:
- Create a list of nominal column names and filter the dataframe by nominal columns  
- Filter the dataframe with nominal columns that have < 10 unique values  
- Convert object type columns to categorical type  

In [None]:
# From the data documentation, create a list of nominal columns
nominal_features = ["PID", "MS SubClass", "MS Zoning", "Street", "Alley", 
                    "Land Contour", "Lot Config", "Neighborhood", "Condition 1",
                    "Condition 2", "Bldg Type", "House Style", "Roof Style",
                    "Roof Matl", "Exterior 1st", "Exterior 2nd", "Mas Vnr Type",
                    "Foundation", "Heating", "Central Air", "Garage Type", 
                    "Misc Feature", "Sale Type", "Sale Condition"]

In [None]:
# The transform_df dataframe cannot be filtered with nominal_features list;
# Python returns an error if the list contains column names that are not in transform_df.
# So create a list of column names that are in both nominal features list and in transform_df dataframe.

nominal_cols_df = []
for col in nominal_features:
    if col in transform_df.columns:
        nominal_cols_df.append(col)

print(len(nominal_features))
print(len(nominal_cols_df))

In [None]:
# Only keep nominal_cols_df features with < 10 unique values

## Create list of the count of unique values of each feature in nominal_cols_df
unique_counts = []
for col in nominal_cols_df:
    count = transform_df[col].value_counts().count()
    unique_counts.append(count)

s_unique_counts = pd.Series(unique_counts, index=nominal_cols_df)

## Filter unique counts > 10, which will be used to drop columns from the dataframe
drop_nonuniq_cols = s_unique_counts[s_unique_counts > 10].index

## In transform_df, drop nominal_cols_df with > 10 unique values
print(transform_df.shape)
transform_df = transform_df.drop(drop_nonuniq_cols, axis=1)
print(transform_df.shape)

In [None]:
# Convert remaining object columns to categorical type
obj_cols = transform_df.select_dtypes(include=['object'])
for col in obj_cols:
    transform_df[col] = transform_df[col].astype('category')

transform_df.dtypes

In [None]:
# Create dummy columns and add back to the dataframe, and Drop the obj_cols
transform_df = pd.concat([transform_df,
                         pd.get_dummies(transform_df.select_dtypes(include=['category']))
                         ],
                        axis=1).drop(obj_cols, axis=1)

print(transform_df.head(2))

**4) Feature Selection:  update select_features() function**

In [None]:
def select_features(df, coeff_threshold=0.5, uniq_threshold=10):   
    nominal_features = ["PID", "MS SubClass", "MS Zoning", "Street", "Alley", 
                    "Land Contour", "Lot Config", "Neighborhood", "Condition 1",
                    "Condition 2", "Bldg Type", "House Style", "Roof Style",
                    "Roof Matl", "Exterior 1st", "Exterior 2nd", "Mas Vnr Type",
                    "Foundation", "Heating", "Central Air", "Garage Type", 
                    "Misc Feature", "Sale Type", "Sale Condition"]

    # Create a new list of column names that are in nominal features list
    # and in df
    nominal_cols_df = []
    for col in nominal_features:
        if col in df.columns:
            nominal_cols_df.append(col)

    # Only keep nominal_cols_df features with < 10 unique values
    ## Create list of the count of unique values of each features in nominal_cols_df
    unique_counts = []
    for col in nominal_cols_df:
        count = df[col].value_counts().count()
        unique_counts.append(count)
    s_unique_counts = pd.Series(unique_counts, index=nominal_cols_df)
    ## Filter unique counts > 10, which will be used to drop columns from the dataframe
    drop_nonuniq_cols = s_unique_counts[s_unique_counts > 10].index
    ## In transform_df, drop nominal_cols_df with > 10 unique values
    df = df.drop(drop_nonuniq_cols, axis=1)

    # Convert remaining object columns to categorical type
    obj_cols = df.select_dtypes(include=['object'])
    for col in obj_cols:
        df[col] = df[col].astype('category')

    # Create dummy columns and add back to the dataframe, and Drop the obj_cols
    df = pd.concat([df,
                    pd.get_dummies(df.select_dtypes(include=['category']))
                    ],
                   axis=1).drop(obj_cols, axis=1)
    
    return df

## <font color='blue'> Train and Test </font>  

This section will update the train_and_test() function.  The function selects a model type and validation method, and trains and tests the model.  The dataframe will be split into training and testing datasets.  Let v be the type of validation, and k is the number of folds or the number of times to split the dataframe into training and testing datasets.  The train_and_test() function returns RMSE.  

The function will contain three types of validation methods:
- for v=0, apply Holdout validation:  split data into 50:50 Training:Testing datasets (the existing function applies Holdout)  
- for v=1 apply Simple Cross Validation:  randomize dataset, split data into 50:50 Training:Testing datasets; and Train model on each set of data.
- for v>1, apply K-fold Cross Validation:  randomize dataset, and split data into k number of folds.  Train model on k - 1 folds.  Test model on kth fold.  Repeat until each k-fold has been the TEST set. 

**Train and Test: update train_and_test() function**  

All final functions.

In [None]:
def transform_features(df):
    # All Columns: drop columns with missing values >= 5%
    null_percent = (df.isnull().sum()) / (len(df))
    null_five = (null_percent[null_percent >= 0.05]).index
    df = df.drop(null_five, axis=1)

    # Numeric columns: replace missing values with the most common value
    # in the column.
    numeric_cols = df.select_dtypes(include=['float', 'int'])
    df[numeric_cols.columns] = numeric_cols.fillna(numeric_cols.mode().iloc[0])

    # Text columns: drop columns with missing values >= 1
    object_cols = df.select_dtypes(include=['object'])
    null_object_col = object_cols.isnull().sum()
    isnull_object_col = null_object_col[null_object_col > 0]
    df.drop(isnull_object_col.index, axis=1, inplace=True)

    # Create New Features
    df['Years Before Sale'] = df['Yr Sold'] - df['Year Built']
    df['Years Since Remod'] = df['Yr Sold'] - df['Year Remod/Add']

    # Drop negative values
    df.drop([1702, 2180, 2181], axis=0, inplace=True)

    # Remove columns that are not useful to the model: 'PID' and 'Order'
    df.drop(['PID', 'Order'], axis=1, inplace=True)
    # Remove columns that leak data about the final sale: 'Mo Sold',
    # 'Sale Condition', 'Sale Type', 'Yr Sold'
    df.drop(['Mo Sold', 'Sale Condition', 'Sale Type', 'Yr Sold'], axis=1, inplace=True)
    
    return df

In [None]:
def select_features(df, coeff_threshold=0.5, uniq_threshold=10):   
    nominal_features = ["PID", "MS SubClass", "MS Zoning", "Street", "Alley", 
                    "Land Contour", "Lot Config", "Neighborhood", "Condition 1",
                    "Condition 2", "Bldg Type", "House Style", "Roof Style",
                    "Roof Matl", "Exterior 1st", "Exterior 2nd", "Mas Vnr Type",
                    "Foundation", "Heating", "Central Air", "Garage Type", 
                    "Misc Feature", "Sale Type", "Sale Condition"]

    # Create a new list of column names that are in nominal features list
    # and in df
    nominal_cols_df = []
    for col in nominal_features:
        if col in df.columns:
            nominal_cols_df.append(col)

    # Only keep nominal_cols_df features with < 10 unique values
    ## Create list of the count of unique values of each features in nominal_cols_df
    unique_counts = []
    for col in nominal_cols_df:
        count = df[col].value_counts().count()
        unique_counts.append(count)
    s_unique_counts = pd.Series(unique_counts, index=nominal_cols_df)
    ## Filter unique counts > 10, which will be used to drop columns from the dataframe
    drop_nonuniq_cols = s_unique_counts[s_unique_counts > 10].index
    ## In transform_df, drop nominal_cols_df with > 10 unique values
    df = df.drop(drop_nonuniq_cols, axis=1)

    # Convert remaining object columns to categorical type
    obj_cols = df.select_dtypes(include=['object'])
    for col in obj_cols:
        df[col] = df[col].astype('category')

    # Create dummy columns and add back to the dataframe, and Drop the obj_cols
    df = pd.concat([df,
                    pd.get_dummies(df.select_dtypes(include=['category']))
                    ],
                   axis=1).drop(obj_cols, axis=1)
    
    return df

In [None]:
def train_and_test(df, v, k):  
    # Dataframe: select all numeric columns except Target column 
    df_numeric = df.select_dtypes(include=['integer', 'float'])    
    # Define Target
    target = 'SalePrice'
    # Define features, which are all numeric columns - target column
    features = df_numeric.columns.drop(target)
    
    # Instantiate an empty Linear Regression model 
    lr = LinearRegression()
        
    # Train a Linear Regression model on TEST set using numeric_cols
    # v=0: apply Holdout validation; 50:50 split of data Train:Test
    if v == 0:         
        # 50:50 split of data into TRAIN & TEST dataset
        half_df = round(len(df)*0.5)
        train = df[:half_df]
        test = df[half_df:]
    
        # Fit to TRAIN data, & Make predictions on the model using TEST data
        lr.fit(train[features], train[target])
        predictions = lr.predict(test[features])
        mse = mean_squared_error(test[target], predictions)
        rmse = mse**0.5
        return rmse

    # v=1:apply Simple Cross Validation; 
    if v == 1: 
        # Randomize data, and 50:50 split of data into TRAIN & TEST dataset
        # Data must be split perfectly in half, else function does Not work
        df = df.sample(frac=1, random_state=1)
        if (len(df) % 2) != 0:
            drop_index = np.random.choice(df.index, 1, replace=False)
            df.drop(drop_index, axis=0, inplace=True) 
        half_df = round(len(df)*0.5)
        train = df[:half_df]
        test = df[half_df:]
        
        # Train on 'train', Test on 'test'
        # Fit or train model on TRAIN data; & make predictions or test model with TEST data
        lr.fit(train[features], train[target])
        predictions_one = lr.predict(test[features])
        mse_one = mean_squared_error(test[target], predictions_one)
        rmse_one = mse_one**0.5
    
        # Fit/train model on TEST data; & make predictions on Model with TRAIN data
        lr.fit(test[features], test[target])
        predictions_two = lr.predict(train[features])
        mse_two = mean_squared_error(train[target], predictions_one)
        rmse_two = mse_two**0.5      
    
        # Return average RMSE
        avg_rmse = np.mean([rmse_one, rmse_two])
        return avg_rmse

    # v>1: apply K-fold cross validation; 
    else: 
        # Split data into k number of folds
        kf = KFold(n_splits=k, shuffle=True, random_state=1)
        # K-fold Cross Validation
        mse_values = cross_val_score(lr,
                       df[features],
                       df[target],
                       scoring='neg_mean_squared_error',
                       cv=kf)
        # cross_val_score() returns negative MSE values, so convert to positive values
        abs_mse_values = abs(mse_values)
        rmse_values = abs_mse_values**0.5
        avg_rmse = np.mean(rmse_values)     # OR: avg_rmse = rmse_values.mean()     
        return avg_rmse

In [None]:
df = pd.read_csv('C:/Users/Name/Documents/PythonScripts/DataSets/AmesHousing.tsv',
            delimiter='\t')
transform_df = transform_features(df)
filtered_df = select_features(transform_df)

**Select validation method (v) and number of K-folds (k)**  

In [None]:
df = filtered_df

kfold = 2
# Holdout Validation, v=0:
rmse_zero = round(train_and_test(df, v=0, k=kfold))
print("For holdout validation (v=0), and kfolds=2:  ", rmse_zero, '\n')

# Simple Cross Validation, v=1:
rmse_one = round(train_and_test(df, v=1, k=kfold))
print("For simple cross validation (v=1), and kfolds=2:  ", rmse_one, '\n', '\n')

# K-fold Cross Validation, v>1:
kfold_cross_validation = 2
rmse_two_dict = dict()
for k in range(2,8):
    rmse_two = train_and_test(df, kfold_cross_validation, k)  
    rmse_two = round(rmse_two)
    rmse_two_dict[k] = rmse_two

# Convert rmse_dict dictionary to dataframe
rmse_two_df = pd.DataFrame(rmse_two_dict.items(),
                       columns=['k-folds', 'rmse'])
print("For k-fold cross validation (v>1), and varying k:  ")
rmse_two_df

Selecting the validation method and number of K-folds based on the Root Mean Square Error, RMSE.  
1. For **Holdout validation**, the data is split into two sets, so K-folds is 2.  This method returned a high RMSE value of **57,116**. 
2. For **Simple Cross validation**, the data is also split into two sets, so K-folds is 2.  This method returned the highest RMSE value of **69,695**.  
3. For **K-Fold Cross validation**, K-folds can vary.  Since this method returned the lowest RMSE values, then **this method is selected**.  With increasing K-folds from 2 to 6, the RMSE decreased from **29,216 to 28,016**.  On this trend, then there is a temptation to use a maximum number of K-folds.  However, considering both bias and variance error, there is a bias-variance tradeoff; i.e. as the bias error decreases, then the variance error increase, and vice versa.  As a rule, typically a **k=3 is selected**.

## <font color='blue'> CONCLUSION </font>  

The goal of this project was to create a model to predict home prices for Ames, Iowa.  A linear regression model was selected, then features were engineered, features were selected, the model was trained and tested, and the type of validation and the number of K-folds were selected. 

Feature engineering included dealing with missing values, creating new features, and removing features that are not useful or leak information about the home's sale price.  Feature selection consisted of creating a correlation heatmap, and converting select features to categorical.  

The K-fold cross validation method yields the lowest RMSE and a typical value of number of K-folds of 3 was selected.