# Project: The Linear Regression

### Introduction

In this project we'll work with housing data for the city of Ames, Iowa, United States from 2006 to 2010. You can read about the different columns in the data [here](https://s3.amazonaws.com/dq-content/307/data_description.txt).


### Import libraries, load and explore data

In [215]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import train_test_split, KFold

In [216]:
df = pd.read_csv('https://bit.ly/3boZCX4)',sep= "\t")

In [217]:
df.shape

(2930, 82)

In [218]:
df.head(3)

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,Neighborhood,Condition 1,Condition 2,Bldg Type,House Style,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Roof Style,Roof Matl,Exterior 1st,Exterior 2nd,Mas Vnr Type,Mas Vnr Area,Exter Qual,Exter Cond,Foundation,Bsmt Qual,Bsmt Cond,Bsmt Exposure,BsmtFin Type 1,BsmtFin SF 1,BsmtFin Type 2,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,Heating,Heating QC,Central Air,Electrical,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom AbvGr,Kitchen AbvGr,Kitchen Qual,TotRms AbvGrd,Functional,Fireplaces,Fireplace Qu,Garage Type,Garage Yr Blt,Garage Finish,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,5,1960,1960,Hip,CompShg,BrkFace,Plywood,Stone,112.0,TA,TA,CBlock,TA,Gd,Gd,BLQ,639.0,Unf,0.0,441.0,1080.0,GasA,Fa,Y,SBrkr,1656,0,0,1656,1.0,0.0,1,0,3,1,TA,7,Typ,2,Gd,Attchd,1960.0,Fin,2.0,528.0,TA,TA,P,210,62,0,0,0,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Feedr,Norm,1Fam,1Story,5,6,1961,1961,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,CBlock,TA,TA,No,Rec,468.0,LwQ,144.0,270.0,882.0,GasA,TA,Y,SBrkr,896,0,0,896,0.0,0.0,1,0,2,1,TA,5,Typ,0,,Attchd,1961.0,Unf,1.0,730.0,TA,TA,Y,140,0,0,0,120,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,6,1958,1958,Hip,CompShg,Wd Sdng,Wd Sdng,BrkFace,108.0,TA,TA,CBlock,TA,TA,No,ALQ,923.0,Unf,0.0,406.0,1329.0,GasA,TA,Y,SBrkr,1329,0,0,1329,0.0,0.0,1,1,3,1,Gd,6,Typ,0,,Attchd,1958.0,Unf,1.0,312.0,TA,TA,Y,393,36,0,0,0,0,,,Gar2,12500,6,2010,WD,Normal,172000


## 2. Feature Engineering

Let's now start removing features with many missing values, diving deeper into potential categorical features, and transforming text and numerical columns. We'll update `transform_features()` so that any column from the data frame with more than 25% (or another cutoff value) missing values is dropped. For numerical columns that contain less than 5% missing values, let's fill in the missing values using the most popular value for that column. We'll also need to remove any columns that leak information about the sale (e.g. like the year the sale happened). In general, the goal of this function is to:

* remove features that we don't want to use in the model, just based on the number of missing values or data leakage.
* transform features into the proper format (numerical to categorical, scaling numerical, filling in missing values, etc).
* create new features by combining other features.

Next, we need to get more familiar with the remaining columns by reading the data documentation for each column, determining what transformations are necessary (if any), and more. Succeeding in predictive modeling (and competitions like Kaggle) is highly dependent on the quality of features the model has. Libraries like scikit-learn have made it quick and easy to simply try and tweak many different models, but cleaning, selecting, and transforming features are still more of an art that requires a bit of human ingenuity.

In [219]:
def transform_features(df_):
  nulls = df_.isnull().sum() / len(df_)

  # drop columns with more than 25% null values
  more_than_25pct_nulls = nulls[nulls > 0.25].index
  df_.drop(more_than_25pct_nulls, axis=1, inplace=True)

  ''' 
      handle nulls in numeric columns
      replace nulls with mode for numeric columns with less than 5% of null values
      replace nulls with zero for numeric columns with less 5% or more of null values
  '''
  numeric_cols = df_.select_dtypes(include = ['integer', 'float']).columns
  numeric_nulls = df_[numeric_cols].isnull().sum() / len(df)
  less_than_5pct_nulls = numeric_nulls[numeric_nulls < 0.05].index
  more_than_5pct_nulls = numeric_nulls[numeric_nulls >= 0.05].index

  # get mode for numeric columns with less than 5% of null values
  mode = df_[less_than_5pct_nulls].mode().iloc[0]

  # replace nulls with mode for numeric columns with less than 5% of null values
  df_[less_than_5pct_nulls] = df_[less_than_5pct_nulls].fillna(value=mode)

  # replace nulls with zero for numeric columns with less 5% or more of null values
  df_[more_than_5pct_nulls] = df_[more_than_5pct_nulls].fillna(value=0)

  ''' handle nulls in nominal columns '''
  nominal_cols = df_.select_dtypes(exclude = ['integer', 'float']).columns
  df_[nominal_cols] = df_[nominal_cols].fillna(value='Unknown')

  ''' create new features'''
  years_until_remod = df_['Year Built'] - df['Year Remod/Add'] # will be zero if remodel date  is the same as construction date
  df_['Years Until Remod'] = years_until_remod

  # drop any columns that leak information about the sale, that we don't want to use in the model
  df_.drop(['PID', 'Order', 'Year Built', 'Year Remod/Add'], axis = 1)

  return df_

## 3. Feature Selection

Now that we have cleaned and transformed a lot of the features in the data set, it's time to move on to feature selection for numerical features. We will update the logic for the select_features() function. This function should take in the new, modified train and test data frames that were returned from `transform_features()`.

In [220]:
def select_features(df_):
  # generate a correlation heatmap matrix of the numerical features
  corr_coeffs = df_.select_dtypes(include = ['integer', 'float']).corr()['SalePrice'].abs().sort_values(ascending = False)

  # drop columns with correlation coefficient < 0.3
  df_ = df_.drop(corr_coeffs[corr_coeffs < 0.3].index, axis=1)

  nominal_cols = df_.select_dtypes(exclude = ['integer', 'float']).columns

  # drop columns with more than 10 unique values
  unique_values_count = df_[nominal_cols].apply(lambda col: len(col.value_counts()))
  cols_to_drop = unique_values_count[unique_values_count > 10].index
  cols_to_keep = unique_values_count[unique_values_count <= 10].index
  df_ = df_.drop(cols_to_drop, axis=1)

  # One hot encoding
  df_ = pd.get_dummies(df_, columns=cols_to_keep, drop_first=True)

  return df_

## 4. Build and Train

Now for the final part of the pipeline, training and testing. When iterating on different features, using simple validation is a good idea. Let's add a parameter named `k` that controls the type of cross validation that occurs.

When `k` equals to `0`:

* Select the first 1460 rows and assign to train
* Select the remaining rows and assign to test
* Train on train and test on test
* Compute and return the RMSE

When `k` equals to `1`, perform simple cross validation:
* Shuffle the ordering of the rows in the data frame
* Select the first 1460 rows and assign to fold_one
* Select the remaining rows and assign to fold_two
* Train on fold_one and test on fold_two
* Train on fold_two and test on fold_one
* Compute and return the average RMSE

When `k` is greater than `1`, implement k-fold cross validation using `k` folds and return the average `RMSE` 

In [221]:
def train_and_test(passed_df, k=0):

  target = 'SalePrice'

  if k == 0:
    training_set = passed_df[:1460]
    testing_set = passed_df[1460:]
    features = training_set.columns.drop('SalePrice')

    lr = LinearRegression()
    lr.fit(training_set[features], training_set[target])
    predictions = lr.predict(testing_set[features])

    return np.sqrt(mean_squared_error(testing_set[target], predictions))

  if k == 1:
    lr = LinearRegression()
    shuffled_df = passed_df.sample(frac=1, )
    
    fold_one = shuffled_df[:1460]
    fold_one_x = fold_one.drop(['SalePrice'], axis=1)
    fold_one_y = fold_one['SalePrice']
    fold_one_x_train, fold_one_x_test, fold_one_y_train, fold_one_y_test = train_test_split(fold_one_x, fold_one_y, test_size = 0.3, random_state = 0)

    lr.fit(fold_one_x_train, fold_one_y_train)
    fold_one_prediction = lr.predict(fold_one_x_test)

    fold_one_rmse = np.sqrt(mean_squared_error(fold_one_y_test, fold_one_prediction))

    fold_two = shuffled_df[1460:]
    fold_two_x = fold_two.drop(['SalePrice'], axis=1)
    fold_two_y = fold_two['SalePrice']
    fold_two_x_train, fold_two_x_test, fold_two_y_train, fold_two_y_test = train_test_split(fold_two_x, fold_two_y, test_size = 0.3, random_state = 0)

    lr.fit(fold_two_x_train, fold_two_y_train)
    fold_two_prediction = lr.predict(fold_two_x_test)

    fold_two_rmse = np.sqrt(mean_squared_error(fold_two_y_test, fold_two_prediction))

    avg_rmse = (fold_one_rmse + fold_two_rmse) / 2 

    return avg_rmse

  if k > 1:
    lr = LinearRegression()
    kfold = KFold(n_splits=k, shuffle=True)
    rmse_list = []

    for train_index, test_index, in kfold.split(passed_df):
        features = passed_df.columns.drop(target)
        train = passed_df.iloc[train_index]
        test = passed_df.iloc[test_index]

        lr.fit(train[features], train[target])
        predictions = lr.predict(test[features])

        rmse = np.sqrt(mean_squared_error(test[target], predictions))
        rmse_list.append(rmse)

    return np.mean(rmse_list)

## 5. Test

In [222]:
transformed_df = transform_features(df)
final_df = select_features(transformed_df)

In [223]:
train_and_test(final_df, k=0)

38594.42344493608

In [224]:
train_and_test(final_df, k=1)

32550.035468419246

In [227]:
train_and_test(final_df, k=3)

35069.19689767552

## 5. Next Steps

Potenial next steps that we can take:

1. Continue iteration on feature engineering:
* Research some other approaches to feature engineering online around housing data.
* Visit the Kaggle kernels [page](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/kernels) page for this dataset to see approaches others took.

2. Improve your feature selection:
* Research ways of doing feature selection better with categorical columns.