# Project 2 - Ames Housing Data

## Feature Engineering and Model Selection

*Author: Grace Campbell*


> - Previous Notebook: [Data Cleaning and Exploratory Data Analysis](Project-2-Data-Cleaning-EDA.ipynb)
- Next Notebook: [Model Optimization](Project-2-Model-Optimization.ipynb)

___

In this notebook I will be selecting feature variables to use in my final model with `LassoCV`.

I want to use all features in the dataset, perform a Lasso regularization on the data, and look at which variables the Lasso chose to keep in the model as being the most predictive. After engineering and choosing my feature variables, I will continue on to my final model.

In [154]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import PolynomialFeatures, StandardScaler

import pandas as pd
import numpy as np
import warnings

warnings.simplefilter(action='ignore', category=FutureWarning)

df = pd.read_csv('./datasets/train_cleaned.csv', keep_default_na=False).drop('Unnamed: 0', axis=1)
df_test = pd.read_csv('./datasets/test_cleaned.csv', keep_default_na=False).drop(['Unnamed: 0', 'Id', 'PID'], axis=1)

I want to use Lasso regularization for this model because there are 81 features, not including dummy columns or feature transformations. I want to bring some of the coefficients for these variables down to 0 if they do not have much predictive value.

#### Steps:
1. Perform a log transformation on `y` because its distribution is positively skewed
2. One-hot encode the categorical variables in the dataset so they can be included in the model
3. Transform all explanatory variables with `PolynomialFeatures` to better capture the the explanatory variables' relationships with the target and with each other
4. Scale the variables with `StandardScaler` 
5. Fit the data to a `Lasso` regularization

In [109]:
# Creating dummies for both datasets
df = pd.get_dummies(df, drop_first=True)
df_test = pd.get_dummies(df_test, drop_first=True)

# Making sure both datasets share the same dummy columns
train_features = [col for col in df.columns if col in df_test.columns]
test_features = [col for col in df_test.columns if col in df.columns]

# Defining X and y for training set
X = df[[col for col in train_features if col not in ('Misc Val', 'Pool Area', 'Pool QC')]] # <-- see below for explanation
y = df['SalePrice']

# Log transforming y
y_log = np.log(y)

# Defining X for test set
test_X = df_test[test_features]

# Train-test-splitting
X_train, X_test, y_train, y_test = train_test_split(X, y_log, random_state=42)

# Creating a pipeline 
pipe = Pipeline([
    ('poly', PolynomialFeatures()),
    ('scaler', StandardScaler()),
    ('lcv', LassoCV(alphas=0.0023636363))
])

> **Note**:
I ran this model with all columns in the beginning, and two of the highest (negative) coefficients were interactions with `Misc Val` and `Pool Area`, both of which have high proportions of values that equal 0 in the dataframe. I looked at the proportions of 0-values in both columns, which were almost 100% each. This means that when an interaction term is made with one of these columns, almost all of the values will also be 0, regardless of what the values are for the other variable of the interaction term. Therefore, I removed these columns, along with `Pool QC` (the proportion of 0s is exactly the same as in `Pool Area`), in my matrix of features.

In [156]:
pipe.named_steps['lcv'].alpha_

0.0023636363636363638

In [110]:
# Fitting the model
pipe.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('poly', PolynomialFeatures(degree=2, include_bias=True, interaction_only=False)), ('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('lcv', LassoCV(alphas=array([0.0005 , 0.00055, ..., 0.00495, 0.005  ]), copy_X=True,
    cv=None, eps=0.001, fit_intercept=True, max_iter=1000, n_alphas=100,
    n_jobs=1, normalize=False, positive=False, precompute='auto',
    random_state=None, selection='cyclic', tol=0.0001, verbose=False))])

In [146]:
# Scoring the model
pipe.score(X_test, y_test)

0.8207500403055616

This model has an $R^2$ score of 0.82, which means that ~82% of variation in the data can be explained by this model compared to the null model.

Now I will look at the top 30 coefficients that the Lasso deemed the most predictive:

In [155]:
coef_df = pd.DataFrame({
    'Coefficient': pipe.named_steps['poly'].get_feature_names(X.columns),
    'Value':  pipe.named_steps['lcv'].coef_
})
coef_df['abs'] = coef_df['Value'].map(lambda x: abs(x))
coef_df.sort_values(by='abs', ascending=False).drop('abs', axis=1)[:31]

Unnamed: 0,Coefficient,Value
1416,Overall Cond Gr Liv Area,0.06314
10730,Neighborhood_Edwards Exterior 1st_CemntBd,-0.049106
12482,Condition 1_Feedr Exterior 1st_Stucco,-0.048912
884,Utilities Year Built,0.03688
882,Utilities Overall Qual,0.033297
1567,Year Built Year Remod/Add,0.032837
4615,Gr Liv Area Functional,0.032139
1227,Overall Qual Year Built,0.023096
9467,Land Contour_Low Foundation_Slab,-0.021828
3846,Heating QC Gr Liv Area,0.021581


The first thing I notice is that every variable in this list is an interaction term. Another thing I notice is that, as I suspected, the interaction between `Year Built` and `Year Remod/Add` has a relatively high predictive value.

It does not surprise me that the interaction between `Overall Qual` and `Gr Liv Area` has the highest coefficient. It makes logical sense that the overall quality of a house and the total above-ground living area in square feet would have a quite large effect on the price of that house.

For the sake of memory and time, I do not want to include every categorical dummy variable in my final model, so I will use these coefficients as a way to choose which ones to include. I absolutely want to include the `Neighborhood` column because the neighborhood in which a house is located will undoubtedly greatly affect its price. I also notice a high number of coefficients with `Exterior 1st`, `Condition_1`, `Foundation`, and `SaleType`, so I will one-hot encode and include these variables in my final model.

#### Click [here](Project-2-Ames-Housing-Data.ipynb) to view the next notebook, where I run my final model.

____