<a href="https://colab.research.google.com/github/zangell44/DS-Unit-2-Sprint-2-Linear-Regression/blob/master/module2-polynomial-regression/Polynomial_Log_linear_Regression_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intermediate Linear Regression Practice

## Use a Linear Regression model to get the lowest RMSE possible on the following dataset:

[Dataset Folder](https://github.com/ryanleeallred/datasets/tree/master/Ames%20Housing%20Data)

[Raw CSV](https://raw.githubusercontent.com/ryanleeallred/datasets/master/Ames%20Housing%20Data/train.csv)

## You model must include (at least):
- A log-transformed y variable
- Two polynomial features
- One interaction feature
- 10 other engineered features

What is the lowest Root-Mean-Squared Error that you are able to obtain? Share your best RMSEs in Slack!

Notes:

There may be some data cleaning that you need to do on some features of this dataset. Linear Regression will only accept numeric values and will not accept

Note* There may not be a clear candidate for an interaction term in this dataset. Include one anyway, sometimes it's a good practice for predictive modeling feature engineering in general. 

In [0]:
# imports
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error as mse

In [111]:
# data import
df_raw = pd.read_csv('https://raw.githubusercontent.com/ryanleeallred/datasets/master/Ames%20Housing%20Data/train.csv', index_col=0)
print (df_raw.shape)
df_raw.head()

(1460, 80)


Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500
3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500
4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,2,2006,WD,Abnorml,140000
5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,,,,0,12,2008,WD,Normal,250000


Data cleaning time!

In [0]:
# single out target variable
target = 'SalePrice'
y = np.log(df_raw[target])
X = df_raw.drop([target], axis=1).copy()

In [113]:
# handling null values

# for some variables, null seems reasonable to represent the feature not
# being present for the property (e.g. no fence or pool)
# we will fill these ones with the string 'NA' and OHE
explainable_nulls = ['Alley', 'MasVnrType', 'BsmtQual',
                    'BsmtCond', 'BsmtExposure', 'BsmtFinType1',
                    'BsmtFinType2', 'FireplaceQu', 'GarageType',
                    'GarageYrBlt', 'GarageFinish', 'GarageQual',
                    'GarageCond', 'PoolQC', 'Fence', 'MiscFeature']

X[explainable_nulls] = X[explainable_nulls].copy().fillna(value='None')

# fill these with zeroes
numeric_nulls = ['LotFrontage', 'MasVnrArea']

X[numeric_nulls] = X[numeric_nulls].fillna(0)

# fill one null in Electrical with 'Mix'
X['Electrical'] = X['Electrical'].fillna('Mix')

X.isnull().sum().sum()

0

In [114]:
# encoding categorical variables
categorical = ['MSSubClass', 'MSZoning', 'Street', 'Alley',
               'LotShape', 'LandContour', 'Utilities',
               'LotConfig', 'LandSlope', 'Neighborhood', 
               'Condition1', 'Condition2', 'BldgType',
               'HouseStyle', 'RoofStyle', 'RoofMatl',
               'Exterior1st', 'Exterior2nd', 'ExterQual',
               'ExterCond', 'Foundation', 'BsmtQual',
               'BsmtCond', 'BsmtExposure', 'BsmtFinType1',
               'BsmtFinType2', 'Heating', 'HeatingQC',
               'CentralAir', 'Electrical', 'KitchenQual',
               'Functional', 'FireplaceQu', 'GarageType',
               'GarageFinish', 'GarageQual', 'GarageCond',
               'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature',
               'MiscVal', 'MoSold', 'SaleType', 'SaleCondition',
               'MasVnrType']

# append log of numeric variables
for col in X.columns:
  if col not in categorical:
    if X[col].dtype != 'object':
      if X[col].min() >- 0:
        X['log_' + col] = np.log(X[col]).replace(-np.Inf, -200)
      
# one hot encode all categorical variables
X_processed = pd.get_dummies(X, prefix_sep="__",
                              columns=categorical)

obj_df = X_processed.select_dtypes(include=['object']).copy()
obj_df.head()

Unnamed: 0_level_0,GarageYrBlt
Id,Unnamed: 1_level_1
1,2003
2,1976
3,2001
4,1998
5,2000


In [115]:
# GarageYrBlt should be encoded as an integer instead
# TODO later, this is annoying so im just dropping it
X_processed.drop('GarageYrBlt', axis=1, inplace=True)
X_values = X_processed.values
type(X_values)

numpy.ndarray

Data enhancement / feature engineering time!

In [0]:
# creating polynomial features
# this will create squared features AND interaction features, don't know how
# to label this though
poly2 = PolynomialFeatures(2)
X_poly = poly2.fit_transform(X_values)

In [0]:
# splitting data
X_train, X_test, y_train, y_test = train_test_split(X_poly, y,
                                                   test_size=0.20,
                                                   random_state=100)

Time for regression!

In [118]:
lr = LinearRegression()
lr.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [119]:
print ('Training R^2', lr.score(X_train, y_train))
print ('Train RMSE', mse(lr.predict(X_train), y_train)**0.5)

Training R^2 1.0
Train RMSE 1.6511303565385025e-10


Fits perfectly on the training data ... but I suspect it would be overfitting given all the features introduced.

Let's see how it does on the test set.

In [120]:
print ('Test R^2', lr.score(X_test, y_test))
print ('Test RMSE', mse(lr.predict(X_test), y_test)**0.5)

Test R^2 0.27253573359022254
Test RMSE 0.3538945106096992


Yep, not so good lol. Only able to explain 34% of variation in the test data. We'll have to introduce some penalties to curb these out of control features.

In [121]:
# Ridge regression
ridge = Ridge(alpha=2000000000.0) # alpha has to be pretty large because of our feature count
ridge.fit(X_train, y_train)

# model scoring
print ('Training R^2', ridge.score(X_train, y_train))
print ('Test R^2', ridge.score(X_test, y_test))
print ('Train RMSE', mse(ridge.predict(X_train), y_train)**0.5)
print ('Test RMSE', mse(ridge.predict(X_test), y_test)**0.5)

Training R^2 0.9546971425019742
Test R^2 0.8704294536288258
Train RMSE 0.08414048012861412
Test RMSE 0.14935547197081409


# Stretch Goals

- Write a blog post explaining one of today's topics.
- Find a new regression dataset from the UCI machine learning repository and use it to test out your new modeling skillz.
 [ - UCI Machine Learning Repository - Regression Datasets](https://)
- Make a list for yourself of common feature engineering techniques. Browse Kaggle kernels to learn more methods.
- Start studying for tomorrow's topic: Gradient Descent
- Try and make the ultimate model with this dataset. clean as many features as possible, engineer the most sensible features as possible and see how accurate of a prediction you can make. 
- Learn about the "Dummy Variable Trap" and how it applies to linear regression modeling.
- Learning about using linear regression to model time series data