# Stretch Goals

- Write a blog post explaining one of today's topics.
- Find a new regression dataset from the UCI machine learning repository and use it to test out your new modeling skillz.
 [ - UCI Machine Learning Repository - Regression Datasets](https://)
- Make a list for yourself of common feature engineering techniques. Browse Kaggle kernels to learn more methods.
- Start studying for tomorrow's topic: Gradient Descent
- Try and make the ultimate model with this dataset. clean as many features as possible, engineer the most sensible features as possible and see how accurate of a prediction you can make. 
- Learn about the "Dummy Variable Trap" and how it applies to linear regression modeling.
- Learning about using linear regression to model time series data

# Intermediate Linear Regression Practice

## Use a Linear Regression model to get the lowest RMSE possible on the following dataset:

[Dataset Folder](https://github.com/ryanleeallred/datasets/tree/master/Ames%20Housing%20Data)

[Raw CSV](https://raw.githubusercontent.com/ryanleeallred/datasets/master/Ames%20Housing%20Data/train.csv)

## You model must include (at least):
- A log-transformed y variable
- Two polynomial features
- One interaction feature
- 10 other engineered features

What is the lowest Root-Mean-Squared Error that you are able to obtain? Share your best RMSEs in Slack!

Notes:

There may be some data cleaning that you need to do on some features of this dataset. Linear Regression will only accept numeric values and will not accept

Note* There may not be a clear candidate for an interaction term in this dataset. Include one anyway, sometimes it's a good practice for predictive modeling feature engineering in general. 

In [62]:
# Incantations
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn import preprocessing

In [51]:
url = 'https://raw.githubusercontent.com/ryanleeallred/datasets/master/Ames%20Housing%20Data/train.csv'
ames = pd.read_csv(url)
print(ames.shape)
ames.head()

(1460, 81)


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [58]:
# First, which of the columns have NaNs?
null_sums = ames.isnull().sum()
has_nulls = set(nullsums[nullsums > 0].index)
numeric_columns = set(ames.select_dtypes('number').columns)
categorical_columns = set(ames.select_dtypes(exclude='number').columns)

numeric_with_nulls = has_nulls & numeric_columns
categorical_with_nulls = has_nulls & categorical_columns

print(f'Categorical_with_nulls:\n{categorical_with_nulls}')
print(f'\nNumeric_with_nulls:\n{numeric_with_nulls}')

Categorical_with_nulls:
{'GarageQual', 'Electrical', 'MiscFeature', 'BsmtExposure', 'GarageCond', 'PoolQC', 'MasVnrType', 'FireplaceQu', 'Alley', 'BsmtFinType2', 'Fence', 'GarageType', 'GarageFinish', 'BsmtCond', 'BsmtQual', 'BsmtFinType1'}

Numeric_with_nulls:
{'MasVnrArea', 'GarageYrBlt', 'LotFrontage'}


In [65]:
# I'm going to fill NaNs for the numerical columns
ames['GarageYrBlt'].fillna(ames['GarageYrBlt'].mean(), inplace=True)
ames['LotFrontage'].fillna(0, inplace=True)
ames['MasVnrArea'].fillna(0, inplace=True)

# In the categorical columns, I'll replace NaNs with a string
# value that won't be interpreted as a NaN, so that it will
# be classified as its own category

ames.fillna("wombat", inplace=True)

In [66]:
enc = preprocessing.OrdinalEncoder()

In [74]:
ames_encoded = enc.fit_transform(ames[list(categorical_columns)])

In [75]:
enc.categories_

[array(['Floor', 'GasA', 'GasW', 'Grav', 'OthW', 'Wall'], dtype=object),
 array(['N', 'Y'], dtype=object),
 array(['Grvl', 'Pave'], dtype=object),
 array(['1Fam', '2fmCon', 'Duplex', 'Twnhs', 'TwnhsE'], dtype=object),
 array(['Grvl', 'Pave', 'wombat'], dtype=object),
 array(['Ex', 'Fa', 'Gd', 'Po', 'TA'], dtype=object),
 array(['GdPrv', 'GdWo', 'MnPrv', 'MnWw', 'wombat'], dtype=object),
 array(['2Types', 'Attchd', 'Basment', 'BuiltIn', 'CarPort', 'Detchd',
        'wombat'], dtype=object),
 array(['ALQ', 'BLQ', 'GLQ', 'LwQ', 'Rec', 'Unf', 'wombat'], dtype=object),
 array(['Ex', 'Fa', 'Gd', 'TA'], dtype=object),
 array(['C (all)', 'FV', 'RH', 'RL', 'RM'], dtype=object),
 array(['Bnk', 'HLS', 'Low', 'Lvl'], dtype=object),
 array(['Artery', 'Feedr', 'Norm', 'PosA', 'PosN', 'RRAe', 'RRAn', 'RRNn'],
       dtype=object),
 array(['Ex', 'Fa', 'Gd', 'Po', 'TA', 'wombat'], dtype=object),
 array(['Ex', 'Fa', 'Gd', 'Po', 'TA', 'wombat'], dtype=object),
 array(['ClyTile', 'CompShg', 'Membran', 'Me