# Assignment 9 - Basic Machine Learning Example
### Let's Play with a Decision Tree 
Let's try and predict housing prices in Iowa using a Test & Train dataset provided by Kaggle

Source & images for the notebook: **kaggle.com/learn/intro-to-machine-learning.**

#### Your Challenge - predicting sales price
We'll explore a dataset for houses in Iowa that has a variety of different attributes (called "features" in the ML world). 

Your goal is to try and figure what a range for a sales price might look like.

# Grade: 100%

### Assignment
1. Explore the Iowa housing data (EDA) using Pandas methods
2. Pick what features you think might better predict price as measured by MEA
3. Split the data (test, train, validate)
4. Predict your values & train your model
5. Validate your predictions
6. Run a random forest to compare decision tree

### Decision Trees
There are a variety of model to use with machine learning. First up, we'll use a tool called decision tree which has "leaves" and "nodes". Basically, it splits data at key parts (nodes) based on an attribute (leaf) and then goes further and further down the branch. 

A good visualization here:
<img src="https://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_auto:best/v1545934190/1_r5ikdb.png">
Source: https://www.datacamp.com/community/tutorials/decision-tree-classification-python

However, at some point, the leaves can be so small that the predictions match the actual values. This is called "overfitting" because the predictions overfit for new data.

On the other side, we can divide on just a few leaves and what happens is that the model fails to capture important distinctions or patterns in the data. It will perform poorly on training data and is called "underfitting".

<img src="https://i.imgur.com/2q85n9s.png">

Knowing how to fine tune all this is where the art of machine learning comes in using your data and what you have at hand (called "features" or "feature engineering"). The science of evaluating your model comes in with using statistical analysis. There's always a tension here.

Image and Content Source: https://www.kaggle.com/dansbecker/underfitting-and-overfitting

In this example, we'll use a decision tree to test out predicting price for house prices in Iowa (the dataset we'll be using). You can think of this example for how we might split housing data in the following way: 

<img src="https://i.imgur.com/R3ywQsR.png">
Source: https://www.kaggle.com/learn/intro-to-machine-learning

In [9]:
# Path of the file to read of Iowa housing data set
iowa_file_path = '/Users/liamhettinger/Documents/Python/July/09_Numpy/train.csv'

In [10]:
# Importing our standard libraries
import pandas as pd
import numpy as np

# Loading our machine learning library from Scikit Learn
from sklearn.tree import DecisionTreeRegressor

# Loading mean absolute error
from sklearn.metrics import mean_absolute_error

# Loading test/train/split library
from sklearn.model_selection import train_test_split

# Setting our columns to max so we can see everything 
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# read & store our data in a DataFrame called home_data
home_data = pd.read_csv(iowa_file_path)
home_data.head(5)
#home_data.describe()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


### Exploring our dataset (EDA)

In [11]:
# Let's explore what our columns include (ideally, we'd have a data dictionary here)

home_data.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

In [12]:
# Let's take a quick peek to see if we have any missing values in our columns
# This is critical for picking our features to make sure we don't include holes

home_data.isnull().sum()

Id                  0
MSSubClass          0
MSZoning            0
LotFrontage       259
LotArea             0
Street              0
Alley            1369
LotShape            0
LandContour         0
Utilities           0
LotConfig           0
LandSlope           0
Neighborhood        0
Condition1          0
Condition2          0
BldgType            0
HouseStyle          0
OverallQual         0
OverallCond         0
YearBuilt           0
YearRemodAdd        0
RoofStyle           0
RoofMatl            0
Exterior1st         0
Exterior2nd         0
MasVnrType          8
MasVnrArea          8
ExterQual           0
ExterCond           0
Foundation          0
BsmtQual           37
BsmtCond           37
BsmtExposure       38
BsmtFinType1       37
BsmtFinSF1          0
BsmtFinType2       38
BsmtFinSF2          0
BsmtUnfSF           0
TotalBsmtSF         0
Heating             0
HeatingQC           0
CentralAir          0
Electrical          1
1stFlrSF            0
2ndFlrSF            0
LowQualFin

### Selecting our Prediction Target
We are interested in predicting a sales price - Column name 'SalePrice'

In [13]:
# Let's set our prediction variable or target variable "y" for SalePrice
# Note this is in lower case y - standard practice for ML

y = home_data.SalePrice

### Selecting our Features
These are the parts of our dataset we feel could be useful for predicting price. We'll go with some standard guesses here from how real estate is valued

In [16]:
# Picking standard features for our "X" variable

# Let's start with a custom list of columns that could help us predict sale price based on what's in our dataset
home_data_features = ['LotArea', 'YearBuilt','OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd']

# Passing these features into our X variable (notice the capital X)
X = home_data[home_data_features]

# Some quick summary statistics
X.describe()

Unnamed: 0,LotArea,YearBuilt,OverallQual,OverallCond,YearBuilt.1,YearRemodAdd
count,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,10516.828082,1971.267808,6.099315,5.575342,1971.267808,1984.865753
std,9981.264932,30.202904,1.382997,1.112799,30.202904,20.645407
min,1300.0,1872.0,1.0,1.0,1872.0,1950.0
25%,7553.5,1954.0,5.0,5.0,1954.0,1967.0
50%,9478.5,1973.0,6.0,5.0,1973.0,1994.0
75%,11601.5,2000.0,7.0,6.0,2000.0,2004.0
max,215245.0,2010.0,10.0,9.0,2010.0,2010.0


### Picking a Model and Fitting It
At this point, we'll use a decision tree regressor from Scikit Learn machine learning library we loaded up above. Next, we'll fit a our data to this model using the variables we defined for features (X) and our prediction target (y)

In [17]:
# Calling in our preferred method of a Decision Tree
HomePrice = DecisionTreeRegressor(random_state = 0) # =0 ensures same results in each run

# Passing in our feature and predicted variables as we defined above
HomePrice.fit(X,y)

DecisionTreeRegressor(random_state=0)

In [19]:
# Let's make some predictions using our features defined as "X"
# Returns a Numpy array

HomePrice.predict(X)

array([196750., 181500., 223500., ..., 266500., 142125., 147500.])

### Model Validation
How accurate is our model for predicting future prices? There are ton of ways to assess this but in this quick example we'll use Mean Absolute Error (MAE). Our prediction error is basically: error = actual - prediction. For example, if a house costs 500,000 and we predicted 400,00 our error is 100,000. With MAE we take the absolute value which converts errors to positive values.

In [20]:
# Setting a variable to hold our predicted values
predicted_home_prices = HomePrice.predict(X)

# Using sklearn method on our target and predicted values
print(mean_absolute_error(y, predicted_home_prices))

#Print your prediction:


307.43105022831054


#### WARNING! 
The above example has a critical flaw - we used the same data mixed together to build the model and evaluate it. This gives us false confidence for accuracy with our predictions because the model has seen this data before. But when we apply it to a dataset we've never seen, we can easily get highly inaccurate data.

The power of ML is to train a model on a dataset, tune it, and then let it make predictions on new datasets.

To do this, we need to split out some of our data as "validation data".

## Splitting Data for Better Accuracy

In [22]:
# First, we've split our data into training and validation data - for BOTH features and target (X, y)
# Note: the split is based on a random number generator. Again, we pass in a numeric value to
# the random_state argument which guarantees we get the same split every time we
# run this script - key for consistency

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

# Define model
Mymodel = DecisionTreeRegressor(random_state=0)

# Fit model
Mymodel.fit(train_X, train_y)

# get predicted prices on validation data
Mypredictions = Mymodel.predict(val_X)
print(mean_absolute_error(val_y, Mypredictions))

33638.55342465753


#### What This Means
So, when we use our in-sample data, your likely had an MAE of low dollars. Out of sample, it went to a much bigger number - Hint: it should be a huuuge difference! 

Note: our average home value in our dataset is 180921. 

## Tuning our Model for Over/Under Fitting
Here, we'll look at test/train/validate our dataset and then compare our MAE scores to see how many leaves we should include to avoid over or under fitting our model. Note: source for the code and inspiration here: https://www.kaggle.com/dansbecker/underfitting-and-overfitting

<img src="https://i.imgur.com/2q85n9s.png">


Image and Content Source: https://www.kaggle.com/dansbecker/underfitting-and-overfitting


In [23]:
# Let's write a custom function for MAE (source: Kaggle.com) 

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y): #passing in max_leaf_nodes
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0) #Setting a variable to hold leaves
    model.fit(train_X, train_y) #Same as above for fitting to our training data
    preds_val = model.predict(val_X) #Predict on our validation data
    mae = mean_absolute_error(val_y, preds_val) #Passing in our validation and predicted values for sales
    return(mae)


In [24]:
# Setting up various leaves to test for under/over fitting

for max_leaf_nodes in [5, 50, 125, 250, 500, 5000]: # Iterating through a list of various leaf counts
    iowa_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y) #passing our work from above for each leaf
    print("Max leaf nodes: %d \t\t MAE: %d" %(max_leaf_nodes, iowa_mae))

Max leaf nodes: 5 		 MAE: 32224
Max leaf nodes: 50 		 MAE: 28104
Max leaf nodes: 125 		 MAE: 29099
Max leaf nodes: 250 		 MAE: 31797
Max leaf nodes: 500 		 MAE: 33824
Max leaf nodes: 5000 		 MAE: 34051


#### What This Means
Here we see that our MAE goes to 27825 at 50 leaves and get us better accuracy than our original "out of sample" calculation of 32,110. However at 250 leaves it heads the other way and overfits.

## Bonus - Random Forest
As we saw with a decision tree, there are some obvious tradeoffs. Lot's of leaves can have you overfitting the model because the prediction is coming just a few houses at the leaf. But with just a few leaves, it predicts poorly because it's not accounting for the unique characteristics in the data.

To get around this, we'll use the random forest method, which includes many trees many predictions and then averages everything together. Source: Wikipedia
<img src="https://upload.wikimedia.org/wikipedia/commons/7/76/Random_forest_diagram_complete.png">

#### Our Dataset Splits
Quick refresher on our variables that we split up above
* train_X - our training data comprised of features from our housing dataset: 'LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd'

* val_X - our validation dataset that we use to validate the accuracy of our predictions based on the feature set

* train_y - our target prediction (sales price) that we will train our data against

* val_y - our sales price we want to validate against


In [25]:
# Importing the Random Forest model from Scikit Learn
from sklearn.ensemble import RandomForestRegressor

# Setting a variable to hold our data and setting random state to the same number for consistency
forest_model = RandomForestRegressor(random_state=0)

# Fitting our model on our test and train datasets as defined above
forest_model.fit(train_X, train_y)

# Predicting on our validation dataset
iowa_preds = forest_model.predict(val_X)
print("Our MAE is now:")
print("$",mean_absolute_error(val_y, iowa_preds))

Our MAE is now:
$ 26480.24085283757


## Reflection
In a few sentences what did you learn with this notebook and the dataset? How did your feature selection impact your MAE? What additional datasets do you wish you had to better predict?

Finally, what was the hardest part of this assignment and what did you find most interesting?

In [None]:
#This assignment was really interesting to look through how the decision tree works.
#Would be helpful to see the decision tree visually.
#The hardest part was understanding the overall picture.
#The part I found most interesting is thinking of other applications that could be used.