# Basic Machine Learning Example
Let's try and predict housing prices in Iowa using a Test & Train dataset provided by Kaggle

Source: kaggle.com/learn/intro-to-machine-learning.

This falls under what we call "supervised learning" where we help the algorithm learn from previous data relationships to predict future insights. Examples of this is using regression and classification techniques.

### Our Workflow
1. Explore our data (EDA) using Pandas methods
2. Pick a model
3. Split the data (test, train, validate)
4. Predict our values & train our model
5. Validate our predictions

### Decision Trees
There are a variety of model to use with machine learning. First up, we'll use a tool called decision tree which has "leaves" and "nodes". Basically, it splits data at key parts (nodes) based on an attribute (leaf) and then goes further and further down the branch. 

<img src="https://i.imgur.com/R3ywQsR.png">

However, at some point, the leaves can be so small that the predictions match the actual values. This is called "overfitting" because the predictions overfit for new data.

On the other side, we can divide on just a few leaves and what happens is that the model fails to capture important distinctions or patterns in the data. It will perform poorly on training data and is called "underfitting".

<img src="https://i.imgur.com/2q85n9s.png">

Knowing how to fine tune all this is where the art of machine learning comes in using your data and what you have at hand (called "features" or "feature engineering"). The science of evaluating your model comes in with using statistical analysis.

We'll use a decision tree to test out predicting price for house prices in Iowa (the dataset we'll be using). You can think of this in the following way

Image and Content Source: https://www.kaggle.com/dansbecker/underfitting-and-overfitting

In [1]:
# Importing our standard libraries
import pandas as pd
import numpy as np

# Loading our machine learning library
from sklearn.tree import DecisionTreeRegressor

# Loading mean absolute error
from sklearn.metrics import mean_absolute_error

# Loading test/train/split library
from sklearn.model_selection import train_test_split

# Setting our columns to max so we can see everything 
pd.set_option('display.max_columns', None)

# Path of the file to read of Iowa housing data set
iowa_file_path = '/Users/jacobcook/DataScience/MSBA-Python/09_Numpy/train.csv'

# read & store our data in a DataFrame called home_data
home_data = pd.read_csv(iowa_file_path)

home_data.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1379.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,46.549315,567.240411,1057.429452,1162.626712,346.992466,5.844521,1515.463699,0.425342,0.057534,1.565068,0.382877,2.866438,1.046575,6.517808,0.613014,1978.506164,1.767123,472.980137,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,161.319273,441.866955,438.705324,386.587738,436.528436,48.623081,525.480383,0.518911,0.238753,0.550916,0.502885,0.815778,0.220338,1.625393,0.644666,24.689725,0.747315,213.804841,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,0.0,0.0,0.0,334.0,0.0,0.0,334.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,1900.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,0.0,223.0,795.75,882.0,0.0,0.0,1129.5,0.0,0.0,1.0,0.0,2.0,1.0,5.0,0.0,1961.0,1.0,334.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,0.0,477.5,991.5,1087.0,0.0,0.0,1464.0,0.0,0.0,2.0,0.0,3.0,1.0,6.0,1.0,1980.0,2.0,480.0,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,0.0,808.0,1298.25,1391.25,728.0,0.0,1776.75,1.0,0.0,2.0,1.0,3.0,1.0,7.0,1.0,2002.0,2.0,576.0,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,1474.0,2336.0,6110.0,4692.0,2065.0,572.0,5642.0,3.0,2.0,3.0,2.0,8.0,3.0,14.0,3.0,2010.0,4.0,1418.0,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


### Exploring our dataset (EDA)

In [2]:
# Let's explore what our columns include (ideally, we'd have a data dictionary here)
home_data.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

In [3]:
# Let's take a quick peek to see if we have any missing values in our columns
# This is critical for picking our features

home_data.columns[home_data.isnull().any()]


Index(['LotFrontage', 'Alley', 'MasVnrType', 'MasVnrArea', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
       'Electrical', 'FireplaceQu', 'GarageType', 'GarageYrBlt',
       'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence',
       'MiscFeature'],
      dtype='object')

### Selecting our Prediction Target
We are interested in predicting a sales price - Column name 'SalePrice'

In [4]:
# Let's set our prediction variable "y" for SalePrice

y = home_data.SalePrice

### Selecting our Features
These are the parts of our dataset we feel could be useful for predicting price. We'll go with some standard guesses here from how real estate is valued

In [5]:
# Picking standard features for our "X" variable

# Let's start with a custom list of columns that could help us predict sale price based on what's in our dataset
home_data_features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']

# Passing this into our X variable for features
X = home_data[home_data_features]

# Some quick summary statistics
X.describe()

Unnamed: 0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd
count,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,10516.828082,1971.267808,1162.626712,346.992466,1.565068,2.866438,6.517808
std,9981.264932,30.202904,386.587738,436.528436,0.550916,0.815778,1.625393
min,1300.0,1872.0,334.0,0.0,0.0,0.0,2.0
25%,7553.5,1954.0,882.0,0.0,1.0,2.0,5.0
50%,9478.5,1973.0,1087.0,0.0,2.0,3.0,6.0
75%,11601.5,2000.0,1391.25,728.0,2.0,3.0,7.0
max,215245.0,2010.0,4692.0,2065.0,3.0,8.0,14.0


### Picking a Model and Fitting It
At this point, we'll use a decision tree regressor from Scikit Learn machine learning library we loaded up above. Next, we'll fit a our data to this model using the variables we defined for features (X) and our prediction target (y)

In [6]:
iowa_model = DecisionTreeRegressor(random_state = 1) # =1 ensures same results in each run

iowa_model.fit(X,y)

DecisionTreeRegressor(random_state=1)

In [7]:
# Let's make some predictions

iowa_model.predict(X)

array([208500., 181500., 223500., ..., 266500., 142125., 147500.])

In [8]:
print ('Predicting sales for following 5 houses')
print (X.head(5))
print ("The predictions are")
print (iowa_model.predict(X.head(5)))

Predicting sales for following 5 houses
   LotArea  YearBuilt  1stFlrSF  2ndFlrSF  FullBath  BedroomAbvGr  \
0     8450       2003       856       854         2             3   
1     9600       1976      1262         0         2             3   
2    11250       2001       920       866         2             3   
3     9550       1915       961       756         1             3   
4    14260       2000      1145      1053         2             4   

   TotRmsAbvGrd  
0             8  
1             6  
2             6  
3             7  
4             9  
The predictions are
[208500. 181500. 223500. 140000. 250000.]


In [9]:
home_data.head(5)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


### Model Validation
How accurate is our model for predicting future prices? There are ton of ways to assess this but in this quick example we'll use Mean Absolute Error (MAE). Our prediction error is basically: error = actual - prediction. For example, if a house costs 500,000 and we predicted 400,00 our error is 100,000. With MAE we take the absolute value which converts errors to positive values.

In [10]:
# Setting a variable to hold our predicted values
predicted_home_prices = iowa_model.predict(X)

# Using sklearn method on our target and predicted values
mean_absolute_error(y, predicted_home_prices)

62.35433789954339

#### WARNING! 
The above example has a critical flaw - we used the same data mixed together to build the model and evaluate it. This means we can be very accurate with our predictions because it has seen this data before. But when we apply it to a dataset we've never seen, we can easily get highly inaccurate data.

The whole point with ML is to train a model, tune it, and then let it make predictions on new datasets.

To do this, we need to hold back some of our data as "validation data".

## Splitting Data for Better Accuracy

In [11]:
# First, we've split our data into training and validation data - for BOTH features and target (X, y)
# Note: the split is based on a random number generator. We use a numeric value to
# the random_state argument which guarantees we get the same split every time we
# run this script - key for consistency

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

# Define model
iowa_model = DecisionTreeRegressor()

# Fit model
iowa_model.fit(train_X, train_y)

# get predicted prices on validation data
val_predictions = iowa_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

32360.03287671233


#### What This Means
So, when we use our in-sample data, we had an MAE of 62.35 dollars. Out of sample, it went to 32,110 dollars. A HUuuuge difference! 

Note: our average home value in our dataset is 180921. So, the error in new data is roughly 17%. We can fine tune this to get better results in a variety of ways however.

## Tuning our model
Here, we'll look at test/train/validate our dataset and then compare our MAE scores to see how many leaves we should include to avoid over or under fitting our model. Note: source for the code and inspiration here: https://www.kaggle.com/dansbecker/underfitting-and-overfitting

In [14]:
# Let's write a custom function for MAE (source: Kaggle.com) 

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)


In [19]:
for max_leaf_nodes in [5, 50, 250, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d \t\t MAE: %d" %(max_leaf_nodes, my_mae))

Max leaf nodes: 5 		 MAE: 35190
Max leaf nodes: 50 		 MAE: 27825
Max leaf nodes: 250 		 MAE: 31738
Max leaf nodes: 500 		 MAE: 32662
Max leaf nodes: 5000 		 MAE: 33382


#### What This Means
Here we see that our MAE goes to 27825 at 50 leaves and get us better accuracy than our out of sample of 32,110. However at 250 leaves it heads the other way and overfits.

## Bonus - Random Forest
As we saw with a decision tree, there are some obvious tradeoffs. Lot's of leaves can have you overfitting the model because the prediction is coming just a few houses at the leaf. But with just a few leaves, it predicts poorly because it's not accounting for the unique characteristics in the data.

To get around this, we'll use the random forest method, which includes many trees many predictions and then averages everything together. Source: Wikipedia
<img src="https://upload.wikimedia.org/wikipedia/commons/7/76/Random_forest_diagram_complete.png">

#### Our Dataset Splits
Quick refresher on our variables that we split up above
* train_X - our training data comprised of features from our housing dataset: 'LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd'

* val_X - our validation dataset that we use to validate the accuracy of our predictions based on the feature set

* train_y - our target prediction (sales price) that we will train our data against

* val_y - our sales price we want to validate against


In [22]:
# Importing the Random Forest model from Scikit Learn
from sklearn.ensemble import RandomForestRegressor

# Setting a variable to hold our data and setting random state to the same number for consistency
forest_model = RandomForestRegressor(random_state=0)

# Fitting our model on our test and train datasets as defined above
forest_model.fit(train_X, train_y)

# Predicting on our validation dataset
iowa_preds = forest_model.predict(val_X)
print("Our MAE is now %d" %mean_absolute_error(val_y, iowa_preds))

23093.063676581867


## Conclusion
Here's what the notebook tells us:
1. Using the Pandas.describe method we see a standard deviation of \\$75K for sales price
2. Using a Decision Tree and testing "out of sample" we get this down to \\$32,110 for Mean Absolute Error
3. Next, we split our data for testing/training/validating and fiddled with leaves to get our MAE down to \\$27,825
4. Finally, we used Random Forest to take an average of a bunch of trees and got this down to \\$23,093

Note: Both standard deviation and MAE measure the dispersion of your data by computing the distance of the data to its mean. The difference between the two norms is that the standard deviation is calculating the square of the difference whereas the mean absolute deviation is only looking at the absolute difference. Source: https://stats.stackexchange.com/questions/81986/mean-absolute-deviation-vs-standard-deviation