# Kaggle Learn - Machine Learning Micro          Course

## Basic Data Exploration

The first step in ML is familiarizing ourselves with the data. 

In [1]:
import pandas as pd
# DecisionTreeRegressor Model
from sklearn.tree import DecisionTreeRegressor
# MAE
from sklearn.metrics import mean_absolute_error
# Splitting up data - train_X, test_X, train_y, test_y
from sklearn.model_selection import train_test_split
# RandomForestRegressor Model
from sklearn.ensemble import RandomForestRegressor

The most important part of the Pandas library is the DataFrame. 
A DataFrame holds the type of data we might think of as a table.

We'll look at data about home prices in Melbourne, Austrailia.

In [2]:
# save filepath to vairable for easy access
melbourne_file_path = '.../files/melb_data.csv'

# read the data and store in DataFrame titled melbourne_data
melbourne_data = pd.read_csv(melbourne_file_path)

# print a summary of the data in Melbourne data
melbourne_data.describe()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13518.0,13580.0,7130.0,8205.0,13580.0,13580.0,13580.0
mean,2.937997,1075684.0,10.137776,3105.301915,2.914728,1.534242,1.610075,558.416127,151.96765,1964.684217,-37.809203,144.995216,7454.417378
std,0.955748,639310.7,5.868725,90.676964,0.965921,0.691712,0.962634,3990.669241,541.014538,37.273762,0.07926,0.103916,4378.581772
min,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,2.0,650000.0,6.1,3044.0,2.0,1.0,1.0,177.0,93.0,1940.0,-37.856822,144.9296,4380.0
50%,3.0,903000.0,9.2,3084.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.802355,145.0001,6555.0
75%,3.0,1330000.0,13.0,3148.0,3.0,2.0,2.0,651.0,174.0,1999.0,-37.7564,145.058305,10331.0
max,10.0,9000000.0,48.1,3977.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


## Interpreting Data Description

The results show 8 numbers of each column of our original dataset.

<ul>
    <li> The first number, the <b> count </b>, shows how many rows          have non-missing values. </li>
    <span> Missing values arise for many reasons. For example, the 
           size of the 2nd bedroom wouldn't be collected when                  surveying a 1 bedroom house.</span>
    <li> The second value is the <b> mean </b>, which is the                average. </li>
    <li> <b> std </b> is the standard deviation, which measures how          numerically spread out the values are. </li> 
    <li> To interpret the <b> min, 25%, 50%, 75% and max </b>                values, imagine sorting each coulumn from lowest to                highest value. The first (smallest) value is the min.
         If you go a quarter way through the list, you'll find a 
         number that is bigger than 25% of the values and smaller 
         than 75% of the values.
         That is the <b> 25% </b> value (pronounced "25th                    percentile"). The 50th and 75th percentiles are defined
         analogously, and the <b> max </b> is that largest number.
    </li>
</ul>

In [3]:
melbourne_data.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,...,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


## Selecting Data for Medeling

To choose variables/columns, we'll need to see a list of all columns in the dataset. That is done with the <b> columns </b> property of the DataFrame.

In [4]:
melbourne_data.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

In [5]:
# The Melbourne data has some missing values (some houses 
# for which variables weren't recorded.)

# isnull() checks if a value is empty - returns a boolean : False is value is non-null and True is value is null
melbourne_data.isnull()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,True,True,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,True,True,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False,False,...,False,False,False,True,True,False,False,False,False,False
6,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
8,False,False,False,False,False,False,False,False,False,False,...,False,False,False,True,True,False,False,False,False,False
9,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [6]:
# sum() sums up all the empty values for a specific column
# isnull().sum() basiclly provides a description
melbourne_data.isnull().sum()

Suburb              0
Address             0
Rooms               0
Type                0
Price               0
Method              0
SellerG             0
Date                0
Distance            0
Postcode            0
Bedroom2            0
Bathroom            0
Car                62
Landsize            0
BuildingArea     6450
YearBuilt        5375
CouncilArea      1369
Lattitude           0
Longtitude          0
Regionname          0
Propertycount       0
dtype: int64

In [7]:
# The Melbourne data has some missing values 
# (some houses for which some variables weren't recorded.)

# dropna() drops missing values (think of na as "not available")
melbourne_data = melbourne_data.dropna(axis=0)

In [8]:
melbourne_data.isnull().sum()

Suburb           0
Address          0
Rooms            0
Type             0
Price            0
Method           0
SellerG          0
Date             0
Distance         0
Postcode         0
Bedroom2         0
Bathroom         0
Car              0
Landsize         0
BuildingArea     0
YearBuilt        0
CouncilArea      0
Lattitude        0
Longtitude       0
Regionname       0
Propertycount    0
dtype: int64

## Selecting The Prediction Target

We can pull out a variable with <b> dot-notation </b>. This single column is stored in a <b> Series </b>, which is broadly like a DataFrame with only a single column of data.

We'll use the dot notation to select the column we want to predict, which is called the <b> prediction target </b>. 

By convention, the prediction target is called <b> y </b>.
So the code we need to save the house prices in Melbourne data is:

In [9]:
y = melbourne_data.Price
y.describe()

count    6.196000e+03
mean     1.068828e+06
std      6.751564e+05
min      1.310000e+05
25%      6.200000e+05
50%      8.800000e+05
75%      1.325000e+06
max      9.000000e+06
Name: Price, dtype: float64

In [10]:
y.head()

1    1035000.0
2    1465000.0
4    1600000.0
6    1876000.0
7    1636000.0
Name: Price, dtype: float64

## Choosing "Features"

The columns that are inputted into our model (and later used to make predictions) are called "feautres". 
In our case, those would be the columns used to determine the home price. 

Sometimes, we will use all columns except the target as features. Othertimes we'll be better off with fewer features.

In [11]:
melbourne_feautes = ['Rooms', 'Bathroom', 'Landsize', 
                     'Lattitude', 'Longtitude']

By convention, this data is called <b> X </b>.

In [12]:
X = melbourne_data[melbourne_feautes]

# Review the data we'll be using to predict house prices using
# the describe() method and the head() method
X.describe()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
count,6196.0,6196.0,6196.0,6196.0,6196.0
mean,2.931407,1.57634,471.00694,-37.807904,144.990201
std,0.971079,0.711362,897.449881,0.07585,0.099165
min,1.0,1.0,0.0,-38.16492,144.54237
25%,2.0,1.0,152.0,-37.855438,144.926198
50%,3.0,1.0,373.0,-37.80225,144.9958
75%,4.0,2.0,628.0,-37.7582,145.0527
max,8.0,8.0,37000.0,-37.45709,145.52635


In [13]:
X.head()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
1,2,1.0,156.0,-37.8079,144.9934
2,3,2.0,134.0,-37.8093,144.9944
4,4,1.0,120.0,-37.8072,144.9941
6,3,2.0,245.0,-37.8024,144.9993
7,2,1.0,256.0,-37.806,144.9954


## Building the model

<ul>
    <li> <b> Define: </b> What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too. </li>
    <li> <b> Fit: </b> Capture patterns from provided data. This is the heart of modeling. </li>
    <li> <b> Predict: </b> Just what it sounds like </li>
    <li> <b> Evaluate: </b> Determine how accurate the model's predictions are. </li>
</ul>    

In [14]:
# Define model. 
# Speci
melbourne_model = DecisionTreeRegressor(random_state = 1)

# Fit model
melbourne_model.fit(X, y)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=1, splitter='best')

Many machine learning models allow some randomness in model training.
Specifying a number for random_state ensures we get the same results in each run. This is considered good practice. We use any number, and model quality won't depend meaningfully on exactly what value we choose.

We want to make predictions for new houses coming on the market rather than the houses we already have prices for.
But we'll make predictions for the first few rows of the training data to see how the predict function works.

In [15]:
print ("Making predictions for the following 5 houses: ")
print (X.head())
print ("The predictions are ")
print (melbourne_model.predict(X.head()))

Making predictions for the following 5 houses: 
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
4      4       1.0     120.0   -37.8072    144.9941
6      3       2.0     245.0   -37.8024    144.9993
7      2       1.0     256.0   -37.8060    144.9954
The predictions are 
[1035000. 1465000. 1600000. 1876000. 1636000.]


## Model Validation

In most applications, the relevant measure of model quality is predictive accuracy. 

People usually make a huge mistake when measuring predictive accuracy. They make predictions with their <em> training data </em> and compare those predictions to the target values in the <em> training data </em>. 

We first need to summarize the model quality into an understandable way. If we compare predicted and actual home values for 10,000 houses, we'll likely find mix of good and bad predictions. Moreover, looking through a list of 10,000 predicted and actual values would be pointless. 
Therefore, we need to summarize it into a single metric. 

There are many metrics for summarizing model quality, but we'll start with <b> Mean Absolute Error (MAE) </b>. 

<code> error = actual - predicted </code> 

So if a house costs 150,000 and we predict it would cost 100,000 the error is 50,000. 

<ul>
    <li> With the MAE matric, we take the absolute of each  error. This converts each error to a positive number.</li>
    <li> We then take the average of those absolute errors. This is our measure of model quality. </li>
</ul>

"On average, our predictions are off by about X."

In [16]:
predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

1115.7467183128902

## The Problem with "In-Sample" Scores

The measure we just computed can be called an "in-sample" score. We used a single "sample" of houses for both building the model and evaluating it.

This method is bad.

Imagine in the large real estate market, door color is unrelated to home price.

However, in the sample of data we used to build the model, all homes with green doors were very expensive. The model's job is to find patterns that predicts home prices, so it will see this pattern, and it will always predict high prices for homes with green doors.

Since this pattern was derived from the training data, the model will apprea accurate in the training data. 

But if this pattern doesn't hold when the model sees new data, the model would be very inaccurate when used. 

Since models' practical value comes from making predictions on new data, we measure performance on data that wasn't used to build the model. 

The most straightforward way to do this is to exclude some data from the model-building process, and then use those to test the model's accuracy on data it hasn't seen before. 
This data is called <b> validation data or test data </b>. 

The scikit-learn library has a function <code> train_test_split </code> to break up the data into two pieces. 
We'll use some of that data as training data to fit the model, and the other data as validation data to calculate <code> mean_absolute_error </code>.

In [17]:
# split data into training and validation data, for both 
# features and target

# The split is based on a random number generator. 

# Supplying a numeric value to the random_state argument
# guarantees we get the same split every time we run the 
# script.

train_X, test_X, train_y, test_y = train_test_split(X, y, random_state = 1)

# Define Model
melbourne_model = DecisionTreeRegressor(random_state=1)

# Fit Model
melbourne_model.fit(train_X, train_y)

# get predicted prices on validation data (Test data)
test_predictions = melbourne_model.predict(test_X)
print(mean_absolute_error(test_y, test_predictions))


251688.7630729503


## The Difference

The mean absolute error for the in-sample data was about 1,116 dollars. 
Out-of-sample it is more than 250,000 dollars.

This is the difference between a model that is almost exactly right, and one that is unusable for most practical purposes.

## Underfitting and Overfitting

### Experimenting with Different Models

Now that we have a reliable way to measure model accuracy, we can experiment with alternative models and see which gives the best predictions. 

We can see in scikit-learn's <a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html'> documentation </a> that the decision tree model has many options.

The most important options determine the tree's depth. A tree's depth is a measue of how many splits it makes before coming to a prediction. 

In practice, it's not uncommon for a tree to have 10 splits between the top level (all houses) and a leaf. As the tree gets deeper, the dataset gets sliced up into leaves with fewer houses. If a tree had only 1 split, it divides the data into 2 groups. If each group splits again, we would get 4 groups of houses. Splitting each of those again would create 8 groups. 
If we keep doubling the number of groups by adding more splits at each level, we'll have 2^10 groups of houses by the time we get to the 10th level. That's 1024 leaves. 

<hr> 

#### Overfitting
When we divide the houses amongst many leaves, we also have fewer houses in each leaf. Leaves with very few houses will make predictions that are quite close to those homes' actual values, but they may make very unreliable predictions for new data (because each prediction is based on only a few houses). 

This phenomenon called <b> overfitting </b>, where a model matches the training data almost perfectly, but does poorly in validation and other new data. 

#### Underfitting
On the flip side, if we make our tree very shallow, it doesn't divide up the houses into very distinct groups. 

As an extreme, if a tree divides houses into only 2 or 4, each group still has a wide variety of houses. Resulting predictions may be far off for most houses, even in the training data (and it will be bad in validation too for the same reason). 

When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data, that is called <b> underfitting </b>. 

<hr>

There are a few alternatives for controlling the tree depth, and many allow for some routes through the tree to have greater depth than other routes.
But the <em> max_leaf_node </em> argument provides a very sensible way to control overfitting vs underfitting. 
The more leaves we allow the model to make, the more we move from the underfitting area to the over fitting area. 

We use a utility function to help compare MAE scores from different values for max_leaf_nodes: 

In [18]:
def get_mae(max_leaf_nores, train_X, test_X, train_y, test_y):
        model = DecisionTreeRegressor(max_leaf_nodes= max_leaf_nodes, random_state = 1)
        model.fit(train_X, train_y)
        preds_val = model.predict(test_X)
        mae = mean_absolute_error(test_y, preds_val)
        return(mae)
    
# Compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 25, 50, 100, 250, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, test_X, train_y, test_y)
    print ("Max leaf nodes: %d \t\t Mean Absolute Error: %d" %(max_leaf_nodes, my_mae))

Max leaf nodes: 5 		 Mean Absolute Error: 369673
Max leaf nodes: 25 		 Mean Absolute Error: 283377
Max leaf nodes: 50 		 Mean Absolute Error: 266644
Max leaf nodes: 100 		 Mean Absolute Error: 256533
Max leaf nodes: 250 		 Mean Absolute Error: 242217
Max leaf nodes: 500 		 Mean Absolute Error: 244780
Max leaf nodes: 5000 		 Mean Absolute Error: 258379


Out of the options listed, 250 is the optimal number of leaves. 

## Conclusion

Models can suffer from either: 

<ul>
    <li> <b> Overfitting: </b> capturing spurious patters that won't recur in the future, leading to less accurate predictions </li>
    <li> <b> Underfitting: </b> failing to capture relevant patterns, again leading to less accurate predictions </li>
</ul>

We use <b> validation/test data </b>, which isn't used in model training to measure a candidate model's accuracy. This lets us tru many candidate models and keep the best one. 

In [19]:
# Final model based on the number of nodes
final_model = DecisionTreeRegressor(random_state= 1, max_leaf_nodes = 250)

# fit the final model and uncomment the next two lines
final_model.fit(train_X, train_y)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=250, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=1, splitter='best')

In [20]:
final_predictions = final_model.predict(test_X)
print(mean_absolute_error(test_y, final_predictions))

242217.1817972604


## Random Forest

Decision trees leave us with difficult decision.
A deep tree with lots of leaves will overfit because each prediction is coming from historical data from only the few houses at its leaf. 
A shallow tree with few leaves will perform poorly because it fails to capture as many distinctions in the raw data. 

Many models have clever ideas that can lead to better performance. We'll look at the <b> random forest </b> as an example. 

The random forest uses many trees, and it makes a prediction by averaging the predictions of each component tree. It generally has much better predictive accuracy than a single decision tree and it works well with default parameters. 
If we keep modeling, we can learn more models with even better performance, but many of those are sensitive to getting the right parameters. 

In [21]:
forest_model = RandomForestRegressor(random_state = 1)
forest_model.fit(train_X, train_y)

melb_preds = forest_model.predict(test_X)

print (mean_absolute_error(test_y, melb_preds))

198635.2808478588




## Conlusion

There is likely room for further improvement, but this is a big improvement over the best decision tree error of 250,000. 
There are parameters which allow us to change the performance of the Rondom Forest much as we changed the maximum depth of the single decision tree. 
But one of the best features of Random Forest models is that they generally work reasonable even without this tuning. 