# Introduction to machine learning

## Regression Model 

### Selecting data and choosing features

In [1]:
import pandas as pd

melbourne_file_path = '../data/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path) 
melbourne_data.columns

# drop rows for which we have "not available" fields
melbourne_data = melbourne_data.dropna(axis=0)

# we retrieve the variable we want to predict
y = melbourne_data.Price


# choose some features
melbourne_features = ['Rooms','Bathroom','Landsize','Lattitude','Longtitude']


# our input vector will be 
X = melbourne_data[melbourne_features]
# X.describe()
X.head()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
1,2,1.0,156.0,-37.8079,144.9934
2,3,2.0,134.0,-37.8093,144.9944
4,4,1.0,120.0,-37.8072,144.9941
6,3,2.0,245.0,-37.8024,144.9993
7,2,1.0,256.0,-37.806,144.9954


### Building the model

In [6]:
from sklearn.tree import DecisionTreeRegressor

# Define model. Specify a number for random_state to ensure same results each run
melbourne_model = DecisionTreeRegressor(random_state=1)

# Fit model
melbourne_model.fit(X, y)

### Use model for prediction

In [24]:
print("Making predictions for the following 5 houses:")
print(X.head())
print("The predictions are")
predictions = melbourne_model.predict(X)
print(predictions)

# print(y)

Making predictions for the following 5 houses:
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
4      4       1.0     120.0   -37.8072    144.9941
6      3       2.0     245.0   -37.8024    144.9993
7      2       1.0     256.0   -37.8060    144.9954
The predictions are
[1035000. 1465000. 1600000. ...  385000.  560000. 2450000.]


## Model validation

In [8]:
from sklearn.metrics import mean_absolute_error

predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)


"""
Here we are measuring the error we have between our model prediction and the real price
But we are using "in-sample" data which is not an accurate way, we should exclude some data from the model building 
process and then use those to test the model accuracy
"""

'\nHere we are measuring the error we have between our model prediction and the real price\nBut we are using "in-sample" data which is not an accurate way, we should exclude some data from the model building \nprocess and then use those to test the model accuracy\n'

In [43]:
from sklearn.model_selection import train_test_split

# split data into training and validation (testing) data, for both features and target
train_X, test_X, train_y, test_y = train_test_split(X, y, random_state = 0)

# Define model
melbourne_model = DecisionTreeRegressor()
# Fit model
melbourne_model.fit(train_X, train_y)

# get predicted prices on validation data
val_predictions = melbourne_model.predict(test_X)
# evaluate error of predictions
MAE = mean_absolute_error(test_y, val_predictions)

print("Mean Absolute error",MAE)
print("Mean House Price in of testing data ",test_y.mean())



Mean Absolute error 272327.02130406717
Mean House Price in of testing data  1091516.642995481


### Conclusion sur notre modele

We have for our model a Mean Absolute Error of more than 275 000$ and our houses are at the mean value of 1.1 million $. 
This basic predictions system will either need :
<ul>
    <li>a better choosing of features</li>
    <li>a better model to fit our data</li>
</ul>


## Underfitting and overfitting 

<li>over : where a model matches the training data almost perfectly, but does poorly in validation and other new data</li>

<li>under : when a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data</li>

!["overfitting"](images/overfitting.png){50x50}


In [3]:
import pandas as pd
    
# Load data
melbourne_file_path = 'data/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path) 
# Filter rows with missing values
filtered_melbourne_data = melbourne_data.dropna(axis=0)
# Choose target and features
y = filtered_melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
                        'YearBuilt', 'Lattitude', 'Longtitude']
X = filtered_melbourne_data[melbourne_features]

from sklearn.model_selection import train_test_split

# split data into training and validation data, for both features and target
train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)

In [4]:
"""
The more leaves we allow the model to make, the more we move from the underfitting 
area in the above graph to the overfitting area
"""


def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

In [9]:
# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

Max leaf nodes: 5  		 Mean Absolute Error:  347380
Max leaf nodes: 50  		 Mean Absolute Error:  258171
Max leaf nodes: 500  		 Mean Absolute Error:  243495
Max leaf nodes: 5000  		 Mean Absolute Error:  255575


# General Conclusion 


<li>Overfitting: capturing spurious patterns that won't recur in the future, leading to less accurate prediction</li>
<br>
<li>Underfitting: failing to capture relevant patterns, again leading to less accurate predictions</li>


### Decision trees

A deep tree with lots of leaves will overfit because each prediction is coming from historical data from only the few houses at its leaf. But a shallow tree with few leaves will perform poorly because it fails to capture as many distinctions in the raw data


### A better model : random forest

The random forest uses many trees, and it makes a prediction by averaging the predictions of each component tree

In [10]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))


191669.7536453626
