# <font color='purple'>My First Machine Learning Model </font>


#### This notebook is part of my first study on Machine Learning from the ML Study Jam.
ML Study Jam is a four-week Machine Learning Bootcamp offered by Google for students who want to develop their ML and AI skills.

The dataset used in this project was imported from Kaggle and saved as csv file. The dataset is disponible in: https://www.kaggle.com/datasets/dansbecker/melbourne-housing-snapshot?select=melb_data.csv

In [1]:
import pandas as pd

In [2]:
# read the data and store data in DataFrame titled melbourne_data
melbourne_data = pd.read_csv("melb_data.csv")

## Data Exploration

In [3]:
# print a summary of the data in Melbourne data
melbourne_data.describe()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13518.0,13580.0,7130.0,8205.0,13580.0,13580.0,13580.0
mean,2.937997,1075684.0,10.137776,3105.301915,2.914728,1.534242,1.610075,558.416127,151.96765,1964.684217,-37.809203,144.995216,7454.417378
std,0.955748,639310.7,5.868725,90.676964,0.965921,0.691712,0.962634,3990.669241,541.014538,37.273762,0.07926,0.103916,4378.581772
min,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,2.0,650000.0,6.1,3044.0,2.0,1.0,1.0,177.0,93.0,1940.0,-37.856822,144.9296,4380.0
50%,3.0,903000.0,9.2,3084.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.802355,145.0001,6555.0
75%,3.0,1330000.0,13.0,3148.0,3.0,2.0,2.0,651.0,174.0,1999.0,-37.7564,145.058305,10331.0
max,10.0,9000000.0,48.1,3977.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


## Selecting Data for Modeling:
Our dataset had too many variables to wrap your head around, or even to print out nicely. 
We'll start by picking a few variables using our intuition. To choose variables/columns, we'll need to see a list of all columns in the dataset. 

In [4]:
melbourne_data.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

The Melbourne data has some missing values (some houses for which some variables weren't recorded.)
For now, we will take the simplest option and drop houses from our data.

In [5]:
# Drop missing values

melbourne_data = melbourne_data.dropna(axis=0)

### Selecting The Prediction Target:
You can pull out a variable with dot-notation. This single column is stored in a Series, which is broadly like a DataFrame with only a single column of data.
We'll use the dot notation to select the column we want to predict, which is called the prediction target. 
By convention, the prediction target is called y.

In [6]:
# Create the prediction target

y = melbourne_data.Price

## Choosing "Features":
The columns that are inputted into our model (and later used to make predictions) are called "features". 
In our case, those would be the columns used to determine the home price. By convention, this data is called X.

In [7]:
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]

In [8]:
# Review the data we'll be using to predict house prices using the describe method and the head method.

X.describe()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
count,6196.0,6196.0,6196.0,6196.0,6196.0
mean,2.931407,1.57634,471.00694,-37.807904,144.990201
std,0.971079,0.711362,897.449881,0.07585,0.099165
min,1.0,1.0,0.0,-38.16492,144.54237
25%,2.0,1.0,152.0,-37.855438,144.926198
50%,3.0,1.0,373.0,-37.80225,144.9958
75%,4.0,2.0,628.0,-37.7582,145.0527
max,8.0,8.0,37000.0,-37.45709,145.52635


In [9]:
X.head()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
1,2,1.0,156.0,-37.8079,144.9934
2,3,2.0,134.0,-37.8093,144.9944
4,4,1.0,120.0,-37.8072,144.9941
6,3,2.0,245.0,-37.8024,144.9993
7,2,1.0,256.0,-37.806,144.9954


## Building Our Model
Use the scikit-learn library to create the models.
The steps to building and using a model are:
- Define: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.
- Fit: Capture patterns from provided data. This is the heart of modeling.
- Predict: Just what it sounds like
- Evaluate: Determine how accurate the model's predictions are.

In [10]:
# Defining a decision tree model with scikit-learn and fitting it with the features and target variable.

from sklearn.tree import DecisionTreeRegressor

In [11]:
# Define model. Specify a number for random_state to ensure same results each run

melbourne_model = DecisionTreeRegressor(random_state=1)

# Fit model

melbourne_model.fit(X, y)

Many machine learning models allow some randomness in model training. Specifying a number for random_state ensures you get the same results in each run. This is considered a good practice. You use any number, and model quality won't depend meaningfully on exactly what value you choose.
We now have a fitted model that we can use to make predictions.
In practice, we'll want to make predictions for new houses coming on the market rather than the houses we already have prices for. But we'll make predictions for the first few rows of the training data to see how the predict function works.

In [12]:
print("Making predictions for the following 5 houses:")
print(X.head())
print("The predictions are")
print(melbourne_model.predict(X.head()))

Making predictions for the following 5 houses:
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
4      4       1.0     120.0   -37.8072    144.9941
6      3       2.0     245.0   -37.8024    144.9993
7      2       1.0     256.0   -37.8060    144.9954
The predictions are
[1035000. 1465000. 1600000. 1876000. 1636000.]


## Model Validation
For summarizing our model quality, we'll start with one called Mean Absolute Error (also called MAE).
The prediction error for each house is:  error=actual−predicted
With the MAE metric, we take the absolute value of each error. This converts each error to a positive number. We then take the average of those absolute errors. This is our measure of model quality. 


In [13]:
# Calculate the MAE metric:

from sklearn.metrics import mean_absolute_error

predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

1115.7467183128902

## The Problem with "In-Sample" Scores
The measure we just computed can be called an "in-sample" score. We used a single "sample" of houses for both building the model and evaluating it.Because this pattern was derived from the training data, the model will appear accurate on the training data, but if this pattern doesn't hold when the model sees new data, the model will be very inaccurate when used in practice.

Because the models' practical value comes from making predictions on new data, we measure performance on data that was not used to build the model. The most straightforward way to do this is to exclude some data from the model building process and then use it to test the model's accuracy on data that has not been seen before. This data is called **validation data**.

In [14]:
# Use the function train_test_split to break up the data into two pieces. 
# Use some of that data as training data to fit the model, and the other data as validation data to calculate mean_absolute_error.

from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 1)

# Define model
melbourne_model = DecisionTreeRegressor()

# Fit model
melbourne_model.fit(train_X, train_y)

# get predicted prices on validation data
val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

254449.80761781795


Your mean absolute error for the in-sample data was about 1,115 dollars. Out-of-sample it is more than 254,000 dollars.
This is the difference between a model that is almost exactly right, and one that is unusable for most practical purposes. 
Let's try to improve this model, such as experimenting to find better features or different model types.

## Underfitting and Overfitting
#### Fine-turn the model for better performance.
For better controll to the tree depth, we will use the max_leaf_nodes argument.
First, we can use a utility function to help compare MAE scores from different values for max_leaf_nodes:

In [15]:
# Import mean_absolute_error and DecisionTreeRegressor from scikit-learn

from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

# Define the function get_mae

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

We can use a for-loop to compare the accuracy of models built with different values for max_leaf_nodes.

In [16]:
# Compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

Max leaf nodes: 5  		 Mean Absolute Error:  369673
Max leaf nodes: 50  		 Mean Absolute Error:  266644
Max leaf nodes: 500  		 Mean Absolute Error:  243613
Max leaf nodes: 5000  		 Mean Absolute Error:  256147


Of the options listed, 500 is the optimal number of leaves.


In [17]:
# Create a variable to store the best value of max_leaf_nodes
best_tree_size = 500

# Fill in argument to make optimal size and uncomment
final_model = DecisionTreeRegressor(max_leaf_nodes= best_tree_size, random_state=1)

# fit the final model and uncomment the next two lines
final_model.fit(X,y)

In [18]:
# Calculate the MAE metric to check the new model:

predicted_home_prices = final_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

126642.40540214001

We can see that the absolute error dropped from more than 253,000 dollars to approximately 126,000 dollars which is still a very high value.
Therefore, instead of the **decision tree**, we will use a new model to try to improve our performance, called **random forest**.

## Trying the random forest model

The random forest uses many trees, and it makes a prediction by averaging the predictions of each component tree. It generally has much better predictive accuracy than a single decision tree and it works well with default parameters.
We build a random forest model similarly to how we built a decision tree in scikit-learn - this time using the RandomForestRegressor class instead of DecisionTreeRegressor.

In [19]:
# Import RandomForestRegressor from scikit-learn
from sklearn.ensemble import RandomForestRegressor

# Create the model using the variables already created: train_X, val_X, train_y, val_y

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))

190414.59149025998


#### Conclusion: 
The Random forest model performed better than the first decision tree (254,000 dollars of absolute error), but still performed worse than the decision tree with improvements (126,000 dollars of absolute error). Let's try to make better with our random forest now.

### Defining some diferent random forest models

Let's define some diferent random forest models and select the best option for our model

In [20]:
# Define the models
model_1 = RandomForestRegressor(n_estimators=50, random_state=0)
model_2 = RandomForestRegressor(n_estimators=100, random_state=0)
model_3 = RandomForestRegressor(n_estimators=100, criterion='absolute_error', random_state=0)
model_4 = RandomForestRegressor(n_estimators=200, min_samples_split=20, random_state=0)
model_5 = RandomForestRegressor(n_estimators=100, max_depth=7, random_state=0)

models = [model_1, model_2, model_3, model_4, model_5]

To select the best model out of the five, we define a function score_model() below. This function returns the MAE from the validation set. The best model will obtain the lowest MAE.

In [21]:
# Define the function for comparing different models
def score_model(model, X_t=train_X, X_v=val_X, y_t=train_y, y_v=val_y):
    model.fit(X_t, y_t)
    preds = model.predict(X_v)
    return mean_absolute_error(y_v, preds)

#Get the MAE for the models
for i in range(0, len(models)):
    mae = score_model(models[i])
    print("Model %d MAE: %d" % (i+1, mae))

Model 1 MAE: 192759
Model 2 MAE: 192287
Model 3 MAE: 186379
Model 4 MAE: 197957
Model 5 MAE: 235402


In [22]:
# Define a model using the best option (model 3)
my_model = RandomForestRegressor(n_estimators=100, criterion='absolute_error', random_state=0)


In [23]:
# Fit the model to the training data
my_model.fit(X, y)

In [24]:
# Calculate the MAE metric to check the new model:
predicted_home_prices = my_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

69719.13740477728

Now the absolute error has dropped to approximately 69,000, which is a huge difference from the first model.

In [25]:
# Generate test predictions
preds_test = my_model.predict(X)

In [26]:
print(preds_test)

[1070945. 1370250. 1477716. ...  408886.  596850. 2746940.]
