<a href="https://colab.research.google.com/github/Keoni808/KaggleCourses/blob/main/intro_to_Machine_Learning/Model_Validation/Model_Validation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
# Mount your Google Drive to access files
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
!cp -rT '/content/drive/MyDrive/Colab Notebooks/KaggleCourses/Intro to Machine Learning/Model Validation/MV_DATA' .
# !cp '/content/drive/MyDrive/Colab Notebooks/KaggleCourses/Geospatial Analysis/Coordinate Reference Systems/purple_martin.csv' .

In [6]:
import pandas as pd

# 1 - What is Model Validation

The number one thing to look for when validating models is accuracy. Is the model predicting what actually happens?

One huge mistake that is commonly made is predicting accuracy with a model by training it using "training data" and validating it with "training data"

Summarizing model quality in an understandable way is essential. A single metric is useful, esspecially when handling large data sets.

**Mean Absolute Error (MAE)**

error = actual - predicted

For MAE,
- Take the error of each prediction
- Turn that error into a positive number
- Then take the collective mean of each error

In [7]:
# Model we will be observing

# Load data
melbourne_file_path = 'melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path)
# Filter rows with missing price values
filtered_melbourne_data = melbourne_data.dropna(axis=0)
# Choose target and features
y = filtered_melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea',
                        'YearBuilt', 'Lattitude', 'Longtitude']
X = filtered_melbourne_data[melbourne_features]

from sklearn.tree import DecisionTreeRegressor
# Define model
melbourne_model = DecisionTreeRegressor()
# Fit model
melbourne_model.fit(X, y)

In [8]:
# calculating the mean absolute error
from sklearn.metrics import mean_absolute_error

predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

434.71594577146544

# 2 - The Problem with "In-Sample" Scores

**"In-sample" score**

Using the same "sample" for training and evaluating

- The reason that this is bad practice all comes down to the features that you decide to use. If you train your data to have weight in an area that is not accounted for outside of the model, then it will give faulty predictions.
- The goal is to acheive accuracy with new data that the model has yet to see.

**MODEL VALUE**

The models that we create only have value when creating accurate predictions with new data.
- To achieve this, it is important to measure performance on data that was not used to train the model

**VALIDATION DATA**

- Cutting the data used to train the model and separate it by training data and validation data. This way, the validation data will be new to the model.

# 3 - Coding It

In [9]:
# train_test_split

# This is a function splits data into two pieces.
# 1. for training
# 2. for validation

from sklearn.model_selection import train_test_split

# split data into training and validation data, for both features and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

# Define model
melbourne_model = DecisionTreeRegressor()
# Fit model
melbourne_model.fit(train_X, train_y)

# get predicted prices on validation data
val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

263822.8373143964


Our model was around 500 off when validating with the same data used to train.

When introducing new data to that model, we are now off by around 250k.

There are many ways to fixing this, the main places to look are
1. finding better features
2. using different models

# EXERCISE EXAMPLE QUESTIONS

In [10]:
# Path of the file to read
iowa_file_path = 'train.csv'

home_data = pd.read_csv(iowa_file_path)
y = home_data.SalePrice
feature_columns = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[feature_columns]

# Specify Model
iowa_model = DecisionTreeRegressor()
# Fit Model
iowa_model.fit(X, y)

print("First in-sample predictions:", iowa_model.predict(X.head()))
print("Actual target values for those homes:", y.head().tolist())

First in-sample predictions: [208500. 181500. 223500. 140000. 250000.]
Actual target values for those homes: [208500, 181500, 223500, 140000, 250000]


In [11]:
# 1. Split Your Data
from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 1)

In [12]:
# 2. Specify and Fit the Model

iowa_model = DecisionTreeRegressor(random_state=1)

iowa_model.fit(train_X, train_y)

In [13]:
# 3. Make Predictions with Validation data

val_predictions = iowa_model.predict(val_X)

print(val_predictions[:5])
print(val_y.head().tolist())

[186500. 184000. 130000.  92000. 164500.]
[231500, 179500, 122000, 84500, 142000]


In [17]:
# 4. Calculate the Mean Absolute Error in Validation Data

val_mae = mean_absolute_error(val_y, val_predictions)
print(val_mae)

29652.931506849316
