Kaggle - Validate Your First ML Model

*  We are going to start off by [validating the Melborne dataset](https://www.kaggle.com/code/dansbecker/model-validation/tutorial) with this notebook
*  We are going to use the [Panda's library](https://pandas.pydata.org/docs/) Date: Dec 08, 2023 Version: 2.1.4
*  We are also going to use the [Scikit library](https://scikit-learn.org/stable/) Date: October 2023.  Version: 1.3.2

We can drag and drop data files (csv files) that we want to work with from our local drive into the google colab file icon (left side of the colab screen)
1.   Download the [Kaggle Melborne Housing Data](https://www.kaggle.com/code/dansbecker/model-validation/data) to your desktop
2.   Click on the folder on left side of the approximate middle of the Colab screen
3.   Drag and drop the melb_data.csv file into the folder to upload it to Google Colab from your desktop
4.   You will need to do this operation everytime you use the notebook

Our script does the following:

*   Imports the panda's library
*   Loads the data
*   Filters out rows with missing data
*   Identifies the target of the analysis and the features that impact it
*   Imports the scikit library
*   Loads a [decision tree](https://scikit-learn.org/stable/modules/tree.html) algo



In [1]:
# Import Pandas Library
import pandas as pd

# Load data
melbourne_data = pd.read_csv('melb_data.csv')

# Filter rows with missing price values
filtered_melbourne_data = melbourne_data.dropna(axis=0)

# Choose target and features
y = filtered_melbourne_data.Price
melbourne_features = ['Rooms',
                      'Bathroom',
                      'Landsize',
                      'BuildingArea',
                      'YearBuilt',
                      'Lattitude',
                      'Longtitude']
X = filtered_melbourne_data[melbourne_features]

# Import scikit library
from sklearn.tree import DecisionTreeRegressor

# Define model
melbourne_model = DecisionTreeRegressor()

# Fit model
melbourne_model.fit(X, y)

Let's take a look at Mean Absolute Error (MAE)

*  In general;  Error = Actual - Predicted
*  MAE = The sum of absolute errors divided by the sample size - [wiki](https://en.wikipedia.org/wiki/Mean_absolute_error)
*  Our script caclulates the House Price MAE in $

In [2]:
from sklearn.metrics import mean_absolute_error

predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

434.71594577146544

Now we will think about/look at why it's good to split up data sets into two parts:
*  Training portion (model is familiar with this data)
*  Validation portion (new data for the model and we can see how the model performs in the wild)

We want a bias free model that can analyze datasets impartially

Our script does the following:

*   Imports the scikit function train_test_split to break up the data into two pieces
*   Splits the dataset into two parts: training and validation datasets, for both the features and the target
*   Loads a [decision tree](https://scikit-learn.org/stable/modules/tree.html) algo
*   Fits the model
*   Calculates the MAE (Mean Absolute Error)


In [3]:
# Import function train_test_split to break up the data into two pieces
from sklearn.model_selection import train_test_split

# split data into training and validation data, for both features and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

# Define model
melbourne_model = DecisionTreeRegressor()

# Fit model
melbourne_model.fit(train_X, train_y)

# get predicted prices on validation data
val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

258405.54422207875


Our model shows bias

*   Test data MAE is $434.72

*   Validation data MAE is $258,405.54

We are going to have to refine the model