Kaggle - Model Underfitting and Overfitting

*  We are going to [use the Melborne dataset](https://www.kaggle.com/code/dansbecker/model-validation/tutorial) to examine model underfitting and overfitting with this notebook
* We are working through the [Kaggle Tutorial on underfitting and overfitting](https://www.kaggle.com/code/dansbecker/underfitting-and-overfitting)
*  We are going to use the [Panda's library](https://pandas.pydata.org/docs/) Date: Dec 08, 2023 Version: 2.1.4
*  We are also going to use the [Scikit library](https://scikit-learn.org/stable/) Date: October 2023.  Version: 1.3.2


We can drag and drop data files (csv files) that we want to work with from our local drive into the google colab file icon (left side of the colab screen)
1.   Download the [Kaggle Melborne Housing Data](https://www.kaggle.com/code/dansbecker/model-validation/data) to your desktop
2.   Click on the folder on left side of the approximate middle of the Colab screen
3.   Drag and drop the melb_data.csv file into the folder to upload it to Google Colab from your desktop
4.   You will need to do this operation everytime you use the notebook

Our first script does the following:

*   Imports the Panda's library
*   Loads the data into Pandas

In [1]:
# Import Pandas Library
import pandas as pd

# Load data
melbourne_data = pd.read_csv('melb_data.csv')

Our next script does the following:

*   Filters out dataset rows with missing data
*   Identifies the target of the analysis and the features that impact it
*   Imports the [scikit train test split library](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
*   Splits data into training and validation data, for both features and target



In [2]:
# Filter rows with missing price values
filtered_melbourne_data = melbourne_data.dropna(axis=0)

# Choose target and features
y = filtered_melbourne_data.Price
melbourne_features = ['Rooms',
                      'Bathroom',
                      'Landsize',
                      'BuildingArea',
                      'YearBuilt',
                      'Lattitude',
                      'Longtitude']
X = filtered_melbourne_data[melbourne_features]

# Import scikit library
from sklearn.model_selection import train_test_split

# split data into training and validation data, for both features and target
train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)

Now we will think about/look at why it's good to split up data sets into two parts:
*  Training portion (model is familiar with this data)
*  Validation portion (new data for the model and we can see how the model performs in the wild)

We want a bias free model that can analyze datasets impartially

Keeping this in mind, our next script accomplishes the following:
*  Imports scikits libraries
*  Uses a function to help compare MAE scores from different values for max_leaf_nodes

It's helpful to review [scikit's DecisionTreeClassifier documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)

In [4]:
# Import scikit libraries
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

# Function to help compare MAE scores from different values for max_leaf_nodes
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

Our next script:
* Uses a for-loop to compare the accuracy of models built with different values for max_leaf_nodes.

In [6]:
# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

Max leaf nodes: 5  		 Mean Absolute Error:  347380
Max leaf nodes: 50  		 Mean Absolute Error:  258171
Max leaf nodes: 500  		 Mean Absolute Error:  243495
Max leaf nodes: 5000  		 Mean Absolute Error:  255575


Of the options listed, 500 is the optimal number of leaves to be used in the model

Here's the takeaway: Models can suffer from either:

* Overfitting: capturing spurious patterns that won't recur in the future, leading to less accurate predictions, or
* Underfitting: failing to capture relevant patterns, again leading to less accurate predictions.

We use validation data, which isn't used in model training, to measure a candidate model's accuracy. This lets us try many candidate models and keep the best one.