# Mini Project: Build a Machine Learning Model

## Predict Total Fare on the NYC Taxi Dataset

Welcome to the NYC Taxi Fare Prediction project! In this Colab, we will continue using the NYC Taxi Dataset to predict the fare amount for taxi rides using a subset of available features. We will go through three main stages: building a baseline model, creating a full model, and performing hyperparameter tuning to enhance our predictions.

Now that you've completed exploratory data analysis on this dataset you should have a good understanding of the feature space.

## Project Objectives

The primary objectives of this project are as follows:

Baseline Model: We will start by building a simple baseline model to establish a benchmark for our predictions. This model will serve as a starting point to compare the performance of our subsequent models.

Full Model: Next, we will develop a more comprehensive model that leverages machine learning techniques to improve prediction accuracy. We will use Scikit-Learn's model pipeline to build a framework that enables rapid experimentation.

Hyperparameter Tuning: Lastly, we will optimize our full model by fine-tuning its hyperparameters. By systematically adjusting the parameters that control model behavior, we aim to achieve the best possible performance for our prediction task.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_absolute_error

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

Load the NYC taxi dataset into a Pandas DataFrame and do a few basic checks to ensure the data is loaded properly. Note, there are several months of data that can be used. For simplicity, use the Yellow Taxi 2022-01 parquet file [here](https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet). Here are your tasks:

  1. Load the `yellow_tripdata_2022-01.parquet` file into Pandas.
  2. Print the first 5 rows of data.
  3. Drop any rows of data that contain NULL values.
  4. Create a new feature, 'trip_duration' that captures the duration of the trip in minutes.
  5. Create a variable named 'target_variable' to store the name of the thing we're trying to predict, 'total_amount'.
  6. Create a list called 'feature_cols' containing the feature names that we'll be using to predict our target variable. The list should contain 'VendorID', 'trip_distance', 'payment_type', 'PULocationID', 'DOLocationID', and 'trip_duration'.

In [2]:
# Load the dataset into a pandas DataFrame (from https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
taxi_tripdata_df = pd.read_parquet ('yellow_tripdata_2022-01.parquet')

In [3]:
# Display the first few rows of the dataset
print (taxi_tripdata_df.head())

   VendorID tpep_pickup_datetime tpep_dropoff_datetime  passenger_count  \
0         1  2022-01-01 00:35:40   2022-01-01 00:53:29              2.0   
1         1  2022-01-01 00:33:43   2022-01-01 00:42:07              1.0   
2         2  2022-01-01 00:53:21   2022-01-01 01:02:19              1.0   
3         2  2022-01-01 00:25:21   2022-01-01 00:35:23              1.0   
4         2  2022-01-01 00:36:48   2022-01-01 01:14:20              1.0   

   trip_distance  RatecodeID store_and_fwd_flag  PULocationID  DOLocationID  \
0           3.80         1.0                  N           142           236   
1           2.10         1.0                  N           236            42   
2           0.97         1.0                  N           166           166   
3           1.09         1.0                  N           114            68   
4           4.30         1.0                  N            68           163   

   payment_type  fare_amount  extra  mta_tax  tip_amount  tolls_amount  \


In [4]:
# Drop rows with missing values.
taxi_tripdata_full_rows = taxi_tripdata_df.dropna (how = "any")

In [6]:
# Create new feature, 'trip_duration'.
# Same method as the EDA mini project I did earlier: https://github.com/TrollRider-Kristian/Springboard-AI-Mini-Projects/blob/main/Student_MLE_MiniProject_EDA.ipynb
# Bear in mind the SettingWithCopyWarning may appear for this project.
# As this StackOverflow article indicates: https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas
# it does not apply to us because we are assigning values to a new column rather than overriding values to an existing one in the DataFrame.
trip_duration_column = taxi_tripdata_full_rows.loc[:, 'tpep_dropoff_datetime'] - taxi_tripdata_full_rows.loc[:, 'tpep_pickup_datetime']
taxi_tripdata_full_rows.loc[:, 'trip_duration'] = trip_duration_column.dt.total_seconds() / 60
# Print first few rows with new column
print (taxi_tripdata_full_rows.head())

   VendorID tpep_pickup_datetime tpep_dropoff_datetime  passenger_count  \
0         1  2022-01-01 00:35:40   2022-01-01 00:53:29              2.0   
1         1  2022-01-01 00:33:43   2022-01-01 00:42:07              1.0   
2         2  2022-01-01 00:53:21   2022-01-01 01:02:19              1.0   
3         2  2022-01-01 00:25:21   2022-01-01 00:35:23              1.0   
4         2  2022-01-01 00:36:48   2022-01-01 01:14:20              1.0   

   trip_distance  RatecodeID store_and_fwd_flag  PULocationID  DOLocationID  \
0           3.80         1.0                  N           142           236   
1           2.10         1.0                  N           236            42   
2           0.97         1.0                  N           166           166   
3           1.09         1.0                  N           114            68   
4           4.30         1.0                  N            68           163   

   payment_type  fare_amount  extra  mta_tax  tip_amount  tolls_amount  \


In [7]:
# "Create a variable named 'target_variable' to store the name of the thing we're trying to predict, 'total_amount'."
target_variable = 'total_amount'

In [8]:
# Create a list called feature_col to store column names
# trip_distance and trip_duration are continuous features while the other four are categorical
feature_cols = ['VendorID', 'trip_distance', 'payment_type', 'PULocationID', 'DOLocationID', 'trip_duration']

Splitting a dataset into training and test sets is a crucial step in machine learning model development. It allows us to evaluate the performance and generalization ability of our models accurately. The training set is used to train the model, while the test set serves as an independent sample for evaluating its performance.

1. **Model Training**: The training set is used to fit the model, allowing it to learn the underlying patterns and relationships between the features and the target variable. By exposing the model to a diverse range of examples in the training set, it can capture the underlying structure of the data.

2. **Model Evaluation**: The test set, which is independent of the training set, is crucial for evaluating how well the trained model generalizes to unseen data. It provides an unbiased assessment of the model's performance on new instances. By measuring the model's accuracy, precision, recall, or other evaluation metrics on the test set, we can estimate how well the model will perform on unseen data.

3. **Preventing Overfitting**: Overfitting occurs when a model learns the training data's noise and idiosyncrasies instead of the underlying patterns. By evaluating the model on the test set, we can identify if the model is overfitting. If the model performs significantly worse on the test set compared to the training set, it indicates overfitting. In such cases, we might need to adjust the model, feature selection, or regularization techniques to improve generalization.

4. **Hyperparameter Tuning**: Splitting the dataset allows us to perform hyperparameter tuning on the model. Hyperparameters are configuration settings that control the learning process, such as learning rate, regularization strength, or the number of hidden layers in a neural network. By using a validation set (often created from a portion of the training set), we can iteratively adjust the hyperparameters and select the best combination that maximizes the model's performance on the validation set. The final evaluation on the test set provides an unbiased estimate of the model's performance.

By splitting the dataset into training and test sets, we can ensure that our models are both well-trained and accurately evaluated. This separation helps us understand how the model will perform on new, unseen data, which is critical for assessing its effectiveness and making informed decisions about its deployment.

Here is your task:

  1. Use Scikit-Learn's [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) to split the data into training and test sets. Don't forget to set the random state.

In [9]:
# Split dataset into training and test sets
features_train, features_test, target_train, target_test = train_test_split (taxi_tripdata_full_rows.loc[:, feature_cols],\
  taxi_tripdata_full_rows.loc[:, target_variable],\
  random_state = 3\
  # Later in this project, we are tasked to apply one-hot encoding to our training and test datasets AFTER
  # we split the data here.  This will create an issue where catagorical values will be present in the test set but
  # but absent in the training set.  One workaround for this is to stratify select feature columns:
  # https://datascience.stackexchange.com/questions/102294/dealing-with-extra-categories-in-test-set
  # KRISTIAN_NOTE - HOWEVER, stratifying categorical feature columns gives an error message for this dataset.
  # I will leave the code commented out for reference purposes:
  # stratify = taxi_tripdata_full_rows.loc [:, ['VendorID', 'payment_type', 'PULocationID', 'DOLocationID']]
)
print (f'Size of Features Training Set: {features_train.shape}')
print (f'Size of Features Test Set: {features_test.shape}')

Size of Features Training Set: (1794321, 6)
Size of Features Test Set: (598107, 6)


The importance of a baseline model, even if it uses a simple strategy like always predicting the mean, cannot be understated. Here's why a baseline model is valuable:

1. **Performance Comparison**: A baseline model serves as a reference point for evaluating the performance of more sophisticated models. By establishing a simple yet reasonable baseline, we can determine whether our advanced models offer any significant improvement over this basic approach. It helps us set realistic expectations and gauge the effectiveness of our efforts.

2. **Model Complexity**: A baseline model provides insight into the complexity required to solve the prediction task. If a simple strategy like predicting the median performs reasonably well, it suggests that the problem might not necessitate complex modeling techniques. Conversely, if the baseline model performs poorly, it indicates the presence of more intricate patterns that need to be captured by more sophisticated models.

3. **Minimum Performance Requirement**: A baseline model can establish a minimum performance requirement for a predictive task. If we cannot outperform the baseline, it suggests that our models have failed to capture even the most fundamental relationships within the data. In such cases, we may need to revisit our data preprocessing steps, feature engineering techniques, or consider other external factors affecting the task.

4. **Identifying Data Issues**: A baseline model can help identify potential issues within the dataset. If the baseline model performs poorly, it may indicate problems like missing values, outliers, or data inconsistencies. These issues can be further investigated and resolved to improve the overall model performance.

While a baseline model like always predicting the median may not offer the highest prediction accuracy, its importance lies in its role as a starting point for model development and evaluation. It provides a solid foundation for comparing and assessing the performance of more complex models, ensuring that any improvements made are meaningful and significant.

Here is your task:

  1. Create a model that always predicts the mean total fare of the training dataset. Use Scikit-Learn's [mean_absolute_error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html) to evaluate this model. Is it any good?

In [10]:
# Create a baseline for mean absolute error of total amount
# For my baseline, I'm just going to always predict the mean, rounded to 2 decimal places.
baseline_model = target_train.mean().round (2)
print (f'mean: {baseline_model}')
baseline_predictions = [baseline_model] * len (target_test)
baseline_error = mean_absolute_error (target_test, baseline_predictions)
print (f'mean absolute error of baseline: {baseline_error}')

mean: 18.85
mean absolute error of baseline: 9.763421745607387


The mean total_amount for a taxi fare in January 2022 is $18.85.  Using it as our baseline metric yields a mean absolute error of \$9.76, which is over half that amount.  The model is poor but just realistic enough to serve as a baseline.

With a baseline metric in place, we can try to build a machine learning model. Obviously, if the model can't beat the baseline then there are some major issues to be resolved.

It's always a good idea to start with a simple machine learning model, like linear regression, and build upon it if necessary.

Here are your tasks:

  1. Use Scikit-Learn's [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) to preprocess the categorical and continuous features independently. Apply the [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) to the continuous columns and [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) to the categorical columns.

  One-hot encoding is a popular technique used to represent categorical variables numerically in machine learning models. It transforms categorical features into a binary vector representation, where each category is represented by a binary column. Here's an explanation of one-hot encoding:

  When working with categorical variables, such as colors (e.g., red, blue, green) or vehicle types (e.g., car, truck, motorcycle), machine learning algorithms often require numerical inputs. However, directly assigning numerical values to categories can introduce unintended relationships or orderings between them. For example, assigning the values 0, 1, and 2 to the categories red, blue, and green may imply a sequential relationship, which is not desired.

  One-hot encoding solves this problem by creating new binary columns, equal to the number of unique categories in the original feature. Each binary column represents a specific category and takes a value of 1 if the data point belongs to that category, and 0 otherwise. This encoding ensures that no implicit ordering or relationship exists between the categories.

  2. Integrate the preprocessor in the previous step with Scikit-Learn's [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) model using a [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html).

  3. Train the pipeline on the training data.

  4. Evaluate the model using mean absolute error as a metric on the test data. Does the model beat the baseline?


In [11]:
# Use Scikit-Learn's ColumnTransformer to preprocess the categorical and
# continuous features independently.
all_columns_preparer = ColumnTransformer ([\
  # KRISTIAN_NOTE - For this one-hot encoder, I will set handle_unknown to 'ignore'.  Other datasets will require a different treatment.
  # Please see my reasoning for this below this code snippet.
  ('one_hot_categorical_columns', OneHotEncoder (handle_unknown = 'ignore'), ['VendorID', 'payment_type', 'PULocationID', 'DOLocationID']),
  ('scale_continuous_columns', StandardScaler (with_std=True), ['trip_distance', 'trip_duration'])
])

# Print shape of data for testing purposes:
print (f'Size of Column-Transformed Training Set: {all_columns_preparer.fit_transform (features_train, target_train).shape}')
print (f'Size of Column-Transformed Test Set: {all_columns_preparer.fit_transform (features_test, target_test).shape}')

Size of Column-Transformed Training Set: (1794321, 524)
Size of Column-Transformed Test Set: (598107, 510)


The above ColumnTransformer does not take into account data in the test set of categories not present in the training set.  If we try to fit the data right away, we get this message: "UserWarning: Found unknown categories in columns [2] during transform."  It also does not point to the specific offending data.

The OneHotEncoder object supports a parameter called 'handle_unknown', which allows us to change the error to a warning and treat the unknown rows of test data as 0's in all the categorical columns, but that is not appropriate for every scenario.

Other options (besides stratifying categorical feature columns for the train-test split) that come to mind are:

1. Ignore the offending rows from the test set as 0's.  This can easily be done by Python's OneHotEncoder via setting the 'handle_unknown' argument to 'ignore':
https://stackoverflow.com/questions/57946006/one-hot-encoding-train-with-values-not-present-on-test
2. Manually remove the offending rows from the dataset.
3. Move one copy of each unaccounted category from the test set to the training set.  This is a variation of the response to this question on Quora, which suggests to "deliberately sample so that the same categories are prevalent in both".
https://www.quora.com/Whats-the-best-thing-to-do-about-a-categorical-variable-that-generates-a-lot-more-features-in-the-test-set-then-the-training-set-when-one-hot-encoded
4. Retry the train/test split with a different random state that doesn't have the same problem: https://www.kaggle.com/discussions/getting-started/50008
5. Recombine the test and train sets with a designated categorical column, where a '0' means the data belongs to the training set and a '1' means the data belongs to the test set.  Then, do One-Hot-Encoding on the entire data:
https://medium.com/@vaibhavshukla182/how-to-solve-mismatch-in-train-and-test-set-after-categorical-encoding-8320ed03552f


First, we take a peek at the offending test data:

In [12]:
test_categories_absent_in_training_set = []
for col in ['VendorID', 'payment_type', 'PULocationID', 'DOLocationID']:
  # Compare the training and test sets one column at a time for each of the categories.
  # Sets cannot have duplicates.  Treat both columns as a set and take the set difference.
  # to see which categories for a given column are present only in the test set.
  test_categories_absent_in_training_set.append ({col: set (features_test.loc [:, col]) - set (features_train.loc [:, col])})

# Pickup Location ID's 27 and 245 are unaccounted.  No other columns have unaccounted data.
print (test_categories_absent_in_training_set)

test_rows_with_unaccounted_categories = features_test.loc [(features_test['PULocationID'] == 27) | (features_test['PULocationID'] == 245)]
print ('----------------------------------------------------------')
print ('Table Rows of PULocation IDs present in the test set only:')
print ('----------------------------------------------------------')
print (test_rows_with_unaccounted_categories)

[{'VendorID': set()}, {'payment_type': set()}, {'PULocationID': {27, 245}}, {'DOLocationID': set()}]
----------------------------------------------------------
Table Rows of PULocation IDs present in the test set only:
----------------------------------------------------------
        VendorID  trip_distance  payment_type  PULocationID  DOLocationID  \
420613         2           0.71             2           245           245   
672619         2           5.08             1           245           214   
291171         1          19.60             1            27           186   

        trip_duration  
420613       7.766667  
672619      17.150000  
291171      97.983333  


It appears only 3 out of 598,107 test rows are left unaccounted.  The nature of the model is taxi tripdata for January 2022 only, which suggests the dataset is not going to grow/update over time since taxi trips taken in February 2022 are treated in their own separate dataset.  Therefore, I believe the impact of the 3 offending rows in the test set will be insignificant to the overall model and are worth ignoring by our One-Hot Encoder.  The best solution will, of course, vary from dataset to dataset.

For this dataset, let us proceed with option 1 in my above reasoning.

In [13]:
# Create a pipeline object containing the column transformations and regression
# model.
linear_regression_pipeline = Pipeline ([('prepare_all_columns', all_columns_preparer), ('linear_regression', LinearRegression())])

In [14]:
# Fit the pipeline on the training data.
linear_regression_pipeline.fit (features_train, target_train)

In [15]:
# Make predictions on the test data.
# As discussed above, just ignore the 3 unaccounted rows of the test data.
linear_regression_error = mean_absolute_error (target_test, linear_regression_pipeline.predict (features_test))
print (f'mean absolute error of linear regression pipeline: {linear_regression_error}')

mean absolute error of linear regression pipeline: 3.668963770855991


With the power of linear regression, our mean absolute error has decreased to a total of $3.67, which significantly beats the baseline.

Random Forest Regression and Linear Regression are two commonly used regression algorithms, each with its own advantages and suitability for different scenarios. Random Forest Regression offers several advantages over Linear Regression, including:

1. **Non-linearity**: Random Forest Regressor is capable of capturing non-linear relationships between features and the target variable. In contrast, Linear Regression assumes a linear relationship between the features and the target. When faced with non-linear relationships or complex feature interactions, Random Forest Regressor can provide more accurate predictions.

2. **Robustness to Outliers**: Random Forest Regressor is generally more robust to outliers compared to Linear Regression. Outliers can disproportionately impact the coefficients and predictions of Linear Regression models. However, as an ensemble of decision trees, Random Forest Regressor can mitigate the effect of outliers by averaging predictions from multiple trees.

3. **Feature Importance**: Random Forest Regressor provides a measure of feature importance, which helps identify the most influential features for making predictions. This information is useful for feature selection, understanding the underlying relationships in the data, and gaining insights into the problem domain. Unlike Linear Regression, which provides coefficient values indicating the direction and magnitude of relationships, Random Forest Regressor explicitly highlights feature importance.

4. **Handling of Categorical Variables**: Random Forest Regressor can effectively handle categorical variables without requiring pre-processing steps like one-hot encoding. It can directly incorporate categorical variables into the model, making it more convenient when working with mixed data types. In contrast, Linear Regression often requires categorical variables to be encoded or transformed before use.

5. **Handling of High-Dimensional Data**: Random Forest Regressor can handle datasets with a large number of features (high dimensionality) by automatically selecting subsets of features during the construction of individual decision trees. This reduces the risk of overfitting, which is a concern with Linear Regression when dealing with high-dimensional data.

6. **Resistance to Multicollinearity**: Random Forest Regressor is less affected by multicollinearity, which occurs when predictor variables are highly correlated. In Linear Regression, highly correlated features can lead to unstable coefficient estimates, making it challenging to interpret the individual effects of each feature. Random Forest Regressor, as an ensemble approach, is less impacted by multicollinearity because each tree is built independently.

Here are your tasks:

  1. Build a Random Forest Regressor model using Scikit-Learn's [RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) and train it on the train data.

  2. Evaluate the performance of the model on the test data using mean absolute error as a metric. Mess around with various input parameter configurations to see how they affect the model. Can you beat the performance of the linear regression model?

In [19]:
# Build random forest regressor model
# Remember, Random Forest Regressors use 100 estimators by default, but that crashed the RAM on the default Google Compute Engine.
# I want my code to perform on the default computing engine.  No "but it works on my machine" or similar excuses allowed here.
# 50 estimators accomplishes the intended mission.
random_forest_regressor_model = RandomForestRegressor (n_estimators = 50)

In [20]:
# Make predictions on the test data.  No need for another Pipeline here.  All the data is already preprocessed from the Linear Regression Pipeline.
random_forest_regressor_model.fit (features_train, target_train)
random_forest_error = mean_absolute_error (target_test, random_forest_regressor_model.predict (features_test))
print (f'mean absolute error of random forest regression with 50 trees: {random_forest_error}')

mean absolute error of random forest regression with 50 trees: 2.214422636066241


With 50 estimators, the mean absolute error is approximately $2.21. This is already significantly better than the linear regression model above.

I'd like to mess around with the 'criteron' parameter of this model in a future project.  The reason I'm not using it here is that it would render this model incomparable to the linear regression and baseline models above because we use mean absolute error for both of them.

Let's cut down the number of estimators from 100 to 10 for performance and see if we can still beat the linear regression model.  Let's also constrain the 'max_depth' to 3 levels deep because simpler trees are less likely to overfit the data and faster to build.  We'll allow maximum 2 features for each decision tree instead of the default of 1.

In [21]:
random_forest_10 = RandomForestRegressor (n_estimators = 10, max_depth = 3, max_features = 2)
random_forest_10.fit (features_train, target_train)
error_10 = mean_absolute_error (target_test, random_forest_10.predict (features_test))
print (f'mean absolute error of random forest regression with 10 trees: {error_10}')

mean absolute error of random forest regression with 10 trees: 4.355286894997856


This code only ran in 7 seconds, but performs worse than both the 50 estimator forest and the linear regression models above.  Let's increase the estimators back to 50, but keep the trees small to see how it compares.

In [22]:
small_trees_50 = RandomForestRegressor (n_estimators = 50, max_depth = 3, max_features = 2)
small_trees_50.fit (features_train, target_train)
error_50 = mean_absolute_error (target_test, small_trees_50.predict (features_test))
print (f'mean absolute error of random forest regression with 50 trees of max depth 3: {error_50}')

mean absolute error of random forest regression with 50 trees of max depth 3: 4.024090767944009


Unfortunately, even with 50 estimators, explicitly stating 'max_depth' and 'max_features' as different from the default doesn't seem to do us any favors.

Let's see how the random forest pans out with all the defaults ('max_depth' = None and 'max_features' = 1) except that we have only 10 trees instead of 50 or 100.

In [23]:
random_forest_default_10 = RandomForestRegressor (n_estimators = 10)
random_forest_default_10.fit (features_train, target_train)
error_default_10 = mean_absolute_error (target_test, random_forest_default_10.predict (features_test))
print (f'mean absolute error of random forest regression with all defaults but only 10 trees: {error_default_10}')

mean absolute error of random forest regression with all defaults but only 10 trees: 2.2857094552969848


The mean absolute error for only 10 trees is $2.28, only slightly worse than the default random forest with 50 trees.  Even with just a few estimators, a Random Forest Regressor beats Linear Regression.

Hyperparameter tuning plays a critical role in machine learning model development. It involves selecting the optimal values for the hyperparameters, which are configuration settings that control the behavior of the learning algorithm. Here's why hyperparameter tuning is so important in ML:

1. **Optimizing Model Performance**: The choice of hyperparameters can significantly impact the model's performance. By fine-tuning the hyperparameters, we can improve the model's accuracy, precision, recall, or other performance metrics. It helps to extract the maximum predictive power from the chosen algorithm and ensures that the model is well-suited to the specific problem at hand.

2. **Avoiding Overfitting and Underfitting**: Hyperparameter tuning helps strike a balance between overfitting and underfitting.

3. **Exploring Model Complexity**: Hyperparameter tuning enables us to explore the complexity of the model. For instance, in algorithms like decision trees or neural networks, we can adjust the number of layers, the number of neurons, or the maximum depth of the tree. By systematically modifying these hyperparameters, we can understand how different levels of complexity impact the model's performance and find the right balance between simplicity and complexity.

Note, there are multiple approaches to hyperparemeter tuning.  

While grid search is the easiest to understand and implement there are many advantages of Bayesian search over grid search for hyperparameter tuning:

1. **Efficiency**: Bayesian search is generally more efficient than grid search. Grid search explores all possible combinations of hyperparameter values, which can be computationally expensive and time-consuming, especially when dealing with a large number of hyperparameters or a wide range of values. Bayesian search, on the other hand, intelligently selects the next hyperparameter configuration to evaluate based on the results of previous evaluations. It focuses on areas of the hyperparameter space that are more likely to yield better performance, reducing the number of evaluations needed.

2. **Flexibility**: Bayesian search is flexible in handling continuous and discrete hyperparameters. It can handle both types of hyperparameters naturally and effectively. In contrast, grid search is more suitable for discrete hyperparameters but may struggle with continuous ones, as it requires discretization or defining a finite set of values to search over.

3. **Adaptive Search**: Bayesian search adapts its search strategy based on the results of previous evaluations. It maintains a probability distribution over the hyperparameter space, updating it with each evaluation. This allows it to dynamically allocate more evaluations to promising regions and explore unexplored areas. In contrast, grid search follows a fixed and predefined search grid, regardless of the results of previous evaluations.

4. **Better Convergence**: Bayesian search has the potential to converge to the optimal hyperparameter configuration more quickly.

Here are your tasks:

  1. Perform a grid-search on a Random Forest Regressor model. Only search the space for the parameters 'n_estimators', 'max_depth', and 'min_samples_split'. Note, this can take some time to run. Make sure you set reasonable boundaries for the search space. Use Scikit-Learn's [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) method.

  2. After you've identified the best parameters, train a random forest regression model using these parameters on the full training data.

  3. Evaluate the model from the previous step using the test data. How does your model perform?

In [24]:
# Define the hyperparameters to tune.
# Given our above experiment with Random Forest Models, we can use Grid Search to systematically find the optimal number of
# trees between 10 and 50.  We can also test our hypothesis as to whether giving our random forest a higher 'max_depth' will
# yield a more accurate model.  Also consider a few values for the minimum number of classified test rows required to split
# a tree node into two children.
# KRISTIAN_NOTE - I tried running 60 models and the code was not done even after 5 hours and 48 minutes.
# random_forest_params = {'n_estimators': [10, 20, 30, 40, 50], 'max_depth': [2, 4, 8, 16], 'min_samples_split': [2, 10, 50]}
# Cutting down to 12 models:
random_forest_params = {'n_estimators': [10, 30, 50], 'max_depth': [4, 8], 'min_samples_split': [2, 100]}
grid_search_forest = GridSearchCV (\
  estimator = RandomForestRegressor(),\
  param_grid = random_forest_params\
)

In [25]:
# Perform grid search to find the best hyperparameters. This could take a while.
grid_search_forest.fit (features_train, target_train)

In [26]:
# Get the best model and its parameters.
best_forest = grid_search_forest.best_params_
print (best_forest)
best_max_depth = best_forest['max_depth']
best_min_samples_split = best_forest['min_samples_split']
best_n_estimators = best_forest['n_estimators']

{'max_depth': 8, 'min_samples_split': 2, 'n_estimators': 50}


In [27]:
# Fit the best classifier on the training data.
best_forest_model = RandomForestRegressor (\
  n_estimators = best_n_estimators,\
  min_samples_split = best_min_samples_split,\
  max_depth = best_max_depth,\
)
best_forest_model.fit (features_train, target_train)

In [28]:
# Make predictions on the test data
best_random_forest_error = mean_absolute_error (target_test, best_forest_model.predict (features_test))
print (f"mean absolute error of best random forest model from grid search: {best_random_forest_error}")

mean absolute error of best random forest model from grid search: 2.584159483860037
