# **05. Model Training and Evaluation**
*This notebook will focus on training the machine learning model (e.g., RandomForestRegressor) and evaluating its performance using metrics like RMSLE.*

## Objectives

* Write here your notebook objective, for example, "Fetch data from Kaggle and save as raw data", or "engineer features for modelling"

## Inputs

* Write here which data or information you need to run the notebook 

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 


---

# Execution Timestamp

Purpose: This code block adds a timestamp to track notebook execution
- Helps monitor when analysis was last performed
- Ensures reproducibility of results
- Useful for debugging and version control

In [2]:
# Timestamp
import datetime

import datetime
print(f"Notebook last run (end-to-end): {datetime.datetime.now()}")

Notebook last run (end-to-end): 2025-02-19 12:25:15.931248


# Project Directory Structure and Working Directory

**Purpose: This code block establishes and explains the project organization**
- Creates a standardized project structure for data science workflows
- Documents the purpose of each directory for team collaboration
- Gets current working directory for file path management

## Key Components:
1. `data/ directory` stores all datasets (raw, processed, interim)
2. `src/` contains all source code (data preparation, models, utilities)
3. `notebooks/` holds Jupyter notebooks for experimentation
4. `results/` stores output files and visualizations

## Project Root Structure

- **`data/`** - Where all your datasets live
    - `raw/` - Original, untouched data
    - `processed/` - Cleaned and prepared data
    - `interim/` - Temporary data files
- **`src/`** - Your source code
    - `data_prep/` - Code for preparing data
    - `models/` - Your ML models
    - `utils/` - Helper functions
- **`notebooks/`** - Jupyter notebooks for experiments
- **`results/`** - Model outputs and visualizations

## Setting Up Working Directory
This code block sets up the working environment by:
- Changing to the project directory where our code and data files are located
- Verifying the current working directory to ensure we're in the right place

In [3]:
import os

# Move to the desired directory
os.chdir('c:\\Users\\blign\\Dropbox\\1 PROJECT\\VS Code Project Respository\\About-BulldozerPriceGenius-_BPG-_v2')

# Get the current directory to verify the change
current_dir = os.getcwd()
current_dir

'c:\\Users\\blign\\Dropbox\\1 PROJECT\\VS Code Project Respository\\About-BulldozerPriceGenius-_BPG-_v2'

## Set Working Directory to Project Root
**Purpose: Changes the current working directory to the parent directory**
- Gets the folder one level above the current one
- Makes sure all file locations work correctly throughout the project
- Keeps files and folders organized in a clean way

In [4]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


## Get Current Working Directory
**Purpose: Retrieves and stores the current working directory path**
- Gets the folder location where we're currently working
- Saves this location in a variable called current_dir so we can use it later
- Helps us find and work with files in the right place

In [5]:
import os

# Change the current working directory
os.chdir('c:\\Users\\blign\\Dropbox\\1 PROJECT\\VS Code Project Respository')

# Get the current working directory
current_dir = os.getcwd()
current_dir

'c:\\Users\\blign\\Dropbox\\1 PROJECT\\VS Code Project Respository'

---

# **Import Essential Data Science Libraries and Check Versions**

**Purpose: This code block imports fundamental Python libraries for data analysis and visualization**
- `pandas:` For data manipulation and analysis
- `numpy:` For numerical computations
- `matplotlib:` For creating visualizations and plots

**The version checks help ensure:**
- *Code compatibility across different environments*
- *Reproducibility of analysis*
- *Easy debugging of version-specific issues*


In [6]:
# Import data analysis tools
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt


print(f"pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"matplotlib version: {matplotlib.__version__}")

pandas version: 2.2.3
NumPy version: 2.2.2
matplotlib version: 3.10.0


# **Import and Displaying the Processed Bulldozer Dataset**

This code serves three main purposes:

- Imports pandas for data manipulation
- Loads our preprocessed bulldozer dataset from a Parquet file that contains cleaned data with properly encoded categorical values and filled missing values
- Displays the first few rows of the data to verify successful loading

---

In [7]:
import pandas as pd

# Define the file path
file_path = "C:/Users/blign/Dropbox/1 PROJECT/VS Code Project Respository/About-BulldozerPriceGenius-_BPG-_v2/data/processed/TrainAndValid_object_values_as_categories_and_missing_values_filled.parquet"

# Load the Parquet file into a DataFrame
df = pd.read_parquet(file_path)

# Display the first few rows of the DataFrame
print(df.head())

   SalesID  SalePrice  MachineID  ModelID  datasource  auctioneerID  YearMade  \
0  1139246    66000.0     999089     3157         121           3.0      2004   
1  1139248    57000.0     117657       77         121           3.0      1996   
2  1139249    10000.0     434808     7009         121           3.0      2001   
3  1139251    38500.0    1026470      332         121           3.0      2001   
4  1139253    11000.0    1057373    17311         121           3.0      2007   

   MachineHoursCurrentMeter  UsageBand  fiModelDesc  ...  \
0                      68.0          2          963  ...   
1                    4640.0          2         1745  ...   
2                    2838.0          1          336  ...   
3                    3486.0          1         3716  ...   
4                     722.0          3         4261  ...   

   Undercarriage_Pad_Width_is_missing  Stick_Length_is_missing  \
0                                   1                        1   
1                   

## Loading the Preprocessed Bulldozer Dataset

This code reads our previously processed bulldozer dataset from a Parquet file. The dataset contains:

- Cleaned and properly formatted data
- Encoded categorical values
- Filled missing value

In [8]:
# Read in preprocessed dataset
df_tmp = pd.read_parquet(path="C:/Users/blign/Dropbox/1 PROJECT/VS Code Project Respository/About-BulldozerPriceGenius-_BPG-_v2/data/processed/TrainAndValid_object_values_as_categories_and_missing_values_filled.parquet",
                        engine="auto")


### Check for Missing Values

This code checks if there are any missing values in our data. It:

- Calculates the total number of missing values across all columns using pandas' isna() and sum() functions
- Provides informative feedback based on the result:
    - If no missing values are found (total = 0), confirms we can proceed with model building
    - If missing values exist, suggests reviewing our data preprocessing steps

In [9]:
# Check total number of missing values
total_missing_values = df_tmp.isna().sum().sum()

if total_missing_values == 0:
    print(f"[INFO] Total missing values: {total_missing_values} - Great! Let's build a model!")
else:
    print(f"[INFO] Uh ohh... total missing values: {total_missing_values} - Perhaps we might have to retrace our steps to fill the values?")

[INFO] Total missing values: 0 - Great! Let's build a model!


---

# Training Our Machine Learning Model

Now that we've cleaned up our data and made sure everything is in the right format, we're ready to create our price prediction model!

### Starting Small

We'll use a special type of model called a Random Forest that's good at learning patterns from data. Since we have a lot of data (over 400,000 rows), we'll first test our approach on a smaller sample of about 1,000 rows.

### Why Start Small?

Think of it like testing a recipe - it's better to try it with smaller portions first to make sure everything works before making a huge batch. This way, we can quickly fix any problems without wasting time.
 

### Setting Up Our Data

We'll organize our data into two parts:

- `Features (X)`: All the information about the bulldozers
- `Target (y)`: The actual sale prices we want to predict

### Measuring Performance

We'll use a special timing tool `(%%time)` to see how long our model takes to learn. This helps us plan for when we use the full dataset.

## Initialize and Train Random Forest Model

This code prepares and runs our model that will predict bulldozer prices.

- Imports the RandomForestRegressor class from scikit-learn's ensemble module
- Creates a model instance that utilizes all available CPU cores (n_jobs=-1)
- Trains the model using:
    - `Features (X)`: All columns except SalePrice
    - `Target (y)`: The SalePrice column we want to predict

In [10]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_jobs=-1)
model.fit(X=df_tmp.drop("SalePrice", axis=1), 
        y=df_tmp.SalePrice)

### What is `n_jobs=-1`?

Setting n_jobs=-1 in a RandomForestRegressor tells the model to use all available CPU cores on your computer for training the model.

#### How it Works

Think of it like this: If your computer has 4 CPU cores, it's like having 4 workers that can process data simultaneously. When you set n_jobs=-1, you're telling the program "Use all available workers (cores)" instead of just one.

#### Benefits

- Faster training times since work is distributed across all cores
- Maximum utilization of your computer's processing power

#### Potential Drawbacks

- May slow down other programs on your computer since all cores are being used
- Could cause system stability issues on computers with limited resources

If you want to be more conservative with resource usage, you can set n_jobs to a specific number (like n_jobs=2 to use just two cores).

## Training a Sample Model for Initial Testing

This code shows how we test our Random Forest model using a small portion of our bulldozer dataset to make sure everything works correctly.

- Takes a random sample of 1,000 records from our full dataset for quick testing
- Creates a Random Forest model that uses all available CPU cores for efficient processing
- Splits the data into two parts: what we want to use to make predictions (X) and what we want to predict (y)
- Trains the model on this sample data to predict bulldozer prices

In [11]:
%%time

# Sample 1000 samples with random state 42 for reproducibility
df_tmp_sample_1k = df_tmp.sample(n=1000, random_state=42)

# Instantiate a model
model = RandomForestRegressor(n_jobs=-1) # use -1 to utilise all available processors

# Create features and labels
X_sample_1k = df_tmp_sample_1k.drop("SalePrice", axis=1) # use all columns except SalePrice as X values
y_sample_1k = df_tmp_sample_1k["SalePrice"] # use SalePrice as y values (target variable)

# Fit the model to the sample data
model.fit(X=X_sample_1k, 
          y=y_sample_1k) 

CPU times: total: 6.09 s
Wall time: 4.76 s


### Model Training Time Analysis

This output shows how long it took to train our Random Forest model on the 1,000 sample records:

- **CPU times** (total processing time across all CPU cores): 3.03 seconds
- **Wall time** (actual elapsed time): 1.5 seconds

The difference between CPU and wall time indicates effective parallel processing across multiple CPU cores.

## Evaluate Model Performance

This code checks how accurate our Random Forest model is at guessing bulldozer prices.

- Using the `score() method` to measure how well our model can guess prices by comparing its predictions to the actual bulldozer prices.
- Testing the model on the same 1,000 sample records we used for training.
- Printing the score along with the sample size for easy reference.

In [12]:
# Evaluate the model
model_sample_1k_score = model.score(X=X_sample_1k,
                                    y=y_sample_1k)

print(f"[INFO] Model score on {len(df_tmp_sample_1k)} samples: {model_sample_1k_score}")

[INFO] Model score on 1000 samples: 0.9546332512859753


### Model Performance Results

The **Random Forest model** achieved an impressive accuracy score of about `96%` when predicting bulldozer prices. Here are the key points:

- The model shows 96% accuracy in price predictions, comparable to getting 96 out of 100 questions correct.
- Testing was conducted on a limited sample:
    - Only 1,000 bulldozers were used.
    - All available CPU power was utilized for efficient processing.
- Practical implications:
    - Provides reliable price estimates for buyers and sellers.
    - Results may vary when applied to the complete dataset.

## Training the Full Random Forest Model

This code runs our Random Forest model using all our bulldozer data to make better price predictions.

- Measures the training time using a special timer command `*%%*time`
- Uses all available computer processing power to run the model faster
- Gets the data ready by:
    - Taking out the price information (SalePrice) from the main dataset
    - Keeping the price information separate to use as our target values
- Uses all available bulldozer data to help the model learn how to predict prices accurately

In [13]:
%%time

# Instantiate model
model = RandomForestRegressor(n_jobs=-1) # note: this could take quite a while depending on your machine

# Create features and labels with entire dataset
X_all = df_tmp.drop("SalePrice", axis=1)
y_all = df_tmp["SalePrice"]

# Fit the model
model.fit(X=X_all, 
        y=y_all)

CPU times: total: 35min 10s
Wall time: 7min 37s


## Evaluating Model Performance on Full Dataset

This code block evaluates how well our Random Forest model performs on the complete bulldozer dataset. It:

- Uses the score() method to measure prediction accuracy
- Calculates performance using both features (X_all) and actual prices (y_all)
- Prints the final score along with the total number of samples used

This evaluation is important because it shows us how well our model can predict bulldozer prices when using the entire dataset, rather than just the small sample we tested earlier.

In [14]:
# Evaluate the model
model_sample_all_score = model.score(X=X_all,
                                     y=y_all)

print(f"[INFO] Model score on {len(df_tmp)} samples: {model_sample_all_score}")

[INFO] Model score on 412698 samples: 0.9875847209558163


---

# **Understanding Data Splitting: A Simple Guide**

When working with machine learning projects, it's crucial to split your data properly. Here's why and how we do it:

### What is Data Splitting?

Think of data splitting like dividing a recipe book into three parts:

- `Training data`: The recipes you practice with
- `Validation data`: The recipes you test yourself on
- `Test data`: The final exam recipes

### Why Time Matters

For projects involving time-based predictions (like our bulldozer price predictions), we need to be extra careful with how we split the data. Random splitting won't work because it mixes up the timeline.

### How We Split the Data

In this project, we organize our data by dates:

- `Training data`: Everything up until 2011
- `Validation data`: January 1 to April 30, 2012
- `Testing data`: May 1 to November 2012

This approach ensures we're training our model on past data to predict future prices, just like how we'd use historical prices to guess future ones in the real world.

In [15]:
# Import train samples (making sure to parse dates and then sort by them)
train_df = pd.read_csv(filepath_or_buffer="C:/Users/blign/Dropbox/1 PROJECT/VS Code Project Respository/About-BulldozerPriceGenius-_BPG-_v2/data/raw/bluebook-for-bulldozers/Train.csv",
                       parse_dates=["saledate"],
                       low_memory=False).sort_values(by="saledate", ascending=True)

# Import validation samples (making sure to parse dates and then sort by them)
valid_df = pd.read_csv(filepath_or_buffer="C:/Users/blign/Dropbox/1 PROJECT/VS Code Project Respository/About-BulldozerPriceGenius-_BPG-_v2/data/raw/bluebook-for-bulldozers/Valid.csv",
                       parse_dates=["saledate"])

# The ValidSolution.csv contains the SalePrice values for the samples in Valid.csv
valid_solution = pd.read_csv(filepath_or_buffer="C:/Users/blign/Dropbox/1 PROJECT/VS Code Project Respository/About-BulldozerPriceGenius-_BPG-_v2/data/raw/bluebook-for-bulldozers/ValidSolution.csv")

# Map valid_solution to valid_df
valid_df["SalePrice"] = valid_df["SalesID"].map(valid_solution.set_index("SalesID")["SalePrice"])

# Make sure valid_df is sorted by saledate still
valid_df = valid_df.sort_values("saledate", ascending=True).reset_index(drop=True)

# How many samples are in each DataFrame?
print(f"[INFO] Number of samples in training DataFrame: {len(train_df)}")
print(f"[INFO] Number of samples in validation DataFrame: {len(valid_df)}")

[INFO] Number of samples in training DataFrame: 401125
[INFO] Number of samples in validation DataFrame: 11573


## Loading and Processing Training/Validation Data

This code prepares our bulldozer dataset for machine learning. Here's what it does in simple terms:

- Loads the training data from a file and makes sure all dates are in the right format and order
- Gets the test data ready the same way
- Adds the actual bulldozer prices to the test data
- Makes sure all the data is arranged by date, which helps us make better predictions
- Shows us how many bulldozers we have in our training and test sets

## Examining `Training Data` Sample

This code displays a random sample of 5 rows from our training dataset. This helps us:

- Verify the data is loaded correctly.
- Understand what features we're working with.
- Spot any potential issues in the data structure.

In [16]:
# Let's check out the training DataFrame
train_df.sample(5)

Unnamed: 0,SalesID,SalePrice,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,saledate,...,Undercarriage_Pad_Width,Stick_Length,Thumb,Pattern_Changer,Grouser_Type,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls
162936,1585087,46000,1484446,14308,132,2.0,1999,,,2007-03-22,...,None or Unspecified,None or Unspecified,None or Unspecified,None or Unspecified,Double,,,,,
351514,2414765,42500,1709283,16668,136,1.0,2003,2736.0,Low,2008-02-19,...,,,,,,,,,,
361168,2499289,9000,1925694,3170,149,1.0,1989,8359.0,Medium,2010-09-10,...,,,,,,,,,,
399114,6304415,27000,1912526,23743,149,99.0,1996,,,2011-07-20,...,27 inch,None or Unspecified,Manual,None or Unspecified,Triple,,,,,
61818,1326818,15000,54491,1526,132,10.0,1981,,,2004-03-10,...,,,,,,None or Unspecified,None or Unspecified,None or Unspecified,,


## Examining `Validation Data` Sample

This code displays a random sample of 5 rows from our validation dataset to:

- Verify the validation data is structured correctly.
- Compare validation data features with training data.
- Check for any inconsistencies between datasets.

In [17]:
# And how about the validation DataFrame?
valid_df.sample(5)

Unnamed: 0,SalesID,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,saledate,fiModelDesc,...,Stick_Length,Thumb,Pattern_Changer,Grouser_Type,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls,SalePrice
2057,4286355,2312251,3841,172,1,1997,0.0,,2012-02-12,960F,...,,,,,,,,Standard,Conventional,52000.0
1921,6310576,1853463,9566,149,1,1995,,,2012-02-08,873,...,,,,,,,,,,8500.0
2383,4262904,2305729,3858,172,1,2000,9616.0,Low,2012-02-12,966G,...,,,,,,,,Standard,Conventional,100000.0
5026,1224816,1011258,8022,121,3,1000,2564.0,Low,2012-02-24,270,...,,,,,,,,,,10250.0
2802,6270501,1065735,3380,149,1,1976,,,2012-02-13,16G,...,,,,,,,,,,57500.0


# Creating Time-Based Features from Sale Dates

Here's what this code does with sale dates to help predict bulldozer prices better:

- **Why we need it:**
    - Helps spot patterns in when bulldozers sell for more or less money.
    - Makes it easier to see how prices change over time.
- **What it does:**
    - Breaks down each sale date into useful parts (year, month, and day).
    - Adds helpful details like which day of the week and what time of year it was.
    - Removes the original date to keep things simple.

In [18]:
# Make a function to add date columns
def add_datetime_features_to_df(df, date_column="saledate"):
    # Add datetime parameters for saledate
    df["saleYear"] = df[date_column].dt.year
    df["saleMonth"] = df[date_column].dt.month
    df["saleDay"] = df[date_column].dt.day
    df["saleDayofweek"] = df[date_column].dt.dayofweek
    df["saleDayofyear"] = df[date_column].dt.dayofyear

    # Drop original saledate column
    df.drop("saledate", axis=1, inplace=True)

    return df

train_df = add_datetime_features_to_df(df=train_df)
valid_df = add_datetime_features_to_df(df=valid_df)

## Viewing Time-Based Features

This code displays:

- A random sample of 5 rows showing our new time features.
- Includes year, month, day, day of week, and day of year.
- Helps verify:
    - Date breakdowns are correct.
    - Temporal data structure is proper.

In [19]:
# Display the last 5 columns (the recently added datetime breakdowns)
train_df.iloc[:, -5:].sample(5)

Unnamed: 0,saleYear,saleMonth,saleDay,saleDayofweek,saleDayofyear
350641,2009,5,6,2,126
355681,2010,6,24,3,175
161843,2000,9,8,4,252
227581,2001,2,17,5,48
101287,1994,4,30,5,120


---

# **Trying To Fit A Model On Our `Training Data`**
Let's start by testing our model right away with our data. This approach helps us understand what we're working with quickly.

When we test the model, one of two things will happen:

- If it works: We'll look at the results to see how well it did
- If it doesn't work: We'll learn what we need to fix in our data

To get started, we'll:

- Split our data into two parts:
    - The features (X): All the bulldozer information except the price
    - The target (y): Just the price we want to predict

Then we'll use a special tool called RandomForestRegressor to make price predictions using our training data.

In [20]:
# Split training data into features and labels
# X_train = train_df.drop("SalePrice", axis=1)
# y_train = train_df["SalePrice"]

# Split validation data into features and labels
# X_valid = valid_df.drop("SalePrice", axis=1)
# y_valid = valid_df["SalePrice"]

# Create a model
# model = RandomForestRegressor(n_jobs=-1)

# Fit a model to the training data only
# model.fit(X=X_train,
#          y=y_train)

## Handling Data Type Mismatch in Model Training

We've hit a small roadblock in our model training. Our program is having trouble understanding some of the data because it's expecting numbers but found text values (like the word 'Medium') instead.

Here's what's happening:

- Our model can only work with numbers.
- We found text values in our data (like 'Medium').
- This happened because we loaded our raw data file (`Train.csv`) instead of using our processed version.

The good news is this is a common issue and we know exactly how to fix it!

---

# **Encoding categorical features as numbers using Scikit-Learn**

## Identifying Numerical and Categorical Features

Our code looks at our data and sorts it into two simple groups:

- **Numbers**: Things we can count or measure (like prices and years).
- **Categories**: Words or labels that describe things (like model names or condition types).

This sorting is important because:

- Our computer needs to handle numbers and words differently.
- It helps us prepare our data the right way.
- It makes sure our computer can understand and use all our information.

In [21]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset
df_tmp = pd.read_csv("data/raw/bluebook-for-bulldozers/TrainAndValid.csv",
                     low_memory=False,
                     parse_dates=["saledate"])

# Feature engineering: Add datetime parameters and drop original saledate column
df_tmp["saleYear"] = df_tmp.saledate.dt.year
df_tmp["saleMonth"] = df_tmp.saledate.dt.month
df_tmp["saleDay"] = df_tmp.saledate.dt.day
df_tmp["saleDayofweek"] = df_tmp.saledate.dt.dayofweek
df_tmp["saleDayofyear"] = df_tmp.saledate.dt.dayofyear
df_tmp.drop("saledate", axis=1, inplace=True)

# Convert object type columns to category
for label, content in df_tmp.items():
    if pd.api.types.is_object_dtype(content):
        df_tmp[label] = df_tmp[label].astype("category")

# Split data into training and testing sets
X = df_tmp.drop("SalePrice", axis=1)
y = df_tmp.SalePrice
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)  # Adjust test_size and random_state as needed

# Define numerical and categorical features
numerical_features = [label for label, content in X_train.items() if pd.api.types.is_numeric_dtype(content)]
categorical_features = [label for label, content in X_train.items() if not pd.api.types.is_numeric_dtype(content)]

print(f"[INFO] Numeric features: {numerical_features}")
print(f"[INFO] Categorical features: {categorical_features}")


[INFO] Numeric features: ['SalesID', 'MachineID', 'ModelID', 'datasource', 'auctioneerID', 'YearMade', 'MachineHoursCurrentMeter', 'saleYear', 'saleMonth', 'saleDay', 'saleDayofweek', 'saleDayofyear']
[INFO] Categorical features: ['UsageBand', 'fiModelDesc', 'fiBaseModel', 'fiSecondaryDesc', 'fiModelSeries', 'fiModelDescriptor', 'ProductSize', 'fiProductClassDesc', 'state', 'ProductGroup', 'ProductGroupDesc', 'Drive_System', 'Enclosure', 'Forks', 'Pad_Type', 'Ride_Control', 'Stick', 'Transmission', 'Turbocharged', 'Blade_Extension', 'Blade_Width', 'Enclosure_Type', 'Engine_Horsepower', 'Hydraulics', 'Pushblock', 'Ripper', 'Scarifier', 'Tip_Control', 'Tire_Size', 'Coupler', 'Coupler_System', 'Grouser_Tracks', 'Hydraulics_Flow', 'Track_Type', 'Undercarriage_Pad_Width', 'Stick_Length', 'Thumb', 'Pattern_Changer', 'Grouser_Type', 'Backhoe_Mounting', 'Blade_Type', 'Travel_Controls', 'Differential_Type', 'Steering_Controls']


## Converting Categorical Features to Numbers

Our code takes information that's written as text and turns it into numbers so our computer can work with it. 
**Here's why we need to do this:**

- Computers can only understand and work with numbers.
- Some of our information, like equipment models and conditions, is written as text.
- We need to change this text into numbers, but keep its original meaning.

**The code does these main things:**

- Uses a special tool to change text into numbers.
- Handles any new or unexpected text by marking it as missing.
- Makes these changes to both our training and testing data.

In [22]:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
import numpy as np

# Split data into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(
    df_tmp.drop("SalePrice", axis=1), df_tmp["SalePrice"], test_size=0.2, shuffle=False
)

# Get a list of categorical features
categorical_features = X_train.select_dtypes(include=['category', object]).columns.tolist()

# 1. Create an ordinal encoder 
ordinal_encoder = OrdinalEncoder(categories="auto",
                                 handle_unknown="use_encoded_value",
                                 unknown_value=np.nan,
                                 encoded_missing_value=np.nan) 

# 2. Fit and transform the categorical columns of X_train
X_train_preprocessed = X_train.copy() 
X_train_preprocessed[categorical_features] = ordinal_encoder.fit_transform(X_train_preprocessed[categorical_features].astype(str))

# 3. Transform the categorical columns of X_valid 
X_valid_preprocessed = X_valid.copy()
X_valid_preprocessed[categorical_features] = ordinal_encoder.transform(X_valid_preprocessed[categorical_features].astype(str))

## Display First Few Rows of Training Data

This code displays the first 5 rows of our training dataset (`X_train`) to:

- Verify our data is structured correctly
- Preview the features we'll use for training
- Ensure all columns are present and properly formatted


In [23]:
X_train.head()

Unnamed: 0,SalesID,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,fiModelDesc,fiBaseModel,...,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls,saleYear,saleMonth,saleDay,saleDayofweek,saleDayofyear
0,1139246,999089,3157,121,3.0,2004,68.0,Low,521D,521,...,,,,Standard,Conventional,2006,11,16,3,320
1,1139248,117657,77,121,3.0,1996,4640.0,Low,950FII,950,...,,,,Standard,Conventional,2004,3,26,4,86
2,1139249,434808,7009,121,3.0,2001,2838.0,High,226,226,...,,,,,,2004,2,26,3,57
3,1139251,1026470,332,121,3.0,2001,3486.0,High,PC120-6E,PC120,...,,,,,,2011,5,19,3,139
4,1139253,1057373,17311,121,3.0,2007,722.0,Medium,S175,S175,...,,,,,,2009,7,23,3,204


## Display Preprocessed Training Data

This code displays the first 5 rows of our preprocessed training dataset to:

- Verify our categorical features have been successfully converted to numerical values
- Confirm the preprocessing steps were applied correctly
- Check the data format is now suitable for model training

In [24]:
X_train_preprocessed.head()

Unnamed: 0,SalesID,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,fiModelDesc,fiBaseModel,...,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls,saleYear,saleMonth,saleDay,saleDayofweek,saleDayofyear
0,1139246,999089,3157,121,3.0,2004,68.0,1.0,868.0,290.0,...,2.0,10.0,7.0,3.0,1.0,2006,11,16,3,320
1,1139248,117657,77,121,3.0,1996,4640.0,1.0,1577.0,507.0,...,2.0,10.0,7.0,3.0,1.0,2004,3,26,4,86
2,1139249,434808,7009,121,3.0,2001,2838.0,0.0,299.0,108.0,...,2.0,10.0,7.0,4.0,5.0,2004,2,26,3,57
3,1139251,1026470,332,121,3.0,2001,3486.0,0.0,3213.0,1295.0,...,2.0,10.0,7.0,4.0,5.0,2011,5,19,3,139
4,1139253,1057373,17311,121,3.0,2007,722.0,2.0,3633.0,1430.0,...,2.0,10.0,7.0,4.0,5.0,2009,7,23,3,204


## Finding Missing Data in Categories

This code helps us find where data is missing in our categories. It does these simple things:

- Looks at just the category columns in our training data
- Counts how many empty spots we have in each column
- Lists them in order, showing which columns have the most missing data first
- Shows us the top 10 problem areasThe simpler rewrite:
    - Uses shorter, clearer sentences
    - Removes technical terms like "categorical features" and "NA values"
    - Maintains the same structure and information
    - Makes the purpose more immediately clear
    - Uses more everyday language while keeping the meaning intact

In [25]:
X_train[categorical_features].isna().sum().sort_values(ascending=False)[:10]

Tip_Control          308533
Enclosure_Type       308533
Engine_Horsepower    308533
Blade_Extension      308533
Blade_Width          308533
Pushblock            308533
Scarifier            308523
Grouser_Tracks       296496
Hydraulics_Flow      296496
Coupler_System       296411
dtype: int64

## Examining Category Encodings

This code shows us the first three types of data we've converted into numbers, helping us:

- Check if our text labels were properly turned into numbers.
- Look at how each label was matched with its number.
- Make sure we kept the right connections between related labels.

In [26]:
# Let's inspect the first three categories
ordinal_encoder.categories_[:3]

[array(['High', 'Low', 'Medium', np.str_('nan')], dtype=object),
 array(['100C', '104', '1066', ..., 'ZX800', 'ZX800LC', 'ZX850H'],
       shape=(4312,), dtype=object),
 array(['10', '100', '104', ..., 'ZX80', 'ZX800', 'ZX850'],
       shape=(1824,), dtype=object)]

## Creating Category-to-Number Mapping Dictionary

This code helps us keep track of how we change text labels into numbers by creating an easy-to-use reference list. It works like a translator:

- Creates a dictionary that maps each category column to its numerical values.
- Preserves the relationship between original text values and their numeric codes.
- Helps us translate numbers back to their original categories when needed.

In [27]:
# Create a dictionary of dictionaries mapping column names and their variables to their numerical encoding
column_to_category_mapping = {}

for column_name, category_values in zip(categorical_features, ordinal_encoder.categories_):
    int_to_category = {i: category for i, category in enumerate(category_values)}
    column_to_category_mapping[column_name] = int_to_category

# Inspect an example column name to category mapping
column_to_category_mapping["UsageBand"]

{0: 'High', 1: 'Low', 2: 'Medium', 3: np.str_('nan')}

## Converting Numbers Back to Categories

This code takes our number data and changes it back to the original words and labels so we can check if everything was converted correctly.

- Creates a copy of our preprocessed data to preserve the original.
- Uses `inverse_transform` to convert numbers back to their original categories.
- Organizes the data into a readable DataFrame format.
- Shows a random sample of 5 rows to verify the conversion worked correctly.

In [28]:
# Create a copy of the preprocessed DataFrame
X_train_unprocessed = X_train_preprocessed[categorical_features].copy()

# This will return an array of the original untransformed data
X_train_unprocessed = ordinal_encoder.inverse_transform(X_train_unprocessed)

# Turn back into a DataFrame for viewing pleasure
X_train_unprocessed_df = pd.DataFrame(X_train_unprocessed, columns=categorical_features)

# Check out a sample
X_train_unprocessed_df.sample(5)

Unnamed: 0,UsageBand,fiModelDesc,fiBaseModel,fiSecondaryDesc,fiModelSeries,fiModelDescriptor,ProductSize,fiProductClassDesc,state,ProductGroup,...,Undercarriage_Pad_Width,Stick_Length,Thumb,Pattern_Changer,Grouser_Type,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls
310903,,311B,311,B,,,Small,"Hydraulic Excavator, Track - 11.0 to 12.0 Metr...",Ohio,TEX,...,24 inch,None or Unspecified,None or Unspecified,Yes,Triple,,,,,
91967,,140G,140,G,,,,Motorgrader - 145.0 to 170.0 Horsepower,Tennessee,MG,...,,,,,,,,,,
228650,,K907C,K907,C,,,Large / Medium,"Hydraulic Excavator, Track - 19.0 to 21.0 Metr...",New York,TEX,...,32 inch,None or Unspecified,None or Unspecified,None or Unspecified,Double,,,,,
47121,,580SUPER L,580,SUPER L,,,,Backhoe Loader - 14.0 to 15.0 Ft Standard Digg...,Texas,BL,...,,,,,,,,,,
91359,Low,D3C,D3,C,,,,"Track Type Tractor, Dozer - 20.0 to 75.0 Horse...",Florida,TTT,...,,,,,,None or Unspecified,PAT,None or Unspecified,,


# **Fitting A Model To Our Preprocessed Training Data** 

### Getting Our Model Ready

Now that we've organized our data, we can start training our computer to make predictions. 

## Training Our Random Forest Model

This code gets our bulldozer price prediction system ready to work. It:

- Uses `%%time` to measure how long the training takes.
- Creates a RandomForestRegressor with parallel processing `(n_jobs=-1)`.
- Trains the model using our prepared features (X_train_preprocessed) and target values (y_train).

In [29]:
%%time

# Instantiate a Random Forest Regression model
model = RandomForestRegressor(n_jobs=-1)

# Fit the model to the preprocessed training data
model.fit(X=X_train_preprocessed,
          y=y_train)

CPU times: total: 23min 45s
Wall time: 4min 46s


## Evaluating Model Performance on Validation Set

This code helps us check if our model is working well by testing it on new data it hasn't seen before (the validation set). We can:

- See how long the testing takes using the %%time command.
- Get a score between 0 and 1 that shows how accurate our predictions are.
- The closer the score is to 1, the better our model is at predicting bulldozer prices.

In [30]:
%%time

# Check model performance on the validation set
# model.score(X=X_valid,
#            y=y_valid)

CPU times: total: 0 ns
Wall time: 0 ns


### Error Explanation: Unable to Process Text Data
Here are the bullet points from the error explanation:

- The model is trying to work with text data (specifically the word 'Medium') but it can only handle numerical values
- This is like trying to do math with words instead of numbers - it just doesn't work
- The validation data needs to be preprocessed (converted to numbers) in the same way as the training data before the model can use it

In [31]:
%%time

# Check model performance on the validation set
model.score(X=X_valid_preprocessed,
            y=y_valid)

CPU times: total: 11.3 s
Wall time: 7.44 s


0.7827384322007986

### Model Performance Results

Our bulldozer price prediction model shows:

- Achieved a score of 0.784 (78.4%) on new data.
- Demonstrates strong real-world prediction capability.
- Provides reliable estimates for future bulldozer pricing.

#### Why This Score Matters

- The model achieved a score of 78.4% on new, unseen data.
- While an earlier test showed a higher score of 98.75%, this was less reliable since it was tested on data the model had already learned from.
- **The `78.4%` score is a more accurate representation of how the model will likely perform in real-world predictions.**

#### Real-World Implications

- The model achieved a score of 78.4% on new, unseen data.
- While an earlier test showed 98.75%, this higher score was less reliable since it came from data the model had already seen.
- The 78.4% score better represents how the model will likely perform in real-world predictions.

## Checking Training Set Performance

This code shows us how well the model learned from its training data:

- Uses %%time to track how long the evaluation takes
- Applies model.score() to compare predicted prices against actual training data prices
- Helps us understand if the model learned the training patterns effectively

In [32]:
%%time

# Check model performance on the training set
model.score(X=X_train_preprocessed,
            y=y_train)

CPU times: total: 30.7 s
Wall time: 16.3 s


0.9872175284152153

### Model Performance Results

The machine learning model for predicting bulldozer prices showed impressive results in two key areas:

##### Training Set Performance

Based on the selection, here are the key points about the model's training performance:

- **The model achieved a score of `98.72%` when tested on training data.**
- This high score indicates the model successfully learned the patterns in the training data.
- The model performed better on training data than validation data, which is expected and normal in machine learning.

# **Building an evaluation function**

To ensure our machine learning model performs well, we need a way to measure its accuracy. We'll create an evaluation function that helps us:

- Compare predicted prices against actual prices
- Track model performance consistently across different tests
- Use industry-standard metrics for bulldozer price prediction

### Key Evaluation Metrics

We'll use two main metrics to evaluate our model:

### 1. Root Mean Squared Log Error (RMSLE)

This is the official metric used in the Kaggle Bulldozer competition. It measures relative errors rather than absolute ones, meaning:

- A $100 error on a $1,000 prediction (10%) is considered worse than
- A $100 error on a $10,000 prediction (1%)

### 2. Mean Absolute Error (MAE)

This gives us a different perspective by measuring absolute differences between predictions and actual values.

### Implementation Details

Our evaluation function will calculate:

- MAE (Mean Absolute Error) - lower is better
- RMSLE (Root Mean Squared Log Error) - lower is better
- R² Score (Coefficient of Determination) - higher is better

We'll use scikit-learn's built-in metrics and the model's predict() method to generate these scores.

In [33]:
# Create evaluation function (the competition uses Root Mean Square Log Error)
from sklearn.metrics import mean_absolute_error, root_mean_squared_log_error

# Create function to evaluate our model
def show_scores(model, 
                train_features=X_train_preprocessed,
                train_labels=y_train,
                valid_features=X_valid_preprocessed,
                valid_labels=y_valid):
    
    # Make predictions on train and validation features
    train_preds = model.predict(X=train_features)
    val_preds = model.predict(X=valid_features)

    # Create a scores dictionary of different evaluation metrics
    scores = {"Training MAE": mean_absolute_error(y_true=train_labels, 
                                                  y_pred=train_preds),
              "Valid MAE": mean_absolute_error(y_true=valid_labels, 
                                               y_pred=val_preds),
              "Training RMSLE": root_mean_squared_log_error(y_true=train_labels, 
                                                            y_pred=train_preds),
              "Valid RMSLE": root_mean_squared_log_error(y_true=valid_labels, 
                                                         y_pred=val_preds),
              "Training R^2": model.score(X=train_features, 
                                          y=train_labels),
              "Valid R^2": model.score(X=valid_features, 
                                       y=valid_labels)}
    return scores

## Evaluating Model Performance

This code section serves to test our evaluation function and display comprehensive model performance metrics. It:

- Calls our custom show_scores() function with our trained model.
- Stores the results in model_scores variable.
- Displays various metrics including MAE, RMSLE, and R² scores for both training and validation data.

In [34]:
# Try our model scoring function out
model_scores = show_scores(model=model)
model_scores

{'Training MAE': 1601.8474971680223,
 'Valid MAE': 7350.989092116953,
 'Training RMSLE': 0.08517829499259015,
 'Valid RMSLE': 0.34490646690183985,
 'Training R^2': 0.9872175284152153,
 'Valid R^2': 0.7827384322007986}

### Model Performance Results

##### Training Data Results (How well it learned)

- Price predictions were off by about $1,600 on average.
- The model was 98.7% accurate on data it trained with.
- Very small error rate of 0.085 (closer to 0 is better).

##### Real-World Performance (New Data)

- Price predictions were off by about $7,300 on average.
- The model was 78.4% accurate on new data.
- Higher error rate of 0.344 (expected for new data).

##### What This Means

- The model learned its training data very well (98.7% accuracy).
- When faced with new data, it's still quite good (78.4% accuracy).
- This difference is normal - models usually perform better on data they've seen before.

# **Tuning Our Model's Hyperparameters**
## Optimizing Model Training Speed

When working with large datasets, model training and hyperparameter tuning can be time-consuming. Here's how we'll speed up our experiments:

- Challenge: Training on our full dataset (~400,000 rows) takes 1-1.5 minutes per iteration.
- Solution: We'll use a smaller sample of the training data for initial hyperparameter tuning.

We can achieve this by using the `max_samples` parameter in RandomForestRegressor:

- It controls how many samples each decision tree sees during training.
- Setting it to 10,000 means using only 10,000 random samples instead of all 400,000.
- This makes training 40x faster, though we expect slightly lower accuracy.

This approach allows us to quickly test different model configurations before training on the full dataset.me.

## Optimizing Model Training with Sample Size Control

This code makes the training process faster by using less data. It works like this:

- Creates a RandomForestRegressor with controlled sample size (10,000 samples)
- Uses parallel processing (n_jobs=-1) to utilize all available CPU cores
- Significantly reduces training time while maintaining reasonable model performance

In [35]:
%%time

# Change max samples in RandomForestRegressor
model = RandomForestRegressor(n_estimators=100, # this is the default
                              n_jobs=-1,
                              max_samples=10000) # each estimator sees max_samples (the default is to see all available samples)

# Cutting down the max number of samples each tree can see improves training time
model.fit(X_train_preprocessed, 
          y_train)

CPU times: total: 1min 10s
Wall time: 15.8 s


## Evaluating the Reduced Sample Model

Here's what this part of our code does:

- Tests how well our model works after using less training data.
- Uses our scoring tool to check how accurate the model is.
- Saves the results so we can compare them with other versions of the model.

In [37]:
# Get evaluation metrics from reduced sample model
base_model_scores = show_scores(model=model)
base_model_scores

{'Training MAE': 5420.02941167562,
 'Valid MAE': 7834.046851587109,
 'Training RMSLE': 0.25300524358849397,
 'Valid RMSLE': 0.3544106589676751,
 'Training R^2': 0.8669862488033313,
 'Valid R^2': 0.7704107636353107}

### Model Performance Analysis

These metrics show how well our model performs in predicting bulldozer prices. The results include:

- **Mean Absolute Error (MAE)**: Shows average prediction error in dollars
    - Training: $5,420 off on average
    - Validation: $7,834 off on average
- **Root Mean Squared Log Error (RMSLE)**: Measures relative prediction accuracy
    - Training: 0.253 (closer to 0 is better)
    - Validation: 0.354 (expected higher for new data)
- **R² Score**: Indicates overall model accuracy
    - Training: 86.7% accurate
    - Validation: 77.0% accurate

---

# **Hyperparameter Tuning With RandomizedSearchCV**
Instead of manually adjusting model parameters one by one, we'll use automated hyperparameter tuning to find the best settings for our RandomForestRegressor. This process helps us:

- Find optimal model settings automatically.
- Save time compared to manual tuning.
- Improve model accuracy systematically.

We'll use RandomizedSearchCV to test different combinations of parameters like n_estimators, max_depth, and min_samples_split. Our approach involves:

- Starting with a wide range of parameter values.
- Using random sampling to test different combinations.
- Narrowing down to the most promising settings.
- Finally using GridSearchCV for precise optimization.

The process works in three main steps:

1. Define parameter options for our model (keeping max_samples=10000 for speed).
2. Set up RandomizedSearchCV with specific iterations and cross-validation folds.
3. Train the model to automatically find the best parameter combination.

# Hyperparameter Tuning with RandomizedSearchCV

This code helps our model learn better by automatically testing different settings. Instead of changing settings by hand, it tries many different combinations on its own to find what works best.

The code performs three key steps:

- Creates a parameter grid (rf_grid) that defines the possible values for each model parameter like n_estimators, max_depth, and others.
- Sets up RandomizedSearchCV to efficiently sample from these parameter combinations, using 20 iterations and 3-fold cross-validation.
- Fits the model to find the best parameter combination that maximizes performance.

In [38]:
%%time

from sklearn.model_selection import RandomizedSearchCV

# 1. Define a dictionary with different values for RandomForestRegressor hyperparameters
# See documatation for potential different values - https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html 
rf_grid = {"n_estimators": np.arange(10, 200, 10),
           "max_depth": [None, 10, 20],
           "min_samples_split": np.arange(2, 10, 1), # min_samples_split must be an int in the range [2, inf) or a float in the range (0.0, 1.0]
           "min_samples_leaf": np.arange(1, 10, 1),
           "max_features": [0.5, 1.0, "sqrt"], # Note: "max_features='auto'" is equivalent to "max_features=1.0", as of Scikit-Learn version 1.1
           "max_samples": [10000]}

# 2. Setup instance of RandomizedSearchCV to explore different parameters 
rs_model = RandomizedSearchCV(estimator=RandomForestRegressor(), # can pass new model instance directly, all settings will be taken from the rf_grid
                              param_distributions=rf_grid,
                              n_iter=20,
                            #   scoring="neg_root_mean_squared_log_error", # want to optimize for RMSLE, though sometimes optimizing for the default metric (R^2) can lead to just as good results all round
                              cv=3,
                              verbose=3) # control how much output gets produced, higher number = more output

# 3. Fit the model using a series of different hyperparameter values
rs_model.fit(X=X_train_preprocessed, 
             y=y_train)

Fitting 3 folds for each of 20 candidates, totalling 60 fits
[CV 1/3] END max_depth=None, max_features=sqrt, max_samples=10000, min_samples_leaf=9, min_samples_split=8, n_estimators=20;, score=0.666 total time=   4.9s
[CV 2/3] END max_depth=None, max_features=sqrt, max_samples=10000, min_samples_leaf=9, min_samples_split=8, n_estimators=20;, score=0.631 total time=   2.7s
[CV 3/3] END max_depth=None, max_features=sqrt, max_samples=10000, min_samples_leaf=9, min_samples_split=8, n_estimators=20;, score=0.663 total time=   2.1s
[CV 1/3] END max_depth=20, max_features=sqrt, max_samples=10000, min_samples_leaf=8, min_samples_split=7, n_estimators=190;, score=0.681 total time=  14.6s
[CV 2/3] END max_depth=20, max_features=sqrt, max_samples=10000, min_samples_leaf=8, min_samples_split=7, n_estimators=190;, score=0.645 total time=  13.7s
[CV 3/3] END max_depth=20, max_features=sqrt, max_samples=10000, min_samples_leaf=8, min_samples_split=7, n_estimators=190;, score=0.670 total time=  14.8s


### RandomizedSearchCV Model Training Results

This code shows how we tested different settings to find the best version of our Random Forest model. We used a tool called `**RandomizedSearchCV**` to try out different combinations and find what works best.

- Tests 20 different parameter combinations across 3 cross-validation folds (60 total fits)
- Evaluates various hyperparameters including max_depth, max_features, min_samples_leaf, min_samples_split, and n_estimators
- Records model performance scores and execution times for each combination
- Total execution time was approximately 26 minutes

The purpose is to automatically find the optimal model configuration that maximizes performance by testing different parameter combinations.

## Finding Optimal Model Parameters

This code retrieves the best hyperparameters found during our RandomizedSearchCV process. These parameters achieved the highest performance scores during training, giving us the optimal configuration for our random forest model.

In [40]:
# Find the best parameters from RandomizedSearchCV
rs_model.best_params_

{'n_estimators': np.int64(90),
 'min_samples_split': np.int64(5),
 'min_samples_leaf': np.int64(2),
 'max_samples': 10000,
 'max_features': 0.5,
 'max_depth': None}

### Best Model Parameters Found Through RandomizedSearchCV

Below are the optimal hyperparameters discovered during our model tuning process. These parameters represent the best configuration that maximizes our Random Forest model's performance:

The optimal configuration includes:

- 90 trees in the forest
- Minimum of 5 samples required to split an internal node
- Minimum of 2 samples per leaf node
- 10,000 samples per tree limit
- Half of the features used for each split
- Trees allowed to grow to full depth

These settings achieve a balance between model complexity and performance, helping prevent overfitting while maintaining good predictive power.

## Evaluating Model Performance

This code block evaluates our optimized Random Forest model's performance:

- Uses a custom `show_scores()` function.
- Calculates various performance metrics.
- Shows how well the model performs with optimized parameters.
- Helps assess the effectiveness of our hyperparameter tuning.

In [41]:
# Evaluate the RandomizedSearch model
rs_model_scores = show_scores(rs_model)
rs_model_scores

{'Training MAE': 5644.116728951262,
 'Valid MAE': 7609.7001903197815,
 'Training RMSLE': 0.26015691522970624,
 'Valid RMSLE': 0.34543252484518705,
 'Training R^2': 0.8567867001736508,
 'Valid R^2': 0.7845786478293352}

### Model Performance Metrics

These numbers tell us how accurate our machine learning model is at predicting bulldozer prices. We tested it in two ways:

- **Mean Absolute Error (MAE)**: Shows average prediction error in dollars - lower is better.
- **Root Mean Square Logarithmic Error (RMSLE)**: Measures relative prediction error - lower is better.
- **R-squared (R²)**: Indicates how well the model fits the data, with 1.0 being perfect - higher is better.

Our model performs well at predicting bulldozer prices. It scores 0.86 (86%) when tested on training data and 0.78 (78%) on new data, showing it can make reliable predictions even on bulldozers it hasn't seen before.

---

# **Training A Model With The Best Hyperparameters**

### Training the Final Model with Optimized Parameters

We tested different settings for our model to find the best way to predict bulldozer prices. We tried 100 different combinations of settings, and it took about 2 hours to test them all. This helped us find the settings that make our model work best.

The code shows:

- The final selected hyperparameters for model optimization.
- How to implement these parameters in a new model instance.
- Important considerations about model training time and computational resources.

We've adjusted all the settings in our model to work as effectively as possible:

- `n_estimators=90`
- `max_depth=None`
- `min_samples_leaf=1`
- `min_samples_split=5`
- `max_features=0.5`
- `n_jobs=-1`
- `max_samples=None`

` **Note:** This search (`n_iter=100`) took more than 2-hours on my  Acer laptop. `

We will now create a new model using the best settings we found, and we'll change the `max_samples` setting back to its default value.

## Training the Optimized Random Forest Model

Here we create and train our final machine learning model. We're using settings that we found work best after testing many different combinations:

- Uses 90 decision trees (n_estimators).
- Allows trees to grow to their full depth (max_depth=None).
- Sets minimum requirements for node splitting and leaf size.
- Uses 50% of features for best results (max_features=0.5).
- Utilizes all available CPU cores for faster training (n_jobs=-1).

In [42]:
%%time

# Create a model with best found hyperparameters 
ideal_model = RandomForestRegressor(n_estimators=90,
                                    max_depth=None,
                                    min_samples_leaf=1,
                                    min_samples_split=5,
                                    max_features=0.5,
                                    n_jobs=-1,
                                    max_samples=None)

# Fit a model to the preprocessed data
ideal_model.fit(X=X_train_preprocessed, 
                y=y_train)

CPU times: total: 10min 23s
Wall time: 2min 23s


### Training an Optimized Random Forest Model for Bulldozer Price Prediction

This code uses a special type of machine learning model (called Random Forest) to predict bulldozer prices. We've adjusted the model's settings to work better, including:

- 90 decision trees for robust predictions.
- Unconstrained tree depth for maximum model flexibility.
- Optimal feature utilization (50% of features per tree).
- Parallel processing across all CPU cores for efficient training.

The model took about 10 minutes to train, which shows it needed a lot of computer processing power to learn from all the data.

## Evaluating Model Performance Metrics

This code checks how well our model performs at predicting bulldozer prices:

- It measures how long the code takes to run.
- It uses a special testing tool we created to check the model's accuracy.
- It shows us how close our predictions are to the real prices using different measurement methods.

In [43]:
%%time

# Evaluate ideal model
ideal_model_scores = show_scores(model=ideal_model)
ideal_model_scores

CPU times: total: 49.2 s
Wall time: 21 s


{'Training MAE': 1960.256625575743,
 'Valid MAE': 6656.226670602618,
 'Training RMSLE': 0.10177165206734948,
 'Valid RMSLE': 0.3163943104409574,
 'Training R^2': 0.9810241241949494,
 'Valid R^2': 0.8343130842035419}

### Model Performance Metrics Analysis

We tested how good our computer model is at guessing bulldozer prices. We checked its performance in two ways: first with data it learned from, and then with new data it hadn't seen before. This helps us know if the model can make reliable price predictions.

- **Mean Absolute Error (MAE)**: Shows our predictions are off by about $1,960 on training data and $6,656 on validation data.
- **Root Mean Square Logarithmic Error (RMSLE)**: At 0.10 for training and 0.32 for validation, indicates good prediction accuracy with some variance.
- **R-squared (R²)**: Excellent 98% accuracy on training data and strong 83% on validation data, showing the model generalizes well to new cases.

### Creating a Faster Model with Reduced Complexity

This code creates a more efficient version of our Random Forest model by reducing the number of trees (estimators) from 90 to 45. The goal is to speed up training and prediction times while maintaining reasonable accuracy. We keep all other optimal parameters the same but halve the computational load

In [44]:
%%time

# Halve the number of estimators
fast_model = RandomForestRegressor(n_estimators=45,
                                   max_depth=None,
                                   min_samples_leaf=1,
                                   min_samples_split=5,
                                   max_features=0.5,
                                   n_jobs=-1,
                                   max_samples=None)

# Fit the faster model to the data
fast_model.fit(X=X_train_preprocessed, 
               y=y_train)

CPU times: total: 4min 55s
Wall time: 1min 2s


### Faster Random Forest Model with Reduced Complexity Metrics Analysis

Key features of the implementation:

- Uses 45 decision trees instead of 90 to reduce computational load.
- Maintains other optimal parameters like max_depth, min_samples_leaf, and max_features.
- Takes approximately 1 minute and 2 seconds of wall time to execute.

## Evaluating the Fast Model's Performance

This code block measures and displays the performance metrics of our simplified Random Forest model:

- Uses Python's magic command `%%time` to track execution duration.
- Calls our custom `show_scores()` function to calculate key performance indicators.
- Allows us to compare accuracy between the full and simplified models.

In [46]:
%%time

# Get results from the fast model
fast_model_scores = show_scores(model=fast_model)
fast_model_scores

CPU times: total: 24.4 s
Wall time: 8.98 s


{'Training MAE': 1996.230820253853,
 'Valid MAE': 6742.00246707134,
 'Training RMSLE': 0.10350416034126005,
 'Valid RMSLE': 0.32084838098914825,
 'Training R^2': 0.9802527775314333,
 'Valid R^2': 0.8299498693975467}

### Fast Model Performance Results Analysis

Our faster version of the model worked well and saved a lot of time. It took only about 25 seconds to run (or 9 seconds in real time), which is much quicker than before. Even though we made it simpler by using fewer decision trees, it's still very good at predicting prices:
- **Training accuracy (R²)** = `98%`
- **Validation accuracy (R²)** = `83%`
- **Mean Absolute Error**   
  -  for *training* = `$1,996`
  -  for *validation* = `$6,742` 

These results indicate the faster model achieves comparable accuracy to the full model while requiring less computational resources.

---