# **05. Model Training and Evaluation**
*This notebook will focus on training the machine learning model (e.g., RandomForestRegressor) and evaluating its performance using metrics like RMSLE.*

## Objectives

* Write here your notebook objective, for example, "Fetch data from Kaggle and save as raw data", or "engineer features for modelling"

## Inputs

* Write here which data or information you need to run the notebook 

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 


---

# Execution Timestamp

Purpose: This code block adds a timestamp to track notebook execution
- Helps monitor when analysis was last performed
- Ensures reproducibility of results
- Useful for debugging and version control

In [1]:
# Timestamp
import datetime

import datetime
print(f"Notebook last run (end-to-end): {datetime.datetime.now()}")

Notebook last run (end-to-end): 2025-02-17 22:03:42.852379


# Project Directory Structure and Working Directory

**Purpose: This code block establishes and explains the project organization**
- Creates a standardized project structure for data science workflows
- Documents the purpose of each directory for team collaboration
- Gets current working directory for file path management

## Key Components:
1. `data/ directory` stores all datasets (raw, processed, interim)
2. `src/` contains all source code (data preparation, models, utilities)
3. `notebooks/` holds Jupyter notebooks for experimentation
4. `results/` stores output files and visualizations

## Project Root Structure

- **`data/`** - Where all your datasets live
    - `raw/` - Original, untouched data
    - `processed/` - Cleaned and prepared data
    - `interim/` - Temporary data files
- **`src/`** - Your source code
    - `data_prep/` - Code for preparing data
    - `models/` - Your ML models
    - `utils/` - Helper functions
- **`notebooks/`** - Jupyter notebooks for experiments
- **`results/`** - Model outputs and visualizations

## Setting Up Working Directory
This code block sets up the working environment by:
- Changing to the project directory where our code and data files are located
- Verifying the current working directory to ensure we're in the right place

In [2]:
import os

# Move to the desired directory
os.chdir('c:\\Users\\blign\\Dropbox\\1 PROJECT\\VS Code Project Respository\\About-BulldozerPriceGenius-_BPG-_v2')

# Get the current directory to verify the change
current_dir = os.getcwd()
current_dir

'c:\\Users\\blign\\Dropbox\\1 PROJECT\\VS Code Project Respository\\About-BulldozerPriceGenius-_BPG-_v2'

## Set Working Directory to Project Root
**Purpose: Changes the current working directory to the parent directory**
- Gets the folder one level above the current one
- Makes sure all file locations work correctly throughout the project
- Keeps files and folders organized in a clean way

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


## Get Current Working Directory
**Purpose: Retrieves and stores the current working directory path**
- Gets the folder location where we're currently working
- Saves this location in a variable called current_dir so we can use it later
- Helps us find and work with files in the right place

In [4]:
import os

# Change the current working directory
os.chdir('c:\\Users\\blign\\Dropbox\\1 PROJECT\\VS Code Project Respository')

# Get the current working directory
current_dir = os.getcwd()
current_dir

'c:\\Users\\blign\\Dropbox\\1 PROJECT\\VS Code Project Respository'

---

# **Import Essential Data Science Libraries and Check Versions**

**Purpose: This code block imports fundamental Python libraries for data analysis and visualization**
- `pandas:` For data manipulation and analysis
- `numpy:` For numerical computations
- `matplotlib:` For creating visualizations and plots

**The version checks help ensure:**
- *Code compatibility across different environments*
- *Reproducibility of analysis*
- *Easy debugging of version-specific issues*


In [5]:
# Import data analysis tools
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt


print(f"pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"matplotlib version: {matplotlib.__version__}")

pandas version: 2.2.3
NumPy version: 2.2.2
matplotlib version: 3.10.0


# **Import and Displaying the Processed Bulldozer Dataset**

This code serves three main purposes:

- Imports pandas for data manipulation
- Loads our preprocessed bulldozer dataset from a Parquet file that contains cleaned data with properly encoded categorical values and filled missing values
- Displays the first few rows of the data to verify successful loading

---

In [6]:
import pandas as pd

# Define the file path
file_path = "C:/Users/blign/Dropbox/1 PROJECT/VS Code Project Respository/About-BulldozerPriceGenius-_BPG-_v2/data/processed/TrainAndValid_object_values_as_categories_and_missing_values_filled.parquet"

# Load the Parquet file into a DataFrame
df = pd.read_parquet(file_path)

# Display the first few rows of the DataFrame
print(df.head())

   SalesID  SalePrice  MachineID  ModelID  datasource  auctioneerID  YearMade  \
0  1139246    66000.0     999089     3157         121           3.0      2004   
1  1139248    57000.0     117657       77         121           3.0      1996   
2  1139249    10000.0     434808     7009         121           3.0      2001   
3  1139251    38500.0    1026470      332         121           3.0      2001   
4  1139253    11000.0    1057373    17311         121           3.0      2007   

   MachineHoursCurrentMeter  UsageBand  fiModelDesc  ...  \
0                      68.0          2          963  ...   
1                    4640.0          2         1745  ...   
2                    2838.0          1          336  ...   
3                    3486.0          1         3716  ...   
4                     722.0          3         4261  ...   

   Undercarriage_Pad_Width_is_missing  Stick_Length_is_missing  \
0                                   1                        1   
1                   

## Loading the Preprocessed Bulldozer Dataset

This code reads our previously processed bulldozer dataset from a Parquet file. The dataset contains:

- Cleaned and properly formatted data
- Encoded categorical values
- Filled missing value

In [7]:
# Read in preprocessed dataset
df_tmp = pd.read_parquet(path="C:/Users/blign/Dropbox/1 PROJECT/VS Code Project Respository/About-BulldozerPriceGenius-_BPG-_v2/data/processed/TrainAndValid_object_values_as_categories_and_missing_values_filled.parquet",
                        engine="auto")


### Check for Missing Values

This code checks if there are any missing values in our data. It:

- Calculates the total number of missing values across all columns using pandas' isna() and sum() functions
- Provides informative feedback based on the result:
    - If no missing values are found (total = 0), confirms we can proceed with model building
    - If missing values exist, suggests reviewing our data preprocessing steps

In [8]:
# Check total number of missing values
total_missing_values = df_tmp.isna().sum().sum()

if total_missing_values == 0:
    print(f"[INFO] Total missing values: {total_missing_values} - Great! Let's build a model!")
else:
    print(f"[INFO] Uh ohh... total missing values: {total_missing_values} - Perhaps we might have to retrace our steps to fill the values?")

[INFO] Total missing values: 0 - Great! Let's build a model!


---

# Training Our Machine Learning Model

Now that we've cleaned up our data and made sure everything is in the right format, we're ready to create our price prediction model!

### Starting Small

We'll use a special type of model called a Random Forest that's good at learning patterns from data. Since we have a lot of data (over 400,000 rows), we'll first test our approach on a smaller sample of about 1,000 rows.

### Why Start Small?

Think of it like testing a recipe - it's better to try it with smaller portions first to make sure everything works before making a huge batch. This way, we can quickly fix any problems without wasting time.
 

### Setting Up Our Data

We'll organize our data into two parts:

- `Features (X)`: All the information about the bulldozers
- `Target (y)`: The actual sale prices we want to predict

### Measuring Performance

We'll use a special timing tool `(%%time)` to see how long our model takes to learn. This helps us plan for when we use the full dataset.

## Initialize and Train Random Forest Model

This code prepares and runs our model that will predict bulldozer prices.

- Imports the RandomForestRegressor class from scikit-learn's ensemble module
- Creates a model instance that utilizes all available CPU cores (n_jobs=-1)
- Trains the model using:
    - `Features (X)`: All columns except SalePrice
    - `Target (y)`: The SalePrice column we want to predict

In [None]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_jobs=-1)
model.fit(X=df_tmp.drop("SalePrice", axis=1), 
        y=df_tmp.SalePrice)

### What is `n_jobs=-1`?

Setting n_jobs=-1 in a RandomForestRegressor tells the model to use all available CPU cores on your computer for training the model.

#### How it Works

Think of it like this: If your computer has 4 CPU cores, it's like having 4 workers that can process data simultaneously. When you set n_jobs=-1, you're telling the program "Use all available workers (cores)" instead of just one.

#### Benefits

- Faster training times since work is distributed across all cores
- Maximum utilization of your computer's processing power

#### Potential Drawbacks

- May slow down other programs on your computer since all cores are being used
- Could cause system stability issues on computers with limited resources

If you want to be more conservative with resource usage, you can set n_jobs to a specific number (like n_jobs=2 to use just two cores).

## Training a Sample Model for Initial Testing

This code shows how we test our Random Forest model using a small portion of our bulldozer dataset to make sure everything works correctly.

- Takes a random sample of 1,000 records from our full dataset for quick testing
- Creates a Random Forest model that uses all available CPU cores for efficient processing
- Splits the data into two parts: what we want to use to make predictions (X) and what we want to predict (y)
- Trains the model on this sample data to predict bulldozer prices

In [12]:
%%time

# Sample 1000 samples with random state 42 for reproducibility
df_tmp_sample_1k = df_tmp.sample(n=1000, random_state=42)

# Instantiate a model
model = RandomForestRegressor(n_jobs=-1) # use -1 to utilise all available processors

# Create features and labels
X_sample_1k = df_tmp_sample_1k.drop("SalePrice", axis=1) # use all columns except SalePrice as X values
y_sample_1k = df_tmp_sample_1k["SalePrice"] # use SalePrice as y values (target variable)

# Fit the model to the sample data
model.fit(X=X_sample_1k, 
          y=y_sample_1k) 

CPU times: total: 3.03 s
Wall time: 1.5 s


### Model Training Time Analysis

This output shows how long it took to train our Random Forest model on the 1,000 sample records:

- **CPU times** (total processing time across all CPU cores): 3.03 seconds
- **Wall time** (actual elapsed time): 1.5 seconds

The difference between CPU and wall time indicates effective parallel processing across multiple CPU cores.

## Evaluate Model Performance

This code checks how accurate our Random Forest model is at guessing bulldozer prices.

- Using the `score() method` to measure how well our model can guess prices by comparing its predictions to the actual bulldozer prices.
- Testing the model on the same 1,000 sample records we used for training.
- Printing the score along with the sample size for easy reference.

In [13]:
# Evaluate the model
model_sample_1k_score = model.score(X=X_sample_1k,
                                    y=y_sample_1k)

print(f"[INFO] Model score on {len(df_tmp_sample_1k)} samples: {model_sample_1k_score}")

[INFO] Model score on 1000 samples: 0.9574835015845709


### Model Performance Results

The **Random Forest model** achieved an impressive accuracy score of about `96%` when predicting bulldozer prices. Here are the key points:

- The model shows 96% accuracy in price predictions, comparable to getting 96 out of 100 questions correct.
- Testing was conducted on a limited sample:
    - Only 1,000 bulldozers were used.
    - All available CPU power was utilized for efficient processing.
- Practical implications:
    - Provides reliable price estimates for buyers and sellers.
    - Results may vary when applied to the complete dataset.

## Training the Full Random Forest Model

This code runs our Random Forest model using all our bulldozer data to make better price predictions.

- Measures the training time using a special timer command `*%%*time`
- Uses all available computer processing power to run the model faster
- Gets the data ready by:
    - Taking out the price information (SalePrice) from the main dataset
    - Keeping the price information separate to use as our target values
- Uses all available bulldozer data to help the model learn how to predict prices accurately

In [15]:
%%time

# Instantiate model
model = RandomForestRegressor(n_jobs=-1) # note: this could take quite a while depending on your machine

# Create features and labels with entire dataset
X_all = df_tmp.drop("SalePrice", axis=1)
y_all = df_tmp["SalePrice"]

# Fit the model
model.fit(X=X_all, 
        y=y_all)

CPU times: total: 32min 31s
Wall time: 6min 18s


---

# **Understanding Data Splitting: A Simple Guide**

When working with machine learning projects, it's crucial to split your data properly. Here's why and how we do it:

### What is Data Splitting?

Think of data splitting like dividing a recipe book into three parts:

- `Training data`: The recipes you practice with
- `Validation data`: The recipes you test yourself on
- `Test data`: The final exam recipes

### Why Time Matters

For projects involving time-based predictions (like our bulldozer price predictions), we need to be extra careful with how we split the data. Random splitting won't work because it mixes up the timeline.

### How We Split the Data

In this project, we organize our data by dates:

- `Training data`: Everything up until 2011
- `Validation data`: January 1 to April 30, 2012
- `Testing data`: May 1 to November 2012

This approach ensures we're training our model on past data to predict future prices, just like how we'd use historical prices to guess future ones in the real world.

In [16]:
# Import train samples (making sure to parse dates and then sort by them)
train_df = pd.read_csv(filepath_or_buffer="C:/Users/blign/Dropbox/1 PROJECT/VS Code Project Respository/About-BulldozerPriceGenius-_BPG-_v2/data/raw/bluebook-for-bulldozers/Train.csv",
                       parse_dates=["saledate"],
                       low_memory=False).sort_values(by="saledate", ascending=True)

# Import validation samples (making sure to parse dates and then sort by them)
valid_df = pd.read_csv(filepath_or_buffer="C:/Users/blign/Dropbox/1 PROJECT/VS Code Project Respository/About-BulldozerPriceGenius-_BPG-_v2/data/raw/bluebook-for-bulldozers/Valid.csv",
                       parse_dates=["saledate"])

# The ValidSolution.csv contains the SalePrice values for the samples in Valid.csv
valid_solution = pd.read_csv(filepath_or_buffer="C:/Users/blign/Dropbox/1 PROJECT/VS Code Project Respository/About-BulldozerPriceGenius-_BPG-_v2/data/raw/bluebook-for-bulldozers/ValidSolution.csv")

# Map valid_solution to valid_df
valid_df["SalePrice"] = valid_df["SalesID"].map(valid_solution.set_index("SalesID")["SalePrice"])

# Make sure valid_df is sorted by saledate still
valid_df = valid_df.sort_values("saledate", ascending=True).reset_index(drop=True)

# How many samples are in each DataFrame?
print(f"[INFO] Number of samples in training DataFrame: {len(train_df)}")
print(f"[INFO] Number of samples in validation DataFrame: {len(valid_df)}")

[INFO] Number of samples in training DataFrame: 401125
[INFO] Number of samples in validation DataFrame: 11573


## Loading and Processing Training/Validation Data

This code prepares our bulldozer dataset for machine learning. Here's what it does in simple terms:

- Loads the training data from a file and makes sure all dates are in the right format and order
- Gets the test data ready the same way
- Adds the actual bulldozer prices to the test data
- Makes sure all the data is arranged by date, which helps us make better predictions
- Shows us how many bulldozers we have in our training and test sets

## Examining `Training Data` Sample

This code displays a random sample of 10 rows from our training dataset. This helps us:

- Verify the data is loaded correctly.
- Understand what features we're working with.
- Spot any potential issues in the data structure.

In [18]:
# Let's check out the training DataFrame
train_df.sample(10)

Unnamed: 0,SalesID,SalePrice,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,saledate,...,Undercarriage_Pad_Width,Stick_Length,Thumb,Pattern_Changer,Grouser_Type,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls
111899,1447051,40000,1378263,3362,132,1.0,1985,,,2003-04-16,...,,,,,,,,,,
359776,2467019,15000,1543152,12524,136,1.0,2004,0.0,,2009-09-16,...,None or Unspecified,None or Unspecified,Hydraulic,None or Unspecified,Double,,,,,
62807,1329117,11500,1103720,4089,132,4.0,1984,,,1993-02-01,...,,,,,,None or Unspecified,PAT,None or Unspecified,,
32405,1265316,7500,1411471,3112,132,1.0,1987,,,2001-03-14,...,,,,,,,,,,
279323,1853112,7000,1343839,18666,132,1.0,1998,,,2005-12-05,...,,,,,,,,,,
65241,1335692,93000,1374894,3879,132,1.0,1998,,,2002-05-14,...,,,,,,,,,Standard,Conventional
24877,1251992,19000,1315458,6788,132,1.0,1985,,,1991-03-14,...,,,,,,,,,,
82098,1380667,37500,1182381,4128,132,1.0,1986,,,2001-09-19,...,,,,,,None or Unspecified,Semi U,Differential Steer,,
310930,2270070,28000,1344636,3463,136,17.0,1997,0.0,,2007-10-09,...,None or Unspecified,None or Unspecified,None or Unspecified,None or Unspecified,Double,,,,,
318776,2296597,62000,1495536,1269,136,27.0,2003,0.0,,2009-07-23,...,None or Unspecified,None or Unspecified,None or Unspecified,Yes,Double,,,,,


## Examining `Validation Data` Sample

This code displays a random sample of 10 rows from our validation dataset to:

- Verify the validation data is structured correctly.
- Compare validation data features with training data.
- Check for any inconsistencies between datasets.

In [19]:
# And how about the validation DataFrame?
valid_df.sample(10)

Unnamed: 0,SalesID,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,saledate,fiModelDesc,...,Stick_Length,Thumb,Pattern_Changer,Grouser_Type,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls,SalePrice
1959,1224101,1057212,1946,121,3,1000,0.0,,2012-02-09,850BLT,...,,,,,None or Unspecified,None or Unspecified,None or Unspecified,,,10500.0
1999,4315882,2271989,14310,172,1,2005,8339.0,High,2012-02-12,330CLC,...,None or Unspecified,Hydraulic,None or Unspecified,Triple,,,,,,60000.0
4018,6285098,1938634,14315,149,1,2006,,,2012-02-13,350DLC,...,None or Unspecified,None or Unspecified,None or Unspecified,Double,,,,,,92500.0
7898,4316281,2273736,4674,172,1,1991,8631.0,Low,2012-03-26,490D,...,None or Unspecified,Hydraulic,None or Unspecified,Double,,,,,,11000.0
4375,4249081,2266795,6798,172,1,2006,7482.0,Medium,2012-02-17,621D,...,,,,,,,,Standard,Conventional,51000.0
4514,1225004,1016346,33398,121,3,1000,12084.0,Low,2012-02-21,BM5350,...,,,,,,,,Standard,Conventional,11000.0
11485,6262390,194797,23931,149,1,1999,,,2012-04-27,140HNA,...,,,,,,,,,,97000.0
9260,6285796,1857815,22072,149,1,2006,,,2012-03-29,317,...,,,,,,,,,,11500.0
11496,6304085,1858697,12006,149,1,1998,,,2012-04-27,WA450-3L,...,,,,,,,,Standard,Conventional,40000.0
2771,6269532,869634,4334,149,1,1998,,,2012-02-13,IT28G,...,,,,,,,,Standard,Conventional,37000.0


# Creating Time-Based Features from Sale Dates

Here's what this code does with sale dates to help predict bulldozer prices better:

- **Why we need it:**
    - Helps spot patterns in when bulldozers sell for more or less money.
    - Makes it easier to see how prices change over time.
- **What it does:**
    - Breaks down each sale date into useful parts (year, month, and day).
    - Adds helpful details like which day of the week and what time of year it was.
    - Removes the original date to keep things simple.

In [20]:
# Make a function to add date columns
def add_datetime_features_to_df(df, date_column="saledate"):
    # Add datetime parameters for saledate
    df["saleYear"] = df[date_column].dt.year
    df["saleMonth"] = df[date_column].dt.month
    df["saleDay"] = df[date_column].dt.day
    df["saleDayofweek"] = df[date_column].dt.dayofweek
    df["saleDayofyear"] = df[date_column].dt.dayofyear

    # Drop original saledate column
    df.drop("saledate", axis=1, inplace=True)

    return df

train_df = add_datetime_features_to_df(df=train_df)
valid_df = add_datetime_features_to_df(df=valid_df)

## Viewing Time-Based Features

This code displays:

- A random sample of 5 rows showing our new time features.
- Includes year, month, day, day of week, and day of year.
- Helps verify:
    - Date breakdowns are correct.
    - Temporal data structure is proper.

In [21]:
# Display the last 5 columns (the recently added datetime breakdowns)
train_df.iloc[:, -5:].sample(5)

Unnamed: 0,saleYear,saleMonth,saleDay,saleDayofweek,saleDayofyear
167277,2009,6,26,4,177
287179,2007,11,14,2,318
386647,2011,3,29,1,88
308920,2007,8,9,3,221
202531,2007,12,7,4,341


---

# **Trying To Fit A Model On Our `Training Data`**
Let's start by testing our model right away with our data. This approach helps us understand what we're working with quickly.

When we test the model, one of two things will happen:

- If it works: We'll look at the results to see how well it did
- If it doesn't work: We'll learn what we need to fix in our data

To get started, we'll:

- Split our data into two parts:
    - The features (X): All the bulldozer information except the price
    - The target (y): Just the price we want to predict

Then we'll use a special tool called RandomForestRegressor to make price predictions using our training data.

In [None]:
# Split training data into features and labels
# X_train = train_df.drop("SalePrice", axis=1)
# y_train = train_df["SalePrice"]

# Split validation data into features and labels
# X_valid = valid_df.drop("SalePrice", axis=1)
# y_valid = valid_df["SalePrice"]

# Create a model
# model = RandomForestRegressor(n_jobs=-1)

# Fit a model to the training data only
# model.fit(X=X_train,
#          y=y_train)

ValueError: could not convert string to float: 'Medium'

## Handling Data Type Mismatch in Model Training

We've hit a small roadblock in our model training. Our program is having trouble understanding some of the data because it's expecting numbers but found text values (like the word 'Medium') instead.

Here's what's happening:

- Our model can only work with numbers.
- We found text values in our data (like 'Medium').
- This happened because we loaded our raw data file (`Train.csv`) instead of using our processed version.

The good news is this is a common issue and we know exactly how to fix it!

---

# **Encoding categorical features as numbers using Scikit-Learn**

## Identifying Numerical and Categorical Features

Our code looks at our data and sorts it into two simple groups:

- **Numbers**: Things we can count or measure (like prices and years).
- **Categories**: Words or labels that describe things (like model names or condition types).

This sorting is important because:

- Our computer needs to handle numbers and words differently.
- It helps us prepare our data the right way.
- It makes sure our computer can understand and use all our information.

In [23]:
# Define numerical and categorical features
numerical_features = [label for label, content in X_train.items() if pd.api.types.is_numeric_dtype(content)]
categorical_features = [label for label, content in X_train.items() if not pd.api.types.is_numeric_dtype(content)]

print(f"[INFO] Numeric features: {numerical_features}")
print(f"[INFO] Categorical features: {categorical_features[:10]}...")

[INFO] Numeric features: ['SalesID', 'MachineID', 'ModelID', 'datasource', 'auctioneerID', 'YearMade', 'MachineHoursCurrentMeter', 'saleYear', 'saleMonth', 'saleDay', 'saleDayofweek', 'saleDayofyear']
[INFO] Categorical features: ['UsageBand', 'fiModelDesc', 'fiBaseModel', 'fiSecondaryDesc', 'fiModelSeries', 'fiModelDescriptor', 'ProductSize', 'fiProductClassDesc', 'state', 'ProductGroup']...


## Converting Categorical Features to Numbers

Our code takes information that's written as text and turns it into numbers so our computer can work with it. 
**Here's why we need to do this:**

- Computers can only understand and work with numbers.
- Some of our information, like equipment models and conditions, is written as text.
- We need to change this text into numbers, but keep its original meaning.

**The code does these main things:**

- Uses a special tool to change text into numbers.
- Handles any new or unexpected text by marking it as missing.
- Makes these changes to both our training and testing data.

In [24]:
from sklearn.preprocessing import OrdinalEncoder

# 1. Create an ordinal encoder (turns category items into numeric representation)
ordinal_encoder = OrdinalEncoder(categories="auto",
                                 handle_unknown="use_encoded_value",
                                 unknown_value=np.nan,
                                 encoded_missing_value=np.nan) # treat unknown categories as np.nan (or None)

# 2. Fit and transform the categorical columns of X_train
X_train_preprocessed = X_train.copy() # make copies of the oringal DataFrames so we can keep the original values in tact and view them later
X_train_preprocessed[categorical_features] = ordinal_encoder.fit_transform(X_train_preprocessed[categorical_features].astype(str)) # OrdinalEncoder expects all values as the same type (e.g. string or numeric only)

# 3. Transform the categorical columns of X_valid 
X_valid_preprocessed = X_valid.copy()
X_valid_preprocessed[categorical_features] = ordinal_encoder.transform(X_valid_preprocessed[categorical_features].astype(str)) # only use `transform` on the validation data

## Display First Few Rows of Training Data

This code displays the first 5 rows of our training dataset (`X_train`) to:

- Verify our data is structured correctly
- Preview the features we'll use for training
- Ensure all columns are present and properly formatted


In [25]:
X_train.head()

Unnamed: 0,SalesID,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,fiModelDesc,fiBaseModel,...,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls,saleYear,saleMonth,saleDay,saleDayofweek,saleDayofyear
205615,1646770,1126363,8434,132,18.0,1974,,,TD20,TD20,...,None or Unspecified,Straight,None or Unspecified,,,1989,1,17,1,17
92803,1404019,1169900,7110,132,99.0,1986,,,416,416,...,,,,,,1989,1,31,1,31
98346,1415646,1262088,3357,132,99.0,1975,,,12G,12,...,,,,,,1989,1,31,1,31
169297,1596358,1433229,8247,132,99.0,1978,,,644,644,...,,,,Standard,Conventional,1989,1,31,1,31
274835,1821514,1194089,10150,132,99.0,1980,,,A66,A66,...,,,,Standard,Conventional,1989,1,31,1,31


## Display Preprocessed Training Data

This code displays the first 5 rows of our preprocessed training dataset to:

- Verify our categorical features have been successfully converted to numerical values
- Confirm the preprocessing steps were applied correctly
- Check the data format is now suitable for model training

In [27]:
X_train_preprocessed.head()

Unnamed: 0,SalesID,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,fiModelDesc,fiBaseModel,...,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls,saleYear,saleMonth,saleDay,saleDayofweek,saleDayofyear
205615,1646770,1126363,8434,132,18.0,1974,,3.0,4536.0,1734.0,...,0.0,7.0,5.0,4.0,5.0,1989,1,17,1,17
92803,1404019,1169900,7110,132,99.0,1986,,3.0,734.0,242.0,...,2.0,10.0,7.0,4.0,5.0,1989,1,31,1,31
98346,1415646,1262088,3357,132,99.0,1975,,3.0,81.0,18.0,...,2.0,10.0,7.0,4.0,5.0,1989,1,31,1,31
169297,1596358,1433229,8247,132,99.0,1978,,3.0,1157.0,348.0,...,2.0,10.0,7.0,3.0,1.0,1989,1,31,1,31
274835,1821514,1194089,10150,132,99.0,1980,,3.0,1799.0,556.0,...,2.0,10.0,7.0,3.0,1.0,1989,1,31,1,31


## Finding Missing Data in Categories

This code helps us find where data is missing in our categories. It does these simple things:

- Looks at just the category columns in our training data
- Counts how many empty spots we have in each column
- Lists them in order, showing which columns have the most missing data first
- Shows us the top 10 problem areasThe simpler rewrite:
    - Uses shorter, clearer sentences
    - Removes technical terms like "categorical features" and "NA values"
    - Maintains the same structure and information
    - Makes the purpose more immediately clear
    - Uses more everyday language while keeping the meaning intact

In [31]:
X_train[categorical_features].isna().sum().sort_values(ascending=False)[:25]

Tip_Control          375906
Enclosure_Type       375906
Engine_Horsepower    375906
Blade_Extension      375906
Blade_Width          375906
Pushblock            375906
Scarifier            375895
Grouser_Tracks       357763
Hydraulics_Flow      357763
Coupler_System       357667
fiModelSeries        344217
Steering_Controls    331756
Differential_Type    331714
UsageBand            331486
fiModelDescriptor    329206
Backhoe_Mounting     322453
Turbocharged         321991
Stick                321991
Pad_Type             321991
Blade_Type           321292
Travel_Controls      321291
Tire_Size            306407
Track_Type           301972
Grouser_Type         301972
Pattern_Changer      301907
dtype: int64

## Checking Missing Values After Preprocessing

This code examines our preprocessed training data to identify any remaining missing values after our categorical encoding step. Here's what it does:

- Takes our preprocessed training data and looks only at categorical columns.
- Counts missing values (NaN) in each categorical feature.
- Sorts features by number of missing values (most to least).
- Shows the top 25 features with missing values.

In [30]:
X_train_preprocessed[categorical_features].isna().sum().sort_values(ascending=False)[:25]

UsageBand             0
fiModelDesc           0
fiBaseModel           0
fiSecondaryDesc       0
fiModelSeries         0
fiModelDescriptor     0
ProductSize           0
fiProductClassDesc    0
state                 0
ProductGroup          0
ProductGroupDesc      0
Drive_System          0
Enclosure             0
Forks                 0
Pad_Type              0
Ride_Control          0
Stick                 0
Transmission          0
Turbocharged          0
Blade_Extension       0
Blade_Width           0
Enclosure_Type        0
Engine_Horsepower     0
Hydraulics            0
Pushblock             0
dtype: int64

## Inspecting Encoded Categories

This code displays the first five categories that our ordinal encoder has processed. This helps us:

- Verify that our categorical data was properly encoded into numerical values.
- Check the mapping between original categories and their numeric representations.
- Ensure the encoding preserves the meaningful relationships between category values.

In [None]:
# Let's inspect the first five categories
ordinal_encoder.categories_[:5]

[array(['High', 'Low', 'Medium', 'nan'], dtype=object),
 array(['100C', '104', '1066', ..., 'ZX800LC', 'ZX80LCK', 'ZX850H'],
       shape=(4999,), dtype=object),
 array(['10', '100', '104', ..., 'ZX80', 'ZX800', 'ZX850'],
       shape=(1950,), dtype=object)]