# **05. Model Training and Evaluation**
*This notebook will focus on training the machine learning model (e.g., RandomForestRegressor) and evaluating its performance using metrics like RMSLE.*

## Objectives

* Write here your notebook objective, for example, "Fetch data from Kaggle and save as raw data", or "engineer features for modelling"

## Inputs

* Write here which data or information you need to run the notebook 

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 


---

# Execution Timestamp

Purpose: This code block adds a timestamp to track notebook execution
- Helps monitor when analysis was last performed
- Ensures reproducibility of results
- Useful for debugging and version control

In [11]:
# Timestamp
import datetime

import datetime
print(f"Notebook last run (end-to-end): {datetime.datetime.now()}")

Notebook last run (end-to-end): 2025-02-16 01:44:41.891693


# Project Directory Structure and Working Directory

**Purpose: This code block establishes and explains the project organization**
- Creates a standardized project structure for data science workflows
- Documents the purpose of each directory for team collaboration
- Gets current working directory for file path management

## Key Components:
1. `data/ directory` stores all datasets (raw, processed, interim)
2. `src/` contains all source code (data preparation, models, utilities)
3. `notebooks/` holds Jupyter notebooks for experimentation
4. `results/` stores output files and visualizations

## Project Root Structure

- **`data/`** - Where all your datasets live
    - `raw/` - Original, untouched data
    - `processed/` - Cleaned and prepared data
    - `interim/` - Temporary data files
- **`src/`** - Your source code
    - `data_prep/` - Code for preparing data
    - `models/` - Your ML models
    - `utils/` - Helper functions
- **`notebooks/`** - Jupyter notebooks for experiments
- **`results/`** - Model outputs and visualizations

## Setting Up Working Directory
This code block sets up the working environment by:
- Changing to the project directory where our code and data files are located
- Verifying the current working directory to ensure we're in the right place

In [12]:
import os

# Move to the desired directory
os.chdir('c:\\Users\\blign\\Dropbox\\1 PROJECT\\VS Code Project Respository\\About-BulldozerPriceGenius-_BPG-_v2')

# Get the current directory to verify the change
current_dir = os.getcwd()
current_dir

'c:\\Users\\blign\\Dropbox\\1 PROJECT\\VS Code Project Respository\\About-BulldozerPriceGenius-_BPG-_v2'

## Set Working Directory to Project Root
**Purpose: Changes the current working directory to the parent directory**
- Gets the folder one level above the current one
- Makes sure all file locations work correctly throughout the project
- Keeps files and folders organized in a clean way

In [13]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


## Get Current Working Directory
**Purpose: Retrieves and stores the current working directory path**
- Gets the folder location where we're currently working
- Saves this location in a variable called current_dir so we can use it later
- Helps us find and work with files in the right place

In [14]:
import os

# Change the current working directory
os.chdir('c:\\Users\\blign\\Dropbox\\1 PROJECT\\VS Code Project Respository')

# Get the current working directory
current_dir = os.getcwd()
current_dir

'c:\\Users\\blign\\Dropbox\\1 PROJECT\\VS Code Project Respository'

---

# **Import Essential Data Science Libraries and Check Versions**

**Purpose: This code block imports fundamental Python libraries for data analysis and visualization**
- `pandas:` For data manipulation and analysis
- `numpy:` For numerical computations
- `matplotlib:` For creating visualizations and plots

**The version checks help ensure:**
- *Code compatibility across different environments*
- *Reproducibility of analysis*
- *Easy debugging of version-specific issues*


In [15]:
# Import data analysis tools
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt


print(f"pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"matplotlib version: {matplotlib.__version__}")

pandas version: 2.2.3
NumPy version: 2.2.2
matplotlib version: 3.10.0


# **Import and Displaying the Processed Bulldozer Dataset**

This code serves three main purposes:

- Imports pandas for data manipulation
- Loads our preprocessed bulldozer dataset from a Parquet file that contains cleaned data with properly encoded categorical values and filled missing values
- Displays the first few rows of the data to verify successful loading

---

In [16]:
import pandas as pd

# Define the file path
file_path = "C:/Users/blign/Dropbox/1 PROJECT/VS Code Project Respository/About-BulldozerPriceGenius-_BPG-_v2/data/processed/TrainAndValid_object_values_as_categories_and_missing_values_filled.parquet"

# Load the Parquet file into a DataFrame
df = pd.read_parquet(file_path)

# Display the first few rows of the DataFrame
print(df.head())

   SalesID  SalePrice  MachineID  ModelID  datasource  auctioneerID  YearMade  \
0  1139246    66000.0     999089     3157         121           3.0      2004   
1  1139248    57000.0     117657       77         121           3.0      1996   
2  1139249    10000.0     434808     7009         121           3.0      2001   
3  1139251    38500.0    1026470      332         121           3.0      2001   
4  1139253    11000.0    1057373    17311         121           3.0      2007   

   MachineHoursCurrentMeter  UsageBand  fiModelDesc  ...  \
0                      68.0          2          963  ...   
1                    4640.0          2         1745  ...   
2                    2838.0          1          336  ...   
3                    3486.0          1         3716  ...   
4                     722.0          3         4261  ...   

   Undercarriage_Pad_Width_is_missing  Stick_Length_is_missing  \
0                                   1                        1   
1                   

## Loading the Preprocessed Bulldozer Dataset

This code reads our previously processed bulldozer dataset from a Parquet file. The dataset contains:

- Cleaned and properly formatted data
- Encoded categorical values
- Filled missing value

In [None]:
# Read in preprocessed dataset
df_tmp = pd.read_parquet(path="C:/Users/blign/Dropbox/1 PROJECT/VS Code Project Respository/About-BulldozerPriceGenius-_BPG-_v2/data/processed/TrainAndValid_object_values_as_categories_and_missing_values_filled.parquet",
                        engine="auto")


### Check for Missing Values

This code checks if there are any missing values in our data. It:

- Calculates the total number of missing values across all columns using pandas' isna() and sum() functions
- Provides informative feedback based on the result:
    - If no missing values are found (total = 0), confirms we can proceed with model building
    - If missing values exist, suggests reviewing our data preprocessing steps

In [20]:
# Check total number of missing values
total_missing_values = df_tmp.isna().sum().sum()

if total_missing_values == 0:
    print(f"[INFO] Total missing values: {total_missing_values} - Great! Let's build a model!")
else:
    print(f"[INFO] Uh ohh... total missing values: {total_missing_values} - Perhaps we might have to retrace our steps to fill the values?")

[INFO] Total missing values: 0 - Great! Let's build a model!
