# Notebook 3: Preprocessing & Modeling

## 1. Objective

This notebook marks the transition from analysis (EDA) to prediction (Machine Learning).

Our work in Notebooks 1 and 2 was crucial. We now have:
1.  A clean, high-quality dataset: `rockfall_synthetic_data.csv`.
2.  A deep understanding of our data, including its "imbalanced" nature and strongest predictors.

The purpose of *this* notebook is to use that knowledge to build and select the best possible machine learning model. Our plan is based on our EDA findings:

1.  **Preprocessing:** We will prepare the data for modeling using standard technologies. This includes a `StandardScaler` (because our features have different scales) and an `LabelEncoder` (to convert our text target "Low", "Critical" into numbers 0, 3).
2.  **Handling Imbalance:** Our biggest challenge (found in Notebook 2) is that "Critical" events are rare (1.7%). We will use the `class_weight='balanced'` parameter in our models. This advanced technique forces the model to pay extra attention to the rare classes, preventing it from just ignoring them.
3.  **Model Training:** We will train a "ladder" of models, from a simple baseline (`LogisticRegression`) to a powerful, modern standard (`XGBClassifier`).
4.  **Model Evaluation:** We will *not* use "Accuracy" as our success metric. We will use a **Classification Report**. Our primary goal is to find the model with the highest **Recall** on the "Critical" class, as this is the most important prediction to get right.

In [1]:
import pandas as pd
import numpy as np
import os
import pickle

# --- 1. Preprocessing Tools ---
# train_test_split: To split our data into a 'training' set and a 'testing' set.
# StandardScaler: To rescale all our number features so they have a similar scale.
# LabelEncoder: To turn our text target ('Low', 'High') into numbers (0, 3).
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

# --- 2. Model Algorithms ---
# These are the different models we will train and compare.
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

# --- 3. Evaluation Tools ---
# classification_report: Our main tool. It shows Precision, Recall, and F1-score.
from sklearn.metrics import classification_report

# --- 4. Define File Paths ---
BASE_DIR = '..'
DATA_DIR = os.path.join(BASE_DIR, 'data')
DATA_FILE = os.path.join(DATA_DIR, 'rockfall_synthetic_data.csv')

# This is where we will save our trained models
MODELS_DIR = os.path.join(BASE_DIR, 'models')
os.makedirs(MODELS_DIR, exist_ok=True)


# --- 5. Load the Data ---
try:
    df = pd.read_csv(DATA_FILE)
    print(f"Successfully loaded '{os.path.basename(DATA_FILE)}'.")
    print(f"Dataset has {df.shape[0]} rows and {df.shape[1]} columns.")
except FileNotFoundError:
    print(f"--- ERROR ---")
    print(f"The file '{os.path.basename(DATA_FILE)}' was not found at: {DATA_FILE}")
    print("Please make sure Notebook 1 was run successfully and the file was saved.")
except Exception as e:
    print(f"An error occurred: {e}")

# --- 6. Initial Data Inspection ---
if 'df' in locals():
    print("\n--- Data Head (First 5 Rows) ---")
    print(df.head())

Successfully loaded 'rockfall_synthetic_data.csv'.
Dataset has 20000 rows and 6 columns.

--- Data Head (First 5 Rows) ---
   rainfall_mm_past_24h  seismic_activity  joint_water_pressure_kPa  \
0             16.351962          1.042005                 41.542877   
1              4.133069          1.410756                 34.360071   
2              7.764729          1.554489                 38.339998   
3             14.857486          1.517141                 50.282894   
4              0.000000          0.941456                 31.325325   

   vibration_level  displacement_mm risk_level  
0         0.206142         6.991127        Low  
1         0.349532         8.686621        Low  
2         0.384975        10.007807        Low  
3         0.351243        13.029807     Medium  
4         0.266100         8.382537        Low  


In [2]:
# ---
# ### 2.1. Define Features (X) and Target (y)
# ---

# 'X' (features) is our 5 numerical columns.
X = df.drop('risk_level', axis=1)

# 'y' (target) is the one column we are trying to predict.
y = df['risk_level']

print(f"X (features) shape: {X.shape}")
print(f"y (target) shape: {y.shape}")


# ---
# ### 2.2. Encode the Target Variable (y)
# ---
# We need to convert the text labels ('Low', 'Medium', 'High', 'Critical')
# into numbers (0, 1, 2, 3) so the models can understand them.

le = LabelEncoder()
y_encoded = le.fit_transform(y)

# Let's see what the new classes are:
# We also save the logical order for our final reports
# We use .tolist() to save it as a simple list
target_names = le.classes_.tolist()
target_map = {index: label for index, label in enumerate(target_names)}

print("\n--- Target Variable Encoding ---")
print(f"Text labels: {le.classes_}")
print(f"Encoded numbers: {le.transform(le.classes_)}")
print(f"Mapping: {target_map}")


# ---
# ### 2.3. Split Data into Training and Testing Sets
# ---
# This is a critical step. We will train our model on 80% of the data
# and test it on 20% of "unseen" data.
# 'stratify=y_encoded' is crucial for our imbalanced dataset.
# It ensures that our 'Critical' samples (1.7%) are split 80/20
# and don't all end up in just one set.

X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y_encoded, 
    test_size=0.2,    # 20% of data for testing
    random_state=42,  # Ensures our split is reproducible
    stratify=y_encoded # CRITICAL for imbalanced data
)

print("\n--- Data Splitting ---")
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")


# ---
# ### 2.4. Scale the Numerical Features (X)
# ---
# Our features have different scales (e.g., rainfall 0-22 vs. vibration 0-1.5).
# StandardScaler rescales them all to have a mean of 0 and std of 1.
# We 'fit' the scaler ONLY on the training data to prevent data leakage.

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("\n--- Feature Scaling ---")
print("Features have been scaled using StandardScaler.")
print(f"X_train_scaled mean (should be ~0): {X_train_scaled.mean():.2f}")
print(f"X_test_scaled mean (should be similar): {X_test_scaled.mean():.2f}")


# ---
# ### 2.5. Save Preprocessing Objects
# ---
# We must save the 'LabelEncoder' (le) and 'StandardScaler' (scaler)
# so our final web app (app.py) can use the exact same transformations.

le_path = os.path.join(MODELS_DIR, 'label_encoder.pkl')
scaler_path = os.path.join(MODELS_DIR, 'scaler.pkl')

with open(le_path, 'wb') as f:
    pickle.dump(le, f)
    
with open(scaler_path, 'wb') as f:
    pickle.dump(scaler, f)

print("\n--- Preprocessing Complete ---")
print(f"LabelEncoder saved to {le_path}")
print(f"StandardScaler saved to {scaler_path}")

X (features) shape: (20000, 5)
y (target) shape: (20000,)

--- Target Variable Encoding ---
Text labels: ['Critical' 'High' 'Low' 'Medium']
Encoded numbers: [0 1 2 3]
Mapping: {0: 'Critical', 1: 'High', 2: 'Low', 3: 'Medium'}

--- Data Splitting ---
X_train shape: (16000, 5)
X_test shape: (4000, 5)
y_train shape: (16000,)
y_test shape: (4000,)

--- Feature Scaling ---
Features have been scaled using StandardScaler.
X_train_scaled mean (should be ~0): -0.00
X_test_scaled mean (should be similar): 0.01

--- Preprocessing Complete ---
LabelEncoder saved to ..\models\label_encoder.pkl
StandardScaler saved to ..\models\scaler.pkl


In [3]:
# ---
# ### 3.1. Create a Model "Evaluation" Function
# ---
#
# We will be training 4 different models. To compare them fairly,
# we need to do the exact same evaluation for each one.
#
# A "helper function" is the best way to do this. This function will:
# 1. Take a trained model as input.
# 2. Make predictions on our "unseen" test data (X_test_scaled).
# 3. Print a full Classification Report (our main success metric).

def evaluate_model(model, model_name):
    """
    Takes a trained model and a name, makes predictions on the
    test set, and prints a classification report.
    """
    print(f"--- Evaluating: {model_name} ---")
    
    # 1. Make predictions on the test data
    y_pred = model.predict(X_test_scaled)
    
    # 2. Print the classification report
    # We use our 'target_names' list from Cell 3 to get the text labels
    report = classification_report(
        y_test, 
        y_pred, 
        target_names=target_names
    )
    print(report)

# ---
# This cell just defines the function. 
# No output will be shown, but the function is now in memory.
# ---
print("Helper function 'evaluate_model' is defined and ready to use.")

Helper function 'evaluate_model' is defined and ready to use.


In [6]:
# ---
# ### 3.2. Model 1: Logistic Regression (Baseline)
# ---
#
# We will start with a simple, fast model called Logistic Regression.
# This will be our "baseline". A good, complex model (like XGBoost)
# should be able to easily beat this score.

print("--- Training Model 1: Logistic Regression ---")

# 1. Initialize the model
# We set 'class_weight="balanced"' to handle our imbalanced dataset.
# This tells the model to "pay more attention" to the rare 'Critical' class.
# We set 'max_iter=1000' to give it more time to find a good solution.
log_reg = LogisticRegression(
    class_weight='balanced', 
    max_iter=1000, 
    random_state=42
)

# 2. Train the model
# We 'fit' the model on our scaled training data
log_reg.fit(X_train_scaled, y_train)

print("Model training complete.")

# 3. Evaluate the model
# We use our helper function from Cell 4
evaluate_model(log_reg, "Logistic Regression")

--- Training Model 1: Logistic Regression ---
Model training complete.
--- Evaluating: Logistic Regression ---
              precision    recall  f1-score   support

    Critical       0.62      0.87      0.72        69
        High       0.71      0.83      0.77       290
         Low       0.99      0.98      0.99      2019
      Medium       0.97      0.93      0.95      1622

    accuracy                           0.95      4000
   macro avg       0.82      0.90      0.86      4000
weighted avg       0.96      0.95      0.95      4000



In [7]:
# ---
# ### 3.3. Model 2: K-Nearest Neighbors (KNN)
# ---
#
# KNN is a simple but powerful model. It works by "polling" the
# "k" (e.g., 5) closest data points from the training set.
#
# NOTE: This model does *not* have a 'class_weight' parameter.
# It will be interesting to see how it handles our imbalanced data
# without that special instruction.

print("--- Training Model 2: K-Nearest Neighbors ---")

# 1. Initialize the model
# 'n_neighbors=7' is a common starting point. We could
# "tune" this number later, but 7 is a solid choice.
knn = KNeighborsClassifier(n_neighbors=7)

# 2. Train the model
# We 'fit' the model on our scaled training data
knn.fit(X_train_scaled, y_train)

print("Model training complete.")

# 3. Evaluate the model
evaluate_model(knn, "K-Nearest Neighbors (k=7)")

--- Training Model 2: K-Nearest Neighbors ---
Model training complete.
--- Evaluating: K-Nearest Neighbors (k=7) ---
              precision    recall  f1-score   support

    Critical       0.86      0.80      0.83        69
        High       0.91      0.88      0.89       290
         Low       0.98      0.98      0.98      2019
      Medium       0.95      0.97      0.96      1622

    accuracy                           0.96      4000
   macro avg       0.92      0.90      0.91      4000
weighted avg       0.96      0.96      0.96      4000



In [8]:
# ---
# ### 3.4. Model 3: Random Forest Classifier
# ---
#
# This is our first "ensemble" model. It's much more powerful
# and smarter than our first two models. It builds hundreds of
# "decision trees" and combines their votes.

print("--- Training Model 3: Random Forest Classifier ---")

# 1. Initialize the model
# We again set 'class_weight="balanced"' to handle our imbalanced data.
# 'n_estimators=100' means it will build 100 decision trees.
rf = RandomForestClassifier(
    n_estimators=100, 
    class_weight='balanced', 
    random_state=42,
    n_jobs=-1  # Uses all available CPU cores to speed up training
)

# 2. Train the model
# This may take a few seconds, as it's a more complex model.
rf.fit(X_train_scaled, y_train)

print("Model training complete.")

# 3. Evaluate the model
evaluate_model(rf, "Random Forest")

--- Training Model 3: Random Forest Classifier ---
Model training complete.
--- Evaluating: Random Forest ---
              precision    recall  f1-score   support

    Critical       1.00      1.00      1.00        69
        High       1.00      1.00      1.00       290
         Low       1.00      1.00      1.00      2019
      Medium       1.00      1.00      1.00      1622

    accuracy                           1.00      4000
   macro avg       1.00      1.00      1.00      4000
weighted avg       1.00      1.00      1.00      4000



In [9]:
# ---
# ### 3.5. Model 4: XGBClassifier (Advanced)
# ---
#
# This is our most advanced model. XGBoost (eXtreme Gradient Boosting)
# is the "industry standard" for tabular data and wins many Kaggle competitions.
#
# Like Random Forest, it's an "ensemble", but it builds trees 
# sequentially, with each new tree learning from the "mistakes" of the last one.

print("--- Training Model 4: XGBClassifier ---")

# 1. Initialize the model
# NOTE: XGBoost's 'class_weight' parameter is different.
# The best way to handle imbalance here is to set 'scale_pos_weight'
# for each class, but for a simple comparison, we will first try
# it without special handling and see how it performs.
#
# We also need to tell it to use the 'multi:softmax' objective
# because we have more than 2 classes.
xgb = XGBClassifier(
    objective='multi:softmax',  # Use 'multi:softmax' for multi-class problems
    num_class=4,                # We have 4 classes
    random_state=42,
    n_jobs=-1
)

# 2. Train the model
# This may also take a few seconds.
xgb.fit(X_train_scaled, y_train)

print("Model training complete.")

# 3. Evaluate the model
evaluate_model(xgb, "XGBClassifier")

--- Training Model 4: XGBClassifier ---
Model training complete.
--- Evaluating: XGBClassifier ---
              precision    recall  f1-score   support

    Critical       0.97      0.99      0.98        69
        High       0.99      0.98      0.98       290
         Low       1.00      1.00      1.00      2019
      Medium       1.00      1.00      1.00      1622

    accuracy                           1.00      4000
   macro avg       0.99      0.99      0.99      4000
weighted avg       1.00      1.00      1.00      4000



In [10]:
# ---
# ### 4.1. Model Comparison & Analysis
# ---

print("--- Model 'Tournament' Results ---")
print("Based on 'Critical' class Recall (our most important metric):\n")
print("1. Logistic Regression: 87%")
print("2. K-Nearest Neighbors: 80%")
print("3. Random Forest:         100% (Potential Overfitting)")
print("4. XGBClassifier:         99% (Highly Robust)")

print("\n--- Analysis ---")
print("The 100% score for Random Forest is a strong sign of overfitting.")
print("It has likely 'memorized' the exact rules of our synthetic data.")
print("The 99% score for XGBClassifier is more realistic and likely to be more")
print("robust on new, unseen ('messy') data.")

print("\n--- NEW PLAN ---")
print("We will save BOTH top models for a final head-to-head comparison")
print("in Notebook 4: Model Interpretation.")


# ---
# ### 4.2. Save Both Top Models
# ---

# 'rf' is the variable holding our Random Forest model from Cell 7
# 'xgb' is the variable holding our XGBoost model from Cell 8
rf_model_path = os.path.join(MODELS_DIR, 'rf_model.pkl')
xgb_model_path = os.path.join(MODELS_DIR, 'xgb_model.pkl')

try:
    # Save Random Forest
    with open(rf_model_path, 'wb') as f:
        pickle.dump(rf, f)
    
    # Save XGBoost
    with open(xgb_model_path, 'wb') as f:
        pickle.dump(xgb, f)
    
    print("\n--- SUCCESS! ---")
    print(f"Random Forest model saved to: {rf_model_path}")
    print(f"XGBoost model saved to: {xgb_model_path}")

except Exception as e:
    print(f"--- ERROR ---")
    print(f"An error occurred while saving the models: {e}")

--- Model 'Tournament' Results ---
Based on 'Critical' class Recall (our most important metric):

1. Logistic Regression: 87%
2. K-Nearest Neighbors: 80%
3. Random Forest:         100% (Potential Overfitting)
4. XGBClassifier:         99% (Highly Robust)

--- Analysis ---
The 100% score for Random Forest is a strong sign of overfitting.
It has likely 'memorized' the exact rules of our synthetic data.
The 99% score for XGBClassifier is more realistic and likely to be more
robust on new, unseen ('messy') data.

--- NEW PLAN ---
We will save BOTH top models for a final head-to-head comparison
in Notebook 4: Model Interpretation.

--- SUCCESS! ---
Random Forest model saved to: ..\models\rf_model.pkl
XGBoost model saved to: ..\models\xgb_model.pkl
