# Part 2: Model Training and Prediction

This notebook focuses on the machine learning workflow:
1.  **Load Processed Data**: Load the cleaned data from the previous step.
2.  **Feature Preparation**: One-hot encode categorical variables.
3.  **Train-Test Split**: Split the data for model training and validation.
4.  **Model Training**: Train a Gradient Boosting Regressor model.
5.  **Model Evaluation**: Evaluate the model using Mean Absolute Error (MAE).
6.  **Prediction**: Demonstrate how to use the trained model to make predictions on new data.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error

## 1. Load Processed Data

In [2]:
df = pd.read_csv('../data/processed/merged_data.csv')

## 2. Feature Preparation

We select our features (`X`) and target (`Y`). All non-numeric columns and identifiers are dropped from the feature set. Categorical features like `Operation` are one-hot encoded to be used in the model.

In [3]:
# Drop identifier columns and other non-feature columns
df_model = df.drop(columns=['ID', 'Debut', 'Fin', 'Machine', 'Outil', 'Parcelle'])

# One-hot encode the 'Operation' column
df_model = pd.get_dummies(df_model, columns=['Operation'], drop_first=True)

# Define features (X) and target (Y)
X = df_model.drop('Consommation (L)', axis=1)
Y = df_model['Consommation (L)']

print("Features (X) shape:", X.shape)
print("Target (Y) shape:", Y.shape)
display(X.head())

Features (X) shape: (500, 15)
Target (Y) shape: (500,)


Unnamed: 0,Puissance,Largeur,Duree_mn,Distance_km,Vitesse_moy_kmph,Vitesse_med_kmph,Vitesse_max_kmph,Accélération_moy_kmph2,Accélération_max_kmph2,Surface_ha,Perimetre_km,Complexite,Operation_Semis,Operation_Traitements phytosanitaires,Operation_Travaux du sol
0,134,2.4,118.6,44.158868,6.790227,8.0,10,0.0,4800.0,20.171059,1.805188,15.0,False,False,True
1,134,2.4,190.4,39.695623,6.111286,7.0,11,0.0,4800.0,20.171059,1.805188,15.0,False,False,True
2,134,2.4,298.0,55.914311,6.207984,8.0,13,0.0,6000.0,20.171059,1.805188,15.0,False,False,True
3,134,2.4,46.2,16.810207,6.861771,8.0,13,-10.38961,7800.0,17.314159,1.928989,14.0,False,False,True
4,134,2.4,54.8,19.743015,7.530055,8.0,10,1.094891,5400.0,10.491037,1.474957,7.0,False,False,True


## 3. Train-Test Split

We split the data into training and validation sets. Using `random_state=42` ensures that the split is the same every time, making our results reproducible.

In [4]:
X_train, X_val, y_train, y_val = train_test_split(X, Y, test_size=0.2, random_state=42)

## 4. Model Training

We will use a `GradientBoostingRegressor`, a powerful ensemble model. We set `random_state=42` again for reproducibility.

In [5]:
gbr = GradientBoostingRegressor(
    n_estimators=100,
    min_samples_split=3,
    max_features='sqrt',
    max_depth=10,
    criterion='squared_error',
    random_state=42
)

gbr.fit(X_train, y_train)

## 5. Model Evaluation

We evaluate the model on the validation set using the Mean Absolute Error (MAE).

In [6]:
y_pred = gbr.predict(X_val)
mae = mean_absolute_error(y_val, y_pred)

print(f"Mean Absolute Error on Validation Set: {mae:.4f} Liters")

Mean Absolute Error on Validation Set: 0.5438 Liters


## 6. Training on Full Dataset and Predicting on Test Data

For the final step, we would typically receive a separate test dataset. Here, we'll simulate this by re-training the model on the *entire* dataset and then preparing a dummy test set to demonstrate the prediction process.

In [7]:
# Re-train the model on the full dataset
print("Training final model on the full dataset...")
final_model = GradientBoostingRegressor(
    n_estimators=100,
    min_samples_split=3,
    max_features='sqrt',
    max_depth=10,
    criterion='squared_error',
    random_state=42
)
final_model.fit(X, Y)
print("Final model training complete.")

Training final model on the full dataset...
Final model training complete.


In [8]:
# This cell demonstrates how you would make predictions on a new, unseen test set.
# First, you would need to load and process the test data using the *exact same steps* as the training data.
# For this example, we'll just use the validation set as a stand-in for a new test set.

X_test_example = X_val.copy()

# Predict fuel consumption
test_predictions = final_model.predict(X_test_example)

# Create a DataFrame with the predictions
df_predictions = pd.DataFrame({'Predicted_Consumption_L': test_predictions})

print("Example predictions on new data:")
display(df_predictions.head())

# Optionally, save the predictions to a CSV file
df_predictions.to_csv('../predictions.csv', index=False)

Example predictions on new data:


Unnamed: 0,Predicted_Consumption_L
0,0.971848
1,0.063713
2,0.251638
3,0.100444
4,0.14399
