# Profit Prediction Model – Regression
This notebook builds machine learning models to predict Profit_per_kg using Mandi–Retail arbitrage data.

1. Import required libraries

In [13]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

import joblib

## 2. Load final processed dataset

We use the `final_arbitrage_dataset.csv` created in the data preparation step.

In [14]:
df = pd.read_csv("../data/processed/final_arbitrage_dataset.csv", parse_dates=["Date"])
df.head()

Unnamed: 0,State,Market,Commodity,Date,Min_Price,Max_Price,Modal_Price,City,Retail_Price,Distance_km,Transport_rate_per_km,Transport_cost_per_kg,Price_Spread,Profit_per_kg,Profit_per_ton,Profit_per_truck,Opportunity_Score
0,Delhi,Lasalgaon,Chilli,2024-01-09,25,45,35.0,Bangalore,66,407,28,11.396,31.0,19.604,19604.0,98020.0,100.0
1,Delhi,Lasalgaon,Chilli,2024-01-09,25,45,35.0,Bangalore,66,509,19,9.671,31.0,21.329,21329.0,106645.0,100.0
2,Delhi,Lasalgaon,Chilli,2024-01-09,25,45,35.0,Bangalore,66,23,32,0.736,31.0,30.264,30264.0,151320.0,100.0
3,Delhi,Lasalgaon,Chilli,2024-01-09,25,45,35.0,Bangalore,66,55,35,1.925,31.0,29.075,29075.0,145375.0,100.0
4,Delhi,Lasalgaon,Chilli,2024-01-09,25,45,35.0,Bangalore,66,128,23,2.944,31.0,28.056,28056.0,140280.0,100.0


## 3. Quick sanity check (shape & columns)

In [15]:
df.shape, df.columns

((5993, 17),
 Index(['State', 'Market', 'Commodity', 'Date', 'Min_Price', 'Max_Price',
        'Modal_Price', 'City', 'Retail_Price', 'Distance_km',
        'Transport_rate_per_km', 'Transport_cost_per_kg', 'Price_Spread',
        'Profit_per_kg', 'Profit_per_ton', 'Profit_per_truck',
        'Opportunity_Score'],
       dtype='object'))

## 4. Select features (X) and target (y)

We will predict **Profit_per_kg** using numerical features:

- Modal_Price
- Retail_Price
- Distance_km
- Transport_rate_per_km
- Transport_cost_per_kg
- Price_Spread

In [16]:
features = [
    "Modal_Price",
    "Retail_Price",
    "Distance_km",
    "Transport_rate_per_km",
    "Transport_cost_per_kg",
    "Price_Spread",
]

target = "Profit_per_kg"

X = df[features]
y = df[target]

X.head(), y.head()

(   Modal_Price  Retail_Price  Distance_km  Transport_rate_per_km  \
 0         35.0            66          407                     28   
 1         35.0            66          509                     19   
 2         35.0            66           23                     32   
 3         35.0            66           55                     35   
 4         35.0            66          128                     23   
 
    Transport_cost_per_kg  Price_Spread  
 0                 11.396          31.0  
 1                  9.671          31.0  
 2                  0.736          31.0  
 3                  1.925          31.0  
 4                  2.944          31.0  ,
 0    19.604
 1    21.329
 2    30.264
 3    29.075
 4    28.056
 Name: Profit_per_kg, dtype: float64)

## 5. Train–test split

We split the data into:
- 80% training
- 20% testing

In [17]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

X_train.shape, X_test.shape

((4794, 6), (1199, 6))

## 6. Model 1 – Linear Regression

We first fit a simple **Linear Regression** model as a baseline.

In [18]:
lr = LinearRegression()
lr.fit(X_train, y_train)

lr_pred = lr.predict(X_test)

## 7. Evaluate Linear Regression

We evaluate using:
- MAE (Mean Absolute Error)
- RMSE (Root Mean Squared Error)
- R² (Coefficient of determination)

In [19]:
lr_mae = mean_absolute_error(y_test, lr_pred)
lr_mse = mean_squared_error(y_test, lr_pred)
lr_rmse = np.sqrt(lr_mse)
lr_r2 = r2_score(y_test, lr_pred)

print("Linear Regression Performance:")
print(f"MAE  : {lr_mae:.4f}")
print(f"RMSE : {lr_rmse:.4f}")
print(f"R²   : {lr_r2:.4f}")

Linear Regression Performance:
MAE  : 0.0000
RMSE : 0.0000
R²   : 1.0000


## 8. Model 2 – Random Forest Regressor

Now we train a more powerful **RandomForestRegressor** to capture non-linear relationships.

In [20]:
rf = RandomForestRegressor(
    n_estimators=200,
    random_state=42,
    n_jobs=-1
)

rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)

## 9. Evaluate Random Forest Regressor

In [21]:
rf_mae = mean_absolute_error(y_test, rf_pred)
rf_mse = mean_squared_error(y_test, rf_pred)
rf_rmse = np.sqrt(rf_mse)
rf_r2 = r2_score(y_test, rf_pred)

print("Random Forest Performance:")
print(f"MAE  : {rf_mae:.4f}")
print(f"RMSE : {rf_rmse:.4f}")
print(f"R²   : {rf_r2:.4f}")

Random Forest Performance:
MAE  : 0.3743
RMSE : 0.5503
R²   : 0.9997


## 10. Model comparison

We compare Linear Regression vs Random Forest on the same test set.

In [22]:
print("Linear Regression  → R²:", round(lr_r2, 4), "| RMSE:", round(lr_rmse, 4))
print("Random Forest      → R²:", round(rf_r2, 4), "| RMSE:", round(rf_rmse, 4))

Linear Regression  → R²: 1.0 | RMSE: 0.0
Random Forest      → R²: 0.9997 | RMSE: 0.5503


## 11. Save best model for deployment

We save the **Random Forest model** as a `.pkl` file in the `models/` folder.  
This will be loaded later in the Streamlit dashboard.

In [23]:
import os
os.makedirs("../models", exist_ok=True)

model_path = "../models/profit_model.pkl"
joblib.dump(rf, model_path)

model_path

'../models/profit_model.pkl'

## 12. Modeling Summary

- Target variable: **Profit_per_kg**
- Features used:
  - Modal_Price
  - Retail_Price
  - Distance_km
  - Transport_rate_per_km
  - Transport_cost_per_kg
  - Price_Spread
- Models trained:
  - Linear Regression (baseline)
  - Random Forest Regressor (non-linear)

### Final Selected Model
- **Random Forest Regressor**
- Reason:
  - Higher R² score than Linear Regression
  - Lower error (MAE / RMSE)
- Saved as: `models/profit_model.pkl`

This model will be used in the dashboard to **predict expected profit per kg** for a given route and show recommended profitable trades.