# Resale and GST Predictor - Model Training Notebook

## 📌 Introduction
This project aims to **predict resale prices of used vehicles** and **adjust new vehicle prices according to revised GST slabs**.  

- **Used Cars & Bikes** → Machine Learning regression models.  
- **New Cars & Bikes** → Rule-based GST adjustment (handled in `app.py`).  

Datasets used:  
- `used_cars.csv`  
- `used_bikes.csv`  

The trained models are saved as `.pkl` files and later loaded in the Flask app (`app.py`) for real-time predictions.

In [3]:
# 📦 Import Libraries
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error
import joblib
import datetime

In [2]:
# ⚙️ Helper function to normalize column names
def normalize_and_map(df, mapping):
    df = df.rename(columns={col: col.strip().lower().replace(" ", "_") for col in df.columns})
    df = df.rename(columns=mapping)
    return df

## 📂 Load & Preprocess Data
We will load both datasets, clean column names, and create features for model training:
- Convert `year` → `age`
- One-hot encode categorical features (fuel type)
- Keep only numeric + encoded features

In [5]:
# --- Load & Preprocess Data ---
import pandas as pd
import datetime

# --- Utility: Normalize and map column names ---
def normalize_and_map(df, mapping=None):
    # Rename columns based on mapping
    if mapping:
        df = df.rename(columns=mapping)

    # Standardize column names
    df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
    
    # If 'year' exists, calculate 'age'
    if "year" in df.columns:
        current_year = datetime.datetime.now().year
        df["age"] = current_year - df["year"]
        df.drop(columns=["year"], inplace=True)
    
    return df

# --- Load datasets (CSV files should be in the same folder as this notebook) ---
used_car_df = pd.read_csv("used_cars.csv")
used_bike_df = pd.read_csv("used_bikes.csv")

# --- Define column mappings if needed ---
car_mapping = {"Selling_Price": "resale_price"}  # handle different naming
bike_mapping = {"Selling_Price": "resale_price"}

# --- Normalize datasets ---
used_car_df = normalize_and_map(used_car_df, car_mapping)
used_bike_df = normalize_and_map(used_bike_df, bike_mapping)

# --- Check data ---
print("Used Car Dataset:")
display(used_car_df.head())

print("Used Bike Dataset:")
display(used_bike_df.head())


Used Car Dataset:


Unnamed: 0,model_name,km_driven,fuel_type,transmission,ownership,engine_cc,seats,resale_price,age
0,Amaze 1.2 VX i-VTEC,87150,Petrol,Manual,1,1198.0,5.0,505000,8
1,Swift DZire VDI,75000,Diesel,Manual,2,1248.0,5.0,450000,11
2,i10 Magna 1.2 Kappa2,67000,Petrol,Manual,1,1197.0,5.0,220000,14
3,Glanza G,37500,Petrol,Manual,1,1197.0,5.0,799000,6
4,Innova 2.4 VX 7 STR [2016-2020],69000,Diesel,Manual,1,2393.0,7.0,1950000,7


Used Bike Dataset:


Unnamed: 0,model_name,km_driven,ownership,age,fuel_type,engine_cc,resale_price
0,TVS Star City Plus Dual Tone 110cc,17654,1,3,Petrol,110,35000
1,Royal Enfield Classic 350cc,11000,1,4,Petrol,350,119900
2,Triumph Daytona 675R,110,1,8,Petrol,675,600000
3,TVS Apache RTR 180cc,16329,1,4,Petrol,180,65000
4,Yamaha FZ S V 2.0 150cc-Ltd. Edition,10000,1,3,Petrol,150,80000


## 🏍️ Train Bike Price Model
We use **RandomForestRegressor** for predicting used bike prices.

In [6]:
# --- Feature Selection, Train & Save Bike Model ---
import os
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_absolute_error
import joblib

# --- Create 'models' folder if not exists ---
os.makedirs("models", exist_ok=True)

# --- Features & Target ---
X_bike = pd.get_dummies(
    used_bike_df[['km_driven','age','fuel_type','engine_cc']],
    columns=['fuel_type'], drop_first=True
)
y_bike = used_bike_df['resale_price']

# --- Split Data ---
X_train, X_test, y_train, y_test = train_test_split(X_bike, y_bike, test_size=0.2, random_state=42)

# --- Train Model ---
bike_model = RandomForestRegressor(random_state=42)
bike_model.fit(X_train, y_train)

# --- Evaluate ---
bike_preds = bike_model.predict(X_test)
print("Bike Model R²:", r2_score(y_test, bike_preds))
print("Bike Model MAE:", mean_absolute_error(y_test, bike_preds))

# --- Save Model & Columns ---
joblib.dump(bike_model, "models/bike_model.pkl")
joblib.dump(list(X_bike.columns), "models/bike_columns.pkl")

print("✅ Bike model and columns saved successfully!")


Bike Model R²: 0.9302423406231619
Bike Model MAE: 4300.1325849728255
✅ Bike model and columns saved successfully!


## 🚗 Train Car Price Model
We use the same pipeline for used car prices.

In [7]:
# --- Feature Selection, Train & Save Car Model ---
import os
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_absolute_error
import joblib

# --- Create 'models' folder if not exists ---
os.makedirs("models", exist_ok=True)

# --- Features & Target ---
X_car = pd.get_dummies(
    used_car_df[['km_driven','age','fuel_type','engine_cc','transmission']],
    columns=['fuel_type','transmission'], drop_first=True
)
y_car = used_car_df['resale_price']

# --- Split Data ---
X_train, X_test, y_train, y_test = train_test_split(X_car, y_car, test_size=0.2, random_state=42)

# --- Train Model ---
car_model = RandomForestRegressor(random_state=42)
car_model.fit(X_train, y_train)

# --- Evaluate ---
car_preds = car_model.predict(X_test)
print("Car Model R²:", r2_score(y_test, car_preds))
print("Car Model MAE:", mean_absolute_error(y_test, car_preds))

# --- Save Model & Columns ---
joblib.dump(car_model, "models/car_model.pkl")
joblib.dump(list(X_car.columns), "models/car_columns.pkl")

print("✅ Car model and columns saved successfully!")


Car Model R²: 0.7565624253345127
Car Model MAE: 470798.32228039764
✅ Car model and columns saved successfully!


In [8]:
# --- Test Loading Saved Models ---
import joblib
import pandas as pd

# Load bike model and columns
bike_model = joblib.load("models/bike_model.pkl")
bike_columns = joblib.load("models/bike_columns.pkl")

# Load car model and columns
car_model = joblib.load("models/car_model.pkl")
car_columns = joblib.load("models/car_columns.pkl")

# Quick test input (dummy)
test_bike = pd.DataFrame([{col: 0 for col in bike_columns}])
test_bike.loc[0, ['km_driven','age','engine_cc']] = [10000, 3, 150]
print("Sample Bike Prediction:", bike_model.predict(test_bike)[0])

test_car = pd.DataFrame([{col: 0 for col in car_columns}])
test_car.loc[0, ['km_driven','age','engine_cc']] = [40000, 5, 1200]
print("Sample Car Prediction:", car_model.predict(test_car)[0])


Sample Bike Prediction: 97803.71013431012
Sample Car Prediction: 724779.74


## 📊 Model Summary
- **Bike Model** and **Car Model** trained using RandomForestRegressor.  
- Performance evaluated with R² and MAE.  
- Models & input columns saved in `/models/` folder.  
- Flask app (`app.py`) loads these models for predictions.

## 🔮 Future Work
- Add more features (brand, location, condition).  
- Try advanced models (XGBoost, CatBoost).  
- Deploy with Docker/Cloud for scalability.  