# PriceTrack: Unlocking Bike Market Insights

PriceTrack is a data science project designed to predict the valuation of used bike based on key input parameters.
Leveraging Multiple Linear regression model, it provides data-driven insights to help sellers make informed decisions.

## Developing Regression Model

This step involves building and training regression model, encoding features, and evaluating performance using metrics like R², MAE, MSE, and RMSE to ensure accurate predictions.

- For our project, we have used Multiple Linear Regression since output feature is a continuous variable and we want to predict it.

- To handle categorical data, we have used one hot encoding.

Importing all the required modules and functions

In [111]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

Loading the cleaned data

In [112]:
df = pd.read_csv("Cleaned_Bike_Data.csv")
df

Unnamed: 0,model,price,city,kms_driven,owner,age,power,brand,owner_encoded
0,TVS Star City Plus Dual Tone 110cc,35000,Ahmedabad,17654,First Owner,3,110,TVS,1
1,Royal Enfield Classic 350cc,119900,Delhi,11000,First Owner,4,350,Royal Enfield,1
2,TVS Apache RTR 180cc,65000,Bangalore,16329,First Owner,4,180,TVS,1
3,Yamaha FZ S V 2.0 150cc-Ltd. Edition,80000,Bangalore,10000,First Owner,3,150,Yamaha,1
4,Yamaha FZs 150cc,53499,Delhi,25000,First Owner,6,150,Yamaha,1
...,...,...,...,...,...,...,...,...,...
30205,Bajaj Avenger 220cc,41000,Delhi,20245,Second Owner,11,220,Bajaj,2
30206,Hero Passion Pro 100cc,39000,Delhi,22000,First Owner,4,100,Hero,1
30207,TVS Apache RTR 180cc,30000,Karnal,6639,First Owner,9,180,TVS,1
30208,Bajaj Avenger Street 220,60000,Delhi,20373,First Owner,6,220,Bajaj,1


### 🏍️ Defining Features and Target Variable

- **`X` (Features)**: The independent variables used for prediction:
  - `age`: Age of the bike in years (numerical).
  - `power`: Engine capacity in cc (numerical).
  - `brand`: Brand of the bike (categorical).
  - `owner_encoded`: Encoded representation of ownership status (ordinal categorical).
  - `city`: City where the bike is listed (categorical).
  - `kms_driven`: Total distance the bike has been ridden (numerical).

- **`y` (Target Variable)**:  
  - `price`: The dependent variable representing the bike's resale price.

This selection enables the model to learn how various mechanical, demographic, and usage factors influence a bike’s resale value.

In [113]:
X = df[["age", "power", "brand", "owner_encoded", "city", "kms_driven"]]
y = df["price"]

### Train-Test Split

- The dataset is split into **training** and **testing** sets to evaluate model performance.
- **`train_test_split(X, y, test_size=0.2, random_state=42)`**:
  - **80%** of the data is used for training (`X_train`, `y_train`).
  - **20%** of the data is reserved for testing (`X_test`, `y_test`).
  - `random_state=0` ensures reproducibility by generating the same split every time.

This helps in assessing how well the model generalizes to unseen data.


In [114]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

### 🔧 Data Preprocessing Pipeline – Used Bike Dataset

The `preprocessor` is a `ColumnTransformer` that applies suitable transformations to different feature types:

- **One-Hot Encoding (`OneHotEncoder`)**: Applied to categorical variables (`brand`, `city`) to convert them into a machine-readable numerical format, while handling unknown categories gracefully.
- **Feature Scaling (`StandardScaler`)**: Applied to numerical features (`age`, `power`, `kms_driven`) to standardize their values (zero mean and unit variance), improving model stability and convergence.
- **Pass-Through**: The `owner_encoded` feature (already numeric and ordinal) is passed without transformation.

This pipeline ensures consistent preprocessing across diverse data types, making the data well-prepared for regression modeling.

In [115]:
preprocessor = ColumnTransformer(
    [
        ("onehot", OneHotEncoder(handle_unknown="ignore"), ["brand", "city"]),
        ("scaler", StandardScaler(), ["age", "power", "kms_driven"]),
    ],
    remainder="passthrough",  # Keeps 'owner_encoded'
    force_int_remainder_cols=False,  # 👈 Enables future behavior now
)

preprocessor

### Model Pipeline

A **`Pipeline`** is used to streamline preprocessing and model training in a single workflow:

- **`preprocessor`**: Applies transformations to the input data (One-Hot Encoding & Standard Scaling).
- **`LinearRegression()`**: The regression model that learns the relationship between the features and the target variable.

This approach ensures that preprocessing steps are consistently applied during both training and prediction, improving efficiency and reducing the risk of data leakage.

In [116]:
model = Pipeline([
    ("preprocessor", preprocessor),
    ("regressor", LinearRegression())
])

# Train Model
model.fit(X_train, y_train)

### Model Evaluation Metrics

- **Mean Absolute Error (MAE)**: Measures the average absolute difference between actual and predicted prices. Lower values indicate better accuracy.
- **Mean Squared Error (MSE)**: Similar to MAE but gives higher weight to larger errors, making it more sensitive to outliers.
- **Root Mean Squared Error (RMSE)**: Square root of MSE, providing an interpretable error measure in the same unit as price.
- **R² Score**: Indicates how well the model explains the variance in price; closer to 1 means a better fit.

In [117]:
# Predictions
y_pred = model.predict(X_test)

# Model Evaluation
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

# Display Metrics
print(f"MAE: {mae:.2f}")
print(f"MSE: {mse:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"R2 Score: {r2:.2f}")

MAE: 6008.00
MSE: 127910307.94
RMSE: 11309.74
R2 Score: 0.94


In [118]:
# Checking the co-efficients

# feature_names = preprocessor.get_feature_names_out()
# coef_df = pd.DataFrame(model.named_steps["regressor"].coef_, index=feature_names, columns=["Coefficient"])
# coef_df

## Integration with Statsmodel

In [119]:
import statsmodels.api as sm

In [120]:
# Encode categorical variables
X_train_encoded = pd.get_dummies(X_train, drop_first=True)
X_test_encoded = pd.get_dummies(X_test, drop_first=True)
X_overall_encoded = pd.get_dummies(X, drop_first=True)

# Align the columns of test and overall with train
X_test_encoded = X_test_encoded.reindex(columns=X_train_encoded.columns, fill_value=0)
X_overall_encoded = X_overall_encoded.reindex(columns=X_train_encoded.columns, fill_value=0)

# Add constant (intercept)
X_train_encoded = sm.add_constant(X_train_encoded)
X_test_encoded = sm.add_constant(X_test_encoded)
X_overall_encoded = sm.add_constant(X_overall_encoded)

# Force float dtype
X_train_encoded = X_train_encoded.astype(float)
X_test_encoded = X_test_encoded.astype(float)
X_overall_encoded = X_overall_encoded.astype(float)

y_train = y_train.astype(float)
y_test = y_test.astype(float)
y_overall = y.astype(float)

# Fit the model
model_sm = sm.OLS(y_train, X_train_encoded).fit()

# Predictions
y_train_pred_sm = model_sm.predict(X_train_encoded)
y_test_pred_sm = model_sm.predict(X_test_encoded)
y_overall_pred_sm = model_sm.predict(X_overall_encoded)

results = {
    "r2": [
        model_sm.rsquared,
        r2_score(y_test, y_test_pred_sm),
        r2_score(y_overall, y_overall_pred_sm),
    ],
    "mae": [
        mean_absolute_error(y_train, y_train_pred_sm),
        mean_absolute_error(y_test, y_test_pred_sm),
        mean_absolute_error(y_overall, y_overall_pred_sm),
    ],
    "mse": [
        mean_squared_error(y_train, y_train_pred_sm),
        mean_squared_error(y_test, y_test_pred_sm),
        mean_squared_error(y_overall, y_overall_pred_sm),
    ],
}

results["rmse"] = np.sqrt(results["mse"])
results["r2_percent"] = [f"{val*100:.0f}%" for val in results["r2"]]

# Print Metrics
result_df = pd.DataFrame(
    {
        "Percentage Accuracy": results["r2_percent"],
        "R² Score": results["r2"],
        "MAE": results["mae"],
        "MSE": results["mse"],
        "RMSE": results["rmse"]
    },
    index=["Training", "Testing", "Overall"],
)
# result_df[["R² Score","MAE", "MSE", "RMSE"]].agg(lambda s: ['%.2f'%val for val in s]) # Formatting
result_df

Unnamed: 0,Percentage Accuracy,R² Score,MAE,MSE,RMSE
Training,94%,0.937599,5795.129236,119673900.0,10939.555979
Testing,94%,0.935468,5999.333048,128032500.0,11315.144086
Overall,94%,0.937165,5835.969998,121345600.0,11015.698123


### Statsmodel Summary on Testing Data

In [121]:
model_sm.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.938
Model:,OLS,Adj. R-squared:,0.937
Method:,Least Squares,F-statistic:,883.8
Date:,"Wed, 14 May 2025",Prob (F-statistic):,0.0
Time:,20:18:54,Log-Likelihood:,-259060.0
No. Observations:,24168,AIC:,518900.0
Df Residuals:,23763,BIC:,522200.0
Df Model:,404,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.897e+05,6560.006,28.920,0.000,1.77e+05,2.03e+05
age,-3675.1228,33.515,-109.656,0.000,-3740.815,-3609.431
power,285.0975,1.762,161.763,0.000,281.643,288.552
owner_encoded,-7998.5706,310.064,-25.796,0.000,-8606.317,-7390.825
kms_driven,-0.3341,0.007,-51.133,0.000,-0.347,-0.321
brand_Bajaj,-1.541e+05,5523.782,-27.893,0.000,-1.65e+05,-1.43e+05
brand_Benelli,-2.809e+04,5859.745,-4.794,0.000,-3.96e+04,-1.66e+04
brand_Harley-Davidson,-6.107e+04,6588.614,-9.270,0.000,-7.4e+04,-4.82e+04
brand_Hero,-1.504e+05,5533.684,-27.172,0.000,-1.61e+05,-1.4e+05

0,1,2,3
Omnibus:,11944.347,Durbin-Watson:,2.002
Prob(Omnibus):,0.0,Jarque-Bera (JB):,782961.632
Skew:,1.559,Prob(JB):,0.0
Kurtosis:,30.709,Cond. No.,29400000.0


### 📊 Model Testing: Manual Test Cases for Used Bikes

To validate model behavior, we manually test different scenarios based on bike attributes.

| **Brand**       | **City**     | **KMs Driven** | **Age (Years)** | **Power (cc)** | **Owner (Encoded)** | **Description**                              | **Expected Outcome**                                |
|-----------------|--------------|----------------|------------------|----------------|---------------------|----------------------------------------------|------------------------------------------------------|
| Harley Davidson            | Bangalore    | 5,000          | 1                | 750            | 1                   | New bike with low mileage and Luxury brand   | High price due to new condition and premium brand    |
| Royal Enfield   | Pune         | 70,000         | 5                | 350            | 1                   | Mid-aged popular cruiser                     | Moderate to high price due to brand and engine size  |
| Hero            | Patna        | 12,000         | 10               | 100            | 2                   | Old budget bike, second owner                | Low price due to age, low power, and second owner    |
| KTM             | Mumbai       | 3,700          | 2                | 390            | 1                   | Premium sports bike with low mileage         | High price due to premium branding and condition     |
| Bajaj           | Ahmedabad    | 2,300          | 3                | 150            | 3                   | Mid-age bike, 3rd owner                      | Lower price due to ownership count despite low kms   |

### **Key Trends Expected**
- **Newer bikes with low mileage** → **Higher resale price**  
- **Older bikes or with high ownership** → **Lower resale price**  
- **Luxury or performance brands (e.g., Harley Davidson, KTM)** → **High resale value**  
- **Commuter brands (e.g., Hero, Bajaj)** → **Budget resale range**  


In [122]:
# Define test cases as a list of dictionaries
test_cases = [
    {
        "brand": "Harley Davidson",
        "city": "Bangalore",
        "kms_driven": 5000,
        "age": 1,
        "power": 750,
        "owner_encoded": 1,
        "Description": "New bike with low mileage and Luxury brand",
    },
    {
        "brand": "Royal Enfield",
        "city": "Pune",
        "kms_driven": 70000,
        "age": 5,
        "power": 350,
        "owner_encoded": 1,
        "Description": "Mid-aged popular cruiser",
    },
    {
        "brand": "Hero",
        "city": "Patna",
        "kms_driven": 12000,
        "age": 10,
        "power": 100,
        "owner_encoded": 2,
        "Description": "Old budget bike, second owner",
    },
    {
        "brand": "KTM",
        "city": "Mumbai",
        "kms_driven": 3700,
        "age": 2,
        "power": 390,
        "owner_encoded": 1,
        "Description": "Premium sports bike with low mileage",
    },
    {
        "brand": "Bajaj",
        "city": "Ahmedabad",
        "kms_driven": 2300,
        "age": 3,
        "power": 150,
        "owner_encoded": 3,
        "Description": "Mid-age bike, 3rd owner",
    },
]

# Convert test cases to DataFrame
test_df = pd.DataFrame(test_cases)

# Predict prices for test cases
predicted_prices = model.predict(test_df.drop(columns=["Description"]))

# Add predicted prices to DataFrame
test_df["Predicted Price (₹)"] = [f"₹{price:,.0f}" for price in predicted_prices]

# Display the results
test_df

Unnamed: 0,brand,city,kms_driven,age,power,owner_encoded,Description,Predicted Price (₹)
0,Harley Davidson,Bangalore,5000,1,750,1,New bike with low mileage and Luxury brand,"₹291,293"
1,Royal Enfield,Pune,70000,5,350,1,Mid-aged popular cruiser,"₹98,715"
2,Hero,Patna,12000,10,100,2,"Old budget bike, second owner","₹23,687"
3,KTM,Mumbai,3700,2,390,1,Premium sports bike with low mileage,"₹197,877"
4,Bajaj,Ahmedabad,2300,3,150,3,"Mid-age bike, 3rd owner","₹51,094"
