# PriceTrack: Unlocking Car Market Insights

PriceTrack is a data science project designed to predict the valuation of second-hand cars based on key input parameters. 
Leveraging Linear regression model, it provides data-driven insights to help buyers and sellers make informed decisions.

## Developing Regression Model

Importing all the required modules and functions

In [1836]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

Loading the cleaned data

In [1837]:
df = pd.read_csv("cleaned_data.csv")
df

Unnamed: 0,name,company,year,Price,kms_driven,fuel_type,Age,Annual_Km_Driven
0,Hyundai Santro Xing XO eRLX Euro III,Hyundai,2007,80000,45000,Petrol,13,3461.54
1,Mahindra Jeep CL550 MDI,Mahindra,2006,425000,140160,Diesel,14,10011.44
2,Hyundai Grand i10 Magna 1.2 Kappa VTVT,Hyundai,2014,325000,28000,Petrol,6,4666.67
3,Ford EcoSport Titanium 1.5L TDCi,Ford,2014,575000,36000,Diesel,6,6000.00
4,Ford Figo,Ford,2012,175000,41000,Diesel,8,5125.00
...,...,...,...,...,...,...,...,...
811,Maruti Suzuki Ritz VXI ABS,Maruti,2011,270000,50000,Petrol,9,5555.56
812,Tata Indica V2 DLE BS III,Tata,2009,110000,30000,Diesel,11,2727.27
813,Toyota Corolla Altis,Toyota,2009,300000,132000,Petrol,11,12000.00
814,Tata Zest XM Diesel,Tata,2018,260000,27000,Diesel,2,13500.00


### Feature Engineering: Age & Kilometers Driven Transformation  

- **`Age` Calculation:**  
  The car's age is computed as:  
  $$
  \text{Age} = \text{Max Year} - \text{Car's Year} + 1
  $$
  - Newer cars have a **lower age value**, while older cars have a **higher age value** to reflect depreciation.

- **`kms_driven` Inversion:**  
  The **`kms_driven`** column is transformed using:  
  $$
  \text{kms\_driven} = \text{Max kms\_driven} - \text{kms\_driven}
  $$  
  - Higher mileage cars get **lower values** (indicating more wear).  
  - Lower mileage cars get **higher values** (indicating better condition).  

- The original **`year`** column is removed, as it's no longer needed after transformation.  

This transformation improves the model’s ability to capture **price depreciation** over time and with increased usage. 🚗📉  


In [1838]:
useful_df = df.copy()
useful_df["Age"] = df["year"].max() - df["year"] + 1  # Assume max year as base year
useful_df = useful_df.drop(columns=["year"])  # Drop original year column
useful_df["kms_driven"] = df["kms_driven"].max() - df["kms_driven"] # Invert kms_driven

# # Replace less frequent companies with "Other"
# top_companies = useful_df["company"].value_counts().nlargest(4).index  # Top 4 companies
# useful_df["company"] = useful_df["company"].apply(lambda x: x if x in top_companies else "Other")

useful_df

Unnamed: 0,name,company,Price,kms_driven,fuel_type,Age,Annual_Km_Driven
0,Hyundai Santro Xing XO eRLX Euro III,Hyundai,80000,155000,Petrol,13,3461.54
1,Mahindra Jeep CL550 MDI,Mahindra,425000,59840,Diesel,14,10011.44
2,Hyundai Grand i10 Magna 1.2 Kappa VTVT,Hyundai,325000,172000,Petrol,6,4666.67
3,Ford EcoSport Titanium 1.5L TDCi,Ford,575000,164000,Diesel,6,6000.00
4,Ford Figo,Ford,175000,159000,Diesel,8,5125.00
...,...,...,...,...,...,...,...
811,Maruti Suzuki Ritz VXI ABS,Maruti,270000,150000,Petrol,9,5555.56
812,Tata Indica V2 DLE BS III,Tata,110000,170000,Diesel,11,2727.27
813,Toyota Corolla Altis,Toyota,300000,68000,Petrol,11,12000.00
814,Tata Zest XM Diesel,Tata,260000,173000,Diesel,2,13500.00


### Defining Features and Target Variable

- **`X` (Features)**: The independent variables used for prediction:
  - `company`: The brand of the car (categorical).
  - `fuel_type`: The type of fuel the car uses (categorical).
  - `kms_driven`: The total distance the car has been driven (numerical).
  - `Age`: The age of the car, derived from the manufacturing year (numerical).

- **`y` (Target Variable)**:  
  - `Price`: The dependent variable representing the car's selling price.

This selection ensures that the model learns the relationship between these key factors and the car's valuation.


In [1839]:
X = useful_df[["company", "fuel_type", "kms_driven", "Age"]]
y = useful_df["Price"]

### Train-Test Split

- The dataset is split into **training** and **testing** sets to evaluate model performance.
- **`train_test_split(X, y, test_size=0.2, random_state=974)`**:
  - **80%** of the data is used for training (`X_train`, `y_train`).
  - **20%** of the data is reserved for testing (`X_test`, `y_test`).
  - `random_state=974` ensures reproducibility by generating the same split every time.

This helps in assessing how well the model generalizes to unseen data.


In [1840]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=974)

### Data Preprocessing Pipeline

The `preprocessor` is a `ColumnTransformer` that applies different transformations to categorical and numerical features:

- **One-Hot Encoding (`OneHotEncoder`)**: Converts categorical variables (`company`, `fuel_type`) into numerical format while ignoring unknown categories.
- **Feature Scaling (`StandardScaler`)**: Standardizes numerical features (`kms_driven`, `Age`) to improve model performance by ensuring they have a mean of 0 and a standard deviation of 1.

This preprocessing step ensures that both categorical and numerical data are properly formatted before feeding them into the regression model.


In [1841]:
preprocessor = ColumnTransformer([
    ("onehot", OneHotEncoder(handle_unknown='ignore'), ["company", "fuel_type"]),
    ("scaler", StandardScaler(), ["kms_driven", "Age"])
])
preprocessor

### Model Pipeline

A **`Pipeline`** is used to streamline preprocessing and model training in a single workflow:

- **`preprocessor`**: Applies transformations to the input data (One-Hot Encoding & Standard Scaling).
- **`LinearRegression()`**: The regression model that learns the relationship between the features and the target variable.

This approach ensures that preprocessing steps are consistently applied during both training and prediction, improving efficiency and reducing the risk of data leakage.


In [1842]:
model = Pipeline([
    ("preprocessor", preprocessor),
    ("regressor", LinearRegression())
])

### Training & Predictions

- **`model.fit(X_train, y_train)`**: Trains the Linear Regression model using the training dataset.  
- **`y_pred = model.predict(X_test)`**: Predicts car prices on the test dataset for evaluation.








In [1843]:
# Train Model
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

### Model Evaluation Metrics

- **Mean Absolute Error (MAE)**: Measures the average absolute difference between actual and predicted prices. Lower values indicate better accuracy.
- **Mean Squared Error (MSE)**: Similar to MAE but gives higher weight to larger errors, making it more sensitive to outliers.
- **Root Mean Squared Error (RMSE)**: Square root of MSE, providing an interpretable error measure in the same unit as price.
- **R² Score**: Indicates how well the model explains the variance in price; closer to 1 means a better fit.

### Integration with Statsmodel

In [1844]:
import statsmodels.api as sm

In [1845]:
# Encode categorical variables
X_train_encoded = pd.get_dummies(X_train, drop_first=True)
X_test_encoded = pd.get_dummies(X_test, drop_first=True)
X_overall_encoded = pd.get_dummies(X, drop_first=True)

# Add constant (intercept)
X_train_encoded = sm.add_constant(X_train_encoded)
X_test_encoded = sm.add_constant(X_test_encoded)
X_overall_encoded = sm.add_constant(X_overall_encoded)

# Force float dtype
X_train_encoded = X_train_encoded.astype(float)
X_test_encoded = X_test_encoded.astype(float)
X_overall_encoded = X_overall_encoded.astype(float)

y_train = y_train.astype(float)
y_test = y_test.astype(float)
y_overall = y.astype(float)

# Fit the model
model_sm_train = sm.OLS(y_train, X_train_encoded).fit()
model_sm_test = sm.OLS(y_test, X_test_encoded).fit()
model_sm_overall = sm.OLS(y_overall, X_overall_encoded).fit()

results = {
    "r2": [
        model_sm_train.rsquared,
        model_sm_test.rsquared,
        model_sm_overall.rsquared,
    ],
    "mae": [
        mean_absolute_error(y_train, model_sm_train.predict(X_train_encoded)),
        mean_absolute_error(y_test, model_sm_test.predict(X_test_encoded)),
        mean_absolute_error(y_overall, model_sm_overall.predict(X_overall_encoded)),
    ],
    "mse": [
        mean_squared_error(y_train, model_sm_train.predict(X_train_encoded)),
        mean_squared_error(y_test, model_sm_test.predict(X_test_encoded)),
        mean_squared_error(y_overall, model_sm_overall.predict(X_overall_encoded)),
    ],
}

results["rmse"] = np.sqrt(results["mse"])
results["r2_percent"] = [f"{val*100:.0f}%" for val in results["r2"]]

# Print Metrics
result_df = pd.DataFrame(
    {
        "Percentage Accuracy": results["r2_percent"],
        "R² Score": results["r2"],
        "MAE": results["mae"],
        "MSE": results["mse"],
        "RMSE": results["rmse"]
    },
    index=["Training", "Testing", "Overall"],
)
# result_df[["R² Score","MAE", "MSE", "RMSE"]].agg(lambda s: ['%.2f'%val for val in s]) # Formatting
result_df

Unnamed: 0,Percentage Accuracy,R² Score,MAE,MSE,RMSE
Training,66%,0.66,142021.89,49433755204.81,222337.03
Testing,89%,0.89,92702.07,16776587929.88,129524.47
Overall,70%,0.7,134038.47,43875518512.07,209464.84


### Statsmodel Summary on Testing Data

In [1846]:
model_sm_test.summary()

0,1,2,3
Dep. Variable:,Price,R-squared:,0.888
Model:,OLS,Adj. R-squared:,0.871
Method:,Least Squares,F-statistic:,51.04
Date:,"Sun, 27 Apr 2025",Prob (F-statistic):,3.89e-56
Time:,11:07:01,Log-Likelihood:,-2163.3
No. Observations:,164,AIC:,4373.0
Df Residuals:,141,BIC:,4444.0
Df Model:,22,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.596e+06,1.71e+05,9.361,0.000,1.26e+06,1.93e+06
kms_driven,0.5743,0.496,1.158,0.249,-0.406,1.554
Age,-2.649e+04,3436.536,-7.708,0.000,-3.33e+04,-1.97e+04
company_BMW,-4.746e+05,1.72e+05,-2.753,0.007,-8.15e+05,-1.34e+05
company_Chevrolet,-1.243e+06,1.47e+05,-8.449,0.000,-1.53e+06,-9.52e+05
company_Datsun,-1.274e+06,2.01e+05,-6.340,0.000,-1.67e+06,-8.77e+05
company_Fiat,-1.201e+06,2.01e+05,-5.977,0.000,-1.6e+06,-8.04e+05
company_Force,-9.966e+05,1.98e+05,-5.037,0.000,-1.39e+06,-6.05e+05
company_Ford,-1.104e+06,1.52e+05,-7.284,0.000,-1.4e+06,-8.05e+05

0,1,2,3
Omnibus:,16.365,Durbin-Watson:,2.18
Prob(Omnibus):,0.0,Jarque-Bera (JB):,26.561
Skew:,0.524,Prob(JB):,1.71e-06
Kurtosis:,4.67,Cond. No.,9160000.0


Checking the co-efficients

In [1847]:
feature_names = preprocessor.get_feature_names_out()
coef_df = pd.DataFrame(model.named_steps["regressor"].coef_, index=feature_names, columns=["Coefficient"])
coef_df

Unnamed: 0,Coefficient
onehot__company_Audi,659896.39
onehot__company_BMW,349068.27
onehot__company_Chevrolet,-586558.27
onehot__company_Datsun,-598696.2
onehot__company_Fiat,-552235.98
onehot__company_Force,-333259.5
onehot__company_Ford,-309693.05
onehot__company_Hindustan,-74642.29
onehot__company_Honda,-389965.36
onehot__company_Hyundai,-408402.18


### Model Testing: Manual Test Cases  

To validate the accuracy and behavior of the trained regression model, we run manual test cases with different car attributes. These test cases help check if the model is making reasonable predictions based on the expected trends.  

| **Company** | **Fuel Type** | **Kilometers Driven** | **Age (Years)** | **Description** | **Expected Outcome** |
|------------|-------------|-----------------|-------------|----------------|----------------|
| Toyota     | Petrol      | 1,000           | 1           | New Car with Low Mileage | A high predicted price, as the car is new with low mileage. |
| Honda      | Diesel      | 200,000         | 10          | Old Car with High Mileage | A significantly lower predicted price, as the car has high mileage and is very old. |
| BMW        | Petrol      | 20,000          | 2           | Luxury Car with Low Mileage | A relatively high predicted price, as luxury cars tend to retain value better. |
| Hyundai    | Petrol      | 50,000          | 5           | Mid-range Car with Moderate Usage | A moderate price, considering the car is neither too old nor too new. |
| Maruti     | Petrol      | 150,000         | 8           | Economy Car with High Mileage | A lower predicted price due to high mileage and moderate age. |

### **Key Trends Expected**  
- **Newer cars with lower mileage** should have a **higher price**.  
- **Older cars with high mileage** should have a **lower price**.  
- **Luxury brands (BMW, Jaguar, Mercedes, etc.)** should have a **higher predicted price** compared to economy brands.  
- **Diesel cars** might have a **slightly higher price** than petrol cars due to fuel efficiency.  

These test cases ensure that the model adheres to logical pricing patterns based on mileage, age, and brand value. 🚗💨


In [1848]:
# Define test cases as a list of dictionaries
test_cases = [
    { "company": "Toyota", "fuel_type": "Petrol", "kms_driven": 1000, "Age": 1, "Description": "New Car with Low Mileage",},
    { "company": "Honda", "fuel_type": "Diesel", "kms_driven": 200000, "Age": 10, "Description": "Old Car with High Mileage",},
    { "company": "BMW", "fuel_type": "Petrol", "kms_driven": 20000, "Age": 2, "Description": "Luxury Car with Low Mileage",},
    { "company": "Hyundai", "fuel_type": "Petrol", "kms_driven": 50000, "Age": 5, "Description": "Mid-range Car with Moderate Usage"},
    { "company": "Maruti", "fuel_type": "Petrol", "kms_driven": 150000, "Age": 8, "Description": "Economy Car with High Mileage",}
]

# Convert test cases to DataFrame
test_df = pd.DataFrame(test_cases)

# Predict prices for test cases
predicted_prices = model.predict(test_df.drop(columns=["Description"]))

# Add predicted prices to DataFrame
test_df["Predicted Price (₹)"] = [f"₹{price:,.0f}" for price in predicted_prices]

# Display the results in a structured table
test_df

Unnamed: 0,company,fuel_type,kms_driven,Age,Description,Predicted Price (₹)
0,Toyota,Petrol,1000,1,New Car with Low Mileage,"₹744,464"
1,Honda,Diesel,200000,10,Old Car with High Mileage,"₹372,288"
2,BMW,Petrol,20000,2,Luxury Car with Low Mileage,"₹1,255,218"
3,Hyundai,Petrol,50000,5,Mid-range Car with Moderate Usage,"₹400,385"
4,Maruti,Petrol,150000,8,Economy Car with High Mileage,"₹241,171"


In [1849]:
# # Convert test cases to DataFrame
# test_df = pd.DataFrame(test_cases)

# # Drop 'Description' column and encode categorical variables in the test data
# test_cases_encoded = pd.get_dummies(test_df.drop(columns=["Description"]), drop_first=True)

# # Add constant (intercept) to the encoded test data
# test_cases_encoded = sm.add_constant(test_cases_encoded)

# # Ensure the test data has the same columns as the training data (X_overall_encoded)
# test_cases_encoded = test_cases_encoded.reindex(columns=X_overall_encoded.columns, fill_value=0)

# # Predict prices for test cases using the overall model
# predicted_prices = model_sm_train.predict(test_cases_encoded)
# # predicted_prices = model_sm_test.predict(test_cases_encoded)
# # predicted_prices = model_sm_overall.predict(test_cases_encoded)

# # Add predicted prices to the test DataFrame
# test_df["Predicted Price (₹)"] = [f"₹{price:,.0f}" for price in predicted_prices]

# # Display the results in a structured table
# test_df
