Skip to content

Tosa9/CodeAlpha_CarPricePrediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Car Price Prediction with Machine Learning

CodeAlpha Data Science Internship — Task 3

Intern: Omokhoa Oshose Tosayoname
Intern ID: CA/DF1/71570
Duration: 20th May 2026 – 20th June 2026


Overview

This project builds and evaluates multiple machine learning regression models to predict the selling price of used cars based on features such as present showroom price, kilometres driven, manufacturing year, fuel type, seller type, transmission, and number of previous owners. The dataset contains 301 used car listings spanning model years 2003 to 2018.

Business Question: What factors most strongly determine the resale value of a used car, and how accurately can we predict selling price from these features?


Project Pipeline

Data Loading --> EDA & Visualisation --> Feature Engineering & Preprocessing
    --> Model Training --> Evaluation & Comparison --> Business Insights

Project Structure

CodeAlpha_CarPricePrediction/
├── data/
│   ├── car_data.csv                      # Raw dataset
│   └── *.png                             # All generated visualisations
├── notebooks/
│   └── car_price_prediction.ipynb        # Main notebook (fully executed)
├── requirements.txt
└── README.md

Dataset

Feature Type Description
Car_Name Categorical Model name of the car
Year Numeric Year of manufacture
Selling_Price Numeric Price at which car was sold (Lakhs INR) — target
Present_Price Numeric Current ex-showroom price (Lakhs INR)
Driven_kms Numeric Total kilometres driven
Fuel_Type Categorical Petrol / Diesel / CNG
Selling_type Categorical Dealer / Individual
Transmission Categorical Manual / Automatic
Owner Numeric Number of previous owners

Exploratory Data Analysis

Target Variable: Selling Price Distribution

Price Distribution

The selling price is right-skewed, with most cars priced under 10 Lakhs INR. Log transformation brings it closer to a normal distribution.


Categorical Feature Distributions

Categorical Distributions

Petrol and Manual transmission cars dominate the dataset. Dealer listings outnumber Individual ones.


Selling Price by Category

Price by Category

Diesel cars command a higher median resale price than Petrol. Automatic transmission cars sell at a premium, and Dealer-listed cars are priced higher than Individual listings.


Depreciation Analysis: Car Age vs Price

Age vs Price

Selling price declines with car age. The depreciation is steepest in the first 5 years, stabilising for older models.


Numeric Features vs Selling Price

Numeric vs Price

Present Price has the strongest positive correlation with Selling Price. Kilometres driven shows a weak negative relationship.


Top 15 Car Models by Average Selling Price

Top Brands

Premium and luxury models lead in average resale value, with significant spread across the dataset.


Listings Volume and Average Price by Year

Yearly Listings

Listings peak around 2015–2017 model years. Newer cars command higher average prices as expected.


Depreciation Curve by Fuel Type

Depreciation Curve

All points below the diagonal represent depreciated vehicles. Diesel cars tend to retain value better relative to their showroom price.


Feature Correlation Matrix

Correlation Heatmap

Present Price has the highest correlation with Selling Price. Car Age and Driven_kms are negatively correlated with price.


Feature Engineering

Engineered features added to improve model performance:

Feature Formula Rationale
Car_Age 2026 - Year More interpretable than Year
KM_per_Year Driven_kms / (Car_Age + 1) Usage intensity per year
Age_x_KM Car_Age × Driven_kms Joint depreciation proxy
Price_per_KM Present_Price / (Driven_kms + 1) Value retention per km
Brand_Avg_Price Mean Selling_Price per brand Brand prestige encoding

Models Trained and Compared

Model Features Used
Linear Regression All 11 features
Ridge Regression All 11 features
Lasso Regression All 11 features
Random Forest All 11 features
Gradient Boosting All 11 features
XGBoost All 11 features

Results

Model RMSE MAE
Gradient Boosting 0.9697 0.8355 0.5060
XGBoost 0.9654 0.8925 0.5522
Random Forest 0.9472 1.1030 0.6678
Linear Regression 0.8493 1.8629 1.1484
Lasso Regression 0.8488 1.8663 1.1401
Ridge Regression 0.8468 1.8785 1.1540

Best model: Gradient Boosting (R² = 0.9697)


Model Performance Comparison

Model Comparison


Actual vs Predicted Selling Price

Actual vs Predicted

The best models cluster tightly around the perfect prediction line, with minimal scatter at higher price points.


Residual Analysis

Residual Plots

Residuals are randomly distributed around zero, confirming good model fit with no systematic bias.


Random Forest Feature Importances

Feature Importance


XGBoost Feature Importances

XGBoost Feature Importance

Both ensemble models agree: Present Price and Brand_Avg_Price are the dominant predictors.


Prediction Confidence Band

Confidence Band

The 95% confidence band shows the model's prediction range across the test set, sorted by actual price.


Key Business Findings

  • Present showroom price is the strongest single predictor of resale value across all models.
  • Diesel cars retain value better than Petrol; Automatic transmission cars command a premium.
  • First-owner cars (Owner = 0) retain significantly more value than second or third-owner vehicles.
  • Car age is a key depreciation driver; steepest drop occurs within the first 5 years.
  • Brand prestige (encoded via brand average price) is highly informative for tree-based models.
  • Ensemble models (Gradient Boosting, XGBoost, Random Forest) significantly outperform linear models on this dataset, suggesting non-linear relationships between features and price.

How to Run

  1. Clone the repository:

    git clone https://github.com/Tosa9/CodeAlpha_CarPricePrediction.git
    cd CodeAlpha_CarPricePrediction
  2. Install dependencies:

    pip install -r requirements.txt
  3. Launch the notebook:

    jupyter notebook notebooks/car_price_prediction.ipynb

Dataset Source

Car Price Prediction Dataset — Kaggle


CodeAlpha Data Science Internship | Task 3
#CodeAlpha #DataScience #MachineLearning #CarPricePrediction #XGBoost #Python

About

Used car selling price prediction using Linear, Ridge, Lasso, Random Forest, Gradient Boosting, and XGBoost regression. Best model: Gradient Boosting at R²=0.970. Includes depreciation analysis, feature engineering, brand encoding, and 16 embedded visualisations. CodeAlpha Data Science Internship — Task 3.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors