This repository contains the implementation and analysis of a Car Price Prediction project. The project aims to predict car prices using various data analysis, transformation, and machine learning techniques.
- Numpy: For numerical computations.
- Pandas: For data manipulation and analysis.
- Matplotlib: For data visualization.
- Scipy: For statistical analysis.
- Seaborn: For enhanced visualizations.
- Sklearn: For building and evaluating machine learning models.
- Data loaded into the
bpd
dataframe. - Column headers were added based on index numbers.
- Dataset saved as CSV for future use.
- Explored features and their data types.
- Generated statistical summaries using
describe(include="all")
. - Used
info()
to inspect non-null values and data types.
Identified and handled missing data:
- Identify missing data.
- Handle missing data.
- Correct data format.
- Converted quantitative features to appropriate metrics using mathematical techniques.
- Normalized numerical features.
- Used binning for categorizing numerical variables.
- Applied one-hot encoding to convert categorical variables into numerical ones.
- Continuous Numerical Variables Analysis: Regression plots to assess linear relationships.
- Categorical Variables Analysis: Used box plots,
value_counts
, grouping, and pivot tables. - Descriptive Statistical Analysis: Heatmaps, correlation, causation, and ANOVA analysis.
Key variables for price prediction:
- Continuous Numerical Variables: Length, Width, Curb-weight, Engine-size, Horsepower, City-mpg, Highway-mpg, Wheel-base, Bore.
- Categorical Variables: Drive-wheels.
- Simple Linear Regression: One independent variable.
- Multiple Linear Regression (MLR): Multiple independent variables.
- Polynomial Regression: Non-linear relationships handled via polynomial transformations.
- Pipelines: Simplified data preprocessing and scaling using
Pipeline
andStandardScaler
.
- Used regression and residual plots for model visualization.
- Evaluated models using R² and Mean Squared Error (MSE) metrics.
Performance Metrics:
- Simple Linear Regression:
- R²: 0.6418
- MSE: 2.25 x 10⁷
- Multiple Linear Regression:
- R²: 0.8119
- MSE: 1.2 x 10⁷
- Polynomial Regression:
- R²: 0.6754
- MSE: 2.04 x 10⁷
Conclusion: MLR provided the best results due to its ability to account for multiple variables.
- Predicted outcomes for the test dataset using regression models.
- Compared training and testing R² scores.
- Applied
cross_val_score
to address limited test data issues.
-
Techniques Used:
- Polynomial Features
- Ridge Regression
- Hyperparameter Tuning (using
alpha
variable and Grid Search).
-
Optimized Result: Achieved an R² score of 0.84 for the test dataset after optimization.
Through the systematic application of machine learning models and evaluation techniques, we identified that the Multiple Linear Regression model offers the best predictive power for car price estimation. The use of model refinement and hyperparameter tuning further improved prediction accuracy.
- auto.csv/: Contains raw and processed datasets.
- OLD CAR PRICE DATASET ANALYSIS/: Jupyter notebooks with detailed analysis and visualization on dataset.
- MODEL DEVELOPMENT AND EVALUATION/: Jupyter notebooks with detailed model development, analysis, visualization, and refinement on dataset.
This project provides a robust framework for car price prediction using exploratory data analysis and machine learning techniques. The repository can be extended for other regression problems with similar workflows.
For contributions or feedback, feel free to raise an issue or submit a pull request! 🚗📊