-
Exploratory Data Analysis (EDA)
Investigate data integrity, feature distributions, correlations, and outliers to guide modeling decisions. -
Data Preprocessing & Feature Engineering
Automatic handling of missing values, categorical encoding, numerical scaling, and derivation of new features such as car age. -
Model Development & Hyperparameter Tuning
Baseline linear regression plus tree-based regressors (Random Forest, Gradient Boosting) with cross-validated grid search. -
Model Evaluation
Performance metrics (MAE, RMSE, R²), residual analysis, and feature-importance visualizations. -
Model Serialization
A single, versionedsklearn.Pipeline(including preprocessing) saved viajoblibfor reproducible inference. -
REST API
Lightweight FastAPI/Flask service exposing a/predictendpoint for real-time price estimates. -
Docker Support
Dockerfile for containerized deployment on platforms such as Heroku, AWS ECS, or Azure Web Apps.
/ ├── data/ # Raw and processed datasets ├── images/ # Static assets (figures, banners) ├── notebooks/ # Exploratory Data Analysis (EDA) notebooks ├── src/ # Application code │ ├── data_prep.py # Data cleaning & feature‐engineering │ ├── train.py # Model training & hyperparameter tuning │ ├── evaluate.py # Evaluation metrics & plots │ └── app.py # Flask web application ├── requirements.txt # Python dependencies └── README.md # This file
We train regression models (Linear Regression, Random Forest, Gradient Boosting) on a used-car dataset to learn how features like age, mileage, engine size, make/model and fuel type influence resale price. After evaluating performance (MAE, RMSE, R²), the best model is serialized and served via a lightweight Flask API.
- Clone the repository and navigate into the project folder.
- Create a Python virtual environment and activate it.
- Install all required packages from
requirements.txt.
Open the eda.ipynb notebook under notebooks/.
- Perform data quality checks (missing values, duplicates, outliers).
- Visualize distributions for numeric and categorical features.
- Generate correlation matrices and scatter plots (e.g. price vs. mileage, price vs. age) to inform feature engineering.
- Prepare and clean your processed dataset (under data/processed/).
- Run the training script in src/train.py to fit your regression pipelines and perform hyperparameter tuning.
- Run the evaluation script in src/evaluate.py to compute MAE, RMSE and R² on hold-out data and save residual plots.
Outputs:
models/car_price_pipeline.joblib— serialized sklearnPipelinereports/metrics.json— performance metrics- Residual plots in the reports/ folder
