Skip to content

๐Ÿ” Fuel efficiency (mpg) analysis and prediction using linear regression and GAM models on the StatLib Auto dataset (1983). Includes preprocessing, correlation, subset selection, and model evaluation in R.

License

Notifications You must be signed in to change notification settings

SergeyFilipov/AutoMPG-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

๐ŸŽ๏ธ AutoMPG-Analysis ๐Ÿ” Statistical analysis and fuel efficiency prediction (mpg) using linear regression and GAM models on the classic Auto dataset from StatLib (1983).

๐Ÿ“Š Dataset The analysis is based on the Auto.csv dataset, available as part of the resources for the book:

James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An Introduction to Statistical Learning with Applications in R, 2nd Edition. Springer-Verlag, New York. Available at: https://www.statlearning.com

๐Ÿ”— Dataset Source ๐Ÿ“š This dataset comes from the StatLib library, maintained by Carnegie Mellon University, and was originally featured in the 1983 ASA Exposition.

๐Ÿ“ฅ You can download the .csv file directly from the Resources section of the ISLR book website: ๐ŸŒ https://www.statlearning.com

๐Ÿ“„ Dataset Description ๐Ÿš— Observations: 392 vehicles ๐Ÿ“Š Features: 9 variables

๐Ÿ”  Variable ๐Ÿ“ Description mpg ๐Ÿ“ˆ Miles per gallon (fuel efficiency) cylinders ๐Ÿ”ง Number of cylinders (4โ€“8) displacement ๐Ÿ“ฆ Engine displacement (cubic inches) horsepower ๐ŸŽ Engine horsepower weight โš–๏ธ Vehicle weight (lbs) acceleration ๐Ÿ Time to accelerate from 0 to 60 mph (seconds) year ๐Ÿ“† Model year (e.g. 70 = 1970) origin ๐ŸŒ Car origin: 1 = ๐Ÿ‡บ๐Ÿ‡ธ US, 2 = ๐Ÿ‡ช๐Ÿ‡บ Europe, 3 = ๐Ÿ‡ฏ๐Ÿ‡ต Japan name ๐Ÿท๏ธ Vehicle name (e.g. โ€œchevrolet chevelle malibuโ€)

โ„น๏ธ Note: The original dataset included 397 entries. ๐Ÿงน The ISLR2 version removes 5 rows due to missing horsepower values. ๐Ÿ”ง In this project, we handle missing values manually during data preprocessing.

๐Ÿ”ฝ 1. Data Import & Library Setup ๐Ÿ“ฆ Description: We begin by loading all necessary R libraries for data wrangling, visualization, and modeling. The dataset is imported from .csv with missing values handled explicitly.

๐Ÿงฐ Key steps:

Load core packages: dplyr, ggplot2, mgcv, leaps, etc.

Import Auto.csv and flag "?" as NA.

๐Ÿ“‚ Output: A raw but structured DataFrame ready for cleaning and transformation.

๐Ÿงผ 2. Data Inspection & Cleaning ๐Ÿ” Description: We inspect variable classes, recast types where needed, and handle missing values in horsepower using group-wise mean imputation. We also identify and visualize outliers.

๐Ÿ› ๏ธ Key steps:

Fix data types (e.g., origin โ†’ factor, horsepower โ†’ numeric)

Impute missing horsepower by cylinders & origin

Generate boxplot of horsepower after cleaning

๐Ÿ“Ž Insight: Missing values are smartly filled based on mechanical similarity. Outliers still sneak through โ€” but we catch them.

๐Ÿ“Š 3. Exploratory Analysis of MPG Description: We explore the distribution of mpg and compare fuel efficiency by region of origin (US, Europe, Japan). The dataset is then split into training and test subsets for modeling.

๐Ÿ”ง Key steps:

Boxplot & summary of mpg

Identify mpg outliers

Train-test split (75/25)

Visualize mpg by car origin

๐Ÿง  Insight: Japanese cars generally achieve higher mpg, as expected. Time to let the models speak.

๐Ÿ”— 4. Correlation & Variable Relationships ๐Ÿ“ˆ Description: We calculate a correlation matrix of all numeric variables and visualize it using a color gradient. A custom upper-triangle scatter matrix shows nuanced relationships between variables.

๐Ÿ“Š Visualizations:

Heatmap of correlation coefficients

Colored scatterplots by model year (e.g., mpg vs. weight, horsepower, etc.)

๐Ÿ’ก Insight: Weight and displacement are highly correlated โ€” these big players drag mpg down. Year brings salvation.

๐Ÿงช 5. Baseline Linear Regression Modeling Description: A multiple linear regression model is trained using weight, horsepower, year, and more. Predictions are made on the test set and compared to actual values.

๐Ÿ” Key metrics:

Predictions with 95% prediction intervals

Plot: actual vs. predicted mpg

Summary of residuals and model stats

๐ŸŽฏ Insight: Not perfect, but decent. Linear regression gives a baseline idea of how physical specs impact efficiency.

๐Ÿ“ 6. Model Performance Evaluation Description: We compute standard performance metrics (MSE, RMSE, MAE, Rยฒ) for both training and test sets to evaluate model generalization.

๐Ÿ“Š Metrics calculated:

Mean Squared Error (MSE)

Root Mean Squared Error (RMSE)

Mean Absolute Error (MAE)

R-squared (Rยฒ)

๐Ÿง  Insight: Some overfitting is visible โ€” the model performs better on training data than unseen test data.

๐Ÿ› ๏ธ 7. Polynomial & Interaction Modeling Description: We improve the linear model by introducing polynomial terms and interaction effects. This allows us to capture non-linear dependencies and mixed-variable impacts.

๐Ÿ”ง Upgrades include:

poly(weight, 2), poly(displacement, 2)

Interactions: horsepower * year, weight * cylinders

๐Ÿ“‰ Insight: The enhanced model reduces error and boosts Rยฒ, but complexity increases. It's a trade-off โ€” more flexibility, less interpretability.

๐Ÿงฎ 8. Feature Selection via Subset Regression Description: We apply exhaustive subset selection to find the optimal number of predictors using regsubsets(). Models are evaluated by Adjusted Rยฒ.

๐Ÿ“ˆ Visualization:

Line plot of Adjusted Rยฒ vs. number of predictors

๐ŸŽฏ Insight: A 4โ€“5 variable model often offers the best balance between simplicity and performance. More isnโ€™t always better.

๐ŸŒฑ 9. Nonlinear Trends with LOESS Description: Before jumping into GAMs, we visualize the true shape of mpg relationships using LOESS smoothing. Each numeric predictor is plotted against mpg to capture subtle curves.

๐Ÿงฉ Key features:

Smooth curves highlight nonlinear patterns

Visual cue for feature transformations

๐Ÿ’ก Insight: mpg rises nonlinearly with year, drops sharply with weight. Time to call in the GAMs.

๐Ÿ”ฅ 10. Generalized Additive Modeling (GAM) Description: Using mgcv::gam(), we fit a flexible nonlinear model that lets each variable tell its own story. Spline terms (s(...)) are used to capture smooth trends.

๐Ÿ“Š Evaluation:

Actual vs. predicted mpg plots

Performance metrics (MSE, RMSE, Rยฒ, MAE) for both train and test sets

๐Ÿ Insight: GAM delivers strong, interpretable performance. It handles curvature naturally and provides the best generalization of all tested models.

๐Ÿง  Final Takeaway: This project shows how regression models โ€” from linear baselines to nonlinear GAMs โ€” can effectively decode the hidden patterns behind fuel efficiency. Combining visual exploration, feature selection, and flexible modeling makes for an insightful and powerful analysis.

About

๐Ÿ” Fuel efficiency (mpg) analysis and prediction using linear regression and GAM models on the StatLib Auto dataset (1983). Includes preprocessing, correlation, subset selection, and model evaluation in R.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages