๐๏ธ AutoMPG-Analysis ๐ Statistical analysis and fuel efficiency prediction (mpg) using linear regression and GAM models on the classic Auto dataset from StatLib (1983).
๐ Dataset The analysis is based on the Auto.csv dataset, available as part of the resources for the book:
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An Introduction to Statistical Learning with Applications in R, 2nd Edition. Springer-Verlag, New York. Available at: https://www.statlearning.com
๐ Dataset Source ๐ This dataset comes from the StatLib library, maintained by Carnegie Mellon University, and was originally featured in the 1983 ASA Exposition.
๐ฅ You can download the .csv file directly from the Resources section of the ISLR book website: ๐ https://www.statlearning.com
๐ Dataset Description ๐ Observations: 392 vehicles ๐ Features: 9 variables
๐ Variable ๐ Description mpg ๐ Miles per gallon (fuel efficiency) cylinders ๐ง Number of cylinders (4โ8) displacement ๐ฆ Engine displacement (cubic inches) horsepower ๐ Engine horsepower weight โ๏ธ Vehicle weight (lbs) acceleration ๐ Time to accelerate from 0 to 60 mph (seconds) year ๐ Model year (e.g. 70 = 1970) origin ๐ Car origin: 1 = ๐บ๐ธ US, 2 = ๐ช๐บ Europe, 3 = ๐ฏ๐ต Japan name ๐ท๏ธ Vehicle name (e.g. โchevrolet chevelle malibuโ)
โน๏ธ Note: The original dataset included 397 entries. ๐งน The ISLR2 version removes 5 rows due to missing horsepower values. ๐ง In this project, we handle missing values manually during data preprocessing.
๐ฝ 1. Data Import & Library Setup ๐ฆ Description: We begin by loading all necessary R libraries for data wrangling, visualization, and modeling. The dataset is imported from .csv with missing values handled explicitly.
๐งฐ Key steps:
Load core packages: dplyr, ggplot2, mgcv, leaps, etc.
Import Auto.csv and flag "?" as NA.
๐ Output: A raw but structured DataFrame ready for cleaning and transformation.
๐งผ 2. Data Inspection & Cleaning ๐ Description: We inspect variable classes, recast types where needed, and handle missing values in horsepower using group-wise mean imputation. We also identify and visualize outliers.
๐ ๏ธ Key steps:
Fix data types (e.g., origin โ factor, horsepower โ numeric)
Impute missing horsepower by cylinders & origin
Generate boxplot of horsepower after cleaning
๐ Insight: Missing values are smartly filled based on mechanical similarity. Outliers still sneak through โ but we catch them.
๐ 3. Exploratory Analysis of MPG Description: We explore the distribution of mpg and compare fuel efficiency by region of origin (US, Europe, Japan). The dataset is then split into training and test subsets for modeling.
๐ง Key steps:
Boxplot & summary of mpg
Identify mpg outliers
Train-test split (75/25)
Visualize mpg by car origin
๐ง Insight: Japanese cars generally achieve higher mpg, as expected. Time to let the models speak.
๐ 4. Correlation & Variable Relationships ๐ Description: We calculate a correlation matrix of all numeric variables and visualize it using a color gradient. A custom upper-triangle scatter matrix shows nuanced relationships between variables.
๐ Visualizations:
Heatmap of correlation coefficients
Colored scatterplots by model year (e.g., mpg vs. weight, horsepower, etc.)
๐ก Insight: Weight and displacement are highly correlated โ these big players drag mpg down. Year brings salvation.
๐งช 5. Baseline Linear Regression Modeling Description: A multiple linear regression model is trained using weight, horsepower, year, and more. Predictions are made on the test set and compared to actual values.
๐ Key metrics:
Predictions with 95% prediction intervals
Plot: actual vs. predicted mpg
Summary of residuals and model stats
๐ฏ Insight: Not perfect, but decent. Linear regression gives a baseline idea of how physical specs impact efficiency.
๐ 6. Model Performance Evaluation Description: We compute standard performance metrics (MSE, RMSE, MAE, Rยฒ) for both training and test sets to evaluate model generalization.
๐ Metrics calculated:
Mean Squared Error (MSE)
Root Mean Squared Error (RMSE)
Mean Absolute Error (MAE)
R-squared (Rยฒ)
๐ง Insight: Some overfitting is visible โ the model performs better on training data than unseen test data.
๐ ๏ธ 7. Polynomial & Interaction Modeling Description: We improve the linear model by introducing polynomial terms and interaction effects. This allows us to capture non-linear dependencies and mixed-variable impacts.
๐ง Upgrades include:
poly(weight, 2), poly(displacement, 2)
Interactions: horsepower * year, weight * cylinders
๐ Insight: The enhanced model reduces error and boosts Rยฒ, but complexity increases. It's a trade-off โ more flexibility, less interpretability.
๐งฎ 8. Feature Selection via Subset Regression Description: We apply exhaustive subset selection to find the optimal number of predictors using regsubsets(). Models are evaluated by Adjusted Rยฒ.
๐ Visualization:
Line plot of Adjusted Rยฒ vs. number of predictors
๐ฏ Insight: A 4โ5 variable model often offers the best balance between simplicity and performance. More isnโt always better.
๐ฑ 9. Nonlinear Trends with LOESS Description: Before jumping into GAMs, we visualize the true shape of mpg relationships using LOESS smoothing. Each numeric predictor is plotted against mpg to capture subtle curves.
๐งฉ Key features:
Smooth curves highlight nonlinear patterns
Visual cue for feature transformations
๐ก Insight: mpg rises nonlinearly with year, drops sharply with weight. Time to call in the GAMs.
๐ฅ 10. Generalized Additive Modeling (GAM) Description: Using mgcv::gam(), we fit a flexible nonlinear model that lets each variable tell its own story. Spline terms (s(...)) are used to capture smooth trends.
๐ Evaluation:
Actual vs. predicted mpg plots
Performance metrics (MSE, RMSE, Rยฒ, MAE) for both train and test sets
๐ Insight: GAM delivers strong, interpretable performance. It handles curvature naturally and provides the best generalization of all tested models.
๐ง Final Takeaway: This project shows how regression models โ from linear baselines to nonlinear GAMs โ can effectively decode the hidden patterns behind fuel efficiency. Combining visual exploration, feature selection, and flexible modeling makes for an insightful and powerful analysis.