<a href="https://colab.research.google.com/github/Shambhaviadhikari/PythonClass/blob/main/Auto_MPG_Regression_(Fall_2024)_G37903602.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Instructions






Perform a linear regression using the Auto MPG dataset (loaded for you in the setup section).

Incorporate the following aspects:

  1. Data
     + Load the data.
     + Explore the data, including distributions, correlation, etc. Make plots.
     + Check for null values. Handle null values by dropping or imputing.
     + Choose target and feature(s).
     + Encode features as necessary (ordinal vs one-hot).
     + Scale / normalize features as necessary.
     + Split into train and test sets (specifically 80/20 split). Remember to use a random seed to ensure your results are reproducible.
  2. Model
    + Use a `LinearRegression` model from sklearn.
    + Train the model using the training data.
    + Inspect artifacts from the training process:
      + Print the model's coeficients and intercept (i.e. line of best fit).
      + Inspect the coefficients by wrapping them in a pandas Series and labeling them with their corresponding feature names, then sort them in descending order.
      + Interpret the coefficients - which features contribute most to our model's predictive ability? Write your answer in a text cell.
  3. Evaluation
    + Make predictions for the test set.
    + Evaluate the results using sklearn regression metrics, specifically the r-squared score and mean squared error (MSE). Calculate the Root Mean Squared Error (RMSE) as well, based on the MSE. Interpret the results - how well did the model do? Answer in a text cell.




## Setup

In [None]:
from warnings import filterwarnings
filterwarnings("ignore")

In [None]:
%%capture
!pip install ucimlrepo

## Data Loading


### Auto MPG Dataset

https://archive.ics.uci.edu/dataset/9/auto+mpg

The Auto MPG dataset provides information about automobile fuel efficiency, in terms of miles per gallon (MPG).

We'll be using a version of this dataset hosted by UCI. They have a great repository of machine learning datasets, and now a cool new website and python package we can use to load the data easily:


In [None]:
from ucimlrepo import fetch_ucirepo

repo = fetch_ucirepo(id=9)
print(type(repo)) # assuming this is dictionary-like

<class 'ucimlrepo.dotdict.dotdict'>


In [None]:
repo.keys()

dict_keys(['data', 'metadata', 'variables'])

Repo has data, metadata (dataset description), and variables (data dictionary of sorts).

In [None]:
repo.metadata

{'uci_id': 9,
 'name': 'Auto MPG',
 'repository_url': 'https://archive.ics.uci.edu/dataset/9/auto+mpg',
 'data_url': 'https://archive.ics.uci.edu/static/public/9/data.csv',
 'abstract': 'Revised from CMU StatLib library, data concerns city-cycle fuel consumption',
 'area': 'Other',
 'tasks': ['Regression'],
 'characteristics': ['Multivariate'],
 'num_instances': 398,
 'num_features': 7,
 'feature_types': ['Real', 'Categorical', 'Integer'],
 'demographics': [],
 'target_col': ['mpg'],
 'index_col': ['car_name'],
 'has_missing_values': 'yes',
 'missing_values_symbol': 'NaN',
 'year_of_dataset_creation': 1993,
 'last_updated': 'Thu Aug 10 2023',
 'dataset_doi': '10.24432/C5859H',
 'creators': ['R. Quinlan'],
 'intro_paper': None,
 'additional_info': {'summary': 'This dataset is a slightly modified version of the dataset provided in the StatLib library.  In line with the use by Ross Quinlan (1993) in predicting the attribute "mpg", 8 of the original instances were removed because they had 

In [None]:
repo.variables

Unnamed: 0,name,role,type,demographic,description,units,missing_values
0,displacement,Feature,Continuous,,,,no
1,mpg,Target,Continuous,,,,no
2,cylinders,Feature,Integer,,,,no
3,horsepower,Feature,Continuous,,,,yes
4,weight,Feature,Continuous,,,,no
5,acceleration,Feature,Continuous,,,,no
6,model_year,Feature,Integer,,,,no
7,origin,Feature,Integer,,,,no
8,car_name,ID,Categorical,,,,no


We see the target is "mpg" and there are a number of features - some continuous, some categorical. We'll need to further investigate and decide how to encode the categorical features.

We see there are some missing values in the "horsepower" column. We'll need to handle them later.

Finally, here is our dataset:

In [None]:
#auto_mpg.data.keys()

In [None]:
#print(type(auto_mpg.data.features))
#print(type(auto_mpg.data.targets))
#print(type(auto_mpg.data.ids))

In [None]:
df = repo.data.original
df.head()

Unnamed: 0,car_name,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,mpg
0,"chevrolet,chevelle,malibu",8,307.0,130.0,3504,12.0,70,1,18.0
1,"buick,skylark,320",8,350.0,165.0,3693,11.5,70,1,15.0
2,"plymouth,satellite",8,318.0,150.0,3436,11.0,70,1,18.0
3,"amc,rebel,sst",8,304.0,150.0,3433,12.0,70,1,16.0
4,"ford,torino",8,302.0,140.0,3449,10.5,70,1,17.0


What's the "origin" feature about?

One can possibly interpret from the car names, or consult various [internet](https://rstudio-pubs-static.s3.amazonaws.com/516461_09a0ec8250df45c4bb362c97ad7fd965.html) [resources](https://www.kaggle.com/code/asokraju/auto-mpg-dataset), that mention the following mapping: (1: USA, 2: Europe, 3: Asia).

In [None]:
df[df["origin"] == 1]["car_name"] # north american cars
df[df["origin"] == 2]["car_name"] # european cars
df[df["origin"] == 3]["car_name"] # japanese / asian cars

Unnamed: 0,car_name
14,"toyota,corona,mark,ii"
18,"datsun,pl510"
29,"datsun,pl510"
31,"toyota,corona"
53,"toyota,corolla,1200"
...,...
382,"toyota,corolla"
383,"honda,civic"
384,"honda,civic,(auto)"
385,"datsun,310,gx"


In [None]:
ORIGINS_MAP = {1: "usa", 2: "europe", 3: "asia"}


# todo: map and one-hot encode the origin

## Solution

### Data Exploration and Preprocessing

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import r2_score, mean_squared_error
from pandas import Series
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import train_test_split
from pandas import Series


In [None]:
# Display the first few rows and summary statistics
df.head()

Unnamed: 0,car_name,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,mpg
0,"chevrolet,chevelle,malibu",8,307.0,130.0,3504,12.0,70,1,18.0
1,"buick,skylark,320",8,350.0,165.0,3693,11.5,70,1,15.0
2,"plymouth,satellite",8,318.0,150.0,3436,11.0,70,1,18.0
3,"amc,rebel,sst",8,304.0,150.0,3433,12.0,70,1,16.0
4,"ford,torino",8,302.0,140.0,3449,10.5,70,1,17.0


In [None]:
df.describe()

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,mpg
count,398.0,398.0,392.0,398.0,398.0,398.0,398.0,398.0
mean,5.454774,193.425879,104.469388,2970.424623,15.56809,76.01005,1.572864,23.514573
std,1.701004,104.269838,38.49116,846.841774,2.757689,3.697627,0.802055,7.815984
min,3.0,68.0,46.0,1613.0,8.0,70.0,1.0,9.0
25%,4.0,104.25,75.0,2223.75,13.825,73.0,1.0,17.5
50%,4.0,148.5,93.5,2803.5,15.5,76.0,1.0,23.0
75%,8.0,262.0,126.0,3608.0,17.175,79.0,2.0,29.0
max,8.0,455.0,230.0,5140.0,24.8,82.0,3.0,46.6


In [None]:
df[df['horsepower'] != df['horsepower'].max()]

Unnamed: 0,car_name,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,mpg
0,"chevrolet,chevelle,malibu",8,307.0,130.0,3504,12.0,70,1,18.0
1,"buick,skylark,320",8,350.0,165.0,3693,11.5,70,1,15.0
2,"plymouth,satellite",8,318.0,150.0,3436,11.0,70,1,18.0
3,"amc,rebel,sst",8,304.0,150.0,3433,12.0,70,1,16.0
4,"ford,torino",8,302.0,140.0,3449,10.5,70,1,17.0
...,...,...,...,...,...,...,...,...,...
393,"ford,mustang,gl",4,140.0,86.0,2790,15.6,82,1,27.0
394,"vw,pickup",4,97.0,52.0,2130,24.6,82,2,44.0
395,"dodge,rampage",4,135.0,84.0,2295,11.6,82,1,32.0
396,"ford,ranger",4,120.0,79.0,2625,18.6,82,1,28.0


In [None]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Define the x and y variables
x_vars = ["displacement", "weight", "acceleration"]
y_var = "mpg"

# Set up a subplot grid
fig = make_subplots(rows=1, cols=len(x_vars), shared_yaxes=True,
                    subplot_titles=[f"{y_var} vs {x}" for x in x_vars])

# Add scatter plots for each x variable against y
for i, x_var in enumerate(x_vars):
    fig.add_trace(
        go.Scatter(x=df[x_var], y=df[y_var], mode='markers', marker=dict(color='blue')),
        row=1, col=i+1
    )

# Update layout to include titles and spacing
fig.update_layout(
    height=400, width=900,
    title_text="Pairplot of mpg vs. displacement, weight, and acceleration",
    showlegend=False
)

# Update axis labels
for i, x_var in enumerate(x_vars):
    fig.update_xaxes(title_text=x_var, row=1, col=i+1)
fig.update_yaxes(title_text=y_var, row=1, col=1)

# Show plot
fig.show()


In [None]:
# Check for null values
print("Null values in each column:\n", df.isnull().sum())

Null values in each column:
 car_name        0
cylinders       0
displacement    0
horsepower      6
weight          0
acceleration    0
model_year      0
origin          0
mpg             0
dtype: int64


In [None]:
# Handling missing values by dropping rows with any null values
data = df.dropna()

In [None]:
# Scatter plot matrix (pair plot equivalent)
fig = px.scatter_matrix(data)
fig.update_layout(title="Scatter Matrix of Features", width=800, height=800)
fig.show()

In [None]:
import plotly.graph_objects as go

# Identify non-numeric columns
non_numeric_columns = data.select_dtypes(exclude=['float64', 'int64']).columns
print("Non-numeric columns:", non_numeric_columns)

# Drop non-numeric columns from data for correlation matrix
data_numeric = data.drop(columns=non_numeric_columns)

# Correlation matrix
corr_matrix = data_numeric.corr()

# Create heatmap with correlation values
fig = go.Figure(data=go.Heatmap(
        z=corr_matrix.values,
        x=corr_matrix.columns,
        y=corr_matrix.columns,
        colorscale='Peach',
        text=corr_matrix.values.round(2),  # Round correlation values to 2 decimals
        texttemplate="%{text}",  # Display values on the heatmap
        hovertemplate="Correlation: %{z:.2f}"  # Hover format
    ))

fig.update_layout(
    title="Correlation Heatmap",
    xaxis_title="Features",
    yaxis_title="Features"
)
fig.show()


Non-numeric columns: Index(['car_name'], dtype='object')


In [None]:
# Select features and target
features = ['cylinders', 'displacement', 'horsepower', 'weight', 'model_year']
target = 'mpg'
X = data[features]
y = data[target]

In [None]:
df2 = df.drop(df.columns[[1, 2, 3, 5, 6, 7, 8]], axis=1)
df2.head()

Unnamed: 0,car_name,weight
0,"chevrolet,chevelle,malibu",3504
1,"buick,skylark,320",3693
2,"plymouth,satellite",3436
3,"amc,rebel,sst",3433
4,"ford,torino",3449


In [None]:
# Select only numeric columns
numeric_df = df.select_dtypes(include=['number'])

# Calculate Q1 (25th percentile) and Q3 (75th percentile) for each numeric column
Q1 = numeric_df.quantile(0.25)
Q3 = numeric_df.quantile(0.75)
IQR = Q3 - Q1

# Define outliers as any value that lies outside 1.5 * IQR from Q1 or Q3
outliers_iqr = ((numeric_df < (Q1 - 1.5 * IQR)) | (numeric_df > (Q3 + 1.5 * IQR)))

# Check rows with outliers
outlier_rows = df[outliers_iqr.any(axis=1)]  # Get all rows where any column has an outlier

# Print the rows with outliers
print("Rows with outliers using IQR method:")
print(outlier_rows)


Rows with outliers using IQR method:
                         car_name  cylinders  displacement  horsepower  \
6                chevrolet,impala          8         454.0       220.0   
7               plymouth,fury,iii          8         440.0       215.0   
8                pontiac,catalina          8         455.0       225.0   
9              amc,ambassador,dpl          8         390.0       190.0   
11             plymouth,'cuda,340          8         340.0       160.0   
13        buick,estate,wagon,(sw)          8         455.0       225.0   
25                      ford,f250          8         360.0       215.0   
27                     dodge,d200          8         318.0       210.0   
59              volkswagen,type,3          4          97.0        54.0   
67                mercury,marquis          8         429.0       208.0   
94   chrysler,new,yorker,brougham          8         440.0       215.0   
95       buick,electra,225,custom          8         455.0       225.0   
1

In [None]:
# Drop rows with outliers
df_cleaned = df.drop(outlier_rows.index)

# Reset the index after dropping rows
df_cleaned.reset_index(drop=True, inplace=True)

# Print the cleaned DataFrame
print("DataFrame after dropping outliers:")
print(df_cleaned)



DataFrame after dropping outliers:
                      car_name  cylinders  displacement  horsepower  weight  \
0    chevrolet,chevelle,malibu          8         307.0       130.0    3504   
1            buick,skylark,320          8         350.0       165.0    3693   
2           plymouth,satellite          8         318.0       150.0    3436   
3                amc,rebel,sst          8         304.0       150.0    3433   
4                  ford,torino          8         302.0       140.0    3449   
..                         ...        ...           ...         ...     ...   
376           chevrolet,camaro          4         151.0        90.0    2950   
377            ford,mustang,gl          4         140.0        86.0    2790   
378              dodge,rampage          4         135.0        84.0    2295   
379                ford,ranger          4         120.0        79.0    2625   
380                 chevy,s-10          4         119.0        82.0    2720   

     acceleratio

In [None]:
from pandas import get_dummies as one_hot_encode

# One-hot encoding the 'origin' column
x_encoded = one_hot_encode(df_cleaned['origin'])

# Convert the encoded columns to integer type
x_encoded = x_encoded.astype("int")

# Rename the columns for clarity
x_encoded.columns = [f"origin_{origin}" for origin in x_encoded.columns]

# Merge the encoded columns back into the original DataFrame
df_merge = df_cleaned.merge(x_encoded, left_index=True, right_index=True)

# Optional: Drop the original 'origin' column if no longer needed
df_merge.drop('origin', axis=1, inplace=True)

# Print the updated DataFrame
print(df_merge.head())


                    car_name  cylinders  displacement  horsepower  weight  \
0  chevrolet,chevelle,malibu          8         307.0       130.0    3504   
1          buick,skylark,320          8         350.0       165.0    3693   
2         plymouth,satellite          8         318.0       150.0    3436   
3              amc,rebel,sst          8         304.0       150.0    3433   
4                ford,torino          8         302.0       140.0    3449   

   acceleration  model_year   mpg  origin_1  origin_2  origin_3  
0          12.0          70  18.0         1         0         0  
1          11.5          70  15.0         1         0         0  
2          11.0          70  18.0         1         0         0  
3          12.0          70  16.0         1         0         0  
4          10.5          70  17.0         1         0         0  


In [None]:

# Select only numerical features for scaling, excluding 'mpg' and 'car_name'
numerical_features = df_merge.select_dtypes(include=np.number).columns.difference(['mpg', 'car_name'])

# Create a copy of the dataframe to avoid modifying the original data
x_scaled = df_merge.copy()

# Log transform 'displacement', 'horsepower', and 'weight' columns
log_columns = ['displacement', 'horsepower', 'weight']
x_scaled[log_columns] = np.log1p(x_scaled[log_columns])

# Standard scaling for all numerical features (excluding 'mpg' and 'car_name')
scaler = StandardScaler()
x_scaled[numerical_features] = scaler.fit_transform(x_scaled[numerical_features])

# Prepare features (X) and target (y)
X = x_scaled[numerical_features]
y = df_merge['mpg']

In [None]:
# Split the data into an 80/20 train-test split with random state 42 for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


### Model Training

In [None]:

# Check for missing values and impute if necessary
if X_train.isnull().values.any() or X_test.isnull().values.any():
    imputer = SimpleImputer(strategy="mean")
    X_train = imputer.fit_transform(X_train)
    X_test = imputer.transform(X_test)
    X_train = pd.DataFrame(X_train, columns=numerical_features)
    X_test = pd.DataFrame(X_test, columns=numerical_features)

# Fit the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Extract and sort coefficients by importance
coefficients = Series(model.coef_, index=X_train.columns)
coefficients_sorted = coefficients.sort_values(ascending=False)
print("Coefficients sorted by importance:\n", coefficients_sorted)

# Predict on the training and test data
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)


Coefficients sorted by importance:
 model_year      2.944908
cylinders       1.084229
origin_2        0.277039
origin_3        0.184763
origin_1       -0.369983
acceleration   -0.560528
displacement   -1.120405
horsepower     -1.743103
weight         -3.839397
dtype: float64


The feature "model_year" contributes the most to the model's predictive ability with a coefficient of 2.94, indicating a strong positive influence. Other significant contributors include "cylinders" (1.08), with a positive impact, while features like "weight" (-3.84) and "horsepower" (-1.74) negatively influence the model's predictions the most.

### Model Evaluation

In [None]:
# Evaluate the model on the training set
r2_train = r2_score(y_train, y_pred_train)
mse_train = mean_squared_error(y_train, y_pred_train)
rmse_train = np.sqrt(mse_train)

# Evaluate the model on the test set
r2_test = r2_score(y_test, y_pred_test)
mse_test = mean_squared_error(y_test, y_pred_test)
rmse_test = np.sqrt(mse_test)

# Print model evaluation metrics
print("\nTest Set Evaluation Metrics:")
print(f"R-squared score: {r2_test}")
print(f"Mean Squared Error (MSE): {mse_test}")
print(f"Root Mean Squared Error (RMSE): {rmse_test}")



Test Set Evaluation Metrics:
R-squared score: 0.8624178298716563
Mean Squared Error (MSE): 7.586000545036907
Root Mean Squared Error (RMSE): 2.754269512055221


The model performed well based on the evaluation metrics:

- **R-squared score**: 0.86, which indicates that the model explains 86% of the variance in the data, suggesting a strong fit.
- **Mean Squared Error (MSE)**: 7.59, reflecting the average squared difference between predicted and actual values.
- **Root Mean Squared Error (RMSE)**: 2.75, which gives an error of approximately 2.75 units on the target variable, providing a tangible sense of prediction accuracy.
