The "mpg" dataset, which stands for "miles per gallon". It contains information about various car models and their characteristics, such as cylinders, displacement, horsepower, weight, acceleration, model year, origin, and miles per gallon (mpg) fuel efficiency.

Here's a brief explanation of each column:

mpg: Miles per gallon, representing the fuel efficiency of the car. cylinders: Number of cylinders in the engine. displacement: Engine displacement, the measure of the cylinder volume swept by all of the pistons of a piston engine. horsepower: The power of the engine, typically measured in horsepower (hp). weight: Weight of the car, often measured in pounds. acceleration: Acceleration of the car from 0 to 60 miles per hour (mph) in seconds. model year: Year of manufacturing of the car model. origin: Origin of the car, represented as a categorical variable (1: USA, 2: Europe, 3: Japan). name: The name of the car model. This dataset is commonly used for regression tasks, where the goal is to predict the fuel efficiency (mpg) of a car based on its other characteristics

In [2]:
#1 mile is 1.6 kms and 1 gallon is 3.7 liters

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


import warnings
warnings.filterwarnings('ignore')

In [10]:
df = sns.load_dataset('mpg')

In [12]:
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino


In [14]:
df.drop("name", axis=1, inplace = True)

In [16]:
df

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin
0,18.0,8,307.0,130.0,3504,12.0,70,usa
1,15.0,8,350.0,165.0,3693,11.5,70,usa
2,18.0,8,318.0,150.0,3436,11.0,70,usa
3,16.0,8,304.0,150.0,3433,12.0,70,usa
4,17.0,8,302.0,140.0,3449,10.5,70,usa
...,...,...,...,...,...,...,...,...
393,27.0,4,140.0,86.0,2790,15.6,82,usa
394,44.0,4,97.0,52.0,2130,24.6,82,europe
395,32.0,4,135.0,84.0,2295,11.6,82,usa
396,28.0,4,120.0,79.0,2625,18.6,82,usa


In [18]:
df.isna().sum()

mpg             0
cylinders       0
displacement    0
horsepower      6
weight          0
acceleration    0
model_year      0
origin          0
dtype: int64

In [20]:
df['horsepower'].median()

93.5

In [22]:
df['horsepower'] = df['horsepower'].fillna(df['horsepower'].median())

In [24]:
df.isna().sum()

mpg             0
cylinders       0
displacement    0
horsepower      0
weight          0
acceleration    0
model_year      0
origin          0
dtype: int64

In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    398 non-null    float64
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   model_year    398 non-null    int64  
 7   origin        398 non-null    object 
dtypes: float64(4), int64(3), object(1)
memory usage: 25.0+ KB


In [28]:
df.dtypes

mpg             float64
cylinders         int64
displacement    float64
horsepower      float64
weight            int64
acceleration    float64
model_year        int64
origin           object
dtype: object

In [32]:
df['origin'].value_counts()

origin
usa       249
japan      79
europe     70
Name: count, dtype: int64

In [38]:
#convert string column to numerical

df['origin'] = df['origin'].map({"usa": 1, "japan":2, "europe":3})

In [40]:
df.dtypes

mpg             float64
cylinders         int64
displacement    float64
horsepower      float64
weight            int64
acceleration    float64
model_year        int64
origin            int64
dtype: object

In [42]:
df['origin'] = df['origin'].astype(int)

In [44]:
#separate X and y
X = df.drop('mpg', axis=1)
y = df['mpg']

In [46]:
X

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model_year,origin
0,8,307.0,130.0,3504,12.0,70,1
1,8,350.0,165.0,3693,11.5,70,1
2,8,318.0,150.0,3436,11.0,70,1
3,8,304.0,150.0,3433,12.0,70,1
4,8,302.0,140.0,3449,10.5,70,1
...,...,...,...,...,...,...,...
393,4,140.0,86.0,2790,15.6,82,1
394,4,97.0,52.0,2130,24.6,82,3
395,4,135.0,84.0,2295,11.6,82,1
396,4,120.0,79.0,2625,18.6,82,1


In [48]:
y

0      18.0
1      15.0
2      18.0
3      16.0
4      17.0
       ... 
393    27.0
394    44.0
395    32.0
396    28.0
397    31.0
Name: mpg, Length: 398, dtype: float64

In [50]:
#train test split
from sklearn.model_selection import train_test_split

In [54]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=1)

In [56]:
X_train.shape

(278, 7)

In [58]:
X_test.shape

(120, 7)

In [60]:
#multiple linear regression model
from sklearn.linear_model import LinearRegression
regression_model = LinearRegression()

In [62]:
regression_model

In [64]:
regression_model.fit(X_train, y_train)

In [66]:
for i, col_name in enumerate(X_train.columns):
    print(f"The coefficients for {col_name} is {regression_model.coef_[i]}")

The coefficients for cylinders is -0.317614230279928
The coefficients for displacement is 0.026237482599078973
The coefficients for horsepower is -0.01827076491312446
The coefficients for weight is -0.007487750398361911
The coefficients for acceleration is 0.0504067346197149
The coefficients for model_year is 0.8470951427061373
The coefficients for origin is 1.519095838797505


In [68]:
#observations
#The coefficients are relatively smaller, if one independent variable changes slightly
#then there will be no much difference in prediction
#This is sometime called smoother models

In [73]:
from sklearn.metrics import r2_score
y_pred_linear = regression_model.predict(X_test)
y_pred_linear

array([21.16196121, 27.89684387, 20.45045592, 27.12361164, 24.36117063,
       15.87763934, 29.93157794, 34.02155729, 17.08992155, 10.56782304,
       30.53231377, 16.48854992, 22.4061424 , 27.76978226, 36.0209892 ,
       23.79725872, 10.82747269, 20.27707855,  8.86935273, 32.48801009,
       25.36507567, 32.75235387, 20.95486868, 24.54530695, 25.77582154,
       30.20140405, 32.01102103, 31.96692512, 15.25929349, 30.41225966,
       27.50427715, 10.93370544, 21.42816438, 28.08300976, 25.03368839,
       13.67199264, 26.67769394,  9.04050101, 32.03270673, 23.97429191,
       24.18855895, 24.60440771, 21.16368861, 34.53665774, 26.31981331,
       22.23170907, 21.0865992 , 11.65432984, 27.9398198 , 18.98058597,
       23.69821181, 26.86564242, 17.04794305, 12.03955477, 28.70876897,
       24.26227131, 10.20293895, 13.03594704, 29.96910853, 35.35029687,
       37.01162788, 35.38558158, 18.04991116, 27.90304164, 20.67174751,
       33.83899858, 27.02537633, 26.73184442, 29.93216787, 12.33

In [75]:
r2_linear = r2_score(y_test, y_pred_linear)
print(f"The R square of linear regression {r2_linear}")

The R square of linear regression 0.8348001123742285


In [77]:
#Ridge regression >> lambda(m**2) 
#in implemementation lambda is referred as alpha

In [81]:
from sklearn.linear_model import Ridge
ridge_regression_model = Ridge(alpha = 0.1)
ridge_regression_model

In [83]:
ridge_regression_model.fit(X_train, y_train)

In [85]:
for i, col_name in enumerate(X_train.columns):
    print(f"The coefficients for {col_name} is {ridge_regression_model.coef_[i]}")

The coefficients for cylinders is -0.3170032101006763
The coefficients for displacement is 0.0262132497579838
The coefficients for horsepower is -0.018263252481448618
The coefficients for weight is -0.007487326050213165
The coefficients for acceleration is 0.05036896947443228
The coefficients for model_year is 0.8470062938903182
The coefficients for origin is 1.5174528285653937


In [87]:
#if you compare the coefficient with earlier built model, the coefficient is reduced >> ridge regression
#Ridge regression evaluation
y_pred_ridge = ridge_regression_model.predict(X_test)
y_pred_ridge

array([21.16378735, 27.8941097 , 20.45036614, 27.12217036, 24.36109009,
       15.87711185, 29.93054628, 34.02120736, 17.09026838, 10.56805319,
       30.53281923, 16.489648  , 22.40592464, 27.7681332 , 36.02074147,
       23.7957794 , 10.82621758, 20.27829877,  8.8680337 , 32.48787284,
       25.36501617, 32.75070173, 20.95612105, 24.54598666, 25.77621005,
       30.19962567, 32.01045453, 31.96692629, 15.26028283, 30.41239844,
       27.50524838, 10.93205307, 21.42671881, 28.0824173 , 25.03437165,
       13.67199703, 26.67941277,  9.03877322, 32.03328076, 23.97528883,
       24.1906245 , 24.60537896, 21.16380327, 34.53454371, 26.32131073,
       22.23199374, 21.08640039, 11.65445956, 27.9400072 , 18.98072282,
       23.69898037, 26.86615157, 17.04747419, 12.03951882, 28.7104206 ,
       24.26286485, 10.20166741, 13.03586308, 29.96934575, 35.34989798,
       37.00929434, 35.38515626, 18.05035079, 27.90434574, 20.67072077,
       33.83711964, 27.02688961, 26.73230117, 29.93275866, 12.33

In [89]:
r2_ridge = r2_score(y_test, y_pred_ridge)
print(f"The R square of ridge regression {r2_ridge}")

The R square of ridge regression 0.8348084889168356


In [91]:
#we dont see much variation is coeff of ridge regression as compared to linear regression

In [97]:
#Lasso regression
from sklearn.linear_model import Lasso
lasso_regression_model = Lasso(alpha = 0.5)
lasso_regression_model.fit(X_train, y_train)

In [107]:
for i, col_name in enumerate(X_train.columns):
    print(f"The coefficients for {col_name} is {lasso_regression_model.coef_[i]}")

The coefficients for cylinders is -0.0
The coefficients for displacement is 0.0062081988883003845
The coefficients for horsepower is -0.011058382987169572
The coefficients for weight is -0.00698267316802309
The coefficients for acceleration is 0.0
The coefficients for model_year is 0.744654952003819
The coefficients for origin is 0.0


In [101]:
#3 features coefficient are 0, Lasso helps in feature selection

In [103]:
y_pred_lasso = lasso_regression_model.predict(X_test)
r2_lasso = r2_score(y_test, y_pred_lasso)
print(f"The R square of lasso regression {r2_lasso}")

The R square of lasso regression 0.8277934716635555


In [105]:
from sklearn.linear_model import ElasticNet
elastic_net_model = ElasticNet(alpha = 1, l1_ratio = 0.5)
elastic_net_model.fit(X_train, y_train)

In [109]:
for i, col_name in enumerate(X_train.columns):
    print(f"The coefficients for {col_name} is {elastic_net_model.coef_[i]}")

The coefficients for cylinders is -0.0
The coefficients for displacement is 0.005888869953667572
The coefficients for horsepower is -0.0124038749335701
The coefficients for weight is -0.006934550516257633
The coefficients for acceleration is 0.0
The coefficients for model_year is 0.7133150744603874
The coefficients for origin is 0.0


In [111]:
# Predict on the test set
y_pred_elastic_net = elastic_net_model.predict(X_test)

# Calculate evaluation metrics
r2_elastic_net = r2_score(y_test, y_pred_elastic_net)


print(f"R-squared score for Elastic Net Regression: {r2_elastic_net}")

R-squared score for Elastic Net Regression: 0.8284840073256803
