# Borrowing from W3 EDA session - Linear Regression on Auto-MPG data
## Please move to regression analysis section but you will need to run a few cells below before machine learning model development.
## You may need to run cells with: import libraries, read data, remove inconsistencies, etc.
https://www.kaggle.com/code/devanshbesain/exploration-and-analysis-auto-mpg

First of all, all the data preprocessing and EDA processes need to be followed, as follows:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import norm
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

We have imported all the packages and libraries we will be using for the initial exploration of data. This notebook will be focusing on the Exploration and Visualization using pandas and seaborn packages.

Let us load the data to explore for hidden treasures

In [2]:
data = pd.read_csv('Week 4/datasets/auto-mpg.csv',index_col='car name')


Let's have a look at data

In [3]:
print(data.head())
print(data.index)
print(data.columns)

                            mpg  cylinders  displacement horsepower  weight  \
car name                                                                      
chevrolet chevelle malibu  18.0          8         307.0        130    3504   
buick skylark 320          15.0          8         350.0        165    3693   
plymouth satellite         18.0          8         318.0        150    3436   
amc rebel sst              16.0          8         304.0        150    3433   
ford torino                17.0          8         302.0        140    3449   

                           acceleration  model year  origin  
car name                                                     
chevrolet chevelle malibu          12.0          70       1  
buick skylark 320                  11.5          70       1  
plymouth satellite                 11.0          70       1  
amc rebel sst                      12.0          70       1  
ford torino                        10.5          70       1  
Index(['chev

In [4]:
data.shape

(398, 8)

In [7]:
data.isnull().any()

mpg             False
cylinders       False
displacement    False
horsepower      False
weight          False
acceleration    False
model year      False
origin          False
dtype: bool

Nothing seems to be missing

In [6]:
#data.dtypes
data.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
Index: 398 entries, chevrolet chevelle malibu to chevy s-10
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    398 non-null    object 
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   model year    398 non-null    int64  
 7   origin        398 non-null    int64  
dtypes: float64(3), int64(4), object(1)
memory usage: 28.0+ KB


Interestingly, the horsepower is an object and not a float. The values we saw above were clearly numbers. So let's try converting the column using astype().

    Let's look at the unique elements of horsepower to look for discrepancies

In [8]:
data.horsepower.unique()

array(['130', '165', '150', '140', '198', '220', '215', '225', '190',
       '170', '160', '95', '97', '85', '88', '46', '87', '90', '113',
       '200', '210', '193', '?', '100', '105', '175', '153', '180', '110',
       '72', '86', '70', '76', '65', '69', '60', '80', '54', '208', '155',
       '112', '92', '145', '137', '158', '167', '94', '107', '230', '49',
       '75', '91', '122', '67', '83', '78', '52', '61', '93', '148',
       '129', '96', '71', '98', '115', '53', '81', '79', '120', '152',
       '102', '108', '68', '58', '149', '89', '63', '48', '66', '139',
       '103', '125', '133', '138', '135', '142', '77', '62', '132', '84',
       '64', '74', '116', '82'], dtype=object)

When we print out all the unique values in horsepower, we find that there is '?' which was used as a placeholder for missing values. Lest remove these entries.

In [9]:
data = data[data.horsepower != '?']

In [10]:
print('?' in data.horsepower)

False


In [11]:
data.shape

(392, 8)

In [12]:
data.dtypes

mpg             float64
cylinders         int64
displacement    float64
horsepower       object
weight            int64
acceleration    float64
model year        int64
origin            int64
dtype: object

So we see all entries with '?' as place holder for data are removed. However, we the horsepower data is still an object type and not float. That is because initially pandas obtained the entire column as object when we imported the data set due to '?', so lets change that data column to float.

In [15]:
data.horsepower = data.horsepower.astype('float')
data.dtypes

mpg             float64
cylinders         int64
displacement    float64
horsepower      float64
weight            int64
acceleration    float64
model year        int64
origin            int64
dtype: object

Now everything looks in order so lets continue, let's describe the dataset

In [14]:
data.describe()

Unnamed: 0,mpg,cylinders,displacement,weight,acceleration,model year,origin
count,392.0,392.0,392.0,392.0,392.0,392.0,392.0
mean,23.445918,5.471939,194.41199,2977.584184,15.541327,75.979592,1.576531
std,7.805007,1.705783,104.644004,849.40256,2.758864,3.683737,0.805518
min,9.0,3.0,68.0,1613.0,8.0,70.0,1.0
25%,17.0,4.0,105.0,2225.25,13.775,73.0,1.0
50%,22.75,4.0,151.0,2803.5,15.5,76.0,1.0
75%,29.0,8.0,275.75,3614.75,17.025,79.0,2.0
max,46.6,8.0,455.0,5140.0,24.8,82.0,3.0


- The first quartile, 17 MPG, is the value for which 25% of the entire MPG observations are smaller and 75% are larger.
- Q2, 22.75 MPG, is the same as the median (50% of MPG observations are smaller than Q2, 50% are larger)
- Only 25% of the observations are greater than the third quartile, 29 MPG.

In [13]:
data.head()

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin
car name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
chevrolet chevelle malibu,18.0,8,307.0,130,3504,12.0,70,1
buick skylark 320,15.0,8,350.0,165,3693,11.5,70,1
plymouth satellite,18.0,8,318.0,150,3436,11.0,70,1
amc rebel sst,16.0,8,304.0,150,3433,12.0,70,1
ford torino,17.0,8,302.0,140,3449,10.5,70,1


## Regression Analysis

Let us use linear regression to predict the value of MPG given the values of a set that is correlated to MPG.

In [16]:
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold

In [17]:
factors = ['cylinders','displacement','horsepower','acceleration','weight','origin','model year']
X = pd.DataFrame(data[factors].copy())
y = data['mpg'].copy()
y

car name
chevrolet chevelle malibu    18.0
buick skylark 320            15.0
plymouth satellite           18.0
amc rebel sst                16.0
ford torino                  17.0
                             ... 
ford mustang gl              27.0
vw pickup                    44.0
dodge rampage                32.0
ford ranger                  28.0
chevy s-10                   31.0
Name: mpg, Length: 392, dtype: float64

In [18]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size = 0.2,random_state=324)
#X_train.shape[0] == y_train.shape[0]
# Always split the data into train and test subsets first, particularly before any preprocessing steps.

In [19]:
reg_model = LinearRegression()
# Selecting linear regression

In [20]:
reg_model.fit(X_train,y_train)
# Training
# Fitting your model to the training data is essentially the training part of the modeling process.
# It finds the coefficients/(Beta) weights for the equation specified via the algorithm being used.

In [21]:
y_predicted = reg_model.predict(X_test)
# Then those Beta Weights/Coefficients are used to calculate the prediction outcomes with the unseen input data X.
# Note that the most important part of this process is to find the coefficients that are fit to your training data.
# y=b0+b1*x1+b2*x2+...+bn*xn, fit() fits the model to training data and finds the B0,...,Bn coefficients, suppose that it is [1 2 3...11].
# Once your unseen new input test data [x1...xn] is provided, then it is easy to calculate the new y=1+2*x1+3*x2+...+11*xn using the learned coeffients (weights).

In [22]:
# Evaluation metrics, MAE, Closer to zero means better accuracy
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test,y_predicted)

2.5694726527406306