# Exercise: Optimize a multiple linear regression model

In this exercise, we will optimize the performance of our *multiple linear regression model* by working on the dataset.

## Data preparation

Let's start this exercise by loading in and having a look at our data.

In [1]:
import pandas as pd
import matplotlib.pyplot as graph
from matplotlib import cm
from mpl_toolkits.mplot3d import Axes3D
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.preprocessing import StandardScaler
from numpy import reshape, array, percentile, subtract, where, concatenate, shape, quantile
import joblib

#Import the data from the .csv file
dataset = pd.read_csv('Data/auto-mpg-cleaned.csv') 


The *y values*, or *labels*, are represented by the `mpg` column. These are the values we are training our model to predict. There are 8 features for each label, which means that each *y value* is defined as a function of these 8 *x values*.
The features can be split into two types: *numerical* and *categorical*. Let's have a look at the *categorical variables* and see if we can turn them into *numerical variables*.

In [2]:
#Let's use pandas' concatenate function to create our categorical features subset
categorical_features = pd.concat([dataset.iloc[:,1],dataset.iloc[:,7:]],axis=1)

#How many different unique categories are there for each feature?
unique_cylinders = categorical_features['cylinders'].unique()
unique_origin = categorical_features['origin'].unique()
unique_car_name = categorical_features['car name'].unique()

print(f"# unique cylinders: {len(unique_cylinders)}")
print(f"# unique origin: {len(unique_origin)}")
print(f"# unique car names: {len(unique_car_name)}")

print(categorical_features)


# unique cylinders: 5
# unique origin: 3
# unique car names: 301
     cylinders  origin                      car name
0            4       3                  datsun pl510
1            4       2                      bmw 2002
2            4       2  volkswagen 1131 deluxe sedan
3            4       2                   peugeot 504
4            4       2                      saab 99e
..         ...     ...                           ...
387          4       1      chevrolet cavalier wagon
388          4       1    chrysler lebaron medallion
389          6       1         buick century limited
390          4       1          ford fairmont futura
391          6       1                ford granada l

[392 rows x 3 columns]


We can see that there are 301 unique car models out of a total 392 entries in the dataset. This would suggest that the `car name` is not a very useful metric. What if we split extract the car manufacturer from the this column, and use that as a feature instead? 

In [3]:
manufacturer = dataset['car name'].str.split(expand=True).iloc[:,0]
unique_manufacturer = manufacturer.unique()
print(f"# unique manufacturer: {len(unique_manufacturer)}")
print(manufacturer)

#Join manufacturers column to categorical variables and drop the car name column

categorical_features = pd.concat([categorical_features,manufacturer],axis=1)
categorical_features = categorical_features.drop('car name', axis = 1)
print(categorical_features)

print(categorical_features.rename(columns={0:'manufacturer'},inplace=True))

# unique manufacturer: 37
0          datsun
1             bmw
2      volkswagen
3         peugeot
4            saab
          ...    
387     chevrolet
388      chrysler
389         buick
390          ford
391          ford
Name: 0, Length: 392, dtype: object
     cylinders  origin           0
0            4       3      datsun
1            4       2         bmw
2            4       2  volkswagen
3            4       2     peugeot
4            4       2        saab
..         ...     ...         ...
387          4       1   chevrolet
388          4       1    chrysler
389          6       1       buick
390          4       1        ford
391          6       1        ford

[392 rows x 3 columns]
None


In [4]:
print(categorical_features.columns)

Index(['cylinders', 'origin', 'manufacturer'], dtype='object')


We can use these 37 `manufacturer` values as features instead! It's not unreasonable to assume that certain manufacturers may place a stronger emphasis on mpg performance than others.
Now, all we need is to convert these categorical variables to numerical variables using *one hot encoding*.

In [5]:

#One hot encoding
categorical_features['cylinders'] = categorical_features['cylinders'].astype('category')
categorical_features['origin'] = categorical_features['origin'].astype('category')
categorical_features['manufacturer'] = categorical_features['manufacturer'].astype('category')

enc_categorical_features = pd.get_dummies(categorical_features)

#Join to main dataset

dataset = dataset.drop(['cylinders','origin','car name'], axis = 1)
dataset = pd.concat([dataset,enc_categorical_features], axis=1)
print(dataset)

      mpg  displacement  horsepower  weight  acceleration  model year  \
0    27.0          97.0          88    2130          14.5          70   
1    26.0         121.0         113    2234          12.5          70   
2    26.0          97.0          46    1835          20.5          70   
3    25.0         110.0          87    2672          17.5          70   
4    25.0         104.0          95    2375          17.5          70   
..    ...           ...         ...     ...           ...         ...   
387  27.0         112.0          88    2640          18.6          82   
388  26.0         156.0          92    2585          14.5          82   
389  25.0         181.0         110    2945          16.4          82   
390  24.0         140.0          92    2865          16.4          82   
391  22.0         232.0         112    2835          14.7          82   

     cylinders_3  cylinders_4  cylinders_5  cylinders_6  ...  \
0              0            1            0            0  ..

## Model performance

Ok, let's test model performance!

In [6]:
X = array(dataset.drop(['mpg'], axis=1))
y = array(dataset['mpg'])

#random_state for reproducibilty
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=0)

#reshape arrays for 'LinearRegression'

y_train = y_train.reshape(-1,1)
y_test = y_test.reshape(-1,1)

model = LinearRegression().fit(X_train,y_train)

y_pred = model.predict(X_test)

rms = r2_score(y_test,y_pred)
print(f"R2 score: {rms}")

# Save the model as a pickle file
filename_1 = './cars_model_1.pkl'
joblib.dump(model, filename_1)

R2 score: 0.7525953593910831


['./cars_model_1.pkl']

Here, we are gauging the performance of our model using the *R2* metric. The *R2* metric is a measure of how well the data points match our fitted model. The closer the *R2* value is to 1, the better our model explains the data points.

## Further optimization

Let's see if we can optimize this model further. Let's start by having a look at the feature names again.

In [7]:
print(dataset.columns.values)

['mpg' 'displacement' 'horsepower' 'weight' 'acceleration' 'model year'
 'cylinders_3' 'cylinders_4' 'cylinders_5' 'cylinders_6' 'cylinders_8'
 'origin_1' 'origin_2' 'origin_3' 'manufacturer_amc' 'manufacturer_audi'
 'manufacturer_bmw' 'manufacturer_buick' 'manufacturer_cadillac'
 'manufacturer_capri' 'manufacturer_chevroelt' 'manufacturer_chevrolet'
 'manufacturer_chevy' 'manufacturer_chrysler' 'manufacturer_datsun'
 'manufacturer_dodge' 'manufacturer_fiat' 'manufacturer_ford'
 'manufacturer_hi' 'manufacturer_honda' 'manufacturer_maxda'
 'manufacturer_mazda' 'manufacturer_mercedes' 'manufacturer_mercedes-benz'
 'manufacturer_mercury' 'manufacturer_nissan' 'manufacturer_oldsmobile'
 'manufacturer_opel' 'manufacturer_peugeot' 'manufacturer_plymouth'
 'manufacturer_pontiac' 'manufacturer_renault' 'manufacturer_saab'
 'manufacturer_subaru' 'manufacturer_toyota' 'manufacturer_toyouta'
 'manufacturer_triumph' 'manufacturer_vokswagen' 'manufacturer_volkswagen'
 'manufacturer_volvo' 'manufact

It seems that there are some mistakes in the names of the car manufacturers, leading to some extra, unneeded, features. Let's fix that by combining them.

In [8]:
#sum the columns
dataset["manufacturer_volkswagen"] = dataset["manufacturer_volkswagen"] + dataset["manufacturer_vokswagen"] + dataset['manufacturer_vw']
dataset["manufacturer_toyota"] = dataset["manufacturer_toyota"] + dataset["manufacturer_toyouta"]
dataset["manufacturer_chevrolet"] = dataset["manufacturer_chevrolet"] + dataset["manufacturer_chevroelt"] + dataset["manufacturer_chevy"]
dataset["manufacturer_mazda"] = dataset["manufacturer_mazda"] + dataset["manufacturer_maxda"]
dataset["manufacturer_mercedes-benz"] = dataset["manufacturer_mercedes-benz"] + dataset["manufacturer_mercedes"]

#drop the unneeded columns
dataset = dataset.drop(['manufacturer_vokswagen','manufacturer_vw','manufacturer_toyouta','manufacturer_chevroelt','manufacturer_chevy','manufacturer_maxda','manufacturer_mercedes'], axis = 1)


In [9]:
X = array(dataset.drop(['mpg'], axis=1))
y = array(dataset['mpg'])

#random_state for reproducibilty
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=0)

#reshape arrays for 'LinearRegression'

y_train = y_train.reshape(-1,1)
y_test = y_test.reshape(-1,1)

model = LinearRegression().fit(X_train,y_train)

y_pred = model.predict(X_test)

rms = r2_score(y_test,y_pred)
print(f"R2 score: {rms}")

# Save the model as a pickle file
filename_2 = './cars_model_2.pkl'
joblib.dump(model, filename_2)

R2 score: 0.76854834172345


['./cars_model_2.pkl']

The *R2* score obtained by this model is closer to 1 than the previous model. By just simply looking at the dataset, we were able to spot small, easy to fix mistakes that were negatively impacting our model's performance. Let's see if we can use some statisitcal techniques to improve performance even more!

## Outlier removal

We tend to assume that most of the values of our data fall within a certain range, but sometimes, maybe due to a measurement error, some vales are completely different from the rest. We call these values *outliers*, and it is important to deal with them to reduce their imapct on our predictions. There exists a variety of different ways to treat *outliers*, but this time we will just remove any rows containing them from the dataset.  

In [10]:
#We don't include the one hot encoded columns, just the original numerical features

ol_dataset = dataset.iloc[:,1:5]

Q1 = ol_dataset.quantile(0.25)
Q3 = ol_dataset.quantile(0.75)
IQR = Q3 - Q1

#Sensitivity is 1.5x
Q1_bound = Q1 - 1.5 * IQR
Q3_bound = Q3 + 1.5 * IQR

dataset = dataset[~((ol_dataset < Q1_bound) |(ol_dataset > Q3_bound)).any(axis=1)]



In [11]:
X = array(dataset.drop(['mpg'], axis=1))
y = array(dataset['mpg'])

#random_state for reproducibilty
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=0)

#reshape arrays for 'LinearRegression'

y_train = y_train.reshape(-1,1)
y_test = y_test.reshape(-1,1)

model = LinearRegression().fit(X_train,y_train)

y_pred = model.predict(X_test)

rms = r2_score(y_test,y_pred)
print(f"R2 score: {rms}")

# Save the model as a pickle file
filename_3 = './cars_model_3.pkl'
joblib.dump(model, filename_3)

R2 score: 0.779201794911517


['./cars_model_3.pkl']

Removing the *outliers* from our dataset has further improved our *R2* score. By removing the points outside of our defined range for each feature in the dataset, we are able make sure that our model only takes into account relevant data points.  


## Summary
We covered the following concepts in this exercise:

- Improve the a model's performance by working on the data: we used one hot encoding, manually cleaned up the data, and removed outliers
- Covered the concept of the R2 score, and how it can be used to gauge a model's performance
