** Here we will be exploring the Automobile datase from the UCI Machine Learning repository. A copy of the data is provided in the data folder. The data as well as descriptions was obtained from: https://archive.ics.uci.edu/ml/datasets/automobile **

In this example, I will be attempting to predict the make of these cars

This example will take you through:

1. Data loading and preprocessing (that includes data cleaning and selection)

2. Handling of categorical data

3. Splitting dataset into training and test datasets

4. Basic Machine Learning modeling

For a more indepth beginner ML example (using the famous Titanic dataset) please see https://www.kaggle.com/sramml/simple-tutorial-for-beginners


In [2]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import mean_squared_error, confusion_matrix, classification_report
%matplotlib inline

pd.options.display.max_columns = 99

** 1. Data loading and preprocessing **

In [3]:
cols = ['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration', 'num-of-doors', 'body-style', 
        'drive-wheels', 'engine-location', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type', 
        'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke', 'compression-rate', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price']
autos = pd.read_csv('imports-85.data',names=cols,na_values='?')

In [4]:
autos.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,width,height,curb-weight,engine-type,num-of-cylinders,engine-size,fuel-system,bore,stroke,compression-rate,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,13495.0
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,16500.0
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,171.2,65.5,52.4,2823,ohcv,six,152,mpfi,2.68,3.47,9.0,154.0,5000.0,19,26,16500.0
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,176.6,66.2,54.3,2337,ohc,four,109,mpfi,3.19,3.4,10.0,102.0,5500.0,24,30,13950.0
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,176.6,66.4,54.3,2824,ohc,five,136,mpfi,3.19,3.4,8.0,115.0,5500.0,18,22,17450.0


Most cols are self explanitory. However there are two that, unless you are in the car insurance buisness won't know: 
- Symboling: 
      corresponds to the degree to which the auto is more risky than its price indicates.
      Cars are initially assigned a risk factor symbol associated with its
      price.   Then, if it is more risky (or less), this symbol is
      adjusted by moving it up (or down) the scale.  Actuarians call this
      process "symboling".  A value of +3 indicates that the auto is
      risky, -3 that it is probably pretty safe.
- Normalized loss:
      is the relative average loss payment per insured
      vehicle year.  This value is normalized for all autos within a
      particular size classification (two-door small, station wagons,
      sports/speciality, etc...), and represents the average loss per car
      per year.

In [5]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
symboling            205 non-null int64
normalized-losses    164 non-null float64
make                 205 non-null object
fuel-type            205 non-null object
aspiration           205 non-null object
num-of-doors         203 non-null object
body-style           205 non-null object
drive-wheels         205 non-null object
engine-location      205 non-null object
wheel-base           205 non-null float64
length               205 non-null float64
width                205 non-null float64
height               205 non-null float64
curb-weight          205 non-null int64
engine-type          205 non-null object
num-of-cylinders     205 non-null object
engine-size          205 non-null int64
fuel-system          205 non-null object
bore                 201 non-null float64
stroke               201 non-null float64
compression-rate     205 non-null float64
horsepower           203 non-

In [6]:
autos.isnull().sum()

symboling             0
normalized-losses    41
make                  0
fuel-type             0
aspiration            0
num-of-doors          2
body-style            0
drive-wheels          0
engine-location       0
wheel-base            0
length                0
width                 0
height                0
curb-weight           0
engine-type           0
num-of-cylinders      0
engine-size           0
fuel-system           0
bore                  4
stroke                4
compression-rate      0
horsepower            2
peak-rpm              2
city-mpg              0
highway-mpg           0
price                 4
dtype: int64

There are a few NaNs, and with mainly numeric columns. intead of discarding these, we chall replace the NaNs with the mean values of each column. there seem to be 4 without the actual price. We can either replace the price with the mean vlue or remove these 4 cars. I prefer to remove them, as using the mean will greatly skew the results if we want to predict the price.

In [7]:
autos = autos.dropna(subset=['price'])

In [8]:
autos = autos.fillna(autos.mean())

In [9]:
# Confirm that there's no more missing values!
autos.isnull().sum()

symboling            0
normalized-losses    0
make                 0
fuel-type            0
aspiration           0
num-of-doors         2
body-style           0
drive-wheels         0
engine-location      0
wheel-base           0
length               0
width                0
height               0
curb-weight          0
engine-type          0
num-of-cylinders     0
engine-size          0
fuel-system          0
bore                 0
stroke               0
compression-rate     0
horsepower           0
peak-rpm             0
city-mpg             0
highway-mpg          0
price                0
dtype: int64

#There seem to be two rows in num-of-doors with NaN. looking at the column, this describes the number of doors as a string i.e. two, four ...

In [10]:
print('Unique values in "num-of-doors" \n{}'.format(autos['num-of-doors'].unique()))
autos[autos['num-of-doors'].isnull()]

Unique values in "num-of-doors" 
['two' 'four' nan]


Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,width,height,curb-weight,engine-type,num-of-cylinders,engine-size,fuel-system,bore,stroke,compression-rate,horsepower,peak-rpm,city-mpg,highway-mpg,price
27,1,148.0,dodge,gas,turbo,,sedan,fwd,front,93.7,157.3,63.8,50.6,2191,ohc,four,98,mpfi,3.03,3.39,7.6,102.0,5500.0,24,30,8558.0
63,0,122.0,mazda,diesel,std,,sedan,fwd,front,98.8,177.8,66.5,55.5,2443,ohc,four,122,idi,3.39,3.39,22.7,64.0,4650.0,36,42,10795.0


Hmmm .... Since it is a single column, I'll drop it as there may be a weak correlation between the number of doors ... I may be wrong, but it's better than removeing the columns for these cars.

In [11]:
autos = autos.drop('num-of-doors',axis=1)
autos.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,body-style,drive-wheels,engine-location,wheel-base,length,width,height,curb-weight,engine-type,num-of-cylinders,engine-size,fuel-system,bore,stroke,compression-rate,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,122.0,alfa-romero,gas,std,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,13495.0
1,3,122.0,alfa-romero,gas,std,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,16500.0
2,1,122.0,alfa-romero,gas,std,hatchback,rwd,front,94.5,171.2,65.5,52.4,2823,ohcv,six,152,mpfi,2.68,3.47,9.0,154.0,5000.0,19,26,16500.0
3,2,164.0,audi,gas,std,sedan,fwd,front,99.8,176.6,66.2,54.3,2337,ohc,four,109,mpfi,3.19,3.4,10.0,102.0,5500.0,24,30,13950.0
4,2,164.0,audi,gas,std,sedan,4wd,front,99.4,176.6,66.4,54.3,2824,ohc,five,136,mpfi,3.19,3.4,8.0,115.0,5500.0,18,22,17450.0


** Data exploration **

In [12]:
#number of unique car make
len(autos['make'].unique())

22

In [13]:
autos['make'].value_counts()

toyota           32
nissan           18
mazda            17
honda            13
mitsubishi       13
volkswagen       12
subaru           12
peugot           11
volvo            11
dodge             9
mercedes-benz     8
bmw               8
plymouth          7
saab              6
audi              6
porsche           4
alfa-romero       3
chevrolet         3
jaguar            3
isuzu             2
renault           2
mercury           1
Name: make, dtype: int64

** hmmm the dataset is very mismatched. It is unlikely we will get any relaible results fom this. Nonetheless let's try!!!! **

** Now that we have the final datasets now is time to prepare the datasets for machine learning. You mave noticed that there are a lot of features. For this example I'll be selecting a few. I am going to assume that the make are related to the price length, width and engine-size. and I'll throw in the body-style. **

In [16]:
numerical_cols = ['length','width','height','engine-size','price']
dummy = pd.get_dummies(autos['body-style'])
auto_features = pd.concat([autos[numerical_cols],dummy],axis=1)
auto_features.head()

Unnamed: 0,length,width,height,engine-size,price,convertible,hardtop,hatchback,sedan,wagon
0,168.8,64.1,48.8,130,13495.0,1,0,0,0,0
1,168.8,64.1,48.8,130,16500.0,1,0,0,0,0
2,171.2,65.5,52.4,152,16500.0,0,0,1,0,0
3,176.6,66.2,54.3,109,13950.0,0,0,0,1,0
4,176.6,66.4,54.3,136,17450.0,0,0,0,1,0


**Now we split the dataset into a training and a test set. the Training set is where we train our data and the test is well to see how well we have fitted the modle that we trained on **

In [18]:
from sklearn.model_selection import train_test_split

In [19]:
X = auto_features
y = autos['make']

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3)

In [20]:
print('Shape of the Training set \n X = {}, y = {}'.format(X_train.shape,y_train.shape))
print('Shape of the test set \n X = {}, y = {}'.format(X_test.shape,y_test.shape))

Shape of the Training set 
 X = (140, 10), y = (140,)
Shape of the test set 
 X = (61, 10), y = (61,)


** We are trying to classify the make of the car. For this I'll use only the Logistic Regression Model. See the Automobile_Data_Set_price_estimation_example and the linked example at the top of the notebook for other ML models **

In [21]:
from sklearn.linear_model import LogisticRegression

In [22]:
#Instantiate the model
regressor = LogisticRegression()
# fit the models
regressor.fit(X_train, y_train)
# predict the price
y_pred = regressor.predict(X_test)

In [26]:
# find the acuracy
accuracy = (y_pred == y_test).sum() / len(y_test)

print('Linear Regression; Accuracy = {}'.format(accuracy))

Linear Regression; Accuracy = 0.26229508196721313


In [27]:
print(confusion_matrix(y_test,y_pred))

[[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 2 0 0]
 [0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 2 0 0]
 [0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 4 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
 [0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 4 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0]
 [0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0]
 [0 0 0 0 0 0 0 0 0 0 3 1 0 0 1 0 0 6 0 0]
 [0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 3 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0]]


**Yikes 26% accuracy. Let's have a look at the classification report and confusion matrix **

In [25]:
print(classification_report(y_test,y_pred))

               precision    recall  f1-score   support

  alfa-romero       0.00      0.00      0.00         1
         audi       1.00      0.50      0.67         2
          bmw       0.00      0.00      0.00         1
    chevrolet       0.50      1.00      0.67         1
        dodge       0.00      0.00      0.00         4
        honda       0.50      0.20      0.29         5
        mazda       1.00      0.33      0.50         6
mercedes-benz       0.00      0.00      0.00         3
      mercury       0.00      0.00      0.00         1
   mitsubishi       0.25      1.00      0.40         2
       nissan       0.20      0.20      0.20         5
       peugot       0.12      0.50      0.20         2
     plymouth       0.00      0.00      0.00         2
      porsche       0.25      0.50      0.33         2
      renault       0.00      0.00      0.00         0
         saab       0.00      0.00      0.00         2
       subaru       0.00      0.00      0.00         3
       to

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


** Overal we are not doing well, but for a few we are doing OK. Why? **

In [28]:
y_train.value_counts()

toyota           21
nissan           13
mitsubishi       11
mazda            11
peugot            9
subaru            9
honda             8
volkswagen        8
bmw               7
volvo             7
plymouth          5
mercedes-benz     5
dodge             5
audi              4
saab              4
jaguar            3
chevrolet         2
renault           2
porsche           2
isuzu             2
alfa-romero       2
Name: make, dtype: int64

In [132]:
y_test.value_counts()

nissan           7
mazda            6
volkswagen       6
subaru           6
toyota           5
peugot           5
honda            4
volvo            3
bmw              3
plymouth         3
jaguar           2
mitsubishi       2
mercedes-benz    1
mercury          1
saab             1
renault          1
porsche          1
chevrolet        1
dodge            1
audi             1
Name: make, dtype: int64

** As you can see from above. The make is very unbalanced.**