### Summary

Automobile Data Set

https://archive.ics.uci.edu/ml/datasets/automobile

Data Set Information:

This data set consists of three types of entities: (a) the specification of an auto in terms of various characteristics, (b) its assigned insurance risk rating, (c) its normalized losses in use as compared to other cars. The second rating corresponds to the degree to which the auto is more risky than its price indicates. Cars are initially assigned a risk factor symbol associated with its price. Then, if it is more risky (or less), this symbol is adjusted by moving it up (or down) the scale. Actuarians call this process "symboling". A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe.

The third factor is the relative average loss payment per insured vehicle year. This value is normalized for all autos within a particular size classification (two-door small, station wagons, sports/speciality, etc...), and represents the average loss per car per year.

Note: Several of the attributes in the database could be used as a "class" attribute.


In this course, we explored the fundamentals of machine learning using the k-nearest neighbors algorithm. In this guided project, you'll practice the machine learning workflow you've learned so far to predict a car's market price using its attributes. The data set we will be working with contains information on various cars. For each car we have information about the technical aspects of the vehicle such as the motor's displacement, the weight of the car, the miles per gallon, how fast the car accelerates, and more. You can read more about the data set here and can download it directly from here. Here's a preview of the data set:

### Import core Python packages

In [103]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [104]:
column_names = ["symboling",
"normalized-losses",
"make",
"fuel-type",
"aspiration",
"num-of-doors",
"body-style",
"drive-wheels",
"engine-location",
"wheel-base",
"length",
"width",
"height",
"curb-weight",
"engine-type",
"num-of-cylinders",
"engine-size",
"fuel-system",
"bore",
"stroke",
"compression-ratio",
"horsepower",
"peak-rpm",
"city-mpg",
"highway-mpg",
"price"
]

In [105]:
column_names

['symboling',
 'normalized-losses',
 'make',
 'fuel-type',
 'aspiration',
 'num-of-doors',
 'body-style',
 'drive-wheels',
 'engine-location',
 'wheel-base',
 'length',
 'width',
 'height',
 'curb-weight',
 'engine-type',
 'num-of-cylinders',
 'engine-size',
 'fuel-system',
 'bore',
 'stroke',
 'compression-ratio',
 'horsepower',
 'peak-rpm',
 'city-mpg',
 'highway-mpg',
 'price']

In [106]:
cars=pd.read_csv("imports-85.data", header=None, names= column_names)

First look at the data.

In [107]:
cars.shape

(205, 26)

205 observations and 26 variables.

In [108]:
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
symboling            205 non-null int64
normalized-losses    205 non-null object
make                 205 non-null object
fuel-type            205 non-null object
aspiration           205 non-null object
num-of-doors         205 non-null object
body-style           205 non-null object
drive-wheels         205 non-null object
engine-location      205 non-null object
wheel-base           205 non-null float64
length               205 non-null float64
width                205 non-null float64
height               205 non-null float64
curb-weight          205 non-null int64
engine-type          205 non-null object
num-of-cylinders     205 non-null object
engine-size          205 non-null int64
fuel-system          205 non-null object
bore                 205 non-null object
stroke               205 non-null object
compression-ratio    205 non-null float64
horsepower           205 non-nul

In [109]:
cars.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


In [110]:
cars.tail()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
200,-1,95,volvo,gas,std,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,9.5,114,5400,23,28,16845
201,-1,95,volvo,gas,turbo,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,8.7,160,5300,19,25,19045
202,-1,95,volvo,gas,std,four,sedan,rwd,front,109.1,...,173,mpfi,3.58,2.87,8.8,134,5500,18,23,21485
203,-1,95,volvo,diesel,turbo,four,sedan,rwd,front,109.1,...,145,idi,3.01,3.4,23.0,106,4800,26,27,22470
204,-1,95,volvo,gas,turbo,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,9.5,114,5400,19,25,22625


On the first look there are some major issues with the data types. 
For example num-of-doors is a "non-null object". This points out to data type "string".
A closer look seems necessary.
Also missing values are coded with a question mark it seems.

In [111]:
for i in cars.columns:
    print(i, type(cars.iloc[5][i]))

symboling <class 'numpy.int64'>
normalized-losses <class 'str'>
make <class 'str'>
fuel-type <class 'str'>
aspiration <class 'str'>
num-of-doors <class 'str'>
body-style <class 'str'>
drive-wheels <class 'str'>
engine-location <class 'str'>
wheel-base <class 'numpy.float64'>
length <class 'numpy.float64'>
width <class 'numpy.float64'>
height <class 'numpy.float64'>
curb-weight <class 'numpy.int64'>
engine-type <class 'str'>
num-of-cylinders <class 'str'>
engine-size <class 'numpy.int64'>
fuel-system <class 'str'>
bore <class 'str'>
stroke <class 'str'>
compression-ratio <class 'numpy.float64'>
horsepower <class 'str'>
peak-rpm <class 'str'>
city-mpg <class 'numpy.int64'>
highway-mpg <class 'numpy.int64'>
price <class 'str'>


There are clearly issues with the data types.
Even the important target variable price is coded as a string, but it should be a continuous variable.

Target variable: price.

In [112]:
cars[10:15]['price']

10    16430
11    16925
12    20970
13    21105
14    24565
Name: price, dtype: object

In [113]:
print(type(cars['price']))
print(type(cars.iloc[10]['price']))

<class 'pandas.core.series.Series'>
<class 'str'>


The price variable should be of course a continuous variable.

Missing values?

In [114]:
cars['price'].isnull().sum()

0

Are there missing values in the price column coded as question mark?

In [115]:
# cars['price'].unique()
questionmark_or_not = list()

for i in cars['price']:
    if i == "?": 
        questionmark_or_not.append(1)
    else:
        questionmark_or_not.append(0)
        
print(questionmark_or_not)
print(sum(questionmark_or_not))

[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
4


In [116]:
number_questionmarks = sum([1 for x in cars['price'] if x =="?"])
number_questionmarks

4

Missing values? How many question marks are there in any variable?

In [117]:
cars.isnull().sum()[0:5]
# There none null values because missing values are coded as "?".

symboling            0
normalized-losses    0
make                 0
fuel-type            0
aspiration           0
dtype: int64

In [118]:
cars.columns

Index(['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration',
       'num-of-doors', 'body-style', 'drive-wheels', 'engine-location',
       'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type',
       'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke',
       'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
       'highway-mpg', 'price'],
      dtype='object')

In [119]:
qm_per_column = dict()

for i in cars.columns:
    nq=sum([1 for x in cars[i] if x =="?"])
    qm_per_column[i] = nq
    
print(qm_per_column)

{'symboling': 0, 'normalized-losses': 41, 'make': 0, 'fuel-type': 0, 'aspiration': 0, 'num-of-doors': 2, 'body-style': 0, 'drive-wheels': 0, 'engine-location': 0, 'wheel-base': 0, 'length': 0, 'width': 0, 'height': 0, 'curb-weight': 0, 'engine-type': 0, 'num-of-cylinders': 0, 'engine-size': 0, 'fuel-system': 0, 'bore': 4, 'stroke': 4, 'compression-ratio': 0, 'horsepower': 2, 'peak-rpm': 2, 'city-mpg': 0, 'highway-mpg': 0, 'price': 4}


Replace "?" with nan.

In [120]:
cars_2 = cars.copy()

In [121]:
cars_2=cars_2.replace("?", np.nan)

In [122]:
car2_null_series=cars_2.isnull().sum()
print(type(car2_null_series))
# print(bb)
print(car2_null_series.index)
# print(bb.index[0])

<class 'pandas.core.series.Series'>
Index(['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration',
       'num-of-doors', 'body-style', 'drive-wheels', 'engine-location',
       'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type',
       'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke',
       'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
       'highway-mpg', 'price'],
      dtype='object')


Number nan's per column with nan values only.

In [123]:
cars2_index = list(car2_null_series.index)
for a, i in zip(cars2_index, car2_null_series):
    if i != 0:
        print (a, ":", i)

normalized-losses : 41
num-of-doors : 2
bore : 4
stroke : 4
horsepower : 2
peak-rpm : 2
price : 4


Missing values: impute or drop or not using the variable as feature?<br>
In a data set with 205 observations dropping 41 could records of "normalized-losses"<br>
would have an impact on the quality of the data. <br>
The other columns have less missing values and therefore less impact.

Imputing missing values in 'normalized-losses' with the median of this variable.

In [136]:
cars_2['normalized-losses'] = cars_2['normalized-losses'].astype(float)
type(cars_2.loc[4, 'normalized-losses'])

numpy.float64

In [131]:
print(cars_2[0:5]['normalized-losses'])

0      NaN
1      NaN
2      NaN
3    164.0
4    164.0
Name: normalized-losses, dtype: float64


In [137]:
cars_2['normalized-losses'] = cars_2['normalized-losses'].fillna(cars_2['normalized-losses'].median())

In [138]:
cars_2['normalized-losses'].describe()

count    205.000000
mean     120.600000
std       31.805105
min       65.000000
25%      101.000000
50%      115.000000
75%      137.000000
max      256.000000
Name: normalized-losses, dtype: float64

Drop all the remaining nan-observations from the data frame.

In [139]:
cars_2 = cars_2.dropna()

In [140]:
cars_2.isnull().sum()

symboling            0
normalized-losses    0
make                 0
fuel-type            0
aspiration           0
num-of-doors         0
body-style           0
drive-wheels         0
engine-location      0
wheel-base           0
length               0
width                0
height               0
curb-weight          0
engine-type          0
num-of-cylinders     0
engine-size          0
fuel-system          0
bore                 0
stroke               0
compression-ratio    0
horsepower           0
peak-rpm             0
city-mpg             0
highway-mpg          0
price                0
dtype: int64

Convert variables coded as strings into numeric if justified.

In [None]:

2. normalized-losses: continuous from 65 to 256.

    
    

4. fuel-type: diesel, gas.
5. aspiration: std, turbo.
6. num-of-doors: four, two.
7. body-style: hardtop, wagon, sedan, hatchback, convertible.
8. drive-wheels: 4wd, fwd, rwd.
9. engine-location: front, rear.
10. wheel-base: continuous from 86.6 120.9.
11. length: continuous from 141.1 to 208.1.
12. width: continuous from 60.3 to 72.3.
13. height: continuous from 47.8 to 59.8.
14. curb-weight: continuous from 1488 to 4066.
15. engine-type: dohc, dohcv, l, ohc, ohcf, ohcv, rotor.
16. num-of-cylinders: eight, five, four, six, three, twelve, two.
17. engine-size: continuous from 61 to 326.
18. fuel-system: 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi.
19. bore: continuous from 2.54 to 3.94.
20. stroke: continuous from 2.07 to 4.17.
21. compression-ratio: continuous from 7 to 23.
22. horsepower: continuous from 48 to 288.
23. peak-rpm: continuous from 4150 to 6600.
24. city-mpg: continuous from 13 to 49.
25. highway-mpg: continuous from 16 to 54.
26. price: continuous from 5118 to 45400.