# Predict Auto Prices
## Context

We want to predict the price of a used car based on differents features. The features are of 3 types: caracterics of the vehicle, the normalize loss and the risk factor. 
*Source: https://archive.ics.uci.edu/ml/machine-learning-databases/autos*


## Hypothesis
withut looking into the data, our first hypothesis is that the price will be affected by the risk factor, the fuel consumption per meter (or miles) and type (Ex Diesel), the number of miles, the year of manufacturing, and maybe other features. 


## Exploratory Data Analysis
Here is the attributes description:

     Attribute:                Attribute Range:
     ------------------        -----------------------------------------------
  1. symboling:                -3, -2, -1, 0, 1, 2, 3.
  2. normalized-losses:        continuous from 65 to 256.
  3. make:                     alfa-romero, audi, bmw, chevrolet, dodge, honda,
                               isuzu, jaguar, mazda, mercedes-benz, mercury,
                               mitsubishi, nissan, peugot, plymouth, porsche,
                               renault, saab, subaru, toyota, volkswagen, volvo
  4. fuel-type:                diesel, gas.
  5. aspiration:               std, turbo.
  6. num-of-doors:             four, two.
  7. body-style:               hardtop, wagon, sedan, hatchback, convertible.
  8. drive-wheels:             4wd, fwd, rwd.
  9. engine-location:          front, rear.
 10. wheel-base:               continuous from 86.6 120.9.
 11. length:                   continuous from 141.1 to 208.1.
 12. width:                    continuous from 60.3 to 72.3.
 13. height:                   continuous from 47.8 to 59.8.
 14. curb-weight:              continuous from 1488 to 4066.
 15. engine-type:              dohc, dohcv, l, ohc, ohcf, ohcv, rotor.
 16. num-of-cylinders:         eight, five, four, six, three, twelve, two.
 17. engine-size:              continuous from 61 to 326.
 18. fuel-system:              1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi.
 19. bore:                     continuous from 2.54 to 3.94.
 20. stroke:                   continuous from 2.07 to 4.17.
 21. compression-ratio:        continuous from 7 to 23.
 22. horsepower:               continuous from 48 to 288.
 23. peak-rpm:                 continuous from 4150 to 6600.
 24. city-mpg:                 continuous from 13 to 49.
 25. highway-mpg:              continuous from 16 to 54.
 26. price:                    continuous from 5118 to 45400.


We begin by importing the dataset:

In [17]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sbn

# Package of reusable personal modules 
from datawrangling import attribute_to_list

# Getting data
path = "autos.csv"
file_headers = "attributes.csv"
df = pd.read_csv(path, names=attribute_to_list(file_headers, False))
df.shape

(205, 26)

In [18]:
df.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


We can already spot somme missin values as "?". 
Change format missin values

In [51]:
df.replace("?", np.nan, inplace=True)
df.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


Now lets' check the variables types:

In [10]:
df.dtypes

symboling              int64
normalized-losses     object
make                  object
fuel-type             object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                  object
stroke                object
compression-ratio    float64
horsepower            object
peak-rpm              object
city-mpg               int64
highway-mpg            int64
price                 object
dtype: object

After comparing the output above and the data description (furtehr above) we notice some type msimatch. The variables normalized-losses, bore, stroke, horsepower, peak-rpm, city-mpg and price expected to be continuous but teh actual type is object, meaning string in native python.

Change variable type:

In [60]:
# Force type as numerical , not object
df['normalized-losses'] = df['normalized-losses'].astype('float64', errors="ignore")
df['bore'] = df['bore'].astype('float64', errors="ignore")
df['stroke'] = df['stroke'].astype('float64', errors="ignore")
df['horsepower'] = df['horsepower'].astype('float64', errors="ignore")
df['peak-rpm'] = df['peak-rpm'].astype('float64', errors="ignore")
df['price'] = df['price'].astype('float64', errors="ignore")
df.dtypes

symboling              int64
normalized-losses    float64
make                  object
fuel-type             object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                 float64
stroke               float64
compression-ratio    float64
horsepower           float64
peak-rpm             float64
city-mpg               int64
highway-mpg            int64
price                float64
dtype: object

Now we can go further with our data exploration.

In [61]:
# Data Exploration descriptive analysis
df.describe(include="all")

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
count,205.0,164.0,205,205,205,205,205,205,205,205.0,...,205.0,205,201.0,201.0,205.0,203.0,203.0,205.0,205.0,201.0
unique,,,22,2,2,2,5,3,2,,...,,8,,,,,,,,
top,,,toyota,gas,std,four,sedan,fwd,front,,...,,mpfi,,,,,,,,
freq,,,32,185,168,116,96,120,202,,...,,94,,,,,,,,
mean,0.834146,122.0,,,,,,,,98.756585,...,126.907317,,3.329751,3.255423,10.142537,104.256158,5125.369458,25.219512,30.75122,13207.129353
std,1.245307,35.442168,,,,,,,,6.021776,...,41.642693,,0.273539,0.316717,3.97204,39.714369,479.33456,6.542142,6.886443,7947.066342
min,-2.0,65.0,,,,,,,,86.6,...,61.0,,2.54,2.07,7.0,48.0,4150.0,13.0,16.0,5118.0
25%,0.0,94.0,,,,,,,,94.5,...,97.0,,3.15,3.11,8.6,70.0,4800.0,19.0,25.0,7775.0
50%,1.0,115.0,,,,,,,,97.0,...,120.0,,3.31,3.29,9.0,95.0,5200.0,24.0,30.0,10295.0
75%,2.0,150.0,,,,,,,,102.4,...,141.0,,3.59,3.41,9.4,116.0,5500.0,30.0,34.0,16500.0


In [62]:
df.apply(lambda x: sum(x.isnull()), axis=0)

symboling             0
normalized-losses    41
make                  0
fuel-type             0
aspiration            0
num-of-doors          0
body-style            0
drive-wheels          0
engine-location       0
wheel-base            0
length                0
width                 0
height                0
curb-weight           0
engine-type           0
num-of-cylinders      0
engine-size           0
fuel-system           0
bore                  4
stroke                4
compression-ratio     0
horsepower            2
peak-rpm              2
city-mpg              0
highway-mpg           0
price                 4
dtype: int64

## Missing data 

Now we have to handle the missing data. we choose to impute the values instead of dropping the imcomplete data entries.

num-of-doors: 2 missing values, Mode = four , replace MV by the mode

bore: 4 missing values, mean and meadian pretty close, low variance, replace MV by the mean

stroke: 4 missing values, mean and meadian pretty close, low variance, replace MV by the mean

horsepower: 2 missing values, mean and median not far away, replace MV by the mean

peak-rpm: 2 missing values, mean and median not far away, replace MV by the mean

price: 4 missing values, mean and median not far away, replace MV by the mean

normalized-losses: 41 missing values. We first tried to split de value into differents categories based on fuel type. We can see there is more in category gas than diesel, so they might have significantly different distribution and means. 

![title](Loss_Fueltype.png) 

In [66]:
# Missing values
df['num-of-doors'].fillna(df['num-of-doors'].mode()[0], inplace=True)
df['bore'].fillna(df['bore'].mean(), inplace=True)
df['stroke'].fillna(df['stroke'].mean(), inplace=True)
df['horsepower'].fillna(df['horsepower'].mean(), inplace=True)
df['peak-rpm'].fillna(df['peak-rpm'].mean(), inplace=True)
df['price'].fillna(df['price'].mean(), inplace=True)


In [67]:
df.groupby(['fuel-type'])['normalized-losses'].mean()

fuel-type
diesel    109.000000
gas       123.308725
Name: normalized-losses, dtype: float64

Replace all the missing values by the mean. 

In [70]:
df['normalized-losses'].fillna(df['normalized-losses'].mean(), inplace=True)
df.apply(lambda x: sum(x.isnull()), axis=0)

symboling            0
normalized-losses    0
make                 0
fuel-type            0
aspiration           0
num-of-doors         0
body-style           0
drive-wheels         0
engine-location      0
wheel-base           0
length               0
width                0
height               0
curb-weight          0
engine-type          0
num-of-cylinders     0
engine-size          0
fuel-system          0
bore                 0
stroke               0
compression-ratio    0
horsepower           0
peak-rpm             0
city-mpg             0
highway-mpg          0
price                0
dtype: int64

Now that there isn't any missing value left, we can do more analysis.

# Unvariate Analysis
## Make
![Title](Loss_symboling.png)

It seems like we can bin the values into 3 categories: Entry Level (from Chevrolet to mazda), Midlle Range (from saab to volvo) and High End (from bmw to jaguar). We will take the decision later.


**Comming next :univariate analysis, bi-variate nanalysis? Dimensionalit reduction regarding the number of columns? variable creation? **