# Automobile Mileage Prediction

The auto mpg dataset will be used for predicting mpg (Miles Per Gallon) i.e. Mileage of automobiles.

This dataset consists of the following attributes:

1. mpg
2. cylinders
3. displacement
4. horsepower
5. weight
6. acceleration
7. model year
8. origin
9. car name: string (unique for each instance)


### Tools Required:
Python
Jupyter-Notebook

### Packages:
Pandas
Numpy
Matplotlib

In [1]:
#importing libraries
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression


In [2]:
# reading the dataset
df = pd.read_csv("C:/Users/admin/Downloads/auto-mpg.csv")
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    398 non-null    object 
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   model year    398 non-null    int64  
 7   origin        398 non-null    int64  
 8   car name      398 non-null    object 
dtypes: float64(3), int64(4), object(2)
memory usage: 28.1+ KB


This shows that the dataset has in all 398 rows and 9 columns. There are no null values in the entire dataset. Out of the 9 columns 3 columns are of type 'Float', 4 of type 'Int' and 2 of type 'Object'.

In [4]:
df.replace('?', np.nan, inplace=True)
df = df.dropna()

In [5]:
df['horsepower'] = df.horsepower.astype(int)

In [6]:
df.describe()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin
count,392.0,392.0,392.0,392.0,392.0,392.0,392.0,392.0
mean,23.445918,5.471939,194.41199,104.469388,2977.584184,15.541327,75.979592,1.576531
std,7.805007,1.705783,104.644004,38.49116,849.40256,2.758864,3.683737,0.805518
min,9.0,3.0,68.0,46.0,1613.0,8.0,70.0,1.0
25%,17.0,4.0,105.0,75.0,2225.25,13.775,73.0,1.0
50%,22.75,4.0,151.0,93.5,2803.5,15.5,76.0,1.0
75%,29.0,8.0,275.75,126.0,3614.75,17.025,79.0,2.0
max,46.6,8.0,455.0,230.0,5140.0,24.8,82.0,3.0


In [7]:
df.origin.duplicated().sum()

389

Out of the total 392 entries 389 entries have the same value. So if we drop this column it would not make much difference.

In [8]:
df = df.drop('origin',1)

  """Entry point for launching an IPython kernel.


In [9]:
# pandas-profiling
from pandas_profiling import ProfileReport
prof = ProfileReport(df)
prof.to_file(output_file='output.html')

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

### Types of Data
1. Qualitative or Categorical Data: Data which aren't numbers.
2. Quantitative or Numerical Data: Numbers

### Kinds of Numerical Data
1. Continuous: Eg: temperature 69F or 69.4F or 69.45564F
2. Discrete: Eg: number of balls in pack 1 or 2 or 3

#### Central Tendency consists of mean, median and mode. 
#### Measures of spread contain range, variance and standard deviation.
#### Outliers are points far away from the mean.
#### Skewness is a measure of the lack of symmetry. Kurtosis tells us whether a data is heavily-tailed. Means data with high kurtosis   tend to have heavy tails, or outliers.

In [10]:
# Descriptive Statistics
variance = df.var() 
std_dev = df.std()
print("Variance\n")
print(variance)
print("\nStandard Deviation\n")
print(std_dev)

Variance

mpg                 60.918142
cylinders            2.909696
displacement     10950.367554
horsepower        1481.569393
weight          721484.709008
acceleration         7.611331
model year          13.569915
dtype: float64

Standard Deviation

mpg               7.805007
cylinders         1.705783
displacement    104.644004
horsepower       38.491160
weight          849.402560
acceleration      2.758864
model year        3.683737
dtype: float64


  
  This is separate from the ipykernel package so we can avoid doing imports until


#### 'weight' attribute has the highest standard deviation while 'cylinders' has the lowest standard deviation.

In [11]:
df['weight'].hist()

<AxesSubplot:>

In [12]:
df['cylinders'].hist()

<AxesSubplot:>

In [13]:
df['acceleration'].hist()

<AxesSubplot:>

We observe that weight plot is skewed while acceleration is a normal plot. A skewed plot illustrates that many points are far away from the mean. A normal plot has most number of points close to the mean. Hence, weight has the highest standard deviation while acceleration has a low standard deviation.

In [14]:
X = df.iloc[:, :-1].values
y = df.iloc[:, 0].values

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)

In [16]:
regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression()

In [17]:
y_pred = regressor.predict(X_test)
print(y_pred)

[28.  22.3 12.  38.  33.8 19.4 38.1 30.  20.  20.  27.  16.5 24.5 11.
 16.9 33.7 21.6 14.  26.  28.4 13.  16.  20.  25.  41.5 14.  25.8 25.1
 20.  17.  20.  31.6 22.  26.  21.  29.8 31.  13.  16.  14.  15.  44.6
 31.3 16.  29.  16.  29.  13.  17.5 18.  26.  15.  10.  22.  34.3 30.7
 20.2 22.  33.  21.  22.  24.  31.5 15.  26.  16.  14.  27.  25.  40.8
 36.1 30.  17.6 15.5 23.  14.  26.  19.2 31.5 33.5 20.5 34.2 24.  24.
 14.  23.9 24.  32.9 31.8 21.5 25.5 15.  21.5 19.  38.  23.  35.1 23.
 31.  39.4 12.  25.  24.  26.5 34.7 28.8 28.  18.2 44.  14.  15.5 36.
 25.5 19.2 26.  25.  27.  17.5 31.  17.  26.  32.3 11.  20.5 13.  15.
 26.  33.5 39.  32.2 21. ]


In [18]:
print(y_test)

[28.  22.3 12.  38.  33.8 19.4 38.1 30.  20.  20.  27.  16.5 24.5 11.
 16.9 33.7 21.6 14.  26.  28.4 13.  16.  20.  25.  41.5 14.  25.8 25.1
 20.  17.  20.  31.6 22.  26.  21.  29.8 31.  13.  16.  14.  15.  44.6
 31.3 16.  29.  16.  29.  13.  17.5 18.  26.  15.  10.  22.  34.3 30.7
 20.2 22.  33.  21.  22.  24.  31.5 15.  26.  16.  14.  27.  25.  40.8
 36.1 30.  17.6 15.5 23.  14.  26.  19.2 31.5 33.5 20.5 34.2 24.  24.
 14.  23.9 24.  32.9 31.8 21.5 25.5 15.  21.5 19.  38.  23.  35.1 23.
 31.  39.4 12.  25.  24.  26.5 34.7 28.8 28.  18.2 44.  14.  15.5 36.
 25.5 19.2 26.  25.  27.  17.5 31.  17.  26.  32.3 11.  20.5 13.  15.
 26.  33.5 39.  32.2 21. ]


In [19]:
df.describe()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year
count,392.0,392.0,392.0,392.0,392.0,392.0,392.0
mean,23.445918,5.471939,194.41199,104.469388,2977.584184,15.541327,75.979592
std,7.805007,1.705783,104.644004,38.49116,849.40256,2.758864,3.683737
min,9.0,3.0,68.0,46.0,1613.0,8.0,70.0
25%,17.0,4.0,105.0,75.0,2225.25,13.775,73.0
50%,22.75,4.0,151.0,93.5,2803.5,15.5,76.0
75%,29.0,8.0,275.75,126.0,3614.75,17.025,79.0
max,46.6,8.0,455.0,230.0,5140.0,24.8,82.0


In [20]:
df.groupby(['car name','horsepower']).mpg.mean().sort_values()

car name              horsepower
hi 1200d              193            9.0
chevy c20             200           10.0
ford f250             215           10.0
dodge d200            210           11.0
chevrolet impala      150           11.0
                                    ... 
vw dasher (diesel)    48            43.4
vw pickup             52            44.0
vw rabbit c (diesel)  48            44.3
honda civic 1500 gl   67            44.6
mazda glc             65            46.6
Name: mpg, Length: 367, dtype: float64

#### This shows that horsepower is inversely proportional to mpg

In [21]:
df.groupby(['car name','displacement']).mpg.mean().sort_values()

car name              displacement
hi 1200d              304.0            9.0
ford f250             360.0           10.0
chevy c20             307.0           10.0
chevrolet impala      400.0           11.0
oldsmobile omega      350.0           11.0
                                      ... 
vw dasher (diesel)    90.0            43.4
vw pickup             97.0            44.0
vw rabbit c (diesel)  90.0            44.3
honda civic 1500 gl   91.0            44.6
mazda glc             86.0            46.6
Name: mpg, Length: 342, dtype: float64

#### This shows that displacement is inversely proportional to mpg

In [22]:
df.groupby(['car name','weight']).mpg.mean().sort_values()

car name              weight
hi 1200d              4732       9.0
chevy c20             4376      10.0
ford f250             4615      10.0
oldsmobile omega      3664      11.0
chevrolet impala      4997      11.0
                                ... 
vw dasher (diesel)    2335      43.4
vw pickup             2130      44.0
vw rabbit c (diesel)  2085      44.3
honda civic 1500 gl   1850      44.6
mazda glc             2110      46.6
Name: mpg, Length: 391, dtype: float64

#### This shows that weight is inversely proportional to mpg

In [23]:
import pickle

In [24]:
pickle.dump(df, open('test.pkl', 'wb'))

In [25]:
my_dict = pickle.load(open('./test.pkl', 'rb'))
my_dict

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,car name
0,18.0,8,307.0,130,3504,12.0,70,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,ford torino
...,...,...,...,...,...,...,...,...
393,27.0,4,140.0,86,2790,15.6,82,ford mustang gl
394,44.0,4,97.0,52,2130,24.6,82,vw pickup
395,32.0,4,135.0,84,2295,11.6,82,dodge rampage
396,28.0,4,120.0,79,2625,18.6,82,ford ranger
