# Fuel Economy
### Data on Cars used for Testing Fuel Economy
<hr>

The test data used to determine fuel economy estimates is derived from vehicle testing done at EPA's National Vehicle and Fuel Emissions Laboratory in Ann Arbor, Michigan, and by vehicle manufacturers who submit their own test data to EPA.

Each year, EPA provides fuel economy data to the Department of Energy (DOE), the Department of Transportation (DOT) and the Internal Revenue Service (IRS) so that they can administer their fuel economy-related programs.

In this scenario we will analyse how the fuel economy has made impact on vehicles after a decade. Therfor we will use the dataset for 2010 and 2020 respectively and analyse how much impact has been done. One of the major concern of recent times is pollution which contributes to environmental degradatation and vehicles adds the most of it. With the steep increase in vehicle buying we need to analyse the economy of vehicles, as the more fuel is burnt the more residue is emmisioned into the atmosphere plus the rapid decrease in non-renuable sources would put a devastation scenario to our future generations.


### About the Dataset
<hr>

Fuel economy data are the result of vehicle testing done at the Environmental Protection Agency's National Vehicle and Fuel Emissions Laboratory in Ann Arbor, Michigan, and by vehicle manufacturers with oversight by EPA.

#### Attribute Information

The data set comprises of the following features of vehicles and the respective vehicles data points :

* Model – vehicle make and model
* Displ – engine displacement in liters
* Cyl – number of engine cylinders
* Trans – transmission type plus number of gears
    1. Auto - Automatic
    2. Man - Manual
    3. SemiAuto - Semi-Automatic
    4. SCV - Selectable Continuously Variable (e.g. CVT with paddles)
    5. AutoMan - Automated Manual
    6. AMS - Automated Manual-Selectable (e.g. Automated Manual with paddles)
    7. Other - Other
    8. CVT - Continuously Variable
    9. CM3 - Creeper/Manual 3-Speed
    10. CM4 - Creeper/Manual 4-Speed
    11. C4 - Creeper/Manual 4-Speed
    12. C5 - Creeper/Manual 5-Speed
    13. Auto-S2 - Semi-Automatic 2-Speed
    14. Auto-S3 - Semi-Automatic 3-Speed
    15. Auto-S4 - Semi-Automatic 4-Speed
    16. Auto-S5 - Semi-Automatic 5-Speed
    17. Auto-S6 - Semi-Automatic 6-Speed
    18. Auto-S7 - Semi-Automatic 7-Speed
* Drive – 2-wheel Drive, 4-wheel drive/all-wheel drive
* Fuel – fuel(s)
* Cert Region –
    1. CA - California
    2. CE - Calif. + NLEV (Northeast trading area)
    3. CF - Clean Fuel Vehicle
    4. CL - Calif. + NLEV (All states)
    5. FA - Federal All Altitude
    6. FC - Tier 2 Federal and Calif.
    7. NF - CFV + NLEV(ASTR) + Calif.
    8. NL - NLEV (All states)
* Stnd – vehicle emissions standard code. See Stnd Description.
* Stnd Description – vehicle emissions standard description. See
https://www.epa.gov/greenvehicles/federal-and-california-light-duty-vehicle-emissions-standards-air-
pollutants
* Underhood ID – engine family or test group ID. See
http://www.fueleconomy.gov/feg/findacarhelp.shtml#airPollutionScore
* Veh Class – EPA vehicle class. See http://www.fueleconomy.gov/feg/findacarhelp.shtml#epaSizeClass
* Air Pollution Score (Smog Rating) – see
http://www.fueleconomy.gov/feg/findacarhelp.shtml#airPollutionScore and
https://www.epa.gov/greenvehicles/smog-rating
* City MPG – city fuel economy in miles per gallon
* Hwy MPG – highway fuel economy in miles per gallon
* Cmb MPG – combined city/highway fuel economy in miles per gallon
* Greenhouse Gas Score (Greenhouse Gas Rating) – see
https://www.epa.gov/greenvehicles/greenhouse-gas-rating
* SmartWay – Yes, No, or Elite. See https://www.epa.gov/greenvehicles/consider-smartway-
vehicle
* Comb CO 2 – combined city/highway CO 2 tailpipe emissions in grams per mile





In [1]:
# import necessary libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

## Data Cleaning

In [2]:
# import data sets // both 2010 and 2020 economy data

data_2010 = pd.read_excel("data/all_alpha_10.xls")
data_2020 = pd.read_excel("data/all_alpha_20.xlsx")

In [3]:
# view data for 2010

data_2010.head()

Unnamed: 0,Model,Displ,Cyl,Trans,Drive,Fuel,Sales Area,Stnd,Stnd Description,Underhood ID,Veh Class,Air Pollution Score,City MPG,Hwy MPG,Cmb MPG,Greenhouse Gas Score,SmartWay
0,ACURA MDX,3.7,6.0,SemiAuto-6,4WD,Gasoline,CA,U2,California LEV-II ULEV,AHNXT03.7W19,SUV,7,16,21,18,4,no
1,ACURA MDX,3.7,6.0,SemiAuto-6,4WD,Gasoline,FA,B5,Federal Tier 2 Bin 5,AHNXT03.7W19,SUV,6,16,21,18,4,no
2,ACURA RDX,2.3,4.0,SemiAuto-5,2WD,Gasoline,CA,U2,California LEV-II ULEV,AHNXT02.3Y19,SUV,7,19,24,21,5,no
3,ACURA RDX,2.3,4.0,SemiAuto-5,4WD,Gasoline,CA,U2,California LEV-II ULEV,AHNXT02.3Y19,SUV,7,17,22,19,4,no
4,ACURA RDX,2.3,4.0,SemiAuto-5,2WD,Gasoline,FA,B5,Federal Tier 2 Bin 5,AHNXT02.3Y19,SUV,6,19,24,21,5,no


In [4]:
# view data for 2020

data_2020.head()

Unnamed: 0,Model,Displ,Cyl,Trans,Drive,Fuel,Cert Region,Stnd,Stnd Description,Underhood ID,Veh Class,Air Pollution Score,City MPG,Hwy MPG,Cmb MPG,Greenhouse Gas Score,SmartWay,Comb CO2
0,ACURA ILX,2.4,4.0,AMS-8,2WD,Gasoline,CA,L3ULEV125,California LEV-III ULEV125,LHNXV02.4KH3,small car,3,24,34,28,6,No,316
1,ACURA ILX,2.4,4.0,AMS-8,2WD,Gasoline,FA,T3B125,Federal Tier 3 Bin 125,LHNXV02.4KH3,small car,3,24,34,28,6,No,316
2,ACURA MDX,3.0,6.0,AMS-7,4WD,Gasoline,CA,L3ULEV125,California LEV-III ULEV125,LHNXV03.0ABC,small SUV,3,26,27,27,6,No,333
3,ACURA MDX,3.0,6.0,AMS-7,4WD,Gasoline,FA,T3B125,Federal Tier 3 Bin 125,LHNXV03.0ABC,small SUV,3,26,27,27,6,No,333
4,ACURA MDX,3.5,6.0,SemiAuto-9,2WD,Gasoline,CA,L3ULEV125,California LEV-III ULEV125,LHNXV03.5PBM,small SUV,3,20,27,23,5,No,387


In [5]:
# check columns

print(data_2010.columns)
print(data_2020.columns)

Index(['Model', 'Displ', 'Cyl', 'Trans', 'Drive', 'Fuel', 'Sales Area', 'Stnd',
       'Stnd Description', 'Underhood ID', 'Veh Class', 'Air Pollution Score',
       'City MPG', 'Hwy MPG', 'Cmb MPG', 'Greenhouse Gas Score', 'SmartWay'],
      dtype='object')
Index(['Model', 'Displ', 'Cyl', 'Trans', 'Drive', 'Fuel', 'Cert Region',
       'Stnd', 'Stnd Description', 'Underhood ID', 'Veh Class',
       'Air Pollution Score', 'City MPG', 'Hwy MPG', 'Cmb MPG',
       'Greenhouse Gas Score', 'SmartWay', 'Comb CO2'],
      dtype='object')


#### Few changes have to be made :

1. Data set for the year 2010 contains a column 'Sales Area'  and year 2020 contains the column 'Cert Region'. Which are same, after checking the values we should rename either one of them.


2. Year 2010 does not contain column 'Comb C02' whereas 2020 does therefor we will remove the column from 2020.

#### Check columns 'Sales Area' and 'Cert Region'.

In [6]:
data_2010['Sales Area'].value_counts()

FA    1118
CA    1041
FC     183
Name: Sales Area, dtype: int64

In [7]:
data_2020['Cert Region'].value_counts()

CA    1263
FA    1260
Name: Cert Region, dtype: int64

In [8]:
data_2010.rename(columns = {'Sales Area':'Cert Region'}, inplace=True)

### Drop Extraneous Columns

In [9]:
# data set 2010

data_2010.drop(['Stnd','Stnd Description','Underhood ID'], axis=1, inplace=True)

In [10]:
# data set 2010

data_2020.drop(['Stnd','Stnd Description','Underhood ID','Comb CO2'], axis=1, inplace=True)

In [11]:
# check weather all columns are same and of same structure

(data_2010.columns == data_2020.columns).all()

True

### Rename Columns

In [12]:
# replace space with _ and convert all letters to lowercase for ease
data_2010.rename(columns =lambda x: x.strip().lower().replace(' ','_'), inplace=True)

In [13]:
# replace space with _ and convert all letters to lowercase for ease
data_2020.rename(columns =lambda x: x.strip().lower().replace(' ','_'), inplace=True)

In [14]:
### Save Data Sets
data_2010.to_csv('data/data_10_clean.csv', index=False)
data_2020.to_csv('data/data_20_clean.csv', index=False)