# Using Pandas

In [1]:
import pandas as pd
import numpy as np
pd.set_option('display.max_rows', 200)
## to make it possible to display multiple output inside one cell 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

<b>load the data from the vehicles.csv file into pandas data frame

In [2]:
vehicles_df = pd.read_csv("Data/vehicles.csv")


First exploration of the dataset:

- How many observations does it have?
  **35,952 records**
- Look at all the columns: do you understand what they mean?
  **Engine Displacement** - measure of the cylinder volume and therefore loose indicator of the power
  **MPG** - miles per gallon (measure of fuel consumption)
- Look at the raw data: do you see anything weird?
  **No**
- Look at the data types: are they the expected ones for the information the column contains?
  **Cylinders** - it could be integer
  **Transmission** - could be changed to a binary variable (manual and auto)

In [3]:
vehicles_df

Unnamed: 0,Make,Model,Year,Engine Displacement,Cylinders,Transmission,Drivetrain,Vehicle Class,Fuel Type,Fuel Barrels/Year,City MPG,Highway MPG,Combined MPG,CO2 Emission Grams/Mile,Fuel Cost/Year
0,AM General,DJ Po Vehicle 2WD,1984,2.5,4.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,19.388824,18,17,17,522.764706,1950
1,AM General,FJ8c Post Office,1984,4.2,6.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,13,683.615385,2550
2,AM General,Post Office DJ5 2WD,1985,2.5,4.0,Automatic 3-spd,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,20.600625,16,17,16,555.437500,2100
3,AM General,Post Office DJ8 2WD,1985,4.2,6.0,Automatic 3-spd,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,13,683.615385,2550
4,ASC Incorporated,GNX,1987,3.8,6.0,Automatic 4-spd,Rear-Wheel Drive,Midsize Cars,Premium,20.600625,14,21,16,555.437500,2550
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35947,smart,fortwo coupe,2013,1.0,3.0,Auto(AM5),Rear-Wheel Drive,Two Seaters,Premium,9.155833,34,38,36,244.000000,1100
35948,smart,fortwo coupe,2014,1.0,3.0,Auto(AM5),Rear-Wheel Drive,Two Seaters,Premium,9.155833,34,38,36,243.000000,1100
35949,smart,fortwo coupe,2015,1.0,3.0,Auto(AM5),Rear-Wheel Drive,Two Seaters,Premium,9.155833,34,38,36,244.000000,1100
35950,smart,fortwo coupe,2016,0.9,3.0,Auto(AM6),Rear-Wheel Drive,Two Seaters,Premium,9.155833,34,39,36,246.000000,1100


In [4]:
type(vehicles_df["Make"].value_counts())

pandas.core.series.Series

In [5]:
# Function to store in dictionary the number of nan values per column

def nan_counter(df):
    
    """
    Returns a dictionary containing the number of nan values per column (for dataframe df)

    Parameters
    ----------
    df : Pandas dataframe

    Returns
    -------
    remaining_nan : Dictionary
        Contains the number of nan values in each column of the dataframe

    """
    
    remaining_nan = {}

    for column in df.columns:

        remaining_nan[column] = df[column][df[column].isna() == True].size

    return remaining_nan

In [6]:
nan_counter(vehicles_df)

{'Make': 0,
 'Model': 0,
 'Year': 0,
 'Engine Displacement': 0,
 'Cylinders': 0,
 'Transmission': 0,
 'Drivetrain': 0,
 'Vehicle Class': 0,
 'Fuel Type': 0,
 'Fuel Barrels/Year': 0,
 'City MPG': 0,
 'Highway MPG': 0,
 'Combined MPG': 0,
 'CO2 Emission Grams/Mile': 0,
 'Fuel Cost/Year': 0}

In [7]:
vehicles_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35952 entries, 0 to 35951
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Make                     35952 non-null  object 
 1   Model                    35952 non-null  object 
 2   Year                     35952 non-null  int64  
 3   Engine Displacement      35952 non-null  float64
 4   Cylinders                35952 non-null  float64
 5   Transmission             35952 non-null  object 
 6   Drivetrain               35952 non-null  object 
 7   Vehicle Class            35952 non-null  object 
 8   Fuel Type                35952 non-null  object 
 9   Fuel Barrels/Year        35952 non-null  float64
 10  City MPG                 35952 non-null  int64  
 11  Highway MPG              35952 non-null  int64  
 12  Combined MPG             35952 non-null  int64  
 13  CO2 Emission Grams/Mile  35952 non-null  float64
 14  Fuel Cost/Year        

### Cleaning and wrangling data

- Some car brand names refer to the same brand. Replace all brand names that contain the word "Dutton" for simply "Dutton". If you find similar examples, clean their names too. Use `loc` with boolean indexing.

- Convert CO2 Emissions from Grams/Mile to Grams/Km

- Create a binary column that solely indicates if the transmission of a car is automatic or manual. Use `pandas.Series.str.startswith` and .

- convert MPG columns to km_per_liter

#### Car Brand Names

In [8]:
brands = (pd.DataFrame(vehicles_df["Make"].value_counts())
            .reset_index()
            .rename(columns={"index":"Brand"})
            .sort_values(by="Brand", ascending=True)
         )

In [9]:
brands

Unnamed: 0,Brand,Make
78,AM General,4
110,ASC Incorporated,1
33,Acura,302
52,Alfa Romeo,41
57,American Motors Corporation,22
41,Aston Martin,133
11,Audi,890
112,Aurora Cars Ltd,1
83,Autokraft Limited,4
5,BMW,1677


In [10]:
mask_dutton = vehicles_df["Make"].apply(lambda x: "Dutton" in x)
vehicles_df[mask_dutton]

Unnamed: 0,Make,Model,Year,Engine Displacement,Cylinders,Transmission,Drivetrain,Vehicle Class,Fuel Type,Fuel Barrels/Year,City MPG,Highway MPG,Combined MPG,CO2 Emission Grams/Mile,Fuel Cost/Year
11012,"E. P. Dutton, Inc.",Funeral Coach,1985,4.1,8.0,Automatic 4-spd,Front-Wheel Drive,Special Purpose Vehicles,Regular,19.388824,15,21,17,522.764706,1950
30164,S and S Coach Company E.p. Dutton,Funeral Coach 2WD,1984,6.0,8.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,32.961,9,11,10,888.7,3350
31754,Superior Coaches Div E.p. Dutton,Funeral Coach 2WD,1984,6.0,8.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,32.961,10,11,10,888.7,3350


In [33]:
vehicles_df[vehicles_df["Make"].str.contains("Dutton")]

Unnamed: 0,Make,Model,Year,Engine Displacement,Cylinders,Transmission,Drivetrain,Vehicle Class,Fuel Type,Fuel Barrels/Year,City MPG,Highway MPG,Combined MPG,CO2 Emission Grams/Mile,Fuel Cost/Year,CO2 Emission Grams/Km,City Km/L,Highway Km/L,Combined Km/L
11012,Dutton,Funeral Coach,1985,4.1,8.0,automatic,Front-Wheel Drive,Special Purpose Vehicles,Regular,19.388824,15,21,17,522.764706,1950,324.831736,6.377665,8.928731,7.22802
30164,Dutton,Funeral Coach 2WD,1984,6.0,8.0,automatic,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,32.961,9,11,10,888.7,3350,552.213951,3.826599,4.676954,4.251777
31754,Dutton,Funeral Coach 2WD,1984,6.0,8.0,automatic,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,32.961,10,11,10,888.7,3350,552.213951,4.251777,4.676954,4.251777


In [11]:
vehicles_df["Make"].where(~ mask_dutton, "Dutton", inplace=True)

In [12]:
vehicles_df[mask_dutton]

Unnamed: 0,Make,Model,Year,Engine Displacement,Cylinders,Transmission,Drivetrain,Vehicle Class,Fuel Type,Fuel Barrels/Year,City MPG,Highway MPG,Combined MPG,CO2 Emission Grams/Mile,Fuel Cost/Year
11012,Dutton,Funeral Coach,1985,4.1,8.0,Automatic 4-spd,Front-Wheel Drive,Special Purpose Vehicles,Regular,19.388824,15,21,17,522.764706,1950
30164,Dutton,Funeral Coach 2WD,1984,6.0,8.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,32.961,9,11,10,888.7,3350
31754,Dutton,Funeral Coach 2WD,1984,6.0,8.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,32.961,10,11,10,888.7,3350


#### Convert CO2 Emissions from Grams/Mile to Grams/Km

In [13]:
conversion_ratio_mile_km = 1 / 1.60934

In [14]:
co2_emission_grams_km = (vehicles_df["CO2 Emission Grams/Mile"]*conversion_ratio_mile_km)

In [15]:
vehicles_df["CO2 Emission Grams/Km"] = co2_emission_grams_km
vehicles_df.head()

Unnamed: 0,Make,Model,Year,Engine Displacement,Cylinders,Transmission,Drivetrain,Vehicle Class,Fuel Type,Fuel Barrels/Year,City MPG,Highway MPG,Combined MPG,CO2 Emission Grams/Mile,Fuel Cost/Year,CO2 Emission Grams/Km
0,AM General,DJ Po Vehicle 2WD,1984,2.5,4.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,19.388824,18,17,17,522.764706,1950,324.831736
1,AM General,FJ8c Post Office,1984,4.2,6.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,13,683.615385,2550,424.779962
2,AM General,Post Office DJ5 2WD,1985,2.5,4.0,Automatic 3-spd,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,20.600625,16,17,16,555.4375,2100,345.133719
3,AM General,Post Office DJ8 2WD,1985,4.2,6.0,Automatic 3-spd,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,13,683.615385,2550,424.779962
4,ASC Incorporated,GNX,1987,3.8,6.0,Automatic 4-spd,Rear-Wheel Drive,Midsize Cars,Premium,20.600625,14,21,16,555.4375,2550,345.133719


#### binary colum

In [16]:
vehicles_df["Transmission"].value_counts()

Automatic 4-spd                     10585
Manual 5-spd                         7787
Automatic (S6)                       2631
Automatic 3-spd                      2597
Manual 6-spd                         2423
Automatic 5-spd                      2171
Automatic 6-spd                      1432
Manual 4-spd                         1306
Automatic (S8)                        960
Automatic (S5)                        822
Automatic (variable gear ratios)      675
Automatic 7-spd                       662
Automatic (S7)                        261
Auto(AM-S7)                           256
Automatic 8-spd                       243
Automatic (S4)                        229
Auto(AM7)                             157
Auto(AV-S6)                           145
Auto(AM6)                             110
Auto(AM-S6)                            92
Automatic 9-spd                        90
Manual 3-spd                           74
Manual 7-spd                           68
Auto(AV-S7)                       

In [17]:
vehicles_df["Transmission"] = vehicles_df["Transmission"].map(lambda x: "automatic" if "Auto" in x else "manual")


In [18]:
vehicles_df["Transmission"].head

<bound method NDFrame.head of 0        automatic
1        automatic
2        automatic
3        automatic
4        automatic
           ...    
35947    automatic
35948    automatic
35949    automatic
35950    automatic
35951       manual
Name: Transmission, Length: 35952, dtype: object>

#### convert MPG columns to km_per_liter

In [19]:
conversion_ratio_mpg_kmliter = 1.60934 / 3.7851

In [20]:
city_kmliter = (vehicles_df["City MPG"]*conversion_ratio_mpg_kmliter)
highway_kmliter = (vehicles_df["Highway MPG"]*conversion_ratio_mpg_kmliter)
combined_kmliter = (vehicles_df["Combined MPG"]*conversion_ratio_mpg_kmliter)

In [21]:
vehicles_df["City Km/L"] = city_kmliter
vehicles_df["Highway Km/L"] = highway_kmliter
vehicles_df["Combined Km/L"] = combined_kmliter
vehicles_df.head()

Unnamed: 0,Make,Model,Year,Engine Displacement,Cylinders,Transmission,Drivetrain,Vehicle Class,Fuel Type,Fuel Barrels/Year,City MPG,Highway MPG,Combined MPG,CO2 Emission Grams/Mile,Fuel Cost/Year,CO2 Emission Grams/Km,City Km/L,Highway Km/L,Combined Km/L
0,AM General,DJ Po Vehicle 2WD,1984,2.5,4.0,automatic,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,19.388824,18,17,17,522.764706,1950,324.831736,7.653198,7.22802,7.22802
1,AM General,FJ8c Post Office,1984,4.2,6.0,automatic,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,13,683.615385,2550,424.779962,5.52731,5.52731,5.52731
2,AM General,Post Office DJ5 2WD,1985,2.5,4.0,automatic,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,20.600625,16,17,16,555.4375,2100,345.133719,6.802843,7.22802,6.802843
3,AM General,Post Office DJ8 2WD,1985,4.2,6.0,automatic,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,13,683.615385,2550,424.779962,5.52731,5.52731,5.52731
4,ASC Incorporated,GNX,1987,3.8,6.0,automatic,Rear-Wheel Drive,Midsize Cars,Premium,20.600625,14,21,16,555.4375,2550,345.133719,5.952487,8.928731,6.802843


Converting Grams/Mile to Grams/Km

1 Mile = 1.60934 Km

Grams/Mile * Mile/Km -> Grams/Mile * 1 Mile/1.60934Km

$$ \frac{Grams}{Mile} * \frac{Mile}{Km} $$

$$ \frac{Grams}{Mile} * \frac{1 Mile}{1.60934Km}  $$

convert MPG columns to km_per_liter

MPG = Miles/Gallon -> Km/Liter

1 Mile = 1.60934 Km

1 Gallon = 3.78541 Liters

$$ \frac{Miles}{Gallon} -> \frac{Miles}{Gallon} * \frac{Km}{Miles} * \frac{Gallon}{Liters}$$

$$ \frac{Miles}{Gallon} -> \frac{Miles}{Gallon} * \frac{1.60934Km}{ 1Miles} * \frac{1 Gallon}{3.78541 Liters}$$

* ( 1.60934 / 3.78541 )


### Gathering insights:

- How many car makers are there? How many models? Which car maker has the most cars in the dataset?

- When were these cars made? How big is the engine of these cars?

- What's the frequency of different transmissions, drivetrains and fuel types?

- What's the car that consumes the least/most fuel?

#### How many car makers are there? How many models? Which car maker has the most cars in the dataset?

In [40]:
vehicles_df.nunique()

Make                        125
Model                      3608
Year                         34
Engine Displacement          65
Cylinders                     9
Transmission                  2
Drivetrain                    8
Vehicle Class                34
Fuel Type                    13
Fuel Barrels/Year           123
City MPG                     48
Highway MPG                  49
Combined MPG                 46
CO2 Emission Grams/Mile     575
Fuel Cost/Year               55
CO2 Emission Grams/Km       575
City Km/L                    48
Highway Km/L                 49
Combined Km/L                46
dtype: int64

In [46]:
(pd.DataFrame(vehicles_df["Make"].value_counts())
                                 .sort_values(by="Make", ascending=False)
)

Unnamed: 0,Make
Chevrolet,3643
Ford,2946
Dodge,2360
GMC,2347
Toyota,1836
BMW,1677
Mercedes-Benz,1284
Nissan,1253
Volkswagen,1047
Mitsubishi,950


#### What's the car that consumes the least/most fuel?
#### When were these cars made? How big is the engine of these cars?

In [None]:
# Car with minimum consumption

vehicles_df[vehicles_df["Combined Km/L"] == np.max(vehicles_df["Combined Km/L"])]

Unnamed: 0,Make,Model,Year,Engine Displacement,Cylinders,Transmission,Drivetrain,Vehicle Class,Fuel Type,Fuel Barrels/Year,City MPG,Highway MPG,Combined MPG,CO2 Emission Grams/Mile,Fuel Cost/Year,CO2 Emission Grams/Km,City Km/L,Highway Km/L,Combined Km/L
33279,Toyota,Prius Eco,2016,1.8,4.0,automatic,Front-Wheel Drive,Midsize Cars,Regular,5.885893,58,53,56,158.0,600,98.176892,24.660305,22.534417,23.80995
33280,Toyota,Prius Eco,2017,1.8,4.0,automatic,Front-Wheel Drive,Midsize Cars,Regular,5.885893,58,53,56,158.0,600,98.176892,24.660305,22.534417,23.80995


In [None]:
# Car with maximum consumption

vehicles_df[vehicles_df["Combined Km/L"] == np.min(vehicles_df["Combined Km/L"])]

Unnamed: 0,Make,Model,Year,Engine Displacement,Cylinders,Transmission,Drivetrain,Vehicle Class,Fuel Type,Fuel Barrels/Year,City MPG,Highway MPG,Combined MPG,CO2 Emission Grams/Mile,Fuel Cost/Year,CO2 Emission Grams/Km,City Km/L,Highway Km/L,Combined Km/L
20894,Lamborghini,Countach,1986,5.2,12.0,manual,Rear-Wheel Drive,Two Seaters,Premium,47.087143,6,10,7,1269.571429,5800,788.877073,2.551066,4.251777,2.976244
20895,Lamborghini,Countach,1987,5.2,12.0,manual,Rear-Wheel Drive,Two Seaters,Premium,47.087143,6,10,7,1269.571429,5800,788.877073,2.551066,4.251777,2.976244
20896,Lamborghini,Countach,1988,5.2,12.0,manual,Rear-Wheel Drive,Two Seaters,Premium,47.087143,6,10,7,1269.571429,5800,788.877073,2.551066,4.251777,2.976244
20897,Lamborghini,Countach,1989,5.2,12.0,manual,Rear-Wheel Drive,Two Seaters,Premium,47.087143,6,10,7,1269.571429,5800,788.877073,2.551066,4.251777,2.976244
20898,Lamborghini,Countach,1990,5.2,12.0,manual,Rear-Wheel Drive,Two Seaters,Premium,47.087143,6,10,7,1269.571429,5800,788.877073,2.551066,4.251777,2.976244


#### What's the frequency of different transmissions, drivetrains and fuel types?

In [47]:
vehicles_df["Transmission"].value_counts()

automatic    24290
manual       11662
Name: Transmission, dtype: int64

In [54]:
n_total = len(vehicles_df)
n_total

35952

In [59]:
n_manual = np.where(vehicles_df["Transmission"] == "manual", 1, 0).sum()
n_auto = np.where(vehicles_df["Transmission"] == "automatic", 1, 0).sum()

In [60]:
freq_manual = round(n_manual / n_total*100,2)
freq_auto = round(n_auto / n_total*100,2)
print(freq_manual)
print(freq_auto)

32.44
67.56


In [61]:
vehicles_df["Drivetrain"].value_counts()

Front-Wheel Drive             13044
Rear-Wheel Drive              12726
4-Wheel or All-Wheel Drive     6503
All-Wheel Drive                2039
4-Wheel Drive                  1058
2-Wheel Drive                   423
Part-time 4-Wheel Drive         158
2-Wheel Drive, Front              1
Name: Drivetrain, dtype: int64

In [64]:
def frequency_calculator(df):
    
    output_dict = {}
    counts_df = pd.DataFrame(df.value_counts())
    n_total = len(df)
    
    for index in counts_df.index:
        
        n_index = np.where(df == index, 1, 0).sum()
        index_freq = round(n_index / n_total*100, 1)
        output_dict[index] = index_freq
        
    return output_dict


In [65]:
frequency_calculator(vehicles_df["Drivetrain"])

{'Front-Wheel Drive': 36.3,
 'Rear-Wheel Drive': 35.4,
 '4-Wheel or All-Wheel Drive': 18.1,
 'All-Wheel Drive': 5.7,
 '4-Wheel Drive': 2.9,
 '2-Wheel Drive': 1.2,
 'Part-time 4-Wheel Drive': 0.4,
 '2-Wheel Drive, Front': 0.0}

What brand has the worse CO2 Emissions on average?

Hint: use the function `sort_values()`

In [70]:
(vehicles_df[["Make","CO2 Emission Grams/Km"]].groupby(["Make"])
                                             .mean()
                                             .sort_values(by="CO2 Emission Grams/Km", ascending=False)
)

Unnamed: 0_level_0,CO2 Emission Grams/Km
Make,Unnamed: 1_level_1
Vector,651.919248
Bugatti,542.497235
Laforza Automobile Inc,502.012683
Dutton,476.419879
Rolls-Royce,475.397772
Lamborghini,469.001266
Texas Coach Company,460.178293
Maybach,453.327003
Ferrari,442.812798
Bentley,426.290692


Do cars with automatic transmission consume more fuel than cars with manual transmission on average?

In [72]:
(vehicles_df[["Transmission","CO2 Emission Grams/Km"]].groupby(["Transmission"])
                                             .mean()
                                             .sort_values(by="CO2 Emission Grams/Km", ascending=False)
)


Unnamed: 0_level_0,CO2 Emission Grams/Km
Transmission,Unnamed: 1_level_1
automatic,302.853002
manual,279.718227
