# Pandas challenges

Read the `vehicles.csv` dataframe and store it in a variable called `cars`:

In [1]:
import pandas as pd
pd.set_option('display.max_columns', 20)
pd.set_option('display.width', 150)


cars = pd.read_csv('data/vehicles.csv')

print(type(cars.info()))
print('\n------------------------------------------------------------------------------------------------------------------')
print(cars.describe())
print('\n------------------------------------------------------------------------------------------------------------------')
cars.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35952 entries, 0 to 35951
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Make                     35952 non-null  object 
 1   Model                    35952 non-null  object 
 2   Year                     35952 non-null  int64  
 3   Engine Displacement      35952 non-null  float64
 4   Cylinders                35952 non-null  float64
 5   Transmission             35952 non-null  object 
 6   Drivetrain               35952 non-null  object 
 7   Vehicle Class            35952 non-null  object 
 8   Fuel Type                35952 non-null  object 
 9   Fuel Barrels/Year        35952 non-null  float64
 10  City MPG                 35952 non-null  int64  
 11  Highway MPG              35952 non-null  int64  
 12  Combined MPG             35952 non-null  int64  
 13  CO2 Emission Grams/Mile  35952 non-null  float64
 14  Fuel Cost/Year        

Unnamed: 0,Make,Model,Year,Engine Displacement,Cylinders,Transmission,Drivetrain,Vehicle Class,Fuel Type,Fuel Barrels/Year,City MPG,Highway MPG,Combined MPG,CO2 Emission Grams/Mile,Fuel Cost/Year
0,AM General,DJ Po Vehicle 2WD,1984,2.5,4.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,19.388824,18,17,17,522.764706,1950
1,AM General,FJ8c Post Office,1984,4.2,6.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,13,683.615385,2550
2,AM General,Post Office DJ5 2WD,1985,2.5,4.0,Automatic 3-spd,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,20.600625,16,17,16,555.4375,2100
3,AM General,Post Office DJ8 2WD,1985,4.2,6.0,Automatic 3-spd,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,13,683.615385,2550
4,ASC Incorporated,GNX,1987,3.8,6.0,Automatic 4-spd,Rear-Wheel Drive,Midsize Cars,Premium,20.600625,14,21,16,555.4375,2550
5,Acura,2.2CL/3.0CL,1997,2.2,4.0,Automatic 4-spd,Front-Wheel Drive,Subcompact Cars,Regular,14.982273,20,26,22,403.954545,1500
6,Acura,2.2CL/3.0CL,1997,2.2,4.0,Manual 5-spd,Front-Wheel Drive,Subcompact Cars,Regular,13.73375,22,28,24,370.291667,1400
7,Acura,2.2CL/3.0CL,1997,3.0,6.0,Automatic 4-spd,Front-Wheel Drive,Subcompact Cars,Regular,16.4805,18,26,20,444.35,1650
8,Acura,2.3CL/3.0CL,1998,2.3,4.0,Automatic 4-spd,Front-Wheel Drive,Subcompact Cars,Regular,14.982273,19,27,22,403.954545,1500
9,Acura,2.3CL/3.0CL,1998,2.3,4.0,Manual 5-spd,Front-Wheel Drive,Subcompact Cars,Regular,13.73375,21,29,24,370.291667,1400


Explore the dataset:

- How many rows and columns are there?

- What are the data types of the columns?

- Are there missing values?

- What are the ranges / distributions of the numerical columns?

- What are the value counts for the categorical columns?

In [2]:
print(f"There are {len(cars.index)} rows and {len(cars.columns)} columns")

print(f"\nThe datatypes of the columns are as follows: \n{cars.dtypes}")

print(f"\nThere are {cars.isnull().sum().sum()} null values in the dataframe")

print(f"\nThe distributions of numerical values are as follows:\n{abs(cars.describe().loc['max'] - cars.describe().loc['min'])}")

categorical = lambda x: x.dtype == 'object'
categorical_columns = cars.apply(categorical)
print(f"\nThe value counts for the categorical columns are as follows:\n{cars.loc[:, categorical_columns].count()}")

There are 35952 rows and 15 columns

The datatypes of the columns are as follows: 
Make                        object
Model                       object
Year                         int64
Engine Displacement        float64
Cylinders                  float64
Transmission                object
Drivetrain                  object
Vehicle Class               object
Fuel Type                   object
Fuel Barrels/Year          float64
City MPG                     int64
Highway MPG                  int64
Combined MPG                 int64
CO2 Emission Grams/Mile    float64
Fuel Cost/Year               int64
dtype: object

There are 0 null values in the dataframe

The distributions of numerical values are as follows:
Year                         33.000000
Engine Displacement           7.800000
Cylinders                    14.000000
Fuel Barrels/Year            47.027143
City MPG                     52.000000
Highway MPG                  52.000000
Combined MPG                 49.000000
CO2 Emis

Drop the column "Combined MPG"

In [3]:
cars.drop("Combined MPG", axis=1,  inplace=True, errors='ignore')
cars.head(5)

Unnamed: 0,Make,Model,Year,Engine Displacement,Cylinders,Transmission,Drivetrain,Vehicle Class,Fuel Type,Fuel Barrels/Year,City MPG,Highway MPG,CO2 Emission Grams/Mile,Fuel Cost/Year
0,AM General,DJ Po Vehicle 2WD,1984,2.5,4.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,19.388824,18,17,522.764706,1950
1,AM General,FJ8c Post Office,1984,4.2,6.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,683.615385,2550
2,AM General,Post Office DJ5 2WD,1985,2.5,4.0,Automatic 3-spd,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,20.600625,16,17,555.4375,2100
3,AM General,Post Office DJ8 2WD,1985,4.2,6.0,Automatic 3-spd,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,683.615385,2550
4,ASC Incorporated,GNX,1987,3.8,6.0,Automatic 4-spd,Rear-Wheel Drive,Midsize Cars,Premium,20.600625,14,21,555.4375,2550


Change column names so that there are no names with spaces or weird special characters:

In [4]:
replacements = {' ': '_', '/': '_per_'}
replace_lambda = lambda x: replacements[x] if x in replacements.keys else False
for k,v in replacements.items():
    cars.columns = cars.columns.str.replace(k,v)

    
#cars2 = cars.copy()

# list_lambda = lambda

# DOESN'T WORK BECAUSE COLUMNS DOESN'T HAVE REPLACE OPTION
#cars2.columns.replace(to_replace=replacements, inplace=True, regex=True)



#cars2.columns = cars2.columns.map(lambda col: col.replace(' ', '_'))
#cars2.columns = cars2.columns.map(lambda col: col.replace(' ', '_'))

# replace_lambda = lambda col: col.replace(col), replacements.items()

#d = {'a': 1, 'b': 2}
#values = map(lambda key: d[key], d.keys())

#cars2.columns = cars2.columns.map(replace_lambda())

#cars2
cars

Unnamed: 0,Make,Model,Year,Engine_Displacement,Cylinders,Transmission,Drivetrain,Vehicle_Class,Fuel_Type,Fuel_Barrels_per_Year,City_MPG,Highway_MPG,CO2_Emission_Grams_per_Mile,Fuel_Cost_per_Year
0,AM General,DJ Po Vehicle 2WD,1984,2.5,4.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,19.388824,18,17,522.764706,1950
1,AM General,FJ8c Post Office,1984,4.2,6.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,683.615385,2550
2,AM General,Post Office DJ5 2WD,1985,2.5,4.0,Automatic 3-spd,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,20.600625,16,17,555.437500,2100
3,AM General,Post Office DJ8 2WD,1985,4.2,6.0,Automatic 3-spd,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,683.615385,2550
4,ASC Incorporated,GNX,1987,3.8,6.0,Automatic 4-spd,Rear-Wheel Drive,Midsize Cars,Premium,20.600625,14,21,555.437500,2550
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35947,smart,fortwo coupe,2013,1.0,3.0,Auto(AM5),Rear-Wheel Drive,Two Seaters,Premium,9.155833,34,38,244.000000,1100
35948,smart,fortwo coupe,2014,1.0,3.0,Auto(AM5),Rear-Wheel Drive,Two Seaters,Premium,9.155833,34,38,243.000000,1100
35949,smart,fortwo coupe,2015,1.0,3.0,Auto(AM5),Rear-Wheel Drive,Two Seaters,Premium,9.155833,34,38,244.000000,1100
35950,smart,fortwo coupe,2016,0.9,3.0,Auto(AM6),Rear-Wheel Drive,Two Seaters,Premium,9.155833,34,39,246.000000,1100


What brand has the most cars?

In [5]:
cars.groupby('Make').count()['Model'].idxmax()

'Chevrolet'

What brand has the worse CO2 Emissions on average?

In [6]:
cars.groupby('Make').sum()['CO2_Emission_Grams_per_Mile'].idxmax()

'Chevrolet'

Which brands are more environment friendly?

In [7]:
cars.groupby('Make').sum()['CO2_Emission_Grams_per_Mile'].sort_values().head(5)

Make
Fisker                    169.000000
Panoz Auto-Development    403.954545
Qvale                     423.190476
Isis Imports Ltd          444.350000
London Taxi               462.727273
Name: CO2_Emission_Grams_per_Mile, dtype: float64

Create 4 groups (bins) of cars, by Year. We want to explore how cars have evolved decade by decade.

In [8]:
# ASK HOW EXACTLY THIS IS WORKING IN TUTORIAL

cars_by_decade = cars.groupby(pd.cut(cars['Year'], [1980, 1990, 2000, 2010, 2020]))
cars_by_decade.describe()

Unnamed: 0_level_0,Year,Year,Year,Year,Year,Year,Year,Year,Engine_Displacement,Engine_Displacement,...,CO2_Emission_Grams_per_Mile,CO2_Emission_Grams_per_Mile,Fuel_Cost_per_Year,Fuel_Cost_per_Year,Fuel_Cost_per_Year,Fuel_Cost_per_Year,Fuel_Cost_per_Year,Fuel_Cost_per_Year,Fuel_Cost_per_Year,Fuel_Cost_per_Year
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
Year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
"(1980, 1990]",7926.0,1987.036841,1.897873,1984.0,1985.0,1987.0,1989.0,1990.0,7926.0,3.244651,...,592.466667,1269.571429,7926.0,1901.406763,525.528343,700.0,1500.0,1850.0,2200.0,5800.0
"(1990, 2000]",9169.0,1995.127277,2.9144,1991.0,1993.0,1995.0,1998.0,2000.0,9169.0,3.276508,...,555.4375,1110.875,9169.0,1930.17232,502.488933,650.0,1600.0,1850.0,2200.0,5050.0
"(2000, 2010]",10866.0,2005.690502,2.81401,2001.0,2003.0,2006.0,2008.0,2010.0,10866.0,3.470072,...,555.4375,1110.875,10866.0,1963.615866,486.60124,650.0,1600.0,1950.0,2200.0,5050.0
"(2010, 2020]",7991.0,2013.934051,1.927803,2011.0,2012.0,2014.0,2016.0,2017.0,7991.0,3.323777,...,480.0,888.7,7991.0,1744.180954,490.083058,600.0,1400.0,1700.0,2000.0,4050.0


Did cars consume more gas in the eighties?

In [9]:
print("Yes, car's did consume more gas in the eighties.")
cars_by_decade.Fuel_Barrels_per_Year.describe()['mean']

Yes, car's did consume more gas in the eighties.


Year
(1980, 1990]    18.551723
(1990, 2000]    18.220520
(2000, 2010]    17.860727
(2010, 2020]    15.630235
Name: mean, dtype: float64

Do cars with automatic transmission consume more fuel than cars with manual transmission?

In [26]:
mean_fuel_manual = round(cars.Fuel_Barrels_per_Year[cars.Transmission.str.contains('Manual')==True].mean(), 2)
mean_fuel_automatic = round(cars.Fuel_Barrels_per_Year[cars.Transmission.str.contains('Auto')==True].mean(), 2)

print(f"Cars with manual transmission consume {mean_fuel_manual} barrels of fuel per year on average.")
print(f"Cars with automatic transmission consume {mean_fuel_automatic} barrels of fuel per year on average.")
print(f"Cars with {'manual' if mean_fuel_manual > mean_fuel_automatic else 'automatic'} transmission consume more fuel.")

Cars with manual transmission consume 16.7 barrels of fuel per year on average.
Cars with automatic transmission consume 18.04 barrels of fuel per year on average.
Cars with automatic transmission consume more fuel.


Group cars by fuel type and aggregate them by the following criteria: 

- The maximum number cylinders
- The oldest year
- The average Miles Per Gallon in the city

In [34]:
print('Max cylinders per fuel type')
print(cars.groupby('Fuel_Type').Cylinders.max())
print('\n\n')

print('The oldest year per fuel type')
print(cars.groupby('Fuel_Type').Year.min())
print('\n\n')

print('The average miles per gallon in the city per fuel type')
print(cars.groupby('Fuel_Type').City_MPG.mean())
print('\n\n')

Max cylinders per fuel type
Fuel_Type
CNG                             8.0
Diesel                         10.0
Gasoline or E85                 8.0
Gasoline or natural gas         8.0
Gasoline or propane             8.0
Midgrade                        8.0
Premium                        16.0
Premium Gas or Electricity      8.0
Premium and Electricity         8.0
Premium or E85                 12.0
Regular                        12.0
Regular Gas and Electricity     4.0
Regular Gas or Electricity      4.0
Name: Cylinders, dtype: float64



The oldest year per fuel type
Fuel_Type
CNG                            1993
Diesel                         1984
Gasoline or E85                2000
Gasoline or natural gas        2000
Gasoline or propane            2001
Midgrade                       2011
Premium                        1985
Premium Gas or Electricity     2011
Premium and Electricity        2014
Premium or E85                 2004
Regular                        1984
Regular Gas and Electri

We want to use "Drivetrain" in a statistical model. Convert the column to numeric.

In [42]:
cars.Drivetrain.unique()
four_wheel_drive = ['4-Wheel or All-Wheel Drive', 'All-Wheel Drive', '4-Wheel Drive', 'Part-time 4-Wheel Drive'] 
two_wheel_drive = ['2-Wheel Drive', 'Rear-Wheel Drive', 'Front-Wheel Drive', '2-Wheel Drive, Front']

cars.Drivetrain.replace(four_wheel_drive, 4, inplace=True)
cars.Drivetrain.replace(two_wheel_drive, 2, inplace=True)
cars

Unnamed: 0,Make,Model,Year,Engine_Displacement,Cylinders,Transmission,Drivetrain,Vehicle_Class,Fuel_Type,Fuel_Barrels_per_Year,City_MPG,Highway_MPG,CO2_Emission_Grams_per_Mile,Fuel_Cost_per_Year
0,AM General,DJ Po Vehicle 2WD,1984,2.5,4.0,Automatic 3-spd,2,Special Purpose Vehicle 2WD,Regular,19.388824,18,17,522.764706,1950
1,AM General,FJ8c Post Office,1984,4.2,6.0,Automatic 3-spd,2,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,683.615385,2550
2,AM General,Post Office DJ5 2WD,1985,2.5,4.0,Automatic 3-spd,2,Special Purpose Vehicle 2WD,Regular,20.600625,16,17,555.437500,2100
3,AM General,Post Office DJ8 2WD,1985,4.2,6.0,Automatic 3-spd,2,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,683.615385,2550
4,ASC Incorporated,GNX,1987,3.8,6.0,Automatic 4-spd,2,Midsize Cars,Premium,20.600625,14,21,555.437500,2550
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35947,smart,fortwo coupe,2013,1.0,3.0,Auto(AM5),2,Two Seaters,Premium,9.155833,34,38,244.000000,1100
35948,smart,fortwo coupe,2014,1.0,3.0,Auto(AM5),2,Two Seaters,Premium,9.155833,34,38,243.000000,1100
35949,smart,fortwo coupe,2015,1.0,3.0,Auto(AM5),2,Two Seaters,Premium,9.155833,34,38,244.000000,1100
35950,smart,fortwo coupe,2016,0.9,3.0,Auto(AM6),2,Two Seaters,Premium,9.155833,34,39,246.000000,1100


Read the `car_brands.csv` data:

In [39]:
car_brands = pd.read_csv('data/car_brands.csv')
car_brands

Unnamed: 0,brand,revenue,production
0,AM General,1537,1.002916
1,ASC Incorporated,232,1.628105
2,Acura,234,3.394481
3,Alfa Romeo,1174,2.313726
4,American Motors Corporation,1230,1.231024
...,...,...,...
122,Volkswagen,273,1.033316
123,Volvo,1312,0.057454
124,Wallace Environmental,277,5.744609
125,Yugo,508,0.520953


Join the cars dataframe with the car brands dataframe.

In [48]:
car_df = cars.merge(car_brands, left_on='Make',  right_on='brand').drop('brand', axis=1)
car_df

Unnamed: 0,Make,Model,Year,Engine_Displacement,Cylinders,Transmission,Drivetrain,Vehicle_Class,Fuel_Type,Fuel_Barrels_per_Year,City_MPG,Highway_MPG,CO2_Emission_Grams_per_Mile,Fuel_Cost_per_Year,revenue,production
0,AM General,DJ Po Vehicle 2WD,1984,2.5,4.0,Automatic 3-spd,2,Special Purpose Vehicle 2WD,Regular,19.388824,18,17,522.764706,1950,1537,1.002916
1,AM General,FJ8c Post Office,1984,4.2,6.0,Automatic 3-spd,2,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,683.615385,2550,1537,1.002916
2,AM General,Post Office DJ5 2WD,1985,2.5,4.0,Automatic 3-spd,2,Special Purpose Vehicle 2WD,Regular,20.600625,16,17,555.437500,2100,1537,1.002916
3,AM General,Post Office DJ8 2WD,1985,4.2,6.0,Automatic 3-spd,2,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,683.615385,2550,1537,1.002916
4,ASC Incorporated,GNX,1987,3.8,6.0,Automatic 4-spd,2,Midsize Cars,Premium,20.600625,14,21,555.437500,2550,232,1.628105
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35947,smart,fortwo coupe,2013,1.0,3.0,Auto(AM5),2,Two Seaters,Premium,9.155833,34,38,244.000000,1100,447,2.229253
35948,smart,fortwo coupe,2014,1.0,3.0,Auto(AM5),2,Two Seaters,Premium,9.155833,34,38,243.000000,1100,447,2.229253
35949,smart,fortwo coupe,2015,1.0,3.0,Auto(AM5),2,Two Seaters,Premium,9.155833,34,38,244.000000,1100,447,2.229253
35950,smart,fortwo coupe,2016,0.9,3.0,Auto(AM6),2,Two Seaters,Premium,9.155833,34,39,246.000000,1100,447,2.229253


Which brands have the most revenue?

In [54]:
car_df.groupby('Make').revenue.sum().sort_values(ascending=False).head(5)

Make
Dodge        4071000
BMW          2782143
GMC          2501902
Chevrolet    1825143
Nissan       1684032
Name: revenue, dtype: int64