
# Introduction
Back in the day cars used to be a luxury and the only concern with them was "will you make it to your destination?". Nowadays, however, we are far beyond this point. We have multiple manufacturers that have perfected the design of the internal combustion engine so much so that we are not shifting to electric vehicles (notably, in 2023 we saw the last ever gas-powered Dodge Challenger which was an icon of the IC era). So at this point the race is for reliablity, comfort and, of course, fuel economy. The latter being very important not only in terms of gas cost, but also because it saves the planet because even though electric cars are very much present on the roads today, the IC vehicles are still the majority and take a significant toll on the nature. That is why I decided to look into what types of cars provide the best fuel efficiency possible to see which direction a manufacturer should go in order to reduce their ecological impact.

In [None]:
!pip install -U kaleido --quiet

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.9/79.9 MB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv('/content/fuel.csv')
print(df.shape)
df.head()

(38113, 81)


  df = pd.read_csv('/content/fuel.csv')


Unnamed: 0,vehicle_id,year,make,model,class,drive,transmission,transmission_type,engine_index,engine_descriptor,...,hours_to_charge_ac_240v,composite_city_mpg,composite_highway_mpg,composite_combined_mpg,range_ft1,city_range_ft1,highway_range_ft1,range_ft2,city_range_ft2,highway_range_ft2
0,26587,1984,Alfa Romeo,GT V6 2.5,Minicompact Cars,,Manual 5-Speed,,9001,(FFS),...,0.0,0,0,0,0,0.0,0.0,,0.0,0.0
1,27705,1984,Alfa Romeo,GT V6 2.5,Minicompact Cars,,Manual 5-Speed,,9005,(FFS) CA model,...,0.0,0,0,0,0,0.0,0.0,,0.0,0.0
2,26561,1984,Alfa Romeo,Spider Veloce 2000,Two Seaters,,Manual 5-Speed,,9002,(FFS),...,0.0,0,0,0,0,0.0,0.0,,0.0,0.0
3,27681,1984,Alfa Romeo,Spider Veloce 2000,Two Seaters,,Manual 5-Speed,,9006,(FFS) CA model,...,0.0,0,0,0,0,0.0,0.0,,0.0,0.0
4,27550,1984,AM General,DJ Po Vehicle 2WD,Special Purpose Vehicle 2WD,2-Wheel Drive,Automatic 3-Speed,,1830,(FFS),...,0.0,0,0,0,0,0.0,0.0,,0.0,0.0


In [None]:
print('Columns:')
np.array(df.columns)

Columns:


array(['vehicle_id', 'year', 'make', 'model', 'class', 'drive',
       'transmission', 'transmission_type', 'engine_index',
       'engine_descriptor', 'engine_cylinders', 'engine_displacement',
       'turbocharger', 'supercharger', 'fuel_type', 'fuel_type_1',
       'fuel_type_2', 'city_mpg_ft1', 'unrounded_city_mpg_ft1',
       'city_mpg_ft2', 'unrounded_city_mpg_ft2',
       'city_gasoline_consumption_cd', 'city_electricity_consumption',
       'city_utility_factor', 'highway_mpg_ft1',
       'unrounded_highway_mpg_ft1', 'highway_mpg_ft2',
       'unrounded_highway_mpg_ft2', 'highway_gasoline_consumption_cd',
       'highway_electricity_consumption', 'highway_utility_factor',
       'unadjusted_city_mpg_ft1', 'unadjusted_highway_mpg_ft1',
       'unadjusted_city_mpg_ft2', 'unadjusted_highway_mpg_ft2',
       'combined_mpg_ft1', 'unrounded_combined_mpg_ft1',
       'combined_mpg_ft2', 'unrounded_combined_mpg_ft2',
       'combined_electricity_consumption',
       'combined_gasoline_

We see that there are a lot of columns describing the details about each car in the dataset, including its specifications, type, make, model, infomrmation about their engine and transmission. All of them can be useful, but there are a lot of them, so we will discuss which ones we will leave later.

# Establishing the target and cleaning the data

Our main objective here is CO2 emissions so we will use ghg_score and ghg_score_alt_fuel (a score of tailpipe CO2 emissions by EPA) or tailpipe_co2_in_grams_mile_ft1 and tailpipe_co2_in_grams_mile_ft2 as our measures (the reason why we use two measures is that one car may have multiple possible fuel types i.e. you may use "petrol" or "natural gas" like methane with the same engine). A reason why we care about grams per mile instead of just a pure CO2 measure is because cars are used for transportation and we are more concerned how much CO2 it will produce over its lifetime than some synthetic CO2 measure. GHG score also take that into account, but is conducted by professionals and presented as a simple 0-10 score.

With that said, it is easier to come to one measure so let us take a look at fuel types present in the dataset:

In [None]:
df.fuel_type.unique()

array(['Regular', 'Diesel', 'Premium', 'CNG', 'Electricity',
       'Gasoline or natural gas', 'Gasoline or E85',
       'Gasoline or propane', 'Premium or E85',
       'Premium Gas or Electricity', 'Midgrade',
       'Regular Gas and Electricity', 'Premium and Electricity',
       'Regular Gas or Electricity'], dtype=object)

In [None]:
print('Cars with natural gas as fuel type 2:', sum(df.fuel_type.apply(lambda x: 'natural gas' in x)))
print('Cars with electric fuel type 2:', sum(
        df.fuel_type.apply(lambda x: 'or Electricity' in x or 'and Electricity' in x)
    )
)
print('Fully electric cars:', sum(df.fuel_type == 'Electricity'))
print('Cars with E85 as fuel type 2:', sum(df.fuel_type.apply(lambda x: 'or E85' in x)))

Cars with natural gas as fuel type 2: 20
Cars with electric fuel type 2: 65
Fully electric cars: 133
Cars with E85 as fuel type 2: 1345


It is clear that it is safe for us to use only the first fuel type because:




*   Natural gas is very rare as an alternative fuel and most of the time very complicated to switch to, so for the purpuse of the analysis we will remove the cars with this alternative fuel type
*   There are very few cars with alternative electric power and we are also interested only in emissions, so we are better off removing those cars too.
*  For the same reason we will have to remove fully electric vehicles as they will obviously have 0 emissions (although, this is debatable because of the emissions during the manufacturing process, but this is out of sope here). We will just remove these cars from our dataset.
*   E85 is generally meant for higher performance mode on cars and generally is not recommended for extensive use by manufacturers. Additionally, E85 is always paired with campatability with regular fuel making it a more rare and more expensive alternative that many people will never use.





In [None]:
zero_em = df.fuel_type[df.tailpipe_co2_ft1 == 0].unique()
if len(zero_em) == 1 and zero_em[0] == 'Electricity':
    print('Electric cars are the only ones with 0 emissions! Just like suspected.')
df = df[df.tailpipe_co2_ft1 != 0]
df.tailpipe_co2_in_grams_mile_ft1.min()

Electric cars are the only ones with 0 emissions! Just like suspected.


29.0

In [None]:
import plotly.io as pio
pio.renderers.default = 'colab'

Having settled on the metrics let us look at their distribution:

In [None]:
import plotly.express as px

fig = px.histogram(df, x="ghg_score")
fig.update_layout(title='GHG score distribution')
fig.show()

In [None]:
print('Unique dates for cars with GHG == -1:')
print(df[df.ghg_score == -1].sort_values(by='year').year.unique())
print('Cars made after 2001 with GHG == -1:', sum(((df.ghg_score == -1) & (df.year >= 2001))))

Unique dates for cars with GHG == -1:
[1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997
 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
 2012]
Cars made after 2001 with GHG == -1: 13154


We see a lot of values with -1 which clearly means that the test was not conducted for the vehicle in question. We could get rid of them, but we can see from the code that while GHG was first introduced in 2001, we still have more than 10,000 cars (produced after 2001) which did not recieve the rating. This is a considerable part of the dataset and removing those vehicles due to the lack of this score is not a good idea. But we can still use this score to find patters in the data because it stipp provides valuable information.

In [None]:
dfg = df[df.ghg_score > -1]
fig = px.histogram(dfg, x="ghg_score")
fig.update_layout(title='GHG score distribution')
fig.show()

Additionally, let us check if there are any other columns containing -1

In [None]:
for i in df.columns:
    if -1 in df[i].values:
        print(i)

tailpipe_co2_ft1
tailpipe_co2_ft2
fuel_economy_score
ghg_score
ghg_score_alt_fuel




*   tailpipe_co2_ft1 and tailpipe_co2_ft2 are redundant as we already have the per mile measure which doesn't have missing values so we can remove them

*   fuel_economy_score will have high correlation with ghg_score, but its focus is not really the same is ours. The main measurement for us is the amount of CO2 produced, but the range and fuel consumption is out of scope. We can remove it.
*  ghg_score has already been envestigated and ghg_score_alt_fuel we agreed to not take into account so it can be removed.

Before deleting them we should also address the following:

When reading the dataset pandas automatically detects data types if the whole column is of the same type. However, this time we saw a warning stating that it was not able to do so: DtypeWarning: Columns (7,44) have mixed types.. This will have to be addressed so that we will be able to work with this data confidently:



In [None]:
df[[df.columns[7], df.columns[44]]].dropna()

Unnamed: 0,transmission_type,gas_guzzler_tax
3209,2MODE,True
3693,2MODE,True
3696,2MODE,True
3698,2MODE,True
4888,CLKUP,True
...,...,...
21392,EMS 2MODE CLKUP,True
21393,EMS 2MODE CLKUP,True
21446,VMODE,True
21538,2MODE,True


We see that these columns are almost full of null values which means that they can't be used for out purposes and thus can be removed too. On that note, let's remove all columns that have 50% or more of null values (as they are of no use for us) and columns that will not be useful based on their meaning.

In [None]:
# removing columns containing -1
df = df.drop(columns=['tailpipe_co2_ft1','tailpipe_co2_ft2','fuel_economy_score','ghg_score_alt_fuel','tailpipe_co2_in_grams_mile_ft2',df.columns[7],df.columns[44]])

In [None]:
# removing columns with >=50% nans
columns_to_remove = []
for i in df.columns:
    if len(df[i].dropna())/len(df) <= .5:
        columns_to_remove.append(i)
print(f'Will remove {len(columns_to_remove)} colummns bacause of too many nan values')
df = df.drop(columns=columns_to_remove)

Will remove 10 colummns bacause of too many nan values


In [None]:
# removing non-useful columns
df = df.drop(columns=['vehicle_id','engine_descriptor','city_mpg_ft1',
                      'city_mpg_ft2','unrounded_city_mpg_ft2','x2d_passenger_volume',
                      'x2d_luggage_volume','x4d_passenger_volume','x4d_luggage_volume',
                      'hatchback_passenger_volume', 'hatchback_luggage_volume',
                      'gasoline_electricity_blended_cd', 'hours_to_charge_120v',
                      'hours_to_charge_240v', 'hours_to_charge_ac_240v',
                      'composite_city_mpg', 'composite_highway_mpg','city_range_ft1',
                      'highway_range_ft1', 'city_range_ft2', 'highway_range_ft2',
                      'my_mpg_data','save_or_spend_5_year','annual_consumption_in_barrels_ft1',
                      'annual_consumption_in_barrels_ft2','annual_fuel_cost_ft2',
                      'combined_electricity_consumption','combined_mpg_ft1','combined_mpg_ft2',
                      'unadjusted_city_mpg_ft1', 'unadjusted_highway_mpg_ft1',
                      'unadjusted_city_mpg_ft2', 'unadjusted_highway_mpg_ft2','fuel_type_1',
                      'unrounded_city_mpg_ft1','city_gasoline_consumption_cd', 'city_electricity_consumption',
                      'city_utility_factor', 'highway_mpg_ft1',
                      'unrounded_highway_mpg_ft1', 'highway_mpg_ft2',
                      'unrounded_highway_mpg_ft2', 'highway_gasoline_consumption_cd',
                      'highway_electricity_consumption', 'highway_utility_factor',
                      'unrounded_combined_mpg_ft2','composite_combined_mpg','combined_utility_factor',
                      'combined_gasoline_consumption_cd',
                     ])

Also, let's make sure that the remaining columns don't have zeroes in them. Zeroes don't necessarily mean that the value is not filled, it depends on the context, but if this is the case, we should be aware of that when fitting a model.



In [None]:
for i in df.columns:
    if 0 in df[i].values:
        print(i)

engine_index
unrounded_combined_mpg_ft1
range_ft1


This is a good thing that we looked itno this because all of those cannot be 0 (maybe except for the unility_factor). In future analysis I will make sure that these zeroes are dropped, but we will not remove thse columns or rows because other data may be useful.

In [None]:
fig = px.histogram(df, x="tailpipe_co2_in_grams_mile_ft1")
fig.update_layout(title='Distribution of per mile CO2 production')
fig.show()

In [None]:
print('List of columns:')
np.array(df.columns)

List of columns:


array(['year', 'make', 'model', 'class', 'drive', 'transmission',
       'engine_index', 'engine_cylinders', 'engine_displacement',
       'fuel_type', 'unrounded_combined_mpg_ft1', 'annual_fuel_cost_ft1',
       'tailpipe_co2_in_grams_mile_ft1', 'ghg_score', 'range_ft1'],
      dtype=object)




Summarizing the columns that we have left:

*   year int: year the vehicle was manufactured in
*   make string: company producing the car
*   model string: model name for the car
*  class string: class of the vehicle (i.e. Van, Compact, etc.)
*   drive string: type of drivetrain (which wheels recieve power and how)
*   transmission string: type of transmission (i.e. Automatic, Manual, Automatic 6 speed, etc.)
*  engine_index int: unique engine identifier
*  engine_cyllinders int: number of cylinders in the engine


*   engine_displacement float: displacement of the engine (i.e. 3 liters, 4.5 liters, etc.)


*  fuel_type string: type of fuel a car uses (i.e. Electricity, Gas, E85, etc.)



*   unrounded_combined_mpg_ft1 float: car fuel efficiency in miles per gallon

*   annual_fuel_cost_ft1 float: how much will it cost you on average to own this car (in terms of fuel), in USD per year
*   tailpipe_co2_in_grams_mile_ft1 float: amount of CO2 produced by the car in grams per mile


*   ghg_score int: GHG score
*  range_ft1 float: how many miles can a car go on full fuel tank





# Analysis

#  The Engine.

In [None]:
import plotly.graph_objects as go
from IPython.display import HTML

mean_co2_cylinders = df.groupby('engine_cylinders')['tailpipe_co2_in_grams_mile_ft1'].mean()
mean_co2_displacement = df.groupby('engine_displacement')['tailpipe_co2_in_grams_mile_ft1'].mean()

scatter_trace1 = go.Scatter(
    x=df.engine_cylinders,
    y=df.tailpipe_co2_in_grams_mile_ft1,
    mode='markers',
    marker=dict(
        size=5,
        color='blue',
    ),
    name='Tailpipe CO2'
)

mean_trace1 = go.Scatter(
    x=mean_co2_cylinders.index,
    y=mean_co2_cylinders,
    mode='markers',
    marker=dict(
        size=8,
        color='lime',
        symbol='triangle-up'
    ),
    name='Mean CO2'
)

scatter_trace2 = go.Scatter(
    x=df.engine_displacement,
    y=df.tailpipe_co2_in_grams_mile_ft1,
    mode='markers',
    marker=dict(
        size=5,
        color='blue',
    ),
    name='Tailpipe CO2'
)

mean_trace2 = go.Scatter(
    x=mean_co2_displacement.index,
    y=mean_co2_displacement,
    mode='markers',
    marker=dict(
        size=8,
        color='lime',
        symbol='triangle-up'
    ),
    name='Mean CO2'
)

layout1 = go.Layout(
    xaxis=dict(title='Cylinders'),
    yaxis=dict(title='CO2')
)

layout2 = go.Layout(
    xaxis=dict(title='Displacement'),
    yaxis=dict(title='CO2')
)

fig1 = go.Figure(data=[scatter_trace1, mean_trace1], layout=layout1)
fig2 = go.Figure(data=[scatter_trace2, mean_trace2], layout=layout2)
fig1.update_layout(title='CO2 vs Cylinders')
fig2.update_layout(title='CO2 vs Cylinders')

INTERACTABLE_PLOTS = True # Set to True if you want interactable plots, but it WILL take a toll on your computer

if INTERACTABLE_PLOTS:
    fig1.show()
    display(HTML('<a id="plots1"></a>'))
    fig2.show()
else:
    fig1.show(renderer='png')
    display(HTML('<a id="plots1"></a>'))
    fig2.show(renderer='png')

Now, it comes at no surprise that the bigger the motor, the more fuel it burns, the more CO2 it produces (and obviously more cylinders wil generally mean higher displacement too). To see get an idea of the affects of these two factors on CO2 emissions we can crete a linear regression:

In [None]:
import statsmodels.api as sm
x = df[['engine_cylinders','engine_displacement','tailpipe_co2_in_grams_mile_ft1']].dropna()
x, y = sm.add_constant(x.drop(columns='tailpipe_co2_in_grams_mile_ft1')), \
                       x.tailpipe_co2_in_grams_mile_ft1
model = sm.OLS(y, x).fit()
model.summary()

0,1,2,3
Dep. Variable:,tailpipe_co2_in_grams_mile_ft1,R-squared:,0.645
Model:,OLS,Adj. R-squared:,0.645
Method:,Least Squares,F-statistic:,34490.0
Date:,"Sat, 31 Aug 2024",Prob (F-statistic):,0.0
Time:,06:33:44,Log-Likelihood:,-215770.0
No. Observations:,37977,AIC:,431600.0
Df Residuals:,37974,BIC:,431600.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,223.4744,1.323,168.926,0.000,220.881,226.067
engine_cylinders,9.8327,0.484,20.324,0.000,8.884,10.781
engine_displacement,58.6367,0.622,94.199,0.000,57.417,59.857

0,1,2,3
Omnibus:,5884.877,Durbin-Watson:,0.761
Prob(Omnibus):,0.0,Jarque-Bera (JB):,26666.735
Skew:,0.694,Prob(JB):,0.0
Kurtosis:,6.863,Cond. No.,26.9


In [None]:
scatter_trace = go.Scatter3d(
    x=df.engine_cylinders,
    y=df.engine_displacement,
    z=df.tailpipe_co2_in_grams_mile_ft1,
    mode='markers',
    marker=dict(
        size=2,
        color='blue',
    ),
    name = 'CO2 level'
)


x = np.linspace(df.engine_cylinders.min(), df.engine_cylinders.max(), 100)
y = np.linspace(df.engine_displacement.min(), df.engine_displacement.max(), 100)

prediction_trace = go.Scatter3d(
    x=x,
    y=y,
    z=model.predict(
        np.column_stack((sm.add_constant(x),y))
    ),
    mode='lines',
    line=dict(
        width=10,
        color='red'
    ),
    name='Predictions'
)

layout = go.Layout(
    scene=dict(
        xaxis=dict(title='Cylinders'),
        yaxis=dict(title='Displacement'),
        zaxis=dict(title='CO2'),
        camera=dict(
            eye=dict(x=1.5, y=-1.5, z=1.5),
            center=dict(x=0, y=0, z=0)
        )
    )
)


fig = go.Figure(data=[scatter_trace, prediction_trace], layout=layout)
fig.update_layout(title='Predicted CO2 based on cylinders and displacement')


fig.show()

We can tell that the relation is prominent based on the rather large  R2
value of 0.645. But we can also see the resemblence visually based on the plot above. More importantly, we can now see that both variables have high importance becuase the respectdive P-values are close to 0. In addition, we observe that the coefficient before the displacement variable is almost 6 times higher than that of cylinders. This may point towards the fact that there is high correlation between two variables, which should be investigated.



In [None]:
dff = df.dropna(subset=['engine_cylinders','engine_displacement'])
print('Correlation between engine_cylinders and engine_displacement:')
np.corrcoef(dff.engine_cylinders, dff.engine_displacement)[0,1]

Correlation between engine_cylinders and engine_displacement:


0.9029127826783077

As predicted, there is very high correlation between displacement and cylinders. This means that it is not a good idea to use both of these variables for the machine learning models, at least without any modifications.




We can also look into the GHG score metric which we saw earlier. The test includes multiple factors, but we can show that GHG score is based on the CO2 emission level:


In [None]:
dff = df[df.ghg_score > 0].sort_values(by=['tailpipe_co2_in_grams_mile_ft1','ghg_score'])

x = dff[['tailpipe_co2_in_grams_mile_ft1','ghg_score']].dropna()
x, y = sm.add_constant(x.drop(columns='ghg_score')), \
                       x.ghg_score
model = sm.OLS(y, x).fit()
model.summary()

0,1,2,3
Dep. Variable:,ghg_score,R-squared:,0.915
Model:,OLS,Adj. R-squared:,0.915
Method:,Least Squares,F-statistic:,64830.0
Date:,"Sat, 31 Aug 2024",Prob (F-statistic):,0.0
Time:,06:36:59,Log-Likelihood:,-4636.0
No. Observations:,6023,AIC:,9276.0
Df Residuals:,6021,BIC:,9289.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,12.0165,0.028,434.167,0.000,11.962,12.071
tailpipe_co2_in_grams_mile_ft1,-0.0166,6.52e-05,-254.617,0.000,-0.017,-0.016

0,1,2,3
Omnibus:,145.071,Durbin-Watson:,0.414
Prob(Omnibus):,0.0,Jarque-Bera (JB):,185.164
Skew:,0.303,Prob(JB):,6.2e-41
Kurtosis:,3.61,Cond. No.,1750.0



As we can see, the large  R2  indicates that our model is very good and the p-value near the tailpipe_co2_in_grams_mile_ft1 variable is near 0 which means that this variable is extremely likely to be significant in predicting the GHG score, which is exactly what we are trying to show. We can also see this relation on the plot:

In [None]:
fig = go.Figure()

fig.add_trace( go.Scatter(
    x=dff.tailpipe_co2_in_grams_mile_ft1,
    y=dff.ghg_score,
    mode='markers',
    marker=dict(
        size=10,
        color='blue',
    ),
    name='GHG score'
) )

fig.add_trace( go.Scatter(
    x=dff.tailpipe_co2_in_grams_mile_ft1,
    y=model.predict(
        x
    ),
    mode='lines',
    line=dict(
        width=5,
        color='red',
    ),
    name='Predicted GHG score'
) )

fig.update_layout(
    title='GHG score by displacement',
    xaxis=dict(title='CO2 emissions'),
    yaxis=dict(title='GHG score')
)

fig.show()


# Patterns in the data

In the displacement plot we can clearly see that there are many cars that are able to maintain a low CO2 emissions level below 400 (which is coimmon for engines with displacement of 2 and below) while having own displacement more thatn double of that. Given our previous findings this can be counter-intuitive, so let us look into how this is possible:

In [None]:
le = df[(df.engine_displacement >= 3) \
   & (df.tailpipe_co2_in_grams_mile_ft1 <= 400)]
le.sample(5)

Unnamed: 0,year,make,model,class,drive,transmission,engine_index,engine_cylinders,engine_displacement,fuel_type,unrounded_combined_mpg_ft1,annual_fuel_cost_ft1,tailpipe_co2_in_grams_mile_ft1,ghg_score,range_ft1
35231,2015,Mercedes-Benz,E400,Midsize Cars,Rear-Wheel Drive,Automatic 7-Speed,302,6.0,3.0,Premium,23.3505,1850,380.0,5,0
37671,2017,Lexus,RX 450h AWD,Standard Sport Utility Vehicle 4WD,All-Wheel Drive,Auto(AV-S6),85,6.0,3.5,Premium,29.7691,1400,297.0,7,0
32815,2013,Mercedes-Benz,SLK350,Two Seaters,Rear-Wheel Drive,Automatic 7-Speed,236,6.0,3.5,Premium,23.9364,1750,371.0,6,0
33673,2014,Honda,Crosstour 2WD,Small Sport Utility Vehicle 2WD,Front-Wheel Drive,Automatic (S6),28,6.0,3.5,Regular,23.1781,1500,384.0,6,0
34479,2015,BMW,435i xDrive Gran Coupe,Compact Cars,All-Wheel Drive,Automatic (S8),457,6.0,3.0,Premium,23.5371,1750,378.0,6,0


We can look into what traits are common for these vehicles:

In [None]:
from collections import Counter

to_investigate = ['class','drive','transmission','engine_cylinders',
                  'fuel_type','annual_fuel_cost_ft1','ghg_score',
                 ]

for i in to_investigate:
    if le[i].dtype == 'int64' or le[i].dtype == 'float64':
        print(f'Mean   value for {i}: {le[i].replace(-1,None).dropna().mean()}')
        print(f'Median value for {i}: {le[i].replace(-1,None).dropna().median()}')
    else:
            uni = Counter(le[i].dropna())
            top_entries = uni.most_common(3)
            print(f'Top 3 entries for {i}:')
            total_entries = len(le[i])
            for entry, count in top_entries:
                percentage = count / total_entries
                print(f'\t{entry}: {count} ({percentage:%})')

Top 3 entries for class:
	Midsize Cars: 196 (28.201439%)
	Compact Cars: 124 (17.841727%)
	Subcompact Cars: 96 (13.812950%)
Top 3 entries for drive:
	Rear-Wheel Drive: 320 (46.043165%)
	Front-Wheel Drive: 181 (26.043165%)
	All-Wheel Drive: 128 (18.417266%)
Top 3 entries for transmission:
	Automatic (S8): 169 (24.316547%)
	Automatic 7-Speed: 103 (14.820144%)
	Automatic (S6): 81 (11.654676%)
Mean   value for engine_cylinders: 6.00863309352518
Median value for engine_cylinders: 6.0
Top 3 entries for fuel_type:
	Premium: 420 (60.431655%)
	Regular: 171 (24.604317%)
	Diesel: 38 (5.467626%)
Mean   value for annual_fuel_cost_ft1: 1670.9352517985612
Median value for annual_fuel_cost_ft1: 1700.0
Mean   value for ghg_score: 5.640144665461121
Median value for ghg_score: 5.0


Now, this does not give us a clear answer to whether these parameters have a large effect of emissions bacause this is very surface-level analysis and each parameter needs a more in-depth look, but it gives us an idea of what our next steps could be.



#Powered wheels effect

Power distribution to wheels clearly has something to do with CO2 emissions. While it is impossible this has a direct effect, it is likely that using RWD or FWD reduces fuel consumption per mile and thus reduces the emissions level.

In [None]:
print('Unique drive types:')
df.drive.dropna().unique()

Unique drive types:


array(['2-Wheel Drive', '4-Wheel or All-Wheel Drive', 'Rear-Wheel Drive',
       'Front-Wheel Drive', '4-Wheel Drive', 'All-Wheel Drive',
       'Part-time 4-Wheel Drive'], dtype=object)

In [None]:
dff = df[df.unrounded_combined_mpg_ft1 > 0]
rwd = dff[dff.drive == 'Rear-Wheel Drive']
# in '4-Wheel or All-Wheel Drive' cars 4wd is always secondary and only used for offroading
awd = dff[(dff.drive == 'All-Wheel Drive') | (dff.drive == '4-Wheel or All-Wheel Drive')]
fwd = dff[dff.drive == 'Front-Wheel Drive']
fourwd = dff[dff.drive == '4-Wheel Drive']

In [None]:
fig = go.Figure()
hovertemplate = '%{y}<br>%{x}'

fig.add_trace(
    go.Histogram(
        x=fwd.unrounded_combined_mpg_ft1, name='FWD mpg',
        marker={'color':'green','line':{'width':0}},
        hovertemplate=hovertemplate, histnorm='probability'
    ))
fig.add_trace(
    go.Histogram(
        x=rwd.unrounded_combined_mpg_ft1, name='RWD mpg',
        marker={'color':'cornflowerblue','line':{'width':0}},
        hovertemplate=hovertemplate, histnorm='probability'
    ))
fig.add_trace(
    go.Histogram(
        x=awd.unrounded_combined_mpg_ft1, name='AWD mpg',
        marker={'color':'yellow','line':{'width':0}},
        hovertemplate=hovertemplate, histnorm='probability'
    ))
fig.add_trace(
    go.Histogram(
        x=fourwd.unrounded_combined_mpg_ft1, name='4WD mpg',
        marker={'color':'red','line':{'width':0}},
        hovertemplate=hovertemplate, histnorm='probability'
    ))
fig.update_layout(barmode='group', title='FWD vs RWD vs AWD vs 4WD mpg',
    xaxis=dict(title='mpg'),
    yaxis=dict(title='Proportion'))
fig.update_xaxes(range=[10, 30])

fig.show()

Showing all drive types is a little cluttered, but it is clear that front wheel drive is the absolute winner in miles per gallon (mpg). Based on my research, the biggest advantage of FWD cars is the ability to ditch the complex drivetrain needed to move the power from the engine to all four or just rear wheels (since generally engines are positioned at the front of the vehicle). Not only this is better in terms of weight reduction, but it also means that there are less losses since moving and distributing power is not 100% efficient. In fact, typical drivetrain efficiency is about 84.4% (see analysis here).

Let us have a look at the remaining drive types:

In [None]:
fig = go.Figure()

fig.add_trace(
    go.Histogram(
        x=awd.unrounded_combined_mpg_ft1, name='AWD mpg',
        marker={'color':'yellow','line':{'width':0}},
        hovertemplate=hovertemplate, histnorm='probability'
    ))
fig.add_trace(
    go.Histogram(
        x=fourwd.unrounded_combined_mpg_ft1, name='4WD mpg',
        marker={'color':'red','line':{'width':0}},
        hovertemplate=hovertemplate, histnorm='probability'
    ))
fig.update_layout(barmode='group', title='AWD vs 4WD mpg',
    xaxis=dict(title='mpg'),
    yaxis=dict(title='Proportion'))
fig.update_xaxes(range=[10, 30])

fig.show()


We can observe that 4WD is skewed towards lower mpg values in comparison to 4WD. Based on what I could find, the reason for this is that AWD systems generaly intelligently control whether use all four wheels or not making it more efficient. In addition to that, 4WD systems also include features like differential locks which increase the weight of the vehicle.

Finally, let us compare AWD and RWD:

In [None]:
fig = go.Figure()
hovertemplate = '%{y}<br>%{x}'

fig.add_trace(
    go.Histogram(
        x=rwd.unrounded_combined_mpg_ft1, name='RWD mpg',
        marker={'color':'cornflowerblue','line':{'width':0}},
        hovertemplate=hovertemplate, histnorm='probability'
    ))
fig.add_trace(
    go.Histogram(
        x=awd.unrounded_combined_mpg_ft1, name='AWD mpg',
        marker={'color':'yellow','line':{'width':0}},
        hovertemplate=hovertemplate, histnorm='probability'
    ))
fig.update_layout(barmode='group', title='RWD vs AWD mpg',
    xaxis=dict(title='mpg'),
    yaxis=dict(title='Proportion'))
fig.update_xaxes(range=[10, 30])

fig.show()

It seems like AWD is actually better in terms of economy. This does not correlate to what I have previously found regarding drivetrain losses, as RWD is supposed to be more simple than AWD as not only it supplies power to only two wheels, it also doesn't have the hardware to manage how the power is distributed between axels. Let's see if this is caused by uneven distribution of displacement

In [None]:
fig = go.Figure()
hovertemplate = '%{y}<br>%{x}'

fig.add_trace(
    go.Histogram(
        x=rwd.engine_displacement, name='RWD displacement',
        marker={'color':'cornflowerblue','line':{'width':0}},
        hovertemplate=hovertemplate, histnorm='probability'
    ))
fig.add_trace(
    go.Histogram(
        x=awd.engine_displacement, name='AWD displacement',
        marker={'color':'yellow','line':{'width':0}},
        hovertemplate=hovertemplate, histnorm='probability'
    ))
fig.update_layout(barmode='group', title='RWD vs AWD displacement',
    xaxis=dict(title='mpg'),
    yaxis=dict(title='Displacement'))

fig.show()


Yes, it seems like in this dataset, there are a lot of AWD low-displacement cars which skewes our findings. This is great information to be aware of as when fitting models to this data we will have to implement adjustments for this to avoid introducing bias into the model

# Transmission effect

Gearbox is an important factor in fuel economy and thus CO2 emissions because it directly influences "how hard the engine has to work", so it is definitely an interesting thing to consider. However, it is important to take into account which engine the transmission is paired to because they may have different responses. That's why we will group the transmission effects by engine_index:

In [None]:
from tqdm import tqdm

best_transmission = {}


for engine in tqdm(df.engine_index.unique()):
    sli = df[df.engine_index == engine].copy()
    sli.unrounded_combined_mpg_ft1 = sli.unrounded_combined_mpg_ft1.replace(0,None)
    sli.dropna(subset='unrounded_combined_mpg_ft1', inplace=True)
    mi = sli.unrounded_combined_mpg_ft1.min()
    if pd.isna(mi) or len(sli) < 5: continue
    for i in sli.transmission[sli.unrounded_combined_mpg_ft1 == mi].unique():
        if i not in best_transmission:
            best_transmission[i] = 0
        best_transmission[i] += 1

_ = sum(best_transmission.values())
for i in best_transmission:
    best_transmission[i] = best_transmission[i]/_

100%|██████████| 2645/2645 [00:08<00:00, 309.32it/s]


In [None]:
fig = go.Figure()

x, y = list(best_transmission.keys()), np.array(list(best_transmission.values()))*100
sorted_lists = sorted(zip(y, x), key=lambda pair: pair[0], reverse=True)
y, x = zip(*sorted_lists)

fig.add_trace(
    go.Bar(
        x=x, y=y, name='% of best for engine',
        marker={'color':'darkorange','line':{'width':0}},
        hovertemplate=hovertemplate, offsetgroup=0
    ))
fig.update_xaxes(tickangle=90)
fig.show()

Above we have looked at which transmission produced least CO2 permile for each engine type. After that we have found for how many engines each transmission type was the best. Based on that, automatic transmissions make up the majority of the list and dominate the top spots. It makes sense because electronically controlled shifts can be much more precise and thoughtful than ones made by people and thus more power can be made using less gas. Interestingly, higher number of gears does not seem to necessarily decrease CO2 emission, but it is clear that 6-speed is a "sweet spot" even for manual transmissions.

Additionally, let's make a model based on a few car characteristics to see if there is a considerable effect of transmission on CO2 emissions:

In [None]:
from sklearn.ensemble import RandomForestRegressor

dff = df[['engine_displacement','year','class','transmission','tailpipe_co2_in_grams_mile_ft1']].dropna()
for i in ['class','transmission']:
    for j in dff[i].unique():
        dff[f'{i}_{j}'] = (dff[i] == j)*1
    dff = dff.drop(columns=i)

rf = RandomForestRegressor(n_estimators=100)

x, y = dff.drop(columns='tailpipe_co2_in_grams_mile_ft1'), dff.tailpipe_co2_in_grams_mile_ft1
rf.fit(x, y)

feature_importances = rf.feature_importances_

In [None]:
fig = go.Figure()

sorted_lists = sorted(zip(y, x), key=lambda pair: pair[0])
y, x = zip(*sorted_lists)
x, y = x[:10], y[:10]

fig.add_trace(
    go.Bar(
        x=y, y=x, name='% of best for engine',
        marker={'color':'darkgreen','line':{'width':0}},
        hovertemplate=hovertemplate, offsetgroup=0, orientation='h'
    ))

fig.update_layout(title='Feature importances')
fig.update_xaxes(tickangle=90, title='Importance')
fig.show()


As we can see, transmission is included into top-10 most important features for this model meaning this feature has predictive power.

# Fuel effect

Obviously, fuel type will also take effect on the CO2 emissions. But how large this effect is?

In [None]:
dff = df[['engine_displacement','fuel_type','tailpipe_co2_in_grams_mile_ft1']].dropna()
dff['premium'] = dff.fuel_type.apply(lambda x: 'Premium' in x)*1
dff['diesel'] = dff.fuel_type.apply(lambda x: 'Diesel' in x)*1

dff = dff.drop(columns='fuel_type')

x = dff
x, y = sm.add_constant(x.drop(columns='tailpipe_co2_in_grams_mile_ft1')), \
                       x.tailpipe_co2_in_grams_mile_ft1
model = sm.OLS(y, x).fit()
model.summary()

0,1,2,3
Dep. Variable:,tailpipe_co2_in_grams_mile_ft1,R-squared:,0.653
Model:,OLS,Adj. R-squared:,0.653
Method:,Least Squares,F-statistic:,23790.0
Date:,"Thu, 04 Jul 2024",Prob (F-statistic):,0.0
Time:,12:20:37,Log-Likelihood:,-215360.0
No. Observations:,37978,AIC:,430700.0
Df Residuals:,37974,BIC:,430800.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,243.4551,0.960,253.549,0.000,241.573,245.337
engine_displacement,71.4117,0.267,267.030,0.000,70.888,71.936
premium,-14.6589,0.819,-17.906,0.000,-16.264,-13.054
diesel,-74.1098,2.263,-32.754,0.000,-78.545,-69.675

0,1,2,3
Omnibus:,6365.781,Durbin-Watson:,0.74
Prob(Omnibus):,0.0,Jarque-Bera (JB):,33294.442
Skew:,0.714,Prob(JB):,0.0
Kurtosis:,7.359,Cond. No.,23.3


From this summary we can see that all variables are significant and  R2  is respectable. Diesel having a large negative coefficient makes sense because it is known to be cheap, but "dirty" fuel making a lot of excess emissions, premium fuel also has a lower (by absolute value), but still negative coefficient. The reason for that is likely that this fuel is meant for treating the engine, not controlling emissions.

# Closing statement

Overall, we have found a lot of interesting patterns that can be used both for reference for car manufacturers and for further analysis including creation of machine learning models. Unfortunately, the data does have a lot of missing values, which limits the possibilities for analysis. For example, a good point to consider would be "How forced induction effects CO2 emissions" but there is not enough data to propertly test this hypothesis. Hopefully in the future we will recieve an update on this dataset and be able to address this.

