# Data Analysis

The main objective of this notebook is detect the key features to predict co2 emissions. For this experiment, we will use plotly, a cool library i discoved last year. I like plotly a lot because visualizations are interactive, you can zoom in and out, check specific values and it works great with pandas dataframes.

In future experiments i will try polars, a new library that is optimized for big data, it is faster than pandas and it is written in rust.

### Steps
- Import libraries and dataset.
- Detect columns with redundant or useless information
- Histogram analysis of selected features, detecting outliers.
- Graphical analysis in scatter plots, trying to find a correlation between features
- Generate a correlation matrix (Correlation analysis with Pearson's correlation coefficient)
- Genearate a new dataset with the selected features for the next experiment (creating a linear regression model)

### Import libraries and dataset

In [188]:
import pandas as pd
import plotly.express as px
import plotly.graph_objs as go
from plotly.subplots import make_subplots

In [189]:
dataset_path = './datasets/processed_vehicles.csv'

df = pd.read_csv(dataset_path)
df.head()

Unnamed: 0,release_year,manufacturer,model,vehicle_class,engine_size,cylinders,transmission_type,fuel_type,fuel_consumption_on_city,fuel_consumption_on_highway,fuel_consumption_combinated_in_kpl,fuel_consumption_combinated_in_mpg,co2_emissions
0,2014,ACURA,ILX,COMPACT,2.0,4,AS5,Z,9.9,6.7,8.5,33,196
1,2014,ACURA,ILX,COMPACT,2.4,4,M6,Z,11.2,7.7,9.6,29,221
2,2014,ACURA,ILX HYBRID,COMPACT,1.5,4,AV7,Z,6.0,5.8,5.9,48,136
3,2014,ACURA,MDX 4WD,SUV - SMALL,3.5,6,AS6,Z,12.7,9.1,11.1,25,255
4,2014,ACURA,RDX AWD,SUV - SMALL,3.5,6,AS6,Z,12.1,8.7,10.6,27,244


### Detect columns with redundant or useless information

First, we will delete categorical variables that are not related to the co2 emissions. For example, the model of the car, the manufacturer, etc.

In [190]:
df = df.drop(
    columns=[
        'model',
        'manufacturer',
        'transmission_type',
        'vehicle_class',
    ]
)

I can't see a clear relationship between the co2 emissions and `fuel_type` I will delete this column too.

In [191]:
fig_config = {
    'data_frame': df,
    'x': 'co2_emissions',
    'nbins': 10,
    'width': 600, 
    'height': 400,
    'color': 'fuel_type',
}

layout_config = {
    'title': 'CO2 Emissions Histogram',
    'xaxis_title': 'CO2 Emissions',
    'yaxis_title': 'Count',
}

fig = px.histogram(**fig_config)
fig.update_layout(**layout_config)
fig.show()

In [192]:
df = df.drop(
    columns=[
        'fuel_type',
    ]
)

All the cars in this dataset were manufactured in 2014, so the `released_year` column is not useful.

In [193]:
names = df['release_year'].value_counts().index
values = df['release_year'].value_counts().values

# a pie chart of the number of vehicles released in each year
fig_config = {
    'data_frame': df,
    'names': names,
    'values': values,
    'width': 600, 
    'height': 400,
}

layout_config = {
    'title': 'Number of Vehicles Released in Each Year',
}


fig = px.pie(**fig_config)
fig.update_layout(**layout_config)
fig.show()

In [194]:
df = df.drop(
    columns=[
        'release_year',
    ]
)

Fuel compsumption is divided between city and highway, but we have a total fuel consumption column in two different units (l/100km and mpg). I will delete all except the total fuel consumption in l/100km.

I will alse rename `fuel_consumption_in_kpl` to `fuel_consumption` because now we have only one unit.

In [195]:
df = df.drop(
    columns=[
        'fuel_consumption_on_city', 
        'fuel_consumption_on_highway',
        'fuel_consumption_combinated_in_mpg'
    ]
)

df.rename(
    columns={'fuel_consumption_combinated_in_kpl':'fuel_consumption'}, inplace=True
)

The final dataframe looks like this:

In [196]:
df.head()

Unnamed: 0,engine_size,cylinders,fuel_consumption,co2_emissions
0,2.0,4,8.5,196
1,2.4,4,9.6,221
2,1.5,4,5.9,136
3,3.5,6,11.1,255
4,3.5,6,10.6,244


### Graphical Distribution Analysis

Analysis thru histograms of selected features, detecting outliers. At first sight, it's not always a good idea to graph all histograms in the same plot because they need different configurations, like the number of bins, the range, etc. 

But in this case, it's ok because we are only interested in the outliers.

`engine_size` and `co2_emissions` have outliers.

In [197]:
columns = df.columns
fig = make_subplots(rows=1, cols=len(columns))

for i in range(0, len(columns)):
    column = columns[i]
    fig.add_trace(
        go.Histogram(
            x=df[column], 
            name=column,
        ), 
        row=1, col=i+1
    )

fig.show()

But after graphing the columns with violin plot, we can see that the outliers are not so far from the rest of the data.

In [198]:
fig_config = {
    'data_frame': df,
    'y': "co2_emissions",
    'width': 600, 
    'height': 400,
    'box': True,
    'points':"all"
}

fig = px.violin(**fig_config)
fig.show()

In [199]:
fig_config = {
    'data_frame': df,
    'y': "engine_size",
    'width': 600, 
    'height': 400,
    'box': True,
    'points':"all"
}

fig = px.violin(**fig_config)
fig.show()

In [200]:
fig_config = {
    'data_frame': df,
    'x': 'co2_emissions',
    'nbins': 10,
    'width': 600, 
    'height': 400
}

layout_config = {
    'title': 'CO2 Emissions',
    'xaxis_title': 'CO2 Emissions',
    'yaxis_title': 'Count',
}

fig = px.histogram(**fig_config)
fig.update_layout(**layout_config)
fig.show()

In [201]:
fig_config = {
    'data_frame': df,
    'x': 'engine_size',
    'nbins': 8,
    'width': 600, 
    'height': 400
}

layout_config = {
    'title': 'Engine Size',
    'xaxis_title': 'Engine Size',
    'yaxis_title': 'Count',
}

fig = px.histogram(**fig_config)
fig.update_layout(**layout_config)
fig.show()

In [202]:
fig_config = {
    'data_frame': df,
    'x': 'fuel_consumption',
    'nbins': 6,
    'width': 600, 
    'height': 400
}

layout_config = {
    'title': 'Fuel Consumption',
    'xaxis_title': 'Fuel Consumption',
    'yaxis_title': 'Count',
}

fig = px.histogram(**fig_config)
fig.update_layout(**layout_config)
fig.show()

### Relationship between CO2 emission and fuel-consumption

If i examine the car with the highest CO2 emission and the car with the lowest CO2 emission, i can see that the three numeric features fluctuate a lot. These features are: `engine-size`, `fuel-consumption` and `cylinders`.

This is a great start.

Let's get the car with the **highest CO2 emissions**.

In [203]:
# get the car with most CO2 emissions
df[df['co2_emissions'] == df['co2_emissions'].min()]

Unnamed: 0,engine_size,cylinders,fuel_consumption,co2_emissions
988,1.5,4,4.7,108


Let's get the car with the **lowest CO2 emissions**.

In [204]:
# get the car with most CO2 emissions
df[df['co2_emissions'] == df['co2_emissions'].max()]

Unnamed: 0,engine_size,cylinders,fuel_consumption,co2_emissions
349,6.8,10,21.2,488


Lets check the relationship between this features and the CO2 emissions with some scatter plots.

In [205]:
fig_config = {
    'data_frame': df,
    'x':'fuel_consumption',
    'y':'co2_emissions',
    'width': 600, 
    'height': 400
}

layout_config = {
    'title': 'Fuel Consumption vs CO2 Emissions',
    'xaxis_title': 'Fuel Consumption',
    'yaxis_title': 'CO2 Emissions',
}

fig = px.scatter(**fig_config)

fig.update_layout(**layout_config)
fig.show()

In [206]:
fig_config = {
    'data_frame': df,
    'x':'engine_size',
    'y':'co2_emissions',
    'width': 600, 
    'height': 400
}

layout_config = {
    'title': 'Engine Size vs CO2 Emissions',
    'xaxis_title': 'Engine Size',
    'yaxis_title': 'CO2 Emissions',
}

fig = px.scatter(**fig_config)

fig.update_layout(**layout_config)
fig.show()

In [207]:
fig_config = {
    'data_frame': df,
    'x':'cylinders',
    'y':'co2_emissions',
    'width': 600, 
    'height': 400
}

layout_config = {
    'title': 'Cylinder vs CO2 Emissions',
    'xaxis_title': 'Cylinders',
    'yaxis_title': 'CO2 Emissions',
}

fig = px.scatter(**fig_config)

fig.update_layout(**layout_config)
fig.show()

It looks like all the features have a positive correlation with the CO2 emissions. Let's go deeper with the correlation matrix.

In [208]:
corr_values = df.corr().round(3)

fig = px.imshow(
	corr_values,
	color_continuous_scale=px.colors.sequential.Plasma_r,
	text_auto=True,
    width=600,
    height=400,
)
fig.show()

In [209]:
corr = df.corr().round(2)

(
    corr[['co2_emissions']]
    .sort_values(
        by='co2_emissions', 
        ascending=False
    )
)

Unnamed: 0,co2_emissions
co2_emissions,1.0
fuel_consumption,0.89
engine_size,0.87
cylinders,0.85


All features have a great correlation with the CO2 emissions. The `fuel_consumption` has the highest correlation with the CO2 emissions.

Now we have a better idea of the features that we will use in the next experiment. Let's export the new dataframe to a csv file that will be the start of our linear regression model.

In [210]:
df.to_csv('./datasets/model_features.csv', index=False)