### Exploratory Data Analysis:
- variable target : CO2, wich is the emission of CO2 of the car in Kilogram of carbon dioxide equivalent
- Number of rows and columns: df.shape or df.shape[0] + df.shape[1]
- variables types: df.dtypes
- number of different variables: df.dtypes.valuecounts().plot.pie()
- rows where there is no values: df.isna()
- None values
### clean the datas:
- Erase empty column: df=df[df.columns[df.isna().sum()/df.shape[0] < 0.9]] 
  * we want to erase the columns that have less 90% of missing values
- if we want to erase another column df= df.drop('name of the column', axis=1)
- to do the plot of the main variable we can use:
 * sns.distplot(df['C02'])
 * plt.hist(x, bins, density = True)/plt.show()
### Analysis of continous variables:
- we want to analyse all the data that are continuous :
  * for col in df.select_dtypes('float'):
  * plt.figure()
  * sns.distplot(df[col])
### Analysis of discrete variables:
  * sns.distplot(df[''])
### Analysis of categorical variables:
  * for col in df.select_dtypes('object'):
  * plt.figure()
  * sns.distplot(df[col])
  * print(col, df[col].unique()) (this is to know wich values is associated to the row)
  * for col in df.select_dtypes('object'):
  * plt.figure()
  * df[col].value_counts().plot.pie() (this to know the answer for every vehicle)
  







# README FILE:
# Dataset Analysis

Today, pollution from the transport sector accounts for 60.6% of total pollution in europe. That's why so many people are wondering whether they should buy a car to reduce pollution. In this dataset, we'll look at the different parameters that increase vehicle pollution, and how to choose the least polluting car possible.
## The Data
- **What are the data?**  
  This dataset contains information about Cars, like their CO2 emissions, but also their horsepower, if they are hybrid or not, etc...

- **Where do they come from?**  
  https://www.data.gouv.fr/fr/datasets/emissions-de-co2-et-de-polluants-des-vehicules-commercialises-en-france/

- **Data provenance:**  
  The provenance of the data is https://www.data.gouv.fr/fr/datasets/emissions-de-co2-et-de-polluants-des-vehicules-commercialises-en-france/ wich is the web site of the French government. It provides lots information about France.

- **Census or sample:**  
  it's a sample by Cars.

- **Raw or processed:**  
  processed because we removed lot's of column like lib-mob-doss which was representing the name of the model in the administrative file or the date because all the datas are from march 2014.


- **Describe the units, size of the sample/population:**  
  - **Units:** Each row represents one different Cars.  
  - **Population/Sample size:**  5504 Cars.

- **Number of variables:**  
  11 columns (variables).

## Variables Description


1. **`lib_mrc`**  
   This variable represents the brand name of the Car. It is a categorical variable.

2. **`lib_mod	`**  
   It represents the model name of the Car. It is a categorical variable.

3. **`cnit`** 
    The CNIT, or Code National d'Identification du Type (National Type Identification Code), is a code created by the manufacturer and then approved by the French government to identify the vehicle model defined by its technical characteristics (engine, version, etc.).
    It is shown in box D.2 of the vehicle registration certificate. It is a categorical variable.
4. **`cod_cbr`** 
    This variable reprensents the fuel code, that's the fuel name used to power the car. In the data set, here are some examples 'ES' for petrol or 'GO' for diesel, 'EE' for ethanol, 'GP' for LPG etc... It is a categorical variable.
5. **`hybride`** 
    This variable shows if a car is hybride or not. the two output are 'yes' or 'no'. It is a categorical variable.
6. **`Puissadmin`** 
    This variable represents the administrative horsepower (or fiscal horsepower) is a value expressed in fiscal horsepower (HP), used in France to calculate the cost of vehicle registration (carte grise), and in some cases for insurance purposes. It is a discrete variable that goes from 1 to 81hp.
7. **`puiss_max`** 
    It represents the maximum power of the car. It is a discrete variable that goes from 10 to 560 hp
8. **`typ_boite_nb_rapp`** 
    It represents the type of gearbox of the car. It's a categorical variable. There is a letter followed by a number. the letter is the type of Gearbox, like 'M' for Manual, and the number is the number of gear ratios.
9. **`conso_mixte`** 
   This varaible represents the average consumption of the car in L/100 Km. It's a continuous variable that goes from 0.6 to 24.5 L/100 Km.
10. **`C02`**
   This varaible represents the C02 emissions of the Car in g/km. it's a continuous variable that goes from 13 to 572 g/Km.
11. **`masse_ordma_max`**
   This variable shows the weight of the car with full tank, the oil, all the options,... the maximum weight of the car. It's a continuous variable that goes from 825 to 3094 Kg.



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as st
import jupyprint as jp



In [None]:
df=pd.read_csv("mars-2014-complete.csv", on_bad_lines='skip', sep=';')
df

In [None]:
df.shape[0]

In [None]:
df.shape[1]

In [None]:
df.dtypes

In [None]:
df.describe()

In [None]:
df['puiss_max']=pd.to_numeric(df['puiss_max'], errors='coerce')
df['conso_mixte']=pd.to_numeric(df['conso_mixte'], errors='coerce')
df.describe()

In [None]:
df.isna()

In [None]:
df=df.sample(n=2000)
df.isna()
df

In [None]:
df.describe()

In [None]:
df.to_csv('sampledataset.csv', index=False)

In [None]:
n=len(df)
n

In [None]:
df = df.filter(items=["lib_mrq", "lib_mod", "cnit", "cod_cbr", "hybride", "puiss_admin_98", "puiss_max", "typ_boite_nb_rapp" ,"conso_mixte","co2","masse_ordma_max"])
df

In [None]:
df.describe()

In [None]:
df.shape


# II - Description of some variables

## Puiss_admin visualisation

In [None]:
x = np.array(df['puiss_admin_98'][:])
x

In [None]:
levels, ni = np.unique(x, return_counts = True) 
fi = ni/n
Fi = np.cumsum(fi)
ft = pd.DataFrame(data = np.transpose([ni, fi, Fi]), index = levels, columns = ['frequencies', 'relative frequencies', 'cummulative relative frequencies'])
ft

In [None]:
positions = levels

plt.bar(positions, fi, width = 0.7)
plt.xticks(positions, levels) 
plt.ylabel('relative frequencies')
plt.xlabel('puiss_admin')
plt.figure(figsize= (30, 15))
plt.show()

In [None]:
plt.plot(positions, ni)
plt.ylabel('frequencies')
plt.xlabel('puiss_admin')
plt.show()

## Puiss_max visualisation

In [None]:
x = np.array(df['puiss_max'][:])
x

In [None]:
levels, ni = np.unique(x, return_counts = True) 
fi = ni/n
Fi = np.cumsum(fi)
ft = pd.DataFrame(data = np.transpose([ni, fi, Fi]), index = levels, columns = ['frequencies', 'relative frequencies', 'cummulative relative frequencies'])
ft = ft.sort_values(by='frequencies', ascending=False).head(5)
ft

In [None]:
positions = levels

plt.bar(positions, fi, width = 0.7)
plt.xticks(positions, levels) 
plt.ylabel('relative frequencies')
plt.xlabel('puiss_max')
plt.figure(figsize= (30, 15))
plt.tight_layout()
plt.show()

In [None]:
plt.plot(positions, ni)
plt.ylabel('frequencies')
plt.xlabel('puiss_max')
plt.show()

## Cod_cbr visualisation

In [None]:
x = np.array(df['cod_cbr'][:])
x

In [None]:
levels, ni = np.unique(x, return_counts = True) 
fi = ni/n
ft = pd.DataFrame(data = np.transpose([ni, fi]), index = levels, columns = ['frequencies', 'relative frequencies'])
ft = ft.sort_values(by='frequencies', ascending=False).head(3)

levels = ft.index
ni = ft['frequencies']
ft

In [None]:
plt.bar(levels, ni)
plt.ylabel('frequencies')
plt.show()

In [None]:
plt.pie(ni, labels = levels, autopct='%1.3f%%')
plt.show()

These are the main cod_cbr on the data set but there exist others such as GP, GN....


## Cnit visualisation 

In [None]:
x = np.array(df['cnit'][:])
x

In [None]:
levels, ni = np.unique(x, return_counts = True) 
fi = ni/n
ft = pd.DataFrame(data = np.transpose([ni, fi]), index = levels, columns = ['frequencies', 'relative frequencies'])
ft = ft.sort_values(by='frequencies', ascending=False)

levels = ft.index
ni = ft['frequencies']
ft

In [None]:
plt.bar(levels, ni)
plt.ylabel('frequencies')
plt.show()

In [None]:
plt.pie(ni, labels = levels, autopct='%1.3f%%')
plt.show()

Those are The `IDs` of the car so it's normal that they all have the same frequency . Its not relevant to do further anlysis for this variable


## lib_mrq visualisation 

In [None]:
x = np.array(df['lib_mrq'][:]) 
x

In [None]:
levels, ni = np.unique(x, return_counts = True) 
levels, ni, sum(ni)
n = len(df)
fi = ni/n
fi, sum(fi)
ft_cartel = pd.DataFrame(data = np.transpose([ni, fi]), index = levels, columns = ['Frequencies', 'Relative Frequencies'])
ft_cartel

In [None]:
ft_cartel = ft_cartel.sort_values(by='Frequencies', ascending=False)
plt.figure(figsize=(12, 6)) 
plt.bar(ft_cartel.index, ft_cartel['Frequencies'])
plt.ylabel('Frequencies')
plt.xlabel('Models')
plt.xticks(rotation=45) 
plt.show()

There are a lot of existing models in this dataset i will take only the most relevent ones the `TOP 3 `

In [None]:

ft_cartel = ft_cartel.sort_values(by='Frequencies', ascending=False).head(3)
plt.figure(figsize=(12, 6)) 
plt.bar(ft_cartel.index, ft_cartel['Frequencies'])
plt.ylabel('Frequencies')
plt.xlabel('Models')
plt.show()

In [None]:
plt.pie(ft_cartel['Frequencies'], labels=ft_cartel.index, autopct='%1.3f%%')
plt.show()

We can see that the `MERCEDES` brand is the most common used followed by `Wolkswagen`

## lib_mod Visualisation

In [None]:
x = np.array(df['lib_mod'][:]) 
x

In [None]:
levels, ni = np.unique(x, return_counts = True) 
levels, ni, sum(ni)
n = len(df)

fi = ni/n
fi, sum(fi)
ft_cartel = pd.DataFrame(data = np.transpose([ni, fi]), index = levels, columns = ['Frequencies', 'Relative Frequencies'])
ft_cartel

In [None]:
ft_cartel = ft_cartel.sort_values(by='Frequencies', ascending=False)
plt.figure(figsize=(12, 6)) 
plt.bar(ft_cartel.index, ft_cartel['Frequencies'])
plt.ylabel('Frequencies')
plt.xlabel('Models')
plt.xticks(rotation=45) 
 
plt.show()

There are a lot oh existing models in this dataset i will take only the most relevent ones the `TOP 4 `

In [None]:
ft_cartel = ft_cartel.sort_values(by='Frequencies', ascending=False).head(4)
plt.figure(figsize=(12, 6)) 
plt.bar(ft_cartel.index, ft_cartel['Frequencies'])
plt.ylabel('Frequencies')
plt.xlabel('Models')
plt.xticks(rotation=45) 
 
plt.show()

In [None]:
plt.pie(ft_cartel['Frequencies'], labels=ft_cartel.index, autopct='%1.3f%%')
plt.show()

We can see from this graph that ``Mercedes`` and ``Volkswagen`` remain in the race, with ``Volkswagen``'s ``Crafter`` as the most used model, followed by ``Mercedes``'s ``Vito``, ``Viano``, and ``Sprinter``

## Co2 Visualisation

In [None]:
x = np.array(df['co2'][:])
min(x), max(x)
bins = np.array([0.,150,200,210,225,250,300,400])
bins, len (bins)
ni, bins = np.histogram(x, bins = bins)
lo = bins[:-1]
hi = bins[1:]
fi = ni/n
Fi = np.cumsum(fi)
ft_quant = pd.DataFrame(data = np.transpose([lo, hi, ni, fi, Fi]), columns = ['lo', 'hi', 'frequencies', 'relative frequencies', 'cummulative relative frequencies'])
ft_quant

In [None]:
plt.hist(x, bins, density = True)
plt.show()

In [None]:
x = x[x > 0]
plt.boxplot(x)
plt.ylabel('g/km')
plt.xticks([1], ['co2'])
plt.show()


## masse_ordma_max Visualisation

In [None]:
x = np.array(df['masse_ordma_max'][:])
min(x), max(x)
bins = np.array([0.,1500,2000,2500,2750,3200])
bins, len (bins)
ni, bins = np.histogram(x, bins = bins)
lo = bins[:-1]
hi = bins[1:]
fi = ni/n
Fi = np.cumsum(fi)
ft_quant = pd.DataFrame(data = np.transpose([lo, hi, ni, fi, Fi]), columns = ['lo', 'hi', 'frequencies', 'relative frequencies', 'cummulative relative frequencies'])
ft_quant

In [None]:
plt.hist(x, bins, density = True)
plt.show()

In [None]:

plt.boxplot(x)
plt.ylabel('g/km')
plt.xticks([1], ['masse_ordma_max'])
plt.show()


## hybride Visualisation

In [None]:
n=len(df)
x = np.array(df['hybride'][:]) 
levels, ni = np.unique(x, return_counts = True) 
levels, ni, sum(ni)

In [None]:
fi = ni/n
fi, sum(fi)

In [None]:
ft_hybride = pd.DataFrame(data = np.transpose([ni, fi]), index = levels, columns = ['frequencies', 'relative frequencies'])
ft_hybride

In [None]:
plt.bar(levels, ni)
plt.ylabel('frequencies')
plt.show()

In [None]:
plt.pie(ni, labels = levels, autopct='%1.1f%%')
plt.show()

That means that the majority of the vehicles of the dataset are not hybrid.

## conso_mixte Visualisation

In [None]:
x0 = np.array(df['conso_mixte'][:])
min(x0), max(x0)
bins = np.linspace(0, 100, 21) 
bins, len (bins)

In [None]:
bins = np.array([  0.,   5.,6,7,8,8.25,  10.,  15.])
ni0 = np.histogram(x0, bins = bins)[0]  
ni0

In [None]:
lo = bins[:-1]
hi = bins[1:]
fi0 = ni0/n
Fi0 = np.cumsum(fi0)
ft_conso_mixte = pd.DataFrame(data = np.transpose([lo, hi, ni0, fi0, Fi0]), columns = ['lo', 'hi', 'frequencies', 'relative frequencies', 'cummulative relative frequencies'])
ft_conso_mixte

In [None]:
plt.hist(x0, bins, density = True)
plt.show()

In [None]:
x0 = x0[x0 > 0]
plt.boxplot(x0)
plt.ylabel('L/100km')
plt.xticks([1], ['conso_mixte'])
plt.show()

## typ_boite_nb_rapp Visualisation

In [None]:
x1 = np.array(df['typ_boite_nb_rapp'][:]) 
levels, ni = np.unique(x1, return_counts = True) 
levels, ni, sum(ni)

In [None]:
fi = ni/n
fi, sum(fi)

In [None]:
ft_typ_boite_nb_rapp = pd.DataFrame(data = np.transpose([ni, fi]), index = levels, columns = ['frequencies', 'relative frequencies'])
ft_typ_boite_nb_rapp

In [None]:
plt.bar(levels, ni)
plt.ylabel('frequencies')
plt.show()

In [None]:
plt.pie(ni, labels = levels, autopct='%1.1f%%')
plt.show()

It seems like the majority of the vehicles of this dataset are of manual transmission with 6 gears.

#  III : Simple linear regression

In [None]:
x = df.co2
y = df.masse_ordma_max
n=len(df)
n

In [None]:
sns.scatterplot(data = df, x = 'co2', y = 'masse_ordma_max', marker = '*')
plt.show()

We assume a linear model. We have $n=2000$, with $x_{i}$ the value of the vehicle's maximum mass and $y_{i}$ its CO2 emission. So we consider the data to be a sample of size $n$:
$n$-sample $((X_1,Y_1),...,(X_n,Y_n))$
We suppose that:

$Y_i = a + b\, x_i$ + $\varepsilon_{i}$ with $i\in\{1,...,n\}$

where $\varepsilon_1$, ..., $\varepsilon_n$ are independent and distributed as $\mathcal{N}(0,\sigma^2)$, $a$, $b$ and $\sigma>0$ being unknown parameters.


### Estimation of parameters

In [None]:
lr = st.linregress(x,y)
lr

In [None]:
jp.jupyprint(f'The estimations of the parameters $a$ and $b$ are $\\hat a =$ {lr.intercept:.3f} \
and $\\hat b =$ {lr.slope:.3f} respectively.')

In [None]:
hy = lr.intercept + lr.slope * x
e = y - hy
sse = np.sum(e**2)
hs2 = sse/(n-2)
jp.jupyprint(f'The estimation of the parameter $\\sigma^2$ is $\\hat\\sigma^2$ = {hs2:.3f} .')



We can see that the $Y$ parameter is widely spread out around the regression line because the prameter $\sigma^2$ is really big.
That's we can see with the following plot where most of the points are far from the regression line.

### Visualization of the regression

In [None]:
fit = [lr.slope, lr.intercept]
poly = np.poly1d(fit)

In [None]:
ran = max(x)-min(x)
lb = min(x)-ran/20
ub = max(x)+ran/20
z = [lb, ub]
sns.scatterplot(data = df, x = 'co2', y = 'masse_ordma_max', marker = '*')
plt.plot(z,poly(z),'r-', linewidth = 0.7)
plt.show()

In [None]:
jp.jupyprint('$R^2$ = {:2.3f}'.format(lr.rvalue**2)+ ' ,   $p$-value = {:1.2e}'.format(lr.pvalue))

We can see with the results that the $R^2$ is  $0.25$ that means that the correlation between the two variable is not very strong (if it was strong it would be near $1$).

Finally the $p$-value is extremly low wich means that the realtion between the two vriables is still significant but not strong. 

we can consider as outliers the points which are at a distance from the line superior to twice the standard deviation of the residuals. We can visualize it on a graph.

In [None]:
sns.scatterplot(data = df, x = 'co2', y = 'masse_ordma_max', marker = '*')
plt.plot(z,poly(z),'r-', linewidth = 0.7)
plt.plot(z,poly(z)+2.5*np.sqrt(hs2),'b--', linewidth = 0.5)
plt.plot(z,poly(z)-2.5*np.sqrt(hs2),'b--', linewidth = 0.5)
plt.show()