<a href="https://colab.research.google.com/github/Carlosmtp/ML-CO2-Emissions/blob/main/ML_DT_CO2_Emissions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Informe Machine Learning *C02 Emissions* (árboles de decisión ) 🚗💨

#### *Carlos Mauricio Tovar Parra - 1741699*
#### *Santiago Duque Chacón - 2180099*
---

### Resumen

El presente informe, se centra en la experimentación con técnicas de machine learning para predecir las emisiones de dióxido de carbono (CO2) en vehículos. Utilizando un conjunto de datos de 1067 vehículos, el informe desarrolla modelos predictivos basados en las características de los vehículos: tamaño del motor (ENGINESIZE), cantidad de cilindros (CYLINDERS), consumo de combustible en ciudad (FUELCONSUMPTION_CITY) y en carretera (FUELCONSUMPTION_HWY). La variable objetivo es CO2EMISSIONS, clasificada en bajas (0) y altas (1) emisiones.

El objetivo del informe es implementar árboles de decisión, configurando y probando diferentes configuraciones para identificar la combinación de hiperparámetros que maximiza la precisión predictiva.

*Tabla de atributos:*

| Número | Atributo                   | Descripción                                                        |
|--------|----------------------------|--------------------------------------------------------------------|
| 1      | ENGINESIZE                 | Tamaño del motor en litros                                         |
| 2      | CYLINDERS                  | Cantidad de cilindros que posee el motor                           |
| 3      | FUELCONSUMPTION_CITY       | Consumo de combustible del vehículo en zona urbana (L/100 km)      |
| 4      | FUELCONSUMPTION_HWY        | Consumo de combustible del vehículo en zona extraurbana (L/100 km) |
| 5      | CO2EMISSIONS               | Emisiones de CO2 del vehículo (0: Bajas, 1: Altas)                 |


*Ejemplo instancia de datos:*

| Atributo | 1   | 2  | 3    | 4   | 5 |
|----------|-----|----|------|-----|---|
| Valor    | 3.7 | 6  | 13.4 | 9.5 | 1 |


*Librerías Usadas:*
- numpy
- pandas
- sklearn

In [48]:
import numpy as np
import pandas as pd
import sklearn

### 1) Se lee el archivo

In [49]:
dataset_url = ("https://raw.githubusercontent.com/Carlosmtp/ML-CO2-Emissions/main/CO2%20emissions%20data.csv")
dataset = pd.read_csv(dataset_url, sep=",")
dataset.columns = ["ENGINESIZE",	"CYLINDERS",	"FUELCONSUMPTION_CITY",	"FUELCONSUMPTION_HWY",	"CO2EMISSIONS"]
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1067 entries, 0 to 1066
Data columns (total 5 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   ENGINESIZE            1067 non-null   float64
 1   CYLINDERS             1067 non-null   int64  
 2   FUELCONSUMPTION_CITY  1067 non-null   float64
 3   FUELCONSUMPTION_HWY   1067 non-null   float64
 4   CO2EMISSIONS          1067 non-null   int64  
dtypes: float64(3), int64(2)
memory usage: 41.8 KB


*Variables cuantitativas continuas:*

In [50]:
dataset.select_dtypes(include=['float64']).describe()

Unnamed: 0,ENGINESIZE,FUELCONSUMPTION_CITY,FUELCONSUMPTION_HWY
count,1067.0,1067.0,1067.0
mean,3.346298,13.296532,9.474602
std,1.415895,4.101253,2.79451
min,1.0,4.6,4.9
25%,2.0,10.25,7.5
50%,3.4,12.6,8.8
75%,4.3,15.55,10.85
max,8.4,30.2,20.5


*Variables cuantitativas discretas:*

In [51]:
dataset.select_dtypes(include=['int64']).describe()

Unnamed: 0,CYLINDERS,CO2EMISSIONS
count,1067.0,1067.0
mean,5.794752,0.461106
std,1.797447,0.498719
min,3.0,0.0
25%,4.0,0.0
50%,6.0,0.0
75%,8.0,1.0
max,12.0,1.0


## 2) Se separan los datos en conjuntos de entrenamiento y prueba (80%, 20%)

In [52]:
from sklearn.model_selection import train_test_split
N=len(dataset)
cTrain=int(N*0.8)
cTest=N-cTrain
print(N,cTrain,cTest)
train_data,test_data= sklearn.model_selection.train_test_split(dataset, train_size=cTrain, test_size=cTest)

1067 853 214


In [53]:
dataset.shape

(1067, 5)

In [54]:
dataset.head()

Unnamed: 0,ENGINESIZE,CYLINDERS,FUELCONSUMPTION_CITY,FUELCONSUMPTION_HWY,CO2EMISSIONS
0,2.0,4,9.9,6.7,0
1,2.4,4,11.2,7.7,0
2,1.5,4,6.0,5.8,0
3,3.5,6,12.7,9.1,0
4,3.5,6,12.1,8.7,0


## 3) Se normalizan e imputan los datos

In [55]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

data_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

attribs = dataset.columns[0:-1]
full_pipeline = ColumnTransformer([
    ("num", data_pipeline, attribs),
])

X_train = full_pipeline.fit_transform(train_data)
X_train.shape

(853, 4)

In [56]:
X_train[0,:]

array([ 0.30511538,  0.09690863, -0.18719376, -0.25236097])

In [57]:
y_train = train_data["CO2EMISSIONS"]
y_train

880     0
813     0
1007    1
873     0
1051    0
       ..
532     1
497     0
1001    1
934     0
915     1
Name: CO2EMISSIONS, Length: 853, dtype: int64

In [58]:
from sklearn import tree
from sklearn.model_selection import cross_val_score

models1 = []
scores1 = []

for i in range(10):
  model = tree.DecisionTreeClassifier(criterion="gini",splitter="best",random_state=123,max_depth=i+1)
  model.fit(X_train, y_train)
  models1.append(model)
  score = cross_val_score(model, X_train, y_train, cv=5)
  scores1.append(score.mean())

print(scores1)

[0.9624699002407981, 0.9577915376676988, 0.9601307189542483, 0.9613071895424836, 0.9753629170966633, 0.9718403852769179, 0.9659855521155831, 0.9671551427588578, 0.965985552115583, 0.965985552115583]


In [59]:
models2 = []
scores2 = []

for i in range(10):
  model = tree.DecisionTreeClassifier(criterion="entropy",splitter="best",random_state=123,max_depth=i+1)
  model.fit(X_train, y_train)
  models2.append(model)
  score = cross_val_score(model, X_train, y_train, cv=5)
  scores2.append(score.mean())

print(scores2)

[0.9624699002407981, 0.9624699002407981, 0.9589542483660131, 0.9683178534571724, 0.9730099759201927, 0.9671689026487789, 0.9659924320605435, 0.9671620227038182, 0.9671620227038182, 0.9671620227038182]


In [60]:
models3 = []
scores3 = []

for i in range(10):
  model = tree.DecisionTreeClassifier(criterion="entropy",splitter="random",random_state=123,max_depth=i+1)
  model.fit(X_train, y_train)
  models3.append(model)
  score = cross_val_score(model, X_train, y_train, cv=5)
  scores3.append(score.mean())

print(scores3)

[0.8417268661850704, 0.8487650498796011, 0.9425730994152047, 0.8733883728930169, 0.8874991400068799, 0.933154454764362, 0.9366219470244237, 0.9472033023735811, 0.9437151702786378, 0.9437014103887169]


In [61]:
X_test = full_pipeline.transform(test_data)
X_test

array([[-1.24281522, -1.02015972, -0.69440927, -0.56985936],
       [-0.96137329, -1.02015972, -0.74271551, -0.60513696],
       [-0.96137329, -1.02015972, -0.42872495, -0.42874897],
       [-1.24281522, -1.02015972, -0.81517487, -0.60513696],
       [ 0.44583635,  1.21397697,  0.03018432, -0.32291617],
       [ 1.29016213,  2.33104531,  1.86582141,  1.51151897],
       [-0.96137329, -1.02015972, -0.83932799, -0.74624736],
       [ 0.93835972,  1.21397697,  0.89969663,  1.15874298],
       [ 1.50124358,  1.21397697,  1.45521838,  2.18179335],
       [-1.10209426, -1.02015972, -0.83932799, -0.81680255],
       [ 0.16439442,  0.09690863,  0.80308415,  0.629579  ],
       [-1.24281522, -1.02015972, -1.03255294, -0.99319055],
       [-0.96137329, -1.02015972, -0.67025615, -0.81680255],
       [-0.96137329, -1.02015972, -0.91178735, -0.88735775],
       [-1.10209426, -1.02015972, -1.15331854, -1.06374575],
       [ 1.50124358,  1.21397697,  1.04461535,  1.37040858],
       [ 0.16439442,  0.

In [62]:
y_pred1 = []
y_pred2 = []
y_pred3 = []
for i in range(10):
  y_pred1.append(models1[i].predict(X_test))
  y_pred2.append(models2[i].predict(X_test))
  y_pred3.append(models3[i].predict(X_test))
y_pred1

[array([0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
        0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1,
        1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0,
        0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0,
        0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0,
        0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0,
        0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0,
        0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0,
        1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1,
        0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1]),
 array([0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
        0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1,
        1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0,
        0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0

In [63]:
y_pred2

[array([0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
        0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1,
        1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0,
        0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0,
        0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0,
        0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0,
        0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0,
        0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0,
        1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1,
        0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1]),
 array([0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
        0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1,
        1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0,
        0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0

In [64]:
y_pred3

[array([0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1,
        0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1,
        1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0,
        0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1,
        1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1,
        0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0,
        0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1,
        0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1,
        1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1,
        0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1]),
 array([0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
        0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
        0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
        0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0

In [65]:
y_test = test_data["CO2EMISSIONS"]
y_test

359    0
735    0
408    0
808    0
31     1
      ..
52     1
446    1
805    0
834    1
238    1
Name: CO2EMISSIONS, Length: 214, dtype: int64

In [66]:
from sklearn.metrics import accuracy_score
accuracies1 = []
accuracies2 = []
accuracies3 = []
for i in range(10):
  accuracies1.append(accuracy_score(y_test,y_pred1[i]))
  accuracies2.append(accuracy_score(y_test,y_pred2[i]))
  accuracies3.append(accuracy_score(y_test,y_pred3[i]))
accuracies1

[0.9626168224299065,
 0.9626168224299065,
 0.9766355140186916,
 0.9579439252336449,
 0.9813084112149533,
 0.9766355140186916,
 0.9766355140186916,
 0.9766355140186916,
 0.9766355140186916,
 0.9766355140186916]

In [67]:
accuracies2

[0.9626168224299065,
 0.9626168224299065,
 0.9579439252336449,
 0.9813084112149533,
 0.9719626168224299,
 0.9813084112149533,
 0.9766355140186916,
 0.9766355140186916,
 0.9766355140186916,
 0.9766355140186916]

In [68]:
accuracies3

[0.8411214953271028,
 0.8037383177570093,
 0.9485981308411215,
 0.883177570093458,
 0.9018691588785047,
 0.9158878504672897,
 0.9672897196261683,
 0.985981308411215,
 0.9766355140186916,
 0.9719626168224299]

In [69]:
model_names = ["model 1", "model 2", "model 3"]
accuracies_list = [accuracies1, accuracies2, accuracies3]

models = []
depths = []
accuracies = []

for model_name, accuracy_list in zip(model_names, accuracies_list):
    for depth, accuracy in enumerate(accuracy_list, start=1):
        models.append(model_name)
        depths.append(depth)
        accuracies.append(accuracy)

df = pd.DataFrame({
    "Model": models,
    "Depth": depths,
    "Accuracy": accuracies
})

df

Unnamed: 0,Model,Depth,Accuracy
0,model 1,1,0.962617
1,model 1,2,0.962617
2,model 1,3,0.976636
3,model 1,4,0.957944
4,model 1,5,0.981308
5,model 1,6,0.976636
6,model 1,7,0.976636
7,model 1,8,0.976636
8,model 1,9,0.976636
9,model 1,10,0.976636
