## Árbol de decisión

Este modelo busca predecir el cargo que tiene un servicio depéndiendo del trafico
* congestion_surcharge = 0 ---> sin trafico
* congestion_surcharge = 2 ---> con trafico

Se importan las librerias requeridas

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import tree
from joblib import dump

Se hace el entrenamiento con 3 datasets de recorridos de taxis ecologicos para tres meses consecutivos del año 2023, se concatenan y se usan para el entrenamiento

In [2]:
d0 = pd.read_parquet("C:/Users/Acer/Downloads/green_tripdata_2023-12_limpio.parquet")
df1 = pd.read_parquet("C:/Users/Acer/Downloads/green_tripdata_2023-11_limpio.parquet")
df2 = pd.read_parquet("C:/Users/Acer/Downloads/green_tripdata_2023-10_limpio.parquet")
df = pd.concat([d0, df1, df2])

In [3]:
df.shape

(183020, 12)

Se filtra el data set con los valores de la columna 'congestion_surcharge' mayor que -1 ya que -1 representa a los valores nulos

In [4]:
df = df[df['congestion_surcharge'] > -1]
df.shape

(169725, 12)

In [5]:
df['congestion_surcharge'].value_counts()

congestion_surcharge
0    120670
2     49055
Name: count, dtype: int64

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 169725 entries, 0 to 61499
Data columns (total 12 columns):
 #   Column                Non-Null Count   Dtype         
---  ------                --------------   -----         
 0   start_trip            169725 non-null  datetime64[us]
 1   end_trip              169725 non-null  datetime64[us]
 2   pu_location_id        169725 non-null  int64         
 3   do_location_id        169725 non-null  int64         
 4   passenger_count       169725 non-null  int64         
 5   trip_distance         169725 non-null  float64       
 6   fare_amount           169725 non-null  float64       
 7   tip_amount            169725 non-null  float64       
 8   total_amount          169725 non-null  float64       
 9   payment_type          169725 non-null  int64         
 10  congestion_surcharge  169725 non-null  int64         
 11  type_of_taxi          169725 non-null  int64         
dtypes: datetime64[us](2), float64(4), int64(6)
memory usage: 16.8 MB

In [7]:
df.head(1)

Unnamed: 0,start_trip,end_trip,pu_location_id,do_location_id,passenger_count,trip_distance,fare_amount,tip_amount,total_amount,payment_type,congestion_surcharge,type_of_taxi
0,2023-12-01 00:27:37,2023-12-01 00:42:48,74,243,1,4.8,22.6,5.02,30.12,1,0,1


Se extrate informacion de las fechas de inicio y fin de viaje para generar columnas con los siguientes valores: año, mes, dia y minutos de viaje

In [15]:
df['year'] = df['start_trip'].dt.year
df['month'] = df['start_trip'].dt.month
df['day'] = df['start_trip'].dt.dayofweek
df['min_trip'] = df['min_trip'] = (df['end_trip'] - df['start_trip']).dt.total_seconds() / 60
df.sample(5)

Unnamed: 0,start_trip,end_trip,pu_location_id,do_location_id,passenger_count,trip_distance,fare_amount,tip_amount,total_amount,payment_type,congestion_surcharge,type_of_taxi,year,month,day,min_trip
5925,2023-12-04 08:43:15,2023-12-04 08:57:30,236,143,1,2.99,16.3,4.11,24.66,1,2,1,2023,12,0,14.25
50976,2023-10-26 18:40:12,2023-10-26 18:49:36,75,75,1,1.23,10.0,0.0,14.0,-1,0,1,2023,10,3,9.4
14612,2023-12-08 01:28:06,2023-12-08 01:28:09,42,42,1,0.07,15.0,0.0,16.0,1,0,1,2023,12,4,0.05
18896,2023-11-10 08:13:15,2023-11-10 08:28:46,69,78,1,2.42,16.3,1.0,18.8,1,0,1,2023,11,4,15.516667
44625,2023-10-23 17:22:04,2023-10-23 17:56:54,97,177,1,4.7,31.0,0.0,35.0,1,0,1,2023,10,0,34.833333


In [9]:
df.columns

Index(['start_trip', 'end_trip', 'pu_location_id', 'do_location_id',
       'passenger_count', 'trip_distance', 'fare_amount', 'tip_amount',
       'total_amount', 'payment_type', 'congestion_surcharge', 'type_of_taxi',
       'year', 'month', 'day', 'min_trip'],
      dtype='object')

In [10]:
df.shape

(169725, 16)

In [11]:
# Arbol de decision

In [12]:
from sklearn import tree
# Selección de las variables independientes
X = df[['min_trip', 'trip_distance', 'pu_location_id', 'do_location_id']]

# Conversión de las variables categóricas a dummies
X = pd.get_dummies(X)

# Selección de la variable dependiente
y = df['congestion_surcharge']

# Separación del conjunto de datos en entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Creación del clasificador de árbol de decisión
arbol = tree.DecisionTreeClassifier(max_depth = 7, random_state=42)

# Entrenamiento del modelo
arbol.fit(X_train, y_train)

# Evaluación del modelo
score = arbol.score(X_test, y_test)

# Predicción del recargo por congestión
predicciones = arbol.predict(X_test)

# Impresión del score del modelo
print(f"Score del modelo: {score}")

# Impresión de las primeras 10 predicciones
#print(f"Primeras 10 predicciones: {predicciones[:10]}")

# Visualización del árbol de decisión
#tree.plot_tree(arbol)

Score del modelo: 0.9317967571644042


In [13]:
resultados = pd.DataFrame({'Valor real': y_test, 'Valor predicho': predicciones})

# Impresión del dataframe
print(resultados.to_string())

       Valor real  Valor predicho
26660           0               0
41798           0               0
3571            2               2
38150           2               2
38007           0               0
30074           2               2
24634           0               0
44661           2               2
13178           0               2
51282           0               0
18614           2               2
41222           2               2
35312           0               0
55064           0               0
6112            0               0
3604            0               0
16049           0               0
3592            0               0
59098           0               0
49703           0               0
49983           2               2
31972           2               2
52620           2               2
57356           2               2
34723           2               2
25387           2               2
20960           2               2
58929           2               2
22001         

Se observa un Score de 0.93, y se visualiza una columna con valores reales y valores predichos

Se guarda el modelo en formato joblib, para ser posteriormente consumido por Streamlit

In [14]:
dump(arbol, 'E:/proyecto_final/Modelo/prediccion_congestion.joblib')

['E:/proyecto_final/Modelo/prediccion_congestion.joblib']