#**Maestría en Inteligencia Artificial Aplicada**
##**Curso: Análisis de Grandes Volúmenes de Datos**
###Tecnológico de Monterrey
###Prof. Iván Olmos

## **Actividad Semana 03**

###**Proyecto: Base de Datos de Big Data**

##### Nombres y matrículas de los integrantes del equipo:
*   Victoria Melgarejo Cabrera - A01795030
*   Héctor Alejandro Alvarez Rosas        - A01796262
*   Andrea Xcaret Gomez Alfaro        - A01796384
*   Mario Guillen De La Torre       - A01796701


---


#### **Descripción de la Base de Datos:**

Este notebook procesa el dataset de viajes en taxi de la ciudad de Chicago, aplicando reglas de particionamiento basadas en el tipo de pago y la zona de recojo. Posteriormente se extraen submuestras representativas que serán utilizadas para analizar el comportamiento de propinas.



---

### **Importación de Librerías**

In [85]:
# Instalación de PySpark en Colab
!pip install pyspark



In [108]:
import os
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, isnan, when, count, percentile_approx, min, max, mean, stddev, approx_count_distinct, expr
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from pyspark.sql.functions import hour, dayofweek, unix_timestamp, when
from pyspark.sql.functions import to_timestamp, hour, dayofweek
from pyspark.sql import functions as F

In [87]:
# Función auxiliar de visualización
from IPython.display import display, HTML

def pretty_display(df, limit=1000):
    """
    Convierte un PySpark DataFrame a Pandas y lo muestra como una tabla HTML con scroll horizontal.
    
    Args:
        df (pyspark.sql.DataFrame): El DataFrame de PySpark a mostrar.
        limit (int): Número máximo de filas a mostrar. Por defecto 1000.
    """
    pdf = df.limit(limit).toPandas()
    display(HTML(pdf.to_html(notebook=True)))

### **Creación de la Sesión Spark**

In [88]:
spark = SparkSession.builder \
    .appName("ChicagoTaxyTripsAnalysis") \
    .getOrCreate()

### **Carga del Dataset**

In [89]:
filename = "Taxi_Trips.csv"

dftaxytrips = spark.read.csv(filename, header=True, inferSchema=True)

                                                                                

In [90]:
print("Número de registros:", dftaxytrips.count())
print("Número de columnas:", len(dftaxytrips.columns))



Número de registros: 7917844
Número de columnas: 23


                                                                                

### **Exploración de los Datos**

In [91]:
# Estructura del dataset
dftaxytrips.printSchema()

root
 |-- Trip ID: string (nullable = true)
 |-- Taxi ID: string (nullable = true)
 |-- Trip Start Timestamp: string (nullable = true)
 |-- Trip End Timestamp: string (nullable = true)
 |-- Trip Seconds: integer (nullable = true)
 |-- Trip Miles: double (nullable = true)
 |-- Pickup Census Tract: long (nullable = true)
 |-- Dropoff Census Tract: long (nullable = true)
 |-- Pickup Community Area: integer (nullable = true)
 |-- Dropoff Community Area: integer (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Tips: double (nullable = true)
 |-- Tolls: double (nullable = true)
 |-- Extras: double (nullable = true)
 |-- Trip Total: double (nullable = true)
 |-- Payment Type: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- Pickup Centroid Latitude: double (nullable = true)
 |-- Pickup Centroid Longitude: double (nullable = true)
 |-- Pickup Centroid Location: string (nullable = true)
 |-- Dropoff Centroid Latitude: double (nullable = true)
 |-- Dropoff Centroid 

In [92]:
# Estadísticas generales
pretty_display(dftaxytrips.summary())

                                                                                

Unnamed: 0,summary,Trip ID,Taxi ID,Trip Start Timestamp,Trip End Timestamp,Trip Seconds,Trip Miles,Pickup Census Tract,Dropoff Census Tract,Pickup Community Area,Dropoff Community Area,Fare,Tips,Tolls,Extras,Trip Total,Payment Type,Company,Pickup Centroid Latitude,Pickup Centroid Longitude,Pickup Centroid Location,Dropoff Centroid Latitude,Dropoff Centroid Longitude,Dropoff Centroid Location
0,count,7917844,7917844,7917844,7917778,7916303.0,7917775.0,3392900.0,3281682.0,7691067.0,7174891.0,7897269.0,7897269.0,7897269.0,7897269.0,7897269.0,7917844,7917844,7695376.0,7695376.0,7695376,7220191.0,7220191.0,7220191
1,mean,Infinity,,,,1254.9250153764958,6.802327488467674,17031508395.342648,17031413941.38261,36.09328575605959,26.293705646538744,22.89300292290886,2.903004674147452,0.0294344956465329,2.1529685008830275,28.182303074643663,,,41.901399650736,-87.7019076253514,,41.8921376218955,-87.66080016228891,
2,stddev,,,,,1648.0977197601726,7.957195844290252,374237.3662004645,342509.8545557547,26.291497670579886,20.81465342793271,33.2323786885119,4.284313925861233,5.0707919401752175,10.395597618506333,38.30632012119162,,,0.0652524674536966,0.1145551443830559,,0.0584354423127849,0.0720059893262844,
3,min,0000006aa752d456d05c6eeb43b057adb1ffa540,000daaa11a2d961100513e232a1ce05391c5d797d2dc56...,01/01/2024 01:00:00 AM,01/01/2024 01:00:00 AM,0.0,0.0,17031010100.0,17031010100.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,Cash,2733 - 74600 Benny Jona,41.650221676,-87.913624596,POINT (-87.5307124836 41.7030053028),41.650221676,-87.913624596,POINT (-87.5349029012 41.707311449)
4,25%,Infinity,,,,480.0,1.03,17031081700.0,17031081500.0,8.0,8.0,8.5,0.0,0.0,0.0,10.25,,,41.878865584,-87.750934289,,41.878865584,-87.66351755,
5,50%,Infinity,,,,943.0,3.34,17031320400.0,17031320100.0,32.0,28.0,16.0,0.57,0.0,0.0,19.22,,,41.89503345,-87.642648998,,41.892042136,-87.633308037,
6,75%,Infinity,,,,1714.0,11.98,17031980000.0,17031839100.0,63.0,32.0,34.5,4.0,0.0,2.25,43.22,,,41.97907082,-87.625192142,,41.922686284,-87.625192142,
7,max,ffffffdda8f2f9f98cf474cce05b7e5e34dc25e4,ffda53354c610fd3af1aee46d723028a49014e35f7280c...,12/31/2024 12:45:00 PM,12/31/2024 12:45:00 PM,86396.0,3397.8,17031980100.0,17031980100.0,77.0,77.0,9999.75,400.0,5550.0,5559.5,9999.75,Unknown,Wolley Taxi,42.021223593,-87.530712484,POINT (-87.913624596 41.9802643146),42.021223593,-87.534902901,POINT (-87.913624596 41.9802643146)


In [93]:
# Análisis de valores faltantes en 'dftaxytrips'
missing_taxytrips = dftaxytrips.select([
    count(when(col(c).isNull() | isnan(c), c)).alias(c) #count(when(col(c).isNull(), c)).alias(c)
    for c in dftaxytrips.columns
])

print("Valores faltantes en Chicago Taxi Trips Dataset (csv):")
pretty_display(missing_taxytrips)

Valores faltantes en Chicago Taxi Trips Dataset (csv):


                                                                                

Unnamed: 0,Trip ID,Taxi ID,Trip Start Timestamp,Trip End Timestamp,Trip Seconds,Trip Miles,Pickup Census Tract,Dropoff Census Tract,Pickup Community Area,Dropoff Community Area,Fare,Tips,Tolls,Extras,Trip Total,Payment Type,Company,Pickup Centroid Latitude,Pickup Centroid Longitude,Pickup Centroid Location,Dropoff Centroid Latitude,Dropoff Centroid Longitude,Dropoff Centroid Location
0,0,0,0,66,1541,69,4524944,4636162,226777,742953,20575,20575,20575,20575,20575,0,0,222468,222468,222468,697653,697653,697653


### **Variables de Caracterización**

In [94]:
# Trip Start Timestamp es tipo timestamp
dftaxytrips = dftaxytrips.withColumn(
    "trip_start_ts",
    to_timestamp(col("Trip Start Timestamp"), "MM/dd/yyyy hh:mm:ss a")
)

# Hora del día
dftaxytrips = dftaxytrips.withColumn("trip_hour", hour(col("trip_start_ts")))

# Día de la semana (1 = domingo, 7 = sábado)
dftaxytrips = dftaxytrips.withColumn("trip_day_of_week", dayofweek(col("trip_start_ts")))

# Duración del viaje en minutos
dftaxytrips = dftaxytrips.withColumn("duration_minutes", col("Trip Seconds") / 60)

# Tip/Fare ratio
dftaxytrips = dftaxytrips.withColumn("tip_ratio",
    when(col("Fare") > 0, col("Tips") / col("Fare")).otherwise(0))

# Tip/Trip Miles ratio
dftaxytrips = dftaxytrips.withColumn("tip_per_mile",
    when(col("Trip Miles") > 0, col("Tips") / col("Trip Miles")).otherwise(0))

# Agrupación método de pago
dftaxytrips = dftaxytrips.withColumn("payment_group",
    when(col("Payment Type") == "Credit Card", "Credit Card")
    .when(col("Payment Type") == "Cash", "Cash")
    .when(col("Payment Type") == "Mobile", "Mobile")
    .otherwise("Other"))

# Agrupación de Compañia
dftaxytrips = dftaxytrips.withColumn("company_group",
    when(col("Company") == "Flash Cab", "Flash Cab")
    .when(col("Company") == "Taxi Affiliation Services", "Taxi Affiliation")
    .when(col("Company") == "Taxicab Insurance Agency Llc", "Insurance Agency")
    .when(col("Company") == "Sun Taxi", "Sun Taxi")
    .when(col("Company") == "City Service", "City Service")
    .when(col("Company") == "Chicago Independents", "Chicago Independents")
    .otherwise("Other"))

# Agrupación Zona origen
dftaxytrips = dftaxytrips.withColumn("pickup_zone_group",
    when(col("Pickup Community Area") == 76, 76)
    .when(col("Pickup Community Area") == 8, 8)
    .when(col("Pickup Community Area") == 32, 32)
    .when(col("Pickup Community Area") == 28, 28)
    .otherwise("Other"))

# Agrupación Zona destino
dftaxytrips = dftaxytrips.withColumn("dropoff_zone_group",
    when(col("Dropoff Community Area") == 8, 8)
    .when(col("Dropoff Community Area") == 32, 32)
    .when(col("Dropoff Community Area") == 28, 28)
    .when(col("Dropoff Community Area") == 76, 76)
    .otherwise("Other"))

# Renombrar ciertas columnas
dftaxytrips = dftaxytrips.withColumnRenamed("Trip ID", "trip_id")
dftaxytrips = dftaxytrips.withColumnRenamed("Trip Miles", "trip_miles")

In [95]:
columnas_seleccionadas = [
    "trip_id",
    "trip_hour",
    "trip_day_of_week",
    "duration_minutes",
    "trip_miles",
    "tip_ratio",
    "tip_per_mile",
    "payment_group",
    "company_group",
    "pickup_zone_group",
    "dropoff_zone_group"
]

# Seleccionar columnas específicas
dftaxytrips_selected = dftaxytrips.select(*columnas_seleccionadas)
pretty_display(dftaxytrips_selected, 5)

Unnamed: 0,trip_id,trip_hour,trip_day_of_week,duration_minutes,trip_miles,tip_ratio,tip_per_mile,payment_group,company_group,pickup_zone_group,dropoff_zone_group
0,0000184e7cd53cee95af32eba49c44e4d20adcd8,17,6,67.516667,17.12,0.21978,0.584112,Credit Card,Flash Cab,76,32
1,000072ee076c9038868e239ca54185eb43959db0,14,1,29.15,12.7,0.0,0.0,Cash,Flash Cab,Other,Other
2,000074019d598c2b1d6e77fbae79e40b0461a2fc,9,6,8.616667,3.39,0.254812,0.820059,Mobile,Insurance Agency,Other,8
3,00007572c5f92e2ff067e6f838a5ad74e83665d3,8,2,34.166667,15.06,0.288153,0.750996,Credit Card,Other,76,Other
4,00007c3e7546e2c7d15168586943a9c22c3856cf,19,5,16.733333,1.18,0.233375,3.152542,Mobile,Other,32,32


### **Caracterización de la Población**

In [96]:
# Análisis de valores faltantes en 'dftaxytrips_selected'
missing_taxytrips = dftaxytrips_selected.select([
    count(when(col(c).isNull() | isnan(c), c)).alias(c)
    for c in dftaxytrips_selected.columns
])

print("Valores faltantes en Chicago Taxi Trips Dataset (csv):")
missing_taxytrips.show()

Valores faltantes en Chicago Taxi Trips Dataset (csv):




+-------+---------+----------------+----------------+----------+---------+------------+-------------+-------------+-----------------+------------------+
|trip_id|trip_hour|trip_day_of_week|duration_minutes|trip_miles|tip_ratio|tip_per_mile|payment_group|company_group|pickup_zone_group|dropoff_zone_group|
+-------+---------+----------------+----------------+----------+---------+------------+-------------+-------------+-----------------+------------------+
|      0|        0|               0|            1541|        69|        0|       18862|            0|            0|                0|                 0|
+-------+---------+----------------+----------------+----------+---------+------------+-------------+-------------+-----------------+------------------+



                                                                                

In [97]:
# Definimos variables
numerical_vars = ["duration_minutes", "trip_miles", "tip_ratio", "tip_per_mile"]

categorical_vars = [
    "trip_hour", "trip_day_of_week",
    "payment_group", "company_group",
    "pickup_zone_group", "dropoff_zone_group"
]

In [98]:
# Lista de variables numéricas a imputar
vars_a_imputar = ["duration_minutes", "trip_miles", "tip_ratio", "tip_per_mile"]

# Aplicamos imputación con mediana
for var in vars_a_imputar:
    mediana = dftaxytrips_selected.approxQuantile(var, [0.5], 0.01)[0]
    dftaxytrips_selected = dftaxytrips_selected.withColumn(
        var, when(col(var).isNull(), mediana).otherwise(col(var))
    )

                                                                                

In [99]:
# Estadísticas Numéricas
stats_exprs = [min(c).alias("min_" + c) for c in numerical_vars] + \
              [max(c).alias("max_" + c) for c in numerical_vars] + \
              [mean(c).alias("mean_" + c) for c in numerical_vars] + \
              [stddev(c).alias("stddev_" + c) for c in numerical_vars]

numerical_stats = dftaxytrips_selected.agg(*stats_exprs)

# Agregar percentiles 20, 33, 40, 60, 66, 80
quantile_probs = [0.2, 0.33, 0.4, 0.6, 0.66, 0.8]
quantiles = {}

for var in numerical_vars:
    quantiles[var] = dftaxytrips_selected.approxQuantile(var, quantile_probs, 0.01)

# Convertir a pandas
numerical_stats_pd = numerical_stats.toPandas().T.reset_index()
numerical_stats_pd.columns = ["statistic", "value"]
numerical_stats_pd[["metric", "variable"]] = numerical_stats_pd["statistic"].str.split("_", n=1, expand=True)
numerical_stats_summary = numerical_stats_pd.pivot(index="variable", columns="metric", values="value").reset_index()

# Agregar los percentiles a la tabla
for var in numerical_stats_summary["variable"]:
    q_values = quantiles.get(var, [None] * 6)
    numerical_stats_summary.loc[numerical_stats_summary["variable"] == var, "q20"] = q_values[0]
    numerical_stats_summary.loc[numerical_stats_summary["variable"] == var, "p33"] = q_values[1]
    numerical_stats_summary.loc[numerical_stats_summary["variable"] == var, "q40"] = q_values[2]
    numerical_stats_summary.loc[numerical_stats_summary["variable"] == var, "q60"] = q_values[3]
    numerical_stats_summary.loc[numerical_stats_summary["variable"] == var, "p66"] = q_values[4]
    numerical_stats_summary.loc[numerical_stats_summary["variable"] == var, "q80"] = q_values[5]


                                                                                

In [100]:
print("Estadísticas numéricas:")
display(numerical_stats_summary)

Estadísticas numéricas:


metric,variable,max,mean,min,stddev,q20,p33,q40,q60,p66,q80
0,duration_minutes,1439.933333,20.914369,0.0,27.465725,6.8,10.0,12.0,20.233333,23.166667,32.0
1,tip_per_mile,6040.0,1.632858,0.0,27.056425,0.0,0.0,0.0,0.517045,0.5798,0.907407
2,tip_ratio,1515.0,0.14367,0.0,1.405698,0.0,0.0,0.0,0.195122,0.214286,0.252514
3,trip_miles,3397.8,6.802297,0.0,7.957168,0.8,1.43,1.9,6.64,9.35,13.5


In [101]:
# Estadísticas Categóricas
categorical_summary = []
for var in categorical_vars:
    mode_df = dftaxytrips_selected.groupBy(col(var)).count().orderBy(col("count").desc()).limit(1)
    mode_row = mode_df.collect()[0]
    unique_count = dftaxytrips_selected.select(approx_count_distinct(col(var)).alias("unique_count")).collect()[0]["unique_count"]
    categorical_summary.append({
        "variable": var,
        "unique_values": unique_count,
        "mode": mode_row[var],
        "mode_count": mode_row["count"]
    })

categorical_stats_summary = pd.DataFrame(categorical_summary)

                                                                                

In [102]:
print("Estadísticas categóricas:")
display(categorical_stats_summary)

Estadísticas categóricas:


Unnamed: 0,variable,unique_values,mode,mode_count
0,trip_hour,25,17,569001
1,trip_day_of_week,7,5,1304627
2,payment_group,4,Credit Card,3066154
3,company_group,7,Other,1901457
4,pickup_zone_group,5,Other,2570284
5,dropoff_zone_group,5,Other,3703753


Para las variables numéricas se optó por crear segmentos basado en los quintiles:

In [103]:
# Duración del viaje (en minutos)
dftaxytrips_selected = dftaxytrips_selected.withColumn(
    "duration_group",
    (
        when(col("duration_minutes") <= 10.0, "Flash Riders")           # viajes muy cortos, de alta rotación
        .when(col("duration_minutes") <= 23.2, "Urban Cruisers")        # trayectos típicos dentro de la ciudad
        .otherwise("Long-Haul Nomads")                                  # trayectos largos, posiblemente entre distritos lejanos
    )
)

# Propina por milla
dftaxytrips_selected = dftaxytrips_selected.withColumn(
    "tip_per_mile_group",
    (
        when(col("tip_per_mile") <= 0.0, "Non-Tippers")                 # usuarios que no dan propina
        .when(col("tip_per_mile") <= 0.58, "Appreciative Riders")       # dan propina moderada por distancia
        .otherwise("Tip Enthusiasts")                                   # viajeros generosos por cada milla recorrida
    )
)

# Ratio de propina sobre la tarifa
dftaxytrips_selected = dftaxytrips_selected.withColumn(
    "tip_ratio_group",
    (
        when(col("tip_ratio") <= 0.0, "Flat Fare Clients")              # no dejan propina
        .when(col("tip_ratio") <= 0.21, "Grateful Givers")              # propina moderada sobre el total
        .otherwise("High-Spirit Donors")                                # usuarios con alto ratio de agradecimiento
    )
)

# Distancia del viaje (en millas)
dftaxytrips_selected = dftaxytrips_selected.withColumn(
    "trip_miles_group",
    (
        when(col("trip_miles") <= 1.4, "Neighborhood Navigators")       # distancias muy cortas dentro de la zona
        .when(col("trip_miles") <= 9.4, "City Explorers")               # distancias promedio en la ciudad
        .otherwise("Wide-Radius Riders")                                # trayectos amplios, posibles traslados al aeropuerto o periferia
    )
)

In [104]:
pretty_display(dftaxytrips_selected, 5)

Unnamed: 0,trip_id,trip_hour,trip_day_of_week,duration_minutes,trip_miles,tip_ratio,tip_per_mile,payment_group,company_group,pickup_zone_group,dropoff_zone_group,duration_group,tip_per_mile_group,tip_ratio_group,trip_miles_group
0,0000184e7cd53cee95af32eba49c44e4d20adcd8,17,6,67.516667,17.12,0.21978,0.584112,Credit Card,Flash Cab,76,32,Long-Haul Nomads,Tip Enthusiasts,High-Spirit Donors,Wide-Radius Riders
1,000072ee076c9038868e239ca54185eb43959db0,14,1,29.15,12.7,0.0,0.0,Cash,Flash Cab,Other,Other,Long-Haul Nomads,Non-Tippers,Flat Fare Clients,Wide-Radius Riders
2,000074019d598c2b1d6e77fbae79e40b0461a2fc,9,6,8.616667,3.39,0.254812,0.820059,Mobile,Insurance Agency,Other,8,Flash Riders,Tip Enthusiasts,High-Spirit Donors,City Explorers
3,00007572c5f92e2ff067e6f838a5ad74e83665d3,8,2,34.166667,15.06,0.288153,0.750996,Credit Card,Other,76,Other,Long-Haul Nomads,Tip Enthusiasts,High-Spirit Donors,Wide-Radius Riders
4,00007c3e7546e2c7d15168586943a9c22c3856cf,19,5,16.733333,1.18,0.233375,3.152542,Mobile,Other,32,32,Urban Cruisers,Tip Enthusiasts,High-Spirit Donors,Neighborhood Navigators


### **Regla de Particionamiento**

Dado que nuestro objetivo está enfocado en analizar las propinas (tips) en los viajes de taxi de Chicago y predecir patrones relevantes, hemos considerado estas tres variables:


*   **`payment_group`:** Agrupación del método de pago Credit Card, Cash, Mobile, Other. Se considera clave debido a la fuerte relación entre pagos con tarjeta y la propina otorgada.
*   **`pickup_zone_group`:** 	Agrupación de zonas de recojo, áreas específicas (76, 8, 32, 28) y un grupo "Other" que incluye las demás zonas. Representa un proxy de ubicación socioeconómica o comercial.
*   **`duration_group`:** Clasificación de duración del viaje Flash Riders (≤10 min), Urban Cruisers (10–23.2 min), Long-Haul Nomads (>23.2 min). Captura la intensidad y contexto del trayecto.

Asimismo, consideramos que estas variables permiten capturar factores clave de comportamiento relacionados con la decisión del pasajero de dejar una propina.

In [105]:
def categorical_summary(df, column):
    # Total count of non-null values
    count = df.filter(F.col(column).isNotNull()).count()
 
    # Count of distinct values
    unique = df.select(column).distinct().count()
 
    # Most frequent value and its frequency
    top_row = (df.groupBy(column)
                 .agg(F.count("*").alias("freq"))
                 .orderBy(F.desc("freq"))
                 .first())
 
    top = top_row[column] if top_row else None
    freq = top_row["freq"] if top_row else 0
 
    # Create a summary dictionary
    summary = {
        "count": count,
        "unique": unique,
        "top": top,
        "freq": freq
    }
 
    return summary

In [106]:
for col in ["payment_group", "pickup_zone_group", "duration_group"]:
    print(f'{col} - {categorical_summary(dftaxytrips_selected, col)}')

                                                                                

payment_group - {'count': 7917844, 'unique': 4, 'top': 'Credit Card', 'freq': 3066154}


                                                                                

pickup_zone_group - {'count': 7917844, 'unique': 5, 'top': 'Other', 'freq': 2570284}




duration_group - {'count': 7917844, 'unique': 3, 'top': 'Long-Haul Nomads', 'freq': 2704954}


                                                                                

In [109]:
# Agrupar por las variables clave
partition_counts = dftaxytrips_selected.groupBy(
    "payment_group", "pickup_zone_group", "duration_group"
).agg(count("*").alias("count"))

# Calcular total general
total_count = dftaxytrips_selected.count()

# Agregar proporción por combinación
partition_counts = partition_counts.withColumn(
    "proportion", col("count") / total_count
)

# Ordenar por las más representativas
partition_counts.orderBy(col("proportion").desc()).show(60)




+-------------+-----------------+----------------+------+--------------------+
|payment_group|pickup_zone_group|  duration_group| count|          proportion|
+-------------+-----------------+----------------+------+--------------------+
|  Credit Card|               76|Long-Haul Nomads|932975| 0.11783195021270942|
|        Other|            Other|  Urban Cruisers|447445|0.056510964348375645|
|        Other|            Other|Long-Haul Nomads|430300| 0.05434560216139646|
|         Cash|            Other|    Flash Riders|310727| 0.03924389012968682|
|         Cash|                8|    Flash Riders|306484| 0.03870801192849973|
|  Credit Card|               32|    Flash Riders|291907|0.036866980455790746|
|  Credit Card|                8|    Flash Riders|276498| 0.03492086987316244|
|         Cash|               32|    Flash Riders|255828| 0.03231031073610442|
|       Mobile|                8|    Flash Riders|225234|0.028446380100441485|
|  Credit Card|            Other|Long-Haul Nomads|21

                                                                                

**Partición R1**

***Método de pago:*** Credit Card

***Zona de recojo:*** 76 (zona de aeropuerto o zona turística - O'Hare)

***Duración del viaje:*** Long-Haul Nomads (>23.2 minutos)

***Proporción:*** 11.78%

***Comentario:*** Este perfil representa a viajeros de trayectos largos con tarjeta, posiblemente turistas o usuarios de aeropuerto; alta posibilidad de propina elevada.

### **Técnica de Muestreo**

In [110]:
taxitrips_r1 = dftaxytrips_selected.filter(
    (col("payment_group") == "Credit Card") &
    (col("pickup_zone_group") == "76") &
    (col("duration_group") == "Long-Haul Nomads")
)
print(f'Registers count: {taxitrips_r1.count()}')
pretty_display(taxitrips_r1, 5)



Registers count: 932975


                                                                                

Unnamed: 0,trip_id,trip_hour,trip_day_of_week,duration_minutes,trip_miles,tip_ratio,tip_per_mile,payment_group,company_group,pickup_zone_group,dropoff_zone_group,duration_group,tip_per_mile_group,tip_ratio_group,trip_miles_group
0,0000184e7cd53cee95af32eba49c44e4d20adcd8,17,6,67.516667,17.12,0.21978,0.584112,Credit Card,Flash Cab,76,32,Long-Haul Nomads,Tip Enthusiasts,High-Spirit Donors,Wide-Radius Riders
1,00007572c5f92e2ff067e6f838a5ad74e83665d3,8,2,34.166667,15.06,0.288153,0.750996,Credit Card,Other,76,Other,Long-Haul Nomads,Tip Enthusiasts,High-Spirit Donors,Wide-Radius Riders
2,0003add631b15f0dce6b59ccfe67b3ebaaf09ba4,19,1,27.333333,18.28,0.220225,0.536105,Credit Card,Insurance Agency,76,8,Long-Haul Nomads,Appreciative Riders,High-Spirit Donors,Wide-Radius Riders
3,000628749a1a6d4cb012c0e5726beeaf0eefd433,16,6,45.033333,28.24,0.220588,0.531161,Credit Card,Insurance Agency,76,Other,Long-Haul Nomads,Appreciative Riders,High-Spirit Donors,Wide-Radius Riders
4,0008253c00e059ff3bfac5763c556ebe83e618a5,18,1,46.866667,21.19,0.276538,0.678622,Credit Card,Chicago Independents,76,Other,Long-Haul Nomads,Tip Enthusiasts,High-Spirit Donors,Wide-Radius Riders


In [111]:
pretty_display(taxitrips_r1.groupBy("tip_ratio_group").count())

                                                                                

Unnamed: 0,tip_ratio_group,count
0,Flat Fare Clients,46515
1,Grateful Givers,207628
2,High-Spirit Donors,678832


Dado que los grupos definidos a partir de nuestras variables de particionamiento presentan diferencias significativas en el número de registros, se ha optado por aplicar un muestreo por estratificación. Esta técnica permite garantizar que cada grupo (estrato) esté proporcionalmente representado en la muestra, evitando sesgos que podrían derivar en modelos sobreentrenados en clases dominantes o subentrenados en clases minoritarias.

En particular, como el objetivo central del análisis es estudiar los factores que influyen en el comportamiento de las propinas, resulta fundamental asegurar una adecuada representación de los distintos niveles de la variable `tips_ratio` —la cual fue construida como el cociente entre Tips y Fare, reflejando la proporción de propina con respecto al costo del viaje. Para facilitar su análisis y segmentación, esta variable fue transformada mediante binning en una nueva variable categórica denominada `tip_ratio_group`, con tres grupos definidos: `"Flat Fare Clients"` (usuarios que no dejan propina), `"Grateful Givers"` (propina moderada) y `"High-Spirit Donors"` (propina alta). La correcta representación de estos grupos en cada partición es clave para capturar patrones relevantes y evitar sesgos en el aprendizaje supervisado.