#**Maestría en Inteligencia Artificial Aplicada**
##**Curso: Análisis de Grandes Volúmenes de Datos**
###Tecnológico de Monterrey
###Prof. Iván Olmos

## **Actividad Semana 03**

###**Proyecto: Base de Datos de Big Data**

##### Nombres y matrículas de los integrantes del equipo:
*   Victoria Melgarejo Cabrera - A01795030
*   Héctor Alejandro Alvarez Rosas        - A01796262
*   Andrea Xcaret Gomez Alfaro        - A01796384
*   Mario Guillen De La Torre       - A01796701


---


#### **Descripción de la Actividad:**

Identificar una base de datos de big data para la aplicación de los conceptos aprendidos en el curso, iniciando con la manipulación básica de lectura y escritura de archivos con PySpark.



---

In [None]:
# Instalación de PySpark en Colab
!pip install pyspark



### **Importación de Librerías**

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, isnan, when, count, percentile_approx
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from pyspark.sql.functions import hour, dayofweek, col, unix_timestamp, when

### **Creación de la Sesión Spark**

In [None]:
spark = SparkSession.builder \
    .appName("ChicagoTaxyTripsAnalysis") \
    .getOrCreate()

### **Carga del Dataset**

In [None]:
import os

# Link: https://drive.google.com/file/d/173pdvU0gxH0vzbJJA2GetofZvMg7XdQg/view?usp=drive_link

# ID del archivo en Google Drive
file_id = "173pdvU0gxH0vzbJJA2GetofZvMg7XdQg"
filename = "Taxi_Trips__2024-__20250426.csv"

# Ruta local donde quieres guardar el archivo
local_path = f"/content/{filename}"

# Instalar gdown si no está instalado
try:
    import gdown
except ImportError:
    !pip install gdown
    import gdown

# Descargar solo si el archivo no existe
if not os.path.exists(local_path):
    print(f"Descargando {filename} desde Google Drive...")
    gdown.download(id=file_id, output=local_path, quiet=False)
else:
    print(f"Archivo {filename} ya existe en {local_path}, no se descarga de nuevo.")

# Carga de archivo CSV
dftaxytrips = spark.read.csv(local_path, header=True, inferSchema=True)

Descargando Taxi_Trips__2024-__20250426.csv desde Google Drive...


Downloading...
From (original): https://drive.google.com/uc?id=173pdvU0gxH0vzbJJA2GetofZvMg7XdQg
From (redirected): https://drive.google.com/uc?id=173pdvU0gxH0vzbJJA2GetofZvMg7XdQg&confirm=t&uuid=60a6798d-233d-4353-9d4e-fa218abcc848
To: /content/Taxi_Trips__2024-__20250426.csv
100%|██████████| 3.28G/3.28G [00:39<00:00, 82.4MB/s]


In [None]:
print("Número de registros:", dftaxytrips.count())
print("Número de columnas:", len(dftaxytrips.columns))

Número de registros: 7917844
Número de columnas: 23


### **Exploración de los Datos**

In [None]:
# Estructura del dataset
dftaxytrips.printSchema()

root
 |-- Trip ID: string (nullable = true)
 |-- Taxi ID: string (nullable = true)
 |-- Trip Start Timestamp: string (nullable = true)
 |-- Trip End Timestamp: string (nullable = true)
 |-- Trip Seconds: integer (nullable = true)
 |-- Trip Miles: double (nullable = true)
 |-- Pickup Census Tract: long (nullable = true)
 |-- Dropoff Census Tract: long (nullable = true)
 |-- Pickup Community Area: integer (nullable = true)
 |-- Dropoff Community Area: integer (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Tips: double (nullable = true)
 |-- Tolls: double (nullable = true)
 |-- Extras: double (nullable = true)
 |-- Trip Total: double (nullable = true)
 |-- Payment Type: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- Pickup Centroid Latitude: double (nullable = true)
 |-- Pickup Centroid Longitude: double (nullable = true)
 |-- Pickup Centroid Location: string (nullable = true)
 |-- Dropoff Centroid Latitude: double (nullable = true)
 |-- Dropoff Centroid 

In [None]:
# Primeras filas
dftaxytrips.show(5)

+--------------------+--------------------+--------------------+--------------------+------------+----------+-------------------+--------------------+---------------------+----------------------+-----+-----+-----+------+----------+------------+--------------------+------------------------+-------------------------+------------------------+-------------------------+--------------------------+--------------------------+
|             Trip ID|             Taxi ID|Trip Start Timestamp|  Trip End Timestamp|Trip Seconds|Trip Miles|Pickup Census Tract|Dropoff Census Tract|Pickup Community Area|Dropoff Community Area| Fare| Tips|Tolls|Extras|Trip Total|Payment Type|             Company|Pickup Centroid Latitude|Pickup Centroid Longitude|Pickup Centroid Location|Dropoff Centroid Latitude|Dropoff Centroid Longitude|Dropoff Centroid  Location|
+--------------------+--------------------+--------------------+--------------------+------------+----------+-------------------+--------------------+------

In [None]:
dftaxytrips.summary().show()

+-------+--------------------+--------------------+--------------------+--------------------+------------------+-----------------+--------------------+--------------------+---------------------+----------------------+-----------------+------------------+--------------------+------------------+------------------+------------+--------------------+------------------------+-------------------------+------------------------+-------------------------+--------------------------+--------------------------+
|summary|             Trip ID|             Taxi ID|Trip Start Timestamp|  Trip End Timestamp|      Trip Seconds|       Trip Miles| Pickup Census Tract|Dropoff Census Tract|Pickup Community Area|Dropoff Community Area|             Fare|              Tips|               Tolls|            Extras|        Trip Total|Payment Type|             Company|Pickup Centroid Latitude|Pickup Centroid Longitude|Pickup Centroid Location|Dropoff Centroid Latitude|Dropoff Centroid Longitude|Dropoff Centroid  Lo

In [None]:
# Estadísticas generales
dftaxytrips.describe().show()

+-------+--------------------+--------------------+--------------------+--------------------+------------------+-----------------+--------------------+--------------------+---------------------+----------------------+-----------------+------------------+--------------------+------------------+------------------+------------+--------------------+------------------------+-------------------------+------------------------+-------------------------+--------------------------+--------------------------+
|summary|             Trip ID|             Taxi ID|Trip Start Timestamp|  Trip End Timestamp|      Trip Seconds|       Trip Miles| Pickup Census Tract|Dropoff Census Tract|Pickup Community Area|Dropoff Community Area|             Fare|              Tips|               Tolls|            Extras|        Trip Total|Payment Type|             Company|Pickup Centroid Latitude|Pickup Centroid Longitude|Pickup Centroid Location|Dropoff Centroid Latitude|Dropoff Centroid Longitude|Dropoff Centroid  Lo

In [None]:
# Análisis de valores faltantes en 'dftaxytrips'
missing_taxytrips = dftaxytrips.select([
    count(when(col(c).isNull() | isnan(c), c)).alias(c) #count(when(col(c).isNull(), c)).alias(c)
    for c in dftaxytrips.columns
])

print("Valores faltantes en Chicago Taxi Trips Dataset (csv):")
missing_taxytrips.show()


Valores faltantes en Chicago Taxi Trips Dataset (csv):
+-------+-------+--------------------+------------------+------------+----------+-------------------+--------------------+---------------------+----------------------+-----+-----+-----+------+----------+------------+-------+------------------------+-------------------------+------------------------+-------------------------+--------------------------+--------------------------+
|Trip ID|Taxi ID|Trip Start Timestamp|Trip End Timestamp|Trip Seconds|Trip Miles|Pickup Census Tract|Dropoff Census Tract|Pickup Community Area|Dropoff Community Area| Fare| Tips|Tolls|Extras|Trip Total|Payment Type|Company|Pickup Centroid Latitude|Pickup Centroid Longitude|Pickup Centroid Location|Dropoff Centroid Latitude|Dropoff Centroid Longitude|Dropoff Centroid  Location|
+-------+-------+--------------------+------------------+------------+----------+-------------------+--------------------+---------------------+----------------------+-----+-----+----

In [None]:
# Asegúrate de que Trip Start Timestamp es tipo timestamp
dftaxytrips=dftaxytrips.withColumn("trip_start_ts", col("Trip Start Timestamp").cast("timestamp"))

dftaxytrips = dftaxytrips.withColumn("trip_hour", hour(col("trip_start_ts")))              # Hora del día
dftaxytrips = dftaxytrips.withColumn("trip_day_of_week", dayofweek(col("trip_start_ts")))  # Día de la semana (1 = domingo, 7 = sábado)
dftaxytrips = dftaxytrips.withColumn("duration_minutes", col("Trip Seconds") / 60)         # Duración del viaje en minutos

# Zona origen/destino
dftaxytrips = dftaxytrips.withColumnRenamed("Pickup Community Area", "pickup_zone")
dftaxytrips = dftaxytrips.withColumnRenamed("Dropoff Community Area", "dropoff_zone")

# Tip/Fare ratio
dftaxytrips = dftaxytrips.withColumn("tip_ratio",
    when(col("Fare") > 0, col("Tips") / col("Fare")).otherwise(0))

# Agrupación método de pago
dftaxytrips = dftaxytrips.withColumn("payment_group",
    when(col("Payment Type") == "Credit Card")
    .when(col("Payment Type") == "Cash")
    .otherwise("Otro"))



In [None]:
# Agrupar y contar por trip_start_ts
conteo_df = dftaxytrips.groupBy("pickup_zone").agg(count("*").alias("conteo"))

# Ordenar opcionalmente por cantidad descendente
conteo_df = conteo_df.orderBy(col("conteo").desc())

# Mostrar los resultados
conteo_df.show()

+-----------+-------+
|pickup_zone| conteo|
+-----------+-------+
|         76|1681489|
|          8|1658939|
|         32|1254835|
|         28| 752297|
|         33| 272803|
|         56| 262608|
|          6| 251015|
|       NULL| 226777|
|          7| 161248|
|          3| 125293|
|         77|  95727|
|         24|  74223|
|         41|  67705|
|          2|  59407|
|         35|  53505|
|         38|  48514|
|         43|  47948|
|         44|  45537|
|          1|  44780|
|         39|  39289|
+-----------+-------+
only showing top 20 rows

