# Ingestão dos dados de viagens de táxis de Nova York - dados da NYC

# Camada bronze

<br>Extração de dados referentes às corridas de táxis de NY para os meses de janeiro a maio de 2023</br>
Link: https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

Os dados são disponibilizados pela agência responsável por licenciar e regular os táxis na cidade de NY - NYC Taxi and Limousine Commission (TLC).
Dados disponíveis para:
- Táxis verdes e amarelos;
- Veículos alugados (FHV)

Siglas:
- TLC: NYC Taxi and Limousine Commission
- TPEP/LPEP: Taxicab & Livery Passenger Enhancement Programs
- FHV: For-Hire Vehicle
- HVFHV: high volume trip records

Dicionário de dados:
- Táxi amarelo: https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf
- Táxi verde: https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_green.pdf
- FHV: https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_fhv.pdf
- HVFHV: https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_hvfhs.pdf


## Bibliotecas

In [0]:
import os
import requests
from datetime import datetime
from dateutil.relativedelta import relativedelta
import tempfile
from pyspark.sql.functions import lit

## Caminhos

In [0]:
# Caminho dos arquivos a serem baixados
base_dir = "/dbfs/case_ifood_nyc/bronze/nyc_taxi"

# Pasta de destino
path = "/FileStore/bronze/nyc_taxi"

## Funções

In [0]:
# Função de processamento
def process_month(vehicle, year_month):
    try:
        year, month = year_month.split("-")
        url = f"https://d37ci6vzurychx.cloudfront.net/trip-data/{vehicle}_tripdata_{year_month}.parquet"
        
        print(f"\nProcessando {year_month}...")
        print(f"URL: {url}")
        
        # Criar diretório temporário local
        with tempfile.TemporaryDirectory() as tmp_dir:
            local_path = os.path.join(tmp_dir, f"{vehicle}_tripdata_{year_month}.parquet")
            
            # Download do arquivo
            print("Baixando arquivo...")
            response = requests.get(url, timeout=30)
            response.raise_for_status()
            
            # Salvar localmente
            with open(local_path, 'wb') as f:
                f.write(response.content)
            
            # Carregar para o Spark
            print("Carregando dados para o Spark...")
            df = spark.read.parquet(f"file:{local_path}")
            
            # Criar diretório de destino no DBFS
            output_dir = f"{base_dir}/{vehicle}/year={year}/month={month}"
            dbutils.fs.mkdirs(output_dir.replace("/dbfs", ""))
            
            # Salvar diretamente com partição no caminho
            print("Salvando no DBFS...")
            df.write.mode("overwrite").parquet(output_dir.replace("/dbfs", "dbfs:"))
            
            print(f"✅ Dados salvos em: {output_dir}")
            return True
            
    except Exception as e:
        print(f"❌ Erro ao processar {year_month}: {str(e)}")
        return False

## Extração dos dados

##### Parâmetros

In [0]:
# Configuração
dbutils.widgets.dropdown("vehicle_type", "yellow", ["yellow", "green", "fhv", "fhvhv"])
dbutils.widgets.text("start_date", "2023-01", "Data Início (YYYY-MM)")
dbutils.widgets.text("end_date", "2023-01", "Data Fim (YYYY-MM)")

In [0]:
# Obter parâmetros escolhidos
vehicle_type = dbutils.widgets.get("vehicle_type")
start_date = dbutils.widgets.get("start_date")
end_date = dbutils.widgets.get("end_date")

print(f"Processando {vehicle_type} táxis de {start_date} a {end_date}")

Processando fhvhv táxis de 2023-01 a 2023-05


## Extração

In [0]:
# Criar o diretório
os.makedirs(base_dir, exist_ok=True)

In [0]:

# Processar o período
start = datetime.strptime(start_date, "%Y-%m")
end = datetime.strptime(end_date, "%Y-%m")
current = start
success_count = 0

while current <= end:
    year_month = current.strftime("%Y-%m")
    if process_month(vehicle_type, year_month):
        success_count += 1
    current += relativedelta(months=1)


Processando 2023-01...
URL: https://d37ci6vzurychx.cloudfront.net/trip-data/fhvhv_tripdata_2023-01.parquet
Baixando arquivo...
Carregando dados para o Spark...
Salvando no DBFS...
✅ Dados salvos em: /dbfs/case_ifood_nyc/bronze/nyc_taxi/fhvhv/year=2023/month=01

Processando 2023-02...
URL: https://d37ci6vzurychx.cloudfront.net/trip-data/fhvhv_tripdata_2023-02.parquet
Baixando arquivo...
Carregando dados para o Spark...
Salvando no DBFS...
✅ Dados salvos em: /dbfs/case_ifood_nyc/bronze/nyc_taxi/fhvhv/year=2023/month=02

Processando 2023-03...
URL: https://d37ci6vzurychx.cloudfront.net/trip-data/fhvhv_tripdata_2023-03.parquet
Baixando arquivo...
Carregando dados para o Spark...
Salvando no DBFS...
✅ Dados salvos em: /dbfs/case_ifood_nyc/bronze/nyc_taxi/fhvhv/year=2023/month=03

Processando 2023-04...
URL: https://d37ci6vzurychx.cloudfront.net/trip-data/fhvhv_tripdata_2023-04.parquet
Baixando arquivo...
Carregando dados para o Spark...
Salvando no DBFS...
✅ Dados salvos em: /dbfs/case_ifo

In [0]:
# Registrar tabela
if success_count > 0:
    print("\nRegistrando tabela no metastore...")
    spark.sql("CREATE DATABASE IF NOT EXISTS case_ifood_nyc_taxi") # Criar schema
    
    # Definir nome da tabela
    table_name = f"bronze_{vehicle_type}_taxi"
    
    # Eliminar a tabela se existir
    spark.sql(f"DROP TABLE IF EXISTS case_ifood_nyc_taxi.{table_name}")
    
    # Criar tabela externa apontando para o local correto
    spark.sql(f"""
        CREATE TABLE case_ifood_nyc_taxi.{table_name}
        USING PARQUET
        LOCATION 'dbfs:/case_ifood_nyc/bronze/nyc_taxi/{vehicle_type}'
    """)
    
    # Atualizar metadados
    spark.sql(f"MSCK REPAIR TABLE case_ifood_nyc_taxi.{table_name}")
    
    print(f"Tabela registrada: case_ifood_nyc_taxi.{table_name}")
    
    print("\nPartições disponíveis:")
    display(spark.sql(f"SHOW PARTITIONS case_ifood_nyc_taxi.{table_name}"))
    
    print("\nAmostra dos dados:")
    display(spark.table(f"case_ifood_nyc_taxi.{table_name}").limit(5))
else:
    print("\nNenhum mês foi processado com sucesso.")


Registrando tabela no metastore...
Tabela registrada: case_ifood_nyc_taxi.bronze_fhvhv_taxi

Partições disponíveis:


partition
year=2023/month=01
year=2023/month=02
year=2023/month=03
year=2023/month=04
year=2023/month=05



Amostra dos dados:


hvfhs_license_num,dispatching_base_num,originating_base_num,request_datetime,on_scene_datetime,pickup_datetime,dropoff_datetime,PULocationID,DOLocationID,trip_miles,trip_time,base_passenger_fare,tolls,bcf,sales_tax,congestion_surcharge,airport_fee,tips,driver_pay,shared_request_flag,shared_match_flag,access_a_ride_flag,wav_request_flag,wav_match_flag,year,month
HV0003,B03404,B03404,2023-01-01T00:18:06.000+0000,2023-01-01T00:19:24.000+0000,2023-01-01T00:19:38.000+0000,2023-01-01T00:48:07.000+0000,48,68,0.94,1709,25.95,0.0,0.78,2.3,2.75,0.0,5.22,27.83,N,N,,N,N,2023,1
HV0003,B03404,B03404,2023-01-01T00:48:42.000+0000,2023-01-01T00:56:20.000+0000,2023-01-01T00:58:39.000+0000,2023-01-01T01:33:08.000+0000,246,163,2.78,2069,60.14,0.0,1.8,5.34,2.75,0.0,0.0,50.15,N,N,,N,N,2023,1
HV0003,B03404,B03404,2023-01-01T00:15:35.000+0000,2023-01-01T00:20:14.000+0000,2023-01-01T00:20:27.000+0000,2023-01-01T00:37:54.000+0000,9,129,8.81,1047,24.37,0.0,0.73,2.16,0.0,0.0,0.0,20.22,N,N,,N,N,2023,1
HV0003,B03404,B03404,2023-01-01T00:35:24.000+0000,2023-01-01T00:39:30.000+0000,2023-01-01T00:41:05.000+0000,2023-01-01T00:48:16.000+0000,129,129,0.67,431,13.8,0.0,0.41,1.22,0.0,0.0,0.0,7.9,N,N,,N,N,2023,1
HV0003,B03404,B03404,2023-01-01T00:43:15.000+0000,2023-01-01T00:51:10.000+0000,2023-01-01T00:52:47.000+0000,2023-01-01T01:04:51.000+0000,129,92,4.38,724,20.49,0.0,0.61,1.82,0.0,0.0,0.0,16.48,N,N,,N,N,2023,1
