# Polars

Los datos corresponde a [TLC Trip Record Data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page) de taxis amarillos de febrero del 2025

In [1]:
import random
import timeit
import polars as pl 
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
data_path = "../data/nyc_taxis/viajes_taxis.parquet"

In [3]:
df_pl = pl.read_parquet(data_path)
df_pd = pd.read_parquet(data_path)
df_pl.shape

(3577543, 7)

In [4]:
df_pl.head(5)

inicio_viaje,fin_viaje,inicio_zona,fin_zona,distancia,metodo_pago,total_pago
datetime[μs],datetime[μs],str,str,f64,i64,f64
2025-02-01 00:12:18,2025-02-01 00:32:33,"""West Chelsea/Hudson Yards""","""East Village""",3.12,1,30.66
2025-02-01 00:40:04,2025-02-01 00:49:15,"""Greenwich Village South""","""East Village""",1.4,1,18.9
2025-02-01 00:06:09,2025-02-01 00:11:51,"""SoHo""","""Little Italy/NoLiTa""",0.4,1,13.25
2025-02-01 00:15:13,2025-02-01 00:20:19,"""Greenwich Village North""","""West Village""",0.7,1,14.95
2025-02-01 00:02:52,2025-02-01 00:20:25,"""Greenwich Village North""","""Yorkville West""",4.19,1,30.66


In [33]:
def medir_tiempo(func):
    tiempos = timeit.repeat(func, repeat=5, number=1)
    total_time = sum(tiempos) / len(tiempos)
    print(f"Tiempo total: {total_time:.8f} segundos")

### Leer archivo

In [27]:
medir_tiempo(lambda: pl.read_parquet(data_path))
medir_tiempo(lambda: pd.read_parquet(data_path))

Tiempo total: 0.070319 segundos
Tiempo total: 0.501953 segundos


### Seleccionar columnas

In [28]:
rand_cols = random.sample(df_pl.columns, 3)

medir_tiempo(lambda: df_pl[:3,rand_cols])

df_pl[:3,rand_cols]

Tiempo total: 0.000031 segundos


distancia,metodo_pago,inicio_viaje
f64,i64,datetime[μs]
3.12,1,2025-02-01 00:12:18
1.4,1,2025-02-01 00:40:04
0.4,1,2025-02-01 00:06:09


In [29]:
rand_cols = df_pd.columns.to_series().sample(3)

medir_tiempo(lambda: df_pd.iloc[0:3][rand_cols])

df_pd.iloc[0:3][rand_cols]

Tiempo total: 0.000475 segundos


Unnamed: 0,fin_viaje,inicio_zona,inicio_viaje
0,2025-02-01 00:32:33,West Chelsea/Hudson Yards,2025-02-01 00:12:18
1,2025-02-01 00:49:15,Greenwich Village South,2025-02-01 00:40:04
2,2025-02-01 00:11:51,SoHo,2025-02-01 00:06:09


### Seleccionar celda

In [38]:
medir_tiempo(lambda: "df_pl.at[6, 'metodo_pago']")

Tiempo total: 0.00000030 segundos


In [42]:
medir_tiempo(lambda: "df_pd.at[6, 'metodo_pago']")

Tiempo total: 0.00000046 segundos


### Cambiar tipo de columna

In [None]:
medir_tiempo(lambda: "df_pl.with_columns(pl.col('inicio_viaje').cast(pl.Float64))")

- https://github.com/deployr-ai/kt-02-pandasvspolars/tree/main

- https://realpython.com/polars-lazyframe/