Due to the large file sizes of the parquet file, a traditional approach with loading into pandas will not be practical. DuckDB will be used for querying and polars will be used for data manipulations to compensate for the large file sizes.

See this link for speed comparisons: https://docs.coiled.io/blog/tpch.html#polars-vs-duckdb



In [1]:
import duckdb
import polars as pl

#the pyarrow module needs to  be installed, but it does not need to be imported

In [31]:
query = """
    SELECT DepDelay, Month
    FROM 'Data/Year=*/data_0.parquet'
    WHERE DepDelay IS NOT NULL AND Month IS NOT NULL
    
"""
result_arrow = duckdb.query(query).arrow() 
result = pl.from_arrow(result_arrow)

In [34]:
print(f'pulled dataset shape: {result.shape}')
result.head()

pulled dataset shape: (213454906, 2)


DepDelay,Month
f64,i64
-1.0,10
-1.0,10
3.0,10
4.0,10
-1.0,10


In [1]:
import sys
if __name__ == '__main__':
    print(sys.version)
    print(sys.path)

3.12.8 (tags/v3.12.8:2dc476b, Dec  3 2024, 19:30:04) [MSC v.1942 64 bit (AMD64)]
['C:\\Program Files\\WindowsApps\\PythonSoftwareFoundation.Python.3.12_3.12.2288.0_x64__qbz5n2kfra8p0\\python312.zip', 'C:\\Program Files\\WindowsApps\\PythonSoftwareFoundation.Python.3.12_3.12.2288.0_x64__qbz5n2kfra8p0\\DLLs', 'C:\\Program Files\\WindowsApps\\PythonSoftwareFoundation.Python.3.12_3.12.2288.0_x64__qbz5n2kfra8p0\\Lib', 'C:\\Program Files\\WindowsApps\\PythonSoftwareFoundation.Python.3.12_3.12.2288.0_x64__qbz5n2kfra8p0', '', 'C:\\Users\\Isaac\\AppData\\Local\\Packages\\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\\LocalCache\\local-packages\\Python312\\site-packages', 'C:\\Users\\Isaac\\AppData\\Local\\Packages\\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\\LocalCache\\local-packages\\Python312\\site-packages\\win32', 'C:\\Users\\Isaac\\AppData\\Local\\Packages\\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\\LocalCache\\local-packages\\Python312\\site-packages\\win32\\lib', '