# 巨大なデータの取り扱い
巨大といってもここでは数GB～100GB未満ぐらいのデータをPythonで処理するときについて記す。  

In [2]:
import polars as pl

# 文字列カラムの表示文字数を50文字に設定
pl.Config.set_fmt_str_lengths(50)

polars.config.Config

## ● 約5000万行 × 5列（4.2GB）× 1ファイルの場合  
USBに保存してある大きめのファイルについて処理したいとき。  
データの全カラム・全レコードが解析に必要なケースは多くないと思われるので、  
LazyFrameを活用して高速化、メモリ節約を意識する。  

In [2]:
! wc -l ../../../sample_data/from_HDD/big_data/sample_big_data_1.csv

49061300 ../../../sample_data/from_HDD/big_data/sample_big_data_1.csv


In [4]:
# 読み込み条件定義
input_file = '../../../sample_data/from_HDD/big_data/sample_big_data_1.csv'
col_names_dtypes = {
    'datetime_col': pl.Utf8, 
    'value_1': pl.Float32, 
    'value_2': pl.Float32, 
    'value_3': pl.Float32, 
    'labels': pl.Utf8}

In [4]:
(
    pl.scan_csv(input_file, has_header=False, schema=col_names_dtypes)
    .head().collect()
)

datetime_col,value_1,value_2,value_3,labels
str,f32,f32,f32,str
"""2023-11-03 13:39:36.088815""",404.69751,443.679321,87.651566,"""CCCC"""
"""2023-11-03 13:39:36.088875""",457.07785,634.807251,549.272156,"""EEEE"""
"""2023-11-03 13:39:36.088891""",620.065186,173.628784,219.745941,"""BBBB"""
"""2023-11-03 13:39:36.088906""",25.579391,143.887222,408.059845,"""BBBB"""
"""2023-11-03 13:39:36.088920""",654.784729,528.226379,1026.964355,"""BBBB"""


最初の5行を表示したりは全く問題ない。

In [7]:
%%time
# filter,with_columnsをやってみる。
(
    pl.scan_csv(input_file, has_header=False, schema=col_names_dtypes, truncate_ragged_lines=True)
    .filter(pl.col('labels') == 'EEEE')
    .with_columns(
        val_1_90tile = pl.col('value_1').quantile(0.9)
    ).collect(streaming=True)
)

CPU times: user 35.8 s, sys: 2min 41s, total: 3min 17s
Wall time: 1min 6s


datetime_col,value_1,value_2,value_3,labels,val_1_90tile
str,f32,f32,f32,str,f32
"""2023-11-03 13:39:36.088875""",457.07785,634.807251,549.272156,"""EEEE""",900.028076
"""2023-11-03 13:39:36.088961""",484.879608,265.178833,711.663208,"""EEEE""",900.028076
"""2023-11-03 13:39:36.089092""",713.928345,101.049217,836.95105,"""EEEE""",900.028076
"""2023-11-03 13:39:36.089165""",964.206848,974.932983,227.194,"""EEEE""",900.028076
"""2023-11-03 13:39:36.089313""",864.268555,748.114075,447.793671,"""EEEE""",900.028076
"""2023-11-03 13:39:36.089325""",730.194397,89.366402,932.182983,"""EEEE""",900.028076
"""2023-11-03 13:39:36.089375""",140.499466,857.064331,237.348999,"""EEEE""",900.028076
"""2023-11-03 13:39:36.089387""",719.163879,674.034851,836.932312,"""EEEE""",900.028076
"""2023-11-03 13:39:36.089461""",832.301819,503.263397,335.356476,"""EEEE""",900.028076
"""2023-11-03 13:39:36.089485""",359.132294,8.830131,935.493713,"""EEEE""",900.028076


In [10]:
%%time
# filter,with_columnsをやってみる。
(
    pl.scan_csv(input_file, has_header=False, schema=col_names_dtypes)
    .filter(pl.col('labels') == 'EEEE')
    .select(pl.col('value_1').quantile(0.9))
    .collect(streaming=True)
)

CPU times: user 21.1 s, sys: 3min 1s, total: 3min 22s
Wall time: 53.8 s


value_1
f32
900.028076


In [11]:
%%time
# filter,with_columnsをやってみる。
(
    pl.scan_csv(input_file, has_header=False, schema=col_names_dtypes)
    .filter(pl.col('labels') == 'EEEE')
    .collect(streaming=True)
)

CPU times: user 32.8 s, sys: 2min 47s, total: 3min 20s
Wall time: 1min 6s


datetime_col,value_1,value_2,value_3,labels
str,f32,f32,f32,str
"""2023-11-03 13:39:36.088875""",457.07785,634.807251,549.272156,"""EEEE"""
"""2023-11-03 13:39:36.088961""",484.879608,265.178833,711.663208,"""EEEE"""
"""2023-11-03 13:39:36.089092""",713.928345,101.049217,836.95105,"""EEEE"""
"""2023-11-03 13:39:36.089165""",964.206848,974.932983,227.194,"""EEEE"""
"""2023-11-03 13:39:36.089313""",864.268555,748.114075,447.793671,"""EEEE"""
"""2023-11-03 13:39:36.089325""",730.194397,89.366402,932.182983,"""EEEE"""
"""2023-11-03 13:39:36.089375""",140.499466,857.064331,237.348999,"""EEEE"""
"""2023-11-03 13:39:36.089387""",719.163879,674.034851,836.932312,"""EEEE"""
"""2023-11-03 13:39:36.089461""",832.301819,503.263397,335.356476,"""EEEE"""
"""2023-11-03 13:39:36.089485""",359.132294,8.830131,935.493713,"""EEEE"""


### ● selectしてquantileだけ求める場合

In [12]:
%%time
# filter,with_columnsをやってみる。
(
    pl.scan_csv(input_file, has_header=False, schema=col_names_dtypes)
    .select(pl.col('value_1').quantile(0.9))
    .collect(streaming=True)
)

CPU times: user 34.1 s, sys: 3min 4s, total: 3min 38s
Wall time: 51 s


value_1
f32
900.036194


### ● selectしてquantileをいくつか求める場合

In [14]:
%%time
# filter,with_columnsをやってみる。
(
    pl.scan_csv(input_file, has_header=False, schema=col_names_dtypes)
    .select([
        pl.col('value_1').quantile(0.9),
        pl.col('value_2').quantile(0.9),
        pl.col('value_3').quantile(0.9),
    ])
    .collect(streaming=True)
)

CPU times: user 1min 19s, sys: 2min 50s, total: 4min 10s
Wall time: 1min 3s


value_1,value_2,value_3
f32,f32,f32
900.036194,900.887268,989.969238


quantileを複数列で出力する場合でもそれほど処理時間は変わらない模様。  
処理時間の大半は大容量ファイルをscanするための時間だと思われる。

In [5]:
%%time
df = pl.scan_csv(input_file, has_header=False, schema=col_names_dtypes)

CPU times: user 6.35 ms, sys: 1.09 ms, total: 7.44 ms
Wall time: 33.1 ms


In [12]:
df.select(pl.col('value_1').quantile(0.9)).collect()

value_1
f32
900.036194
