# Testing in-memory data loads and exports with Pandas

In this notebook, the loading (in-memory) and exporting capabilities of Pandas are tested. The following steps are processes: <br>

- Loading data from multiple csv files into one dataframe (around 20 files of different sizes)
- Loading data from one big csv file (3.2 million rows)
- Exporting the data from dataframe and table into a parquet file.

Prerequisite to run this script: 
- The preceding notebook "data_setup/00_Get_Data.ipynb" must be executed before which stores the csv file under the referred directories.
- **Increase buffer size** when running locally  <code> jupyter notebook --NotebookApp.max_buffer_size = your_value*</code>
 
 *your_value = desired buffer size in bytes

In [1]:
import os
import sys
import glob
import time
import psutil
from pathlib import Path
import pyarrow
import pandas as pd

In [2]:
app_dir = Path().cwd().parent.absolute()
sys.path.insert(0, str(app_dir))

In [3]:
from app.functions import monitor_usage

In [4]:
# Initial variable for the directory
path = os.getcwd()

# Test 1: Loading from Multiple Files

In [5]:
# Path to the GWR csv files on canton level
os.chdir(path)
os.chdir("..\datasets\GWR")

### Loading with Pandas (in-memory)

In [6]:
monitor_usage()
ct = time.time()
cantons_df_pandas = pd.concat([pd.read_csv(f, on_bad_lines='skip', sep='\t') for f in glob.glob('*.csv')])
print(f"Pandas load time: {(time.time() - ct)}")
print(f"Number of dataframe rows: {cantons_df_pandas.shape[0]}")
monitor_usage()

CPU Usage: 0.2%
Memory Usage: 44.4%
Pandas load time: 4.291918516159058
Number of dataframe rows: 2573168
CPU Usage: 0.4%
Memory Usage: 47.1%


In [7]:
# Checking the structure
cantons_df_pandas.head(1)

Unnamed: 0,EGID,EDID,EGAID,DEINR,ESID,STRNAME,STRNAMK,STRINDX,STRSP,STROFFIZIEL,DPLZ4,DPLZZ,DPLZNAME,DKODE,DKODN,DOFFADR,DEXPDAT
0,520001,0,100428604,43,10109554.0,Schiffländestrasse,Schiffländestr.,Sch,9901.0,1.0,5000,0,Aarau,2645359.617,1249365.315,1.0,2023-06-24


# Test 2: Loading from Single File

In [8]:
# Path to the single GWR csv file (whole Switzerland)
os.chdir(path)
os.chdir("..\datasets\GWR\GWR_Total")

### Loading with Pandas (in-memory)

In [9]:
monitor_usage()
ct = time.time()
switzerland_df_pandas = pd.read_csv('eingang_entree_entrata_total.csv', on_bad_lines='skip', sep='\t') 
print(f"Pandas load time: {(time.time() - ct)}")
print(f"Number of dataframe rows: {switzerland_df_pandas.shape[0]}")
monitor_usage()

CPU Usage: 0.0%
Memory Usage: 47.1%
Pandas load time: 4.8594348430633545
Number of dataframe rows: 3262598
CPU Usage: 0.6%
Memory Usage: 50.4%


## Initial Results: 
In the **first scenario**, DuckDB was up to **2.5sec faster** than Pandas. Or when considered the whole loading time, around **50% faster**. However, this advantage play only out as long as the referenced object fits into memory. For the **second scenario**, DuckDB also performed better. However, DuckDB was not able to read the data in one transaction first. This works only with an increased memory buffer for the notebook kernel.
<br> <br>
Also worth mentioning: There is less code required in DuckDB to achieve the same results. For instance, no separator sign needs to be determined in DuckDB when reading the csv file. Conversely in Pandas, there will be an error when using a CSV file where the data is not separated with a comma.

# Export dataframe to parquet?

In [10]:
monitor_usage()
ct = time.time()
cantons_df_pandas.to_parquet('cantons_gwr_pandas.parquet.gzip', engine='auto', compression='snappy',
              index=None, partition_cols=None, storage_options=None)
print(f"Pandas load time: {(time.time() - ct)}")
monitor_usage()

CPU Usage: 1.0%
Memory Usage: 50.3%
Pandas load time: 2.8576462268829346
CPU Usage: 0.6%
Memory Usage: 51.9%


The export to parquet with DuckDB is 50% faster than with Pandas!<br>
*Note: Pandas requires an additional library **(pyarrow)** to export the dataframe as a parquet file.*