## Data retrieval and Bodo

+ Having seen some of Bodo's optimizations, let's look closely at how data retrieval performs—with & without Bodo.

---

In [1]:
import pandas as pd, numpy as np
pd.set_option('display.precision', 2)
import time
import bodo

---

+ We begin as usual importing useful packages: Pandas, NumPy, `time`, and `bodo`.

---

In [2]:
import math
def human_bytes(size_bytes):
    "Converts numerical size in Bytes to sensible units"
    assert size_bytes>0, "Only works with positive values"
    scale = math.floor(math.log(size_bytes)/math.log(1024))
    units = ['B', 'kiB', 'MiB', 'GiB', 'TiB']
    return f'{size_bytes/(1024**scale):.2f} {units[scale]}'

---

+ We also define a convenient utility function `human_bytes`.
+ The math details are less important than what it does:
  + convert positive sizes stated in Bytes to more useful units.

---

In [3]:
for bsize in [53, 2_563, 35_512_191, 567_123_523_527]:
    print(f"{f'{bsize} B is {human_bytes(bsize)}':>28s}")

             53 B is 53.00 B
          2563 B is 2.50 kiB
     35512191 B is 33.87 MiB
567123523527 B is 528.17 GiB


---

+ As an example, here we make a list of integers representing file sizes of different scales.
+ The function `human_bytes` converts the numbers into strings with Bytes, kilobytes, megabytes, or gigabytes as appropriate.

---

### Layout of the data

```bash
bodo-examples-data/bodo-training-fundamentals/DATA/
|
└─ CSV ─────────────── PARQUET ────────── PARQUET_010 ────── PARQUET_050
```

+ Four principle subdirectories of files

---

+ The data we're working with is stored in a remote *S3 bucket*.
+ It's conceptually laid out with main subdirectories.

---

### Layout of the data

```bash
bodo-examples-data/bodo-training-fundamentals/DATA/
|
└─ CSV ─────────────── PARQUET ────────── PARQUET_010 ────── PARQUET_050
    ├── samples_001.csv  ├── samples_001.pq  ├── samples_001.pq  ├── ...
    ├── samples_002.csv  ├── samples_002.pq  ├── samples_002.pq  ├── ...
    :                    :                   :                   :
    :                    :                   └── samples_010.pq  :
    :                    :                                       :
    :                    :                                       └── ...
    ├── samples_499.csv  ├── samples_499.pq
    └── samples_500.csv  └── samples_500.pq
```

---

+ The directories `CSV` & `PARQUET` each hold 500 individual files...
  + ...with one million rows of data encoded differently—CSV and Parquet formats respectively.
+ The directories `PARQUET_010` & `PARQUET_050` hold smaller subsets of the Parquet data.

---

In [4]:
DATA_ROOT = 'bodo-examples-data/bodo-training-fundamentals/DATA'

loading_opts = dict(storage_options=dict(anon=True))

In [5]:
from s3fs import S3FileSystem
s3 = S3FileSystem(anon=True)

In [6]:
s3.du(f'{DATA_ROOT}/CSV/samples_001.csv') # gets file size in Bytes

129939814

---

+ We'll define `DATA_ROOT` to represent the main path in the public S3 bucket containing our data.
+ We'll also define `loading_opts` as a dictionary to use with Pandas functions for reading dataframes.

---

+ Importing the class `S3FileSysyem` from the module `s3fs` enables exploration of this public S3 bucket.
+ To do this, instantiate an object `s3` from this class.

---

+ The `s3` object now has numerous useful methods mimicking Unix file operations.
+ For instance, the `du` method returns the "disk usage" of a file or directory as the Unix command would.

---

In [7]:
s3.ls(f'{DATA_ROOT}/PARQUET_010') # gets contents of "directory"

['bodo-examples-data/bodo-training-fundamentals/DATA/PARQUET_010/samples_001.pq',
 'bodo-examples-data/bodo-training-fundamentals/DATA/PARQUET_010/samples_002.pq',
 'bodo-examples-data/bodo-training-fundamentals/DATA/PARQUET_010/samples_003.pq',
 'bodo-examples-data/bodo-training-fundamentals/DATA/PARQUET_010/samples_004.pq',
 'bodo-examples-data/bodo-training-fundamentals/DATA/PARQUET_010/samples_005.pq',
 'bodo-examples-data/bodo-training-fundamentals/DATA/PARQUET_010/samples_006.pq',
 'bodo-examples-data/bodo-training-fundamentals/DATA/PARQUET_010/samples_007.pq',
 'bodo-examples-data/bodo-training-fundamentals/DATA/PARQUET_010/samples_008.pq',
 'bodo-examples-data/bodo-training-fundamentals/DATA/PARQUET_010/samples_009.pq',
 'bodo-examples-data/bodo-training-fundamentals/DATA/PARQUET_010/samples_010.pq']

---

 + Similarly, the `ls` method applied to a directory-like path returns...
  + ... the contents of the directory as a list of strings.

---

In [30]:
CSV_FILE = 'CSV/samples_001.csv'
%time df = pd.read_csv(f's3://{DATA_ROOT}/{CSV_FILE}', nrows=3)
df  # Shows entire dataframe,

CPU times: user 138 ms, sys: 33 ms, total: 171 ms
Wall time: 2.32 s


Unnamed: 0,Name,Age,Purchase_Date,Purchase_Amount,Product,Product_Review
0,Tomas Talley,49,1994-12-09 13:50:37,16.3,Health,
1,Paulene Greer,31,2004-06-05 14:43:15,129.76,Sporting-Goods,Good : Any element of a tuple can be accessed ...
2,Barrett Mccray,69,1990-04-06 18:35:10,85.19,Electronics,Fine : I don't even care.


---

+ We can mimic the Unix `head` command to examine the first few rows of a text file.
+ The strategy here is to use Pandas `read_csv` with the `nrows` keyword set to a small value.
+ This ensures that a small number of rows are read from the S3 bucket & transferred.
+ Notice the transfer time is relatively small reflecting the amount of data parsed.

---

+ Here, the data retrieved is stored internally as a Pandas dataframe.
+ As expected, it has only three rows.

---

In [9]:
print(df.to_csv(index=False))

Name,Age,Purchase_Date,Purchase_Amount,Product,Product_Review
Tomas Talley,49,1994-12-09 13:50:37,16.3,Health,
Paulene Greer,31,2004-06-05 14:43:15,129.76,Sporting-Goods,Good : Any element of a tuple can be accessed in constant time.
Barrett Mccray,69,1990-04-06 18:35:10,85.19,Electronics,Fine : I don't even care.



+ Converting this dataframe using the `to_csv` method returns the first few rows of the original text file.
+ The option `index=False` prevents the rows being enumerated.

---

In [10]:
def get_size(S3_PATH):
    size = s3.du(S3_PATH)
    return human_bytes(size)

print(f'{CSV_FILE}:\t{get_size(f"{DATA_ROOT}/{CSV_FILE}")}')

CSV/samples_001.csv:	123.92 MiB


+ Combining `du` with `human_bytes` from above let's us get...
 + ... a sense of file & directory sizes from the S3 bucket.

---

In [11]:
for FILE in ['CSV/samples_001.csv', 'PARQUET/samples_001.pq']:
    print(f'{FILE:>22s}:\t{get_size(f"{DATA_ROOT}/{FILE}"):>10s}')


   CSV/samples_001.csv:	123.92 MiB
PARQUET/samples_001.pq:	 22.94 MiB


+ This function `get_size` permits us to compare the relative sizes of files encoded as CSV & Parquet.
+ Sure enough, the Parquet data is substantially smaller than the same data encoded as CSV text files (about a factor of five).
+ The Apache Parquet format is designed with *compressed column-based storage* which also helps with data parsing, retreival, & computation.

---

In [12]:
for PATH in ['CSV', 'PARQUET', 'PARQUET_010', 'PARQUET_050']:
    print(f'{PATH:<12}:\t{get_size(f"{DATA_ROOT}/{PATH}"):>10s}')

CSV         :	 55.48 GiB
PARQUET     :	 11.15 GiB
PARQUET_010 :	228.43 MiB
PARQUET_050 :	  1.12 GiB


+ Examining the directories shows the relative storage requirements...
  + ...of CSV versus Parquet formats.
+ Again, the CSV data requires about five times as much disk space.

---

### Reading from CSV

In [13]:
DATA_SRC = f's3://{DATA_ROOT}/{CSV_FILE}' # Single CSV file
%time df = pd.read_csv(DATA_SRC, **loading_opts)
df.tail(1)

CPU times: user 4.85 s, sys: 744 ms, total: 5.6 s
Wall time: 1min 26s


Unnamed: 0,Name,Age,Purchase_Date,Purchase_Amount,Product,Product_Review
999999,Zonia Browning,40.0,2017-03-30 14:48:36,803.67,Computers,Fine : Initially composing light-hearted and i...


+ Let's look more closely at Pandas functions for reading from S3.
+ The `read_csv` function can load a CSV file straight from the S3 bucket.
+ The resulting dataframe contains records of purchases...
   + ...with columns like `Name`, `Age`, `Purchase_Amount`, `Product`, and so on.
+ Notice, when reading from CSV, datatypes are *inferred* unless specified.

--- 

#### Examining the schema

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 6 columns):
 #   Column           Non-Null Count    Dtype  
---  ------           --------------    -----  
 0   Name             1000000 non-null  object 
 1   Age              838868 non-null   float64
 2   Purchase_Date    1000000 non-null  object 
 3   Purchase_Amount  1000000 non-null  float64
 4   Product          1000000 non-null  object 
 5   Product_Review   962920 non-null   object 
dtypes: float64(2), object(4)
memory usage: 45.8+ MB


+ To see the column types explicitly, use the `DataFrame.info` method.
  + There are a million rows, some with missing entries.
  + The memory footprint is about 46 MiB.
  + The columns are mostly of type `object` (well, strings, actually) except two of type `float64` (`Age` & `Purchase_Amount`).
  + The `Age` column values are inferred as `float64` due to missing (`NaN`, or not-a-number) entries.
+ We'll examine more practically encoded data shortly.

--- 

In [15]:
# Specifying column datatypes for CSV parsing

In [16]:
date_cols = ['Purchase_Date']

In [17]:
col_dtypes = {'Name': pd.StringDtype(),
              'Age': pd.Int32Dtype(),
              'Purchase_Amount':np.float16,
              'Product':pd.StringDtype(),
              'Product_Review':pd.StringDtype()}

---

+ To specify datatypes explicitly when parsing CSV files, use keyword arguments. This approach will be useful later with Bodo.

---

+ The singleton list `date_cols` specifies that the `Purchase_Date` column should be parsed as `datetime64` data.

---

+ The dictionary `col_dtypes` associates column names with NumPy or Pandas data types.
+ We choose Pandas `Int32Dtype` for the `Age` column making an extension array that permits an integer version of `NaN`.

---

In [18]:
csv_opts = {'parse_dates':date_cols, 'dtype':col_dtypes}

In [19]:
%time df = pd.read_csv(DATA_SRC, **csv_opts, **loading_opts)
df.tail(1)

CPU times: user 4.72 s, sys: 586 ms, total: 5.3 s
Wall time: 40.4 s


Unnamed: 0,Name,Age,Purchase_Date,Purchase_Amount,Product,Product_Review
999999,Zonia Browning,40,2017-03-30 14:48:36,803.5,Computers,Fine : Initially composing light-hearted and i...


---

+ The preceding objects are put together in a dictionary `csv_opts` with keys `parse_dates` & `dtype`.
+ These keys are *keyword arguments* for the Pandas `read_csv` function
+ The key `dtype` links to the dictionary `col_dtypes` of datatypes for the columns.
---
+ The subsequent call to `read_csv` is a little cleaner using the dictionary-unpacking operator `**` for keyword arguments.
+ Observe parsing the file from S3 is slightly faster due to some of the explicit type-conversions.
  + These likely result in more efficient data transfers across the network.

---

In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 6 columns):
 #   Column           Non-Null Count    Dtype         
---  ------           --------------    -----         
 0   Name             1000000 non-null  string        
 1   Age              838868 non-null   Int32         
 2   Purchase_Date    1000000 non-null  datetime64[ns]
 3   Purchase_Amount  1000000 non-null  float16       
 4   Product          1000000 non-null  string        
 5   Product_Review   962920 non-null   string        
dtypes: Int32(1), datetime64[ns](1), float16(1), string(3)
memory usage: 37.2 MB


+ Looking at the schema for this new DataFrame, the column data types have been assigned explicitly as the CSV file is parsed.
   + The columns are encoded in memory as strings, 16-bit floats, 32-bit integers, & 64-bit datetimes
+ This reduces the memory foot-print & enables efficiently-stored `Null` entries in integer & string columns.

---

### Reading from Parquet

In [21]:
PARQUET_FILE = 'PARQUET/samples_001.pq' # Single Parquet file
DATA_SRC = f's3://{DATA_ROOT}/{PARQUET_FILE}' 
pq_opts = {'use_nullable_dtypes':True}

+ Going beyond CSV storage, Apache Parquet has 3 significant goals:
  + Interoperability
  + Space efficiency
  + Query efficiency
---
+ Pandas can parse Parquet files.
+ We'll use the dictionary `pq_opts` to specify the keyword argument `use_nullable_dtypes`.
  + This is not strictly necessary but it can help.

---

In [22]:
%time df = pd.read_parquet(DATA_SRC, **pq_opts, **loading_opts)
df.tail(1)

CPU times: user 1.09 s, sys: 270 ms, total: 1.36 s
Wall time: 19.2 s


Unnamed: 0,Name,Age,Purchase_Date,Purchase_Amount,Product,Product_Review
999999,Zonia Browning,40,2017-03-30 14:48:36,803.67,Computers,Fine : Initially composing light-hearted and i...


+ The function `read_parquet` parses this file in less than half a minute.
+ Again, we can examine the data quickly by looking at the last row.

---

In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 6 columns):
 #   Column           Non-Null Count    Dtype         
---  ------           --------------    -----         
 0   Name             1000000 non-null  string        
 1   Age              838868 non-null   Int32         
 2   Purchase_Date    1000000 non-null  datetime64[ns]
 3   Purchase_Amount  1000000 non-null  float64       
 4   Product          1000000 non-null  string        
 5   Product_Review   962920 non-null   string        
dtypes: Int32(1), datetime64[ns](1), float64(1), string(3)
memory usage: 42.9 MB


+ Moreover, the `info` method tells us how the data is processed on parsing from S3.
+ The Parquet data is stored column-wise in a compressed binary format—with types encoded.
+ This permits the data on disk to preserve the schema efficiently for retrieval and processing.

---

### Reading from a directory of files

In [24]:
%%time
DATA_SRC = f's3://{DATA_ROOT}/PARQUET_010' # Directory
df = pd.read_parquet(DATA_SRC, **pq_opts, **loading_opts)
df.tail(1)

CPU times: user 9.81 s, sys: 2.93 s, total: 12.7 s
Wall time: 1min 10s


Unnamed: 0,Name,Age,Purchase_Date,Purchase_Amount,Product,Product_Review
9999999,Hsiu Shelton,24,1992-10-27 14:03:09,40.43,Toys,Terrible : Ports are created with the built-in...


+ Pandas's `read_parquet` function supports reading an entire directory of files into a single dataframe
+ The leading argument needs to be the folder containing a set of Parquet files encoded with the same schema.
+ Using `df.tail()` confirms that the dataframe loaded has 10 million rows.
+ Reading CSV files into a single dataframe requires a loop with the `concat` function as seen previously.

---

### Trying again with Bodo

+ Load 50 files from `PARQUET_050` 
  + `df`: dataframe (50,000,000 rows).
+ Compute the `df.Purchase_Amount.mean()`.

+ Having experimented with using various ways of loading Pandas DataFrames, let's use Bodo again.
+ We'll load 50 files from Parquet data each with a million rows.
+ Once we've assembled our DataFrame, we'll compute the mean of the `Purchase_Amount` column.

---

```python
# Timing data retrieval only...
@bodo.jit
def compute_mean_purchase():
    DATA_ROOT = 'bodo-examples-data/bodo-training-fundamentals/DATA'
    DATA_SRC = f's3://{DATA_ROOT}/PARQUET_050'
    ‎
    df = pd.read_parquet(DATA_SRC)
    ‎
    avg = df['Purchase_Amount'].mean()
    return avg
```

+ Again, embed the computation within a function.
+ The key parts are:
  + retrieving the data with `read_parquet`;
  + computing the mean;
  + and, of course, the `bodo.jit` decorator.

---

```python
# Timing data retrieval only...
@bodo.jit
def compute_mean_purchase():
    DATA_ROOT = 'bodo-examples-data/bodo-training-fundamentals/DATA'
    DATA_SRC = f's3://{DATA_ROOT}/PARQUET_050'
    start = time.time()
    df = pd.read_parquet(DATA_SRC)
    elapsed = time.time() - start
    avg = df['Purchase_Amount'].mean()
    return avg, elapsed
```

+ This time, the calls to `time.time` are sandwiched around the call to `read_parquet`.
+ In addition, rather than printing the elapsed time within the function, it is returned with the function value.

---

In [25]:
# Timing data retrieval only...
@bodo.jit
def compute_mean_purchase():
    DATA_ROOT = 'bodo-examples-data/bodo-training-fundamentals/DATA'
    DATA_SRC = f's3://{DATA_ROOT}/PARQUET_050'
    start = time.time()
    df = pd.read_parquet(DATA_SRC)
    elapsed = time.time() - start
    avg = df['Purchase_Amount'].mean()
    return avg, elapsed

In [26]:
avg, elapsed = compute_mean_purchase()
print(f'Average Purchase_Amount: ${avg:,.2f}')
print(f'Elapsed time: {elapsed:.3f} s')

Average Purchase_Amount: $188.28
Elapsed time: 88.100 s


+ When executed on my laptop, it takes almost a minute & a half just to read the data from S3.
+ This doesn't even work without Bodo for this machine.
 + The Bodo compiler recognizes that only a single column needs to be fetched from S3.
 + Without the compiler, the entire dataframe would be built up & downloaded from S3.
 + This crashes the Jupyter kernel on my laptop without care.

---

### Summary

+ Plausible data retrieval bottlenecks:
  + Single file vs. many files (API mostly)
  + Parquet vs. CSV (or others)
  + Local vs. remote storage (e.g., S3)

---

+ When looking at bottlenecks in data retrieval for a computation, consider:
  + whether the data is in a single file or in many files (this is may affect the coding APIs available); and
  + whether or not the data is stored in an optimized format such as Parquet; and
+ Generically, reading Parquet from S3 is a lot faster than reading CSV from an S3 bucket.
+ A number of issues are at play:
  + moving data across a network;
  + smaller file sizes; and
  + parsing data from storage into abstract data types in memory.
+ We'll discuss all this next.

---