## Accelerating Analysis with Bodo

+ We've timed some data retrieval examples and put them in the context of computer hardware.
+ Let's develop some more examples of analysis with Bodo.

---

In [1]:
import pandas as pd, numpy as np
import time
import bodo
from s3fs import S3FileSystem
s3 = S3FileSystem(anon=True)
pd.set_option('display.precision', 2)

+ We begin with usual imports.
+ We'll also use `s3fs` to probe the files on the S3 bucket.

---

### Building DataFrame from 50 files on S3

In [2]:
DATA_ROOT = 'bodo-examples-data/bodo-training-fundamentals/DATA'
DATA_DIR  = 'PARQUET_050'
DATA_SRC  = f's3://{DATA_ROOT}/{DATA_DIR}'
print(f'Disk usage: {s3.du(f"{DATA_ROOT}/{DATA_DIR}")/(1024**3):.2f} GiB')

Disk usage: 1.12 GiB


+ We'll work with data from 50 Parquet files.
+ These are about 1.1 GiB on the S3 bucket.

---

In [3]:
loading_opts = dict(storage_options=dict(anon=True))
%time df = pd.read_parquet(DATA_SRC, **loading_opts)

CPU times: user 51.8 s, sys: 10.1 s, total: 1min 1s
Wall time: 4min 15s


In [4]:
df.memory_usage().sum() / (1024**3)

2.095475912094116

---

+ We define `loading_opts` as a dictionary to enable reading from S3.
+ This takes a while to load (about 4.5 minutes) but it does complete.

---

+ Invoking at the DataFrame `memory_usage` method & summing reveals the total memory footprint as about 2 GiB.
+ This is slightly larger than the S3 storage due to compression on disk.

---

### A Groupby Example

In [5]:
# Computing the average Purchase_amount grouped by Product
%time avgs = df.groupby('Product')['Purchase_Amount'].mean()

CPU times: user 2.59 s, sys: 713 ms, total: 3.3 s
Wall time: 2.87 s


---

+ With the DataFrame in memory, we'll group the transactions by `Product` category...
 + ...and compute the mean of the groups of `Purchase_Amount`s.
+ This computation time takes about 3 seconds.

---

In [6]:
print(f'Average Purchase_Amounts grouped by Product:\n')
display(pd.DataFrame(avgs).transpose())

Average Purchase_Amounts grouped by Product:



Product,Automotive,Beauty,Books,Clothes,Computers,Electronics,Food,Health,Music,Sporting-Goods,Toys
Purchase_Amount,687.98,25.8,34.4,64.5,860.03,86.01,8.6,21.5,17.2,129.01,43.0


+ The result of the groupy & mean just computed is a Pandas Series.
+ We convert it to a DataFrame & transpose for easier display.

---

### A Transformation & Groupby Example

In [7]:
df.Product_Review.sample(n=10)

47855898    Good : Ports are used to communicate with the ...
9139494     Terrible : Erlang is known for its designs tha...
46573108                                                 None
45351498                                                 None
25215257    Great : Erlang is a general-purpose, concurren...
35819821                                                 None
49175526    Fine : Erlang is known for its designs that ar...
22546398    Great : They are written as strings of consecu...
31534942    Good : The syntax {D1,D2,...,Dn} denotes a tup...
10678903    Terrible : Atoms can contain any character if ...
Name: Product_Review, dtype: object

---

+ For another example computation, with a more complicated flavour,...
 + ...let's examine a few random rows of the `Product_Review` column of the DataFrame in memory.
+ These are strings that begin with `Terrible`, `Fine`, `Good`, or `Great`
+ Some of rows have no review (indicated by `None`).

---

In [8]:
row = df.iloc[-1]
print(row.Product_Review)

Terrible : Any element of a tuple can be accessed in constant time.


In [9]:
print(row.Product_Review.split())

['Terrible', ':', 'Any', 'element', 'of', 'a', 'tuple', 'can', 'be', 'accessed', 'in', 'constant', 'time.']


---

+ We'll extract a single row as a series, so we can convert each review to a numerical score.
---
+ After extracting the (non-empty) `Product_Review` from a row, the Python string method `split` returns a list of words.
+ The zeroth word in this list—`Terrible`—is the one we want.

---

In [10]:
scores = {'Terrible':1, 'Fine':2, 'Good':3, 'Great':4}
key = row.Product_Review.split()[0]
translation = f'{key} -> {scores[key]}'
print(translation)

Terrible -> 1


+ We can use a dictionary `scores` to represent the mapping of the review keywords to numbers. 
+ The numerical scores range from one to four...
+ ...and, for this row, the keyword `Terrible` translates to a score of `1`.

---

In [11]:
def extract_score(row):
    scores = {'Terrible':1, 'Fine':2, 'Good':3, 'Great':4}
    return (np.nan if pd.isna(row.Product_Review)
                   else scores[row.Product_Review.split()[0]])

+ The preceding logic can be wrapped in a Python function `extract score`.
+ The missing entries are assigned score of `NaN` using a ternary `if` statement.
+ This also short-circuits exceptions by trying to apply the `split` method to non-strings.

---

In [12]:
display(df.head())

Unnamed: 0,Name,Age,Purchase_Date,Purchase_Amount,Product,Product_Review
0,Tomas Talley,49,1994-12-09 13:50:37,16.3,Health,
1,Paulene Greer,31,2004-06-05 14:43:15,129.76,Sporting-Goods,Good : Any element of a tuple can be accessed ...
2,Barrett Mccray,69,1990-04-06 18:35:10,85.19,Electronics,Fine : I don't even care.
3,Cammie Adkins,57,2020-07-17 14:25:20,101.17,Electronics,Good : He looked inquisitively at his keyboard...
4,Breann Moses,59,2003-03-16 17:18:02,7.49,Food,Good : Where are my pants?


+ As a first check that the function `extract_score` works, examine the first few rows of the dataframe `df`...

---

In [13]:
score = df.head().apply(extract_score, axis=1)
display(score)

0    NaN
1    3.0
2    2.0
3    3.0
4    3.0
dtype: float64

+ ...and then use the `apply` method to return the corresponding Series of scores.

---

In [14]:
# First, extract a smaller DataFrame
sub_df = df.head(1_000_000).copy() # Extract one file of data
sub_df.tail(3)

Unnamed: 0,Name,Age,Purchase_Date,Purchase_Amount,Product,Product_Review
999997,Diego Dominguez,21,2002-01-18 08:38:38,16.4,Music,
999998,Twanna Phillips,64,2018-12-23 10:36:36,56.48,Clothes,Terrible : The Galactic Empire is nearing comp...
999999,Zonia Browning,40,2017-03-30 14:48:36,803.67,Computers,Fine : Initially composing light-hearted and i...


+ As another check, we extract the equivalent of a single file's worth of data as a sub-dataframe.

---

In [15]:
# Create a column "Score" by extracting numerical values corresponding to the first word of each review.
%time sub_df['Score'] = sub_df.apply(extract_score, axis=1)
sub_df.tail(3)[['Name', 'Age', 'Score']]

CPU times: user 15.8 s, sys: 52.2 ms, total: 15.8 s
Wall time: 15.8 s


Unnamed: 0,Name,Age,Score
999997,Diego Dominguez,21,
999998,Twanna Phillips,64,1.0
999999,Zonia Browning,40,2.0


+ We then use the `apply` method to evaluate the function `extract_score` on every row of the sub-dataframe.
 + the computed series is stored as a new column `Score`.
+ This is quite slow as expected; the Python function `extract_score` is executed in pure Python.
+ Remember, this is for a much smaller DataFrame that fits in 50 MiB of memory;
 + ... we want to generalize this to large dataframes.

---

In [16]:
# Next, groupby Product, extract the Purchase_Amount & Score columns, & aggregate by mean.
cols = ['Purchase_Amount', 'Score']
%time result = sub_df.groupby('Product')[cols].mean()
display(result.transpose())

CPU times: user 60.1 ms, sys: 16.6 ms, total: 76.7 ms
Wall time: 75 ms


Product,Automotive,Beauty,Books,Clothes,Computers,Electronics,Food,Health,Music,Sporting-Goods,Toys
Purchase_Amount,687.59,25.82,34.36,64.46,859.39,86.1,8.6,21.5,17.21,129.02,43.04
Score,2.17,2.23,2.18,2.37,3.13,2.04,2.62,2.51,2.3,2.43,1.98


+ Finally, we groupby `Product` and look at the mean `Purchase_Amount` and `Score`.
+ This is less than a second, but still working on a small dataset.
+ Again, we transpose the result for convenient display.

---

### Improving these examples with Bodo

In [17]:
# First jitted function:
# (i) groupby Product, mean Purchase_Amount
@bodo.jit
def compute_groupby_mean_bodo():
    DATA_ROOT = 'bodo-examples-data/bodo-training-fundamentals/DATA'
    DATA_SRC  = f's3://{DATA_ROOT}/PARQUET_050'
    t0 = time.time()
    df = pd.read_parquet(DATA_SRC)
    avgs = df.groupby('Product')['Purchase_Amount'].mean()
    t1 = time.time()
    return avgs, t1 - t0

+ We can encapsulate the preceding computations in Bodo-jitted functions.
+ The first one, `compute_groupby_mean_bodo`, simply groups by `Product` & returns the mean `Purchase_Amount`.
+ Notice the time includes both the loading time and the time for the groupby computation.
+ The `bodo.jit` decorator will recognize that only one column is used;
 + The compiled function will save a lot in memory & network use compared to the preceding implementation.

---

In [18]:
# Second jitted function:
# (i) compute Score column
# (ii) groupby Product, mean Purchase_Amount & Score
@bodo.jit
def compute_scores_groupby_means_bodo():
    DATA_ROOT = 'bodo-examples-data/bodo-training-fundamentals/DATA'
    DATA_SRC  = f's3://{DATA_ROOT}/PARQUET_050'
    t0 = time.time()
    df = pd.read_parquet(DATA_SRC)
    df['Score'] = df.apply(extract_score, axis=1)
    result = df.groupby('Product')\
              [['Purchase_Amount', 'Score']].mean()
    t1 = time.time()
    return result, t1 - t0

+ The second jitted function the preceding logic can be embedded into a function `compute_scores_groupby_means`
+ This is fairly involved in having to transform the `Purchase_Reviews` column to yield the `Scores`.
+ The function returns a DataFrame—the result of the groupby—as well as the time elapsed.

---

In [19]:
avgs_bodo, elapsed = compute_groupby_mean_bodo()

In [20]:
print(f'Average Purchase_Amounts grouped by Product:')
display(pd.DataFrame(avgs_bodo).transpose())
print(f'Time elapsed: {elapsed:10.4f} s')

Average Purchase_Amounts grouped by Product:


Product,Health,Sporting-Goods,Electronics,Food,Toys,Computers,Music,Clothes,Automotive,Beauty,Books
Purchase_Amount,21.5,129.01,86.01,8.6,43.0,860.03,17.2,64.5,687.98,25.8,34.4


Time elapsed:    96.9399 s


+ Consider the first jitted function.
+ The application of the Bodo JIT decorator reduces the computation time from over 4 minutes to about 100 seconds.
+ Notice that the result computed is superficially different in that the index is not sorted.

In [21]:
# Verify result computed is the same!
pd.DataFrame(avgs - avgs_bodo).transpose()

Product,Automotive,Beauty,Books,Clothes,Computers,Electronics,Food,Health,Music,Sporting-Goods,Toys
Purchase_Amount,-7.62e-12,-7.64e-13,-1.89e-12,-1.53e-12,1.84e-11,2.09e-12,2.13e-13,1.57e-12,3.55e-14,-4.8e-12,-4.74e-12


+ We can check that the Series computed are in fact the same by subtracting.
+ Remember, `avgs` is the original Series computed; and `avgs_bodo` is computed by the Bodo-jitted function.
+ We display the difference as a transposed DataFrame for convenience.
+ The results are less than $10^{-11}$ in magnitude;
 + the absolute differences reflect reasonable rounding differences
 + That is, the Bodo-optimized code likely accumulates sums in a different sequence.
 + Then, we expect the results to differ within the scale of machine precision multiplied by 50 million...
 + ...(the number of rows in our DataFrame).

---

In [22]:
%%time
scores_avgs, elapsed = compute_scores_groupby_means_bodo()

CPU times: user 1min 15s, sys: 2.06 s, total: 1min 17s
Wall time: 2min 52s


+ Remember, it took about 5 minutes to load the DataFrame from the S3 bucket alone.
+ And it took about 17 seconds to apply the transformation to one fiftieth of the dataset.
+ This computation was not really feasible with all 50 million rows;
 + the Bodo-jitted function took about three minutes!
 

---

In [23]:
print(f'Average Purchase_Amounts & Scores grouped by Product:')
display(scores_avgs.transpose())
print(f'Time elapsed: {elapsed:10.4f} s')

Average Purchase_Amounts & Scores grouped by Product:


Product,Health,Sporting-Goods,Electronics,Food,Toys,Computers,Music,Clothes,Automotive,Beauty,Books
Purchase_Amount,21.5,129.01,86.01,8.6,43.0,860.03,17.2,64.5,687.98,25.8,34.4
Score,2.48,2.55,2.51,2.55,2.42,2.51,2.47,2.45,2.48,2.47,2.62


Time elapsed:   154.4954 s


+ When we display the results, notice that the quantity `elapsed` returned by the function...
 + ... is about 15-20 seconds less than the time recorded by the `%%time` IPython magic command.
+ The difference reflects the time required to compile the function.
+ That may seem long, but it is much less than the computation time;
 + Indeed, it's much less than the savings from the 5 minutes needed to download the original data!

---

### Summary

+ `bodo.jit` decorator compiles *Python functions*
+ Adapt from traditional Pandas-style analysis

+ To summarize, the key strategy is to build our analysis incrementally—exactly as we would in Pandas/NumPy/etc.
+ We can get noticeable speedups from functions decorated appropriately.

---