## A first Bodo computation

+ Welcome to this first lesson on using Bodo to speed up data processing.

---

<center><img src='./img/data_preview.png'></img></center>

+ Simulated customer transaction data
+ Some entries missing

---

+ We'll look at synthetic customer transaction data.
+ The features include `Name`, `Age`, `Purchase_Date`, `Purchase_Amount`, and so on.
+ Observe that the `Age` & `Purchase_Review` columns have some missing entries...
 + ...which is common in real data.

---

### Setting up a computation with Bodo

+ Compute average `Purchase_amount` over a large dataset (10,000,000 rows)

In [1]:
import pandas as pd, numpy as np
pd.set_option('display.precision', 2)

---

+ Our first computation is to read 10 files of Parquet data—each with a million rows—and compute an average.
+ We start with generic imports—Pandas & NumPy

---

In [2]:
DATA_ROOT = 's3://bodo-examples-data/bodo-training-fundamentals'
DATA_SRC = f'{DATA_ROOT}/DATA/PARQUET_010'

loading_opts = dict(storage_options=dict(anon=True))

In [3]:
%time df = pd.read_parquet(DATA_SRC, **loading_opts)

CPU times: user 10.6 s, sys: 2.38 s, total: 12.9 s
Wall time: 1min 2s


---

+ To get the data, we define a string `DATA_SRC` to describe its location...
  + on a public remote S3 bucket.
  + The object `DATA_SRC` is defined over two lines here simply to fit on screen.
+ We also define a dictionary `loading_opts`...
  + ...to use as a required argument for Pandas's `read_parquet` function.
  + ...(needed to read data from a remote S3 bucket).
---
+ When executed, this took some time to load (almost a minute on this laptop & internet connection).
+ We'll look at timings like this more closely later.

---

In [4]:
print('The dataframe has {:,d} rows & {} columns.'
      .format(*df.shape))

df.tail(1)

The dataframe has 10,000,000 rows & 6 columns.


Unnamed: 0,Name,Age,Purchase_Date,Purchase_Amount,Product,Product_Review
9999999,Hsiu Shelton,24,1992-10-27 14:03:09,40.43,Toys,Terrible : Ports are created with the built-in...


---

+ We can examine the contents of the dataframe in memory to ensure it's loaded appropriately.
+ Sure enough, there are 10 million rows
+ Examining the last row with the `tail` method shows the columns are as we expect.

---

```python
# Computing average of Purchase_amount column
            df['Purchase_Amount']
‎
```

+ We can apply standard Pandas idioms for extracting columns...

---

```python
# Computing average of Purchase_amount column
      avg = df['Purchase_Amount'].mean()
‎
```

+ ... and computing statistics like averages.

---

```python
# Computing average of Purchase_amount column
%time avg = df['Purchase_Amount'].mean()
print(f'Average Purchase_Amount: ${avg:,.2f}')
```

+ In particular, let's time a computation of the mean of the `Purchase_Amount` column.

---

In [5]:
# Computing average of Purchase_amount column
%time avg = df['Purchase_Amount'].mean()
print(f'Average Purchase_Amount: ${avg:,.2f}')

CPU times: user 15.9 ms, sys: 0 ns, total: 15.9 ms
Wall time: 15.2 ms
Average Purchase_Amount: $184.91


---

+ This is pretty fast in Pandas—about a tenth of a second on this machine.
+ This speed is largely possible because the dataframe fits in local memory.

---

```python
# Embedding in a function...
‎
def compute_mean_purchase():
    ‎
    loading_opts = dict(storage_options=dict(anon=True))
    DATA_ROOT = 's3://bodo-examples-data/bodo-training-fundamentals'
    DATA_SRC = f'{DATA_ROOT}/DATA/PARQUET_010'
    df = pd.read_parquet(DATA_SRC, **loading_opts)
    avg = df['Purchase_Amount'].mean()
    ‎
    return avg
```

---

+ All of the preceding steps to compute an average purchase price can be encapsulated in a single function.
+ This is an important step in using Bodo:
  + setting up analysis in a *compilable* function like `compute_mean_purchase`.

---

```python
# Embedding in a function...
import time
def compute_mean_purchase():
    start = time.time()
    loading_opts = dict(storage_options=dict(anon=True))
    DATA_ROOT = 's3://bodo-examples-data/bodo-training-fundamentals'
    DATA_SRC = f'{DATA_ROOT}/DATA/PARQUET_010'
    df = pd.read_parquet(DATA_SRC, **loading_opts)
    avg = df['Purchase_Amount'].mean()
    print(f'Elapsed time: {time.time() - start:.3f} s')
    return avg
```

---

+ We'll load the `time` module & modify the function `compute_mean_purchase` to call the `time.time` function twice within.
  + We'll look at this more later.

---

In [6]:
# Embedding in a function...
import time
def compute_mean_purchase():
    start = time.time()
    loading_opts = dict(storage_options=dict(anon=True))
    DATA_ROOT = 's3://bodo-examples-data/bodo-training-fundamentals'
    DATA_SRC = f'{DATA_ROOT}/DATA/PARQUET_010'
    df = pd.read_parquet(DATA_SRC, **loading_opts)
    avg = df['Purchase_Amount'].mean()
    print(f'Elapsed time: {time.time() - start:.3f} s')
    return avg

In [7]:
avg = compute_mean_purchase()
print(f'Average Purchase_Amount: ${avg:,.2f}')

Elapsed time: 57.544 s
Average Purchase_Amount: $184.91


---

+ On defining...
---
+ ...and executing the function `compute_mean_purchase`, the computation time is displayed to the screen
  + This took about 60 seconds on this machine.
+ Remember, this includes *both* loading the data & executing the actual computation.

---

### The bodo.jit decorator

```python
# Load module to enable bodo.jit...
import bodo
‎
def compute_mean_purchase():
    start = time.time()
    loading_opts = dict(storage_options=dict(anon=True))
    DATA_ROOT = 's3://bodo-examples-data/bodo-training-fundamentals'
    DATA_SRC = f'{DATA_ROOT}/DATA/PARQUET_010'
    df = pd.read_parquet(DATA_SRC, **loading_opts)
    avg = df['Purchase_Amount'].mean()
    print(f'Elapsed time: {time.time() - start:.3f} s')
    return avg
```

---

+ Now, we'll do this again.
---
+ We import the `bodo` package and try again.
+ Starting from the function `compute_mean_purchase`, we'll make three changes.

---

### The bodo.jit decorator

```python
# Load module to enable bodo.jit...
import bodo
@bodo.jit
def compute_mean_purchase():
    start = time.time()
    loading_opts = dict(storage_options=dict(anon=True))
    DATA_ROOT = 's3://bodo-examples-data/bodo-training-fundamentals'
    DATA_SRC = f'{DATA_ROOT}/DATA/PARQUET_010'
    df = pd.read_parquet(DATA_SRC, **loading_opts)
    avg = df['Purchase_Amount'].mean()
    print(f'Elapsed time: {time.time() - start:.3f} s')
    return avg
```

---

+ First, we'll prepend the decorator `bodo.jit` to the function header.
+ Remember, decorator functions are higher-order functions.
+ This decorated function replaces the original function `compute_mean_purchase`
  + with one that is passed through Bodo's *Just-in-Time* compiler.

---

### The bodo.jit decorator

```python
# Load module to enable bodo.jit...
import bodo
@bodo.jit
def compute_mean_purchase_bodo():
    start = time.time()
    loading_opts = dict(storage_options=dict(anon=True))
    DATA_ROOT = 's3://bodo-examples-data/bodo-training-fundamentals'
    DATA_SRC = f'{DATA_ROOT}/DATA/PARQUET_010'
    df = pd.read_parquet(DATA_SRC, **loading_opts)
    avg = df['Purchase_Amount'].mean()
    print(f'Elapsed time: {time.time() - start:.3f} s')
    return avg
```

---

+ Next, we'll bind the function's identifier as `compute_mean_purchase_bodo`
  + to distinguish it as a new function.
+ This isn't strictly necessary unless we want to have both the decorated & undecorated functions in our namespace.

---

### The bodo.jit decorator

```python
# Load module to enable bodo.jit...
import bodo
@bodo.jit
def compute_mean_purchase_bodo():
    start = time.time()
    DATA_ROOT = 's3://bodo-examples-data/bodo-training-fundamentals'
    DATA_SRC = f'{DATA_ROOT}/DATA/PARQUET_010'
    df = pd.read_parquet(DATA_SRC)
    avg = df['Purchase_Amount'].mean()
    print(f'Elapsed time: {time.time() - start:.3f} s')
    return avg
```

---

+ Finally, we'll remove the dictionary `loading_opts` (and all references to it) from inside the function.
+ The Bodo JIT decorator removes the need for this extra option...
  + ...inside function calls when reading from S3 buckets.

---

In [8]:
# Load module to enable bodo.jit...
import bodo

@bodo.jit
def compute_mean_purchase_bodo():
    start = time.time()
    DATA_ROOT = 's3://bodo-examples-data/bodo-training-fundamentals'
    DATA_SRC = f'{DATA_ROOT}/DATA/PARQUET_010'
    df = pd.read_parquet(DATA_SRC)
    avg = df['Purchase_Amount'].mean()
    print(f'Elapsed time: {time.time() - start:.3f} s')
    return avg

In [9]:
compute_mean_purchase_bodo()
print(f'Average Purchase_Amount: ${avg:,.2f}')

Elapsed time: 40.814 s
Average Purchase_Amount: $184.91


---

+ When this function is evaluated, it takes about 40 seconds (versus a minute previously)
  + ... not huge savings, but it's a decent start.
+ This is a glimpse of how Bodo's Just-in-Time compiler technology helps data analytics.
+ The JIT compiler recogizes that only one column is extracted from files from the remote S3 bucket
  + ... and sets up the compiled function with optimized data access, transmission, & computations.
+ This decorated function gives a slight improvement for this relatively small dataset;
  + greater gains are observable with more data.

---

### Summary

+ `bodo.jit`: Just-in-Time compiler
+ Implemented through Python decorator

---

+ We've had a brief look at how to apply Bodo in a simple case; we'll see more examples soon.
+ Later, we'll see how Bodo incorporates parallelism to scale analysis further still.

---