# Tutorial - Remote Arrow

This notebook serves as an interactive tutorial (and documentation) to **Remote Arrow**.
More broadly it introduces you to **Apache Arrow**, specifically pyArrow and **Apache Arrow Flight**.

**Apache Arrow** is a language-independent columnar memory format for flat and hierarchical data. <br>

**Apache Arrow Flight** is a framework for high-speed transmission of Apache Arrow Tables over a network.  <br>
It achieves that by eliminating the overhead for (de)serialization of the data, as well as allowing parallel down and uploads of data chunks.<br>

**RemoteArrow** is a custom-framework built on top of Apache Arrow Flight and aims to solve the problem of redundancy <br>
and performance bottlenecks when handling shared and large dataset by outsourcing them to a centralized remote server. <br>
In other words, it tries to reduce the cost of data transmission, query, computation and storage of shared data for big data / ML applications. <br>


You can learn more by reading through the README and sources listed below:
#### Sources:

1. [Apache Arrow](https://arrow.apache.org/ "Apache Arrow")
2. [Introducing Apache Arrow Flight](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/ "Introducing Apache Arrow Flight") 
3. [Apache Arrow Flight API](https://arrow.apache.org/docs/format/Flight.html "Apache Arrow Flight API") 
4. [Benchmarking Apache Arrow Flight (Tanveer Ahmad, Zaid Al Ars and H. Peter Hofstee)](https://arxiv.org/pdf/2204.03032.pdf) 


## Arrow Table
Let's take a look at **the Arrow Table** format by loading a CSV file:

In [None]:
import pyarrow as pa
import pyarrow.csv as csv

table = csv.read_csv('hawaii_covid.csv')
print(table.schema)

Arrow Tables support a variety of common functions for handling tabular data, such as **join**, **sort_by**, **group_by**, **drop**, **select** or **take**:

In [None]:
table.select([5]) # Get column at index 5

In [None]:
table.num_rows

If you're familiar with **pandas** and/or **SQL** it may come to you as a relieve that **Arrow Table** provides similar methods and APIs. <br>
Also you can always, convert Arrow Tables into pandas dataframes.

In [None]:
table.take([0, 1, 2]).to_pandas() # Get row 0, 1, and 2 and conver to pandas df

So far, the data and queries have only been handled locally. <br> 
Let's see how we can make them accessible to other clients.

## Starting the server
Remote Arrow uses a centralized server in order to handle datasets and exposes a set of commands for uploads, downloads, queries to all clients in the network. <br>
The API is built on top of gRPC and language agnostic. That is, you can develop clients in any language <br>
or through any application as long as it follows the **Apache Arrow Flight API** [4]. <br>

In this tutorial, we are going to use Python to demonstrate the client and server interactions.
Let's start the server.

In [None]:
import subprocess
import time

SERVER_IP = "localhost"
SERVER_PORT = "5005"

server_process = subprocess.Popen(["python3","server.py","--host", SERVER_IP,"--port", SERVER_PORT])
pid = server_process.pid
time.sleep(2) # Wait for server to boot before running other cells

The server is running in the background (at localhost:5005) as long as the notebook is active. <br>
Of course, you can also run it in a separate terminal if you want to read the console output or change the host and port:

`python3 server --host <HOST> --port <PORT>`

For now, let's connect a client to the server and upload a local file:

In [None]:
from RemoteArrow import RemoteDataset
client_A = RemoteDataset("hawaii_covid.csv", hostname=SERVER_IP+":"+SERVER_PORT)

Here, we have instantiated a **RemoteDataset** object with path to the local CSV file. <br>
Under the hood, our object (**client_A**) converted that file into an **Arrow Table** and uploaded it to the server.

The dataset now resides on the server and is available to all clients as a **Flight** with **ID 0**. <br>

In [None]:
client_A.list_flights()

You will also find that the server has stored that file in the */datasets* subdirectory in **.parquet** format. <br>

Instantiating **client_A** as a **RemoteDataset** also overrides the **Arrow Table methods** with corresponding RPC calls to our server. <br>
In other words, calling methods like **take**, or **select** as above, will not locally compute the query, but command the server to do the query for us, and return the result.

In [None]:
query_result = client_A.take([0,1,2]) # Will be executed on the server
query_result.to_pandas()

In [None]:
# Query results are made available as new Flights
client_A.list_flights()

Now, let's pretend to be another client (**client_B**) and connect to one of the flights.

In [None]:
client_B = RemoteDataset()

In [None]:
client_B.inspect(0)
client_B.connect(0)

In [None]:
client_B.take([0,1,2,3,4,5]).to_pandas()

Note that, **client_B** can now query against the original dataset (**hawaii_covid**) without downloading it. <br>
If we wanted to do so, we can use **fetch()**.

In [None]:
ds = client_B.fetch() # Download flight

# Convert it to pandas dataframe and continue to process it locally
df = ds.to_pandas()
print("Original num of rows: ", len(df))

dropped = df.dropna()
print("Num of rows without nulls: ", len(dropped))

dropped["case_month"]

As you can see, there is a lot of flexibility involved in processing local and shared datasets.<br>

# Actions

**Apache Arrow Flight** provides another interface to interact with Flights, so-called **Actions**. <br>
Actions are application-specific. Our server provides Actions for **saving, deleting, clearing flights** and **shutting down the server**. <br>

Any client can send actions as well as accessing the flights as shown above. However here we are using a new client **(client_C)**.

In [None]:
client_C = RemoteDataset()
client_C.list_actions()

In [None]:
client_C.action(type="save", body="1 my_new_dataset") # Request server to save Flight 1 as "my_new_dataset.parquet")

In [None]:
client_C.action("clear") # Clear all flights (not affecting .parquet files)

In [None]:
client_C.list_flights()

Let's see what happens if our server crashes or reboots gracefully.

In [None]:
client_C.action("shutdown")
time.sleep(5) # Wait for server to properly shutdown before running other cells

In [None]:
# Restart the server
server_process = subprocess.Popen(["python3","server.py","--host", SERVER_IP,"--port", SERVER_PORT])
pid = server_process.pid

time.sleep(2) # Wait for server to boot
client_D = RemoteDataset()

You should see that the server recovered the saved datasets. <br>
As a final step in this tutorial, we can delete them via the "delete" actions. The corresponding .parquet files will also be wiped from the server.

In [None]:
client_D.list_actions()
client_D.action("delete", "0")
client_D.action("delete", "1")

client_D.list_flights()

Note that, the IDs did not disappear and were not rearranged, inorder to maintain consistency throughout the runtime of the server.

## Conclusion
In this tutorial, we have examined how **Remote Arrow** can help to outsource the burden of computation, storage and shared access of data on a single server. <br>
Though still in its prototype phase, it can offer tremendous flexibility for managing shared tabular data.<br>
This effect becomes more evident when datasets are used in a network of heterogenous applications that work in different languages, <br>
even more so when the shared data is extremely large and/or the clients low in computational prowess.

Another aspect, we have not discussed in-depth yet is the remarkable boost in transmission speed, that has been benchmarked by *Ahmad et al.* [4]. <br>
Their publication cites that *"Flight is able to achieve up to **6000 MB/s and 4800 MB/s throughput**"* <br>
and companies like **Dremio** that use Arrow Flight **perform 20x and 30x better** compared to traditional connection methods. <br>
That means Arrow Flight becomes an even better choice for applications that require long-haul communication over the internet. <br>

Regardless of the use-case, hopefully you have learned that **Arrow** especially in combination with **Arrow Flight** provides a highly performant framework for working with big data.
