# PyArrow - Getting Started

## Quick Demo

## Import Package

In [None]:
import pyarrow as pa

## Let's look at two basic structures : Arrays & Tables

### Create a simple array.  Start with a list

In [56]:
a_list = [1,2,3,4,5]

# Convert to PyArrow Array
array_a = pa.array(a_list)

In [57]:
# Describe the array

array_a

<pyarrow.lib.Int64Array object at 0x117925240>
[
  1,
  2,
  3,
  4,
  5
]

### Same operation, but, this time, specify the type as int8

In [25]:
array_b = pa.array(a_list, type=pa.int8())

In [26]:
array_b

<pyarrow.lib.Int8Array object at 0x1178d14e0>
[
  1,
  2,
  3,
  4,
  5
]

### Why is this useful?  
#### Let's consider the size of these two objects

In [None]:
import sys

In [28]:
sys.getsizeof(array_a)

128

In [29]:
sys.getsizeof(array_b)

93

##### Takeaway: Characterise your data and specify appropriate type when loading, in order to minimise memory

### Common Mistakes

#### Try this:  Provide a type of Int 8 and Int 16 for the following array

In [35]:
years = pa.array([1990, 2000, 1995, 2000, 1995], type=pa.int8())

ArrowInvalid: Value 1990 too large to fit in C integer type

In [38]:
years = pa.array([1990, 2000, 1995, 2000, 1995], type=pa.int16())

In [39]:
years = pa.array([1990, 2000, 1995, 2000, 1995])
type(years)

pyarrow.lib.Int64Array

#### Takeaway: Calling function with defaults may lead to errors such as above

#### See Data Types and In-Memory Data Models (https://arrow.apache.org/docs/python/data.html#data) for more details.

### Create a Table

#### Table can be loaded as a list of lists, like this:

In [40]:
a_table = pa.table([[1,2,3,4],[10,20,30,40],[100,200,300,400]],  names=["units", "tens", "huns"])

In [41]:
a_table

pyarrow.Table
units: int64
tens: int64
huns: int64
----
units: [[1,2,3,4]]
tens: [[10,20,30,40]]
huns: [[100,200,300,400]]

In [42]:
type(a_table["tens"])

pyarrow.lib.ChunkedArray

#### Or, you can define arrays beforehand and then load to table

In [None]:
units = pa.array([1,2,3,4])
tens = pa.array([10,20,30,40])
hundreds = pa.array([100,200,300,400])
b_table = pa.table([units,tens,hundreds],  names=["units", "tens", "huns"])

In [44]:
b_table

pyarrow.Table
units: int64
tens: int64
huns: int64
----
units: [[1,2,3,4]]
tens: [[10,20,30,40]]
huns: [[100,200,300,400]]

In [45]:
type(b_table["tens"])

pyarrow.lib.ChunkedArray

## Conversion from Pandas to PyArrow

Pandas serves as the established memory format for many data science projects.  Here is code to help you convert a Pandas data frame to a PyArrow table.  Factors such as space occupied in memory, support for downstream packages and suitability for compute operations, among others, will help you decide whether to use PyArrow or Pandas.

In [47]:
# Create a dummy data frame

import pandas as pd

pdf = pd.DataFrame({"A_Column":[1,2,4,5,6], "B_Column":[10,20,40,50,60], "C_Column":["Alpha", "Beta", "Gamma", "Delta", "Epsilon"]})

In [49]:
# Create a list of column values.  T

list_of_lists =[]

for col in pdf.columns.values:
   list_of_lists.append(pdf[col].tolist())

### The above step is to illustrate the structure.  There is a readymade function to convert to Pandas

In [None]:
a_table = pa.table(list_of_lists, names=pdf.columns.values)
b_table = pa.Table.from_pandas(pdf)

In [None]:
a_table

In [54]:
b_table

pyarrow.Table
A_Column: int64
B_Column: int64
C_Column: string
----
A_Column: [[1,2,4,5,6]]
B_Column: [[10,20,40,50,60]]
C_Column: [["Alpha","Beta","Gamma","Delta","Epsilon"]]

## To summarise:

1. We have seen how to import and use the package
2. We have seen how to create PyArrow arrays
3. We have seen how to create PyArrow table objects.
4. We have an appreciation of how to plan the datatype of an array.
5. We have seen how to convert a Pandas dataframe to a PyArrow table.
