# An introduction to the DataFrame Data Structure

Dataframes have been around for a while, I believe starting with R, as they are currently used.  There are lots of dataframes, in the R ecosystem, but surprisingly few in the Python.

We'll introduce dataframes piece by piece, starting with the overall structure of a dataframe

## A Naive implementation

We'll build up a naive implementation of a dataframe from dictionaries with lists as values.

In [15]:
# Built from primitives
from collections import defaultdict
import random
from prettytable import PrettyTable

dataframe = defaultdict(list)

for column in ["A", "B", "C"]:
    dataframe[column] = [random.random() 
                        for _ in range(10)]

table = PrettyTable(["Col Name"] + list(dataframe.keys()))
for index in range(len(dataframe["A"])):
    row = [index]
    for column in ["A", "B", "C"]:
        row.append(dataframe[column][index])
    table.add_row(row)
print(table)

+----------+---------------------+---------------------+---------------------+
| Col Name |          A          |          B          |          C          |
+----------+---------------------+---------------------+---------------------+
|    0     |  0.9815802415151975 |  0.6639923657600588 |  0.4150559067922459 |
|    1     |  0.8982960101852917 |  0.5459500167830481 | 0.15611034055984496 |
|    2     |  0.9662552458539788 |  0.5450712225858546 |  0.7084079253562369 |
|    3     | 0.27573151579338684 |  0.8904388710113929 |  0.8855629116530425 |
|    4     |  0.8058782741796997 |  0.6598275465304131 | 0.20941199751687345 |
|    5     |  0.5866684676574484 |  0.9721774362398216 |  0.7173778073156678 |
|    6     |  0.6406917959958436 |  0.7739899479535122 |  0.6718462154970831 |
|    7     |  0.6740708083001122 |   0.83501265368149  |  0.8159641750417831 |
|    8     |  0.7471784497437283 | 0.35134490213762215 |  0.769034069303954  |
|    9     |  0.7038010414515948 |  0.92548490724185

As you can see from this naive implementation we have a our first two primitives:

* rows 
* columns

These rows and columns are analogous to that of an excel spreadsheet or a database table.  The difference is dataframes are "in memory" data stores.

Now we'll introduce slicing, first for arrays and then for matrices.  Fundamentally a dataframe is merely a matrix where the columns are named, for semantic purposes.

## An Introduction to Slicing

Before we can understand advanced slicing, we'll need to review regular slicing.

In [43]:
import random
listing = [random.random() for _ in range(10)]

# Selecting the first element
print("First index", listing[0])

print()

# Selecting the last element
print("Last index", listing[-1])

print()

# Reversing the list
print("The reversed list", listing[::-1])

print()

# Selecting the first 3 elements
print("The first few elements", listing[0:3])

print()

# Selecting all the elements at even indexes
print("The elements at even offsets", listing[::2])

print()

# Selecting every third element
print("Every third element", listing[::3])

First index 0.7916158970528665

Last index 0.5691737251739554

The reversed list [0.5691737251739554, 0.7213676621917036, 0.8339543595872136, 0.45527716184948375, 0.622606105789058, 0.2980052187130885, 0.11522601924446241, 0.09250193886597169, 0.11830669453024512, 0.7916158970528665]

The first few elements [0.7916158970528665, 0.11830669453024512, 0.09250193886597169]

The elements at even offsets [0.7916158970528665, 0.09250193886597169, 0.2980052187130885, 0.45527716184948375, 0.7213676621917036]

Every third element [0.7916158970528665, 0.11522601924446241, 0.45527716184948375, 0.5691737251739554]


## An Introduction to Advanced Slicing

Now that we've seen an introduction to slicing, let's move onto numpy slicing, which is far more advanced.  It's also the same slicing that pandas dataframe's use.

In [49]:
import numpy as np

listing = np.random.rand(10)

# Selecting first element
print("First element", listing[0])

print()

# Reversing the list
print("Reversed list", listing[::-1])

print()

# Getting the first few elements
print("First few elements", listing[:3])

First element 0.6240969508813186

Reversed list [0.94434009 0.02207917 0.7003305  0.76394935 0.9425563  0.26801212
 0.85224238 0.41886283 0.91837224 0.62409695]

First few elements [0.62409695 0.91837224 0.41886283]


As you can numpy arrays support all the basic functions a builtin python list supports.  But it also deals well with multi-dimensional arrays:

In [58]:
tensor = np.random.rand(10, 10)

# Selecting first element
print("First element", tensor[0, 0])

print()

# Selecting the first few elements
print("First few elements", tensor[0, 0:3])

print()

# Reversing the first row
print("Reversing the first row", tensor[::-1, 0])

print()

# Reversing the entire tensor
print("Reversing the tensor", tensor[::-1, ::-1])

First element 0.6601583587726582

First few elements [0.66015836 0.88787503 0.70087991]

Reversing the first row [0.00143466 0.3519615  0.27606834 0.66395007 0.15528332 0.52608354
 0.81092874 0.02186712 0.61935367 0.66015836]

Reversing the tensor [[0.85694557 0.50646455 0.17699702 0.30929738 0.60297545 0.93087336
  0.66829771 0.7273428  0.24125466 0.00143466]
 [0.9521811  0.43468107 0.94722499 0.2075009  0.97813676 0.05974539
  0.3879473  0.18905489 0.6598998  0.3519615 ]
 [0.85036124 0.86061632 0.46177964 0.72705882 0.09075407 0.13491748
  0.9627146  0.87786067 0.70111903 0.27606834]
 [0.08289707 0.19962095 0.58755316 0.51956677 0.85507605 0.50793696
  0.5693193  0.50711952 0.78148824 0.66395007]
 [0.45511725 0.61644199 0.87466013 0.62945482 0.39404126 0.98908835
  0.90497992 0.11440851 0.76904577 0.15528332]
 [0.07576779 0.84627586 0.14914789 0.6273588  0.6106119  0.64513274
  0.05258538 0.05157902 0.79901776 0.52608354]
 [0.09249271 0.26937426 0.15255123 0.44025906 0.05046224 0.710

Now that we understand multi-indexing works, let's move onto the main work horse of dataframes, broadcasting.

## Broadcasting

Numpy and pandas dataframe data structures are _incredibly_ fast.  This is because they take advantage of a lot of C code and a kind of programming called broadcast.

Put simply - this means that the looping over elements occurs in C rather than in Python.  Therefore we are able to get C level iteration speeds.  Meaning, when we use numpy arrays or pandas dataframe, we actually get _C level performance in Python_.  

### For - loops

Not all for loops can be expressed through broadcasting.  Instead because of the syntactic similarity to array slicing, only simple or primitive for loops can be expressed.  Specifically:

* loops that transform all the elements of a column or columns
* loops that filter down the data structure to a specific subset 

It's worth noting that filtering is expressed via slicing, while transformations are expressed via method calls. However both make use of broadcasting to do the computation.

In [83]:
import numpy as np

tensor = np.random.rand(10, 10)
A = np.random.rand(10, 10)
B = np.random.rand(10, 10)

# Filtering down
print("All elements in the tensor less than 0.1", tensor[tensor < 0.1])

print()

# Transforming
print("Multiplying two tensors", A * B)

print()

# summing a row
print("Sum all the elements of a tensor row", tensor[0].sum())

print()

# summing a column
print("Sum all the elements of a tensor column", tensor[:, 0].sum())

All elements in the tensor less than 0.1 [0.06534567 0.04042085 0.04911574 0.05536978 0.04123728 0.01482627
 0.08344982 0.0756749  0.02506227]

Multiplying two tensors [[1.58275377e-01 6.63622410e-02 5.66779847e-01 2.45977986e-01
  1.78088406e-01 6.86077686e-02 1.95636305e-01 2.47245466e-01
  2.63428948e-01 1.92707349e-01]
 [9.55546504e-01 5.09681317e-01 8.00711758e-02 2.32874162e-01
  6.72384364e-01 1.65838925e-02 1.23901420e-01 5.76035660e-02
  6.00990124e-01 5.17757414e-03]
 [3.33456071e-02 6.71825761e-02 7.02695860e-01 3.86201178e-02
  7.31771625e-02 2.14030858e-02 1.69025874e-01 1.71363882e-01
  1.78744770e-02 5.66891594e-02]
 [8.47561523e-01 3.29271988e-01 5.43994409e-01 5.63684359e-01
  8.54528066e-02 8.25867481e-02 5.41194816e-02 9.09876805e-02
  4.03518101e-02 1.43864444e-01]
 [7.65278042e-02 9.69540051e-02 2.72737516e-01 9.59720153e-05
  1.79655520e-01 1.27945352e-01 3.91259060e-01 5.53385548e-02
  2.34934309e-01 6.82280821e-01]
 [1.49319611e-02 3.68881689e-01 4.36746628e-01 

Notice the last computation:

`tensor[:, 0].sum()`

This specific computation takes advantage of both slicing as well as other computation.  This also shows how we can select any section of an array we might like and act on that selection in a single statement.

There is a clear and obvious danger with syntactic sugar like this - incredibly long and complex one liner statements throughout the Python syntax.  This could lead to encouraging opaque complex code which is hard to read or reason about because it does too much on a single line.  

We'll see more of this complexity when we start working with pandas dataframes.

# References

* [https://ipython-books.github.io/](https://ipython-books.github.io/)
* [https://jakevdp.github.io/PythonDataScienceHandbook/](https://jakevdp.github.io/PythonDataScienceHandbook/)