# The Dataframe data structure

The data scientist's primary tool is the dataframe.  This data structure enables the user to quickly filter and select records, as well as complete transformations, similar to SQL statements, but typically in a more expressible manner.  Once the data are in the correct format, it can be used as the input for a statistical or machine learning model.

The dataframe is described as rows of records, or observations, and columns of data variables describing those records.  It was originally implemented in the R language; however, it is now ubiquitous in data science frameworks, such as Python's Pandas, Scala's Spark, and Java's TableSaw. 

## Introduction

In the Datascience R versus Pandas debate, it is really an apples and oranges comparison.  R is a domain specific language in the field of statistics, analytics, and data visualization.  This makes R great for consulting, research, and basic analysis, especially within a careful academic context.  

In contrast, Python's statistics packages are woefully inadequate and rarely mention details which are of great importance to statistical practicioners.  An example of this is the use of contrasts in linear models.  The different Types (I-IV) of Analysis Of Variance models use different encodings for data.  Determining their estimators is not trivial.

However, if you want tight integration with other applications, the strengths of typical programming languages, and want to 'just get stuff done', then Python / Pandas is a great solution.  Pandas is quite good at data manipulation.  Python has the very strong NumPy and SciKit Learn module, which are very good for matrix operations and predictive modeling.  And the Python language is a really good general scripting language with strong support for strings and datetime types.

## Config

We will begin by installing both the jupyter `R-irkenel` and `rpy2` so that we can move data between R and Pandas and compare expressions and results.

In [2]:
import pandas as pd
import numpy as np

! pip install rpy2

%load_ext rpy2.ipython

In [3]:
trades = pd.DataFrame(
    [
        ["2016-05-25 13:30:01.023", "MSFT", 51.95, 75],
        ["2016-05-25 13:30:01.038", "MSFT", 51.95, 155],
        ["2016-05-25 13:30:03.048", "GOOG", 720.77, 100],
        ["2016-05-25 13:30:03.048", "GOOG", 720.92, 100],
        ["2016-05-25 13:30:03.048", "AAPL", 98.00, 100],
    ],
    columns=["timestamp", "ticker", "price", "quantity"],   #set index during assignment: `, index_col='timestamp'`
)
trades['timestamp'] = pd.to_datetime(trades['timestamp'])
trades.head()

Unnamed: 0,timestamp,ticker,price,quantity
0,2016-05-25 13:30:01.023,MSFT,51.95,75
1,2016-05-25 13:30:01.038,MSFT,51.95,155
2,2016-05-25 13:30:03.048,GOOG,720.77,100
3,2016-05-25 13:30:03.048,GOOG,720.92,100
4,2016-05-25 13:30:03.048,AAPL,98.0,100


In [None]:
%%R -i trades
head( trades )

Everything looks to be working, let's move on.

## Numpy Arrays

The two primary classes we are interested in is the DataFrame and the Series, which are used as DataFrame columns.  Both the Series and DataFrame are numpy array with additional attributes and methods to integrate it with typical data science work.  We usually import numpy with pandas (as we did in above) so that we can flexibly change between the interfaces.  

Spending some time with the numpy module is very important for what is happening under the hood of pandas.  We will just show a few features, here, because they are extend to features of pandas.  Its important to keep in-mind the following:

* numpy is built on C++ for efficiency of memory and processing speed
* numpy arrays must be of homogenous data types
* because they are created with a fixed size, they must be recreated if the size changes
* arrays are founded on mathematical matrices and enable corresponding operations

The `ndarray` class is referred to as an “N-dimensional array”, so either a matrix or vector.  


After getting accustomed to the numpy functionality, you may find yourself reaching for it instead of a list.  This tendency should be tempered by a few aspects.  Importantly, numpy must be installed as it does not come built-in to Python.  Also, it is fairly heavy in size on disk.

Let's see some useful functionality:

In [3]:
np.arange(4)

array([0, 1, 2, 3])

In [4]:
np.arange(2, 9, 2)

array([2, 4, 6, 8])

In [5]:
np.linspace(0, 10, num=5)

array([ 0. ,  2.5,  5. ,  7.5, 10. ])

Numpy describes matrices uses axes.  Many operations in numpy and pandas will be performed on axis 0 (rows) or 1 (cols).

In [6]:
#vector
arr = np.array([2, 1, 5, 3, 7, 4, 6, 8])
np.sort(arr)

array([1, 2, 3, 4, 5, 6, 7, 8])

In [7]:
arr.shape

(8,)

In [8]:
#matrix
x = np.array([[0, 3], [2, 2]])
np.argsort(x, axis=1)


array([[0, 1],
       [0, 1]])

In [9]:
x.shape

(2, 2)

Indexing and accessing items is fairly intuitive, with value before each comma referring to an axis.  Notice with the `arr[0]` that the earlier `np.sort(arr)` did not save inplace; therefore, it acts upon the original unsorted array.

We can slice a sequence of items, instead of an index, like this: [start:end], or with a pattern: [start:end:step].

In [12]:
arr[0]

2

In [13]:
x[1,1]

2

In [14]:
arr[0:3]

array([2, 1, 5])

In [15]:
arr[0:3:2]

array([2, 5])

In [19]:
x[0:1]    #the columns colon is implied
x[0:1,:]     

array([[0, 3]])

In [18]:
x[:,0:1]    #must use colon for rows, before comma

array([[0],
       [2]])

And we can reshape arrays and matrices, such as this example which flattens it.

In [20]:
x.reshape(-1)

array([0, 3, 2, 2])

Instead of getting the values based on an index, we may want to perform the inverse.

In [21]:
arr = np.array([1, 2, 3, 4, 5, 4, 4])
np.where(arr == 4)

(array([3, 5, 6]),)

Filters on an array are performed by using an array of the same dimension, but with boolean values.  

In [22]:
arr = np.array([41, 42, 43, 44])
x = [True, False, True, False]
arr[x]

array([41, 43])

The `.where()` method is combined with filter to create a 'mask' of which values fit selection criteria.

In [23]:
arr = np.array([1, 2, 3, 4, 5, 4, 4])
mask = arr > 3
arr[mask]

array([4, 5, 4, 4])

Python’s floating-point numbers are usually 64-bit floating-point numbers.  Numpy improves the description of data types from simple `float` to a variety of different forms that effect memory allocation.  See the reference for [details](https://numpy.org/doc/stable/user/basics.types.html).

Python and Pandas use these data types:

* strings - used to represent text data, the text is given under quote marks. e.g. "ABCD"
* integer - used to represent integer numbers. e.g. -1, -2, -3
* float - used to represent real numbers. e.g. 1.2, 42.42
* boolean - used to represent True or False.
* complex - used to represent complex numbers. e.g. 1.0 + 2.0j, 1.5 + 2.5j

list of all data types in NumPy and the characters used to represent them.

* i - integer
* b - boolean
* u - unsigned integer
* f - float
* c - complex float
* m - timedelta
* M - datetime
* O - object
* S - string
* U - unicode string
* V - fixed chunk of memory for other type ( void )

In [24]:
x = np.array([(1, 0), (0, 1)], dtype=[('x', '<i4'), ('y', '<i4')])
np.argsort(x, order=('x','y'))

array([1, 0])

Here, the i means signed integer. 4 means a 4-byte size.

In [25]:
np.argsort(x, order=('y','x'))

array([0, 1])

TODO: Vectorization

## Pandas Collections

The collections available through pandas are mostly convenience functions that make use of the optimized numpy functionality.  This greatly improves the speed of your workflow because many operations are so frequently reperformed.  When reviewing the pandas API, take note of the precedent that is set by numpy and which pandas tries to follow.


### Creation

There are many ways to create pandas Series and DataFrames.  We will display loading data, dynamically, and through I/O operations.

In [27]:
#series from list
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

In [34]:
#df from series
df = s.to_frame(name='mycol')
df

Unnamed: 0,mycol
0,1.0
1,3.0
2,5.0
3,
4,6.0
5,8.0


In [None]:
#list of dicts

In [None]:
#list of lists
trades = pd.DataFrame(
    [
        ["2016-05-25 13:30:01.023", "MSFT", 51.95, 75],
        ["2016-05-25 13:30:01.038", "MSFT", 51.95, 155],
        ["2016-05-25 13:30:03.048", "GOOG", 720.77, 100],
        ["2016-05-25 13:30:03.048", "GOOG", 720.92, 100],
        ["2016-05-25 13:30:03.048", "AAPL", 98.00, 100],
    ],
    columns=["timestamp", "ticker", "price", "quantity"],   #set index during assignment: `, index_col='timestamp'`
)
trades['timestamp'] = pd.to_datetime(trades['timestamp'])
trades.head()

### Selections

We will start by comparing against typical SQL queries.  The dataframe really shows its expressionful nature through brackets `[]`.  R and Pandas are similar in concept, but different in nuances.

* select columns: `SELECT column1, column2, ...FROM table_name;`
* select distinct: `SELECT DISTINCT column1, column2, ... FROM table_name;` 
* where (with AND, OR, NOT): `SELECT column1, column2, ... FROM table_name WHERE condition;`
* order by: `SELECT column1, column2, ... FROM table_name ORDER BY column1, column2, ... ASC|DESC;`
* insert into: `INSERT INTO table_name VALUES (value1, value2, value3, ...);`

Many of these methods come with the argument `inplace=False`, so you don't need to create a new dataframe at each step.

In [None]:
#select columns: `SELECT column1, column2, ...FROM table_name;`
trades[['ticker', 'price']]

In [None]:
#select distinct: `SELECT DISTINCT column1, column2, ... FROM table_name;`
trades.unique(  'ticker')
trades.duplicated(subset='MSFT', keep='first')

In [None]:
#where (with AND, OR, NOT): `SELECT column1, column2, ... FROM table_name WHERE condition;`
trades[ (trades['ticker']=='MSFT') & (trades['quantity']>75)]

In [None]:
trades[ (trades['ticker']=='MSFT') | (trades['quantity']<75)]

In [None]:
trades[ ~((trades['ticker']=='MSFT') | (trades['quantity']>75))]

In [None]:
#order by: `SELECT column1, column2, ... FROM table_name ORDER BY column1, column2, ... ASC|DESC;`
trades.sort_values(by=['ticker'], ascending=False, inplace=False)

Pandas is not quite as expressionful as R, here, as the `.loc()` method is needed to perform an insert.  However, the `.iloc()` allows rows to be selected by index, which R does not have available.

In [None]:
#insert into: `INSERT INTO table_name VALUES (value1, value2, value3, ...);`
trades.loc[trades['ticker']>75, 'ticker'] = 'TEST'

In [None]:
trades.iloc[1:3]
trades.iloc[[1,3]]