# The Dataframe data structure

The data scientist's primary tool is the dataframe.  This data structure enables the user to quickly filter and select records, as well as complete transformations, similar to SQL statements, but typically in a more expressible manner.  Once the data are in the correct format, it can be used as the input for a statistical or machine learning model.

The dataframe is described as rows of records, or observations, and columns of data variables describing those records.  It was originally implemented in the R language; however, it is now ubiquitous in data science frameworks, such as Python's Pandas, Scala's Spark, and Java's TableSaw. 

## Introduction

In the Datascience R versus Pandas debate, it is really an apples and oranges comparison.  R is a domain specific language in the field of statistics, analytics, and data visualization.  This makes R great for consulting, research, and basic analysis, especially within a careful academic context.  

In contrast, Python's statistics packages are woefully inadequate and rarely mention details which are of great importance to statistical practicioners.  An example of this is the use of contrasts in linear models.  The different Types (I-IV) of Analysis Of Variance models use different encodings for data.  Determining their estimators is not trivial.

However, if you want tight integration with other applications, the strengths of typical programming languages, and want to 'just get stuff done', then Python / Pandas is a great solution.  Pandas is quite good at data manipulation.  Python has the very strong NumPy and SciKit Learn module, which are very good for matrix operations and predictive modeling.  And the Python language is a really good general scripting language with strong support for strings and datetime types.

## Config

We will begin by installing both the jupyter `R-irkenel` and `rpy2` so that we can move data between R and Pandas and compare expressions and results.

In [1]:
import pandas as pd
import numpy as np

! pip install rpy2

%load_ext rpy2.ipython

In [3]:
trades = pd.DataFrame(
    [
        ["2016-05-25 13:30:01.023", "MSFT", 51.95, 75],
        ["2016-05-25 13:30:01.038", "MSFT", 51.95, 155],
        ["2016-05-25 13:30:03.048", "GOOG", 720.77, 100],
        ["2016-05-25 13:30:03.048", "GOOG", 720.92, 100],
        ["2016-05-25 13:30:03.048", "AAPL", 98.00, 100],
    ],
    columns=["timestamp", "ticker", "price", "quantity"],   #set index during assignment: `, index_col='timestamp'`
)
trades['timestamp'] = pd.to_datetime(trades['timestamp'])
trades.head()

Unnamed: 0,timestamp,ticker,price,quantity
0,2016-05-25 13:30:01.023,MSFT,51.95,75
1,2016-05-25 13:30:01.038,MSFT,51.95,155
2,2016-05-25 13:30:03.048,GOOG,720.77,100
3,2016-05-25 13:30:03.048,GOOG,720.92,100
4,2016-05-25 13:30:03.048,AAPL,98.0,100


In [None]:
%%R -i trades
head( trades )

Everything looks to be working, let's move on.

## Selections

We will start by comparing against typical SQL queries.  The dataframe really shows its expressionful nature through brackets `[]`.  R and Pandas are similar in concept, but different in nuances.

* select columns: `SELECT column1, column2, ...FROM table_name;`
* select distinct: `SELECT DISTINCT column1, column2, ... FROM table_name;` 
* where (with AND, OR, NOT): `SELECT column1, column2, ... FROM table_name WHERE condition;`
* order by: `SELECT column1, column2, ... FROM table_name ORDER BY column1, column2, ... ASC|DESC;`
* insert into: `INSERT INTO table_name VALUES (value1, value2, value3, ...);`

Many of these methods come with the argument `inplace=False`, so you don't need to create a new dataframe at each step.

In [None]:
#select columns: `SELECT column1, column2, ...FROM table_name;`
trades[['ticker', 'price']]

In [None]:
#select distinct: `SELECT DISTINCT column1, column2, ... FROM table_name;`
trades.unique(  'ticker')
trades.duplicated(subset='MSFT', keep='first')

In [None]:
#where (with AND, OR, NOT): `SELECT column1, column2, ... FROM table_name WHERE condition;`
trades[ (trades['ticker']=='MSFT') & (trades['quantity']>75)]

In [None]:
trades[ (trades['ticker']=='MSFT') | (trades['quantity']<75)]

In [None]:
trades[ ~((trades['ticker']=='MSFT') | (trades['quantity']>75))]

In [None]:
#order by: `SELECT column1, column2, ... FROM table_name ORDER BY column1, column2, ... ASC|DESC;`
trades.sort_values(by=['ticker'], ascending=False, inplace=False)

Pandas is not quite as expressionful as R, here, as the `.loc()` method is needed to perform an insert.  However, the `.iloc()` allows rows to be selected by index, which R does not have available.

In [None]:
#insert into: `INSERT INTO table_name VALUES (value1, value2, value3, ...);`
trades.loc[trades['ticker']>75, 'ticker'] = 'TEST'

In [None]:
trades.iloc[1:3]
trades.iloc[[1,3]]