This notebook will cover the assumed knowledge of pandas.

Outline:

- Philosophy
- NumPy fondation
- basic IO
- Data Structures

## Philosophy

pandas aims to provide an API that enables you to exprsively(?) and concisely eppjalwij.
Achieving this goal is complicated the nature of the world: datasets are messy or incomplete.

## Numpy Foundation

pandas is builds atop NumPy, historically and in the actual library.
It's helpful to have a good understanding of some NumPyisms. [Speak the vernacular](https://www.youtube.com/watch?v=u2yvNw49AX4).

### ndarray

The core of numpy is the `ndarray`, N-dimensional array. These are homogenously-typed, fixed-lenght data containers.
NumPy also provides many convinient and fast methods implemented on the `ndarray`.

In [16]:
import numpy as np
import pandas as pd

x = np.array([1, 2, 3])
x

array([1, 2, 3])

In [5]:
x.dtype

dtype('int64')

In [6]:
y = np.array([[True, False], [False, True]])
y

array([[ True, False],
       [False,  True]], dtype=bool)

In [7]:
y.shape

(2, 2)

In [9]:
a = [1, 2, 3]
a

[1, 2, 3]

### dtypes

Unlike python lists, NumPy arrays care about the type of data stored within.
The full list of NumPy dtypes can be found in the [NumPy documentation](http://docs.scipy.org/doc/numpy/user/basics.types.html).
We sacrifice the convinience of mixing bools and ints and floats within an array for much better performance.
However, an unexpected `dtype` change will probably bite you at some point in the future.

The two biggest things to remember are

- Missing values (NaN) cast integer or boolean arrays to floats
- the object dtype is the fallback

You'll want to avoid object dtypes. It's typically slow.

In [14]:
s = pd.Series(np.random.randn(10000))
%timeit s.mean()

The slowest run took 4.91 times longer than the fastest. This could mean that an intermediate result is being cached 
10000 loops, best of 3: 29.2 µs per loop


In [15]:
s = pd.Series(np.random.randn(10000)).astype(object)
%timeit s.mean()

1000 loops, best of 3: 603 µs per loop


### Broadcasting

It's super cool and super useful. The one-line explanation is that when doing elementwise operations, things expand to the "correct" shape.

In [22]:
# add a scalar to a 1-d array
x = np.arange(5)
print('x: ', x)
print('x+1', x + 1, end='\n\n')

y = np.random.uniform(size=(2, 5))
print('y:  ', y,  sep='\n')
print('y+1:', y + 1, sep='\n')

x:  [0 1 2 3 4]
x+1 [1 2 3 4 5]

y:  
[[ 0.96173856  0.91094749  0.07895303  0.53317977  0.10373788]
 [ 0.70308853  0.34366506  0.19683112  0.38510709  0.81542228]]
y+1:
[[ 1.96173856  1.91094749  1.07895303  1.53317977  1.10373788]
 [ 1.70308853  1.34366506  1.19683112  1.38510709  1.81542228]]


Since `x` is shaped `(5,)` and `y` is shaped `(2,5)` we can add those.

In [23]:
x * y

array([[ 0.        ,  0.91094749,  0.15790607,  1.59953931,  0.41495154],
       [ 0.        ,  0.34366506,  0.39366224,  1.15532127,  3.26168913]])

We'll breeze through the basics here, and get onto some interesting applications in a bit. I want to provide the *barest* of intuition so things stick down the road.

## Why pandas

NumPy is great. But it lacks a few things that are key to doing statisitcal analysis. By building on top of NumPy, pandas provides

- labeled arrays
- heterogenous data types within a table
- better missing data handling

## Data Structures

This is the typical starting point for any intro to pandas.
We'll follow suit.

### The DataFrame

Here we have the workhorse data structure for pandas.
It's an in-memory table holding your data, and provides a few conviniences over lists of lists or NumPy arrays.

In [25]:
import numpy as np
import pandas as pd

In [26]:
# Many ways to construct a DataFrame
# We pass a dict of {column name: column values}
np.random.seed(42)
df = pd.DataFrame({'A': [1, 2, 3], 'B': [True, True, False],
                   'C': np.random.randn(3)},
                  index=['a', 'b', 'c'])  # also this weird index thing
df

Unnamed: 0,A,B,C
a,1,True,0.496714
b,2,True,-0.138264
c,3,False,0.647689


### Selecting

Our first improvement over numpy arrays is labeled indexing. We can select subsets by column, row, or both. Column selection uses the regular python `__getitem__` machinery. Pass in a single column label `'A'` or a list of labels `['A', 'C']` to select subsets of the original `DataFrame`.

In [27]:
# select A, C from table

In [32]:
cols = ['A', 'C']
df[cols]

Unnamed: 0,A,C
a,1,0.496714
b,2,-0.138264
c,3,0.647689


In [33]:
# Single column
df['A']

a    1
b    2
c    3
Name: A, dtype: int64

For row-wise selection, use the special `.loc` accessor.

In [35]:
df.loc[['a', 'b']]

Unnamed: 0,A,B,C
a,1,True,0.496714
b,2,True,-0.138264


When your index labels are ordered, you can use *ranges* to rows or columns.

In [36]:
df.loc['a':'b']

Unnamed: 0,A,B,C
a,1,True,0.496714
b,2,True,-0.138264


Notice that the slice is *inclusive* on both sides,  unlike your typical slicing of a list. Sometimes, you'd rather slice by *position* instead of label. `.iloc` has you covered:

In [38]:
df.iloc[0:2]

Unnamed: 0,A,B,C
a,1,True,0.496714
b,2,True,-0.138264


As I mentioned, you can slice both rows and columns. Use `.loc` for label or `.iloc` for position indexing.

In [42]:
df.loc['a', 'B']

True

Pandas, like NumPy, will reduce dimensions when possible. Select a single column and you get back `Series` (see below). Select a single row and single column, you get a scalar.

You can get pretty fancy:

In [43]:
df.loc['a':'b', ['A', 'C']]

Unnamed: 0,A,C
a,1,0.496714
b,2,-0.138264


#### Summary


- Use `.loc[row_lables, column_labels]` for label-based indexing
- Use `.iloc[row_positions, column_positions]` for positional index

I've left out boolean and hierarchical indexing, which we'll see later.

### Series

You've already seen some `Series` up above. It's the 1-dimensional analog of the DataFrame. Each column in a `DataFrame` is in some sense a `Series`. You can select a `Series` from a DataFrame in a few ways:

In [48]:
# __getitem__ like before
df['A']

a    1
b    2
c    3
Name: A, dtype: int64

In [49]:
# .loc, like before
df.loc[:, 'A']

a    1
b    2
c    3
Name: A, dtype: int64

In [50]:
# using `.` attribute lookup
df.A

a    1
b    2
c    3
Name: A, dtype: int64

You'll have to be careful with the last one. It won't work if you're column name isn't a valid python identifier (say it has a space) or if it conflicts with one of the (many) methods on `DataFrame`. The `.` accessor is extremely convient for interactive use though.

You should never *assign* a column with `.` e.g. don't do

```python
# bad
df.A = [1, 2, 3]
```

It's unclear whether your attaching the list `[1, 2, 3]` as an attirbute of `df`, or whether you want it as a column. It's better to just say

```python
df['A'] = [1, 2, 3]
# or
df.loc[:, 'A'] = [1, 2, 3]
```

`Series` share many of the same methods as DataFrames.

In [53]:
frame_methods = set(filter(lambda x: not x.startswith('_'), dir(df)))
series_methods = set(filter(lambda x: not x.startswith('_'), dir(df['A'])))
len(frame_methods & series_methods) / len(frame_methods | series_methods)

0.6937984496124031

About 70% overlap.

You might wonder what benefit having a separate class for `Series` has over just having a `DataFrame` with a single column. As we uncover more of pandas, I'll be sure to point them out. But as a preview, `Series` are mappable

In [54]:
print('Series:\n', df.A.map({1: 'one', 2: 'two'}), end='\n\n', sep='')

try:
    print(df[['A']].map({1: 'one', 2: 'two'}))
except:
    print('DataFrames are not mappable!')

Series:
a    one
b    two
c    NaN
Name: A, dtype: object

DataFrames are not mappable!


Series can be used as boolean indexers (more later)

In [75]:
# return rows where 'B' is True
print("Series:", df[df.B], sep='\n', end='\n\n')

print("DataFrame:", df[df[['B']]], sep='\n')

Series:
   A     B         C
a  1  True -0.941808
b  2  True -0.921615

DataFrame:
    A   B   C
a NaN   1 NaN
b NaN   1 NaN
c NaN NaN NaN


The DataFrame result probably isn't what you expected to get. I'll explain why in `reindexing`.

Seires can be used in `groupby`.

In [55]:
print("Series", df.groupby(df['B']).A.count(), sep='\n', end='\n')

try:
    df.groupby(df[['B']]).A.count()
except ValueError:
    print("Can't group by DataFrame")

Series
B
False    1
True     2
Name: A, dtype: int64
Can't group by DataFrame


And other methods like `value_counts` or the `.str` methods either haven't been implemented on `DataFrame` or don't make as much sense.

### Index

`Index`es are something of a peculiarity to pandas.
First off, they are not the kind of indexes you'll find in SQL, which are used to help the engine speed up certain queries.
In pandas, `Index`es are about lables. This helps with selection (like we did above) and automatic alignment when perofrming operations between two `DataFrame`s or `Series`.

R does have row labels, but they're nowhere near as powerful (or complicated) as the are in pandas. You can access the index of a `DataFrame` or `Series` with the `.index` attribute.

In [52]:
df.index

Index(['a', 'b', 'c'], dtype='object')

In [54]:
df

Unnamed: 0,A,B,C
a,1,True,0.496714
b,2,True,-0.138264
c,3,False,0.647689


In [55]:
df2 = pd.DataFrame({'A': [1, 4, 9], 'B': [True, False, False], 'C': [.1, .2, .3]},
                   index=['a', 'b', 'd'])
df2

Unnamed: 0,A,B,C
a,1,True,0.1
b,4,False,0.2
d,9,False,0.3


In [56]:
df + df2

Unnamed: 0,A,B,C
a,2.0,2.0,0.596714
b,6.0,1.0,0.061736
c,,,
d,,,


There are specical kinds of `Index`es that you'll come across. Some of these are

- `MultiIndex` for multidimensional (Hierarchical) labels
- `DatetimeIndex` for datetimes
- `Float64Index` for floats
- `CategoricalIndex` for, you guessed it, `Categorical`s

We'll talk *a lot* more about indexes. They're a complex topic and can introduce headaches.

<blockquote class="twitter-tweet" lang="en"><p lang="en" dir="ltr"><a href="https://twitter.com/gjreda">@gjreda</a> <a href="https://twitter.com/treycausey">@treycausey</a> in some cases row indexes are the best thing since sliced bread, in others they simply get in the way. Hard problem</p>&mdash; Wes McKinney (@wesmckinn) <a href="https://twitter.com/wesmckinn/status/547177248768659457">December 22, 2014</a></blockquote> <script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

Pandas, for better or for worse, does provide ways around row indexes getting in the way. The problem is knowing *when* they are just getting in the way, which mostly comes by experience. Sorry.