We'll cover the basics here.

Outline:

- Philosophy
- numpy fondation
- basic IO
- Data Structures

## Philosophy

It's helpful to understand the goals of pandas, in the context of the problems it tries to solve.
pandas aims to provide an API that enables you to exprsively(?) and concisely eppjalwij.
Achieving this goal is complicated the nature of the world: datasets are messy or incomplete.

## Numpy Foundation

pandas is builds atop NumPy, historically and in the actual library.
It's helpful to have a good understanding of some NumPyisms. [Speak the vernacular](https://www.youtube.com/watch?v=u2yvNw49AX4).

### dtypes

It's an unfortunate fact that at some point in using pandas (or NumPy), you'll have to worry about data-types.

The full list of NumPy dtypes can be found in the [NumPy documentation](http://docs.scipy.org/doc/numpy/user/basics.types.html).

The two biggest things to remember are

- Missing values (NaN) cast integer or boolean to floats
- the object dtype is the fallback

You'll want to avoid object dtypes.

### Broadcasting

It's super cool and super useful.

We'll breeze through the basics here, and get onto some interesting applications in a bit. I want to provide the *barest* of intuition so things stick down the road.

## Data Structures

This is the typical starting point for any intro to pandas.
We'll follow suit.

### The DataFrame

Here we have the workhorse data structure for pandas.
It's an in-memory table holding your data, which provides a few conviniences over lists of lists or NumPy arrays

- labeled indexing
- hetorogeneous data
- missing data
- convinience methods

In [8]:
import numpy as np
import pandas as pd

In [3]:
# Many ways to construct a DataFrame
# We pass a dict of {column name: column values}
np.random.seed(42)
df = pd.DataFrame({'A': [1, 2, 3], 'B': [True, True, False],
                   'C': np.random.randn(3)},
                  index=['a', 'b', 'c'])  # also this weird index thing
df

Unnamed: 0,A,B,C
a,1,True,0.496714
b,2,True,-0.138264
c,3,False,0.647689


### Selecting

Our first improvement over numpy arrays is labeled indexing. We can select subsets by column, row, or both. Column selection uses the regular python `__getitem__` machinery. Pass in a single column label `'A'` or a list of labels `['A', 'C']` to select subsets of the original `DataFrame`.

In [4]:
df[['A', 'C']]

Unnamed: 0,A,C
a,1,0.496714
b,2,-0.138264
c,3,0.647689


For row-wise selection, use the special `.loc` accessor.

In [5]:
df.loc['a']

A            1
B         True
C    0.4967142
Name: a, dtype: object

When your index labels are ordered, you can use *ranges* to rows or columns.

In [17]:
df.loc['a':'b']

Unnamed: 0,A,B,C
a,1,True,-0.941808
b,2,True,-0.921615


Notice that the slice is *inclusive* on both sides,  unlike your typical slicing of a list. Sometimes, you'd rather slice by *position* instead of label. `.iloc` has you covered:

In [21]:
df.iloc[0]

A            1
B         True
C   -0.9418082
Name: a, dtype: object

As I mentioned, you can slice both rows and columns. Use `.loc` for label or `.iloc` for position indexing.

In [24]:
df.loc['a', 'B']

True

Pandas, like NumPy, will reduce dimensions when possible. Select a single column and you get back `Series` (see below). Select a single row and single column, you get a scalar.

You can get pretty fancy:

In [26]:
df.loc['a':'b', ['A', 'C']]

Unnamed: 0,A,C
a,1,-0.941808
b,2,-0.921615


### Series

You've already seen some `Series` up above. It's the 1-dimensional analog of the DataFrame. Each column in a `DataFrame` is in some sense a `Series`. You can select a `Series` from a DataFrame in a few ways:

In [43]:
# __getitem__ like before
df['A']

a    1
b    2
c    3
Name: A, dtype: int64

In [44]:
# .loc, like before
df.loc[:, 'A']

a    1
b    2
c    3
Name: A, dtype: int64

In [46]:
# using `.` attribute lookup
df.A

a    1
b    2
c    3
Name: A, dtype: int64

You'll have to be careful with the last one. It won't work if you're column name isn't a valid python identifier (say it has a space) or if it conflicts with one of the (many) methods on `DataFrame`. The `.` accessor is extremely convient for interactive use though.

You should never *assign* a column with `.` e.g. don't do

```python
# bad
df.A = [1, 2, 3]
```

It's unclear whether your attaching the list `[1, 2, 3]` as an attirbute of `df`, or whether you want it as a column. It's better to just say

```python
df['A'] = [1, 2, 3]
# or
df.loc[:, 'A'] = [1, 2, 3]
```

`Series` share many of the same methods as DataFrames.

In [18]:
frame_methods = set(filter(lambda x: not x.startswith('_'), dir(df)))
series_methods = set(filter(lambda x: not x.startswith('_'), dir(df['A'])))
len(frame_methods & series_methods) / len(frame_methods | series_methods)

0.6937984496124031

About 70% overlap.

You might wonder what benefit having a separate class for `Series` has over just having a `DataFrame` with a single column. As we uncover more of pandas, I'll be sure to point them out. But as a preview, `Series` are mappable

In [64]:
print('Series:\n', df.A.map({1: 'one', 2: 'two'}), end='\n\n', sep='')

try:
    print(df[['A']].map({1: 'one', 2: 'two'}))
except:
    print('DataFrames are not mappable!')

Series:
a    one
b    two
c    NaN
Name: A, dtype: object

DataFrames are not mappable!


Series can be used as boolean indexers (more later)

In [75]:
# return rows where 'B' is True
print("Series:", df[df.B], sep='\n', end='\n\n')

print("DataFrame:", df[df[['B']]], sep='\n')

Series:
   A     B         C
a  1  True -0.941808
b  2  True -0.921615

DataFrame:
    A   B   C
a NaN   1 NaN
b NaN   1 NaN
c NaN NaN NaN


The DataFrame result probably isn't what you expected to get. I'll explain why in `reindexing`.

Seires can be used in `groupby`.

In [84]:
print("Series", df.groupby(df['B']).A.count(), sep='\n', end='\n')

try:
    df.groupby(df[['B']]).A.count()
except ValueError:
    print("Can't group by DataFrame")

Series
B
False    1
True     2
Name: A, dtype: int64
Can't group by DataFrame


  return self._try_coerce_result(func(values, other))


And other methods like `value_counts` or the `.str` methods either haven't been implemented on `DataFrame` or don't make as much sense.

### Index

Indexes are something of a peculiarity to pandas.
First off, they are not the kind of Indexes you'll find in SQL, which are used to help the engine speed up certain queries.
In pandas, Indexes are about row-lables.

R does have row labels, but they're nowhere near as powerful (or complicated) as the are in pandas. You can access the index of a `DataFrame` or `Series` with the `.index` attribute.

In [88]:
df.index

Index(['a', 'b', 'c'], dtype='object')

What are indexes useful for? Without getting too into the weeds you can slice with them

In [94]:
df.loc[df.index[:2]]

Unnamed: 0,A,B,C
a,1,True,-0.941808
b,2,True,-0.921615


There are specical kinds of `Index`es that you'll come across. Some of these are

- `MultiIndex` for multidimensional (Hierarchical) labels
- `DatetimeIndex` for datetimes
- `Float64Index` for floats
- `CategoricalIndex` for, you guessed it, `Categorical`s

We'll talk *a lot* more about indexes. They're a complex topic and can introduce headaches.

<blockquote class="twitter-tweet" lang="en"><p lang="en" dir="ltr"><a href="https://twitter.com/gjreda">@gjreda</a> <a href="https://twitter.com/treycausey">@treycausey</a> in some cases row indexes are the best thing since sliced bread, in others they simply get in the way. Hard problem</p>&mdash; Wes McKinney (@wesmckinn) <a href="https://twitter.com/wesmckinn/status/547177248768659457">December 22, 2014</a></blockquote> <script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

Pandas, for better or for worse, does provide ways around row indexes getting in the way. The problem is knowing *when* they are just getting in the way, which mostly comes by experience. Sorry.