Pandas, written by Wes McKinney and others, is a powerful module that you can think of as Excel for Python.    
Throughout this section, we'll be closely following the presentation in Wes McKinney's book Python for Data Analysis. I can't recommend this book enough! I've added context to give you a solid understanding of Pandas, as well as exercises to pace the lesson and give you an opportunity to experiment.    

Warning: Learning to use Pandas well takes time and practice. There are no shortcuts :-(    

First, like numpy, most people import pandas with the shorthand pd. Its two key data structures, Series and DataFrame, are so common that people tend to import them directly as well:

In [1]:
from pandas import Series, DataFrame
import pandas as pd

#### Series

A Series is like a NumPy array, but with a set of element labels, called an index. Think of it as a single column in a spreadsheet.

In [2]:
obj = Series([4,7,-5,3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [3]:
obj[1]

7

You can also grab the underlying values and index directly:

In [4]:
obj.values

array([ 4,  7, -5,  3], dtype=int64)

In [5]:
obj.index

Int64Index([0, 1, 2, 3], dtype='int64')

You can create a series with a different index as follows:

In [6]:
obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

Now you can use that index label to get one or more rows in the series:

In [7]:
obj2['d']

4

In [8]:
obj2[['a', 'c']]  # Get more than one row

a   -5
c    3
dtype: int64

In [9]:
# Series have a defined ordering
obj2[['c', 'a']]

c    3
a   -5
dtype: int64

Like a NumPy array, you can perform element-wise operations and filtering on Series, but unlike NumPy, these operations preserve the indices:

In [10]:
obj2[obj2 > 2]

d    4
b    7
c    3
dtype: int64

In [11]:
obj2 + 1

d    5
b    8
a   -4
c    4
dtype: int64

Series can also be thought of as "ordered dictionaries". In fact, you can create a series from a dictionary:

In [12]:
country_pops = {
    'Belgium':     11259000,
    'Netherlands': 16933000,
    'Luxembourg':    570000,
    'France':      67063000,
    'Germany':     81276000,
}
obj3 = Series(country_pops)
obj3

Belgium        11259000
France         67063000
Germany        81276000
Luxembourg       570000
Netherlands    16933000
dtype: int64

By default, all the keys in the dictionary are included, in sorted order. You can be more specific by specifying the index explicitly:

In [13]:
obj4 = Series(country_pops, index=['Belgium', 'Netherlands', 'Spain'])
obj4

Belgium        11259000
Netherlands    16933000
Spain               NaN
dtype: float64

Note that because there was no value for 'Spain' in country_pops, the value in the series is NaN (Not-a-Number). This is Pandas' standard way of representing missing data. You can see if each element in a series was missing data with the isnull and notnull methods:

In [14]:
obj4.isnull()

Belgium        False
Netherlands    False
Spain           True
dtype: bool

In [15]:
obj4.notnull()

Belgium         True
Netherlands     True
Spain          False
dtype: bool

Vector operations between Series align along the index labels, not along the row numbers, which can be very useful:

In [16]:
obj3

Belgium        11259000
France         67063000
Germany        81276000
Luxembourg       570000
Netherlands    16933000
dtype: int64

In [17]:
obj4

Belgium        11259000
Netherlands    16933000
Spain               NaN
dtype: float64

In [18]:
obj3 + obj4

Belgium        22518000
France              NaN
Germany             NaN
Luxembourg          NaN
Netherlands    33866000
Spain               NaN
dtype: float64

You can assign names to both the Series itself and the index. Think of these as column names in a spreasheet:

In [19]:
obj4.name = 'population'
obj4.index.name = 'country'
obj4

country
Belgium        11259000
Netherlands    16933000
Spain               NaN
Name: population, dtype: float64

#### Ex Starting from a dictionary, build a Series to represent the population of Belgium broken down by province. Give both the series and the index a proper name. From that Series, use indexing to sort the Series by number of customers.

### DataFrame

    A DataFrame is the 2D analog to the 1D Series object. Think of it as a spreadsheet. You have a set of columns (each of a single, fixed type). Both the columns and the rows have an index as well.
A common way to create a DataFrame is with two nested dictionaries:

In [20]:
# From https://en.wikipedia.org/wiki/List_of_regions_by_past_GDP_%28PPP%29_per_capita
data = {
    'country': ['BE', 'BE', 'BE', 'NL', 'NL', 'NL'],
    'year': [1913, 1950, 2003, 1913, 1950, 2003],
    'gdp_per_capita': [4220, 5462, 21205, 4049, 5996, 21480]
}
frame = DataFrame(data)
frame

Unnamed: 0,country,gdp_per_capita,year
0,BE,4220,1913
1,BE,5462,1950
2,BE,21205,2003
3,NL,4049,1913
4,NL,5996,1950
5,NL,21480,2003


    Notice that IPython has special support for rendering DataFrames 
We can reorder the columns as we wish and give the rows some other index:

In [21]:
DataFrame(data,
          columns=['year', 'country', 'gdp_per_capita'],
          index=['one', 'two', 'three', 'four', 'five', 'six'])

Unnamed: 0,year,country,gdp_per_capita
one,1913,BE,4220
two,1950,BE,5462
three,2003,BE,21205
four,1913,NL,4049
five,1950,NL,5996
six,2003,NL,21480


As with Series' any columns you specify not in the dictionary are filled with missing values:

In [22]:
frame2 = DataFrame(data,
                   columns=['year', 'country', 'gdp_per_capita', 'beer_quality'],
                   index=['one', 'two', 'three', 'four', 'five', 'six'])
frame2

Unnamed: 0,year,country,gdp_per_capita,beer_quality
one,1913,BE,4220,
two,1950,BE,5462,
three,2003,BE,21205,
four,1913,NL,4049,
five,1950,NL,5996,
six,2003,NL,21480,


As with Series and their indices, we can get at the underlying data in a DataFrame:

In [23]:
frame2.columns

Index([u'year', u'country', u'gdp_per_capita', u'beer_quality'], dtype='object')

In [24]:
frame2.index

Index([u'one', u'two', u'three', u'four', u'five', u'six'], dtype='object')

In [25]:
frame2.values

array([[1913L, 'BE', 4220L, nan],
       [1950L, 'BE', 5462L, nan],
       [2003L, 'BE', 21205L, nan],
       [1913L, 'NL', 4049L, nan],
       [1950L, 'NL', 5996L, nan],
       [2003L, 'NL', 21480L, nan]], dtype=object)

You can select out one column from a DataFrame, either like a dictionary or (thanks to some Python magic) as if it was an attribute of a DataFrame:

In [26]:
frame2['year']

one      1913
two      1950
three    2003
four     1913
five     1950
six      2003
Name: year, dtype: int64

In [27]:
frame2.country

one      BE
two      BE
three    BE
four     NL
five     NL
six      NL
Name: country, dtype: object

Note how a column in a DataFrame is represented as a Series, with an appropriate index and name.

#### Ex Build a DataFrame that captures how the population in each province of Flanders changed in the years 2013, 2014 and 2015.

You can index rows with the .ix[] notation as follows:

In [28]:
frame2.ix['one']    # Single row, returned as a Series

year              1913
country             BE
gdp_per_capita    4220
beer_quality       NaN
Name: one, dtype: object

In [29]:
frame2.ix[['one', 'two']]  # Multiple rows, returned as a DataFrame

Unnamed: 0,year,country,gdp_per_capita,beer_quality
one,1913,BE,4220,
two,1950,BE,5462,


You can use slicing notation, but unlike lists and NumPy arrays, DataFrame slices have an inclusive endpoint:

In [30]:
frame2['one':'three']

Unnamed: 0,year,country,gdp_per_capita,beer_quality
one,1913,BE,4220,
two,1950,BE,5462,
three,2003,BE,21205,


You can also select only some of the columns:

In [31]:
frame2.ix[['one', 'two'], ['year', 'country', 'gdp_per_capita']]

Unnamed: 0,year,country,gdp_per_capita
one,1913,BE,4220
two,1950,BE,5462


It might be easier to drop certain rows or columns instead of selecting the complement:

In [32]:
frame2.drop(['one', 'three'])  # Returns a copy with rows removed

Unnamed: 0,year,country,gdp_per_capita,beer_quality
two,1950,BE,5462,
four,1913,NL,4049,
five,1950,NL,5996,
six,2003,NL,21480,


In [33]:
frame2.drop('year', axis=1)  # Returns a copy with columns removed

Unnamed: 0,country,gdp_per_capita,beer_quality
one,BE,4220,
two,BE,5462,
three,BE,21205,
four,NL,4049,
five,NL,5996,
six,NL,21480,


We can modify columns in a frame: either set them to the same value, or specify a list of values, one for each row:

In [34]:
frame2.beer_quality = 10.0
frame2

Unnamed: 0,year,country,gdp_per_capita,beer_quality
one,1913,BE,4220,10
two,1950,BE,5462,10
three,2003,BE,21205,10
four,1913,NL,4049,10
five,1950,NL,5996,10
six,2003,NL,21480,10


In [35]:
# Not quite right...
frame2.beer_quality = [10.0, 10.0, 10.0, 4.0, 3.0, 2.0]
frame2

Unnamed: 0,year,country,gdp_per_capita,beer_quality
one,1913,BE,4220,10
two,1950,BE,5462,10
three,2003,BE,21205,10
four,1913,NL,4049,4
five,1950,NL,5996,3
six,2003,NL,21480,2


You can also assign a Series to a column, in which case rows are matched by index and missing elements are set to NaN:

In [36]:
val = Series([10.0, 2.0, 1.0], index=['one', 'four', 'five'])
val

one     10
four     2
five     1
dtype: float64

In [37]:
frame2.beer_quality = val
frame2

Unnamed: 0,year,country,gdp_per_capita,beer_quality
one,1913,BE,4220,10.0
two,1950,BE,5462,
three,2003,BE,21205,
four,1913,NL,4049,2.0
five,1950,NL,5996,1.0
six,2003,NL,21480,


You can create a new column by simply assigning something to it:

In [38]:
frame2['livable'] = (frame2.beer_quality >= 9.0)
# Can't do this though: frame2.livable = ...

frame2

Unnamed: 0,year,country,gdp_per_capita,beer_quality,livable
one,1913,BE,4220,10.0,True
two,1950,BE,5462,,False
three,2003,BE,21205,,False
four,1913,NL,4049,2.0,False
five,1950,NL,5996,1.0,False
six,2003,NL,21480,,False


You can also delete columns in place with del:

In [39]:
del frame2['beer_quality']
frame2

Unnamed: 0,year,country,gdp_per_capita,livable
one,1913,BE,4220,True
two,1950,BE,5462,False
three,2003,BE,21205,False
four,1913,NL,4049,False
five,1950,NL,5996,False
six,2003,NL,21480,False


#### Ex Quick: what's the difference between frame2.drop('beer_quality') and del frame2['beer_quality']?

You can also create a DataFrame using nested dictionaries. The outer keys are the columns, the inner ones are the rows, and missing data is marked appropriately:

In [40]:
data = {
    'Belgium': {1913: 4220.0, 1950: 5462.0, 2003: 21205.0},
    'Netherlands': {1950: 5996.0, 1973: 13082.0, 2003: 21480.0}
}
frame3 = DataFrame(data)
frame3

Unnamed: 0,Belgium,Netherlands
1913,4220.0,
1950,5462.0,5996.0
1973,,13082.0
2003,21205.0,21480.0


Like with matrices, you can transpose a DataFrame:

In [41]:
frame3.T

Unnamed: 0,1913,1950,1973,2003
Belgium,4220.0,5462,,21205
Netherlands,,5996,13082.0,21480


It's good form to explicitly name the row and column indices:

In [42]:
frame3.index.name = 'year'
frame3.columns.name = 'country'
frame3

country,Belgium,Netherlands
year,Unnamed: 1_level_1,Unnamed: 2_level_1
1913,4220.0,
1950,5462.0,5996.0
1973,,13082.0
2003,21205.0,21480.0


From now on, we'll quickly survey a few DataFrame operations, then work through a long and interesting example.    

As with Series, arithmetic between DataFrames is defined and aligns over row and column indices. For example:

In [43]:
import numpy as np
df1 = DataFrame(np.arange(9.).reshape((3,3)),
                columns=['b', 'c', 'd'],
                index=['Belgium', 'Netherlands', 'Luxembourg'])
df2 = DataFrame(np.arange(12.).reshape((4,3)),
                columns=['b', 'd', 'e'],
                index=['France', 'Belgium', 'Netherlands', 'Germany'])

In [44]:
df1 

Unnamed: 0,b,c,d
Belgium,0,1,2
Netherlands,3,4,5
Luxembourg,6,7,8


In [45]:
df2

Unnamed: 0,b,d,e
France,0,1,2
Belgium,3,4,5
Netherlands,6,7,8
Germany,9,10,11


In [46]:
df1 + df2

Unnamed: 0,b,c,d,e
Belgium,3.0,,6.0,
France,,,,
Germany,,,,
Luxembourg,,,,
Netherlands,9.0,,12.0,


Note that in places where either df1 or df2 was missing data, the result of df1 + df2 will be missing as well. We can instead tell pandas to assume some other value, like 0, if we use the add method:

In [47]:
df1.add(df2, fill_value=0)

Unnamed: 0,b,c,d,e
Belgium,3,1.0,6,5.0
France,0,,1,2.0
Germany,9,,10,11.0
Luxembourg,6,7.0,8,
Netherlands,9,4.0,12,8.0


Notice that there are still some entries for which data wasn't available in either df1 or df2, and those are still marked as missing.    
There are also analogous methods for - (sub), * (mul) and / (div).    
You can also do arithmetic between DataFrames and Series.  pandas lines these up along the columns of the DataFrame, then uses the same Series for every row:

In [48]:
df = DataFrame({
        'BeerLovers': {'BE': 55, 'FR': 15},
        'WineLovers': {'BE': 10, 'FR': 30}
    })
df

Unnamed: 0,BeerLovers,WineLovers
BE,55,10
FR,15,30


In [49]:
df.ix['Total'] = df.ix['BE'] + df.ix['FR']
df

Unnamed: 0,BeerLovers,WineLovers
BE,55,10
FR,15,30
Total,70,40


In [50]:
df.astype(float) / df.ix['Total']  # Need to avoid integer division!

Unnamed: 0,BeerLovers,WineLovers
BE,0.785714,0.25
FR,0.214286,0.75
Total,1.0,1.0


Perhaps that's a bit more natural in the other orientation. We can specify an axis along which to replicate the series when using the explicit div method:

In [52]:
df = DataFrame({
        'BE': {'BeerLovers': 55, 'WineLovers': 10},
        'FR': {'BeerLovers': 15, 'WineLovers': 30}
    })
df

Unnamed: 0,BE,FR
BeerLovers,55,15
WineLovers,10,30


In [53]:
df['Total'] = df.BE + df.FR
df

Unnamed: 0,BE,FR,Total
BeerLovers,55,15,70
WineLovers,10,30,40


In [54]:
df2 = df.astype(float).div(df.Total, axis=0)
df2

Unnamed: 0,BE,FR,Total
BeerLovers,0.785714,0.214286,1
WineLovers,0.25,0.75,1


In [55]:
df2['Total']

BeerLovers    1
WineLovers    1
Name: Total, dtype: float64

Or maybe you want the proportions within each country:

In [56]:
print(df)
df2 = df.T
df2

            BE  FR  Total
BeerLovers  55  15     70
WineLovers  10  30     40


Unnamed: 0,BeerLovers,WineLovers
BE,55,10
FR,15,30
Total,70,40


In [57]:
df2['Total'] = df2.BeerLovers + df2.WineLovers
df2

Unnamed: 0,BeerLovers,WineLovers,Total
BE,55,10,65
FR,15,30,45
Total,70,40,110


In [58]:
df2.div(df2.Total, axis=0)

Unnamed: 0,BeerLovers,WineLovers,Total
BE,0.846154,0.153846,1
FR,0.333333,0.666667,1
Total,0.636364,0.363636,1


You can apply NumPy functions element by element to a DataFrame:

In [59]:
df = DataFrame(np.random.rand(3,4), columns=['a','b','c','d'])
df

Unnamed: 0,a,b,c,d
0,0.54917,0.023071,0.695001,0.654539
1,0.253293,0.082519,0.945314,0.459492
2,0.592841,0.069312,0.772401,0.520921


In [60]:
np.exp(df)

Unnamed: 0,a,b,c,d
0,1.731814,1.023339,2.003712,1.924255
1,1.288261,1.08602,2.573621,1.58327
2,1.809121,1.07177,2.164957,1.683578


Or you can apply a function to each row or column:

In [61]:
def dynamic_range(xs):
    return xs.max() - xs.min()

In [62]:
df.apply(dynamic_range)  # Apply it to a column

a    0.339548
b    0.059449
c    0.250312
d    0.195047
dtype: float64

In [63]:
df.apply(dynamic_range, axis=1) # Apply it to each row

0    0.671931
1    0.862794
2    0.703089
dtype: float64

Instead of collapsing the DataFrame to a Series, the applied function can itself return a Series:

In [64]:
def dynamic_range_2(xs):
    return Series([xs.max(), xs.min()], index=['min', 'max'])

In [65]:
df

Unnamed: 0,a,b,c,d
0,0.54917,0.023071,0.695001,0.654539
1,0.253293,0.082519,0.945314,0.459492
2,0.592841,0.069312,0.772401,0.520921


In [66]:
df.apply(dynamic_range_2)

Unnamed: 0,a,b,c,d
min,0.592841,0.082519,0.945314,0.654539
max,0.253293,0.023071,0.695001,0.459492


In [67]:
df.apply(dynamic_range_2, axis=1)

Unnamed: 0,min,max
0,0.695001,0.023071
1,0.945314,0.082519
2,0.772401,0.069312


There are a couple of predefined functions that are useful:

In [68]:
df

Unnamed: 0,a,b,c,d
0,0.54917,0.023071,0.695001,0.654539
1,0.253293,0.082519,0.945314,0.459492
2,0.592841,0.069312,0.772401,0.520921


In [69]:
df.sum()

a    1.395304
b    0.174902
c    2.412716
d    1.634953
dtype: float64

In [70]:
df.sum(axis=1)

0    1.921781
1    1.740619
2    1.955475
dtype: float64

In [71]:
df

Unnamed: 0,a,b,c,d
0,0.54917,0.023071,0.695001,0.654539
1,0.253293,0.082519,0.945314,0.459492
2,0.592841,0.069312,0.772401,0.520921


In [72]:
df.mean()

a    0.465101
b    0.058301
c    0.804239
d    0.544984
dtype: float64

In [73]:
df.cumsum()

Unnamed: 0,a,b,c,d
0,0.54917,0.023071,0.695001,0.654539
1,0.802463,0.10559,1.640315,1.114031
2,1.395304,0.174902,2.412716,1.634953


A particularly useful method for exploring data is describe:

In [74]:
df.describe()

Unnamed: 0,a,b,c,d
count,3.0,3.0,3.0,3.0
mean,0.465101,0.058301,0.804239,0.544984
std,0.184726,0.031216,0.128157,0.099725
min,0.253293,0.023071,0.695001,0.459492
25%,0.401232,0.046191,0.733701,0.490207
50%,0.54917,0.069312,0.772401,0.520921
75%,0.571005,0.075916,0.858857,0.58773
max,0.592841,0.082519,0.945314,0.654539


On non-numeric data, describe yields different results:

In [75]:
['a', 'b'] * 3

['a', 'b', 'a', 'b', 'a', 'b']

In [76]:
obj = Series(['a', 'a', 'b', 'c'] * 4)
obj.describe()

count     16
unique     3
top        a
freq       8
dtype: object

One final topic we'll touch on is how to read and write CSV files. CSV files (or also TSV files) are the most common way of getting large amounts of data into and out of Pandas.    
Let's create a CSV file in the current directory:

In [77]:
# We haven't discussed Python's functions for reading and writing
# text in files, but I hope these are self-explanatory
f = open('mysample.csv', 'w')
f.write(
"""Year,Belgium,Netherlands
1913,4220,4049
1950,5462,5996
2003,21205,21480
""")
f.close()

With the file in place, here's how we can read it into a DataFrame:

In [None]:
df = pd.read_csv('mysample.csv')
df

If you show the help for read_csv, you'll see it has a ton of options. With the right options, you can read almost any CSV file out there into the exact DataFrame of your dreams..    
Now let's save the data to a csv:

In [None]:
df.to_csv('mysample2.csv')

In [None]:
# Read out file and print it
for line in open('mysample2.csv', 'r'):
    print line.strip()