# Pandas

* Pandas can roughly be interpreted as "python data analysis", but the name originally came from "panel data"
* Its data is stored in numpy arrays, and many concepts are the same.  But a lot nicer interface for data analysis.  It provides more high-level support for typical data processing.

These examples come directly from "10 minutes to Pandas" from the Pandas documentation.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
s = pd.Series([1,3,5,np.nan,6,8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

In [9]:
dates = pd.date_range('20130101', periods=6, freq='D')
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [11]:
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2013-01-01,-0.191875,0.884182,0.65532,-0.053731
2013-01-02,-0.920584,-0.419627,-1.617731,-0.174208
2013-01-03,2.202912,0.239354,0.921805,-0.057907
2013-01-04,0.754337,0.534045,-2.19263,-1.465499
2013-01-05,0.837581,-0.34288,0.899612,0.670063
2013-01-06,0.340025,0.902787,0.886391,-0.79562


In [14]:
df2 = pd.DataFrame({ 'A' : 1.,
                     'B' : pd.Timestamp('20130102'),
                     'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                     'D' : np.array([3] * 4,dtype='int32'),
                     'E' : pd.Categorical(["test","train","test","train"]),
                     'F' : 'foo' })
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


In [15]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

In [16]:
df.head()

Unnamed: 0,A,B,C,D
2013-01-01,-0.191875,0.884182,0.65532,-0.053731
2013-01-02,-0.920584,-0.419627,-1.617731,-0.174208
2013-01-03,2.202912,0.239354,0.921805,-0.057907
2013-01-04,0.754337,0.534045,-2.19263,-1.465499
2013-01-05,0.837581,-0.34288,0.899612,0.670063


In [17]:
df.tail(3)

Unnamed: 0,A,B,C,D
2013-01-04,0.754337,0.534045,-2.19263,-1.465499
2013-01-05,0.837581,-0.34288,0.899612,0.670063
2013-01-06,0.340025,0.902787,0.886391,-0.79562


In [18]:
df.index

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [19]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [20]:
df.values

array([[-0.1918754 ,  0.88418166,  0.65531961, -0.05373085],
       [-0.92058394, -0.4196273 , -1.61773111, -0.17420796],
       [ 2.2029118 ,  0.2393544 ,  0.9218051 , -0.05790738],
       [ 0.75433673,  0.53404456, -2.19262975, -1.46549901],
       [ 0.83758051, -0.34287988,  0.89961202,  0.67006346],
       [ 0.34002473,  0.9027869 ,  0.88639123, -0.79561977]])

In [21]:
df2.values

array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

In [22]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.503732,0.299643,-0.074539,-0.312817
std,1.058378,0.582105,1.432864,0.732003
min,-0.920584,-0.419627,-2.19263,-1.465499
25%,-0.0589,-0.197321,-1.049468,-0.640267
50%,0.547181,0.386699,0.770855,-0.116058
75%,0.81677,0.796647,0.896307,-0.054775
max,2.202912,0.902787,0.921805,0.670063


In [23]:
df.T

Unnamed: 0,2013-01-01 00:00:00,2013-01-02 00:00:00,2013-01-03 00:00:00,2013-01-04 00:00:00,2013-01-05 00:00:00,2013-01-06 00:00:00
A,-0.191875,-0.920584,2.202912,0.754337,0.837581,0.340025
B,0.884182,-0.419627,0.239354,0.534045,-0.34288,0.902787
C,0.65532,-1.617731,0.921805,-2.19263,0.899612,0.886391
D,-0.053731,-0.174208,-0.057907,-1.465499,0.670063,-0.79562


In [24]:
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2013-01-02,-0.920584,-0.419627,-1.617731,-0.174208
2013-01-05,0.837581,-0.34288,0.899612,0.670063
2013-01-03,2.202912,0.239354,0.921805,-0.057907
2013-01-04,0.754337,0.534045,-2.19263,-1.465499
2013-01-01,-0.191875,0.884182,0.65532,-0.053731
2013-01-06,0.340025,0.902787,0.886391,-0.79562


# Selection and indexing
Pandas does selection more intuitively than numpy - but that can mean inconsistent sometimes.

In [34]:
# Select columns
df['A']      # same as df.A.  Problem?

2013-01-01   -0.191875
2013-01-02   -0.920584
2013-01-03    2.202912
2013-01-04    0.754337
2013-01-05    0.837581
2013-01-06    0.340025
Freq: D, Name: A, dtype: float64

In [65]:
# Select rows - note same syntax as above.  What happens if there is ambiguity?
df['A':'B']

ValueError: Given date string not likely a datetime.

In [40]:
df['20130102':'20130104']

Unnamed: 0,A,B,C,D
2013-01-02,-0.920584,-0.419627,-1.617731,-0.174208
2013-01-03,2.202912,0.239354,0.921805,-0.057907
2013-01-04,0.754337,0.534045,-2.19263,-1.465499


In [41]:
df.loc[dates[0]]

A   -0.191875
B    0.884182
C    0.655320
D   -0.053731
Name: 2013-01-01 00:00:00, dtype: float64

In [42]:
df.loc[:,['A','B']]


Unnamed: 0,A,B
2013-01-01,-0.191875,0.884182
2013-01-02,-0.920584,-0.419627
2013-01-03,2.202912,0.239354
2013-01-04,0.754337,0.534045
2013-01-05,0.837581,-0.34288
2013-01-06,0.340025,0.902787


In [43]:
df.loc[:,['A','B']]

Unnamed: 0,A,B
2013-01-01,-0.191875,0.884182
2013-01-02,-0.920584,-0.419627
2013-01-03,2.202912,0.239354
2013-01-04,0.754337,0.534045
2013-01-05,0.837581,-0.34288
2013-01-06,0.340025,0.902787


In [44]:
df.loc['20130102',['A','B']]

A   -0.920584
B   -0.419627
Name: 2013-01-02 00:00:00, dtype: float64

In [45]:
df.loc[dates[0],'A']


-0.19187539839374196

In [46]:
df.at[dates[0],'A']

-0.19187539839374196

In [47]:
df.iloc[3]

A    0.754337
B    0.534045
C   -2.192630
D   -1.465499
Name: 2013-01-04 00:00:00, dtype: float64

In [48]:
df.iloc[3:5,0:2]

Unnamed: 0,A,B
2013-01-04,0.754337,0.534045
2013-01-05,0.837581,-0.34288


In [49]:
df.iloc[3:5,0:2]

Unnamed: 0,A,B
2013-01-04,0.754337,0.534045
2013-01-05,0.837581,-0.34288


In [51]:
df.iloc[3:5,0:2]

Unnamed: 0,A,B
2013-01-04,0.754337,0.534045
2013-01-05,0.837581,-0.34288


In [52]:
df.iloc[:,1:3]

Unnamed: 0,B,C
2013-01-01,0.884182,0.65532
2013-01-02,-0.419627,-1.617731
2013-01-03,0.239354,0.921805
2013-01-04,0.534045,-2.19263
2013-01-05,-0.34288,0.899612
2013-01-06,0.902787,0.886391


In [54]:
df.iloc[1,1]

-0.4196272982509927

In [55]:
df.iloc[1,1]

-0.4196272982509927

In [57]:
df[df.B > 0]

Unnamed: 0,A,B,C,D
2013-01-01,-0.191875,0.884182,0.65532,-0.053731
2013-01-03,2.202912,0.239354,0.921805,-0.057907
2013-01-04,0.754337,0.534045,-2.19263,-1.465499
2013-01-06,0.340025,0.902787,0.886391,-0.79562


In [58]:
df[df > 0]

Unnamed: 0,A,B,C,D
2013-01-01,,0.884182,0.65532,
2013-01-02,,,,
2013-01-03,2.202912,0.239354,0.921805,
2013-01-04,0.754337,0.534045,,
2013-01-05,0.837581,,0.899612,0.670063
2013-01-06,0.340025,0.902787,0.886391,


In [59]:
df2 = df.copy()
df2['E'] = ['one', 'one','two','three','four','three']
df2

Unnamed: 0,A,B,C,D,E
2013-01-01,-0.191875,0.884182,0.65532,-0.053731,one
2013-01-02,-0.920584,-0.419627,-1.617731,-0.174208,one
2013-01-03,2.202912,0.239354,0.921805,-0.057907,two
2013-01-04,0.754337,0.534045,-2.19263,-1.465499,three
2013-01-05,0.837581,-0.34288,0.899612,0.670063,four
2013-01-06,0.340025,0.902787,0.886391,-0.79562,three


In [60]:
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))
df['F'] = s1
df

Unnamed: 0,A,B,C,D,F
2013-01-01,-0.191875,0.884182,0.65532,-0.053731,
2013-01-02,-0.920584,-0.419627,-1.617731,-0.174208,1.0
2013-01-03,2.202912,0.239354,0.921805,-0.057907,2.0
2013-01-04,0.754337,0.534045,-2.19263,-1.465499,3.0
2013-01-05,0.837581,-0.34288,0.899612,0.670063,4.0
2013-01-06,0.340025,0.902787,0.886391,-0.79562,5.0


In [62]:
df.at[dates[0],'A'] = 0

# Missing data
Missing data handling is one of the nicest features of pandas:

In [None]:
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
df1

In [None]:
df1.loc[dates[0]:dates[1],'E'] = 1


In [None]:
df1.dropna(how='any')

In [None]:
df1.fillna(value=5)

In [None]:
df1.fillna(value=5)

# Operations

Operations generally do the right things in the face of missing data!

In [None]:
df.mean()

In [None]:
df.mean(axis=1)

# Time series

In [None]:
rng = pd.date_range('1/1/2012', periods=100, freq='S')
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts.resample('5Min').sum()

In [None]:
rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts_utc = ts.tz_localize('UTC')

In [None]:
ts_utc.tz_convert('US/Eastern')

In [None]:
rng = pd.date_range('1/1/2012', periods=5, freq='D')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts

# Data alignment
Data alignment based on indexing, combined with missing data handling, is another of the most useful things in pandas

In [None]:
dates = pd.date_range('1/1/2012', periods=10, freq='D')
ts = pd.Series(np.random.randint(0, 10, len(dates)), index=dates)
ts

In [None]:
ts[:5] + ts

In [None]:
dates1 = pd.date_range('1/1/2012', periods=10, freq='D')
ts1    = pd.Series(np.random.randint(0, 10, len(dates1)), index=dates1)
dates2 = pd.date_range('1/5/2012', periods=10, freq='D')
ts2    = pd.Series(np.random.randint(0, 10, len(dates2)), index=dates2)
ts1 + ts2

# Plotting
We'll do this later (day 3), but for now...

In [None]:
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
ts.plot()

# Reading data in pandas
There is lots of support for automatically reading in data in various formats.  Let's use the same dataset as last time, and see what new we can do.

In [None]:
iris = pd.read_csv('../data/iris.data',
                   names=('sepallength', 'sepalwidth', 'petallength', 'petalwidth', 'type'))
iris.describe()

# Exercises 04
These exercises are taken with pride from https://github.com/ajcr/100-pandas-puzzles

## DataFrame basics


Consider the following Python dictionary `data` and Python list `labels`:

``` python
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
```
(This is just some meaningless data I made up with the theme of animals and trips to a vet.)

**4.** Create a DataFrame `df` from this dictionary `data` which has the index `labels`.

In [None]:
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']


**5.** Display a summary of the basic information about this DataFrame and its data.

**6.** Return the first 3 rows of the DataFrame `df`.

**7.** Select just the 'animal' and 'age' columns from the DataFrame `df`.

**8.** Select the data in rows `[3, 4, 8]` *and* in columns `['animal', 'age']`.

**9.** Select only the rows where the number of visits is greater than 3.

**10.** Select the rows where the age is missing, i.e. is `NaN`.

**11.** Select the rows where the animal is a cat *and* the age is less than 3.

**12.** Select the rows the age is between 2 and 4 (inclusive).

**13.** Change the age in row 'f' to 1.5.

**14.** Calculate the sum of all visits (the total number of visits).

**15.** Calculate the mean age for each different animal in `df`.

**16.** Append a new row 'k' to `df` with your choice of values for each column. Then delete that row to return the original DataFrame.

**17.** Count the number of each type of animal in `df`.

**18.** Sort `df` first by the values in the 'age' in *decending* order, then by the value in the 'visit' column in *ascending* order.

**19.** The 'priority' column contains the values 'yes' and 'no'. Replace this column with a column of boolean values: 'yes' should be `True` and 'no' should be `False`.

**20.** In the 'animal' column, change the 'snake' entries to 'python'.

**21.** For each animal type and each number of visits, find the mean age. In other words, each row is an animal, each column is a number of visits and the values are the mean ages (hint: use a pivot table).

## DataFrames: beyond the basics

### Slightly trickier: you may need to combine two or more methods to get the right answer

Difficulty: *medium*

The previous section was tour through some basic but essential DataFrame operations. Below are some ways that you might need to cut your data, but for which there is no single "out of the box" method.

**22.** You have a DataFrame `df` with a column 'A' of integers. For example:
```python
df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})
```

How do you filter out rows which contain the same integer as the row immediately above?

**23.** Given a DataFrame of numeric values, say
```python
df = pd.DataFrame(np.random.random(size=(5, 3))) # a 5x3 frame of float values
```

how do you subtract the row mean from each element in the row?

**24.** Suppose you have DataFrame with 10 columns of real numbers, for example:

```python
df = pd.DataFrame(np.random.random(size=(5, 10)), columns=list('abcdefghij'))
```
Which column of numbers has the smallest sum? (Find that column's label.)

**25.** How do you count how many unique rows a DataFrame has (i.e. ignore all rows that are duplicates)?

The next three puzzles are slightly harder...

**26.** You have a DataFrame that consists of 10 columns of floating--point numbers. Suppose that exactly 5 entries in each row are NaN values. For each row of the DataFrame, find the *column* which contains the *third* NaN value.

(You should return a Series of column labels.)

**27.** A DataFrame has a column of groups 'grps' and and column of numbers 'vals'. For example: 

```python
df = pd.DataFrame({'grps': list('aaabbcaabcccbbc'), 
                   'vals': [12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})
```
For each *group*, find the sum of the three greatest values.

**28.** A DataFrame has two integer columns 'A' and 'B'. The values in 'A' are between 1 and 100 (inclusive). For each group of 10 consecutive integers in 'A' (i.e. `(0, 10]`, `(10, 20]`, ...), calculate the sum of the corresponding values in column 'B'.