# Pandas

* Pandas can roughly be interpreted as "python data analysis", but the name originally came from "panel data"
* Its data is stored in numpy arrays, and many concepts are the same.  But a lot nicer interface for data analysis.  It provides more high-level support for typical data processing.

These examples come directly from "10 minutes to Pandas" from the Pandas documentation.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
s = pd.Series([1,3,5,np.nan,6,8])

In [None]:
dates = pd.date_range('20130101', periods=6)

In [None]:
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))

In [None]:
df2 = pd.DataFrame({ 'A' : 1.,
                     'B' : pd.Timestamp('20130102'),
                     'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                     'D' : np.array([3] * 4,dtype='int32'),
                     'E' : pd.Categorical(["test","train","test","train"]),
                     'F' : 'foo' })


In [None]:
df2.dtypes

In [None]:
df.head()

In [None]:
df.tail(3)

In [None]:
df.index

In [None]:
df.columns

In [None]:
df.values

In [None]:
df2.values

In [None]:
df.describe()

In [None]:
df.T

In [None]:
df.sort_values(by='B')

# Selection and indexing
Pandas does selection more intuitively than numpy - but that can mean inconsistent sometimes.

In [None]:
# Select columns
df['A']      # same as df.A.  Problem?

In [None]:
# Select rows - note same syntax as above.  What happens if there is ambiguity?
df[0:3]

In [None]:
df['20130102':'20130104']

In [None]:
df.loc[dates[0]]

In [None]:
df.loc[:,['A','B']]


In [None]:
df.loc[:,['A','B']]

In [None]:
df.loc['20130102',['A','B']]

In [None]:
df.loc[dates[0],'A']


In [None]:
df.at[dates[0],'A']

In [None]:
df.iloc[3]

In [None]:
df.iloc[3:5,0:2]

In [None]:
df.iloc[3:5,0:2]

In [None]:
df.iloc[3:5,0:2]

In [None]:
df.iloc[:,1:3]

In [None]:
df.iloc[1,1]

In [None]:
df.iloc[1,1]

In [None]:
df[df.A > 0]

In [None]:
df[df > 0]

In [None]:
df2 = df.copy()
df2['E'] = ['one', 'one','two','three','four','three']
df2

In [None]:
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))
df['F'] = s1
df

In [None]:
df.at[dates[0],'A'] = 0

# Missing data
Missing data handling is one of the nicest features of pandas:

In [None]:
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
df1

In [None]:
df1.loc[dates[0]:dates[1],'E'] = 1


In [None]:
df1.dropna(how='any')

In [None]:
df1.fillna(value=5)

In [None]:
df1.fillna(value=5)

# Operations

Operations generally do the right things in the face of missing data!

In [None]:
df.mean()

In [None]:
df.mean(axis=1)

# Time series

In [None]:
rng = pd.date_range('1/1/2012', periods=100, freq='S')
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts.resample('5Min').sum()

In [None]:
rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts_utc = ts.tz_localize('UTC')

In [None]:
ts_utc.tz_convert('US/Eastern')

In [None]:
rng = pd.date_range('1/1/2012', periods=5, freq='D')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts

# Data alignment
Data alignment based on indexing, combined with missing data handling, is another of the most useful things in pandas

In [None]:
dates = pd.date_range('1/1/2012', periods=10, freq='D')
ts = pd.Series(np.random.randint(0, 10, len(dates)), index=dates)
ts

In [None]:
ts[:5] + ts

In [None]:
dates1 = pd.date_range('1/1/2012', periods=10, freq='D')
ts1    = pd.Series(np.random.randint(0, 10, len(dates1)), index=dates1)
dates2 = pd.date_range('1/5/2012', periods=10, freq='D')
ts2    = pd.Series(np.random.randint(0, 10, len(dates2)), index=dates2)
ts1 + ts2

# Plotting
We'll do this later (day 3), but for now...

In [None]:
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
ts.plot()

# Reading data in pandas
There is lots of support for automatically reading in data in various formats.  Let's use the same dataset as last time, and see what new we can do.

In [None]:
iris = pd.read_csv('../data/iris.data',
                   names=('sepallength', 'sepalwidth', 'petallength', 'petalwidth', 'type'))
iris.describe()

# Exercises 04
These exercises are taken with pride from https://github.com/ajcr/100-pandas-puzzles

## DataFrame basics


Consider the following Python dictionary `data` and Python list `labels`:

``` python
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
```
(This is just some meaningless data I made up with the theme of animals and trips to a vet.)

**4.** Create a DataFrame `df` from this dictionary `data` which has the index `labels`.

In [None]:
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']


**5.** Display a summary of the basic information about this DataFrame and its data.

**6.** Return the first 3 rows of the DataFrame `df`.

**7.** Select just the 'animal' and 'age' columns from the DataFrame `df`.

**8.** Select the data in rows `[3, 4, 8]` *and* in columns `['animal', 'age']`.

**9.** Select only the rows where the number of visits is greater than 3.

**10.** Select the rows where the age is missing, i.e. is `NaN`.

**11.** Select the rows where the animal is a cat *and* the age is less than 3.

**12.** Select the rows the age is between 2 and 4 (inclusive).

**13.** Change the age in row 'f' to 1.5.

**14.** Calculate the sum of all visits (the total number of visits).

**15.** Calculate the mean age for each different animal in `df`.

**16.** Append a new row 'k' to `df` with your choice of values for each column. Then delete that row to return the original DataFrame.

**17.** Count the number of each type of animal in `df`.

**18.** Sort `df` first by the values in the 'age' in *decending* order, then by the value in the 'visit' column in *ascending* order.

**19.** The 'priority' column contains the values 'yes' and 'no'. Replace this column with a column of boolean values: 'yes' should be `True` and 'no' should be `False`.

**20.** In the 'animal' column, change the 'snake' entries to 'python'.

**21.** For each animal type and each number of visits, find the mean age. In other words, each row is an animal, each column is a number of visits and the values are the mean ages (hint: use a pivot table).

## DataFrames: beyond the basics

### Slightly trickier: you may need to combine two or more methods to get the right answer

Difficulty: *medium*

The previous section was tour through some basic but essential DataFrame operations. Below are some ways that you might need to cut your data, but for which there is no single "out of the box" method.

**22.** You have a DataFrame `df` with a column 'A' of integers. For example:
```python
df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})
```

How do you filter out rows which contain the same integer as the row immediately above?

**23.** Given a DataFrame of numeric values, say
```python
df = pd.DataFrame(np.random.random(size=(5, 3))) # a 5x3 frame of float values
```

how do you subtract the row mean from each element in the row?

**24.** Suppose you have DataFrame with 10 columns of real numbers, for example:

```python
df = pd.DataFrame(np.random.random(size=(5, 10)), columns=list('abcdefghij'))
```
Which column of numbers has the smallest sum? (Find that column's label.)

**25.** How do you count how many unique rows a DataFrame has (i.e. ignore all rows that are duplicates)?

The next three puzzles are slightly harder...

**26.** You have a DataFrame that consists of 10 columns of floating--point numbers. Suppose that exactly 5 entries in each row are NaN values. For each row of the DataFrame, find the *column* which contains the *third* NaN value.

(You should return a Series of column labels.)

**27.** A DataFrame has a column of groups 'grps' and and column of numbers 'vals'. For example: 

```python
df = pd.DataFrame({'grps': list('aaabbcaabcccbbc'), 
                   'vals': [12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})
```
For each *group*, find the sum of the three greatest values.

**28.** A DataFrame has two integer columns 'A' and 'B'. The values in 'A' are between 1 and 100 (inclusive). For each group of 10 consecutive integers in 'A' (i.e. `(0, 10]`, `(10, 20]`, ...), calculate the sum of the corresponding values in column 'B'.