# pandas - a Pythonic interface

**Dr. Kristian Rother** 

[www.academis.eu](http://www.academis.eu)

![Academis Logo](images/academis_logo.png)

# Ovogena lano-lakto-porko

![Wollmilchsau](images/wollmilchsau.png)

(Esperanto for *"eierlegende Wollmilchsau"*)

*photograph by Georg Mittenecker [kamelopedia.mormo.org](http://kamelopedia.mormo.org/index.php/Datei:Wollmilchsau.jpg) CC BY-SA 2.5*

# Sometimes you need to make a compromise

![Platypus](images/platypus.jpg)

*image by Heinrich Harder, Public Domain [wikimedia commons](https://commons.wikimedia.org/w/index.php?curid=2425503)*

# The Dataset of U.S. Baby Names

![Ewa](images/baby.png)

**all names used more than 5x a year since 1880**

available at [www.ssa.gov/oact/babynames/limits.html](http://www.ssa.gov/oact/babynames/limits.html)

# Reading a dataset with pandas

In [None]:
import pandas as pd

names = []
PATH = 'names'

for year in range(1880, 2015):
    fn = '{}/yob{}.txt'.format(PATH, year)
    
    data_frame = pd.read_csv(fn, names=['name', 'gender', 'count'])
    data_frame['year'] = year
    
    names.append(data_frame)

*`pandas` looks like the average Python library so far.*

# Reading a dataset with pandas

In [None]:
names = pd.concat(names)
names[:10]

# Statistics for girls names
*boolean expressions inside an index?*

In [None]:
def findname(df, name): 
    return df[df['name']==name].sort_values(by='year')

girls = names[names.gender=='F']
findname(girls, "Khaleesi")

# Statistics for boys names
*the double square bracket is not a typo!*

In [None]:
boys = names[names.gender=='M']

tyrion = findname(boys, "Tyrion")
tyrion = tyrion[["year", "count"]]
tyrion = tyrion.set_index('year')
tyrion.transpose()

# Like a Prayer

In [None]:
madonna = findname(girls, "Madonna")

![Madonna](images/madonna.png)

# Total population
*group, select, sum, slice all in one*

In [None]:
names.groupby('year')['count'].apply(sum)[::20]

# Names with four e's
*apply a function and create a new column*

In [None]:
def eeee(x): return x.lower().count('e') == 4

names['eeee'] = names['name'].apply(eeee)
names[names['eeee']]['name'][:3]

# First character preference: boy/girl ratio

In [None]:
names['first_char'] = names['name'].apply(lambda x:x[0])

mrc = names[names.gender=='M'].groupby('first_char')['count'].apply(sum)
frc = names[names.gender=='F'].groupby('first_char')['count'].apply(sum)
ratio = mrc / frc
ratio[:10]

# Conclusions
## Pro pandas
* powerful expressions in a few lines
* based on numpy --> fast, millions of lines
* copes with gaps in data well
* integration with scikit-learn

## Con pandas
* syntax is a bit obscure at times
* steep learning curve

# Don't try using all features at the same time!

![Milk](images/milch-junkie.jpg)

### Contact

e-Mail: `krother@academis.eu`

Twitter: `@k_rother`

In [None]:
# generating Madonna plot
madonna = madonna.set_index('year')
%matplotlib inline
madonna.plot()