In [None]:
%matplotlib inline
import matplotlib
import seaborn as sns
matplotlib.rcParams['savefig.dpi'] = 2 * matplotlib.rcParams['savefig.dpi']

# Pandas

Pandas is Python's answer to R.  It's a good tool for small(ish) data analysis -- i.e. when everything fits into memory.

The basic new "noun" in pandas is the **data frame**.

## Nouns (objects) in Pandas

### Data Frames

Like a table, with rows and columns (e.g. as in SQL).  Except:
  - The rows can be indexed by something interesting (there is special support for labels like categorical and timeseries data).  This is especially useful when you have timeseries data with potentially missing data points.
  - Cells can store Python objects. Like in SQL, columns are type homogeneous.
  - Instead of "NULL", the name for a non-existent value is "NA".  Unlike R, Python's data frames only support NAs in columns of some data types (basically: floating point numbers and 'objects') -- but this is mostly a non-issue (because it will "up-cast" integers to float64, etc.)
  
### Data Series:
These are named columns of a DataFrame (more correctly, a dataframe is a dictionary of Series).  The entries of the series have homogenous type.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

# a data frame
df1 = pd.DataFrame({
    'number': [1, 2, 3],
    'animal': ['cat', 'dog', 'mouse']
})

# the same data frame
df2 = pd.DataFrame([
    ('cat', 1),
    ('dog', 2),
    ('mouse', 3),
], columns=['animal', 'number'])

assert((df1 == df2).all().all())

# a series
print df1['animal']

## Verbs (operations) in Pandas
  
Pandas provides a "batteries-included" basic data analysis:
  - **Loading data:** `read_csv`, `read_table`, `read_sql`, and `read_html`
  - **Selection, filtering, and aggregation** (i.e. SQL-type operations): There's a special syntax for `SELECT`ing.  There's the `merge` method for `JOIN`ing.  There's also an easy syntax for what in SQL is a mouthful: Creating a new column whose value is computed from another column -- with the bonus that now the computations can use the full power of Python (though it might be faster if it didn't).
  - **"Pivot table" style aggregation:** If you're an Excel cognoscenti, you may appreciate this.
  - **NA handling:** Like R's data frames, there is good support for transforming NA values with default values / averaging tricks / etc.
  - **Basic statistics:** e.g. `mean`, `median`, `max`, `min`, and the convenient `describe`.
  - **Plugging into more advanced analytics:** Okay, this isn't batteries included.  But still, it plays reasonably with `sklearn`.
  - **Visualization:** For instance `plot` and `hist`.
  
We'll go through a little on all of these in the context of an example.  To go through it, you must have the (output) data files from the HMDA "Project structure" example.  We're going to explore a dataset of mortgage insurance issued by the Federal Housing Authority (FHA).  The data is broken down by census tract and tells us how big of a player the FHA is in each tract (how many homes etc.).

## Loading data (and basic statistics / visualization)

In [None]:
names =["State_Code", "County_Code", "Census_Tract_Number", "NUM_ALL", "NUM_FHA", "PCT_NUM_FHA", "AMT_ALL", "AMT_FHA", "PCT_AMT_FHA"]
df = pd.read_csv('small_data/fha_by_tract.csv', names=names)  ## Loading a CSV file, without a header (so we have to provide field names)

df['GEOID'] = df['Census_Tract_Number']*100 + 10**6 * df['County_Code'] \
    + 10**9 * df['State_Code']   ## A computed field!
    
df2 = df.sort('State_Code')  # sorting data to make it easier to read
df2.head()

In [None]:
## To drop a column

df.drop('GEOID', axis=1).head()

In [None]:
# Note how sample() is used to speed up plotting/processing by subsampling the rows of the dataframe.

print "Basic statistics and a histogram of the percentage of mortages"
print "in each census tract insured by FHA"
print df['PCT_AMT_FHA'].describe()
df['PCT_AMT_FHA'].hist(bins=50, alpha=0.5)

In [None]:
print "The above distribution looks a little skewed, let's look at it's log"
print "We can save off the data into a new column"
df['LOG_AMT_ALL'] = np.log1p(df['AMT_ALL'])
print df['LOG_AMT_ALL'].describe()

print "We can use the apply function to transform data"
df['AMT_ALL'].apply(np.log1p).hist(bins=50)

## Selecting data

There are 4 ways to select data.

In [None]:
print "Selecting off a column"
print df['State_Code'].head()
print
print "Selecting off multiple columns"
print df[['State_Code', 'County_Code']].head()
print
print "programatically access column names"
print all([col for col in df] == df.columns)

In [None]:
df_simple = df.dropna()

# to select both row and column, use `.loc`
print df_simple.loc[3, 'State_Code']

# to index columns by their integer position (instead of the label), use `.iloc`
print (df_simple.loc[:,'State_Code'] == df_simple.iloc[:,0]).all()

# .ix can do both label and position
print (df_simple.ix[:,'State_Code'] == df_simple.ix[:,0]).all()

## Filtering data

Now the `df[...]` notation is very flexible:
  - It accepts column names (strings and lists of strings);
  - It accepts column numbers (so long as there is no ambiguity with column names);
  - It accepts _binary data series!_
  
This means that you can write

        df[ df['column_name2']==MD & ( df['column_name1']==5 | df['column_name1']==6 ) ]
   
for what you would write in SQL as

>         SELECT * FROM df WHERE
            column_name2='MD" AND (column_name1=5 OR column_name1=6)

In [None]:
print "Selection returns a boolean array .."
print (df['State_Code'] == 1).head()
print
print "... we can apply the usual boolean operators to it"
print ((df['State_Code'] == 1) & (df['Census_Tract_Number'] == 9613)).head()
print
print "pandas indices take boolean lists of the appropriate length"
print df[df['State_Code'] == 5].head()
print
print "p.s. - numpy indices also take boolean arguments: here are some odd numbers"
print np.arange(10)[np.arange(10) % 2 == 1]
print

**Note:** selecting rows by binary data series only if they share the same data index!

**Exercise:**
1. Plot the histogram of percentages for different states in the same graph to compare them.
2. Notice that there is a spike at 100%.  This means that the FHA has insured 100% of the houses in that census tract.  See what happens to the histogram when we restrict it to the case where the total number of loans is non-negligible.

## Joining data

The analogue of a

>             
    SELECT * 
        FROM df1
        INNER JOIN df2 
        ON df1.field_name=df2.field_name;

is

    df_joined = df1.merge(df2, on='field_name')

You can also do left / right / outer joins, mix-and-match column names, etc.  For that consult the Pandas documentation. (The example below will do a left join.)

Of course, just looking at the distribution of insurance by census tract isn't interesting unless we know more about the census tract.

In [None]:
# Loading information about census tracts
df_geo = pd.read_csv('small_data/2013_Gaz_tracts_national.tsv', sep='\t')

print "joining two dataframes"
df_joined = df.merge(df_geo, on='GEOID', how='left')
df_joined.sort('AMT_ALL', ascending=False).head()

## Aggregating data

The analog of SQL's `GROUP BY` is

    grouped = df.groupby(['field_name1', ...])...

The above is analogous to
>             
    SELECT mean(df.value1), std(df.value2) 
        FROM df
        GROUP BY df.field_name1, ...

Pandas is somewhat more flexible in how you can use grouping, not requiring you to specify an aggregation function up front.  A few examples are:


In [None]:
# This isn't a SQL-style 'GROUP BY'.
df_joined.groupby('USPS').first().head()

In [None]:
# This is the analog of
# SELECT USPS, SUM(AMT_FHA), SUM(AMT_ALL), ... FROM df GROUP BY USPS;
df_by_state = df_joined.groupby('USPS')['AMT_FHA', 'AMT_ALL', 'NUM_FHA', 'NUM_ALL'].sum()

df_by_state['PCT_AMT_FHA'] = 100.0 * df_by_state['AMT_FHA']  / df_by_state['AMT_ALL']
df_by_state['PCT_NUM_FHA'] = 100.0 * df_by_state['NUM_FHA']  / df_by_state['NUM_ALL']

# This sure looks different than the census-tract level histogram!
df_by_state['PCT_AMT_FHA'].hist(bins=20)

## Handling missing and NA data


When you read in a CSV file / SQL database there are often "NA" (or "null", "None", etc.) values.  The CSV reader has a special field for specifying how this is denoted, and SQL has the built-in notion of NULL.  Pandas provides some tools for working with these -- they are generally similar to (and a little bit worse than) `R`.

Note that these methods are by default not in place -- that is, they create a new series and do not change the original one.

For more details: http://pandas.pydata.org/pandas-docs/stable/missing_data.html

In [None]:
print df['GEOID'][:10]
print df['GEOID'].size

In [None]:
# isnull() / notnull() test for null-ness and return a Boolean series

df['GEOID'].isnull()[:10]

In [None]:
# dropna() removes the rows with null data

df['GEOID'].dropna().size

In [None]:
# fillna() replaces N/A values with something else
# interpolate() replaces null values by (linear, or quadratic, or...) interpolation
# There is support for indexing by times (not necessarily equally spaced), etc. in the documentation.

df['fill_0'] = df['GEOID'].fillna(0)                        # Fills constant value, here 0
df['fill_forward'] = df['GEOID'].fillna(method='ffill')           # Fill forwards
df['fill_back'] = df['GEOID'].fillna(method='bfill', limit=5)  # Fill backwards, at most 5
df['fill_mean'] = df['GEOID'].fillna(df['GEOID'].mean())       # Fills constant value, here the mean (imputation)
df['fill_interp'] = df['GEOID'].interpolate()                    # Fills interpolated value
df[['fill_0', 'fill_forward', 'fill_back', 'fill_mean', 'fill_interp']][:10]

### Note
N/A values are (usually) smartly ignored when performing other calculations on dataframes. For example, when using string methods on series:

In [None]:
text_series = df['PCT_NUM_FHA'].replace(0, np.nan).apply(str)
print text_series[:10]

In [None]:
text_series[:10].str.split('.')

## Manipulating strings

In [None]:
states = df_joined['USPS'].dropna()
states[states.str.contains('A')].head()

### Pandas HTML data import example

Pandas takes a "batteries included" approach and throws in a whole lot of convenience functions.  For instance it has import functions for a variety of formats.  One of the pleasant surprises is a command `read_html` that's meant to automate the process of extracting tabular data from HTML.  In particular, it works pretty well with tables on Wikipedia.  

Let's do an example: We'll try to extract the list of the world's tallest buildings from
http://en.wikipedia.org/wiki/List_of_tallest_buildings_and_structures_in_the_world


(This example will, likely, not get lecture time.)

In [None]:
dfs = pd.read_html('http://en.wikipedia.org/wiki/List_of_tallest_buildings_and_structures_in_the_world', header=0, parse_dates=False)

# There are several tables on the page.  By inspection we can figure out which one we want
tallest = dfs[2]

# The coordinates column needs to be fixed up.  This is a bit of string parsing:
def clean_lat_long(s):
    try:
        parts = s.split("/")
    except AttributeError:
        return (None, None)
    if len(parts)<3:
        return (None, None)
    m=re.search(r"(\d+[.]\d+);[^\d]*(\d+[.]\d+)[^\d]", parts[2])
    if not m:
        return (None, None)
    return (m.group(1), m.group(2))

tallest['Clean_Coordinates'] = tallest['Coordinates'].apply(clean_lat_long)
tallest['Latitude'] = tallest['Clean_Coordinates'].apply(lambda x:x[0])
tallest['Longitude'] = tallest['Clean_Coordinates'].apply(lambda x:x[1])

# Et voila
tallest.head()

In [None]:
tallest.info()

### Some remarks
Unfortunately, that didn't work well on height (ft) and is so-so on year built.  

In both cases, it is because it is assuming that the field is a _date_ rather than just a number.  We can hint these things to `read_html` through extra parameters.

In this particular case we would probably not bother: It is pretty good on the year built, and height in meters is good enough.

**Exercise**

1. Parse the height correctly.
1. Parse the table rankings of [UK universities available on Wikipedia](https://en.wikipedia.org/wiki/Rankings_of_universities_in_the_United_Kingdom):

## "Pivot table" syntax, stacking and unstacking

These are useful tools -- and if you're already familiar with Pivot Tables in Excel, for example, they'll be familiar. See http://pandas.pydata.org/pandas-docs/stable/reshaping.html for more information.

In general pivot tables are useful when you want to look at aggregations but change what the dependent and independent variables are.

In [None]:
# create a pivot table from a dataframe

pt = pd.pivot_table(df, index='State_Code', values=['NUM_ALL', 'NUM_FHA'], aggfunc=np.sum)
pt.head()

The stack() and unstack() methods are used to manage MultiIndexes, and can control whether you have a row MultiIndex, a column MultiIndex, or neither:

In [None]:
grouped = df[['State_Code', 'County_Code', 'NUM_ALL', 'NUM_FHA']].groupby(['State_Code', 'County_Code']).mean()
grouped.head()

In [None]:
grouped.unstack().head()

In [None]:
unstacked = grouped.unstack(level=['County_Code', 'State_Code'])
unstacked.head()

## Plugging into more advanced analytics

Almost any "advanced analytics" tool in the Python ecosystem is going to take as input `np.array` type arrays.  You can access the underlying array of a data frame column as

        df['column'].values
        
Many of them take `nd.array` whose underlying data can be accessed by 

        df.values
        
directly.  *Most* of the time, they will take `df['column']` and `df` without needing to look at values.

This is particularly important if you want to use Pandas with the sklearn library. See this [blog post](http://www.markhneedham.com/blog/2013/11/09/python-making-scikit-learn-and-pandas-play-nice/) for an example.

### Exit Tickets
1. Describe several situations where you would prefer to use Pandas, and several where you'd prefer SQL.
1. Write the syntax for slicing the rows of a dataframe according to a filtering criterion.
1. Write the syntax for selecting, adding, or removing columns of a dataframe.

*Copyright &copy; 2015 The Data Incubator.  All rights reserved.*