# Pandas

Pandas is Python's answer to R.  It's a good tool for small(ish) data analysis -- i.e., when everything fits into memory.

The basic new "noun" in pandas is the **data frame**.

##What is a data frame

It's like a table, with rows and columns (e.g., as in SQL).  Except:
  - The rows can be indexed by something interesting (there is special support for labels like categorical and timeseries data).  This is especially useful when you have timeseries data with potentially missing data points.
  - Cells can store Python objects. (Like in SQL, columns are homogeneous.)
  - Instead of "NULL", the name for a non-existent value is "NA".  Unlike R, Python's data frames only support NAs in columns of some data types (basically: floating point numbers and 'objects') -- but this is mostly a non-issue (because it will "up-cast" integers to float64, etc.)
  
Pandas provides a "batteries-included" basic data analysis:
  - **Loading data:** `read_csv`, `read_table`, `read_sql`, and `read_html`
  - **Selection, filtering, and aggregation** (i.e., SQL-type operations): There's a special syntax for `SELECT`ing.  There's the `merge` method for `JOIN`ing.  There's also an easy syntax for what in SQL is a mouthful: Creating a new column whose value is computed from other column -- with the bonus that now the computations can use the full power of Python (though it might be faster if it didn't).
  - **"Pivot table" style aggregation**: If you're an Excel cognosceti, you may appreciate this.
  - **NA handling**: Like R's data frames, there is good support for transforming NA values with default values / averaging tricks / etc.
  - **Basic statistics:** e.g. `mean`, `median`, `max`, `min`, and the convenient `describe`.
  - **Plugging into more advanced analytics:** Okay, this isn't batteries included.  But still, it plays reasonably with `sklearn`.
  - **Visualization:** For instance `plot` and `hist`.
  
We'll go through a little on all of these in the context of an example.  To go through it, you must have the (output) data files from the HMDA "Project structure" example.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import re

We're going to explore a dataset of mortgage insurance issued by the Federal Housing Authority (FHA).  The data is broken down by census tract and tells us how big of a player the FHA is in each tract (how many homes etc ...).

##Loading data (and basic statistics / visualization)


In [None]:
names =["State_Code", "County_Code", "Census_Tract_Number", "NUM_ALL", "NUM_FHA", "PCT_NUM_FHA", "AMT_ALL", "AMT_FHA", "PCT_AMT_FHA"]
df = pd.read_csv('../small_data/fha_by_tract.csv', names=names)  ## Loading a CSV file, without a header (so we have to provide field names)

df['GEOID'] = df['Census_Tract_Number']*100 + 10**6 * df['County_Code'] \
    + 10**9 * df['State_Code']   ## A computed field!
    
df = df.sort('State_Code')  # sorting data to make it easier to read
df.head()

In [None]:
# Basic statistics and a histogram of the percentage of mortages 
# in each census tract insured by FHA
print df['PCT_AMT_FHA'].describe() 
df['PCT_AMT_FHA'].hist(bins=50, alpha=0.5)

In [None]:
# The above distribution looks a little skewed, let's look at it's log

# We can save off the data into a new column
df['LOG_AMT_ALL'] = np.log(df['AMT_ALL'])
print df['LOG_AMT_ALL'].describe()

# We can use the apply function to transform data
df['AMT_ALL'].apply(np.log).hist(bins=50)

##Selecting

In [None]:
# Selecting off a column
print df['State_Code'].head()

# Selecting off multiple columns
print df[['State_Code', 'County_Code']].head()

# programatically access column names
print [col for col in df]

##Filtering

This is a slightly fancied up version of Python's index notation.  When you write something like 

        df['column_name']==5

what's actually happening is that pandas creates a new, __binary__, data series indexed by the same indexing set as `df`.  You can combine such binary series using the term-wise `&` (`and`) and term-wise `|` (`or`) operations.  For instance:

        df['column_name2']==MD & ( df['column_name1']==5 | df['column_name2']==6 )

Now the `df[...]` notation is very flexible:
  - It accepts column names;
  - It accepts column numbers (so long as there is no ambiguity with column names..);
  - It accepsts _binary data series!_
  
This means that you can write

        df[ df['column_name2']==MD & ( df['column_name1']==5 | df['column_name1']==6 ) ]
   
for what you would write in SQL as

>         SELECT * FROM df WHERE
            column_name2='MD" AND (column_name1=5 OR column_name1=6)

In [None]:
# Selection returns a boolean array ..
print (df['State_Code'] == 1).head()

# ... we can apply the usual boolean operators to it
print ((df['State_Code'] == 1) & (df['Census_Tract_Number'] == 9613)).head()

# pandas indices take boolean lists of the appropriate length
print ((df['State_Code'] == 1) & (df['Census_Tract_Number'] == 9613)).head()

##Join-ing

The analogue of a

>             
    SELECT * 
        FROM df1
        INNER JOIN df2 
        ON df1.field_name=df2.field_name;

is

    df_joined = df1.merge(df2, on='field_name')

You can also do left / right / outer joins, mix-and-match column names, etc.  For that consult the Pandas documentation. (The example below will do a left join..)

Of course, just looking at the distribution of insurance by census tract isn't interesting unless we know more about the census tract.

In [None]:
# Loading information about census tracts
df_geo = pd.read_csv('../small_data/2013_Gaz_tracts_national.tsv', sep='\t')

# And join it in
df_joined = df.merge(df_geo, on='GEOID', how='left')
df_joined.sort('AMT_ALL', ascending=False).head()

## Aggregation

The analog of SQL's `GROUP BY` is

    grouped = df.groupby(['field_name1', ...])

The above is analogous to
>             
    SELECT * 
        FROM df
        GROUP BY df.field_name1, ...

Pandas is somewhat more flexible in how you can use grouping, not requiring you to specify an aggregation function up front.  A few examples are:


In [None]:
# This isn't a SQL-style 'GROUP BY'.
df_joined.groupby('USPS').first().head()

In [None]:
# This is the analog of
# SELECT USPS, SUM(AMT_FHA), SUM(AMT_ALL), ... FROM df GROUP BY USPS;
df_by_state = df_joined[['USPS', 'AMT_FHA', 'AMT_ALL', 'NUM_FHA', 'NUM_ALL']].groupby('USPS').sum()

df_by_state['PCT_AMT_FHA'] = 100.0 * df_by_state['AMT_FHA']  / df_by_state['AMT_ALL']
df_by_state['PCT_NUM_FHA'] = 100.0 * df_by_state['NUM_FHA']  / df_by_state['NUM_ALL']

# This sure looks different than the census-tract level histogram!
df_by_state['PCT_AMT_FHA'].hist(bins=20)

## NA handling


When you read in a CSV file / SQL data base there are often "NA" (or "null", "None", etc.) values.  The CSV reader has a special field for specifying how this is denoted, and SQL has the built-in notion of NULL.  Pandas provides some tools for working with these -- they are generally similar too (and a little bit worse than) `R`

- `isnull` / `notnull`: Testing for null-ness e.g., 
>       
        df['column_name'].isnull()
        
   returns a Boolean series
- `fillna`: Replacing null values by something else, e.g.,
>         
        df['column_name'].fillna(0)             # Fills constant value, here 0
        df['column_name'].fillna(method='ffill')  # Fill forwards
        df['column_name'].fillna(method='bfill', limit=5)  # Fill backwards, at most 5
        
    At least by default, this is *not in place* -- that is, it creates a new series and does not change the original one.

- `interpolate`: Replacing null values by (linear, or quadratic, or...) interpolation.  There is support for indexing by times (not necessarily equally spaced), etc. in the documentation.  The most basic usage is
>        
        df['column_name'].interpolate()
    
    As above, this is not in place.


For more details: http://pandas.pydata.org/pandas-docs/stable/missing_data.html

Pandas HTML data import example
================================
Pandas takes a "batteries included approach" and throws in a whole lot of convenience functions.  For instance it has import functions for a variety of formats.  One of the pleasant surprises is a command `read_html` that's meant to automate the process of extacting tabular data from HTML.  In particular, it works pretty well with tables on Wikipedia.  

Let's do an example: We'll try to extract the list of the world's tallest buildings from
http://en.wikipedia.org/wiki/List_of_tallest_buildings_and_structures_in_the_world


(This example will, likely, not get lecture time.)

In [None]:
dfs = pd.read_html('http://en.wikipedia.org/wiki/List_of_tallest_buildings_and_structures_in_the_world')

# There are several tables on the page.  By inspection we can figure out which one we want
tallest = dfs[2]  

# The header appeared as the first row.  We can get at it via tallest[0:1].values, copy-paste-fix and get the next line:
tallest.columns=['Category', 'Structure', 'Country', 'City', 'Height (metres)', 'Height (ft)', 'Year built', 'Coordinates']
tallest = tallest[1:]

# The coordinates column needs to be fixed up.  This is a bit of string parsing:
def clean_lat_long(s):
    try:
        parts = s.split("/")
    except AttributeError:
        return (None, None)
    if len(parts)<3:
        return None
    m=re.search(r"(\d+[.]\d+);[^\d]*(\d+[.]\d+)[^\d]", parts[2])
    if not m:
        return (None, None)
    return (m.group(1), m.group(2))

tallest['Coordinates'] = tallest['Coordinates'].apply(clean_lat_long)
tallest['Latitude'] = tallest['Coordinates'].apply(lambda x:x[0])
tallest['Longitude'] = tallest['Coordinates'].apply(lambda x:x[1])

tallest.head()