## Pandas dataframes

Pandas dataframes extend NumPy 2D arrays by giving labels to the columns and an explicit index, also to the rows.

In [1]:
import numpy as np
import pandas as pd

In [2]:
pop2014 = pd.Series([100,99.3,95.5,93.5,92.4,84.8,84.5,78.9,74.3,72.8],
                    index=['Java','C','C++','Python','C#','PHP','JavaScript','Ruby','R','Matlab'])

In [3]:
pop2015 = pd.Series({'Java': 100,'C': 99.9,'C++': 99.4,'Python': 96.5,'C#': 91.3,
                     'R': 84.8,'PHP': 84.5,'JavaScript': 83.0,'Ruby': 76.2,'Matlab': 72.4})

1. Making DataFrames from Series objects, Python dicts, and NumPy arrays
2. Setting indexes
3. Selecting, combining and creating columns
4. Performing database-style operations like relational joins on dataframes

From the two series, we can create a dataframe. The indexes will be matched automatically.

In [7]:
twoyears = pd.DataFrame({'2014': pop2014, '2015': pop2015})
# keys will be used as column names

In [5]:
twoyears

Unnamed: 0,2014,2015
C,99.3,99.9
C#,92.4,91.3
C++,95.5,99.4
Java,100.0,100.0
JavaScript,84.5,83.0
Matlab,72.8,72.4
PHP,84.8,84.5
Python,93.5,96.5
R,74.3,84.8
Ruby,78.9,76.2


In [6]:
# We can sort the dataframe using the values in one of the columns

twoyears = twoyears.sort('2015',ascending=False)
twoyears

AttributeError: 'DataFrame' object has no attribute 'sort'

Since Pandas is built on top of NumPy, there's a NumPy array inside every DataFrame.
We can extract it by asking for the attribute, "values".

In [8]:
twoyears.values

array([[  99.3,   99.9],
       [  92.4,   91.3],
       [  95.5,   99.4],
       [ 100. ,  100. ],
       [  84.5,   83. ],
       [  72.8,   72.4],
       [  84.8,   84.5],
       [  93.5,   96.5],
       [  74.3,   84.8],
       [  78.9,   76.2]])

In [9]:
#We can extract the index and the names of the columns

twoyears.index

Index(['C', 'C#', 'C++', 'Java', 'JavaScript', 'Matlab', 'PHP', 'Python', 'R',
       'Ruby'],
      dtype='object')

In [10]:
twoyears.columns

Index(['2014', '2015'], dtype='object')

In [11]:
#indexing the dataframe with brackets naturally gives a column

twoyears['2014']

C              99.3
C#             92.4
C++            95.5
Java          100.0
JavaScript     84.5
Matlab         72.8
PHP            84.8
Python         93.5
R              74.3
Ruby           78.9
Name: 2014, dtype: float64

In [12]:
# To select a subset of rows from the dataframe use iloc or loc

twoyears.iloc[0:2]

Unnamed: 0,2014,2015
C,99.3,99.9
C#,92.4,91.3


In [13]:
twoyears.loc['C':'Java']

Unnamed: 0,2014,2015
C,99.3,99.9
C#,92.4,91.3
C++,95.5,99.4
Java,100.0,100.0


In [14]:
# We can do numerical operations on entire columns and store the result in a new column

twoyears['avg'] = 0.5*(twoyears['2014'] + twoyears['2015'])

twoyears

Unnamed: 0,2014,2015,avg
C,99.3,99.9,99.6
C#,92.4,91.3,91.85
C++,95.5,99.4,97.45
Java,100.0,100.0,100.0
JavaScript,84.5,83.0,83.75
Matlab,72.8,72.4,72.6
PHP,84.8,84.5,84.65
Python,93.5,96.5,95.0
R,74.3,84.8,79.55
Ruby,78.9,76.2,77.55


There are more ways to make a DataFrame. For instance, we could use a Python dict to specify every row as a dict item and then give Pandas DataFrame a list of such dictionaries.

In [15]:
presidents = pd.DataFrame([{'name': 'Barack Obama', 'inag': 2009, 'birthyear': 1961},
                           {'name': 'George W. Bush', 'inag': 2001, 'birthyear': 1946},
                           {'name': 'Bill Clinton', 'inag': 1993, 'birthyear': 1946}, 
                           {'name': 'George H. W. Bush', 'inag': 1989, 'birthyear': 1924}])

In [16]:
presidents

Unnamed: 0,birthyear,inag,name
0,1961,2009,Barack Obama
1,1946,2001,George W. Bush
2,1946,1993,Bill Clinton
3,1924,1989,George H. W. Bush


In [17]:
#We can choose one of the columns to be used as the index

presidents_indexes = presidents.set_index('name')

In [18]:
presidents_indexes

Unnamed: 0_level_0,birthyear,inag
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Barack Obama,1961,2009
George W. Bush,1946,2001
Bill Clinton,1946,1993
George H. W. Bush,1924,1989


In [19]:
presidents_indexes.loc['Bill Clinton']     #gives entire record for Clinton

birthyear    1946
inag         1993
Name: Bill Clinton, dtype: int64

In [20]:
#Then we can choose the right column

presidents_indexes.loc['Bill Clinton']['inag']

1993

In [21]:
#Or we could select the column first and then index the president we want

presidents_indexes['inag'] 

name
Barack Obama         2009
George W. Bush       2001
Bill Clinton         1993
George H. W. Bush    1989
Name: inag, dtype: int64

In [22]:
presidents_indexes['inag']['Bill Clinton'] 

1993

In Pandas, joins are performed with merge function


In [23]:
presidents_fathers = pd.DataFrame([{'son': 'Barack Obama','father':'Barack Obama, Sr.'},
                                  {'son': 'George W. Bush','father':'George H. W. Bush'},
                                  {'son': 'George H. W. Bush','father':'Prescott Bush'}])

In [27]:
presidents_fathers

Unnamed: 0,father,son
0,"Barack Obama, Sr.",Barack Obama
1,George H. W. Bush,George W. Bush
2,Prescott Bush,George H. W. Bush


In [24]:
# we combine two tables by matching values between two columns. In this case, we combine presidents 
#and their fathers using the column, name, for the first DataFrame and the column, son, for the second.
# We specify these columns with the left_on and right_on arguments to merge.


pd.merge(presidents, presidents_fathers,left_on='name',right_on='son')

Unnamed: 0,birthyear,inag,name,father,son
0,1961,2009,Barack Obama,"Barack Obama, Sr.",Barack Obama
1,1946,2001,George W. Bush,George H. W. Bush,George W. Bush
2,1924,1989,George H. W. Bush,Prescott Bush,George H. W. Bush


In [25]:
#The resulting table has a redundant column that we can drop. 
#add the Pandas command, drop. We need to specify that the 
#column to drop is son and that it's indeed a column. So the axis is one.

pd.merge(presidents, presidents_fathers,left_on='name',right_on='son').drop('son',axis=1)

Unnamed: 0,birthyear,inag,name,father
0,1961,2009,Barack Obama,"Barack Obama, Sr."
1,1946,2001,George W. Bush,George H. W. Bush
2,1924,1989,George H. W. Bush,Prescott Bush


In [26]:
#notice also that the join omitted the record for Bill Clinton since it could not match him to a father. 
#We can do a slightly different kind of join that would include also unmatched records.
#That would be a so-called left join

pd.merge(presidents, presidents_fathers,left_on='name',right_on='son',how='left').drop('son',axis=1)


Unnamed: 0,birthyear,inag,name,father
0,1961,2009,Barack Obama,"Barack Obama, Sr."
1,1946,2001,George W. Bush,George H. W. Bush
2,1946,1993,Bill Clinton,
3,1924,1989,George H. W. Bush,Prescott Bush
