In [None]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))

# Lecture 3C - Group By, Aggregation Operations and Shaping Data*
---


### Content

1. Group-by operations
2. Pivot tables
3. Cross tabulation

 

\* Content in this notebook is based on the material in the "Python for Data Analysis" book by Wes McKinney, chapter 9.

### Learning Outcomes

At the end of this lecture, you should be able to:

* explain the group by mechanism
* split pandas objects into groups using one or more keys
* compute group summary statistics
* perform cross tabulation
* generate pivot tables

---

In [1]:
from IPython.display import HTML, IFrame
IFrame("http://pandas.pydata.org/pandas-docs/dev/groupby.html", width=1100, height=500)



Categorizing a data set and applying operations to each group, be it **transformations or aggregations, is frequently a critical component of a data analysis workflow**. 

After loading, merging, and preparing a data set, a **common task is to compute group statistics or possibly pivot tables for reporting or visualization purposes**. 

**Pandas provides a flexible and high-performance groupby facility**, enabling you to slice and dice, and summarize data sets in a natural way.

One reason for the popularity of relational databases and SQL (which stands for “structured query language”) is the **ease with which data can be joined, filtered, transformed, and aggregated**. However, query languages like **SQL suffer from limitations in the kinds of group operations that can be performed**. 

With the expressiveness and power of Python and pandas, **much more complex grouping operations can be executed** by utilizing any function that accepts a pandas object or NumPy array.

Most off-the-shelf software like Excel can to some degree perform operations like pivot tables that are immensely valued. However, Excel has serious limitations. The greatest of these is memory. Even that latest Excel suite can only handle up to 1M rows but not more (using the PowerPivot add-on, this can now be extended). Pandas provided the **ability to flexibly handle millions of rows of spreadsheet data up to the maximum capacity of your PC's on board RAM**. 

# 1. Group-By operations

One of the most powerful features of Pandas is its **GroupBy** functionality. On occasion we may want to perform operations on *groups* of observations within a dataset. For example:

* **aggregation**, such as computing the sum of mean of each group, which involves applying a function to each group and returning the aggregated results
* **slicing** the DataFrame into groups and then doing something with the resulting slices (*e.g.* plotting)
* group-wise **transformation**, such as standardization/normalization

A common data analysis procedure is the **split-apply-combine** operation, which **groups subsets of data together, applies a function to each of the groups, then recombines them into a new data table**.

For example, we may want to aggregate our data with with some function.

![split-apply-combine](http://f.cl.ly/items/0s0Z252j0X0c3k3P1M47/Screen%20Shot%202013-06-02%20at%203.04.04%20PM.png)

<div align="right">*(figure source "Python for Data Analysis", p.252)*</div>

In [2]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
from pylab import rcParams

%matplotlib inline

In [3]:
# Set some Pandas options as you like
pd.set_option('max_columns', 40)
pd.set_option('max_rows', 30)

In [4]:
rcParams['figure.figsize'] = 15, 10
rcParams['font.size'] = 20

In [5]:
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})
df

Unnamed: 0,key1,key2,data1,data2
0,a,one,0.796999,2.097353
1,a,two,0.314729,-1.741249
2,b,one,-0.013359,-0.285786
3,b,two,-0.150975,0.230213
4,a,one,-0.251338,1.045979


Suppose you wanted to compute the mean of the data1 column using the grouped labels
from key1. There are a number of ways to do this. One is to slice data1 and call
groupby with the column (a Series) at key1:

In [6]:
grouped = df['data1'].groupby(df['key1'])
grouped

<pandas.core.groupby.groupby.SeriesGroupBy object at 0x000001944516A908>

This grouped variable is now a GroupBy object. It has not actually computed anything
yet except for some intermediate data about the group key df['key1']. 

The idea is that this object has all of the information needed to then apply some operation to each of
the groups. 

We can find out what groups we have and who is in these groups by index:

In [7]:
grouped.groups

{'a': Int64Index([0, 1, 4], dtype='int64'),
 'b': Int64Index([2, 3], dtype='int64')}

### Iterating over groups

In order to inspect which rows have been associated with grouping values, we need to iterate through each group.

In [8]:
for key, value in grouped:
    print(key, type(value))
    print(value)

a <class 'pandas.core.series.Series'>
0    0.796999
1    0.314729
4   -0.251338
Name: data1, dtype: float64
b <class 'pandas.core.series.Series'>
2   -0.013359
3   -0.150975
Name: data1, dtype: float64


Now that we have our groupings, we can compute operations on groups using various pre-defined methods or built in methods. Here is the example of applying means to each group:

In [9]:
grouped.mean()

key1
a    0.286796
b   -0.082167
Name: data1, dtype: float64

We could have taken a shortcut and calculated the mean and returned a Series object:

In [None]:
df['data1'].groupby(df['key1']).mean()

or alternatively:

In [None]:
df[['data1','key1']].groupby('key1').mean()

**Exercise**: Perform group by on 'key2' and calculate the sum of values on the 'data2' column. 

**Exercise:** Using df above, display the result of a group by operation on key1, showing the count/number of values that have been aggregated:

We sometimes want to perform group by operations on all columns in  a data frame, whilst grouping the data on one or more columns. This can be done as follows:

In [11]:
df

Unnamed: 0,key1,key2,data1,data2
0,a,one,0.796999,2.097353
1,a,two,0.314729,-1.741249
2,b,one,-0.013359,-0.285786
3,b,two,-0.150975,0.230213
4,a,one,-0.251338,1.045979


In [10]:
gr = df.groupby('key1').mean()
gr

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.286796,0.467361
b,-0.082167,-0.027787


Notice that the returned object in this instance is a data frame since we made use of all columns, and that 'key2' column is ignored since a mean on it cannot be performed.

### Applying multiple operations on groups


Pandas comes with a number of built-in and optimised operations which can be applied simultaneously to the groups using the  `agg` function.

Optimized groupby methods:    

**Function name Description**
- count = Number of non-NA values in the group
- sum = Sum of non-NA values
- mean = Mean of non-NA values
- median = Arithmetic median of non-NA values
- std, var = Unbiased (n - 1 denominator) standard deviation and variance
- min, max = Minimum and maximum of non-NA values
- prod  = Product of non-NA values
- first, last = First and last non-NA values

- In addition we can define our own functions to apply to groups

In [None]:
def my_length(x):
    return len(x)

In [None]:
grouped = df.groupby('key1')
result = grouped.agg(['sum','count','mean','median','min','max','std','var', my_length])
result

As seen from the above example, we can define our own functions and apply them to the groups:

In [None]:
grouped.apply(my_length)

In [None]:
grouped

### Grouping on multiple columns

We could also perform group by on multiple columns:

In [None]:
df

In [None]:
grouped = df.groupby(['key1','key2'])
grouped.groups

Notice that in this case we do not end up with a simple index. What we end up with here is what is called a **hierarchical index**. Each row then becomes accessible using the combination of the two indexes:

In [None]:
grouped.groups['a','one']

We can use these index values to access the actual values in the original dataframe

In [None]:
df.iloc[grouped.groups['a','one']]

We now create a new dataframe that has the mean values of all the groups

In [None]:
df_grouped = grouped.mean()
df_grouped

Make a quick plot of this data:

In [None]:
df_grouped.plot(kind='bar')

### Shaping and reshaping data with **hierarchical indexes**

Hierarchical indexes are created when we group data frames using multiple columns. They are very useful as they allow us to reshape our data.

Consider the dataframe object which was generated from the example above:

In [None]:
df_grouped

In order to  reshape data (hierarchically structured dataframes) we have functions **`stack`** and **`unstack`** available to us.

        stack: “pivot” a level of the (possibly hierarchical) column labels, returning a DataFrame with an index with a new 
        inner most level of row labels.
        
        unstack: inverse operation from stack: “pivot” a level of the (possibly hierarchical) row index to the column axis, 
        producing a reshaped DataFrame with a new inner-most level of column labels.

The above dataframe is already `stacked` which is indicated by the second index indicating hierarchical indexing. We can stack it one further level by taking the columns and making them into columns:

In [None]:
stacked_df = df_grouped.stack()
pd.DataFrame(stacked_df)

In [None]:
df_grouped

Alternatively, we can unstack the original dataframe by taking the inner-most index ('key2') and moving it into the column axis:

In [None]:
unstacked_df = df_grouped.unstack()
unstacked_df

In [None]:
df_grouped

We can slice into dataframes and perform stacking/unstacking on selected columns:

In [None]:
s = df_grouped['data1']
s

The above is now a series object having a hierarchical index which we can now reshape into a data frame with the help of `unstack` method:

In [None]:
df2 = s.unstack()
df2

**Exercise**: Stack the df_grouped dataframe so that data1 and data2 now become rows and store this result in a new dataframe. Now unstack this new dataframe by making key1 into columns. Look up the function to see how you can specify which row to unstack explicitly.

#### Removing hierarchical indexes from group by operations

Sometimes we do not want to preserve the hierarchical index. We can stop the group by operation from generating it by setting the as_index argument to False:

In [None]:
df

In [None]:
df_grouped_mean = df.groupby(['key1','key2'], as_index=False).mean()
df_grouped_mean

Or...

In [None]:
df.groupby(['key1','key2']).mean().reset_index()

---

---

# 2. Pivot Tables

A **pivot table is a data summarization tool** that is very much like the group by facility (and is actually powered by pandas group by under the hood). This tool/operation is routinely performed in spreadsheet programs and other data analysis software. It aggregates a table of data by one or more keys, arranging the data in a rectangle with some of the group keys along the rows and some along the columns. Pivot is often regarded as one of Excels greatest strengths.

Pandas' groupby facility enables pivot table functionality, together with the reshape operations utilizing hierarchical indexing seen earlier. 

DataFrame has a  dedicated pivot_table method, and additionally there is a top-level  pandas.pivot_table function. 

To illustrate these features, a less trivial dataset on restaurant tipping will be used called "tips.csv". It is from the R  reshape2 package; it was originally found in Bryant & Smith’s 1995 text on business statistics (and found in the book’s GitHub repository).

In [None]:
df_tips = pd.read_csv("../datasets/tips.csv")
df_tips.head()

Suppose we wanted to compute a table of group means (the default  pivot_table aggregation type) arranged by  sex and  smoker on the rows:

In [None]:
df_tips.pivot_table(index=['sex', 'smoker'])

This could have been easily produced using  groupby as well.

**Exercise:** Perform the above using groupby.

However, as the perspective which you would like to view data from becomes more complex, additional constructs become helpful.

Now, suppose we want to aggregate only tip_percent and table-party size and we want these aggregations to be split between those who are smokers and non-smokers. And suppose we want to additionally group by day as well as sex. 

In [None]:
df_tips.pivot_table(values=['tip_percent', 'size'], index=['sex', 'day'], columns='smoker')

**Exercise:** Create a pivot table that aggregates by mean only tip_percent and we want these aggregations to be split between dinner and lunches. And  we would like to group by sex and smoker. 

Mean is the default operation performed with pivot tables. The other available functions are:

Function name Description
- values = Column name or names to aggregate. By default aggregates all numeric columns
- rows = Column names or other group keys to group on the rows of the resulting pivot table
- cols = Column names or other group keys to group on the columns of the resulting pivot table
- aggfunc = Aggregation function or list of functions;  'mean' by default. Can be any function valid in a groupby context
- fill_value = Replace missing values in result table
- margins = Add row/column subtotals and grand total, False by default

In [None]:
df_tips.pivot_table(values='tip_percent', index=['sex', 'smoker'], columns='day', aggfunc=len, margins=True)

**Exercise:** Create a pivot table that sums up the total bills for each daya and splits them by smoker.

---

---

# 3. Cross-Tabulation

A cross-tabulation (or crosstab for short) is a special case of a pivot table that computes
group frequencies.

The first two arguments to  crosstab can each either be an array or Series or a list of
arrays. As in the tips data:

In [None]:
pd.crosstab([df_tips.time, df_tips.day], df_tips.smoker)

**Exercise:** Cross-tabulate the tips data in order count the number of tips for dinner and lunches, while grouping by sex and smoker and showing all the margins too. 

---

---

# Exercises

Exercises below will use a geographic dataset sourced from http://www.geonames.org/countries/ called "country_info.csv".

Load the above dataset into a dataframe and inspect the data.

Use this dataset to answer the exercises below.

**Exercise** Group the dataset by continent, showing the total population in each one. Plot the result.

**Exercise** Group the dataset by CurrencyName, showing the total surface Area that each one covers. Sort the result by size and display the top ten currencies by land surface area. Plot the result.

**Exercise** Perform the same aggregation as above, this time using Population instead of land Area.

**Exercise** Use the pivot table to display information for currencies in Europe only. Group the data by continent and currency name, showing only the results from Europe. Aggregate by using summation for the Area and the Population variables and show the margins.

**Exercise**: Cross-tabulate the dataset to find out the counts for currency names across Europe and Americas together with the margins.

**Exercise**: On the Stream site, you will find links to a number of repositories. 

1. Go through these repositories and identify two datasets which you can practice merging and concatenating on.
2. Apply appropriate group by operations on this data in order to draw out insights
3. Apply combinations of appropriate group by operations as well as pivots and cross-tabulation on this data in order to draw out insights.