In [119]:
import pandas as pd
import numpy as np


Categorising a data( by passendger class), and applying a function to each group, is a core operation in data analysis


## Split apply combine
The prolific R developer hadley Wickham, creator of ggplot, coined the term split-apply-combine
This technique was later brought to python through pandas


## Stages
1. Data conatined in a panda object is split into group based on one or more keys. Splitting is done on particular axis
2. a function is then applied to each group production a new value
3. The result of those function application are then combined into a new result object


## Keys
the Key can take several forms
A lost of array of values that is the same lenght as the axisbeing groupe
a value indicating a column name in dataframe

In [120]:
df = pd.DataFrame({"key1" : ["a", "a", None, "b", "b", "a", None], "key2": pd.Series([1,2,1,2,1,None,1]), 
                   "data1" : np.random.standard_normal(7), "data2" : np.random.standard_normal(7)})
df

Unnamed: 0,key1,key2,data1,data2
0,a,1.0,-1.106851,0.274503
1,a,2.0,1.245186,-0.783344
2,,1.0,1.193912,0.736214
3,b,2.0,-0.517307,-0.112067
4,b,1.0,0.198812,-0.123519
5,a,,0.336315,-1.573983
6,,1.0,0.7644,-0.315719


## Grouping
We want to calcualte the mean of the data1 column using the values in key1.
This is, we want the mean of the value for key1 = a, and the mean of the values of key1 =b

# Slow way
We can use boolean indexing to get these values

In [121]:
print(df[df["key1"] == "a"]["data1"].mean())
print(df[df["key1"] == "b"] ["data1"].mean())

0.158216725166237
-0.1592471824989751


## Group-by-way
The concise alternative is using groupby()
We being with boolean indexing by supplying the columns we wish to operate on 
Next, we provide the key(column) on which group values.
This grouped variable is now a special Groupby object

In [122]:
grouped = df["data1"].groupby([df["key1"]])
grouped

<pandas.core.groupby.generic.SeriesGroupBy object at 0x0000014B40160FA0>

## Grouped by funciton
Out groupby object as not yet computed anything, except for intermediate data about the group key df["key1]
This objext has all of the information needed to then apply an operation. For this example, that is the arithemtic mean()

In [123]:
grouped.mean()

key1
a    0.158217
b   -0.159247
Name: data1, dtype: float64

The output data (a Series) has been aggregated by spltting the data on the grouped key, producing a new series that is indexed by the unique values in key1
The index of the returned Series object has the name key1, as the input DataFrame column df["key1"] did

Passing multiple arrays as a lit ot groupby() allows us to further group our DF by mutiple keys.

In [124]:
d = df["data1"].groupby([df["key1"], df["key2"]]).mean()
d

key1  key2
a     1.0    -1.106851
      2.0     1.245186
b     1.0     0.198812
      2.0    -0.517307
Name: data1, dtype: float64

## Multiple Index
A closer inspection of our returned object reveals a Series with 
multiple indexes: Key1, and Key2.

Recall: indexes are not data columns; they index to values in our 
Series/DataFrame.

In [125]:
d.index

MultiIndex([('a', 1.0),
            ('a', 2.0),
            ('b', 1.0),
            ('b', 2.0)],
           names=['key1', 'key2'])

Accessing Values in a multi-index

In [126]:
print(d["a"],"\n----")
print(d["a"][1])

key2
1.0   -1.106851
2.0    1.245186
Name: data1, dtype: float64 
----
-1.1068510012523922


## Hierarchical indexing

Hierarchical indexing allows you multiple (two or more) index levels
on an axis.

You could think about it as allowing you to work with higher-dimensional data in a lower-dimensional form.

Here we have a list of lists (or arrays) as the index to our Series:


In [127]:
data = pd.Series(np.random.uniform(size=9), index = [["a","a", "a", "b", "b", "c", "c", "d", "d"], [1,2,3,1,3,1,2,2,3]])
data    

a  1    0.463076
   2    0.720582
   3    0.214094
b  1    0.039333
   3    0.756247
c  1    0.082883
   2    0.816770
d  2    0.761913
   3    0.328906
dtype: float64

## MultiIndex

In [128]:
data.index

MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 3),
            ('c', 1),
            ('c', 2),
            ('d', 2),
            ('d', 3)],
           )

## Accessing Data
Partial Indexing is possible with a multiindex

In [129]:
# Partial indexing
data["a"]

1    0.463076
2    0.720582
3    0.214094
dtype: float64

As are silces:

In [130]:
data["a":"b"]

a  1    0.463076
   2    0.720582
   3    0.214094
b  1    0.039333
   3    0.756247
dtype: float64

In [131]:
data.loc[["b","d"]]

b  1    0.039333
   3    0.756247
d  2    0.761913
   3    0.328906
dtype: float64

## Inner indexing
We can also select values from the inner index

In [132]:
# Inner indexing
data.loc[:,2]

a    0.720582
c    0.816770
d    0.761913
dtype: float64

## Stacking and unstacking
Hierarchical Indexing plays an important role in GroupedBy operations 
We can rearrange our data into a DataFrame with unstack():

In [133]:
data.unstack()

Unnamed: 0,1,2,3
a,0.463076,0.720582,0.214094
b,0.039333,,0.756247
c,0.082883,0.81677,
d,,0.761913,0.328906


## Stack()
The inverse is a stack(), which creates a multi-index from a Series/DF

In [134]:
data.unstack().stack()

a  1    0.463076
   2    0.720582
   3    0.214094
b  1    0.039333
   3    0.756247
c  1    0.082883
   2    0.816770
d  2    0.761913
   3    0.328906
dtype: float64

## Unstacking our DF
We can now return to our DF eariler DF example, and unstack() it back to a regular DataFrame

In [135]:
print(df)
# Unstacking our eailer df
df["data1"].groupby([df["key1"], df["key2"]]).mean().unstack()

   key1  key2     data1     data2
0     a   1.0 -1.106851  0.274503
1     a   2.0  1.245186 -0.783344
2  None   1.0  1.193912  0.736214
3     b   2.0 -0.517307 -0.112067
4     b   1.0  0.198812 -0.123519
5     a   NaN  0.336315 -1.573983
6  None   1.0  0.764400 -0.315719


key2,1.0,2.0
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,-1.106851,1.245186
b,0.198812,-0.517307
