# Concatening

In [1]:
import pandas as pd
import numpy as np

In [2]:
df1 = pd.DataFrame(np.full((2,3),'x', dtype=object), columns=['A', 'B', 'C'])
df1

Unnamed: 0,A,B,C
0,x,x,x
1,x,x,x


In [3]:
df2 = pd.DataFrame(np.full((3,3),'o', dtype=object), columns=['A', 'B', 'C'])
df2 

Unnamed: 0,A,B,C
0,o,o,o
1,o,o,o
2,o,o,o


In [4]:
df3 = pd.DataFrame(np.full((2,2),'v', dtype=object), columns=['D', 'E'])
df3

Unnamed: 0,D,E
0,v,v
1,v,v


In [5]:
pd.concat([df1,df2])

Unnamed: 0,A,B,C
0,x,x,x
1,x,x,x
0,o,o,o
1,o,o,o
2,o,o,o


In [6]:
 pd.concat([df1,df2]).reset_index(drop=True)

Unnamed: 0,A,B,C
0,x,x,x
1,x,x,x
2,o,o,o
3,o,o,o
4,o,o,o


In [7]:
pd.concat([df1,df3])

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


Unnamed: 0,A,B,C,D,E
0,x,x,x,,
1,x,x,x,,
0,,,,v,v
1,,,,v,v


#### The keys parameter

Suppose that after concatenating the DataFrames, we still want to have the data from each DataFrame in a separate group. This can be useful for determining later on which DataFrame a certain entry came from. We can achieve this with the keys parameter which creates a hierarchical index on the DataFrame. A hierarchical index means that we can have more than one index for each row or observation. This is typically thought of as having different levels, with the first index on the first level, and then possibly other indices on the next levels. Let's look at the example of concatenating df1 and df2 once again, but this time with the use of the keys parameter:

In [8]:
df4 = pd.concat([df1,df2], keys=['df1', 'df2'])
df4

Unnamed: 0,Unnamed: 1,A,B,C
df1,0,x,x,x
df1,1,x,x,x
df2,0,o,o,o
df2,1,o,o,o
df2,2,o,o,o


In [9]:
df4.loc['df2',:]

Unnamed: 0,A,B,C
0,o,o,o
1,o,o,o
2,o,o,o


In [10]:
pd.concat([df1, df3], axis=1)

Unnamed: 0,A,B,C,D,E
0,x,x,x,v,v
1,x,x,x,v,v


In [11]:
pd.concat([df1,df2], axis=1)

Unnamed: 0,A,B,C,A.1,B.1,C.1
0,x,x,x,o,o,o
1,x,x,x,o,o,o
2,,,,o,o,o


#### join parameter

The concat() function has another useful parameter called join. Here, we can use either the default setting which is **join='outer'**, or instead we can choose to use **join='inner'**. What is the difference between them? We can think of the join='outer' as a union of the indices or labels (depending on along which axis we perform the concatenation). This is what we have seen so far. The row indices or column labels that were in common were not duplicated, and those that were not in common were each added separately with the appropriate NaN values. On the other hand, join='inner' refers to the intersection of row indices or column labels. That is, we keep only those that are in common, and discard the rest. Here is the previous example, but this time with join='inner':

In [12]:
pd.concat([df1,df2], axis=1, join='inner')

Unnamed: 0,A,B,C,A.1,B.1,C.1
0,x,x,x,o,o,o
1,x,x,x,o,o,o


Since only the first two rows were in common, only these were kept. The third row of the second DataFrame - the one which previsuly had NaN values - was discarded. What if there are no row indices or column labels in common at all?

In [13]:
pd.concat([df1,df3], join='inner')

0
1
0
1


# Merging and joining
So far we have seen how to combine DataFrames using the concat() function. This function allowed us to combine different DataFrames along their row index or column label. In this unit, we look at an extension of this using the merge() function. The main role of the merge() function is to allow us to combine DataFrames along multiple columns, or along columns other than the index

#### Merging on a single column

In [14]:
users = pd.DataFrame( {'userID': [5672, 3452, 2878, 3234],
                'First Name': ['Christopher', 'Johnnie', 'Debbie', 'Teri'],
                'Last Name': ['Boyd','Baldwin', 'Alvarez', 'Gill']})
users

Unnamed: 0,userID,First Name,Last Name
0,5672,Christopher,Boyd
1,3452,Johnnie,Baldwin
2,2878,Debbie,Alvarez
3,3234,Teri,Gill


In [15]:
scores = pd.DataFrame( {'userID': [2878, 5672, 3234, 5672, 2878],
                'Score': [84,56,72,77,88]})
scores

Unnamed: 0,userID,Score
0,2878,84
1,5672,56
2,3234,72
3,5672,77
4,2878,88


In [16]:
merged_df = pd.merge(users, scores) 
merged_df

Unnamed: 0,userID,First Name,Last Name,Score
0,5672,Christopher,Boyd,56
1,5672,Christopher,Boyd,77
2,2878,Debbie,Alvarez,84
3,2878,Debbie,Alvarez,88
4,3234,Teri,Gill,72


**Pandas has found by itself the column which is in common between the two DataFrames, which is the userID column. It has then merged the two DataFrames according to this column. This worked because the columns had the same label. But what if this is not the case? Consider instead the following alternative definition of the second DataFrame:**

In [17]:
scores2 = pd.DataFrame( {'studentID': [2878, 5672, 3234, 5672, 2878],
                'Score': [84,56,72,77,88]})

In [18]:
pd.merge(users, scores2)
# MergeError: No common columns to perform merge on

MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False

In [None]:
pd.merge(users, scores2, left_on='userID', right_on='studentID')

#### Merging on multiple columns

In [None]:
gold = pd.DataFrame({'Code': ['CAN', 'GER', 'USA', 'NOR'],
               'Country': ['Canada', 'Germany', 'United States', 'Norway'],
               'Total': [ 14, 10, 9, 9]})
gold

In [None]:
bronze= pd.DataFrame({'Code': ['USA','GER', 'NOR', 'AUS'],
               'Country': ['United States', 'Germany', 'Norway', 'Austria'],
               'Total': [ 13, 7, 7, 6]})
bronze

In [None]:
pd.merge(gold,bronze)

What we actually get back is an empty DataFrame. This is because, by default, pandas tries to merge according to all common columns. This means that the rows of the merged DataFrame consist of all rows where the Code, Country, and Total columns are identical in both DataFrames. This results in an empty DataFrame because the entries in the column Total of the two DataFrames never match. To achieve our intended merge, we must specify to pandas to merge along the columns Code and Country. We can do this using the on parameter as follows:

In [None]:
pd.merge(gold, bronze, on=['Code', 'Country'])

In [None]:
pd.merge(gold, bronze, on=['Code', 'Country'], suffixes=['_gold', '_bronze'])

#### Different types of joins


What we did in the last example is referred to as an inner join: we took the rows that matched in the code and country columns of both DataFrames. Since Canada appeared in the gold DataFrame and not the bronze and Austria appeared in the bronze and not the gold these two rows were not included in our merged DataFrame. This corresponds to an intersection.

In contrast to this we can opt for an outer join where we keep all the rows, corresponding to a union. We can do this using the how parameter.

In [None]:
pd.merge(gold, bronze, on=['Code', 'Country'], suffixes=['_gold', '_bronze'], how='outer')

This type of join returns both the merge of the matched rows and the unmatched values from both the left and right DataFrames. Notice that the unmatched entires were filled with the NaN value. In addition to inner and outer joins we have two more option

- left join: return the merge of the matched rows and the unmatched values from only the left DataFrame
- right join: return the merge of the matched rows and the unmatched values from only the right DataFrame

In [None]:
pd.merge(gold, bronze, on=['Code', 'Country'], suffixes=['_gold', '_bronze'], how='left')

In [None]:
pd.merge(gold, bronze, on=['Code', 'Country'], suffixes=['_gold', '_bronze'], how='right')

Remark: We would like to draw your attention to one particular issue that can arise when performing an outer merge. Suppose we have two DataFrames containing integer values

In [None]:
df1 = pd.DataFrame({'key': [1,2,3,4], 'val1': [1,2,3,4]})
df2 = pd.DataFrame({'key': [1,2,3,5], 'val2': [1,2,3,4]})

In [None]:
df_in = df1.merge(df2, how='inner')
df_in

In [None]:
df_in.dtypes

In [None]:
df_out = df1.merge(df2, how='outer')
df_out

In [None]:
df_out.dtypes
#key       int64
#val1    float64
#val2    float64
#dtype: object 

we notice that they have been changed to float64. This is actually a bug and you can read more about it here BUG: DataFrame outer merge changes key columns from int64 to float64. For most practical cases, this is not really a problem, but if it bothers you, you can of course always convert the colums back to the original dtypes.

# Exercise : merging with different joins

In [None]:
left = pd.DataFrame({'key1': ['a', 'b', 'c'], 'key2': ['A', 'B', 'C'], 'lval': [ 0, 1, 2]})
right = pd.DataFrame({'key1': ['a', 'b', 'c'], 'key2': ['A', 'D', 'C'], 'rval': [ 3, 4, 6]})

In [None]:
left

In [None]:
right

In [None]:
df_A = pd.merge(left,right)
df_A

In [None]:
df_B = pd.merge(left,right, how = 'outer')
df_B

In [None]:
df_C = pd.merge(left,right, how = 'right')
df_C

In [None]:
df_D = pd.merge(left,right, on=['key1'])
df_D

# Pivoting

Often, we might want to reshape our data to make it easier to view certain relationships between the variables. The pivot() and pivot_table() functions from pandas let us reorganize the entire DataFrame as we wish. Let's look at each in more detail.

#### The pivot() function

The pivot() function is applied to a DataFrame and has three important parameters: index, columns and values. To each of these parameters, we have to pass the name of a current column of our DataFrame. Pandas then performs the following actions to obtain the new DataFrame

- it takes the entries from the column passed to index and makes these the indices of the new DataFrame
- it takes the entries from the column passed to columns and makes these the column labels of the new DataFrame
- it takes the entries from the column passed to values and uses them to fill in the new DataFrame, by putting them in the corresponding columns

This is easiest to understand with an example. Suppose we have a sensor that reports the coordinates of some mobile device at equal time intervals. Here are the readings from the first four intervals:

In [1]:
import pandas as pd
values = [3, 81, 1, 56, 71, 91, 54, 94, 64, 90, 21, 36]
coordinates = ['x','y','z'] * 4
time = [0]*3 + [1]*3 + [2]*3 + [3]*3
df = pd.DataFrame({'time':time, 'coordinates':coordinates, 'values':values})
df

Unnamed: 0,time,coordinates,values
0,0,x,3
1,0,y,81
2,0,z,1
3,1,x,56
4,1,y,71
5,1,z,91
6,2,x,54
7,2,y,94
8,2,z,64
9,3,x,90


In [2]:
df_pivot = df.pivot(index='time', columns='coordinates', values='values')
df_pivot

coordinates,x,y,z
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,3,81,1
1,56,71,91
2,54,94,64
3,90,21,36


#### The pivot_table() function
The pivot_table() function is a generalization of the pivot() function that allows for duplicated values in the pivoted index/column pairs. To demonstrate this, we need a new example where we have such duplicated values. Let's suppose that our data contained coordinates from a second sensor which was paired with a different mobile device, defined as follows:

In [3]:
values2=[6,82,9,47,8,12,64,88,53,46,59,60]

In [4]:
df2=pd.DataFrame({'time':time*2, 'coordinates':coordinates*2, 'values':values+values2})
df2

Unnamed: 0,time,coordinates,values
0,0,x,3
1,0,y,81
2,0,z,1
3,1,x,56
4,1,y,71
5,1,z,91
6,2,x,54
7,2,y,94
8,2,z,64
9,3,x,90


In [1]:
df2.pivot(index='time', columns='coordinates', values='values')

NameError: name 'df2' is not defined

Pandas gives us an error that we have duplicated entries. For example, rows 0 and 12 both have an x in the column coordinates and a 0 in the column time. This means that the entries of the column values of these two rows would map to the same entry of the new DataFrame. Since pandas doesn't know how to handle this, it gives us an error. However, there is a solution, provided by the pivot_table() function. This function has an additional parameter called aggfunc, which allows us to specify a function that tells pandas how to aggregate or combine the different values that map to the same entry, and return a single value. The default option is the mean() function. Let's take a look:

In [6]:
df2_pivot = df2.pivot_table(index='time', columns='coordinates', values='values')
df2_pivot

coordinates,x,y,z
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,4.5,81.5,5.0
1,51.5,39.5,51.5
2,59.0,91.0,58.5
3,68.0,40.0,48.0


In [7]:
import numpy as np
def distance(a):
    x = np.max(a) - np.min(a)
    return x

In [8]:
df2_pivot = df2.pivot_table(index='time', columns='coordinates', values='values', aggfunc=distance)
df2_pivot

coordinates,x,y,z
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,3,1,8
1,9,63,79
2,10,6,11
3,44,38,24


In [9]:
df2_pivot = df2.pivot_table(index='time', columns='coordinates', values='values', aggfunc=lambda x: tuple(x))
df2_pivot

coordinates,x,y,z
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,"(3, 6)","(81, 82)","(1, 9)"
1,"(56, 47)","(71, 8)","(91, 12)"
2,"(54, 64)","(94, 88)","(64, 53)"
3,"(90, 46)","(21, 59)","(36, 60)"


# Exercise : Pivoting

In [10]:
df=pd.read_csv('songs.csv')
df.head()

Unnamed: 0,Musician,Genre,Name,Decade,Minutes
0,Led Zeppelin,hard rock,Stairway to Heaven,70,08:02
1,Led Zeppelin,hard rock,Kashmir,70,08:37
2,Led Zeppelin,hard rock,Immigrant Song,70,02:26
3,Led Zeppelin,hard rock,Whole Lotta Love,60,05:33
4,Led Zeppelin,hard rock,Black Dog,70,04:55


In [17]:
def set_x(minutes):
    return 'x'

In [20]:
df_pivot = df.pivot_table(index=['Decade', 'Musician', 'Name'], columns=['Genre'], values='Minutes', aggfunc=set_x, fill_value=' ')
df_pivot

Unnamed: 0_level_0,Unnamed: 1_level_0,Genre,folk rock,hard rock,pop rock
Decade,Musician,Name,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
60,Bob Dylan,Blowin' in the Wind,x,,
60,Bob Dylan,Like a Rolling Stone,x,,
60,David Bowie,Space Oddity,,,x
60,Led Zeppelin,Good Times Bad Times,,x,
60,Led Zeppelin,I Can't Quit You Baby,,x,
60,Led Zeppelin,Moby Dick,,x,
60,Led Zeppelin,Ramble On,,x,
60,Led Zeppelin,Whole Lotta Love,,x,
70,Bob Dylan,Tangled Up in Blue,x,,
70,David Bowie,China Girl,,,x


# Hierarchical indexing

#### Stacking and unstacking
In this unit, we will look at the stack() and unstack() functions of pandas. These functions are useful for DataFrames where we have multiple indexing. Their purpose is the following:

- the stack() function takes the innermost column label and turns it into the innermost row index. The overall effect is to make the DataFrame taller.
- the unstack() function is the inverse operation: it takes the innermost row index and turns it into the innermost column label. The overall effect is to make the DataFrame wider.

In [21]:
import pandas as pd
import numpy as np
#define the MultiIndex for the rows
row_levels = [['R0','R1'], ['r00', 'r01', 'r10', 'r11']]
row_labels = [[0,0,1,1],[0,1,2,3]]
row_indices = pd.MultiIndex(row_levels, row_labels)
#define the MultiIndex for the columns
col_levels = [['C0','C1'], ['c00', 'c01', 'c10', 'c11']]
col_labels = [[0,0,1,1],[0,1,2,3]]
col_indices = pd.MultiIndex(col_levels, col_labels)
#define the data
data = np.arange(16).reshape(4,4)
#create the dataframe
df = pd.DataFrame(data, index=row_indices, columns=col_indices)
df

Unnamed: 0_level_0,Unnamed: 1_level_0,C0,C0,C1,C1
Unnamed: 0_level_1,Unnamed: 1_level_1,c00,c01,c10,c11
R0,r00,0,1,2,3
R0,r01,4,5,6,7
R1,r10,8,9,10,11
R1,r11,12,13,14,15


In [22]:
df.stack()

Unnamed: 0,Unnamed: 1,Unnamed: 2,C0,C1
R0,r00,c00,0.0,
R0,r00,c01,1.0,
R0,r00,c10,,2.0
R0,r00,c11,,3.0
R0,r01,c00,4.0,
R0,r01,c01,5.0,
R0,r01,c10,,6.0
R0,r01,c11,,7.0
R1,r10,c00,8.0,
R1,r10,c01,9.0,


In [23]:
df.unstack()

Unnamed: 0_level_0,C0,C0,C0,C0,C0,C0,C0,C0,C1,C1,C1,C1,C1,C1,C1,C1
Unnamed: 0_level_1,c00,c00,c00,c00,c01,c01,c01,c01,c10,c10,c10,c10,c11,c11,c11,c11
Unnamed: 0_level_2,r00,r01,r10,r11,r00,r01,r10,r11,r00,r01,r10,r11,r00,r01,r10,r11
R0,0.0,4.0,,,1.0,5.0,,,2.0,6.0,,,3.0,7.0,,
R1,,,8.0,12.0,,,9.0,13.0,,,10.0,14.0,,,11.0,15.0


#### Stacking and unstacking on different levels

In [24]:
df.stack(level=0)

Unnamed: 0,Unnamed: 1,Unnamed: 2,c00,c01,c10,c11
R0,r00,C0,0.0,1.0,,
R0,r00,C1,,,2.0,3.0
R0,r01,C0,4.0,5.0,,
R0,r01,C1,,,6.0,7.0
R1,r10,C0,8.0,9.0,,
R1,r10,C1,,,10.0,11.0
R1,r11,C0,12.0,13.0,,
R1,r11,C1,,,14.0,15.0


In [26]:
df.stack().unstack()

Unnamed: 0_level_0,Unnamed: 1_level_0,C0,C0,C0,C0,C1,C1,C1,C1
Unnamed: 0_level_1,Unnamed: 1_level_1,c00,c01,c10,c11,c00,c01,c10,c11
R0,r00,0.0,1.0,,,,,2.0,3.0
R0,r01,4.0,5.0,,,,,6.0,7.0
R1,r10,8.0,9.0,,,,,10.0,11.0
R1,r11,12.0,13.0,,,,,14.0,15.0


In [29]:
df.stack().unstack().dropna(axis=1)

Unnamed: 0_level_0,Unnamed: 1_level_0,C0,C0,C1,C1
Unnamed: 0_level_1,Unnamed: 1_level_1,c00,c01,c10,c11
R0,r00,0.0,1.0,2.0,3.0
R0,r01,4.0,5.0,6.0,7.0
R1,r10,8.0,9.0,10.0,11.0
R1,r11,12.0,13.0,14.0,15.0


In [30]:
df.stack(level=0).unstack(level=0).dropna(axis=1)

r00,C0
r00,C1
r01,C0
r01,C1
r10,C0
r10,C1
r11,C0
r11,C1


In [33]:
df.stack(level=0).unstack()

Unnamed: 0_level_0,Unnamed: 1_level_0,c00,c00,c01,c01,c10,c10,c11,c11
Unnamed: 0_level_1,Unnamed: 1_level_1,C0,C1,C0,C1,C0,C1,C0,C1
R0,r00,0.0,,1.0,,,2.0,,3.0
R0,r01,4.0,,5.0,,,6.0,,7.0
R1,r10,8.0,,9.0,,,10.0,,11.0
R1,r11,12.0,,13.0,,,14.0,,15.0


In [35]:
df.stack(level=0).unstack().swaplevel(axis=1)

Unnamed: 0_level_0,Unnamed: 1_level_0,C0,C1,C0,C1,C0,C1,C0,C1
Unnamed: 0_level_1,Unnamed: 1_level_1,c00,c00,c01,c01,c10,c10,c11,c11
R0,r00,0.0,,1.0,,,2.0,,3.0
R0,r01,4.0,,5.0,,,6.0,,7.0
R1,r10,8.0,,9.0,,,10.0,,11.0
R1,r11,12.0,,13.0,,,14.0,,15.0


In [36]:
df.stack(level=0).unstack().swaplevel(axis=1).dropna(axis=1)

Unnamed: 0_level_0,Unnamed: 1_level_0,C0,C0,C1,C1
Unnamed: 0_level_1,Unnamed: 1_level_1,c00,c01,c10,c11
R0,r00,0.0,1.0,2.0,3.0
R0,r01,4.0,5.0,6.0,7.0
R1,r10,8.0,9.0,10.0,11.0
R1,r11,12.0,13.0,14.0,15.0


# Grouping

#### The split-apply-combine process
In the first course, you would have already seen the pandas groupby() function. This is an extremely powerful function that lets us slice, dice and summarize data sets. The general process in which we will use the groupby() function is what is known as a split-apply-combine procedure that consists of the following three steps:

- first, split the data into chunks
- then, apply different functions to each group
- finally, aggregate the results and combine them back into a DataFrame

In [37]:
from IPython.display import Image
Image(url= "https://d7whxh71cqykp.cloudfront.net/uploads/image/data/3229/Screen_Shot_2017-12-15_at_10.46.23.png")

In [39]:
import pandas as pd
import numpy as np

raw_data = {'team': ['Ten Snakes', 'Ten Snakes', 'Ten Snakes', 'Ten Snakes', 
                                    'Nine Monkeys', 'Nine Monkeys', 'Nine Monkeys', 'Nine Monkeys', 
                                    'Eight Eagles', 'Eight Eagles'], 
        'rank': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '2nd'], 
        'name': ['James', 'Allen', 'Matthew', 'James', 'Devon', 'Sam', 'Justin', 'Sam', 'Paul', 'Ross'], 
        'score1': [16,35,55,29,2,61,68,41,94,18],
        'score2': [81,65,54,44,28,93,2,5,53,99]}
df = pd.DataFrame(raw_data, columns = ['team', 'rank', 'name', 'score1', 'score2'])

In [40]:
df.head()

Unnamed: 0,team,rank,name,score1,score2
0,Ten Snakes,1st,James,16,81
1,Ten Snakes,1st,Allen,35,65
2,Ten Snakes,2nd,Matthew,55,54
3,Ten Snakes,2nd,James,29,44
4,Nine Monkeys,1st,Devon,2,28


#### Grouping by a single variable

In [45]:
grouped=df.groupby('team')


An important thing to realize is that the variable grouped does not directly have the grouped data. Instead, this variable is now a DataFrameGroupBy object.

In [46]:
grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000073721F6CF8>

In [47]:
list(grouped)

[('Eight Eagles',
             team rank  name  score1  score2
  8  Eight Eagles  1st  Paul      94      53
  9  Eight Eagles  2nd  Ross      18      99),
 ('Nine Monkeys',
             team rank    name  score1  score2
  4  Nine Monkeys  1st   Devon       2      28
  5  Nine Monkeys  1st     Sam      61      93
  6  Nine Monkeys  2nd  Justin      68       2
  7  Nine Monkeys  2nd     Sam      41       5),
 ('Ten Snakes',
           team rank     name  score1  score2
  0  Ten Snakes  1st    James      16      81
  1  Ten Snakes  1st    Allen      35      65
  2  Ten Snakes  2nd  Matthew      55      54
  3  Ten Snakes  2nd    James      29      44)]

In [48]:
grouped.describe()

Unnamed: 0_level_0,score1,score1,score1,score1,score1,score1,score1,score1,score2,score2,score2,score2,score2,score2,score2,score2
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
team,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
Eight Eagles,2.0,56.0,53.740115,18.0,37.0,56.0,75.0,94.0,2.0,76.0,32.526912,53.0,64.5,76.0,87.5,99.0
Nine Monkeys,4.0,43.0,29.631065,2.0,31.25,51.0,62.75,68.0,4.0,32.0,42.292631,2.0,4.25,16.5,44.25,93.0
Ten Snakes,4.0,33.75,16.23525,16.0,25.75,32.0,40.0,55.0,4.0,61.0,15.853496,44.0,51.5,59.5,69.0,81.0


In [49]:
grouped.mean()

Unnamed: 0_level_0,score1,score2
team,Unnamed: 1_level_1,Unnamed: 2_level_1
Eight Eagles,56.0,76.0
Nine Monkeys,43.0,32.0
Ten Snakes,33.75,61.0


In [50]:
grouped.size()

team
Eight Eagles    2
Nine Monkeys    4
Ten Snakes      4
dtype: int64

In [51]:
grouped.count()

Unnamed: 0_level_0,rank,name,score1,score2
team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Eight Eagles,2,2,2,2
Nine Monkeys,4,4,4,4
Ten Snakes,4,4,4,4


If we want to get a specific group we call the get_group() function and pass it a group name. The groups are labeled according to the entries of the variable that we decided to group by:

In [52]:
grouped.get_group('Ten Snakes')

Unnamed: 0,team,rank,name,score1,score2
0,Ten Snakes,1st,James,16,81
1,Ten Snakes,1st,Allen,35,65
2,Ten Snakes,2nd,Matthew,55,54
3,Ten Snakes,2nd,James,29,44


#### Grouping by multiple index levels


In [54]:
df2 = df.set_index(['team','rank'])
df2

Unnamed: 0_level_0,Unnamed: 1_level_0,name,score1,score2
team,rank,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Ten Snakes,1st,James,16,81
Ten Snakes,1st,Allen,35,65
Ten Snakes,2nd,Matthew,55,54
Ten Snakes,2nd,James,29,44
Nine Monkeys,1st,Devon,2,28
Nine Monkeys,1st,Sam,61,93
Nine Monkeys,2nd,Justin,68,2
Nine Monkeys,2nd,Sam,41,5
Eight Eagles,1st,Paul,94,53
Eight Eagles,2nd,Ross,18,99


In [55]:
grouped2 = df2.groupby(level=['team', 'rank'])

In [56]:
grouped2.size()

team          rank
Eight Eagles  1st     1
              2nd     1
Nine Monkeys  1st     2
              2nd     2
Ten Snakes    1st     2
              2nd     2
dtype: int64

In [57]:
grouped2.get_group(('Eight Eagles','1st'))

Unnamed: 0_level_0,Unnamed: 1_level_0,name,score1,score2
team,rank,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Eight Eagles,1st,Paul,94,53


#### Applying a function

In [58]:
grouped.agg(np.sum)

Unnamed: 0_level_0,score1,score2
team,Unnamed: 1_level_1,Unnamed: 2_level_1
Eight Eagles,112,152
Nine Monkeys,172,128
Ten Snakes,135,244


In [59]:
grouped2.agg(np.sum)

Unnamed: 0_level_0,Unnamed: 1_level_0,score1,score2
team,rank,Unnamed: 2_level_1,Unnamed: 3_level_1
Eight Eagles,1st,94,53
Eight Eagles,2nd,18,99
Nine Monkeys,1st,63,121
Nine Monkeys,2nd,109,7
Ten Snakes,1st,51,146
Ten Snakes,2nd,84,98


#### Filtering by groups

Once we have split the data and applied certain functions on each group separately, we arrive at the last part of the process, which is putting the data back together again. Now, the pandas GroupBy object has a very useful function called filter() which allows us to decide whether to include a certain group or not in the final combination.

This is how it works:

- first, we define a function that when passed a group returns either True or False
- and then, we pass this function to filter()

We will then get back the groups for which the function we defined returned True. Let's try this on our example grouped. Suppose we are interested in keeping only those groups that have an average value greater than 50 in both score1 and score2. We already saw that we can get the average value in a group using the mean() method. This returns a DataFrame that has the mean values per group for each column. So we can define our function as follows:

In [60]:
def f(x):
    m = x.mean()
    return (m.score1 > 50) & (m.score2 > 50)

In [61]:
grouped.filter(f)

Unnamed: 0,rank,name,score1,score2
8,1st,Paul,94,53
9,2nd,Ross,18,99


#### Exercise: grouping and filtering

a group is kept if and only if all the values of the group are greater than 50 in both score1 and score2.

In [69]:
grouped2 = df2.groupby(level=['team', 'rank'])


In [63]:
def f(x):
    return (x.score1.min() > 50) & (x.score2.min() > 50)

In [64]:
grouped2.filter(f)

Unnamed: 0_level_0,Unnamed: 1_level_0,name,score1,score2
team,rank,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Eight Eagles,1st,Paul,94,53


In [67]:
df2[(df2['score1']>50) &  (df2['score2']>50)]

Unnamed: 0_level_0,Unnamed: 1_level_0,name,score1,score2
team,rank,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Ten Snakes,2nd,Matthew,55,54
Nine Monkeys,1st,Sam,61,93
Eight Eagles,1st,Paul,94,53
