# Exercise 8: Splitting / Applying / Combining Data Sources

Let's group by the values in the Embarked column. What groups are available?

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('titanic.csv')

In [2]:
embarked_grouped = df.groupby('Embarked')

print(f'There are {len(embarked_grouped)} Embarked groups')

There are 3 Embarked groups


What is actually in the groups?

In [3]:
embarked_grouped.groups

{'C': Int64Index([   1,    9,   19,   26,   30,   31,   34,   36,   39,   42,
             ...
             1260, 1262, 1266, 1288, 1293, 1295, 1296, 1298, 1305, 1308],
            dtype='int64', length=270),
 'Q': Int64Index([   5,   16,   22,   28,   32,   44,   46,   47,   82,  109,
             ...
             1206, 1249, 1271, 1272, 1279, 1287, 1290, 1299, 1301, 1302],
            dtype='int64', length=123),
 'S': Int64Index([   0,    2,    3,    4,    6,    7,    8,   10,   11,   12,
             ...
             1289, 1291, 1292, 1294, 1297, 1300, 1303, 1304, 1306, 1307],
            dtype='int64', length=914)}

Pick the first index to examine

In [4]:
df.iloc[1]

Unnamed: 0                                                    1
Cabin                                                       C85
Embarked                                                      C
Fare                                                    71.2833
Pclass                                                        1
Ticket                                                 PC 17599
Age                                                          38
Name          Cumings, Mrs. John Bradley (Florence Briggs Th...
Parch                                                         0
Sex                                                      female
SibSp                                                         1
Survived                                                      1
Name: 1, dtype: object

We can also iterate through each of the groups and the corresponding values.  Say we wanted an average age for each member of the embarked groups:

In [5]:
for name, group in embarked_grouped:
    print(name, group.Age.mean())

C 32.33216981132075
Q 28.63
S 29.245204603580564


We can also use the **aggregate** or **agg** method for short:

In [6]:
embarked_grouped.agg(np.mean)

Unnamed: 0_level_0,Unnamed: 0,Fare,Pclass,Age,Parch,SibSp,Survived
Embarked,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
C,689.655556,62.336267,1.851852,32.33217,0.37037,0.4,0.553571
Q,667.593496,12.409012,2.894309,28.63,0.113821,0.341463,0.38961
S,642.095186,27.418824,2.347921,29.245205,0.426696,0.550328,0.336957


By passing the function or compute through the aggregate method:

In [7]:
embarked_grouped.agg(np.sum)

Unnamed: 0_level_0,Unnamed: 0,Fare,Pclass,Age,Parch,SibSp,Survived
Embarked,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
C,186207,16830.7922,500,6854.42,100,108,93.0
Q,82114,1526.3085,356,1431.5,14,42,30.0
S,586875,25033.3862,2146,22869.75,390,503,217.0


We can also apply our own functions, as long as the function can operate on a Pandas Series object

In [8]:
def first_val(x):
        
    return x.values[0]

embarked_grouped.agg(first_val)

Unnamed: 0_level_0,Unnamed: 0,Cabin,Fare,Pclass,Ticket,Age,Name,Parch,Sex,SibSp,Survived
Embarked,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
C,1,C85,71.2833,1,PC 17599,38.0,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,female,1,1.0
Q,5,,8.4583,3,330877,,"Moran, Mr. James",0,male,0,0.0
S,0,,7.25,3,A/5 21171,22.0,"Braund, Mr. Owen Harris",0,male,1,0.0
