In [3]:
import pandas as pd
# Create URL
url = 'https://raw.githubusercontent.com/chrisalbon/sim_data/master/titanic.csv'
# Load data
dataframe = pd.read_csv(url)
dataframe.head(5)

Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
0,"Allen, Miss Elisabeth Walton",1st,29.0,female,1,1
1,"Allison, Miss Helen Loraine",1st,2.0,female,0,1
2,"Allison, Mr Hudson Joshua Creighton",1st,30.0,male,0,0
3,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,female,0,1
4,"Allison, Master Hudson Trevor",1st,0.92,male,1,0


groupby is one of the most powerful features in pandas

In [5]:
dataframe.groupby('Sex').mean()
# Group rows by the values of the column 'Sex', calculate mean
# of each group

Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
0,"Allen, Miss Elisabeth Walton",1st,29.0,female,1,1
1,"Allison, Miss Helen Loraine",1st,2.0,female,0,1
2,"Allison, Mr Hudson Joshua Creighton",1st,30.0,male,0,0
3,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,female,0,1
4,"Allison, Master Hudson Trevor",1st,0.92,male,1,0
5,"Anderson, Mr Harry",1st,47.0,male,1,0
6,"Andrews, Miss Kornelia Theodosia",1st,63.0,female,1,1
7,"Andrews, Mr Thomas, jr",1st,39.0,male,0,0
8,"Appleton, Mrs Edward Dale (Charlotte Lamson)",1st,58.0,female,1,1
9,"Artagaveytia, Mr Ramon",1st,71.0,male,0,0


groupby is where data wrangling really starts to take shape. It is very common
to have a DataFrame where each row is a person or an event and we want to
group them according to some criterion and then calculate a statistic. For
example, you can imagine a DataFrame where each row is an individual sale at a
national restaurant chain and we want the total sales per restaurant. We can
accomplish this by grouping rows by individual resturants and then calculating
the sum of each group.


Why didn’t it return something more useful? The reason is because groupby
needs to be paired with some operation we want to apply to each group, such as
calculating an aggregate statistic (e.g., mean, median, sum). When talking about
grouping we often use shorthand and say “group by gender,” but that is
incomplete. For grouping to be useful, we need to group by something and then
apply a function to each of those groups

In [10]:
dataframe.groupby('Sex').count()

Unnamed: 0_level_0,Name,PClass,Age,Survived,SexCode
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
female,462,462,288,462,462
male,851,851,468,851,851


In [11]:
# Group rows, count rows
dataframe.groupby('Survived')['Name'].count()

Survived
0    863
1    450
Name: Name, dtype: int64

Notice Name added after the groupby? That is because particular summary
statistics are only meaningful to certain types of data. For example, while
calculating the average age by gender makes sense, calculating the total age by
gender does not. In this case we group the data into survived or not, then count
the number of names (i.e., passengers) in each group

We can also group by a first column, then group that grouping by a second
column

In [12]:
# Group rows, calculate mean
dataframe.groupby(['Sex','Survived'])['Age'].mean()

Sex     Survived
female  0           24.901408
        1           30.867143
male    0           32.320780
        1           25.951875
Name: Age, dtype: float64