# Grouping data


## 1. Grouping by multiple columns
In this exercise, you will return to working with the `Titanic` dataset using `.groupby()` to analyse the distribution of passengers who boarded the `Titanic`.

The `'pclass'` column identifies which class of ticket was purchased by the passenger and the `'embarked'` column indicates at which of the three ports the passenger boarded the Titanic. `'S'` stands for Southampton, England, `'C'` for Cherbourg, France and `'Q'` for Queenstown, Ireland.

Your job is to first group by the `'pclass'` column and count the number of rows in each class using the `'survived'` column. You will then group by the `'embarked'` and `'pclass'` columns and count the number of passengers.

In [1]:
import pandas as pd

In [2]:
titanic = pd.read_csv("data/titanic.csv")
titanic.tail()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
1304,3,0,"Zabour, Miss. Hileni",female,14.5,1,0,2665,14.4542,,C,,328.0,
1305,3,0,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,,C,,,
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.5,0,0,2656,7.225,,C,,304.0,
1307,3,0,"Zakarian, Mr. Ortin",male,27.0,0,0,2670,7.225,,C,,,
1308,3,0,"Zimmerman, Mr. Leo",male,29.0,0,0,315082,7.875,,S,,,


In [12]:
# Group titanic by 'pclass'
by_class = titanic.groupby("pclass")
by_class.count()

Unnamed: 0_level_0,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,323,323,323,284,323,323,323,323,256,321,201,35,289
2,277,277,277,261,277,277,277,277,23,277,112,31,261
3,709,709,709,501,709,709,709,708,16,709,173,55,195


In [10]:
# Aggregate 'survived' column of by_class by count
count_by_class = by_class.count()["survived"]

# Print count_by_class
count_by_class

pclass
1    323
2    277
3    709
Name: survived, dtype: int64

In [11]:
# Group titanic by 'embarked' and 'pclass'
by_mult = titanic.groupby(["embarked", "pclass"])
by_mult.count()

Unnamed: 0_level_0,Unnamed: 1_level_0,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,boat,body,home.dest
embarked,pclass,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
C,1,141,141,141,128,141,141,141,141,111,98,14,121
C,2,28,28,28,24,28,28,28,28,4,15,2,27
C,3,101,101,101,60,101,101,101,101,3,36,9,24
Q,1,3,3,3,3,3,3,3,3,3,2,1,3
Q,2,7,7,7,5,7,7,7,7,1,2,0,4
Q,3,113,113,113,42,113,113,113,113,1,34,6,30
S,1,177,177,177,151,177,177,177,177,140,99,20,164
S,2,242,242,242,232,242,242,242,242,18,95,29,230
S,3,495,495,495,399,495,495,495,494,12,103,40,141


In [9]:
# Aggregate 'survived' column of by_mult by count
count_mult = by_mult.count()["survived"]

# Print count_mult
count_mult

embarked  pclass
C         1         141
          2          28
          3         101
Q         1           3
          2           7
          3         113
S         1         177
          2         242
          3         495
Name: survived, dtype: int64

Well done! Grouping your data by certain columns like this and aggregating them by another column, in this case, `'survived'`, allows you to carefully examine your data for interesting insights.

## 2. Grouping by another series
In this exercise, you'll use two data sets from `Gapminder.org` to investigate the average life expectancy (in years) at birth in 2010 for the 6 continental regions. To do this you'll read the life expectancy data per country into one pandas DataFrame and the association between country and region into another.

By setting the index of both DataFrames to the country name, you'll then use the region information to group the countries in the life expectancy DataFrame and compute the mean value for 2010.

In [34]:
# Read life_fname into a DataFrame: life
life = pd.read_csv("data/life_expectancy.csv", index_col='Country')
life.info()

<class 'pandas.core.frame.DataFrame'>
Index: 204 entries, Afghanistan to Åland
Data columns (total 50 columns):
1964    202 non-null float64
1965    202 non-null float64
1966    202 non-null float64
1967    202 non-null float64
1968    202 non-null float64
1969    202 non-null float64
1970    202 non-null float64
1971    202 non-null float64
1972    202 non-null float64
1973    202 non-null float64
1974    202 non-null float64
1975    202 non-null float64
1976    202 non-null float64
1977    202 non-null float64
1978    202 non-null float64
1979    202 non-null float64
1980    202 non-null float64
1981    202 non-null float64
1982    202 non-null float64
1983    202 non-null float64
1984    202 non-null float64
1985    202 non-null float64
1986    202 non-null float64
1987    202 non-null float64
1988    202 non-null float64
1989    202 non-null float64
1990    202 non-null float64
1991    202 non-null float64
1992    202 non-null float64
1993    202 non-null float64
1994    202 non-nu

In [35]:
# Read regions_fname into a DataFrame: regions
regions = pd.read_csv("data/regions.csv", index_col="Country")
regions.info()

<class 'pandas.core.frame.DataFrame'>
Index: 204 entries, Afghanistan to Åland
Data columns (total 1 columns):
region    204 non-null object
dtypes: object(1)
memory usage: 3.2+ KB


In [29]:
# Group life by regions['region']: life_by_region
life_by_region = life.groupby(regions["region"])

# Print the mean over the '2010' column of life_by_region
life_by_region.mean()['2010']

region
America                       74.037350
East Asia & Pacific           73.405750
Europe & Central Asia         75.656387
Middle East & North Africa    72.805333
South Asia                    68.189750
Sub-Saharan Africa            57.575080
Name: 2010, dtype: float64

Great work! It looks like the average life expectancy (in years) at birth in 2010 was highest in Europe & Central Asia and lowest in Sub-Saharan Africa.

## 3. Computing multiple aggregates of multiple columns
The `.agg()` method can be used with a tuple or list of aggregations as input. When applying multiple aggregations on multiple columns, the aggregated DataFrame has a multi-level column index.

In this exercise, you're going to group passengers on the Titanic by `'pclass'` and aggregate the `'age'` and `'fare'` columns by the functions `'max'` and `'median'`. You'll then use multi-level selection to find the oldest passenger per class and the median fare price per class.

In [37]:
# Group titanic by 'pclass': by_class
by_class = titanic.groupby("pclass")

# Select 'age' and 'fare'
by_class_sub = by_class[['age','fare']]

# Aggregate by_class_sub by 'max' and 'median': aggregated
aggregated = by_class_sub.agg(["max", "median"])
aggregated

Unnamed: 0_level_0,age,age,fare,fare
Unnamed: 0_level_1,max,median,max,median
pclass,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,80.0,39.0,512.3292,60.0
2,70.0,29.0,73.5,15.0458
3,74.0,24.0,69.55,8.05


In [40]:
# Print the maximum and median age in each class
aggregated.loc[:, "age"]

Unnamed: 0_level_0,max,median
pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,80.0,39.0
2,70.0,29.0
3,74.0,24.0


In [41]:
# Print the maximum age in each class
aggregated.loc[:, ("age", "max")]

pclass
1    80.0
2    70.0
3    74.0
Name: (age, max), dtype: float64

In [39]:
# Print the median fare in each class
aggregated.loc[:, ("fare", "median")]

pclass
1    60.0000
2    15.0458
3     8.0500
Name: (fare, median), dtype: float64

Fantastic work! It isn't surprising that the highest median fare was for the 1st passenger class.

## 4. Aggregating on index levels/fields
If you have a DataFrame with a multi-level row index, the individual levels can be used to perform the groupby. This allows advanced aggregation techniques to be applied along one or more levels in the index and across one or more columns.

In this exercise you'll use the full `Gapminder` dataset which contains yearly values of life expectancy, population, child mortality (per 1,000) and per capita gross domestic product (GDP) for every country in the world from 1964 to 2013.

Your job is to create a multi-level DataFrame of the columns `'Year'`, `'Region'` and `'Country'`. Next you'll group the DataFrame by the `'Year'` and `'Region'` levels. Finally, you'll apply a dictionary aggregation to compute the total population, spread of per capita GDP values and average child mortality rate.

In [94]:
# Read the CSV file into a DataFrame and sort the index: gapminder
gapminder = pd.read_csv("data/gapminder_tidy.csv", index_col = ["Year", "region", "Country"]).sort_index()
gapminder.tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,fertility,life,population,child_mortality,gdp
Year,region,Country,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2013,Sub-Saharan Africa,Tanzania,5.214,61.53,49153002.0,52.6,2382.0
2013,Sub-Saharan Africa,Togo,4.639,56.537,6412560.0,83.3,1346.0
2013,Sub-Saharan Africa,Uganda,5.867,59.209,36759274.0,65.1,1621.0
2013,Sub-Saharan Africa,Zambia,5.687,58.105,14314515.0,72.8,3800.0
2013,Sub-Saharan Africa,Zimbabwe,3.486,59.871,13327925.0,83.3,1773.0


In [59]:
# Group gapminder by 'Year' and 'region': by_year_region
by_year_region = gapminder.groupby(level = ["Year", "region"])

In [60]:
# Define the function to compute spread: spread
def spread(series):
    return series.max() - series.min()

In [61]:
# Create the dictionary: aggregator
aggregator = {'population':'sum', 'child_mortality':'mean', 'gdp':spread}

# Aggregate by_year_region using the dictionary: aggregated
aggregated = by_year_region.agg(aggregator)

# Print the last 6 entries of aggregated 
aggregated.tail(6)

Unnamed: 0_level_0,Unnamed: 1_level_0,population,child_mortality,gdp
Year,region,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2013,America,962908700.0,17.745833,49634.0
2013,East Asia & Pacific,2244209000.0,22.285714,134744.0
2013,Europe & Central Asia,896878800.0,9.831875,86418.0
2013,Middle East & North Africa,403050400.0,20.2215,128676.0
2013,South Asia,1701241000.0,46.2875,11469.0
2013,Sub-Saharan Africa,920599600.0,76.94449,32035.0


Excellent work! Are you able to see any correlations between `population`, `child_mortality`, and `gdp`?

## 5. Grouping on a function of the index
`Groupby` operations can also be performed on transformations of the index values. In the case of a DateTimeIndex, we can extract portions of the datetime over which to group.

In this exercise you'll read in a set of sample sales data from February 2015 and assign the `'Date'` column as the index. Your job is to group the sales data by the day of the week and aggregate the sum of the `'Units'` column.

Is there a day of the week that is more popular for customers? To find out, you're going to use `.strftime('%a')` to transform the index datetime values to abbreviated days of the week.

In [66]:
# Load the sales dataframe
sales= pd.read_csv("data/sales/sales-feb-2015.csv", index_col="Date", parse_dates=True)
sales.tail()

Unnamed: 0_level_0,Company,Product,Units
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-02-19 16:00:00,Mediacore,Service,10
2015-02-21 05:00:00,Mediacore,Software,3
2015-02-21 20:30:00,Hooli,Hardware,3
2015-02-25 00:30:00,Initech,Service,10
2015-02-26 09:00:00,Streeplex,Service,4


In [69]:
# Create a groupby object: by_day
by_day = sales.groupby(sales.index.strftime("%a"))

# Create sum: units_sum
units_sum = by_day["Units"].sum()

# Print units_sum
units_sum

Mon    48
Sat     7
Thu    59
Tue    13
Wed    48
Name: Units, dtype: int64

Well done! It looks like Monday, Wednesday, and Thursday were the most popular days for customers!

## 6. Detecting outliers with Z-Scores
Using the `zscore` function, you can apply a `.transform()` method after grouping to apply a function to groups of data independently. The z-score is also useful to find outliers: **a z-score value of +/- 3 is generally considered to be an outlier.**

In this example, you're going to normalize the `Gapminder` data in 2010 for life expectancy and fertility by the z-score per region. Using boolean indexing, you will filter for countries that have high fertility rates and low life expectancy for their region.

In [95]:
# Reset current Index
gapminder.reset_index(inplace=True)

In [96]:
# Preparing gapminder dataframe for the exercise
gapminder_2010 = gapminder[gapminder["Year"] == 2010].set_index("Country")
gapminder_2010.tail()

Unnamed: 0_level_0,Year,region,fertility,life,population,child_mortality,gdp
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Tanzania,2010,Sub-Saharan Africa,5.428,59.224,44841226.0,62.3,2143.0
Togo,2010,Sub-Saharan Africa,4.792,55.514,6027798.0,90.9,1246.0
Uganda,2010,Sub-Saharan Africa,6.155,57.319,33424683.0,79.5,1515.0
Zambia,2010,Sub-Saharan Africa,5.813,54.549,13088570.0,84.8,3451.0
Zimbabwe,2010,Sub-Saharan Africa,3.721,53.684,12571454.0,95.1,1484.0


In [99]:
# Import zscore
from scipy.stats import zscore

# Group gapminder_2010: standardized
standardized = gapminder_2010.groupby("region")['life','fertility'].transform(zscore)

In [102]:
# Construct a Boolean Series to identify outliers: outliers
outliers = (standardized['life'] < -3) | (standardized['fertility'] > 3)

# Filter gapminder_2010 by the outliers: gm_outliers
gm_outliers = gapminder_2010.loc[outliers]

# Print gm_outliers
gm_outliers

Unnamed: 0_level_0,Year,region,fertility,life,population,child_mortality,gdp
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Guatemala,2010,America,3.974,71.1,14388929.0,34.5,6849.0
Haiti,2010,America,3.35,45.0,9993247.0,208.8,1518.0
Timor-Leste,2010,East Asia & Pacific,6.237,65.952,1124355.0,63.8,1777.0
Tajikistan,2010,Europe & Central Asia,3.78,66.83,6878637.0,52.6,2110.0


Wonderful work! Using z-scores like this is a great way to identify outliers in your data.

## 7. Filling missing data (imputation) by group
Many statistical and machine learning packages cannot determine the best action to take when missing data entries are encountered. Dealing with missing data is natural in pandas (both in using the default behavior and in defining a custom behavior). So far, you have practiced using the `.dropna()` method to drop missing values. Now, you will practice imputing missing values. You can use `.groupby()` and `.transform()` to fill missing data appropriately for each group.

Your job is to fill in missing `'age'` values for passengers on the Titanic with the median age from their `'gender'` and `'pclass'`. To do this, you'll group by the `'sex'` and `'pclass'` columns and transform each group with a custom function to call `.fillna()` and impute the median value.

Try `titanic.tail(10)`. Notice in particular the `NaNs` in the `'age'` column.

In [103]:
# Inspecting data
titanic.tail(10)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
1299,3,0,"Yasbeck, Mr. Antoni",male,27.0,1,0,2659,14.4542,,C,C,,
1300,3,1,"Yasbeck, Mrs. Antoni (Selini Alexander)",female,15.0,1,0,2659,14.4542,,C,,,
1301,3,0,"Youseff, Mr. Gerious",male,45.5,0,0,2628,7.225,,C,,312.0,
1302,3,0,"Yousif, Mr. Wazli",male,,0,0,2647,7.225,,C,,,
1303,3,0,"Yousseff, Mr. Gerious",male,,0,0,2627,14.4583,,C,,,
1304,3,0,"Zabour, Miss. Hileni",female,14.5,1,0,2665,14.4542,,C,,328.0,
1305,3,0,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,,C,,,
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.5,0,0,2656,7.225,,C,,304.0,
1307,3,0,"Zakarian, Mr. Ortin",male,27.0,0,0,2670,7.225,,C,,,
1308,3,0,"Zimmerman, Mr. Leo",male,29.0,0,0,315082,7.875,,S,,,


In [104]:
# Create a groupby object: by_sex_class
by_sex_class = titanic.groupby(["sex", "pclass"])

In [105]:
# Write a function that imputes median
def impute_median(series):
    return series.fillna(series.median())

In [106]:
# Impute age and assign to titanic['age']
titanic.age = by_sex_class["age"].transform(impute_median)

# Print the output of titanic.tail(10)
titanic.tail(10)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
1299,3,0,"Yasbeck, Mr. Antoni",male,27.0,1,0,2659,14.4542,,C,C,,
1300,3,1,"Yasbeck, Mrs. Antoni (Selini Alexander)",female,15.0,1,0,2659,14.4542,,C,,,
1301,3,0,"Youseff, Mr. Gerious",male,45.5,0,0,2628,7.225,,C,,312.0,
1302,3,0,"Yousif, Mr. Wazli",male,25.0,0,0,2647,7.225,,C,,,
1303,3,0,"Yousseff, Mr. Gerious",male,25.0,0,0,2627,14.4583,,C,,,
1304,3,0,"Zabour, Miss. Hileni",female,14.5,1,0,2665,14.4542,,C,,328.0,
1305,3,0,"Zabour, Miss. Thamine",female,22.0,1,0,2665,14.4542,,C,,,
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.5,0,0,2656,7.225,,C,,304.0,
1307,3,0,"Zakarian, Mr. Ortin",male,27.0,0,0,2670,7.225,,C,,,
1308,3,0,"Zimmerman, Mr. Leo",male,29.0,0,0,315082,7.875,,S,,,


Well done! Imputing missing values intelligently is always preferrable to dropping them entirely!



## 8. Other transformations with .apply
The `.apply()` method when used on a groupby object performs an arbitrary function on each of the groups. These functions can be aggregations, transformations or more complex workflows. The `.apply()` method will then combine the results in an intelligent way.

In this exercise, you're going to analyze economic disparity within regions of the world using the Gapminder data set for 2010. To do this you'll define a function to compute the aggregate spread of per capita GDP in each region and the individual country's z-score of the regional per capita GDP. You'll then select three countries - United States, Great Britain and China - to see a summary of the regional GDP and that country's z-score against the regional mean.

In [111]:
# The following function has been defined for use:

def disparity(gr):
    # Compute the spread of gr['gdp']: s
    s = gr['gdp'].max() - gr['gdp'].min()
    # Compute the z-score of gr['gdp'] as (gr['gdp']-gr['gdp'].mean())/gr['gdp'].std(): z
    z = (gr['gdp'] - gr['gdp'].mean())/gr['gdp'].std()
    # Return a DataFrame with the inputs {'z(gdp)':z, 'regional spread(gdp)':s}
    return pd.DataFrame({'z(gdp)':z , 'regional spread(gdp)':s})

In [114]:
# Group gapminder_2010 by 'region': regional
regional = gapminder_2010.groupby("region")

# Apply the disparity function on regional: reg_disp
reg_disp = regional.apply(disparity)

# Print the disparity of 'United States', 'United Kingdom', and 'China'
reg_disp.loc[['United States','United Kingdom','China']]

Unnamed: 0_level_0,z(gdp),regional spread(gdp)
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
United States,3.013374,47855.0
United Kingdom,0.572873,89037.0
China,-0.432756,96993.0


## 9. Grouping and filtering with .apply()
By using `.apply()`, you can write functions that filter rows within groups. The `.apply()` method will handle the iteration over individual groups and then re-combine them back into a Series or DataFrame.

In this exercise you'll take the Titanic data set and analyze survival rates from the `'C'` deck, which contained the most passengers. To do this you'll group the dataset by `'sex'` and then use the `.apply()` method on a provided user defined function which calculates the mean survival rates on the `'C'` deck:

In [115]:
def c_deck_survival(gr):

    c_passengers = gr['cabin'].str.startswith('C').fillna(False)

    return gr.loc[c_passengers, 'survived'].mean()

In [117]:
# Create a groupby object using titanic over the 'sex' column: by_sex
by_sex = titanic.groupby("sex")

# Call by_sex.apply with the function c_deck_survival
c_surv_by_sex = by_sex.apply(c_deck_survival)

# Print the survival rates
c_surv_by_sex

sex
female    0.913043
male      0.312500
dtype: float64

Excellent work! It looks like female passengers on the `'C'` deck had a much higher chance of surviving!



## 10. Grouping and filtering with .filter()
You can use groupby with the `.filter()` method to remove whole groups of rows from a DataFrame based on a boolean condition.

In this exercise, you'll take the February sales data and remove entries from companies that purchased less than or equal to 35 Units in the whole month.

First, you'll identify how many units each company bought for verification. Next you'll use the `.filter()` method after grouping by `'Company'` to remove all rows belonging to companies whose sum over the `'Units'` column was less than or equal to 35. Finally, verify that the three companies whose total Units purchased were less than or equal to 35 have been filtered out from the DataFrame.

In [121]:
# Read the CSV file into a DataFrame: sales
sales = pd.read_csv('data/sales/sales-feb-2015.csv', index_col='Date', parse_dates=True)
sales.tail()

Unnamed: 0_level_0,Company,Product,Units
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-02-19 16:00:00,Mediacore,Service,10
2015-02-21 05:00:00,Mediacore,Software,3
2015-02-21 20:30:00,Hooli,Hardware,3
2015-02-25 00:30:00,Initech,Service,10
2015-02-26 09:00:00,Streeplex,Service,4


In [122]:
# Group sales by 'Company': by_company
by_company = sales.groupby("Company")

In [126]:
# Compute the sum of the 'Units' of by_company: by_com_sum
by_com_sum = by_company["Units"].sum()
by_com_sum

Company
Acme Coporation    34
Hooli              30
Initech            30
Mediacore          45
Streeplex          36
Name: Units, dtype: int64

In [129]:
# Filter 'Units' where the sum is > 35: by_com_filt
by_com_filt = by_company.filter(lambda g: g["Units"].sum() > 35)
by_com_filt

Unnamed: 0_level_0,Product,Units
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2015-02-02 21:00:00,Hardware,9
2015-02-04 15:30:00,Software,13
2015-02-09 09:00:00,Service,19
2015-02-09 13:00:00,Software,7
2015-02-19 11:00:00,Hardware,16
2015-02-19 16:00:00,Service,10
2015-02-21 05:00:00,Software,3
2015-02-26 09:00:00,Service,4


## 11. Filtering and grouping with .map()
You have seen how to group by a column, or by multiple columns. Sometimes, you may instead want to group by a function/transformation of a column. The key here is that the Series is indexed the same way as the DataFrame. You can also mix and match column grouping with Series grouping.

In this exercise your job is to investigate survival rates of passengers on the Titanic by `'age'` and `'pclass'`. In particular, the goal is to find out what fraction of children under 10 survived in each `'pclass'`. You'll do this by first creating a boolean array where `True` is passengers under 10 years old and `False` is passengers over 10. You'll use `.map()` to change these values to strings.

Finally, you'll group by the under 10 series and the `'pclass'` column and aggregate the `'survived'` column. The `'survived'` column has the value 1 if the passenger survived and 0 otherwise. The mean of the `'survived'` column is the fraction of passengers who lived.

In [131]:
# Create the Boolean Series: under10
under10 = (titanic.age < 10).map({True: "under 10", False: "over 10"})
under10[-5: -1]

1304    over 10
1305    over 10
1306    over 10
1307    over 10
Name: age, dtype: object

In [132]:
# Group by under10 and compute the survival rate
survived_mean_1 = titanic.groupby(under10)["survived"].mean()
survived_mean_1

age
over 10     0.366748
under 10    0.609756
Name: survived, dtype: float64

In [133]:
# Group by under10 and pclass and compute the survival rate
survived_mean_2 = titanic.groupby([under10, "pclass"])["survived"].mean()
survived_mean_2

age       pclass
over 10   1         0.617555
          2         0.380392
          3         0.238897
under 10  1         0.750000
          2         1.000000
          3         0.446429
Name: survived, dtype: float64

Excellent work! It looks like passengers under the age of 10 had a higher survival rate than those above the age of 10.

