# Grouping Basics
*Curtis Miller*

In this notebook we look at how we can form groups for grouping operations, along with some basic operations.

Let's load in the population pyramid dataset and get started.

In [None]:
import pandas as pd
from pandas import Series, DataFrame
%matplotlib inline

In [None]:
pop_pyramids = pd.read_csv("PopPyramids.csv", index_col=["Country", "Year", "Age"])
# I want a version including only male/female populations and no "Total" rows
pop_pyramids = pop_pyramids.loc[:, ["Male Population", "Female Population"]].drop("Total", axis=0, level="Age")
pop_pyramids.columns = pd.Index(["Male", "Female"])
pop_pyramids.sort_index(inplace=True)    # Can't do slicing without this
pop_pyramids.head()

In [None]:
# A version only including 2016 data
pop_pyramids_16 = pop_pyramids.loc[(slice(None), 2016), :]
pop_pyramids_16.index = pop_pyramids_16.index.droplevel("Year")    # Redundant level
pop_pyramids_16.head()

In [None]:
# Store data in columns, not the index
pp16_nomulti = pop_pyramids_16.reset_index()
pp16_nomulti.head()

In [None]:
# Go to long-form format
pp16_longform = pd.melt(pp16_nomulti,                  # DataFrame we're melting
                        id_vars=["Country", "Age"],    # The ID variables; the rest will be melted
                        var_name="Sex",                # The name of the column "variable" is now "Sex"
                        value_name="Population")       # The name of the column "value" is now "Population"
pp16_longform.head()

Here I create different groups, and I give a brief demonstration of how groups can be used.

In [None]:
# We can create a group for:
agegroup = pp16_longform.groupby("Age")    # Age groups
agegroup

In [None]:
countrygroup = pp16_longform.groupby("Country")    # Countries
sexgroup = pp16_longform.groupby("Sex")    # Sex
agesexgroup = pp16_longform.groupby(["Age", "Sex"])    # Age AND Sex

In [None]:
# See groups
sexgroup.groups

In [None]:
# Total counts, just to see what grouping does
countrygroup.sum().sort_values("Population", ascending=False)

In [None]:
sexgroup.sum()

In [None]:
# This is just to order age groups in a reasonable order
agegrpvec = pd.Categorical(['0-4', '5-9', '10-14', '15-19', '20-24', '25-29', '30-34', '35-39', '40-44', '45-49',
                            '50-54', '55-59', '60-64', '65-69', '70-74', '75-79', '80-84', '85-89', '90-94',
                            '95-99', '100+'])    # A relatively new type of data, for categorical-type data

In [None]:
agegroup.sum().loc[agegrpvec]

In [None]:
agegroup.sum().loc[agegrpvec, "Population"].plot("bar")

In [None]:
agesexgroup.sum()

In [None]:
# We can also group using a hierarchical index
yeargroup = pop_pyramids.groupby(level="Year")
yeargroup.sum()

In [None]:
yeargroup.sum().sum(axis=1)    # Yearly populations

In [None]:
yeargroup.sum().sum(axis=1).plot()

In [None]:
yearcountrygroup = pop_pyramids.groupby(level=["Year", "Country"])
yearcountrygroup.sum()

In [None]:
yearcountrygroup.sum().sum(axis=1)

In [None]:
yearcountrygroup.sum().sum(axis=1).loc[:, "UnitedStates"]    # US annual populations

After we have a grouping, we could also iterate through groups. We will have a `DataFrame` for the data in the group and also its name.

In [None]:
for name, data in sexgroup:
    print(name)
    print(data.head())