## More on `agg` Function

There is a whole host of aggregation methods we can use other than `.agg`. Some useful options are:

* [`.max`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.max.html): creates a new DataFrame with the maximum value of each group
* [`.mean`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.mean.html): creates a new DataFrame with the mean value of each group
* [`.size`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.size.html): creates a new Series with the number of entries in each group
* [`.filter`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.filter.html): creates a copy of the original DataFrame, keeping only the rows from subframes that obey a provided condition

## Joining Tables 

When working on data science projects, we're unlikely to have absolutely all the data we want contained in a single DataFrame – a real-world data scientist needs to grapple with data coming from multiple sources. If we have access to multiple datasets with related information, we can join two or more tables into a single DataFrame. 

To put this into practice, we'll revisit the `elections` dataset from last lecture.


## Aggregating Data with Pivot Tables

We know now that `.groupby` gives us the ability to group and aggregate data across our DataFrame. The examples above formed groups using just one column in the DataFrame. It's possible to group by multiple columns at once by passing in a list of columns names to `.groupby`. 

Let's find the total number of baby names associated with each sex for each year in `babynames`. To do this, we'll group by *both* the `"Year"` and `"Sex"` columns.

In [None]:
#| code-fold: false
# Find the total number of baby names associated with each sex for each year in the data
babynames.groupby(["Year", "Sex"])[["Count"]].agg(sum).head(6)

Notice that both `"Year"` and `"Sex"` serve as the index of the DataFrame (they are both rendered in bold). We've created a *multindex* where two different index values, the year and sex, are used to uniquely identify each row. 

This isn't the most intuitive way of representing this data – and, because multindexes have multiple dimensions in their index, they can often be difficult to use. 

Another strategy to aggregate across two columns is to create a pivot table. You saw these back in [Data 8](https://inferentialthinking.com/chapters/08/3/Cross-Classifying_by_More_than_One_Variable.html#pivot-tables-rearranging-the-output-of-group). One set of values is used to create the index of the table; another set is used to define the column names. The values contained in each cell of the table correspond to the aggregated data for each index-column pair.

The best way to understand pivot tables is to see one in action. Let's return to our original goal of summing the total number of names associated with each combination of year and sex. We'll call the `pandas` [`.pivot_table`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html) method to create a new table.

In [None]:
#| code-fold: false
# The `pivot_table` method is used to generate a Pandas pivot table
babynames.pivot_table(index = "Year", columns = "Sex", values = "Count", aggfunc = sum).head(5)

Looks a lot better! Now, our DataFrame is structured with clear index-column combinations. Each entry in the pivot table represents the summed count of names for a given combination of `"Year"` and `"Sex"`.

Let's take a closer look at the code implemented above. 

* `index = "Year"` specifies the column name in the original DataFrame that should be used as the index of the pivot table
* `columns = "Sex"` specifies the column name in the original DataFrame that should be used to generate the columns of the pivot table
* `values = "Count"` indicates what values from the original DataFrame should be used to populate the entry for each index-column combination
* `aggfunc = sum` tells `pandas` what function to use when aggregating the data specified by `values`. Here, we are `sum`ming the name counts for each pair of `"Year"` and `"Sex"`


In [None]:
elections = pd.read_csv("data/elections.csv")
elections.head(5)

Say we want to understand the 2023 popularity of the names of each presidential candidate. To do this, we'll need the combined data of `babynames` *and* `elections`. 

We'll start by creating a new column containing the first name of each presidential candidate. This will help us join each name in `elections` to the corresponding name data in `babynames`. 

In [None]:
#| code-fold: false
# This `str` operation splits each candidate's full name at each 
# blank space, then takes just the candidiate's first name
elections["First Name"] = elections["Candidate"].str.split().str[0]
elections.head(5)

Now, we're ready to join the two tables. [`pd.merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) is the `pandas` method used to join DataFrames together. The `left` and `right` parameters are used to specify the DataFrames to be joined. The `left_on` and `right_on` parameters are assigned to the string names of the columns to be used when performing the join. These two `on` parameters tell `pandas` what values should act as pairing keys to determine which rows to merge across the DataFrames. We'll talk more about this idea of a pairing key next lecture.