<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Overview" data-toc-modified-id="Overview-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Overview</a></span><ul class="toc-item"><li><span><a href="#Aggregate-Data" data-toc-modified-id="Aggregate-Data-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Aggregate Data</a></span><ul class="toc-item"><li><span><a href="#GroupBy" data-toc-modified-id="GroupBy-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>GroupBy</a></span></li></ul></li></ul></li></ul></div>

# Overview
So far, we've learned how to use the pandas library and how to create visualizations with data sets that didn't require much cleanup. However, most data sets in real life require extensive cleaning and manipulation to extract any meaningful insights. In fact, Forbes estimates that data scientists spend about 60% of their time cleaning and organizing data, so it's critical to be able to manipulate data quickly and efficiently.

In this course, we'll learn the following:

- Data aggregation
- How to combine data
- How to transform data
- How to clean strings with pandas
- How to handle missing and duplicate data

You'll need some basic knowledge of pandas and matplotlib to complete this course, including:

- Basic knowledge of pandas dataframes and series
- How to select values and filter a dataframe
- Knowledge of data exploration methods in pandas, such as the info and head methods
- How to create visualizations in pandas and matplotlib

Throughout this course, we'll work to answer the following questions:

- How can aggregating the data give us more insight into happiness scores?
- How did world happiness change from 2015 to 2017?
- Which factors contribute the most to the happiness score?

In this mission, we'll start by learning how to aggregate data. Then in the following missions, we'll learn different data cleaning skills that can help us aggregate and analyze the data in different ways. We'll start by learning each topic in isolation, but build towards a more complete data cleaning workflow by the end of the course.

## Aggregate Data
In this mission, we'll learn how to perform different kinds of **aggregations**, applying a statistical operation to groups of our data, and create visualizations like the one above.

Recall that in the Pandas Fundamentals course, we learned a way to use loops for aggregation. Our process looked like this:

- Identify each unique group in the data set.
- For each group:
    - Select only the rows corresponding to that group.
    - Calculate the average for those rows.
Let's use the same process to find the mean happiness score for each region.

### GroupBy
Let's break down the code we wrote in the previous screen into three steps:

1. Split the dataframe into groups.
2. Apply a function to each group.
3. Combine the results into one data structure.

As with many other common tasks, pandas has a built-in operation for this process. The `groupby` operation performs the "split-apply-combine" process on a dataframe, but condenses it into two steps:

1. Create a GroupBy object.
2. Call a function on the GroupBy object.

The GroupBy object, distinct from a dataframe or series object, allows us to split the dataframe into groups, but only in an abstract sense. Nothing is actually computed until a function is called on the GroupBy object.

You can think of the `groupby` operation like this. Imagine a dataframe as a structure made of stacking blocks in all different colors and sizes.

You know you'll eventually want to group the blocks according to color instead, but you don't know yet what you want to do with them after. Using the groupby process, we would first create a mapping document, the `GroupBy` object, containing information on how to group the blocks by color and where each block is located in the original structure.

Once we create the mapping document, we can use it to easily rearrange the blocks into different structures. For example, let's say our manager asks us first to build another structure using the biggest block from each color.

Creating the initial mapping document, or GroupBy object, allows us to optimize our work, because we no longer have to refer back to the original dataframe. By working with the `groupby` operation, we make our code faster, more flexible, and easier to read.

The first step in the groupby operation is to create a GroupBy object:

To create a GroupBy object, we use the `DataFrame.groupby()` method:

`df.groupby('col')`

where `col` is the column you want to use to group the data set. Note that you can also group the data set on multiple columns by passing a list into the `DataFrame.groupby()` method. However, for teaching purposes, we'll focus on grouping the data by just one column in this mission.

When choosing the column, think about which columns could be used to split the data set into groups. To put it another way, look at columns with the same value for multiple rows.

We can see from the couple of rows above that the `Region` column fits this criteria. Let's confirm the number of regions and the number of unique values in each region for the entire dataframe with the `Series.value_counts()` method next:

`happiness2015['Region'].value_counts()`

Since there's a small number of groups and each group contains more than one unique value, we can confirm the Region column is a good candidate to group by.

Next, let's create a Groupby object and group the dataframe by the `Region` column:

`happiness2015.groupby('Region')`

`print(happiness2015.groupby('Region'))`

`< pandas.core.groupby.groupby.DataFrameGroupBy object at 0x7f77882fa470 >`

Don't be alarmed! This isn't an error. This is telling us that an object of type GroupBy was returned, just like we expected.

Before our we start aggregating data, we'll build some intuition around GroupBy objects. We'll start by using the `GroupBy.get_group()` method to select data for a certain group.

As an example, to select the data for just the North America group, we'd pass 'North America' into the `get_group()` method as follows:

In the last exercise, we used the `GroupBy.get_group()` method to select the `Australia and New Zealand` group. The result is a dataframe containing just the rows for the countries in the `Australia and New Zealand` group:

We can also use the `GroupBy.groups` [attribute](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.groups.html) to get more information about the GroupBy object:

`grouped = happiness2015.groupby('Region')
grouped.groups`

The result is a dictionary in which each key corresponds to a region name. See below for the first couple of keys:

Notice that the values include the index for each row in the original happiness2015 dataframe with the corresponding region name. To prove this, let's again look at the data for the Australia and New Zealand group:

And we see that those rows correspond to Australia and New Zealand! Notice that the `get_group()` method also returned the same dataframe above.

Next, let's continue building our intuition by practicing using the `groups` attribute and `get_group()` method.

In the last exercise, we confirmed that the values for the 'North America' group returned by `grouped.groups` do correspond to the countries in North_America in the `happiness2015` dataframe.

Now that we have a good understanding of `GroupBy` objects, let's use them to **aggregate** our data. In order to aggregate our data, we must call a function on the GroupBy object.

**SIZE**
A basic example of aggregation is computing the number of rows for each of the groups. We can use the `GroupBy.size()` method to confirm the size of each region group:

`grouped = happiness2015.groupby('Region')
grouped.groups`

Notice that the result is a Series and contains just one value for each group. Each value represents the number of rows in each group. For example, the 'Australia and New Zealand' group contains two rows.

Pandas has built in a number of other [common aggregation methods](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html):

Let's practicing use one of these aggregation methods next.

