# Data grouping and aggregation

## Grouping

Categorizing a dataset and applying a function to each group, whether an aggregation or transformation, can be a critical component of a data analysis workflow. After loading, merging, and preparing a dataset, you may need to compute group statistics or possibly pivot tables for reporting or visualization purposes. pandas provides a versatile `groupby` interface, enabling you to slice, dice, and summarize datasets in a natural way.

Hadley Wickham, an author of many popular packages for the R programming language, coined the term split-apply-combine for describing group operations. In the first stage of the process, data contained in a pandas object, whether a Series, DataFrame, or otherwise, is split into groups based on one or more keys that you provide. The splitting is performed on a particular axis of an object. For example, a DataFrame can be grouped on its rows (`axis="index"`) or its columns (`axis="columns"`). Once this is done, a function is applied to each group, producing a new value. Finally, the results of all those function applications are combined into a result object. 

In [None]:
# Loading packages
import pandas as pd
import numpy as np

In [None]:
# Preparing dataset for later use
pov = pd.read_csv("C:\\Users\\iwo.augustynski\\Downloads\\share-of-population-in-extreme-poverty.csv", parse_dates=["Year"])

s = pov.Code.unique() # take only unique values from 'Code' column
s = np.random.choice(s, size = 10) # take 10 random values from previous line

pov_sample = pov[pov.Code.isin (s)] # take selected 10 random countries from the dataset


In [None]:
# Let's start with following DataFrame

df = pd.DataFrame({"key1" : ["a", "a", None, "b", "b", "a", None],
                   "key2" : pd.Series([1, 2, 1, 2, 1, None, 1], dtype="Int64"),
                   "data1" : np.random.standard_normal(7),
                   "data2" : np.random.standard_normal(7)})

df

Suppose you wanted to compute the mean of the `data1` column in groups indicated by labels from `key1`. 

In [None]:
grouped = df["data1"].groupby(df["key1"])

grouped

This `grouped` variable is now a special "GroupBy" object. It has not actually computed anything yet except for some intermediate data about the group key `df["key1"]`. The idea is that this object has all of the information needed to then apply some operation to each of the groups. You should be familiar with this behavior because it is the same as in `R`.
For example, to compute group means we can call the GroupBy’s mean method:

In [None]:
grouped.mean()

The result is not telling much. Definitely we went to far. We should select just one column and pass multiple groupings as a list to get more informative outcome:

In [None]:
means = df["data1"].groupby([df["key1"], df["key2"]]).mean()

means

As result we got new DataFrame with values from the `data1` column grouped using two keys, and the resulting Series now has a hierarchical index consisting of the unique pairs of keys observed:

In [None]:
means.unstack()

As you can check in each group is only one value thus calculating mean doesn't change anything.

In [None]:
df

You can also just pass column names as the group keys:

In [None]:
df.groupby("key1").mean()

In [None]:
df.groupby("key2").mean()

You may have noticed, that there is no `key1` column in the result. Because `df["key1"]` is not numeric data, it is said to be a *nuisance column*, which is therefore automatically excluded from the result. By default, all of the numeric columns are aggregated, though it is possible to filter down to a subset.

In [None]:
df.groupby(["key1", "key2"]).mean()

But if you want to do the same for one column in DataFrame that won't work:

In [None]:
df["data1"].groupby(["key1","key2"]).mean()

It can't work because you first substracted a column as a Series therefore information about grouping was gone.
Understanding this allows to properly rearrange the code:

In [None]:
df.groupby("key1")["data1"].mean()

As you can see now whole DataFrame is grouped and then desired column is selected. `df.groupby("key1")[["data1"]].mean()` works as well. They differ slightly in the type of output:

In [None]:
print(type(df.groupby("key1")[["data1"]].mean()))
print(type(df.groupby("key1")["data1"].mean()))

Regardless of the objective in using `groupby`, a generally useful GroupBy method is `size`, which returns a Series containing group sizes:

In [None]:
df.groupby(["key1", "key2"]).size()

Note that any missing values in a group key are excluded from the result by default. This behavior can be disabled by passing `dropna=False` to `groupby`:

In [None]:
df.groupby("key1", dropna=False).size()

A group function similar in spirit to size is `count`, which computes the number of nonnull values in each group:

In [None]:
df.groupby("key1").count()

## **Assignment**

From `pov_sample` dataset:

1. count how many datapoints (Years) are for each country. Result should consist only two columns: Entity and Year.

2. From `count` select country with highest number of years available. Result should still consist columns Entity and Year.

3. From `count` select country with lowest number of years available. Result should still consist columns Entity and Year. Use `.loc` method.

4. calculate mean share of population below poverty line for each country. Result should consist only two columns: 'Entity' and '$2.15 a day - share of population below poverty line'


In [None]:
# YOUR CODE HERE
raise NotImplementedError()

count

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

print(maximum)
print(minimum)

In [None]:
# YOUR CODE HERE
raise NotImplementedError()
means

## Aggregation

Aggregations refer to any data transformation that produces scalar values from arrays. The preceding examples have used several of them, including mean, count, min, and sum.

 Many common aggregations, such as those found in table below, have optimized implementations. However, you are not limited to only this set of methods.

| Function name  | Description                                                                  |
|----------------|------------------------------------------------------------------------------|
| `any`, `all`       | Return True if any (one or more values) or all non-NA values are "truthy"    |
| `count`          | Number of non-NA values                                                      |
| `cummin`, `cummax` | Cumulative minimum and maximum of non-NA values                              |
| `cumsum`         | Cumulative sum of non-NA values                                              |
| `cumprod`        | Cumulative product of non-NA values                                          |
| `first`, `last`    | First and last non-NA values                                                 |
| `mean`           | Mean of non-NA values                                                        |
| `median`         | Arithmetic median of non-NA values                                           |
| `min`, `max`       | Minimum and maximum of non-NA values                                         |
| `nth`            | Retrieve value that would appear at position n with the data in sorted order |
| `ohlc`           | Compute four "open-high-low-close" statistics for time series-like data      |
| `prod`           | Product of non-NA values                                                     |
| `quantile`       | Compute sample quantile                                                      |
| `rank`           | Ordinal ranks of non-NA values, like calling Series.rank                     |
| `size`           | Compute group sizes, returning result as a Series                            |
| `sum`            | Sum of non-NA values                                                         |
| `std`, `var`       | Sample standard deviation and variance                                       |

To use your own aggregation functions, pass any function that aggregates an array to the `aggregate` method or its short alias `agg`:

In [None]:
def peak_to_peak(arr):
    return arr.max() - arr.min()

grouped.agg(peak_to_peak)

You may notice that some methods, like describe, also work, even though they are not aggregations, strictly speaking:

In [None]:
grouped.describe()

If you pass a list of functions or function names instead, you get back a DataFrame with column names taken from the functions:

In [None]:
grouped.agg(["mean", "std", peak_to_peak])

## **Assignment**

1. Group `pov_sample` by Entity column and describe dataset

2. Group `pov_sample` by Entity column and calculate mean, standard deviation and peak_to_peak i one line of code

2. Select first values as `f` and last values as `l`

In [None]:
# 1. Answer
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# 2. Answer
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

pd.concat([f,l], axis = "rows").sort_values(by="Entity") # 'pd.concat' concatenates dataFrames along given axis. Obviously 'sort_values' sorts dataFrame by given column