## `groupby()`, split-apply-combine

The `.groupby()` method of a dataframe object in pandas works under the concept of **split-apply-combine**.

In [None]:
# An example
df.groupby(['A'], dropna=False).\
aggregate(x_numrows = ('x', 'size'),
          x_nonmissing = ('x', 'count'),
          x_mean = ('x', 'mean'),
          x_std = ('x', 'std'),
          x_sem = ('x', 'sem'),
          x_unique = ('x', 'nunique')).\
reset_index()

In the above code example, **split** corresponds to the part, `.groupby(['A'], dropna=False)`, which conceptually is`.groupby(grouping variable, ...)`. Under that hood, it generates a bunch of data frames(groups) separated from the original data frame according to the different values of the grouping variable.  

**Apply**(apply aggregation method) and **combine** correspond to the part: `aggregate(x_numrows = ('x', 'size'), ...)`,  which conceptually is `.aggregate(named_column = ('grouped/studied variable', 'aggregation method'), ...)`.   

Each aggregation method returns a single summary statistic calculated for a column(grouped variable) of a single group separated from `groupby()` method. **Apply** will loop through all the data frames(all the groups) generated by `groupby()` to apply the aggregation method for the grouped variable.

**Combine** then combines all the single summary statistics for one column(a grouped variable) of each group calculated by a aggregation function into a column(named column), and combines all the named columns(corresponding to all the pairs of a grouped variable + a aggregation method) into a data frame.

This data frame is just the summary of all the statistics(the columns you named) for each group (a row in the data frame)

**PS**: Aggregation method is to aggregate all the data you're interested in into a single value/statistic, it could be `sum`, `mean`, `std`, `sem` or any summary statistic you want to know about the data.  

Below are some examples of aggregation methods

In [None]:
df.mean() # caculate the means for all the columns in the data frame.

df['A'].mean() # calculate the mean for a specified column, A, in the data frame.

df.groupby(['G'], dropna=False).\
aggregate(x_mean = ('A', 'mean')). # caculate the mean for a specified column, A, for each group(data frame) separated from df by the unique values of G


#### Question 1

In the pandas library, what is the primary purpose of the `groupby()` method?

a) To sort data in ascending or descending order based on one or more columns.

b) To merge or concatenate two or more DataFrames.

c) To split data into groups based on some criteria and then apply a function to each group independently.

d) To reshape the DataFrame from a wide format to a long format.

Answer: c) To split data into groups based on some criteria and then apply a function to each group independently.

#### Question 2


In pandas, after using the `groupby()` method on a DataFrame, the result is always another DataFrame with the same number of rows as the original.

a) True

b) False

Answer: b) False