## Aggregation and Grouping

### Planets Data
* It gives info on planets that astronomers have discovered around other stars

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns

In [None]:
# load the planet data
planets = sns.load_dataset('planets')
planets.shape

In [None]:
planets.isnull().sum()

### Simple Aggregation in Pandas
* For a pd Series the aggregates return a single value
* For a pd DataFrame:
    * **by default** the aggregates return results within each column
    * by specifying the axis argument, aggregate within each row

<img src = "files/df_agg.png" width = 550>
<br>
<img src = "files/agg_para.png" width = 550>

In [None]:
# For a pd Series the aggregates return a single value
rng = np.random.RandomState(42)
ser = pd.Series(rng.rand(5))
ser.sum(), ser.mean()

In [None]:
# For a pd DataFrame, by default 
# the aggregates return results within each column
df = pd.DataFrame({'A' : rng.rand(5),
                   'B' : rng.rand(5)})
df.mean()

In [None]:
df.keys()

##  df.describe(percentiles=None, include=None, exclude=None)

<font color = red> 直接讲说明 </font>

Signature: df.describe(percentiles=None, include=None, exclude=None)
Docstring:
Generates descriptive statistics that summarize the central tendency,
dispersion and shape of a dataset's distribution, excluding
``NaN`` values.

Analyzes both numeric and object series, as well
as ``DataFrame`` column sets of mixed data types. The output
will vary depending on what is provided. Refer to the notes
below for more detail.

Parameters
----------
percentiles : list-like of numbers, optional
    The percentiles to include in the output. All should
    fall between 0 and 1. The default is
    ``[.25, .5, .75]``, which returns the 25th, 50th, and
    75th percentiles.


include : 'all', list-like of dtypes or None (default), optional
    A white list of data types to include in the result. Ignored
    for ``Series``. Here are the options:

    - 'all' : All columns of the input will be included in the output.
    - A list-like of dtypes : Limits the results to the
      provided data types.
      To limit the result to numeric types submit
      ``numpy.number``. To limit it instead to object columns submit
      the ``numpy.object`` data type. Strings
      can also be used in the style of
      ``select_dtypes`` (e.g. ``df.describe(include=['O'])``). To
      select pandas categorical columns, use ``'category'``
    - None (default) : The result will include all numeric columns.
exclude : list-like of dtypes or None (default), optional,
    A black list of data types to omit from the result. Ignored
    for ``Series``. Here are the options:

    - A list-like of dtypes : Excludes the provided data types
      from the result. To exclude numeric types submit
      ``numpy.number``. To exclude object columns submit the data
      type ``numpy.object``. Strings can also be used in the style of
      ``select_dtypes`` (e.g. ``df.describe(include=['O'])``). To
      exclude pandas categorical columns, use ``'category'``
    - None (default) : The result will exclude nothing.
    
Notes
-----
For numeric data, the result's index will include ``count``,
``mean``, ``std``, ``min``, ``max`` as well as lower, ``50`` and
upper percentiles. By default the lower percentile is ``25`` and the
upper percentile is ``75``. The ``50`` percentile is the
same as the median.

For object data (e.g. strings or timestamps), the result's index
will include ``count``, ``unique``, ``top``, and ``freq``. The ``top``
is the most common value. The ``freq`` is the most common value's
frequency. Timestamps also include the ``first`` and ``last`` items.

If multiple object values have the highest count, then the
``count`` and ``top`` results will be arbitrarily chosen from
among those with the highest count.

For mixed data types provided via a ``DataFrame``, the default is to
return only an analysis of numeric columns. If the dataframe consists
only of object and categorical data without any numeric columns, the
default is to return an analysis of both the object and categorical
columns. If ``include='all'`` is provided as an option, the result
will include a union of attributes of each type.

The `include` and `exclude` parameters can be used to limit
which columns in a ``DataFrame`` are analyzed for the output.
The parameters are ignored when analyzing a ``Series``.

In [None]:
planets.dropna().describe()

In [None]:
df.sum(axis=None, skipna=None, level=None, numeric_only=None, min_count=0, **kwargs)
df.product(axis=None, skipna=None, level=None, numeric_only=None, min_count=0, **kwargs)

df.cumsum(axis=None, skipna=True, *args, **kwargs)
df.cumprod(axis=None, skipna=True, *args, **kwargs)

df.mean(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
df.median(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
df.std(axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwargs)
df.var(axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwargs)
df.skew(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
df.kurt(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)

df.aggregate(func, axis=0, *args, **kwargs)


<img src = "files/agg_met.png">

### GroupBy: Split, Apply, Combine
* aggregate conditionally on some label or index
* implemented in **groupby** operation.
* "group by" comes from a command in the SQL
* More illuminative in terms of Rstats fame.

#### Split, apply, combine
* *split*: breaking up and grouping a df depending on the value of the specified key.
* *apply*: computing some function, usually an aggregate, transformation, or filtering, within the individual groups.
* *combine*: merge the results of these operations into an output array

<img src = "files/groupby_operation.PNG" width = 450>

<br>
<font color = red>
1. While we could certainly do this manually using some combination of th emasking, aggregation, and merging commands covered earlier, it's important to realize that *the intermediate splits do not need to explicitly instantiated".
2. Rather, the *GroupBy* can (often) do this in a single pass over the dta, updating the sum, mean, count, min or other aggregate for each group along the way.
3. The power of the *GroupBy* is that it abstracts away these steps:
    * the user need not think about *how* the computation is done under the hood, but rather thinks about the **operation as a whole.**


In [None]:
# Create a df
df = pd.DataFrame({'key':list('ABCABC'), 
                   'data': range(6)},
                 columns = ['key', 'data'])
df

In [None]:
df.groupby('key') # Return a DataFrameGroupBy object
                  # Does no actual computation until the aggregation is applied

In [None]:
# Apply an aggregate to produce a result
# which performs appropirate apply/combine steps
# You can apply virtually any common pd or np aggregation func
df.groupby('key').sum()

#### The GroupBy object
* A very flexible abstraction.
* treat it as if it's a collection of dfs
* does the difficult things under the hood.
* Perhaps, the most important operations are
    * aggregate
    * filter
    * transform
    * apply

#### Column indexing
* The GroupBy object supports column indexing as the df and returns a modified GroupBy object

In [None]:
planets.groupby('method')

In [None]:
# Select a particular Series group from the original DataFrame group
# by reference to its column name.
# no computation is done until we call some aggregate on the object
planets.groupby('method')['orbital_period']

In [None]:
planets.groupby('method')['orbital_period'].median()

#### Iteration over groups
* The GroupBy object supports direct iteration over the groups
* returning each group as a *Series* or *DataFrame*

In [None]:
for (method, group) in planets.groupby('method'):
    print("{0:30s} shape = {1}".format(method, group.shape))

In [None]:
for (method, group) in planets.groupby('method'):
    print(method, "      ", group.shape)

#### Dispatch methods

In [None]:
planets.groupby('method')['year'].describe()

#### Aggregate, filter, transform, apply
* GroupBy objects have aggregate(), filter(), transform(), and apply() methods
* Efficiently implement a varietry of useful operations before combining the grouped data

In [None]:
rng = np.random.RandomState(0)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data1': range(6),
                   'data2': rng.randint(0, 10, 6)},
                  columns = ['key', 'data1', 'data2'])
df

#### Aggregation
* **aggregate()** method can take a string, a function, or a list thereof
* Can pass a dictionary mapping column names to operations to be applied on that column

In [None]:
# Aggregate() method can take a string, a function, or a list thereof
df.groupby('key').aggregate(['min', np.median, max])

In [None]:
# pass a dictionary mapping column names 
# to operations to be applied on that column
df.groupby('key').aggregate({'data1' : 'min',
                             'data2' : 'max'})

#### Filtering
* A **filtering** operation allows you to drop data based on the group properties

In [None]:
def filter_func(x):
    return x['data2'].std() > 4

In [None]:
df

In [None]:
df2 = df.groupby('key').std()
df2

In [None]:
df.groupby('key').filter(filter_func)

In [None]:
df.loc[df['key'] == 'A', 'data2'].std()

#### Transformation (???)
* While aggregation must return a reduced version of the data, transfomration can return some transformed version of the full data to recombine.
* For such a transformation, the output is the same shape as the input. 

In [None]:
# Ceter the data by subtracting the group-wise mean
df.groupby('key').transform(lambda x: x - x.mean())

# 看不懂，也查不到说明。

#### The apply() method
* The **apply()** method lets you apply an arbitrary function to the group results
    * THe function should take a DataFrame, and return either a Pandas object (e.g., DataFrame, Series) or a scalar;
    * The combine operation will be tailored to the type of output returned.
    * <font color = red> apply() within a GroupBy is quite flexible: the only criterion is that the function takes a DataFrame and returns a Pandas object or a scalar.

In [None]:
def norm_by_data2(x):
    # x is a df of gropu values
    x['data1'] /= x['data2'].sum()
    return x

In [None]:
df

In [None]:
df.groupby('key').apply(norm_by_data2)

#### Specifying the split key

##### A list, array, series, or index providing the grouping keys
* The key can be any series or list with a length matching that of the DataFrame.

In [None]:
L = [0, 1, 0, 1, 2, 0]
df2 = df.copy()

In [None]:
df

In [None]:
df.groupby(L).sum()

In [None]:
df2['L'] = L

In [None]:
df2

In [None]:
df2.groupby('L').sum()

In [None]:
# a more verbose way
df2.groupby(df2['L']).sum()

### A dictionary or series mapping index to group

In [None]:
df2 = df.set_index('key')
df2

In [None]:
mapping = {'A':'vowel', 'B':'constant', 'C':'consonant'}
df2.groupby(mapping).sum()

##### Any Python function
* Pass any Python function that will input the index value and output the group
df2.groupby(str.lower).mean()

In [None]:
# Pass any Python function that will input the index value and output the group
df2.groupby(str.lower).mean()

#####  A list of valid keys
* Any of the preceding key choices can be combined to group on a multi-index

In [None]:
df2.groupby([str.lower, mapping]).mean()

#### Grouping Example

In [None]:
planets.keys()

In [None]:
decade = 10 * (planets['year']//10)
decade = decade.astype(str) + 's'
decade.name = 'decade'
decade

In [None]:
planets.groupby(['method', decade]).sum()
# decade here is a list of mapping index, same length

In [None]:
planets.groupby(['method', decade])['number'].sum()

In [None]:
planets.groupby(['method', decade])['number'].sum().unstack()

In [None]:
planets.groupby(['method', decade])['number'].sum().unstack().fillna(0)