# Splitting DataFrames into groups

## Learning Objectives

After working through this topic, you should be able to:

- describe why groupby-type operations are useful and common
- call `groupby` and a subsequent method on DataFrames and Series
- describe which methods apply to which data types

## Materials

Here is the
[Screencast](https://electure.uni-bonn.de/static/mh_default_org/engage-player/bc113be7-2155-4590-9665-36ad20c8ce2d/afa1d4c4-0ee8-4695-b58e-d2f380a1589e/2ceaf314-3696-426d-970a-034da44dba89.mp4)
and these are the [slides](pandas_data-groupby.pdf).

## Additional Materials



## Quiz

In [None]:
content = [
    {
        "question": "Examples for `groupby`-style operations are:",
        "type": "many_choice",
        "answers": [
            {
                "answer": "Calculating household-level income from individual incomes.",
                "correct": True,
                "feedback": (
                    "We did not mention this, but this would be a classic use case."
                ),
            },
            {
                "answer": ("Obtaining the fraction of 18-23-year-olds in a sample."),
                "correct": True,
                "feedback": (
                    "Precise semantics may differ (assuming that age is measured in "
                    "years directly, there would be quicker ways), but it is the same "
                    "idea."
                ),
            },
            {
                "answer": "Calculating average income in the sample.",
                "correct": False,
                "feedback": (
                    "To get an overall average, you do not need to calculate "
                    "group-level averages first (you could, but you would have to be "
                    "very careful with weighting the different averages)."
                ),
            },
            {
                "answer": "Calculating median age in a sample",
                "correct": False,
                "feedback": (
                    "To get the median age over all individuals in a sample, grouping "
                    "would be counterproductive."
                ),
            },
        ],
    },
    {
        "question": (
            "*(very hard)* Assuming obvious interpretations of the columns `gender` "
            "and `is_working`, what will the variable `result` hold after executing the"
            " following code?"
        ),
        "code": """g = df.groupby("gender")
result = g.value_counts("is_working") / g.count()
""",
        "type": "multiple_choice",
        "answers": [
            {
                "answer": "The fraction of men and women who are working, respectively",
                "correct": True,
                "feedback": (
                    "Indeed. The `value_counts` counts the number non-missing "
                    "observations for each outcome of is_working (yes/no). Because it "
                    "is called on a grouped object, this is done separately by gender. "
                    "By analogy, we divide by the number of observations by gender to "
                    "obtain shares."
                ),
            },
            {
                "answer": "The number of men and women who are working, respectively",
                "correct": False,
                "feedback": (
                    "This is what the `value_counts` by itself would get us. However, "
                    "we still divide by something"
                ),
            },
            {
                "answer": "The fraction of working individuals.",
                "correct": False,
                "feedback": (
                    "This is what we would have gotten by calling `value_counts` on "
                    "`df['is_working']` or by calling "
                    "`df.groupby('is_working').count()` However, we do something else."
                ),
            },
            {
                "answer": "The fraction in each gender cell.",
                "correct": False,
                "feedback": (
                    "This is what `g.count()` would have returned. However, this "
                    "is the denominator of the expression in the second line."
                ),
            },
        ],
    },
]

from jupyterquiz import display_quiz

display_quiz(content, colors="fdsp")

In [None]:
content = [
    {
        "name": "Intro",
        "front": (
            "For each of the following descriptions, what is the method name and which "
            "data types does it apply to?"
        ),
        "back": (
            "Let's go through them step by step (click on next in the bottom right)."
        ),
    },
    {
        "name": "Averages",
        "front": "Calculate an average",
        "back": (
            "`.mean()`, applies to numeric data types, floating point variables in "
            "particular. Can also be useful for integers, if they have a cardinal "
            "interpretation."
        ),
    },
    {
        "name": "Standard deviation",
        "front": "Calculate the standard deviation",
        "back": (
            "`.std()`, applies to numeric data types, floating point variables in "
            "particular. Can also be useful for integers, if they have a cardinal "
            "interpretation."
        ),
    },
    {
        "name": "Quantiles",
        "front": "Calculate some quantile",
        "back": (
            "`.quantile()` with `.median()` as a special case. Applies to numeric data "
            "types, floating point variables in particular. Can also be useful for "
            "integers, if they take on enough different values."
        ),
    },
    {
        "name": "Minimum / Maximum",
        "front": "Calculate the minimum / maximum",
        "back": (
            "`.min()` / `.max()`. Applies to any ordered data type. Any numerical "
            "value, but also ordered categoricals."
        ),
    },
    {
        "name": "Number of non-mising observations",
        "front": "Count the number of non-missing observations",
        "back": "`.count()`. Applies to any data type.",
    },
    {
        "name": "Number of observations per value",
        "front": "Calculate the number of observations per value a variable takes on.",
        "back": (
            "`.value_counts()`, can be used on any variable, but only makes sense if "
            "there are not too many different values. So rarely makes sense for "
            "floating point variables; for integers it depends. Careful: This will "
            "work on combinations of values across variables if called on more than "
            "one column."
        ),
    },
    {
        "name": "Pass your own function",
        "front": "Calculate something using your own function",
        "back": (
            "`.apply()`, what it applies to (sic!) depends on the function you pass."
        ),
    },
]


from jupytercards import display_flashcards

display_flashcards(content)