# Data types

## Learning objectives

After working through this topic, you should be able to:

- List the most important datatypes in pandas
- Discuss the benefits of modern strings
- Choose memory saving datatypes for your data

## Materials

Here is the
[Screencast](https://electure.uni-bonn.de/static/mh_default_org/engage-player/313e6fa5-9607-4afc-b247-b215e30e2721/ad4e9cb2-867b-47e7-a90e-23219786b136/ac840202-32b2-4cd3-83a9-f9e2b93d8492.mp4)
and these are the [slides](pandas_data-datatypes.pdf).

## Additional Materials

- [Pandas user guide on string/text data](https://pandas.pydata.org/docs/user_guide/text.html)
- [Pandas user guide on categorical data](https://pandas.pydata.org/docs/user_guide/categorical.html)
## Quiz

In [None]:
content = [
    {
        "name": "Intro",
        "front": (
            "Describe each of the DataTypes by giving the type of values it can hold "
            "and an example use case."
        ),
        "back": (
            "Let's go through them step by step (click on next in the bottom right)."
        ),
    },
    {
        "name": "pd.Int64Dtype",
        "front": "pd.Int64Dtype",
        "back": ("Holds integers. We may use it to store the calendar year."),
    },
    {
        "name": "pd.UInt32Dtype",
        "front": "pd.UInt32Dtype",
        "back": (
            "An unsigned integer, taking on either 0 or a positive value. We may use "
            "it to store person identifiers in our data."
        ),
    },
    {
        "name": "pd.StringDtype",
        "front": "pd.StringDtype",
        "back": (
            "Strings of arbitrary length. We may use it to store answers to a "
            "free-text answer from a survey."
        ),
    },
    {
        "name": "pd.CategoricalDtype (ordered)",
        "front": "pd.CategoricalDtype (ordered)",
        "back": (
            "A categorical variable, with a fixed number of possible values, which are "
            "ordered in a way specified by the user. An example would be responses to "
            "a question asking respondents to rate their health status with possible "
            "values excellent-very good-good-fair-poor."
        ),
    },
    {
        "name": "pd.CategoricalDtype (unordered)",
        "front": "pd.CategoricalDtype (unordered)",
        "back": (
            "A categorical variable, with a fixed number of possible values, which are "
            "not ordered. An example would be gender (female, male, other, ...)."
        ),
    },
]


from jupytercards import display_flashcards

display_flashcards(content)

In [None]:
content = [
    {
        "question": (
            "Assume you obtain a small survey dataset with columns for gender, income, "
            "happiness (on a 3-point scale high-so/so-low), all of which are stored as "
            "pd.Int64DType. Tick all that apply."
        ),
        "type": "many_choice",
        "answers": [
            {
                "answer": ("We can just leave the variables as they are."),
                "correct": False,
                "feedback": (
                    "Gender as an integer is not helpful at all! You always need to "
                    "remember what the numbers mean. Let's not get started on plotting "
                    "or accidentally using the values 0, 1, 2 directly in a regression."
                ),
            },
            {
                "answer": (
                    "We should set income to be of the pd.Float64Dtype variant in "
                    "to make clear we are approximating a real number."
                ),
                "correct": True,
                "feedback": (
                    "We would recommend this indeed. Strictly speaking, you could not "
                    "calculate continuous distributions etc. with integers, although "
                    "the necessary type conversion will often happen implicitly."
                ),
            },
            {
                "answer": (
                    "We should set gender to be of the ordered pd.CategoricalDtype "
                    "variant."
                ),
                "correct": False,
                "feedback": "How would you order female/male/other/...?",
            },
            {
                "answer": (
                    "We should set gender to be of the unordered pd.CategoricalDtype "
                    "variant."
                ),
                "correct": True,
            },
            {
                "answer": (
                    "We should set happiness to be of the ordered pd.CategoricalDtype "
                    "variant."
                ),
                "correct": True,
                "feedback": "This is the correct representation.",
            },
            {
                "answer": (
                    "We should set happiness to be of the unordered pd.CategoricalDtype"
                    " variant."
                ),
                "correct": False,
                "feedback": (
                    "Order is built into the above scale by definition; the data "
                    "type should reflect this."
                ),
            },
            {
                "answer": "We can just leave happiness to be an Integer type.",
                "correct": False,
                "feedback": (
                    "A categorical data type makes much clearer what the variable "
                    "contains. The only reason you may want to leave it as an integer "
                    "would be to include it in a regression where you are comfortable "
                    "interpret differences in a cardinal way. While there is a debate "
                    "on whether this is not too far-fetched for scales with a "
                    "larger outcome space, for three outcomes it definitely is not."
                ),
            },
        ],
    },
]

from jupyterquiz import display_quiz

display_quiz(content, colors="fdsp")