# Introduction to Categorical Data

When exploring data, we are often interested in summarizing a large amount of information with a single number or visualization.

Depending on what we are trying to understand from our data, we may need to rely on different statistics. For quantitative data, we can summarize central tendency using mean, median or mode and we can summarize spread using standard deviation, variance, or percentiles. However, when working with categorical data, we may not be able to use all the same summary statistics.

For example, here are the first five rows and some selected columns of a dataset from the 1994 U.S. census:

|age|education|marital.status|race|
|:--|:--------|:-------------|:---|
|90|HS-grad|Widowed|White|
|82|HS-grad|Widowed|White|
|66|Some-college|Widowed|Black|
|54|7th-8th|Divorced|White|
|41|Some-college|Separated|White|

Age is a quantitative variable, so we can calculate the average (or mean) age. However, for a variable like `marital.status`, we can’t calculate something like "average marital status" because the possible values of marital status are categories rather than numbers (e.g. "Married", "Widowed", "Seperated", etc.). This lesson will cover summary statistics specifically for exploring categorical data.

### Exercise

In [1]:
import pandas as pd
import numpy as np

nyc_trees = pd.read_csv("./nyc_tree_census.csv")

1. The dataset we will explore in this lesson is a sample of the NYC 2015 Tree Census. This dataset contains information from a survey of trees in the city collected by parks department employees and community volunteers. A dataframe named nyc_trees has been loaded for you in the workspace. Take a look at the field descriptions below. Once you are ready, inspect the first five rows of nyc_trees using the .head() method and print the result.

    Data Description:

    |Column Name|Description|
    |:----------|:----------|
    |tree_id|Unique identifier for each tree in the survey|
    |trunk_diam|Diameter of the tree measured 54” above the ground|
    |status|Indicates whether the tree is alive, standing dead, or a stump.|
    |health|Indicates the user’s perception of tree health.|
    |spc_common|Common name for species, e.g. "red maple"|
    |neighborhood|Name of the neighborhood the tree is located in|

In [2]:
nyc_trees.head()

Unnamed: 0,tree_id,trunk_diam,status,health,spc_common,neighborhood
0,199250,8,Alive,Good,crab apple,Lincoln Square
1,136891,17,Alive,Good,honeylocust,East Harlem North
2,200218,3,Alive,Good,ginkgo,Chinatown
3,53901,23,Alive,Good,green ash,Bayside-Bayside Hills
4,589218,21,Alive,Good,pin oak,Glen Oaks-Floral Park-New Hyde Park


2. Which of the columns are categorical variables? Write the names into a list named `categorical_vars`. Each name should be a separate string. Although id fields (for example, `tree_id`) can technically be considered categorical data, you do not need to include them in your list.

In [3]:
categorical_vars = ['status', 'health', 'spc_common', 'neighborhood']

***

## Nominal Categories

Depending on the data, some of the summary statistics we use for quantitative data can still be meaningful for categorical data. Let us first consider a *nominal categorical variable*. A nominal categorical variable is a categorical variable with no intrinsic ordering to the categories. Examples from the census dataset introduced in the previous exercise include `marital.status` and `race`.

Because these variables' categories have no ordering or numeric equivalents, it is impossible to calculate a mean or median. It would also be impossible to describe spread with statistics like variance, standard deviation, a range, IQR, or percentiles, because these statistics all rely on being able to order the data in some way. However, it is still possible to calculate the *mode*, the most common value in the dataset.

We can do this in Python using the `.value_counts()` function. The `.value_counts()` function calculates the count of each value in a column and returns the result as a series. By default, `.value_counts()` orders categories in descending order by frequency, thus the top row in the output will be the mode.

In the code below, we use `.value_counts()` to extract the most common responses in the field `marital.status`.

        counts = df['marital.status'].value_counts()
        print(counts)

### Output:

        Married-civ-spouse       14976
        Never-married            10683
        Divorced                  4443
        Separated                 1025
        Widowed                    993
        Married-spouse-absent      418
        Married-AF-spouse           23

This means that the most common value of `marital.status` in this dataset is `'Married-civ-spouse'` (married to a civilian spouse), with 14976 observations in that category. We can also extract the name of the modal category by taking the first value from the series `.value_counts()` returns.

        modal_cat = counts.index[0]
        print(modal_cat) # Output: Married-civ-spouse

### Exercise

1. Using the `nyc_trees` data, find the count of trees in each neighborhood (the column name for neighborhood is `neighborhood`). Save the result as `tree_counts` and print the result.

    Note that this data, like many datasets you will encounter in the real world, is large and a little messy! You will see that the neighborhoods with the fewest trees (only 1 in some cases) have some strange names that do not really seem like neighborhoods. Not to worry — this still tells you some important information!

In [4]:
tree_counts = nyc_trees['neighborhood'].value_counts()
tree_counts

Annadale-Huguenot-Prince's Bay-Eltingville    950
Great Kills                                   761
East New York                                 702
Bayside-Bayside Hills                         665
Rossville-Woodrow                             633
                                             ... 
BX33                                            1
82                                              1
40                                              1
86                                              1
5                                               1
Name: neighborhood, Length: 442, dtype: int64

2. Using the `nyc_trees` data, find the neighborhood with the highest tree count. Save the name of the neighborhood as a variable called `greenest_neighborhood` and print the result.

In [5]:
greenest_neighborhood = tree_counts.index[0]
greenest_neighborhood

"Annadale-Huguenot-Prince's Bay-Eltingville"

***

## Ordinal Categorical Variables - Central Tendency I

*Ordinal categorical variables* have ordered categories. For ordinal categorical variables, we can find the modal category just like in the previous exercise — but we can also calculate other summary statistics that are not possible for nominal categorical variables. For central tendency, this means we can also calculate a median.

In order to calculate numerical statistics for ordered categories, we need to first assign numerical values to the categories. Consider the variable `education` from the census data. We can inspect the unique categories in this variable using `.unique()`:

    print(list(df['education'].unique()))

### Output:

    ['HS-grad', 'Some-college', '7th-8th', '10th', 'Doctorate', 'Prof-school', 'Bachelors', 'Masters', '11th', 'Assoc-acdm', 'Assoc-voc', '1st-4th', '5th-6th', '12th', '9th', 'Preschool']

Then, we can associate each of these categories with a numerical value, indicating an individual’s "education level". In Python, the easiest way to do this is to convert the variable to type `'category'` using `pandas.Categorical()`. When converting a column to type `'category'`, we can also pass a list with the column's categories (and True to the ordered parameter) to indicate the desired ordering.

    correct_order = ['Preschool', '1st-4th', '5th-6th', '7th-8th', '9th', '10th', '11th', '12th', 'HS-grad', 'Some-college', 'Assoc-voc', 'Assoc-acdm', 'Bachelors', 'Masters', 'Prof-school', 'Doctorate']
 
    df['education'] = pd.Categorical(df['education'], correct_order, ordered=True)

Variables stored as type category have an attribute (`cat.codes`) that converts the categories to numbers. This allows us to perform numerical operations on this categorical field. This allows us to calculate the median category using numpy's `median()` function:

    median_index = np.median(df['education'].cat.codes)
    print(median_index) # Output: 9
 
    median_category = correct_order[int(median_index)]
    print(median_category) # Output: Some college

By using `.cat.codes` on `education`, we are able to calculate that the median value for education level is `'9'` which translates to `'Some college'`.

### Exercise

1. Using the NYC trees dataset, find the unique values in the column `health`. Save the unique categories to a variable named `tree_health_statuses` and print the result.

In [6]:
tree_health_statuses = nyc_trees['health'].unique()
tree_health_statuses

array(['Good', 'Poor', 'Fair', nan], dtype=object)

2. Create a list named `health_categories` which lists the categories from worst to best. You should exclude `NaN` values from your list.

In [7]:
health_categories = ['Poor', 'Fair', 'Good']

3. Using the `health_categories` list you created in the previous exercise, convert `health` in the original dataset to a categorical variable type (`'category'`).

In [8]:
nyc_trees['health'] = pd.Categorical(nyc_trees['health'], health_categories, ordered=True)

4. Using `cat.codes`, calculate the value that corresponds to the median value of `health`. Save it as a variable named `median_health_status` and print the result.

In [9]:
median_index = np.median(nyc_trees['health'].cat.codes)
median_health_status = health_categories[int(median_index)]
median_health_status

'Good'

***

## Ordinal Categorical Variables - Central Tendency II

In the previous exercise, we used `.cat.codes` to find the median category for an ordinal categorical variable. We can use `cat.codes` to return numeric values and perform a wide range of operations on categorical data as well. However, before performing any operations, you should check to make sure they make sense in the context of the data.

For example, remember that the categories for `education` (in order) are as follows:

    education_levels_ordered = ['Preschool', '1st-4th', '5th-6th', '7th-8th', '9th', '10th', '11th', '12th','HS-grad', 'Some-college', 'Assoc-voc', 'Assoc-acdm', 'Bachelors', 'Masters', 'Prof-school', 'Doctorate']

While we can represent these categories with equally spaced numbers, there is not equal spacing between categories. Some gaps between educational attainment levels represent up to four additional years of schooling (e.g. '1st-4th' to '5th-6th'), while others represent a single additional year of schooling (e.g. from '9th' to '10th').

When we use `.cat.`codes to translate these categories into integers, those integers have equal spacing. While translating categories to numbers is often necessary to store and use the order of the categories (for calculating a statistic like the median, which only relies on ordering, not spacing), we should not use those numbers to calculate statistics — such as the mean — for which the distance between values matters.

In practice, researchers sometimes (albeit, incorrectly) report means for ordinal categories. For example, a researcher might want to analyze survey responses to the question "Rate your happiness on a scale from 1 to 5 where 1 means 'very unhappy' and 5 means 'very happy'".

If that researcher calculates 'mean happiness score', they are assuming that the difference in happiness between a rating of 1 and 2 is the same as the difference in happiness for a rating of 3 and 4. In practice, this assumption is likely not true and should be acknowledged if reporting a mean of an ordinal categorical variable.

### Exercise

In [10]:
nyc_trees = pd.read_csv('nyc_tree_census2.csv')
nyc_trees['tree_diam_category'] = pd.Categorical(nyc_trees['tree_diam_category'], ['Small (0-3in)', 'Medium (3-10in)', 'Medium-Large (10-18in)', 'Large (18-24in)','Very large (>24in)'], ordered=True)

1. This dataset contains two variables related to trunk size. The first variable, `trunk_diam` contains the diameter of the trunk (in inches) for each tree. The variable `tree_diam_category`, on the other hand, categorizes each tree based on the size of the trunk. The categories are: `'Small (0-3in)'`, `'Medium (3-10in)'`, `'Medium-Large (10-18in)'`, `'Large (18-24in)'`, `'Very large (>24in)'`. You will notice that these categories are not evenly spaced with respect to diameter.

    Calculate the mean of `trunk_diam` (the quantitative variable), save it as `mean_diam`, and print the result.

In [11]:
mean_diam = nyc_trees['trunk_diam'].mean()
mean_diam

11.27048

2. We have already provided code to save `tree_diam_category` as an ordered categorical variable so that you can use `cat.codes`. Calculate the mean of `tree_diam_category`, save it in a variable named `mean_diam_cat` and print it out.

    Which category does this correspond to (remember that `cat.codes` translates the categories to numbers between 0 and 4)? Note how this is different from the mean you calculated in the last checkpoint. While the mean diameter is about 11.27 inches (which would be categorized as "Medium-Large"), the mean category index is about 1.97, which is between `'Medium (3-10in)'` and `'Medium-Large (10-18in)'`.

In [12]:
mean_diam_cat = nyc_trees['tree_diam_category'].cat.codes.mean()
mean_diam_cat

1.97282

***

## Ordinal Categories: Spread

In the last exercise, we learned that the mean is not interpretable for ordinal categorical variables because the mean relies on the assumption of equal spacing between categories.

Many other statistics we might normally use for numerical data rely on the mean. Because of this, these statistics are not appropriate for ordinal data. Remember that the standard deviation and variance both depend on the mean, without a mean, we cannot have a reliable standard deviation or variance either!

Instead, we can rely on other summary statistics, like the proportion of the data within a range, or percentiles/quantiles. For example, consider the education variable from earlier. To calculate a range containing 80% of the data, we can use `np.percentile()`:

    tenth_perc_ind = np.percentile(df['education'].cat.codes, 10)
    tenth_perc_cat = correct_order[int(tenth_perc_ind)]
    print(tenth_perc_cat) # output: 11th
    
    nintieth_perc_ind = np.percentile(df['education'].cat.codes, 90)
    nintieth_perc_cat = correct_order[int(nintieth_perc_ind)]
    print(nintieth_perc_cat): #output: Bachelors

This tells us that at least 80% of respondents range in "education level" from 11th grade to a Bachelor's degree.

### Exercise

In [13]:
size_labels_ordered = ['Small (0-3in)', 'Medium (3-10in)', 'Medium-Large (10-18in)', 'Large (18-24in)','Very large (>24in)']

1. Calculate the 25th percentile for `tree_diam_category`. Use the ordered list, `size_labels_ordered`, to find the corresponding label. Save your result (the label, not the index) to a variable named `p25_tree_diam_category` and print it to the console.

In [14]:
p25_tree_diam_category = size_labels_ordered[int(np.percentile(nyc_trees['tree_diam_category'].cat.codes, 25))]
p25_tree_diam_category

'Medium (3-10in)'

2. Calculate the 75th percentile of `tree_diam_category`. Use the ordered list, `size_labels_ordered`, to find the corresponding label. Save your result (the label, not the index) to a variable named `p75_tree_diam_category` and print it to the console.

    Together with the 25th percentile, we can use this value to determine the Interquartile Range (IQR) for `tree_diam_category`.

In [15]:
p75_tree_diam_category = size_labels_ordered[int(np.percentile(nyc_trees['tree_diam_category'].cat.codes, 75))]
p75_tree_diam_category

'Large (18-24in)'

***

## Table of Proportions

You have already seen that we can use the `.value_counts()` function to get a table of frequencies for a categorical variable. A table of frequencies is often the first approach a data scientist might use to summarize a categorical variable; however, it is sometimes useful to instead look at the proportion of values in each category.

For example, knowing that there are 14976 people in the census dataset who are married to a civilian spouse is hard to interpret without the context of knowing the numbers in the other categories. Instead, if we know that 32% of the surveyed population is married to a civilian spouse, we have more context about the relative frequency of this category. We can calculate proportions by dividing the frequency by the number of observations in the data.

    df['education'].value_counts()/len(df['education'])

We can also calculate proportions using `.value_counts()` by setting the `normalize` parameter equal to `True`:

    df['education'].value_counts(normalize = True).head()

Output:

    HS-grad         0.322502
    Some-college    0.223918
    Bachelors       0.164461

### Exercise

In [16]:
nyc_trees = pd.read_csv("./nyc_tree_census.csv")

1. Calculate a table of proportions for the `status` column. Save this table of proportions as `tree_status_proportions` and print the result.

In [17]:
tree_status_proportions = nyc_trees['status'].value_counts(normalize=True)
tree_status_proportions

Alive    0.9539
Stump    0.0267
Dead     0.0194
Name: status, dtype: float64

***

## Table of Proportions: Missing Data

One thing to keep in mind when calculating the proportion of data in a particular category: how are you dealing with missing data? For example, consider the `workclass` variable from the census data. This column contains 1836 missing values, coded as `NaN`. By default, those missing values are not counted by .`value_counts()`.

Therefore, the results of `df['workclass'].value_counts()/len(df['workclass'])` and `df['workclass'].value_counts(normalize = True)` will be slightly different. You can set the `dropna` parameter in `.value_counts()` to determine how `NaN` values are handled in summaries of data.

When we divide the frequency of each category by `len(df['workclass'])`, we are calculating the proportion of a specific workclass group as a portion of all people in the dataset. This is equivalent to setting `dropna = False` in the call to `value_counts()`.

`df.workclass.value_counts(dropna = False, normalize = True)`

### Output:

    Private             0.697030
    Self-emp-not-inc    0.078038
    Local-gov           0.064279
    NaN                 0.056386
    State-gov           0.039864
    Self-emp-inc        0.034274
    Federal-gov         0.029483
    Without-pay         0.000430
    Never-worked        0.000215

Here, we see that 5.6% of respondents have a missing (`NaN`) value of `workclass`. In contrast, using `.value_counts(normalize = True)` (or `.value_counts(normalize = True, dropna = True) to be explicit) returns proportion of a specific workclass group as a portion of people in the dataset who responded to this question.

`df.workclass.value_counts(normalize = True)`

### Output:

    Private             0.738682
    Self-emp-not-inc    0.082701
    Local-gov           0.068120
    State-gov           0.042246
    Self-emp-inc        0.036322
    Federal-gov         0.031245
    Without-pay         0.000456
    Never-worked        0.000228

Note that if we do not include the missing values in our denominator, we observe slightly larger proportions in each category (and no `NaN` category) in the above output. It is important to think about how you want to deal with missing data when summarizing a categorical variable and then interpret resulting values appropriately.

### Exercise

1. Using `.value_counts()`, calculate the proportions for each category in the `health` variable. The denominator for your proportions should be the number of non-missing values in the health column. Save the result to a dataframe named `health_proportions` and print the result.

In [18]:
health_proportions = nyc_trees['health'].value_counts(normalize=True)
health_proportions

Good    0.810986
Fair    0.146871
Poor    0.042143
Name: health, dtype: float64

2. Now, still using `.value_counts()`, add a parameter to include missing values in the denominator when calculating proportions for the `health` variable. Save the result to a dataframe named `health_proportions_2`. Why are the two sets of results different? Can you think of scenarios where one might be more appropriate to report than the other?

In [19]:
health_proportions_2 = nyc_trees['health'].value_counts(normalize=True, dropna=False)
health_proportions_2

Good    0.7736
Fair    0.1401
NaN     0.0461
Poor    0.0402
Name: health, dtype: float64

***

## Binary Categorical Variables

Binary categorical variables have only two categories. In Python, these variables are often coded as `0`/`1` or `True`/`False`. This makes it easy to calculate the frequency/proportion of these variables in a dataset. For example, consider a variable `income_>50K`, which is equal to `1` if a person makes more than 50k U.S.D per year, and `0` otherwise. If we add up all the `1`s and `0`s in this column, the sum will be exactly equal to the number of `1`s (people making more than 50k):

    np.sum(df['income_>50K'])  #output: 7841

In Python, the same behavior holds for columns coded as `True`/`False` because `True` gets coerced to `1` and `False` gets coerced to `0` (this is also true in most other programming languages used by data scientists). Similarly, we can calculate the proportion equal to `1` or `True` by taking the mean of the column. This works because the mean is just the sum of all values in the column (which is the frequency of `1`s or `Trues`) divided by the number of values in the column:

    np.mean(df['income_>50K'])  #output: 0.24

Finally, we can make use of this nifty trick for any variable by using a conditional to translate a non-binary variable into `True` and `False` values. For example, recall the `workclass` variable from the previous exercise. Suppose that you want to calculate the number (or proportion) of people who work in local government. We could translate the `workclass` column to a binary variable indicating whether a person works in local government (`True`) or not (`False`) by using a conditional.

    print(df.workclass == 'Local-gov')

### Output:

    0        False
    1        True
    2        True
    3        False
    4        False
             ...  

Then, we can use the sum or mean to calculate a frequency or proportion of `True`s in the data.

    (df.workclass == 'Local-gov').sum()  #output: 2093
    (df.workclass == 'Local-gov').mean() #output: 0.064

***

1. Find the frequency and proportion of trees that were recorded as `Alive`. You can do this by transforming the status variable to an indicator for if a tree is alive (indicated by `status == 'Alive'`) or not. Save the results to variables named `living_frequency` and `living_proportion` and print them to the console.

In [20]:
living_frequency = (nyc_trees['status'] == 'Alive').sum()
living_proportion = (nyc_trees['status'] == 'Alive').mean()

print(living_frequency)
print(living_proportion)

47695
0.9539


2. Find the frequency and proportion of trees with `trunk_diam > 30`. Save the results to variables named `giant_frequency` and `giant_proportion` and print them to the console.

In [21]:
giant_frequency = (nyc_trees['trunk_diam'] > 30).sum()
giant_proportion = (nyc_trees['trunk_diam'] > 30).mean()

print(giant_frequency)
print(giant_proportion)

1788
0.03576


***

## Review

In this lesson you have learned the steps you can take to summarize and interpret summaries of nominal categorical and ordinal categorical variables.

* For *nominal categorical* variables, there is no ordering to the categories. Because of this, we are limited to using the *mode* to describe central tendency and there is no way to summarize the spread.
* For *ordinal categorical* variables, there is an implied ordering to the categories. In Python, we can use `pd.Categorical()` to transform a variable to a categorical type. The Categorical type allows us to access a numeric value for each category by using `.cat.codes`. From there, we may perform operations on this variable as if it were a regular, numeric variable.
* However, when calculating statistics for an *ordinal categorical* variable we should be mindful that some numeric statistics rely on the assumption of **equal spacing** between categories.
* For ordinal categorical variables, *median* and *mode* can be used to summarize the central tendency, and the IQR (or any difference between percentiles) can be used to summarize the spread.
* Certain summary statistics (e.g. frequencies and proportions), can be used for all categorical variables. You can create true/false columns and `np.sum()` and `np.mean()` to quickly summarize what proportion of your data meets certain criteria.

## Exercise

As a final exercise, a new dataset has been loaded for you in the cell below. Follow the instructions below to inspect and summarize the categorical variables in this data.

`film_permits` contains a sample of NYC filming permits. Inspect the first few rows. Think about how you might explore and summarize this data. Some exercises you might wish to work through:

* Which variables in this data are nominal? Which are ordinal?
* Which Boroughs are granted permits for the most TV pilot episodes?
* Summarize the types (`Category`) and subtypes (`SubCategoryName`) of projects that get filming permits granted.

Solution code is available in `solutions.py`.

In [22]:
film_permits = pd.read_csv('film_permits.csv')
film_permits.head()

Unnamed: 0,EventID,EventType,StartDateTime,EndDateTime,Borough,Category,SubCategoryName
0,446168,Shooting Permit,10/19/2018 02:00:00 PM,10/20/2018 02:00:00 AM,Manhattan,Film,Feature
1,186438,Shooting Permit,10/30/2014 07:00:00 AM,10/31/2014 02:00:00 AM,Queens,Television,Episodic series
2,445255,Shooting Permit,10/20/2018 07:00:00 AM,10/20/2018 06:00:00 PM,Brooklyn,Still Photography,Not Applicable
3,128794,Theater Load in and Load Outs,11/16/2013 12:01:00 AM,11/17/2013 06:00:00 AM,Manhattan,Theater,Theater
4,43547,Shooting Permit,01/10/2012 07:00:00 AM,01/10/2012 07:00:00 PM,Brooklyn,Television,Episodic series


### Which variables in this data are nominal? Which are ordinal?

|column|type|
|:-----|:---|
|EventID|int|
|EventType|nominal|
|StartDateTime|datetime|
|EndDateTime|datetime|
|Borough|nominal|
|Category|nominal|
|SubCategoryName|nominal|

### Which Boroughs are granted permits for the most TV pilot episodes?

In [23]:
film_permits[film_permits['SubCategoryName'] == 'Pilot']['Borough'].value_counts()

Manhattan        149
Brooklyn          89
Queens            21
Bronx             10
Staten Island      2
Name: Borough, dtype: int64

### Summarize the types (Category) and subtypes (SubCategoryName) of projects that get filming permits granted.

In [24]:
film_permits['Category'].value_counts()

Television           5271
Film                 1765
Theater               966
Commercial            878
Still Photography     658
WEB                   313
Student                72
Documentary            48
Music Video            28
Name: Category, dtype: int64

In [25]:
film_permits['SubCategoryName'].value_counts()

Episodic series            2916
Feature                    1382
Not Applicable             1381
Cable-episodic             1033
Theater                     966
Commercial                  686
Pilot                       271
News                        202
Cable-other                 126
Reality                     124
Morning Show                121
Short                       120
Promo                       112
Made for TV/mini-series      90
Variety                      76
Student Film                 65
Special/Awards Show          59
Cable-daily                  55
Industrial/Corporate         54
Talk Show                    48
PSA                          27
Game show                    25
Signed Artist                15
Children                     12
Syndication/First Run        11
Independent Artist            9
Magazine Show                 8
Daytime soap                  5
Name: SubCategoryName, dtype: int64