# Summarising categorical variables

## Nominal categories

Nominal categorical variables have no intrinsic ordering to the categories, or numerucal equivalent. Not possible to calculate mean, median. Also not possible to describe spread with variance, standard deviation, range, IQR or percentiles. 

Is possible to describe mode, most common value in dataset. This can be done by counting the values.

In [None]:
# Find the number of different values
counts = df['marital.status'].value_counts()
print(counts)

# Get the mode by selecting the first result from the count list
modal_cat = counts.index[0]

print(modal_cat)

## Ordinal categories
Ordinal categorical variables have ordered categories. We can find the modal category as before, but we can also calculate other statistics like median.

First, we need to assign numerical values to the categories buy inspecting unique categories.

In [None]:
# Print a list of unique category names
print(list(df['education'].unique()))

# 

Returns: \['HS-grad', 'Some-college', '7th-8th', '10th', 'Doctorate', 'Prof-school', 'Bachelors', 'Masters', '11th', 'Assoc-acdm', 'Assoc-voc', '1st-4th', '5th-6th', '12th', '9th', 'Preschool']

Then, we can associate each of these categories with a numerical value, indicating an individual’s "education level". In Python, the easiest way to do this is to convert the variable to type 'category' using pandas.Categorical(). When converting a column to type 'category', we can also pass a list with the column’s categories (and True to the ordered parameter) to indicate the desired ordering.

In [None]:
correct_order = ['Preschool', '1st-4th', '5th-6th', '7th-8th', '9th', '10th', '11th', '12th', 'HS-grad', 'Some-college', 'Assoc-voc', 'Assoc-acdm', 'Bachelors', 'Masters', 'Prof-school', 'Doctorate']

df['education'] = pd.Categorical(df['education'], correct_order, ordered=True)

Variables stored as type category have an attribute (cat.codes) that converts the categories to numbers. This allows us to perform numerical operations on this categorical field. This allows us to calculate the median category using numpy’s median() function:

In [None]:
median_index = np.median(df['education'].cat.codes)
print(median_index) # Output: 9

median_category = correct_order[int(median_index)]
print(median_category) # Output: Some college

By using .cat.codes on education, we’re able to calculate that the median value for education level is '9' which translates to 'Some college'.

While we can represent these categories with equally spaced numbers, there’s not equal spacing between categories. Some gaps between educational attainment levels represent up to four additional years of schooling (e.g. '1st-4th' to '5th-6th'), while others represent a single additional year of schooling (e.g. from '9th' to '10th').

When we use .cat.codes to translate these categories into integers, those integers have equal spacing. While translating categories to numbers is often necessary to store and use the order of the categories (for calculating a statistic like the median, which only relies on ordering, not spacing), we should not use those numbers to calculate statistics — such as the mean — for which the distance between values matters.

In practice, researchers sometimes (albeit, incorrectly) report means for ordinal categories. For example, a researcher might want to analyze survey responses to the question "Rate your happiness on a scale from 1 to 5 where 1 means 'very unhappy' and 5 means 'very happy'".

If that researcher calculates 'mean happiness score', they are assuming that the difference in happiness between a rating of 1 and 2 is the same as the difference in happiness for a rating of 3 and 4. In practice, this assumption is likely not true and should be acknowledged if reporting a mean of an ordinal categorical variable.

In [None]:
# Find the mean category
mean_diam_cat = np.mean(nyc_trees['tree_diam_category'].cat.codes)

### Spread

the mean is not interpretable for ordinal categorical variables because the mean relies on the assumption of equal spacing between categories.

Many other statistics we might normally use for numerical data rely on the mean. Because of this, these statistics aren’t appropriate for ordinal data. Remember that the standard deviation and variance both depend on the mean, without a mean, we can’t have a reliable standard deviation or variance either!

Instead, we can rely on other summary statistics, like the proportion of the data within a range, or percentiles/quantiles. For example, consider the education variable from earlier. To calculate a range containing 80% of the data, we can use np.percentile():

In [None]:
tenth_perc_ind = np.percentile(df['education'].cat.codes, 10)
tenth_perc_cat = correct_order[int(tenth_perc_ind)]
print(tenth_perc_cat) # output: 11th
 
nintieth_perc_ind = np.percentile(df['education'].cat.codes, 90)
nintieth_perc_cat = correct_order[int(nintieth_perc_ind)]
print(nintieth_perc_cat): #output: Bachelors

This tells us that at least 80% of respondents range in "education level" from 11th grade to a Bachelor’s degree.

### Table of Proportions
You’ve already seen that we can use the .value_counts() function to get a table of frequencies for a categorical variable. A table of frequencies is often the first approach a data scientist might use to summarize a categorical variable; however, it is sometimes useful to instead look at the proportion of values in each category.

For example, knowing that there are 14976 people in the census dataset who are married to a civilian spouse is hard to interpret without the context of knowing the numbers in the other categories. Instead, if we know that 32% of the surveyed population is married to a civilian spouse, we have more context about the relative frequency of this category. We can calculate proportions by dividing the frequency by the number of observations in the data.

In [None]:
df['education'].value_counts()/len(df['education'])

We can also calculate proportions using .value_counts() by setting the normalize parameter equal to True:

In [None]:
df['education'].value_counts(normalize = True).head()

#### Missing data

One thing to keep in mind when calculating the proportion of data in a particular category: how are you dealing with missing data? For example, consider the workclass variable from the census data. This column contains 1836 missing values, coded as NaN. By default, those missing values are not counted by .value_counts().

Therefore, the results of df\['workclass'].value_counts()/len(df\['workclass']) and df\['workclass'].value_counts(normalize = True) will be slightly different. You can set the dropna parameter in .value_counts() to determine how NaN values are handled in summaries of data.

When we divide the frequency of each category by len(df\['workclass']), we’re calculating the proportion of a specific workclass group as a portion of all people in the dataset. This is equivalent to setting dropna = False in the call to value_counts()

Note that if we don’t include the missing values in our denominator, we observe slightly larger proportions in each category (and no NaN category) in the above output. It is important to think about how you want to deal with missing data when summarizing a categorical variable and then interpret resulting values appropriately.

## Binary categorical variables

These have only two categories – 0/1 or True and False. Makes it easy to calculate the proportion of these values in a dataset.

In [None]:
np.sum() # Counts the number of 1s (true) in a dataset.
np.mean() # Calculates the proportion of 1s (true) in a dataset

It's possible to make use of this by turning non-binary variables into binary variables.

In [None]:
living_frequency = np.sum(nyc_trees.status == 'Alive')
living_proportion = (nyc_trees.status == 'Alive').mean()