## Nominal Categories

A nominal categorical variable is a categorical variable with no intrinsic ordering to the categories.

It’s impossible to calculate a mean or median. It would also be impossible to describe spread with statistics like variance, standard deviation, a range, IQR, or percentiles, because these statistics all rely on being able to order the data in some way. However, it is still possible to calculate the mode, the most common value in the dataset.

We can do this in Python using the .value_counts() function. The .value_counts() function calculates the count of each value in a column and returns the result as a series. By default, .value_counts() orders categories in descending order by frequency, thus the top row in the output will be the mode.

In [5]:
import pandas as pd

# Read NYC Trees Data
nyc_trees = pd.read_csv("./nyc_tree_census.csv")

# Get tree counts by neighborhood
tree_counts = nyc_trees.neighborhood.value_counts()

# Get neighborhoods with most trees
greenest_neighborhood = tree_counts.index[0]

print(tree_counts)

Annadale-Huguenot-Prince's Bay-Eltingville    950
Great Kills                                   761
East New York                                 702
Bayside-Bayside Hills                         665
Rossville-Woodrow                             633
                                             ... 
63                                              1
39                                              1
75                                              1
BX33                                            1
40                                              1
Name: neighborhood, Length: 442, dtype: int64


## Ordinal Categorical Variables

In order to calculate numerical statistics for ordered categories, we need to first assign numerical values to the categories. 

In Python, the easiest way to do this is to convert the variable to type 'category' using pandas.Categorical(). When converting a column to type 'category', we can also pass a list with the column’s categories (and True to the ordered parameter) to indicate the desired ordering.

Variables stored as type category have an attribute (cat.codes) that converts the categories to numbers. This allows us to perform numerical operations on this categorical field. This allows us to calculate the median category using numpy’s median() function

In [6]:
# import pandas as pd
import numpy as np

# Read NYC trees data
# nyc_trees = pd.read_csv("./nyc_tree_census.csv")

tree_health_statuses  = nyc_trees.health.unique()
print(tree_health_statuses)

health_categories = ['Poor', 'Fair', 'Good']

nyc_trees['health'] = pd.Categorical(nyc_trees['health'], health_categories, ordered=True)

# convert the categories to numbers to perform numerical operations
median_healt_status_index = np.median(nyc_trees['health'].cat.codes)
median_health_status = health_categories[int(median_healt_status_index)]
print(median_health_status)

['Good' 'Poor' 'Fair' nan]
Good


We can use cat.codes to return numeric values and perform a wide range of operations on categorical data as well. However, before performing any operations, you should check to make sure they make sense in the context of the data.

When we use .cat.codes to translate these categories into integers, those integers have equal spacing. While translating categories to numbers is often necessary to store and use the order of the categories (for calculating a statistic like the median, which only relies on ordering, not spacing), we should not use those numbers to calculate statistics — such as the mean — for which the distance between values matters.

In practice, researchers sometimes (albeit, incorrectly) report means for ordinal categories. For example, a researcher might want to analyze survey responses to the question "Rate your happiness on a scale from 1 to 5 where 1 means 'very unhappy' and 5 means 'very happy'".

If that researcher calculates 'mean happiness score', they are assuming that the difference in happiness between a rating of 1 and 2 is the same as the difference in happiness for a rating of 3 and 4. In practice, this assumption is likely not true and should be acknowledged if reporting a mean of an ordinal categorical variable.

In [15]:
nyc_trees = pd.read_csv("nyc_tree_census2.csv")

nyc_trees.tree_diam_category = pd.Categorical(nyc_trees.tree_diam_category, ['Small (0-3in)', 'Medium (3-10in)', 'Medium-Large (10-18in)', 'Large (18-24in)','Very large (>24in)'], ordered=True)

print(nyc_trees.tree_diam_category)

# Get Mean Diam of diameter variable, `trunk_diam`
mean_diam = np.average(nyc_trees.trunk_diam)
print(mean_diam)

# Get Mean Category of `tree_diam_category`
mean_diam_cat = np.average(nyc_trees.tree_diam_category.cat.codes)
print(mean_diam_cat)

0        Medium-Large (10-18in)
1               Large (18-24in)
2               Medium (3-10in)
3            Very large (>24in)
4            Very large (>24in)
                  ...          
49995    Medium-Large (10-18in)
49996           Large (18-24in)
49997                       NaN
49998    Medium-Large (10-18in)
49999             Small (0-3in)
Name: tree_diam_category, Length: 50000, dtype: category
Categories (5, object): ['Small (0-3in)' < 'Medium (3-10in)' < 'Medium-Large (10-18in)' < 'Large (18-24in)' < 'Very large (>24in)']
11.27048
1.97282


## Ordinal Categories: Spread

The mean is not interpretable for ordinal categorical variables because the mean relies on the assumption of equal spacing between categories. 

Many other statistics we might normally use for numerical data rely on the mean. Because of this, these statistics aren’t appropriate for ordinal data. Remember that the standard deviation and variance both depend on the mean, without a mean, we can’t have a reliable standard deviation or variance either!

We can rely on other summary statistics, like the proportion of the data within a range, or percentiles/quantiles. For example, consider the education variable from earlier. To calculate a range containing 80% of the data, we can use np.percentile():

tenth_perc_ind = np.percentile(df['education'].cat.codes, 10)

tenth_perc_cat = correct_order[int(tenth_perc_ind)]

print(tenth_perc_cat) # output: 11th
 
nintieth_perc_ind = np.percentile(df['education'].cat.codes, 90)

nintieth_perc_cat = correct_order[int(nintieth_perc_ind)]

print(nintieth_perc_cat): #output: Bachelors

This tells us that at least 80% of respondents range in "education level" from 11th grade to a Bachelor’s degree.

In [16]:
size_labels_ordered = ['Small (0-3in)', 'Medium (3-10in)', 'Medium-Large (10-18in)', 'Large (18-24in)','Very large (>24in)']

nyc_trees.tree_diam_category = pd.Categorical(nyc_trees.tree_diam_category, size_labels_ordered, ordered=True)

# Calculate 25th Percentile Category
p25_perc_ind = np.percentile(nyc_trees.tree_diam_category.cat.codes, 25)
p25_tree_diam_category = size_labels_ordered[int(p25_perc_ind)]
print(p25_tree_diam_category)

# Calculate 75th Percentile Category
p75_perc_ind = np.percentile(nyc_trees.tree_diam_category.cat.codes, 75)
p75_tree_diam_category = size_labels_ordered[int(p75_perc_ind)]
print(p75_tree_diam_category)


Medium (3-10in)
Large (18-24in)


## Table of Proportions

We can calculate proportions by dividing the frequency by the number of observations in the data.
We can also calculate proportions using .value_counts() by setting the normalize parameter equal to True.

In [18]:
tree_status_proportions_1 = nyc_trees.status.value_counts()/len(nyc_trees.status)
tree_status_proportions_2 = nyc_trees.status.value_counts(normalize = True)
print(tree_status_proportions_1)
print(tree_status_proportions_2)

Alive    0.9539
Stump    0.0267
Dead     0.0194
Name: status, dtype: float64
Alive    0.9539
Stump    0.0267
Dead     0.0194
Name: status, dtype: float64


## Missing Data

You can set the dropna parameter in .value_counts() to determine how NaN values are handled in summaries of data.

When we divide the frequency of each category by len(df['workclass']), we’re calculating the proportion of a specific workclass group as a portion of all people in the dataset. This is equivalent to setting dropna = False in the call to value_counts().

In contrast, using .value_counts(normalize = True) (or .value_counts(normalize = True, dropna = True) to be explicit) returns proportion of a specific workclass group as a portion of people in the dataset who responded to this question.

if we don’t include the missing values in our denominator, we observe slightly larger proportions in each category (and no NaN category) in the above output. It is important to think about how you want to deal with missing data when summarizing a categorical variable and then interpret resulting values appropriately.

In [19]:
health_proportions = nyc_trees.health.value_counts(normalize = True)
print(health_proportions)

health_proportions_2 = nyc_trees.health.value_counts(normalize = True, dropna = False)
print(health_proportions_2)


Good    0.810986
Fair    0.146871
Poor    0.042143
Name: health, dtype: float64
Good    0.7736
Fair    0.1401
NaN     0.0461
Poor    0.0402
Name: health, dtype: float64


## Binary Categorical Variables

In Python, the same behavior holds for columns coded as True/False because True gets coerced to 1 and False gets coerced to 0 (this is also true in most other programming languages used by data scientists). Similarly, we can calculate the proportion equal to 1 or True by taking the mean of the column. This works because the mean is just the sum of all values in the column (which is the frequency of 1s or Trues) divided by the number of values in the column

Finally, we can make use of this nifty trick for any variable by using a conditional to translate a non-binary variable into True and False values.

In [20]:
living_frequency = np.sum(nyc_trees.status == 'Alive')
living_proportion = np.average(nyc_trees.status == 'Alive')
print(living_frequency)
print(living_proportion)

giant_frequency = np.sum(nyc_trees.trunk_diam > 30)
giant_proportion = np.average(nyc_trees.trunk_diam > 30)
print(giant_frequency)
print(giant_proportion)

47695
0.9539
1788
0.03576


In [21]:
# Read CSV
film_permits = pd.read_csv('film_permits.csv')

# Look first few rows
print(film_permits.head()) 

# Nominal Vars
nominalvars = ['EventType', 'Borough', 'Category', 'SubCategoryName']

# Ordinal Vars - We might consider an Id like 'EventID' an ordinal variable in some situations

# Borough with the most permits for pilot episodes
print(film_permits[film_permits.SubCategoryName == 'Pilot'].Borough.value_counts())

# Summarize the Top Categories
print(film_permits.Category.value_counts())

# Summarize the Top Subcategories
print(film_permits.SubCategoryName.value_counts())


   EventID                      EventType           StartDateTime  \
0   446168                Shooting Permit  10/19/2018 02:00:00 PM   
1   186438                Shooting Permit  10/30/2014 07:00:00 AM   
2   445255                Shooting Permit  10/20/2018 07:00:00 AM   
3   128794  Theater Load in and Load Outs  11/16/2013 12:01:00 AM   
4    43547                Shooting Permit  01/10/2012 07:00:00 AM   

              EndDateTime    Borough           Category  SubCategoryName  
0  10/20/2018 02:00:00 AM  Manhattan               Film          Feature  
1  10/31/2014 02:00:00 AM     Queens         Television  Episodic series  
2  10/20/2018 06:00:00 PM   Brooklyn  Still Photography   Not Applicable  
3  11/17/2013 06:00:00 AM  Manhattan            Theater          Theater  
4  01/10/2012 07:00:00 PM   Brooklyn         Television  Episodic series  
Manhattan        149
Brooklyn          89
Queens            21
Bronx             10
Staten Island      2
Name: Borough, dtype: int64
Te

In [24]:
# import pandas as pd
# import numpy as np

car_eval = pd.read_csv('car_eval_dataset.csv')
print(car_eval.head())
print(car_eval.columns)
print(car_eval.manufacturer_country.value_counts())
print(car_eval.manufacturer_country.value_counts().index[0])
print(car_eval.manufacturer_country.value_counts(normalize = True))

print(car_eval.buying_cost.unique())
buying_cost_categories = ['low', 'med', 'high', 'vhigh']
car_eval.buying_cost = pd.Categorical(car_eval.buying_cost, buying_cost_categories, ordered=True)
median_index = np.median(car_eval.buying_cost.cat.codes)
median_category = buying_cost_categories[int(median_index)]
print(median_category)

luggage_frequency = car_eval.luggage.value_counts(dropna=False)/len(car_eval.luggage)
luggage_proportions = car_eval.luggage.value_counts(normalize=True)
print(luggage_frequency)
print(luggage_proportions)

print(car_eval.doors.value_counts().sort_index())
freq_2 = (car_eval['doors']=='2').sum()
print(freq_2)
freq_5more = (car_eval['doors']=='5more').sum()
prop_5more = (car_eval['doors']=='5more').mean()
print(freq_5more)
print(prop_5more)


  buying_cost maintenance_cost doors capacity luggage safety acceptability  \
0       vhigh              low     4        4   small    med         unacc   
1       vhigh              med     3        4   small   high           acc   
2         med             high     3        2     med   high         unacc   
3         low              med     4     more     big    low         unacc   
4         low             high     2     more     med   high           acc   

  manufacturer_country  
0                China  
1               France  
2        United States  
3        United States  
4          South Korea  
Index(['buying_cost', 'maintenance_cost', 'doors', 'capacity', 'luggage',
       'safety', 'acceptability', 'manufacturer_country'],
      dtype='object')
Japan            228
Germany          218
South Korea      159
United States    138
Italy             97
France            87
China             73
Name: manufacturer_country, dtype: int64
Japan
Japan            0.228
Germany  