# Categorical Data

**01. Import data and library**

In [10]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

**02. Introduction**

Depending on what we’re trying to understand from our data, we may need to rely on different statistics. For quantitative data, we can summarize central tendency using mean, median or mode and we can summarize spread using standard deviation, variance, or percentiles. However, when working with categorical data, we may not be able to use all the same summary statistics.

**03. Nominal Categories**
- Depending on the data, some of the summary statistics we use for quantitative data can still be meaningful for categorical data. Let’s first consider a nominal categorical variable. A nominal categorical variable is a categorical variable with no intrinsic ordering to the categories.

- We can do this in Python using the **.value_counts()** function. The .value_counts() function calculates the count of each value in a column and returns the result as a series. By default, .value_counts() orders categories in descending order by frequency, thus the top row in the output will be the mode.


In [3]:
ny_trees = pd.read_csv('new_york_tree_census_1996csv.csv')
ny_trees.head()

Unnamed: 0,recordid,species,diameter,status,sidewalk_condition
0,433600,QUPA,6,Good,Good
1,48050,GLTR,10,Excellent,Good
2,506340,QURU,21,Good,Good
3,348044,ACPL,15,Poor,Good
4,354765,ACPL,7,Good,


In [4]:
ny_trees.dtypes

recordid               int64
species               object
diameter               int64
status                object
sidewalk_condition    object
dtype: object

In [28]:
ny_trees.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4999 entries, 0 to 4998
Data columns (total 5 columns):
recordid              4999 non-null int64
species               4999 non-null object
diameter              4999 non-null int64
status                4999 non-null category
sidewalk_condition    4788 non-null object
dtypes: category(1), int64(2), object(2)
memory usage: 161.6+ KB


In [21]:
ny_trees.groupby('status').recordid.nunique()

status
Excellent          918
Good              3265
Fair                 6
Poor               376
Dead               134
Planting Space     148
Stump               58
Shaft                2
Unknown             92
Name: recordid, dtype: int64

In [49]:
# practice
counts0 = ny_trees.sidewalk_condition.value_counts()
counts0

Good      4332
Raised     456
Name: sidewalk_condition, dtype: int64

In [52]:
# count the number of trees species from dataset and print out the highest's three species 
# this statment will return a serries
counts1 = ny_trees.species.value_counts()
counts1.head()

ACPL    1034
PLAC     811
QUPA     363
PYCA     317
GLTR     314
Name: species, dtype: int64

In [12]:
# order values by category 
correct_order = ['Excellent', 'Good', 'Fair', 'Poor', 'Dead', 'Planting Space', 'Stump', 'Shaft', 'Unknown']
ny_trees['status'] = pd.Categorical(ny_trees['status'], correct_order, ordered = True)# save as ordered categorical variables 
median_index = np.median(ny_trees.status.cat.codes)# convert the categories to numbers 
median_category = correct_order[int(median_index)]
median_category

'Good'

**Ordinal Categorical Variables - Central Tendency I**

- While we can represent these categories with equally spaced numbers, there’s not equal spacing between categories. Some gaps between educational attainment levels represent up to four additional years of schooling (e.g. '1st-4th' to '5th-6th'), while others represent a single additional year of schooling (e.g. from '9th' to '10th').

- When we use .cat.codes to translate these categories into integers, those integers have equal spacing. While translating categories to numbers is often necessary to store and use the order of the categories (for calculating a statistic like the median, which only relies on ordering, not spacing), we should not use those numbers to calculate statistics — such as the mean — for which the distance between values matters.

- In practice, researchers sometimes (albeit, incorrectly) report means for ordinal categories. For example, a researcher might want to analyze survey responses to the question "Rate your happiness on a scale from 1 to 5 where 1 means 'very unhappy' and 5 means 'very happy'".

- If that researcher calculates 'mean happiness score', they are assuming that the difference in happiness between a rating of 1 and 2 is the same as the difference in happiness for a rating of 3 and 4. In practice, this assumption is likely not true and should be acknowledged if reporting a mean of an ordinal categorical variable.


**Ordinal Categories: Spread**

- Many other statistics we might normally use for numerical data rely on the mean. Because of this, these statistics aren’t appropriate for ordinal data. Remember that the standard deviation and variance both depend on the mean, without a mean, we can’t have a reliable standard deviation or variance either!

- Instead, we can rely on other summary statistics, like the proportion of the data within a range, or percentiles/quantiles. For example, consider the education variable from earlier. To calculate a range containing 80% of the data, we can use np.percentile():


In [40]:
# calculate the 95 percentile for status, use the ordered list "correct_order ", to find the coresponding label 
percent = np.percentile(ny_trees.status.cat.codes, 95)
p95_status = correct_order[int(percent)]
p95_status

'Planting Space'

**Table of Proportions**

    -You’ve already seen that we can use the .value_counts() function to get a table of frequencies for a categorical variable. A table of frequencies is often the first approach a data scientist might use to summarize a categorical variable; however, it is sometimes useful to instead look at the proportion of values in each category.
    
    -we can calculate proportion by dividing the frequency by the number of observations in the data 
    -We can also calculate proportions using .value_counts() by setting the normalize parameter equal to True:

In [55]:
ny_trees.status.value_counts()/len(ny_trees.status)

Good              0.653131
Excellent         0.183637
Poor              0.075215
Planting Space    0.029606
Dead              0.026805
Unknown           0.018404
Stump             0.011602
Fair              0.001200
Shaft             0.000400
Name: status, dtype: float64

In [54]:
ny_trees.status.value_counts(normalize = True).head()

Good              0.653131
Excellent         0.183637
Poor              0.075215
Planting Space    0.029606
Dead              0.026805
Name: status, dtype: float64

**Table of Proportions: Missing Data**

- One thing to keep in mind when calculating the proportion of data in a particular category: how are you dealing with missing data? 
- You can set the dropna parameter in .value_counts() to determine how NaN values are handled in summaries of data. When set dropna parameter to True then Nan values are not displayed and retuns proportion of a specific category divided by all labels excluding Nan values from denominator , in contrast Nan value are displayed when dropna set it to False.
- Note that if we don’t include the missing values in our denominator, we observe slightly larger proportions in each category (and no NaN category) in the above output. It is important to think about how you want to deal with missing data when summarizing a categorical variable and then interpret resulting values appropriately.

In [67]:
# calculate the proportion for each category, the denominator for your proportions should be the number of non-missing values
ny_trees.sidewalk_condition.value_counts(normalize = True, dropna = True)

Good      0.904762
Raised    0.095238
Name: sidewalk_condition, dtype: float64

In [68]:
# calculate the proportion for each category including missing values
ny_trees.sidewalk_condition.value_counts(normalize = True, dropna = False)

Good      0.866573
Raised    0.091218
NaN       0.042208
Name: sidewalk_condition, dtype: float64

**Binary Categorical Variables**
   - Binary categorical variables have only two categories. In Python, these variables are often coded as 0/1 or True/False. This makes it easy to calculate the frequency/proportion of these variables in a dataset. 
    
   - In Python, the same behavior holds for columns coded as True/False because True gets coerced to 1 and False gets coerced to 0
   -  we can calculate the proportion equal to 1 or True by taking the mean of the column. 

In [85]:
good = ny_trees.sidewalk_condition == 'Good'
print(good.mean())

0.8665733146629326


In [86]:
good.sum()

4332

In [88]:
# find the number of trunk diamter greater than 30 
(ny_trees.diameter > 30).sum()

135

In [90]:
# calculate the proportion of trunk diametr greater than 30 
(ny_trees.diameter > 30).mean()

0.027005401080216044

**Review**
- In this lesson you’ve learned the steps you can take to summarize and interpret summaries of nominal categorical and ordinal categorical variables.

- For nominal categorical variables, there is no ordering to the categories. Because of this, we’re limited to using the mode to describe central tendency and there is no way to summarize the spread.

- For ordinal categorical variables, there is an implied ordering to the categories. In Python, we can use pd.Categorical() to transform a variable to a categorical type. The Categorical type allows us to access a numeric value for each category by using .cat.codes. From there, we may perform operations on this variable as if it were a regular, numeric variable.

- However, when calculating statistics for an ordinal categorical variable we should be mindful that some numeric statistics rely on the assumption of equal spacing between categories.

- For ordinal categorical variables, median and mode can be used to summarize the central tendency, and the IQR (or any difference between percentiles) can be used to summarize the spread.

- Certain summary statistics (e.g. frequencies and proportions), can be used for all categorical variables. You can create true/false columns and np.sum() and np.mean() to quickly summarize what proportion of your data meets certain criteria.