# Analyzing Categorical Data from the General Social Survey in Python

Welcome to your webinar workspace! In this session, we will introduce you to categorical variables in Python. We will be using a subset of data from the [General Social Survey](https://www.kaggle.com/datasets/norc/general-social-survey?select=gss.csv).

The following code block imports some of the main packages we will be using, which are [pandas](https://pandas.pydata.org/docs/index.html), [NumPy](https://numpy.org/doc/stable/index.html), and [Plotly](https://plotly.com/graphing-libraries/). We will also use [statsmodels](https://www.statsmodels.org/dev/index.html) for a special type of categorical plot.

We will read in our data and preview it as an interactive table. Please follow along with the code and feel free to ask any questions!

In [1]:
# Import packages
import pandas as pd
import numpy as np
import plotly.express as px
import matplotlib.pyplot as plt
from statsmodels.graphics.mosaicplot import mosaic

# Read in csv as a DataFrame and preview it



## Inspecting our data
### What types of data are in our dataset?
One of the simplest ways to get an overview of the types of data you are working with is to use the [`.info()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html) method, which will return a summary of your data, including:
- The column names.
- The number of non-null values per column.
- The data types.
- The memory usage of the DataFrame.

Above we see that our DataFrame contains `float64` column (numerical data), as well as a number of `object` columns. Object data types contain strings.

### Inspecting individual columns
To inspect a categorical column, use the [`.describe()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) method with the `include` parameter to select a particular DataType (in this case `"O"`). This returns the count, number of unique values, the mode, and frequency of the mode.

The [`.value_counts()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html) method can give you a greater insight into the distribution and structure of a column.

## Manipulating categorical data
### Let's convert our object columns to categories
- The categorical variable type can be useful, especially here:
    - Save on memory when there are only a few different values.
    - You can specify a precise order to the categories when the default order may be incorrect (e.g., via alphabetical).
    - Can be compatible with other Python libraries.

Let's take our existing categorical variables and convert them from strings to categories. Here, we use [`.select_dtypes()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.select_dtypes.html) to return only object columns, and with a dictionary set their type to be a category.

In [2]:
# Create a dictionary of column and data type mappings


# Convert our DataFrame and check the data types



Already we can see that the memory usage of the DataFrame has been halved from 7 mb to 4 mb! This can help when working with large quantities of data, such as this survey that we'll be working with.

### Cleaning up the `labor_status` column
To analyze the relationship between employment and attitudes over time, we need to clean up the `labor_status` column. We can preview the existing categories using `.categories`.

Let's collapse some of these categories. The easiest way to do this is to replace the values inside the column using a dictionary, and then reset the data type back to a category.

In [3]:
# Create a dictionary of categories to collapse
new_labor_status = {"UNEMPL, LAID OFF": "UNEMPLOYED", 
                    "TEMP NOT WORKING": "UNEMPLOYED",
                    "WORKING FULLTIME": "EMPLOYED",
                    "WORKING PARTTIME": "EMPLOYED"
                   }

# Replace the values in the column and reset as a category


# Preview the new column


### Reordering categories
Another potential issue is the order of our opinion variables (`environment`, `law_enforcement`, and `drugs`). These are ordinal variables, or categorical variables with a clear ordering or ranking. However, these orders are not currently set. 

This will affect use later when we go to visualize our data. We can also take the opportunity to drop some unwanted categories.

Let's loop through the three variables and give them all an order. While we're at it, let's drop two categories that don't have any use for us: "DK" (don't know) and "IAP" (inapplicable). By removing them as categories, we set them to null so they won't be counted in the final analysis.

In [4]:
# Set the new order
new_order = ["TOO LITTLE", "ABOUT RIGHT", "TOO MUCH", "DK", "IAP"]
categories_to_remove = ["DK", "IAP"]

# Loop through each column

    # Reorder and remove the categories


# Preview one of the columns' categories


Now let's also apply these steps to education level in one go: collapsing, removing, and reording.

In [5]:
# Define a dictionary to map old degree categories to new ones
new_degree = {"LT HIGH SCHOOL": "HIGH SCHOOL", 
              "BACHELOR": "COLLEGE/UNIVERSITY",
              "GRADUATE": "COLLEGE/UNIVERSITY",
              "JUNIOR COLLEGE": "COLLEGE/UNIVERSITY"}

# Replace old degree categories with new ones and convert to categorical data type


# Remove "DK" category from degree_clean column


# Reorder degree_clean categories and set as ordered


# Preview the new column


### Let's simplify our dates data
We can also bin numerical data to create categorical variables. There are a few reasons that we might want to do this:
- It can simplify data and allow us to more easily spot trends and patterns.
- It can make visualizing data easier, such as when you want to use bar plots.

Here we use a `pandas` [`IntervalIndex`](https://pandas.pydata.org/docs/reference/api/pandas.IntervalIndex.html) to set cutoff ranges for the `year`. We then use [`pd.cut()`](https://pandas.pydata.org/docs/reference/api/pandas.cut.html) to cut our `year` column by these ranges, and set labels for each range.

In [6]:
# Set the decade boundaries and labels
decade_boundaries = [(1970, 1979), (1979, 1989), (1989, 1999), (1999, 2009), (2009, 2019)]
decade_labels = ['1970s', '1980s', '1990s', '2000s', '2010s']

# Set the bins and cut the DataFrame


# Rename the categories


# Preview the new column


## Visualizing categorical variables

### Bar plots to show value counts
Earlier we used the `.value_counts()` method to show the counts for different categories. But we can also visualize this using Plotly. Let's start with a bar chart.

In [7]:
# Create a new figure object


# Hide the legend and show the plot



Good, but not great! It's often best to use a horizontal bar chart for categorical variables so the labels have room to breathe. Let's change the orientation of the plot and add a title.

In [8]:
# Create a new figure object



# Hide the legend and show the plot


### Bar charts to show a categorical average
Besides counts, bar charts can be a great way to show aggregations of a categorical variable. Let's use our `decade` variable from earlier to visualize the average household size over time.

In [9]:
# Aggregate household size by year


In [10]:
# Create a new figure object





# Show the plot


### Boxplots
Boxplots display the median, quartiles, and range of a dataset in a way that allows for easy comparison between multiple groups or categories. The box in the plot represents the interquartile range (IQR), which contains the middle 50% of the data, with the median represented by a line inside the box. 

Here we use Plotly to create a [box plot](https://plotly.com/python/box-plots/) of the ages by employment. Do the distributions make sense per employment category?

In [17]:
# Create a new figure object


# Show the plot


### Mosaic plots
Sometimes you will want to visualize the relationship between two categorical variables. One way to do this is a frequency table, which will give you the counts across the different combinations of the two variables.

You can create a frequency table using [`pd.crosstab()`](https://pandas.pydata.org/docs/reference/api/pandas.crosstab.html), and passing in your two columns.

However, this can be hard to interpret, as it's not easy to get a sense of the proportions within the two categories. A better way to represent this data is a mosaic plot. Mosaic plots display the proportion of each category within each level of the other variable. This allows us to easily compare the distribution of the two variables. 

Mosaic plots are difficult to generate in Plotly, but fortunately `statsmodels` has a [`mosaic`](https://www.statsmodels.org/stable/generated/statsmodels.graphics.mosaicplot.mosaic.html) function that makes generating them a breeze.

In [22]:
# Create a mosaic plot and show it




### Line charts
The final plot type we will cover is a line plot. Line plots often (but not always!) show the relationship between time and a numerical variable. Adding in a categorical variable can be a great way to enrich a line plot and provide other information.

Here, we use the `.value_counts()` method as an aggregation function, and use this in combination with a Plotly [`line_plot()`](https://plotly.com/python/line-charts/) to visualize the trend in marital statuses over the years.

In [20]:
# Group the dataframe by year and marital status, and calculate the normalized value counts


# Display the resulting DataFrame


In [23]:
# Create a new figure object




# Update the y-axis to show percentages


# Show the plot


## Next steps
To learn more techniques for working with categorical variables, check out [Working with Categorical Data in Python](https://app.datacamp.com/learn/courses/working-with-categorical-data-in-python). This course covers additional methods for working with categorical data, as well as advanced techniques like label and one-hot encoding.