# Interactive Visualizations

Use **Code** cells to write and run any code you need to answer the question and **Markdown** cells to write out answers in words. After you are finished with the assignment, remember to download it as an **HTML file** and submit it in **ELMS**.

In [None]:
from requests import get

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
import plotly.express as px

## Dynamic Visualizations

So far, we have only worked with static visualizations. That is, there were no moving pieces or changes that you would make to the graphs after they were made. These are the more traditional visualizations and most commonly used within reports (after all, you can't animate figures on a piece of paper ... at least not yet!).

However, now that presentations and reports are distributed and accessed online much more frequently, the use of interactive, dynamic visualizations has grown. Dynamic visualizations can open up a whole new dimension and allow you to show relationships and trends more clearly than with a static visualization. In addition, they can be more flexible and allow for customization of different views of the data much more easily than creating any individual static visualizations.

These interactive visualizations are typically used for:
- Flexible custom views of the data
- Creating dashboards and other data tools for non-technical users
- Showing a time dimension that would be otherwise difficult to show
- Making adjustments in real-time as data comes in
- And more ...

To start, we'll bring in some datasets to use within our figures.

In [None]:
with open('census-key.txt', 'r') as f:
    census_key = f.readline()

In [None]:
from acs_data import get_county_data
from acs_data import get_us_data

In [None]:
census_data = get_county_data(2022, census_key)
census_data.head()

In [None]:
md_va = census_data[(census_data.state == 'Maryland') | (census_data.state == 'Virginia')]
md_va.head()

In [None]:
data_file = '201807-CAH_PulseOfTheNation_Raw.csv'
potn = pd.read_csv(data_file)
potn.head()

## Plotly

The `plotly` package provides an easy way to create quick interactive visualizations. Here, we'll go over how to use the "express" functions, which create quick interactive visualizations without needing lots of code or customization. The `plotly` package also allows for more complicated animations and dynamic aspects, including maps and 3-D plots. For more guides on how to use `plotly`, see the tutorials at https://plotly.com/python/#animations.  

The `scatter` function from plotly uses syntax similar to seaborn, but creates a scatterplot with points that you can hover over to get information about them. 

In [None]:
fig = px.scatter(census_data, x = 'percent_bachelors', y = 'mean_income')
fig.show()

Note that the hover information only has the values of the variables that are plotted. This is useful if we want to get the exact values, but we might also want to know what each observation represents. For example, which county is the one that had the highest percent of people with a Bachelor's degree? Or the lowest? We can add this information to the graph by adding the `hover_data` argument.

In [None]:
fig = px.scatter(census_data, x = 'percent_bachelors', y = 'mean_income', hover_data = ['county'])
fig.show()

In [None]:
fig = px.scatter(census_data.dropna(), x = 'percent_bachelors', y = 'mean_income', 
                 hover_data = ['county'], size = 'num_households')
fig.show()

<font color ='red'>**Question 1: Create a visualization that plots the percent of people with a bachelor's degree with the percent of people employed within a county, with the size of the observation scaled to the mean income of that county.**</font>

## Interactive Bar Charts

We can use `plotly` to make interactive bar charts as well. First, we start by using `crosstab` in order to make the table that contains the underlying data in the bar chart. We'll use `normalize = 'index'` in order to get proportions rather than raw counts so that we can see the relationship between two variables.

In this example, we look at political party and gender.

In [None]:
party_by_gender = pd.crosstab(potn.political_party, potn.gender, normalize = 'index')
party_by_gender

As before, we need to reorder the variables so that they are in a more intuitive order. 

In [None]:
party_order = ['Strong Democrat', 'Not Very strong Democrat', 'Independent',
              'Not very Strong Republican', 'Strong Republican', 'DK/REF']
party_by_gender = party_by_gender.loc[party_order, :]

Now that we have our data in the form that we need, using `px.bar` and specifying the x and y variables gives us the bar chart that we want. This bar chart allows us to see the relationship between gender and political party, as well as allowing us to see the exact values of each of the bars and colors by hovering over them. This provides a clean image for comparing the groups as well as a way to see exact values if we want to without having them clutter up the graph.

In [None]:
fig = px.bar(party_by_gender, x = party_by_gender.index, y = party_by_gender.columns)
fig.show()

We could have also made this into a horizontal bar graph by adding the `orientation = 'h'` argument. Note that this requires you to switch the `x` and `y` arguments, because the elements that you are putting on the x- and y-axes are now changed. 

In [None]:
fig = px.bar(party_by_gender, x = party_by_gender.columns, y = party_by_gender.index, orientation = 'h')
fig.show()

### Boxplots

You can make boxplots in the same manner as well. This will add information about the underlying computation that was done to create the boxplots, such as the median and quartile values. Similar to seaborn, we provide the DataFrame as well as specify the `x` and `y` variables. We can also provide an argument for `color` which will further split the data into groups to do more comparisons.

In [None]:
fig = px.box(potn, x = 'gender', y = 'age', color = 'race')
fig.show()

<font color ='red'>**Question 2: Create a visualization that compares the boxplot of mean income for counties in Maryland and in Virginia.**</font>

## Sliders and Animated Plots

The `plotly` package also provides the ability to add sliders and animations to graphs. This is most useful for when you want to show changes over time or want to look at different cuts of the data according to some categorical variable.

Let's take a look at an example using the built-in Gapminder dataset within `plotly`.

In [None]:
gm = px.data.gapminder()
gm.head()


This data contains country level information about characteristics such as life expectancy, population, and GDP per capita. The Gapminder website (https://www.gapminder.org) also has resources to find data on other characteristics for countries around the world, such as fertility and child mortality. We can make a scatterplot similar to before using the `scatter` function, but also add an `animation_frame` argument to make it so that it animates by year. The `animation_group` argument makes sure that observations that are the same across time points are animated smoothly and match. 

In [None]:
fig = px.scatter(gm, x="gdpPercap", y="lifeExp", animation_frame="year", animation_group="country",
                 size="pop", color="continent", hover_name="country",
                 log_x=True, size_max=55, range_x=[100,100000], range_y=[25,90])
fig.show()

We could have done this with the ACS data too, but we would need to get data over multiple years first. The fact that we defined a function separately makes this easier to do. Using a `for` loop, we can use the `get_county_data` function to pull data for a range of years, then use `pd.concat` to combine the datasets together.

In [None]:
year_range = range(2013,2020)
acs = []

for year in year_range:
    df = get_county_data(year, census_key)
    df['year'] = year
    acs.append(df) 
    
acs_over_years = pd.concat(acs, ignore_index = True)

To avoid issues with some counties having the same name across states, we'll combine the `county` and `state` variables and create a new variable that has a unique county-state combination. 

In [None]:
acs_over_years['county_state'] = acs_over_years.county + ', ' + acs_over_years.state

Finally, we use `dropna()` to remove any rows with NAs in them, then create the visualization. Here, we are graphing the `percent_employed` variable with `mean_income`. Note that some of the variable names changed over the years, so it would take a bit more work to get the appropriate `percent_bachelors` variable.

In [None]:
fig = px.scatter(acs_over_years.dropna(), x="percent_employed", y="mean_income", 
                 animation_frame="year", animation_group="county_state",
                 size="num_households", hover_name="county",
                 range_x=[25,90], range_y=[0,300000])
fig.show()

# Visualizations Overview

Visualizations can be quite complicated, and there are lots of options for displaying information. In general, you should start by thinking about what you want to show and why you want to show it. For example, you might use a bar plot with the raw numbers if you are interested in the amount of people within groups, but you might choose a bar plot with proportions if you want to compare two variables. Make sure you keep in mind what the goal of the visualization is when making the graph. Then, after you have an initial version, think about what you would need to do to make the graph clear, such as adding labels or annotations.

**Things to consider before creating a visualization:**
- What type of data are you trying to visualize? Is it numerical? Is it categorical?
- What is the structure of the data you are trying to visualize? 
- What are you trying to show with the visualization? Is it a relationship? A trend over time? Or just raw numbers?
- What characteristics of the data do you want to map to what pieces of the visualizations? For example, what should go on the x-axis? The y-axis? Should there be different facets or colors?

**Things to consider after creating a visualization:**
- Does the visualization show the trend/relationship/numbers that you want it to?
- Is the information clear by itself? Do you need to add reference lines or annotations to make it clearer?

This is much simpler than having to define subplots and graph each one separately! We can also add other dimensions to the plot, such as changing the size of the points by the number of households in that county, so that we can see which counties have more people. 

In [None]:
sns.relplot(
    data=md_va,
    x="percent_bachelors", y="mean_income", col="state",
    size="num_households"
)


Finally, let's clean this up a bit more by adding better axes labels and an overall title. Note that we need to do a bit more work to add a title over the whole figure because we used a figure-level function. The .`figure` allows us to access the Figure object underlying the graph and add an overall title.

In [None]:
g = sns.relplot(
    data=md_va,
    x="percent_bachelors", y="mean_income", col="state",
    size="num_households"
)

# Set axis labels
g.set_axis_labels("Percent Bachelor's", "Mean Income")

# Add overall Title
g.figure.subplots_adjust(top=.85)
g.figure.suptitle("Percent Bachelor's by Mean Income for MD and VA")



Alternatively, we also could have simply used a different type of point and color to indicate a different between Maryland and Virginia.

In [None]:
g = sns.scatterplot(
    data = md_va,
    x = "percent_bachelors", y = "mean_income", 
    hue = "state", style = 'state'
)
g.set(xlabel = "Percent Bachelor's", ylabel = "Mean Income", 
      title = "Mean Income by Percent Bachelor's for Counties in MD vs VA (2022)")

If we wanted to add trend lines to these plots so that we can see what the relationships are like for Maryland vs, Virginia, we can instead use an `lmplot`. this creates a scatterplot as well, but also adds the least squares regression line along with bootstrap confidence intervals around them. 

In [None]:
g = sns.lmplot(
    data = md_va,
    x = "percent_bachelors", y = "mean_income",
    hue = "state"
)
g.set(xlabel = "Percent Bachelor's", ylabel = "Mean Income", 
      title = "Mean Income by Percent Bachelor's \nfor Counties in MD vs VA (2022)")

<font color ='red'>**Question 4: Look at the relationship between mean income and percent of people with a graduate degree for counties in Maryland and Virginia. Does it look like there are any differences between Maryland and Virginia?**</font>