# More Visualization Packages

Use **Code** cells to write and run any code you need to answer the question and **Markdown** cells to write out answers in words. After you are finished with the assignment, remember to download it as an **HTML file** and submit it in **ELMS**.

In [None]:
from requests import get

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

# Other Plotting Libraries

While `matplotlib` can be very useful for creating graphs, it can be easy to get bogged down in all of the intricacies of customizing everything you want to do. In this section, we introduce a few other plotting libraries that you can use to make graphs. We only show a few examples of doing simple histograms and boxplots here, because there are lots and lots of possibilities for visualizations, so we don't want to spend too much time on going every single little detail here. Instead, these are meant to show a little bit about the syntax and style of the graphs that are produced so that you can learn more about them on your own if you'd like.

First, let's bring in some data to work with.

In [None]:
with open('census-key.txt', 'r') as f:
    census_key = f.readline()

In [None]:
from acs_data import get_county_data
from acs_data import get_us_data

In [None]:
census_data = get_county_data(2022, census_key)
census_data.head()

In [None]:
md_va = census_data[(census_data.state == 'Maryland') | (census_data.state == 'Virginia')]
md_va.head()

In [None]:
data_file = '201807-CAH_PulseOfTheNation_Raw.csv'
potn = pd.read_csv(data_file)
potn.head()

## Seaborn

Seaborn is a package built on top of `matplotlib` that is meant to make some of the difficult tasks easier to do. It is built with pandas DataFrames in mind, with plotting functions that perform the mapping and aggregation necessary to intuitively produce informative plots. One nice thing about seaborn is that it makes using DataFrames for plots more intuitive and reduces a lot of preprocessing work that we might otherwise have to do with `matplotlib`.

Let's take a look at an example of how the seaborn syntax works. A basic histogram can be made using the `sns.displot` function, passing the Data Frame as the `data` argument, then specifying the `x` variable and any other arguments to adjust the plot as necessary. 

In [None]:
sns.displot(data = potn, x = 'age')

This is (hopefully) a bit more intuitive than previous methods of graphing that we have used. Instead of needing to isolate the data from the DataFrame to create the histogram, we can simply use the whole DataFrame and specify the exact column that we want within the function.

The `displot` function is meant to be used to show distributions in general (hence the name). The default is a histogram, but you can also create different types of distribution plots, such as a density plot.

In [None]:
sns.displot(data = potn, x = 'age', kind = 'kde')

In seaborn, you can use figure-level functions (`relplot`, `displot`, or `catplot`) or axes-level functions (see subcategories below). 

<img src="seabornplots.png" width="600">

The figure-level functions allow for creating facets of the data more easily, while the axes-level functions create the graphs directly. For example, in order to create a histogram, instead of using `displot`, we could have used `histplot`.

In [None]:
sns.histplot(data = potn, x = 'age')

We can also set the axes labels and title for the histogram by using the `set` method. We also manually set what values we want shown on the x-axis ticks using `plt.xticks`. This is helpful whenever the default sets many x-tick labels so that they run into each other and are hard to read.

In [None]:
g = sns.displot(data = potn, x = 'age')
g.set(xlabel = 'Age', title = 'Age for Pulse of the Nation Respondents')
plt.xticks([20, 40, 60, 80, 100])

<font color ='red'>**Question 1: Using seaborn and `census_data`, create a histogram of mean income for all counties in the US in 2022.**</font>

### Comparisons with Seaborn

Seaborn makes it easy to compare distributions across groups because it does aggregation and computation behind the scenes automatically if necessary. For example, let's say we want to look at the age distribution split up by political leaning. Rather than needing to create different subsets and graph them all separately, we can just include `hue = political_leaning` as an argument to create different colors for each of the groups in `political_leaning`.  

In [None]:
sns.displot(data = potn, x = 'age', hue = 'political_leaning')

By default, this maps everything on top of each other, so it's hard to see some of the colors that are lower. Let's use a density plot instead so that we get lines for each different category of `political_leaning`. 

In [None]:
sns.displot(data = potn, x = 'age', hue = 'political_leaning', kind = 'kde')

<font color ='red'>**Question 2: Using seaborn and `md_va`, create a visualization of mean income for counties in Maryland and Virginia split by state. Does it look like there is a difference in the distributions between Maryland and Virginia?**</font>

### Comparing with boxplots

Density plots can help differentiate distributions according to a categorical variable, but boxplots are actually much better at doing this because they can be shown side by side and split up by category much more easily. Fortunately for us, `seaborn` makes it easy for us to not only split up and aggregate data by categories, but also re-order categories to better compare them. 

Let's take a look at an example comparing age of respondents by political party.

In [None]:
sns.boxplot(data = potn, x = 'age', y = 'political_party')

In order to reorder categories, we can use the `order` argument in the `boxplot` function. 

In [None]:
pol_parties = ['Strong Democrat', 'Not Very strong Democrat', 'Independent', 
               'Not very Strong Republican','Strong Republican', 'DK/REF']
sns.boxplot(data = potn, x = 'age', y = 'political_party', order = pol_parties)

We can actually add even more customization by adding a `hue` argument. This separates the data even more by another categorical variable. Here, we are using the `gender` variable to add another view of the data. 

Note that the code below uses `catplot` instead with the argument of `kind = 'box'` added on. This is the figure-level function and does pretty much the same as the boxplots above, but by default the legend is added outside the graph. Try doing the same with `sns.boxplot` to see how it looks.

In [None]:
pol_parties = ['Strong Democrat', 'Not Very strong Democrat', 'Independent', 
               'Not very Strong Republican','Strong Republican', 'DK/REF']
sns.catplot(data = potn, x = 'age', y = 'political_party',hue = 'gender', order = pol_parties, kind = 'box')

<font color ='red'>**Question 3: Using seaborn and `md_va`, create a visualization that compares the boxplot of mean income for counties in Maryland and in Virginia.**</font>

## Facets with Seaborn

One thing that seaborn makes much easier to do is creating facets of the data so that you can compare relationships or distributions for different groups. This is done using `relplot`, `displot`, or `catplot`, as shown in the figure above. For example, let's say we want to look at the relationship between mean income and percent of people with a bachelor's degree in counties in Maryland vs counties in Virginia. We do this using `relplot` and specifying the `x` and `y` variables, as well specifying that we want columns by the `state` variable.

In [None]:
sns.relplot(data=md_va, x="percent_bachelors", y="mean_income", col="state")



This is much simpler than having to define subplots and graph each one separately! We can also add other dimensions to the plot, such as changing the size of the points by the number of households in that county, so that we can see which counties have more people. 

In [None]:
sns.relplot(
    data=md_va,
    x="percent_bachelors", y="mean_income", col="state",
    size="num_households"
)


Finally, let's clean this up a bit more by adding better axes labels and an overall title. Note that we need to do a bit more work to add a title over the whole figure because we used a figure-level function. The .`figure` allows us to access the Figure object underlying the graph and add an overall title.

In [None]:
g = sns.relplot(
    data=md_va,
    x="percent_bachelors", y="mean_income", col="state",
    size="num_households"
)

# Set axis labels
g.set_axis_labels("Percent Bachelor's", "Mean Income")

# Add overall Title
g.figure.subplots_adjust(top=.85)
g.figure.suptitle("Percent Bachelor's by Mean Income for MD and VA")



Alternatively, we also could have simply used a different type of point and color to indicate a different between Maryland and Virginia.

In [None]:
g = sns.scatterplot(
    data = md_va,
    x = "percent_bachelors", y = "mean_income", 
    hue = "state", style = 'state'
)
g.set(xlabel = "Percent Bachelor's", ylabel = "Mean Income", 
      title = "Mean Income by Percent Bachelor's for Counties in MD vs VA (2022)")

If we wanted to add trend lines to these plots so that we can see what the relationships are like for Maryland vs, Virginia, we can instead use an `lmplot`. this creates a scatterplot as well, but also adds the least squares regression line along with bootstrap confidence intervals around them. 

In [None]:
g = sns.lmplot(
    data = md_va,
    x = "percent_bachelors", y = "mean_income",
    hue = "state"
)
g.set(xlabel = "Percent Bachelor's", ylabel = "Mean Income", 
      title = "Mean Income by Percent Bachelor's \nfor Counties in MD vs VA (2022)")

<font color ='red'>**Question 4: Look at the relationship between mean income and percent of people with a graduate degree for counties in Maryland and Virginia. Does it look like there are any differences between Maryland and Virginia?**</font>

## Styles and More Information on Seaborn

So far, we have just been using the default seaborn style for graphs. One easy way to change the overall look of the graph is by trying out different styles. For more information about different seaborn styles, see https://seaborn.pydata.org/tutorial/aesthetics.html.

Seaborn offers a lot of tools for building and customizing visualizations. For more information on how to use seaborn, you can check out the help pages and see the examples they have provided at https://seaborn.pydata.org/tutorial.html.

## Using ggplot

Another popular visualization package is ggplot. This is a visualization toolset that has become extremely popular in R and has been ported over to Python. It uses the same grammar of graphics syntax as in the R version, and provides many customization options.

We won't discuss ggplot here in much detail, but you can feel free to use it if you have experience with ggplot in R already.

In [None]:
from plotnine import *

In [None]:
(ggplot(md_va, aes(x = 'percent_bachelors', y= 'mean_income', color = 'state')) + geom_point() + 
 labs(title = 'MD vs VA') + xlab('Percent Bachelors'))