# Customizing Visualizations

Use **Code** cells to write and run any code you need to answer the question and **Markdown** cells to write out answers in words. After you are finished with the assignment, remember to download it as an **HTML file** and submit it in **ELMS**.

In [None]:
from requests import get

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## Adjustments to Visualizations

Many times, we want to add more customizations to a visualization than just adding a title or labeling axes. This might include additional text or lines to clarify relationships, or changing the color scheme to make certain parts of the graph more clear. These can add to the visual appeal of the graph, but also add to the clarity and successful delivery of information. 

In this notebook, we will discuss some techniques to add features to graphs and make them look nicer while keeping in mind the elements that are key for a successful visualization. 

Let's start by bringing in some data that we can make plots with.

In [None]:
with open('census-key.txt', 'r') as f:
    census_key = f.readline()

In [None]:
from acs_data import get_county_data
from acs_data import get_us_data

The `get_county_data` function (defined in `acs_data.py`) gets county-level data on characteristics like number of households, mean income, percent employed, percent with a bachelor's degree, and percent with a graduate degree. The `get_us_data` function does the same for the US as a whole (so there will be only one row). 

In [None]:
census_data = get_county_data(2022, census_key)
census_data.head()

In [None]:
us_data = get_us_data(2022, census_key)
us_data

## Styles

So far, we have just been using the default style for graphs. For example, if we create a quick graph, it might look like this.

In [None]:
census_data.mean_income.hist()

One easy way to change the overall look of the graph is by trying out different styles. For example, you could use the default style used with ggplot, which is a visualization package first developed for R and is widely used. 

In [None]:
plt.style.use('ggplot')
fig, axes = plt.subplots(figsize=(8,6))
census_data.mean_income.hist(bins=20)
axes.set_xlabel("Mean Income")
axes.set_ylabel("Frequency")
axes.set_title("Mean Income for Counties in the US")

Alternatively, you could use the style used by FiveThirtyEight (https://fivethirtyeight.com/about-us/), which became popular for its use of graphics to show polling results as well as successful election predictions.

In [None]:
plt.style.use('fivethirtyeight')

fig, axes = plt.subplots(figsize=(8,6))
census_data.mean_income.hist(bins=20)
axes.set_xlabel("Mean Income")
axes.set_ylabel("Frequency")
axes.set_title("Mean Income for Counties in the US")

More plotting styles are provided in the matplotlib documentation here: https://matplotlib.org/stable/gallery/style_sheets/style_sheets_reference.html

## Adding Lines

Suppose we wanted to show what the overall US mean income value was on this histogram. We could do this by adding a line and adding a note about what that line represents. The `axvline` function adds a vertical line by providing the x value that it should be at, as well as providing some specifications for what that line should look like. The `color` argument adjusts the color, while `ls` adjusts the type of line. Since we want to distinguish it from the bars on the graph, we use red to contrast from the blue and make it dashed instead of solid. 

In [None]:
us_mean_income = us_data.mean_income[0]

In [None]:
fig, axes = plt.subplots(figsize=(8,6))
census_data.mean_income.hist(bins=20)
axes.set_xlabel("Mean Income")
axes.set_ylabel("Frequency")
axes.set_title("Mean Income for Counties in the US")
plt.axvline(x = us_mean_income, color = 'red', ls = '--')

### Adding Annotations

The line might be helpful in identifying where the mean income for the US overall might be, but we can't tell what the value is exactly. In order to make it clear where that line is, we can add an annotation with the exact value. We use the `annotate` method to add the annotation on the Axes object. In the example below, we use f-strings to construct the exact text we want to put on the graph and specify the location of the text using the data coordinates. We want it a little bit offset from the actual mean so that the text isn't right on the line, so we add `5000` to the x-value of the location, then put it sufficiently high up so that it isn't running into any bars. 

Note that we use a slightly fancy f-string here as well. The `{us_mean_income:,}` means that it should insert the value in `us_mean_income` while using commas for every three digits, similar to how it might be shown when writing numbers in English. This makes it easier to read on the graph.

In [None]:
fig, axes = plt.subplots(figsize=(6,4))
census_data.mean_income.hist(bins=20)
axes.set_xlabel("Mean Income")
axes.set_ylabel("Frequency")
axes.set_title("Mean Income for Counties in the US")
plt.axvline(x = us_data.mean_income[0], color = 'red', ls = '--')
axes.annotate(f"Mean Income for US: ${us_mean_income:,}", 
              xy=(us_mean_income + 5000, 150), xycoords='data')

We also could have added the text using the figure coordinates by pixels or percentage of the full figure. For example, to add the text start halfway on the x-axis and 80% in height on the y-axis, we can use `xycoords = 'figure fraction'` with `xy=(0.5, 0.8)`. The `(0,0)` point is the bottom left, while `(1,1)` is on the top right. This might require a bit of fiddling around with the values to make sure it is in the right place.

In [None]:
fig, axes = plt.subplots(figsize=(6,4))
census_data.mean_income.hist(bins=20)
axes.set_xlabel("Mean Income")
axes.set_ylabel("Frequency")
axes.set_title("Mean Income for Counties in the US")
plt.axvline(x = us_data.mean_income[0], color = 'red', ls = '--')
axes.annotate(f"Mean Income for US: ${us_mean_income:,}", 
              xy=(0.5, 0.8), xycoords='figure fraction')

We might want to make it a bit more clearer that this line represents the mean, too. That is, we might want to draw an arrow that indicates that the text is describing that line. To do this, we can add arguments to the `annotate` method to have it draw an arrow and give it where the arrow should be pointing. The `xy` argument indicates where on the graph the arrow should be pointing, and the `xytext` argument indicates where on the graph the text should be. An arrow will then be drawn from the text to the point in `xy`. 

In [None]:
fig, axes = plt.subplots(figsize=(6,4))
census_data.mean_income.hist(bins=20)
axes.set_xlabel("Mean Income")
axes.set_ylabel("Frequency")
axes.set_title("Mean Income for Counties in the US")
plt.axvline(x = us_data.mean_income[0], color = 'red', ls = '--')
axes.annotate(f"Mean Income for US: ${us_mean_income:,}", 
              xy=(us_mean_income, 150), xycoords='data',
             xytext = (us_mean_income + 20000, 140), textcoords = 'data',
             arrowprops=dict(facecolor='black', shrink=0.05),
             horizontalalignment='left', verticalalignment='top')

<font color ='red'>**Question 1: Consider the scatterplot shown below. Add an annotation to show the county name of the point with the highest percent of people with a bachelor's degree as well as the lowest percent of people with a bachelor's degree.**</font>

In [None]:
income_bachelors = census_data[['mean_income','percent_bachelors']]

fig, axes = plt.subplots(figsize=(8,6))
income_bachelors.plot.scatter(x = 'percent_bachelors', y = 'mean_income', ax = axes)
axes.set_xlabel("Percent of people with a bachelor's degree")
axes.set_ylabel("Mean Income")
axes.set_title("Mean Income by Percent Bachelors for Counties in the US")

## Adjusting Categories 

Sometimes, when you make a graph with a categorical variables, the order of the categories gets mixed around and isn't really in a sensical order. This happens most often with **ordinal** variables, in which the values of the categorical variables are ordered in some way (for example, something like shirt size with small, medium, and large). 

Let's take a look at some categorical variables from the Pulse of the Nation dataset.

In [None]:
data_file = '201807-CAH_PulseOfTheNation_Raw.csv'
potn = pd.read_csv(data_file)
potn.head()

In [None]:
potn.political_party.value_counts().plot.barh()

This graph organizes the values in increasing order, but it can be a bit confusing to try to get an idea of the overall spectrum of political parties because the Republican and Democrat categories are all mixed up. In order to reorder these categories, we can convert the `political_party` column in our DataFrame to a `pd.Categorical` type and use our own ordering of the categories. This will enforce that ordering when the graph is made.



In [None]:
pol_parties = ['Strong Democrat', 'Not Very strong Democrat', 'Independent', 
               'Not very Strong Republican','Strong Republican',
              'DK/REF']
pd.Categorical(potn.political_party, categories = pol_parties)
potn.political_party.value_counts()[pol_parties].plot.barh()


<font color ='red'>**Question 2: Create a bar graph of the education level for the Pulse of the Nation dataset. Make sure it is in a reasonable order.**</font>

## Using colors

The plotting defaults typically provide colors that work well for a given graph. However, sometimes, you might want to adjust these colors to better represent the data. This is often pertinent when using an ordinal variable, especially for a variable like political party, where a standard color is associated with the groups (blue for Democrats, red for Republicans). The default colors might be misleading in these cases, so it would be good to set our own colors. Let's take a look at an example by comparing political party by gender in the Pulse of the Nation dataset.

Recall that we used `pd.crosstab` to look at comparisons of two categorical variables, with the `normalize` argument allowing us to get proportions rather than raw numbers.

In [None]:
party_by_gender = pd.crosstab(potn.gender,potn.political_party, normalize='index')

We will create a stacked bar chart so that we can look at the comparisons of proportions across genders. Note that since we used `normalize=index`, the overall length of all the bars should be 1.

In [None]:
fig, axes = plt.subplots(figsize=(8,6))
party_by_gender.plot.barh(stacked = True, ax = axes)

There are a lot of things that we need to fix with this graph! First, as mentioned before, the colors are very confusing. Blue does not correspond to Democrats, nor does the color that is closest to red correspond to Republicans. In addition, we lost the ordering of the political parties when we used the `pd.crosstab` function. Finally, the legend is covering up part of the graph, making this harder to read.

Let's address each of these one by one. First, we will make sure that the order of the political parties is adjusted to match what we had used before, so that we have Independent in between the Democrats and Republicans, with DK/REF separated out. Then, we'll move the legend by using the `.legend()` method with the `bbox_to_anchor` argument providing a way for us to adjust the location. The `ncol = 3` argument makes it so that the categories are displaying more horizontal, saving space. 

In [None]:
party_by_gender = party_by_gender.loc[:,pol_parties]

fig, axes = plt.subplots(figsize=(8,6))
party_by_gender.plot.barh(stacked = True, ax = axes)
axes.legend(loc='lower center', bbox_to_anchor=(0.5, -0.3), ncol = 3)

We still have an issue with the colors. We'd like to adjust these to better represent the parties. To do this, we can use the `color` argument within `barh` so that we assign colors to the six categories. This can be done using a myriad of methods, such as specifying hex RGB values or using existing named colors. Here, we simply use the names, but you can check the matplotlib color section to see other ways of specifying colors here: https://matplotlib.org/stable/gallery/color/color_demo.html. 

A list of named colors can be found here: https://matplotlib.org/stable/gallery/color/named_colors.html.

In [None]:
fig, axes = plt.subplots(figsize=(8,6))
party_by_gender.plot.barh(stacked = True, color = ['royalblue','skyblue', 'plum',
                                                    'orangered','crimson','gray'],
                         ax = axes)
axes.legend(loc='lower center', bbox_to_anchor=(0.5, -0.3), ncol = 3)


We could have also used existing color maps from matplotlib to get some colors from a range. For example, we might want to use the blue-red-green range, and get some colors from the blue to red scale and set the "DK/REF" as green to separate it out from the political parties. 

For a list of colormaps, see https://matplotlib.org/stable/gallery/color/colormap_reference.html

In [None]:
import matplotlib as mpl
cmap = mpl.colormaps['brg']
cmap

In [None]:
colors = cmap(np.linspace(0, 1, 11))
colors

In [None]:
fig, axes = plt.subplots(figsize=(8,6))
party_by_gender.plot.barh(stacked = True, color = colors[[0,1,2,3,4,10]],
                         ax = axes)
axes.legend(loc='lower center', bbox_to_anchor=(0.5, -0.3), ncol = 3)

<font color ='red'>**Question 3: Create a visualization that looks at the relationship between the `biz_regulations` variable and the `political_party` variable. Make sure you adjust colors and reorder variables as appropriate. Does it look like there is a relationship?**</font>

In [None]:
biz_by_gender = pd.crosstab(potn.biz_regulations,potn.political_party, normalize='index')
biz_by_gender

