# Data Visualization
By Maryah Garner

## Table of Contents
* [Choosing a Data Visualization Package](#Choosing_Package)
* [Setup - Load Python Packages](#setup)
* [Read in Projects Data from 2012-2021](#read)
* [Checkpoint 1](#cp1) 
* [Visualizations](#visualizations)  
* [Barplots](#barplots)
* [Presentation Ready Figures](#presentation)    
* [Checkpoint 2](#cp2)
* [Percentage of Core Cancer projects over time](#cancer_pi)
* [Stacked Barplot](#stacked)   
* [Geopandas](#geopandas)
* [Heat Map of the United States ](#hm)
* [More visuals](#More)
    * [Lineplots](#lineplots)
    * [Histagram](#Histagram)
    * [Layering in Matplotlib](#Layering)
* [Checkpoint 3](#cp3)
* [More Resources](#Resources)
* [Other Python Visualization Libraries](#Other)

## Choosing a Data Visualization Package <a class="anchor" id="Choosing_Package"></a>

There are many excellent data visualiation modules available in Python, but for the tutorial we will stick to the tried and true combination of **matplotlib** and **seaborn**. You can read more about different options for data visualization in Python in the [More Resources](#More-Resources:) section at the bottom of this notebook. 

### Matplotlib
**matplotlib** is very expressive, meaning it has functionality that can easily account for fine-tuned graph creation and adjustment. However, this also means that **matplotlib** is somewhat more complex to code. The basic steps to create graphs with this package are:
1. Prepare your data
2. Create the plot
3. Plot the plot
4. Customize plot
5. Save and show plot

More info can be found here:https://matplotlib.org/users/beginner.html

### Seaborn
**seaborn** is a higher-level visualization module, which means it is much less expressive and flexible than matplotlib, but far more concise and easier to code. In addition to matlab, this package makes it easier to
1. Use default themes that are aesthetically pleasing
2. Set custom color palettes
3. Make attractive statistical plots
4. Easily and flexibly displaying distributions
5. Visualize information from matrices and DataFrames

You can see it as a complement, not a substitute, for Matplotlib. There are some tweaks that still require Matplotlib.
More info can be found here: http://seaborn.pydata.org/api.html#api-ref

It may seem like we need to choose between these two approaches, but this is not the case! Since `seaborn` is itself written in **matplotlib** (you will sometimes see **seaborn** be called a **matplotlib** 'wrapper'), we can use **seaborn** for making graphs quickly and then **matplotlib** for specific adjustments. When you see `plt` referenced in the code below, we are using **matplotlib's** pyplot submodule.


**seaborn** also improves on **matplotlib** in important ways, such as the ability to more easily visualize regression model results, creating small multiples, enabling better color palettes, and improve default aesthetics. From [**seaborn**'s documentation](https://seaborn.pydata.org/introduction.html):

> If matplotlib 'tries to make easy things easy and hard things possible', seaborn tries to make a well-defined set of hard things easy too. 

## Setup - Load Python Packages <a class="anchor" id="setup"></a>

### Installation of Packages
The enviroment has the most commonly used packages installed so you are able to directly import them. Other packages might not be installed so we need to install them before we can import them. In this notebook we will be using the plotnine package which is not pre-installed. We can use the pip install command to install the package. On your home computer you only have to do this once. As our environment is only active for the current session we have to do this everytime we open the binder.

In [None]:
# Use pip to intall packages
# We will use the geopandas package to make a beautiful heat map of the US and the us package
%pip install geopandas
%pip install us

In [None]:
import pandas as pd
import numpy as np
import os
import glob
#from plotnine import *

import matplotlib as mplib
import matplotlib.pyplot as plt # visualization package 1
import seaborn as sns           # visualization package 2


import geopandas          #  geospatial data package 1
import us                 #  geospatial data package 2

# so images get plotted in the notebook
%matplotlib inline

## Read in Projects Data from 2012-2021  <a class="anchor" id="read"></a>

In [None]:
# Specify a path with the data folder
# Change "NAME" to your name as recorded on your computer
# path = 'C:/Users/NAME/PADM-GP_2505/Data/'
Path = "---/PADM-GP_2505/Data"


#### Set working directory
We will change the working directory to the Projects folder. We need to do this so we can read in all of the projects files at once.

In [None]:
# use the chdir funftion from the os package to sed your working directory 
os.chdir(Path + "/Projects")

In [None]:
# Generate an empty dataframe that will hold all the patent data we have
all_projects = pd.DataFrame([])

# Now loop through each file in the folder that starts with RePORTER
# Read that file using only the columns we need
# And append it to the dataframe that we created above
# This might take a little while to run (no more than 1 minute)
for counter, file in enumerate(glob.glob("RePORTER*?")):
    print(counter,file)
    projects = pd.read_csv(file, usecols=['APPLICATION_ID','TOTAL_COST','CORE_PROJECT_NUM','FULL_PROJECT_NUM', 'FY', 'IC_NAME', 
                                          'ORG_NAME', 'ORG_STATE','PI_NAMEs','PI_NAMEs','PROJECT_START',
                                          'PROJECT_END','PROJECT_TITLE','NIH_SPENDING_CATS'], 
                           encoding='latin-1')
    projects['ORG_STATE'] = projects['ORG_STATE'].astype(str)
    projects['TOTAL_COST'] = projects['TOTAL_COST'].astype(str)
    all_projects = all_projects.append(projects)

In [None]:
# View the first 5 observations 
all_projects.head()

#### Subset for Cancer Projects
In this notebook we want to focus on Cancer research. In the previous notebooks we have focuesd on projects funded by the `NATIONAL CANCER INSTITUTE `. In this notebook we will select all Projects that have the worrd Cancer in the NIH_SPENDING_CATS. 

In [None]:
# Subsetting by a keyword in a string using str.contains()
# Show all projects which contain word "Cancer" in the NIH_SPENDING_CATS variable
cancer_projects = all_projects[all_projects['NIH_SPENDING_CATS'].str.contains('Cancer', na = False)]

# Reset index
cancer_projects = cancer_projects.reset_index()

# view the first 5 observations
cancer_projects.head()

# Checkpoint 1: Subset data for your project  <a class="anchor" id="cp1"></a>

Make a dataframe called Project_data that is a subset of the all_projects data frame where the NIH_SPENDING_CATS is equal to a specific spending category, or the string contains a specific key ward (as we did above with Cancer). (1 point)

Use the FY variable to subset your Project_data dataframe to the fiscal years you would like to include in the analysis for your project. (Please keep this dataframe named Project_data). (1 point) 

# Visualizations  <a class="anchor" id="visualizations"></a>
Before generating the graph you first want to think about the information you are trying to convey and what your visualization should look like. It might help to draw a sketch on paper first. Once you know what type of graph is best suited to illustrate what you want to show, you will need to think about how to prepare the data you need for the graph. For your presentations, you should only include figures that serve a purpose, and convey important information to your audience.

## Barplots  <a class="anchor" id="barplots "></a>

In [None]:
# calculate how many Cancer Projects each IC has
IC = cancer_projects.groupby(['IC_NAME'])['CORE_PROJECT_NUM'].nunique().sort_values(ascending=False)

# Convert into a dataframe and reset index
IC = IC.to_frame().reset_index()

# Rename FULL_PROJECT_NUM to Total_Projects
IC.rename(columns={'CORE_PROJECT_NUM':'Total_Cancer_Projects'}, inplace = True)

# View the first 8 observations 
IC.head()

We can see that the NATIONAL CANCER INSTITUTE sponsors the most Cancer projects by far, but we can create a visualization to process this information easier. This visualization will be used for data exploration, so we will not take time to make it pretty.

In [None]:
## Barplot function
# Note we can reference column names (in quotes) in the specified data:
sns.barplot(x='IC_NAME', 
            y='Total_Cancer_Projects', 
            data = IC)

# Rotate the CI names on the x-axis 90 degrees
plt.xticks(rotation=90, ha='right')
plt.show()

From this visual, we can see that only looking at projects sponsored by the NCI will cover most Cancer projects, but we might not want to leave out those other projects. In this notebook we will not be subsetting for NCI projects.

This visual is just used to help us understand our data, so we will not waste time on perfecting it. 

### Number of Core Cancer Projects Annually 
Next we want to see how the number of Cancer projets changes over the years. 
We will begin with preparing the data, then we will make a quick and easy graph. After that, we will update the figure to get it presentation ready.

#### Preparing the Annual Cancer data

In [None]:
# calculate how many Cancer Core Projects each fiscal year
Cancer_Annual = cancer_projects.groupby(['FY'])['CORE_PROJECT_NUM'].nunique()

# Convert into a dataframe and reset index
Cancer_Annual = Cancer_Annual.to_frame().reset_index()

# Rename CORE_PROJECT_NUM to Cancer_Core_Projects
Cancer_Annual.rename(columns={'CORE_PROJECT_NUM':'Cancer_Core_Projects'}, inplace = True)

# View the first 8 observations 
Cancer_Annual.head()

#### Make a quick and eazy barplot

In [None]:
## Barplot function
# Note we can reference column names (in quotes) in the specified data:
sns.barplot(x = 'FY', 
            y = 'Cancer_Core_Projects', 
            data = Cancer_Annual)
plt.show()

## Presentation Ready Figures <a class="anchor" id="presentation"></a>

### An Important Note on Graph Titles:

The title of a visualization occupies the most valuable real estate on the page. If nothing else, you can be reasonably sure a viewer will at least read the title and glance at your visualization. This is why you want to put thought into making a clear and effective title that acts as a **narrative** for your chart highlighting the **Important takeaway** of the figure. Many novice visualizers default to an **explanatory** title, something like: "Average Projects over Time (2012-2021)". This title is correct - it just isn't very useful. This is particularly true since any good graph will have explained what the visualization is through the axes and legends. Instead, use the title to reinforce and explain the core point of the visualization. It should answer the question "Why is this graph important?" and focus the viewer onto the most critical take-away.

### A Note on Data Sourcing

Data sourcing is a critical aspect of any data visualization. Although here we are simply referencing the agencies that created the data, it is ideal to provide as direct of a path as possible for the viewer to find the data the graph is based on. When this is not possible (e.g. the data is sequestered), directing the viewer to documentation or methodology for the data is a good alternative. Regardless, providing clear sourcing for the underlying data is an **absolutely requirement** of any respectable visualization, and further builds trusts and enables reproducibility.

#### A Note on Colors Selection
The colors used in your visualization should convey meaning. Using different colors without purpose can lead to un-necessary confusion where your audience will start looking for meaning in the colors. 


### Getting the barplot Presentation ready <a class="anchor" id="barplot_presentation"></a>
You may want to include a figure like this in your presentation. As is, this figure is inadequate for conveying information to your audience.
- First, we want to change the color
- Second, We will change the x- and y-axis labels
- Next, we will give the figure a title that tells the audience what you are trying to convey with the figure
- Next, we will add a black line to indicate what year the Cancer Moonshot began.
- Finally, we will add a source to the data

In [None]:
## Barplot function
# Note we can reference column names (in quotes) in the specified data:
# Change the color to steelblue
yearly_projects = sns.barplot(x = 'FY', 
                             y = 'Cancer_Core_Projects', 
                             data = Cancer_Annual, 
                             color='#7851A9')
# Change the x-axis label
yearly_projects.set(xlabel ="Fiscal Year", 
# Change the y-axis label
                    ylabel = "Number of Cancer Core Projects",
# Change the title
                    title = "The Number of Funded Cancer Projects were Decreasing Before the Cancer Moonshot and Increasing After")

# Add a vertical year in 2016
plt.vlines(x = 4, ymin = 0, ymax = 11000, color = 'black', linewidth=5)

# add a data source 
# xy are measured in percent of axes length, from bottom left of graph:
plt.annotate('Source: NIH RePORTER', xy=(0.95,-0.20), xycoords="axes fraction")

plt.show()

#### Using Hex Codes for Color
In the graph above, you can see I set the color of the graph with pund sign `#` followed by a series of six numbers. This is a hexcode - which is short for hexadecimal code. A hexadecimal code lets you specify one of over 16 million colors using combinations of red, green, and blue. It first has two digits for red, then two digits for green, and lastly two digits for blue: `#RRGGBB`

# Checkpoint 2: Make a Barplot  <a class="anchor" id="cp2"></a>
Using your `Project_data` dataframe make a Presentation ready barplot.  
- Prepare the data to make your barplot. You can graph the number of core projects each year as we did above, but you don't have to. You can be more creative with your barplot to depict an important aspect of your research project. (1 point)
- Give your barplot a title that tells your audience the important take away of your visualization (1 point)
- Change the color of the bars (1 point)
- Change the x and y axis labels and make sure you include a data source. (1 point)


### Percentage of Core Cancer projects over time  <a class="anchor" id="cancer_pi"></a>
Next we will look at the number of PIs working on Cancer projects over time and how it compares to the number of PIs working on non-cancer projects

### Cleaning PI_NAMEs  <a class="anchor" id="pi"></a>
Using the same code in the `Record_Linkage_and_measurement.ipynb` notebook we will clean the PI names 

In [None]:
# Make a temperary dataframe that creates an observation for each PI. 
# Use the explode function to split the PI_Names at the ;
temp = cancer_projects['PI_NAMEs'].str.split(';').explode().reset_index()

# Rename the PI_NAMEs variable
temp = temp.rename(columns = {'PI_NAMEs': 'PI_NAME'})

# Only keep observations in the temp dataframe where PI_NAME is not an empty string
temp = temp[temp['PI_NAME'] !=""]

# For the the PI_NAME variable, use the str.replace fundtion to replace ` (contact)` with nothing 
temp['PI_NAME'] = temp['PI_NAME'].str.replace(' \(contact\)', '')

# look at the first 5 observations
temp.head()

In [None]:
# Reset the index
cancer_projects2 = cancer_projects.reset_index()

cancer_projects2[['index','APPLICATION_ID','PI_NAMEs']].head(10)

In [None]:
cancer_projects2 = cancer_projects2.reset_index(drop=True)

In [None]:
cancer_projects2[['level_0','index','APPLICATION_ID','PI_NAMEs']].head(10)

In [None]:
# Merge this temperary dataframe with cancer_projects2
# Reset the index
cancer_projects3 = cancer_projects2.merge(temp, left_on = 'level_0', right_on='index') 

# look at the first 2 observations
cancer_projects3.head(2)

In [None]:
# Look at the first 6 observations for select variables
cancer_projects3[['APPLICATION_ID','PI_NAMEs','PI_NAME']].head(6)

In [None]:
# Convert scientific notation to a full float
pd.set_option('display.float_format', '{:.2f}'.format)

#### Number of PI's working on Cancer projects each yeah
We will calculate the number of PI's working on Cancer projects each fiscal year

In [None]:
# calculate how many PIs are working on Cancer Projects each fiscal year
Cancer_PI = cancer_projects3.groupby(['FY'])['PI_NAMEs'].nunique()

# Convert into a dataframe and reset index
Cancer_PI = Cancer_PI.to_frame().reset_index()

# Rename CORE_PROJECT_NUM to Cancer_Core_Projects
Cancer_PI.rename(columns={'PI_NAMEs':'Cancer_PIs'}, inplace = True)

# View the first 8 observations 
Cancer_PI.head()

#### Number of PI's working on projects each yeah
We will calculate the number of PI's working on any project each fiscal year

In [None]:
# calculate how many PIs are working on all Projects each fiscal year
All_PI = all_projects.groupby(['FY'])['PI_NAMEs'].nunique()

# Convert into a dataframe and reset index
All_PI = All_PI.to_frame().reset_index()

# Rename CORE_PROJECT_NUM to Cancer_Core_Projects
All_PI.rename(columns={'PI_NAMEs':'All_PIs'}, inplace = True)

# View the first 8 observations 
All_PI.head()

#### Percentage of PIs working on Cancer Progects each year
First we are going to merge together the All_PI dataframe and the Cancer_PI data frame, then we will calculate the percentage of all projects that are cancer projects and the percentage of all projects that are not cancer projects.

In [None]:
# Marge together  `All_Annual` and `Cancer_Annual on PI_NAME`, creating a new data frame called `Annual.`
# Use an outer merge. 
Annual_PI = pd.merge(All_PI, Cancer_PI, on='FY', how = 'outer')

# View the dataframe
Annual_PI 

We are going to add a new column that reports the percent of all core projects that are cancer related 

In [None]:
# Calculate the percent of PIs who work on cancer research and round to the tenth decimal place 
Annual_PI['Percent_Cancer'] = round((Annual_PI['Cancer_PIs']/Annual_PI['All_PIs'])*100,2)

# Calculate the percent of PIs who did not work on cancer research 
Annual_PI['Percent_Not_Cancer'] = 100 - Annual_PI['Percent_Cancer']

# Save only the FY, Percent_Cancer and Percent_Not_Cancer variables
Annual_PI2 = Annual_PI[['FY','Percent_Cancer', 'Percent_Not_Cancer']]

# View the data frame
Annual_PI2

## Stacked Barplot   <a class="anchor" id="stacked"></a>
Below, we will produce a Stacked Barplot that depicts the percentage of PIs working on cancer projects in one color and the percentage of PIs not working on cancer progects in a different color.

In [None]:
# Set the Background color (https://matplotlib.org/gallery/style_sheets/style_sheets_reference.html)
plt.style.use('dark_background')

# Create stacked bar chart
yearly_PI = Annual_PI2.set_index('FY').plot(kind='bar', 
                                                  stacked=True, 
                                                  color=['#7851A9', 'Grey'])
# Add the x label
plt.xlabel('Fiscal Year', fontsize=12, labelpad=15)
# Add the y label
plt.ylabel('Percentage of PIs Working on Cancer Projects', fontsize=10, labelpad=15)
# Add the tile
plt.title('The Percentage of PIs Working on Cancer Projects Remained Consistant Over Time', fontsize=15, fontweight='bold')

# add a data source 
plt.annotate('Source: NIH RePORTER', xy=(0.95,-0.20), xycoords="axes fraction")

#rotate x-axis labels
plt.xticks(rotation=45)

## create legend and put it to the right mid way up the graph
plt.legend(("Percent Cancer", "Percent Not Cancer"), loc = (1.05,0.50)) 


plt.show()

In [None]:
## Switch back to default style
plt.rcParams.update(plt.rcParamsDefault)
%matplotlib inline

## Geopandas <a class="anchor" id="geopandas"></a>
GeoPandas is an open-source project to make working with geospatial data in python easier. GeoPandas extends the datatypes used by pandas to allow spatial operations on geometric types. Geometric operations are performed by shapely. Geopandas further depends on fiona for file access and matplotlib for plotting.
In this section we will use the geopandas along with matplotlib to produce geospatial heat maps.

## Heat Map of the United States <a class="anchor" id="hm"></a>

#### Read in the Shape file
Please download the [tl_rd22_us_state.zip](https://drive.google.com/file/d/1M01wosaxtywlT8ss3lS04GPkR4zogIuz/view?usp=share_link) folder from the google drive, un-zip it and move it to the Data folder inside the PADM-GP_2505 folder you made for this class. Note, while we only read in the .shp file, you need all of the files in this folder to make geospatial images. 

In [None]:
# Use geopandas to read in the .shp file
states = geopandas.read_file(Path + '/tl_rd22_us_state/tl_rd22_us_state.shp')
type(states)

In [None]:
# Look at the first 3 observations
states.head(3)

### Number of projects by State
We are going to creata a geospatial heat map of the U.S. where the color of each state depicts the nukber of Core Projects from the state. First we meed to prepair the data. 

In [None]:
# Calculate how many Projects each state
cancer_st = cancer_projects.groupby(['ORG_STATE'])['CORE_PROJECT_NUM'].nunique().sort_values(ascending=False)
# Convert into a dataframe and reset index
cancer_st = cancer_st.to_frame().reset_index()

# Rename APPLICATION_ID to Total_Projects_2015
cancer_st.rename(columns={'CORE_PROJECT_NUM':'Total_Core_Projects'}, inplace = True)

# View the first 2 observations 
cancer_st.head(2)

We need to link this data to the state geodataframe. Note, if we start with the `states` data and merge in the cancer_st data we end up with a geodataframe, however if we start with the cancer_st and merge in the states data we end up with a normal dataframe and not a geodataframe. This is important because we need a geodataframe to produce geospatial heat maps.

In [None]:
# Merge the cancer_st into the states data 
# where we use the STUSPS variable for the states data equals the ORG_STATE variable from the cancer_st data
cancer_st2 = pd.merge(states, cancer_st, left_on=  ['STUSPS'],
                   right_on= ['ORG_STATE'], how = 'inner')

# View the first 3 observations 
cancer_st2.head(3)           

#### Make a quick heat map of the us

In [None]:
# Indicate the variable that should be used for the colors for the heat map
col = 'Total_Core_Projects'

# Create figure and axes for Matplotlib
fig, ax = plt.subplots(1, figsize=(20, 8))

# Create the heat map of the US
cancer_st2.plot(column = col, ax = ax, edgecolor='0.8', linewidth=1, cmap='viridis')


#### Focus on the continental U.S. 
When including the entire U.S. and all of its territories in the visual, it is difficult to make out specific states. Depending on your research question, it may be sufficient to focus on the continental U.S..

In [None]:
# Remove all of the small islands
cancer_st3 = cancer_st2[cancer_st2['REGION'] != '9']
# Remove Hawaii
cancer_st3 = cancer_st3[cancer_st3['STUSPS'] != 'HI']
# Remove Alaska
cancer_st3 = cancer_st3[cancer_st3['STUSPS'] != 'AK']

### Make a Presentation Ready heat map of the us <a class="anchor" id="presentation_hm"></a>

In [None]:
# Indicate the variable that should be used for the colors for the heat map
col = 'Total_Core_Projects'


# Create figure and axes for Matplotlib
fig, ax = plt.subplots(1, figsize=(20, 8))

# Remove the axis
ax.axis('off')

# Create the heat map of the US, use the viridis color palette 
cancer_st3.plot(column=col, ax = ax, edgecolor = '0.8', linewidth = 1, cmap = 'viridis')

# Give your figure a title
title = 'California has the Most Core Cancer Projects'
ax.set_title(title, fontdict={'fontsize': '25', 'fontweight': '3'})


# add a data source 
# xy are measured in percent of axes length, from bottom left of graph:
plt.annotate('Source: NIH RePORTER', xy=(0.95,-0.20), xycoords="axes fraction")

# identify the max and min number of Cancer projects 
vmin = cancer_st3[col].min()
vmax = cancer_st3[col].max()
            
# Create colorbar as a legend
sm = plt.cm.ScalarMappable(norm=plt.Normalize(vmin=vmin, vmax=vmax), cmap='viridis')


# Add the colorbar to the figure
cbaxes = fig.add_axes([0.15, 0.25, 0.01, 0.4])
cbar = fig.colorbar(sm, cax=cbaxes)

## More visuals  <a class="anchor" id="More"></a>
Below are a few visual types you might want to include in your projects. These are just quick and easy graphs, but you can add the same code we added to the bar plot to make presentation ready.

### Lineplots  <a class="anchor" id="lineplots"></a>

In [None]:
sns.lineplot(x = 'FY', 
             y = 'Cancer_Core_Projects', 
             data = Cancer_Annual)
plt.show()

### Histogram   <a class="anchor" id="Histagram"></a>

In [None]:
#### plt.hist(fund_lim[fund_lim["FY"] == 2010].FY_TOTAL_COST, facecolor="y", bins=50, alpha=0.5)

In [None]:
cancer_projects.columns

In [None]:
# Remove all 
hist_data = cancer_projects[cancer_projects["TOTAL_COST"].notnull()]
hist_data_15 = hist_data[hist_data["FY"] == 2015]
hist_data_19 = hist_data[hist_data["FY"] == 2019]

In [None]:
# Make a simple histogram:
# The plt.hist function draws histograms. You have to give it the dataframe and variable you want to plot
plt.hist(hist_data_15.TOTAL_COST)
plt.show()

### Layering in Matplotlib  <a class="anchor" id="Layering"></a>
This functionality - where we can make consecutive changes to the same plot - also allows us to layer on multiple plots. By default, the first graph you create will be at the bottom, with ensuing graphs on top.

Below, we see the 2010 histogram is beneath the 2015 histogram. You might also notice that the distribution of grant funding has shifted a bit over the years. 

In [None]:
## Layering plots
plt.hist(hist_data_19.TOTAL_COST, facecolor="y", bins=50, alpha=0.5)
plt.hist(hist_data_15.TOTAL_COST, facecolor="olive", bins=50, alpha=0.8)

## create legend
plt.legend(("2019", "2015"), loc='upper right') 

plt.show()

# Checkpoint 3 Visualization of you choice  <a class="anchor" id="cp3"></a>
Create another presentation ready visualization using your `projects_data` (any type of graph except a barplot (because you have already done that) or a pie chart (because they are not good at conveying information). 
- Prepare the data for your visualization (1 point)
- Give your barplot a title that tells your audience the important take away of your visualization (1 point)
- Change the color of the visualization (1 point)
- Change the x and y axis labels and make sure you include a data source. (1 point)


## More Resources  <a class="anchor" id="Resources"></a>

* [A Thorough Comparison of Python's DataViz Modules](https://dsaber.com/2016/10/02/a-dramatic-tour-through-pythons-data-visualization-landscape-including-ggplot-and-altair)

* [Seaborn Documentation](http://seaborn.pydata.org)

* [Matplotlib Documentation](https://matplotlib.org)

* [Advanced Functionality in Seaborn](blog.insightdatalabs.com/advanced-functionality-in-seaborn)

## Other Python Visualization Libraries  <a class="anchor" id="Other"></a>

* [Bokeh](http://bokeh.pydata.org)

* [Altair](https://altair-viz.github.io)

* [ggplot](http://ggplot.yhathq.com.com)

* [Plotly](https://plot.ly)

### Exporting Completed Graphs

When you are satisfied with your visualization, you may want to save a a copy outside of your notebook. You can do this with `matplotlib`'s savefig function. You simply need to run:

plt.savefig("fileName.fileExtension")

The file extension is actually surprisingly important. Image formats like png and jpeg are actually **not ideal**. These file formats store your graph as a giant grid of pixels, which is space-efficient, but can't be edited later. Saving your visualizations instead as a PDF is strongly advised. PDFs are a type of vector image, which means all the component of the graph will be maintained.

With PDFs, you can later open the image in a program like Adobe Illustrator and make changes like the size or typeface of your text, move your legends, or adjust the colors of your visual encodings. All of this would be impossible with a png or jpeg.