
----

# Data Visualization



---

### Table of Contents


1 - [Matplotlib](#section1)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.1 - [Line Plots](#subsection1)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.2 - [Scatter Plots](#subsection2)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.3 - [Bar Plots](#subsection3)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.4 - [Histograms](#subsection4)<br>

2 - [Working with Data](#section2)<br>



## Matplotlib <a id='section1'></a>

We'll create visualizations in Python using a popular Python package called matplotlib. Let's import matplotlib along with several other Python libraries that we will be using:

In [None]:
import math
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt

To plot something, we need data! To begin, we will use very simple arrays. Let's try plotting just two points, (x1, y1) = (1,1); (x2, y2) = (5,5)

In [None]:
# EXAMPLE

xpoints = np.array([1,5])
ypoints = np.array([1,5])

plt.plot(xpoints, ypoints);

**Quick note**: Use a semicolon `;` at the end of the last line in a Jupyter notebook cell to suppress the notebooks from printing the return value of the last line. Try removing the semicolon in the above cell and see how the output changes.

You can use the keyword argument `marker` to specify each point:

In [None]:
# EXAMPLE

plt.plot(xpoints, ypoints, marker='o');

There are other markers that you can find in [matplotlib's documentation page](https://matplotlib.org/stable/api/markers_api.html).

Choose a marker from the documentation page and try it out below!

In [None]:
# EXERCISE - Try plotting with a marker of your choice:

plt.plot(xpoints, ypoints, marker=...);

Without the marker keyword, you can plot without a connecting line:

In [None]:
# EXAMPLE

plt.plot(xpoints, ypoints, '^');

### 1.1 Line Plots <a id='subsection1'></a>

We've actually created line plots above, but let's go further. Any good plot needs to have a title, x and y axis labels:

In [None]:
# EXAMPLE

time = np.array([0, 5, 10, 15, 20])
distance = np.array([0, 12.5, 50, 112.5, 200])

plt.title('Distance over Time')
plt.xlabel('Time (s)')
plt.ylabel('Distance (m)')
plt.plot(time, distance);

In [None]:
# EXAMPLE - even more customization with colors and linewidth

time = np.array([0, 5, 10, 15, 20])
distance = np.array([0, 12.5, 50, 112.5, 200])

plt.title('Distance over Time')
plt.xlabel('Time (s)')
plt.ylabel('Distance (m)')
plt.plot(time, distance, linewidth='5', color='darkmagenta');

You can find a table of colors [here](https://www.w3schools.com/colors/colors_names.asp). Note that you can use either the color names like above or the hexademical value (for example the hexadecimal value for Dark Magenta is #8B008B).

### 1.2 Scatter Plots <a id='subsection2'></a>

Scatter plots are useful when you are looking for a relationship between two variables.

We can use the `scatter()` function to draw a scatter plot:

In [None]:
# EXAMPLE

plt.title('Distance over Time')
plt.xlabel('Time (s)')
plt.ylabel('Distance (m)')
plt.scatter(time, distance);

In [None]:
# EXAMPLE - Two data sets

time2 = np.array([0, 5, 10, 15, 20])
distance2 = np.array([0, 50, 100, 150, 200])

plt.title('Distance over Time')
plt.xlabel('Time (s)')
plt.ylabel('Distance (m)')
plt.scatter(time, distance, color='red'); #Similar to before you can change the color
plt.scatter(time2, distance2, color='blue');

For the exercise below, you are given data points for average temperature (in Fahrenheit) and average humidity (in %) for Berkeley over 12 months (data from [Climate-Data.org](https://en.climate-data.org/north-america/united-states-of-america/california/berkeley-1266/). Make a two scatter plots - one for each data set. Don't forget the plot title and axes labels.

In [None]:
#Data sets for the exercise

months = np.array(["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"])
temperature = np.array([48.1, 49.9, 52.7, 55.3, 59.5, 63.9, 64.6, 64.9, 65, 61.3, 54, 48.6])
humidity = np.array([77, 78, 74, 66, 62, 60, 65, 66, 62, 63, 72, 76])

In [None]:
# EXERCISE - Create a scatter plot for Berkeley's average temperature data



In [None]:
# EXERCISE - Create a scatter plot for Berkeley's average humidity data



### 1.3 Bar Plots <a id='subsection3'></a>

Bar plots represent categorical data with rectangular bars (vertical or horizontal) that are proportional to the values they represent.

In [None]:
# EXAMPLE

courses = np.array(["C", "C++", "Fortran", "Java", "MATLAB", "Python"])
students = np.array([10, 20, 5, 30, 30, 50])

plt.title('Number of students enrolled in coding classes')
plt.xlabel('Coding Courses')
plt.ylabel('Number of Students')
plt.bar(courses, students);

In [None]:
#EXAMPLE - same data but horizontal

plt.title('Number of students enrolled in coding classes')
plt.xlabel('Coding Courses')
plt.ylabel('Number of Students')
plt.barh(courses, students);

### 1.4 Histograms <a id='subsection4'></a>

A histogram is a common data distribution chart that is used to show the frequency with which specific values, or values within ranges, occur in a set of data.

The horizontal axis of a histogram displays the number range and the vertical axis represents the amount of data (frequency) that is present in each range.

In [None]:
# EXAMPLE

hoursSlept = np.array([3, 4, 5, 5, 5, 6, 6, 7, 7, 7, 7, 7, 8, 8, 8, 8, 9, 10, 10, 11, 12])

plt.title('Number of Hours Slept for A Class of 11th Graders')
plt.xlabel('Number of Hours')
plt.ylabel('Frequency')
plt.hist(hoursSlept);

Each bar in the histogram is called a bin. The default is 10 bins. Run the code below to see what happens when we reduce the number of bins to 5. What do you notice?

In [None]:
plt.title('Number of Hours Slept for A Class of 11th Graders')
plt.xlabel('Number of Hours')
plt.ylabel('Frequency')
plt.hist(hoursSlept, bins=5);

There is no "right" way to display a histogram, but it's important to keep in mind that some bin counts convey more information than others. For example, reducing the bin size even further to 2, we can see that we don't get a good grasp on the distribution of hours slept:

In [None]:
plt.title('Number of Hours Slept for A Class of 11th Graders')
plt.xlabel('Number of Hours')
plt.ylabel('Frequency')
plt.hist(hoursSlept, bins=2);

In [None]:
# EXERCISE - play around with the bin size for this histogram with random numbers.
# Which bin size do you prefer?

x = np.random.normal(size=10000)
plt.hist(x);

## Working with Data <a id='subsection2'></a>

Now let's try working with real data! For this example we will use the data of the brightness of a supernova (an exploding star), SN 1998aq. By plotting the brightness of the supernova over time, we can obtain light curves.

Light curves of one type of supernova, type Ia, were significant in leading them to be used as "standard candles" - used to accurately measure distances in the universe. By looking at how fast they brighten and fade through light curves, astronomers can calculate distances. This was significant in the discovery of dark energy, the mysterious energy that is causing the accelerating expansion of the universe. Dr. Saul Perlmutter, astrophysicist at Berkeley Lab, was one of the winners of the 2011 Nobel Prize in Physics for this discovery.

*If you're curious to know more, [this quick video on type Ia supernovae](https://www.youtube.com/watch?v=jlqnKu82UxU) is a good starting point. For a more advanced overview article, check out ['Supernovae, Dark Energy, and the Universe'](https://supernova.lbl.gov/PDFs/PhysicsTodayArticle.pdf) written by Dr. Saul Perlmutter.*



<img src="images/riessetal2005fig3.jpg">

*Above figure and data from [Riess et al. 2005, ApJ 627, 579](https://ui.adsabs.harvard.edu/abs/2005ApJ...627..579R/abstract)*

To read in the data we will call another Python library called `pandas`. `pandas` allows us to use DataFrame objects which organizes data in a tabular form (like Google/Excel spreadsheets). Let's import the library:

In [None]:
import pandas as pd

To read in a `.csv` file we simply use `pd.read_csv`. This `.csv` file happens to be tab-delimited, so we need to specify `sep=\t`. We will also skip the first 3 rows.

In [None]:
sn = pd.read_csv('data/sn1998aq_UBVRI.tsv', sep='\t', skiprows=3)

To look at the first few rows of the dataset we'll use the `.head()` method:

In [None]:
sn.head()

What are we looking at?
* JD - 2450000 is the Julian Date - 2450000. Astronomers simply count days (instead of dealing with months, years, leap years, etc.) and often use the Julian Date (if you are interested to learn more, [see here](https://www.aavso.org/about-jd)).
* U, B, V, R, I - Magnitudes (brightness) of the supernova measured in different filters. Although a bit counterintuitive, the smaller the magnitude, the brighter the supernova. Filters are used to look at specific colors in observational astronomy: U stands for ultraviolet, B for blue, V for visible, R for red, and I for infrared.
* Uerr, Berr, Verr, Rerr, Ierr - the uncertainty associated with each magnitude measurement.

A common problem that you will come across when analyzing data is __missing__ data. You can check if you data set contains by using the function `.isnull()`. This function returns True whenever a values is missing (Null, NaN) and False whenever it is not. We can combine this function with `.sum()` to add up all the values that are True  & False.

In Python (as in most programming languages), True is represented by 1, and False by 0. So using the `sum` function allows us to treat these True/False as numerical values. 

Let's check if our data has any missing data:

In [None]:
sn.isnull().sum()

Phew! Seems like we got a data set with no missing values!

**Challenge:** Create plots for data from each filter (you can ignore the uncertainty values for now). The goal of your plots is to show how the magnitude (brightness) of the supernova changes over time. You can use one plot, but make sure to differentiate between different magnitude filters using the `color` argument (see for example 1.1 Line Plots and 1.2 Scatter Plots).

Before starting to code, discuss with your group what type of plot will make the most sense for this data. Check in with your instructor before moving on!

In [None]:
# CHALLENGE

# Here are arrays that you can use for the days and magnitudes:
days = sn['JD-2450000']
magU = sn['U']
magB = sn['B']
magV = sn['V']
magR = sn['R']
magI = sn['I']

#Code to invert the y-axis so it starts at magnitude 17 and goes until magnitude 11.
#This is needed as the brighter the supernova, the smaller the magnitude.
plt.ylim(17,11)

# Your code below. Don't forget to add a title and label your axes.



### Bonus - Error Bars

The data set had columns for uncertainties for the magnitude measurements. Matplotlib can also create plots with error bars through the `plt.errorbar()` function.

Note for this example we zoomed in towards the rise of the light curve as the error bars are quite small.

In [None]:
# EXAMPLE

# Reading in uncertainty measurements:
magUerr = sn['Uerr']
magBerr = sn['Berr']
magVerr = sn['Verr']
magRerr = sn['Rerr']
magIerr = sn['Ierr']

#You can change these values to see more/less of the graph
plt.ylim(14, 12)
plt.xlim(920,930)

plt.title('SN 1998aq B Light Curve')
plt.xlabel('JD-2450000 (days)')
plt.ylabel('B mag')
plt.errorbar(days, magB, yerr=magBerr, fmt='.', capsize=2, elinewidth=2);

## Exploring Python data visualization further...

We've only looked at matplotlib (and not even fully explored all its features!) Another popular data visualization library is [seaborn](https://seaborn.pydata.org/). There is also [Plotly](https://plot.ly/python/) and [Bokeh](http://bokeh.pydata.org/en/latest/) which can create interactive visualizations.

---
Notebook developed by: Alisa Bettale