# Data Visualization with Python

- `pandas` for loading and organizing data
- `matplotlib` and `seaborn` for making data visualizations

1. Basic plotting with Matplotlib
- creating a simple line plot
- adding labels and titles, legends, grids
- customizing colours and styles
- saving plots to files
2. Different types of plots
- scatter, bars, histograms, pie charts
3. Exercises 

# `matplotlib`  https://matplotlib.org/

`matplotlib` ia a Python library specifically designed to make data visualizations. 

To make data viz with `matplotlib`, we import just one module (called `pyplot`) from the library, and shorten it to `plt`. 


In [None]:
# import matplotlib 
from matplotlib import pyplot as plt

The function for a line chart is `.plot()`. We need to call `matplotlib` (as `plt`) first, so to make a line chart we’ll use `plt.plot()`.

The line graph function has parameters for x and y data: `plt.plot(x, y)`.

- `x` is the data for the x-axis – in a line chart, this is very often a measure of time
- `y` is the data for y-axis – generally things we want to view over time, like interest rates, the price of a slice of pizza, or population size in a country

Run the cell to make a line graph plotting the total sales in the restaurant during the week.

In [None]:
# Create a plot
plt.plot(['Mon', 'Tues', 'Weds', 'Thurs', 'Fri', 'Sat', 'Sun'], [540, 500, 490, 590, 680, 710, 610])

# Display the plot
plt.show()

Depending on our data setup, x and y could be written directly into the function (as above), pulled from data stored in a variety of data structures including ***lists***, ***numpy arrays***, and ***pandas dataframes***.

***Your turn:*** Uncomment (##) to plot:

In [None]:
# make a line chart using named lists
day = ['Mon', 'Tues', 'Weds', 'Thurs', 'Fri', 'Sat', 'Sun']
sales_totals = [540, 500, 490, 590, 680, 710, 610]
##plt.plot(day, sales_totals)

##plt.show()

***Your turn:*** **Exercise 1**

Try to create the same plot using the data from *restaurant_data.csv* file.

*Reminder: use `pd.read_csv()`*

In [None]:
## YOUR SOLUTION HERE ##

# import pandas

# load dataset

# make a chart


# Customizing and general functions

Matplotlib allows you to customize various aspects of your plots. You can change line styles, colors, markers, labels, and more. 

| Parameter | argument |
|:--------- |:---|
| ***`linestyle`***    | 'dotted', 'dashed', 'solid', '--', '-.', ':'  |
| ***`marker`*** | 'o', 's', 'D', '*'|
|***`color`***     | 'r', 'green', '#FF0000', 'darkorange' | 
| ***`linewidth`***| 0.2, 0.5, 1.5 |
| ***`label`*** | 'your label', `plt.legend()` uses `plt.plot()`’s label parameter as its information source |

More parameters here: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html.

General functions:

|  | general function | parameters |
|:--------- |:-|:--- |
| Add a title  | `plt.title()` | 'your tile', fontsize=16, color='r', backgroundcolor='b', loc='right', fontstyle='italic', https://matplotlib.org/stable/api/text_api.html#matplotlib.text.Text|
| Add a legend    | `plt.legend()`  | bbox_to_anchor = (1, 0.5) |
| Add axis labels      | `plt.xlabel()`, `plt.ylabel()`  | 'your label', loc='center', ,fontsize=12, https://matplotlib.org/stable/api/text_api.html#matplotlib.text.Text |
| Adjust axes  | `plt.xlim()`, `plt.ylim()` | 1,10 |
| Modify tick labels | `plt.xticks()`, `plt.yticks()` | rotation=45,size='large', color='purple' | 
| Add a grid   |  `plt.grid()` | color='r', linestyle='--', linewidth=2, axis = 'x', alpha=0.5 |
| Print graph  | `plt.show()`  | |
| **Save the figure** | `plt.savefig()` | 'my_plot.png', dpi=128, bbox_inches='tight' |


Here's an example, uncomment (##) and feel free to modify the plot:

In [None]:
plt.plot(['Mon', 'Tues', 'Weds', 'Thurs', 'Fri', 'Sat', 'Sun'], [540, 500, 490, 590, 680, 710, 610], label='First week', linestyle='dashed', color='red', marker='o', linewidth=1.8)
##plt.xlabel('Day',fontsize=12)
##plt.ylabel('Sales',fontsize=12)
##plt.legend()
##plt.ylim(480,720)
##plt.title('Restaurant Sales', fontsize=16)
##plt.grid(axis='y', alpha=0.3)

plt.show()

***Your turn:*** **Exercise 2**

We’re going to look at monthly average temperatures from around the world. 

In [None]:
import pandas as pd

# load dataset
data = pd.read_csv("2020-monthly-avg-temps-f.csv")
data.head()

Below is the code to make a line graph plotting the monthly average high temperature in different cities on the y-axis and month names on the x-axis.


Try to add the monthly average high temperatures for the rest of cities from the *2020-monthly-avg-temps-f.csv* file. Name axes, rotate the tick-labels on the x-axis 45 degrees, add title, legend, axis labels, grid etc. Save your plot.

In [None]:
## YOUR SOLUTION HERE ##

# make a line chart
plt.plot(data.month_name, data.alice_springs_avg_high, color='darkorange', label='Alice Springs')
plt.plot(data.month_name, data.windhoek_avg_high, color='pink', label='Windhoek')

plt.show()

# Different types of plots

Let’s learn about five common chart types in data visualization, and how we make them appear in `matplotlib`.

Each graph that can be made in `matplotlib` has a dedicated function. To change the graph type, we change which graph function we’re calling from `plt`. Here are five of the most commonly used charts and the functions we call to make them:

| Chart type    | Chart code  | Review | 
|:-------------- |:-----------|:-----------|
| Line chart    | `plt.plot()`   |shows continuous change, often used to measure change over time |
| Bar chart     | `plt.bar()`   | uses bar height to compare a measure between categorical variables|
| Scatter plot  | `plt.scatter()`   |uses position to show the relationship, or correlation, between two numeric values |
| Pie chart     | `plt.pie()`   | shows us the breakdown of a whole into its parts|
| Histogram     | `plt.hist()`   |  shows how one kind of data is distributed|


More types here: https://matplotlib.org/stable/plot_types/index.html

**Bar chart**

The bar chart graph function is `plt.bar()`. It takes the following parameters:

- `x`: categorical data for each bar
- `height`: numeric data to determine the height of each bar
- `width`: a number we can pass in to set the width of each bar
- `align`: set to 'center' or 'edge' to align each bar on the x-axis
- `color`, `edgecolor`
- more: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.bar.html

***Your turn:*** **Exercise 3**

Below is a bar chart of *pH* and *average leaf width* using *'plant_data_simplified.csv'*. Check out the columns in the dataset to get a sense of what it contains. Set the width equal to 0.8 and align to center. Add a title: *Effect of soil pH on plant growth* and labels. Play with bar chart parameters.

In [None]:
## YOUR SOLUTION HERE ##

bar_data = pd.read_csv('plant_data_simplified.csv')

##plt.bar(x=bar_data.PH, height=bar_data.average_leaf_width)

##plt.show()

We can visually represent uncertainty by adding error bars to a graph. We’ll use matplotlib’s general function **`plt.errorbar()`**, which takes parameters to

- `x` and `y`: restate the X and Y values of the underlying graph
- `yerr` and/or `xerr`: set error values in the X or Y direction
- `color`: set the color of the error bar (optional)
- `fmt`: change the marker ('o', 's', 'D', '*')
- `capsize`: the length of the error bar caps in points.
- many more: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.errorbar.html



***Your turn:*** **Exercise 4**

Copy the code from the cell above. In the space above `plt.show()`, write the code to add error bars using the *error* column from the *'plant_data_simplified.csv'*. Set the marker to `'o'`, set the capsize to 6 and make the error bar color `'orangered'`.

In [None]:
## YOUR SOLUTION HERE ##



**Scatterplot**

Scatterplots are great for diectly comparing two continuous numeric variables, helping us to visualize the relationship or **correlation** between them.

Use **`plt.scatter()`** with the following parameters:

- `x` and `y`: the continuous numeric variables to be compared
- `color`: marker color, as a color code, color name, or hex code
- `alpha`: marker opacity, as a number between 0 (transparent) and 1 (opaque)
- `marker` and `s`: marker shape and size
- more: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html

***Your turn:*** **Exercise 5**

Run the cell to load the *spotify_data_by_genres.csv*. Check out the columns in the dataset to get a sense of what it contains.

In [None]:
# load dataset
scatter_data = pd.read_csv('spotify_data_by_genres.csv')

Write the code to make a scatterplot comparing `danceability` and `valence`. `danceability` is a measure of how “danceable” a song is, from 0 (not dance-y, like an audiobook or funeral dirge) to 1 (very dance-y, like disco or house music). `valence` is a measure of the song’s mood: sad to happy from 0 to 1. What kind of correlation do you expect?

Adjust the alpha to 0.25 to make the relationship more obvious. Play around with other alpha values to see how the graph’s readability changes!

In [None]:
## YOUR SOLUTION HERE ##

plt.scatter(scatter_data.danceability, scatter_data.valence)
plt.title('Mood and Danceability correlation in Spotify genres')
plt.xlabel('Danceability')
plt.ylabel('Valence / Mood (sadder to happier)')
plt.show()

Experiment with other Spotify categories in the x and y parameters, if you want!

In [None]:
## YOUR SOLUTION HERE ##

plt.title('___ and ___ correlation in Spotify genres')


# View in subplots

In data visualization, it’s often helpful to compare multiple simple visualizations. As we saw in the last exercise, there were many Spotify categories to compare across genres, and viewing them as individual scatterplots definitely made more sense than trying to fit them all into the same graph.

`.subplots` function creates a figure and a grid of subplots with a single call, while providing reasonable control over how the individual plots are created. 

![fig-4-plots-2.png](attachment:fig-4-plots-2.png)

First we create the `figure` (which we can call `fig`) and at least one `axis` (which you can call `ax`) using the **`subplots()`** function from the pyplot module:

In [None]:
# Create figure and one plot (axis object) 
fig, ax = plt.subplots()

`fig` and `ax` are variable names for the **figure** and **axis** objects. You can call these items whatever you want. 

Notice that the `fig` and `ax` are created at the same time by setting them equal to the output of the pyplot.subplots() function. As no other arguments have been provided, the result is a figure with one plot that is empty but ready for data.

You can change the size of your figure using the argument figsize to specify a width and height for your figure:

`figsize = (width, height)`

In [None]:
# Resize figure
fig, ax = plt.subplots(figsize = (5, 3))

In [None]:
# Figure with two plots
fig, (ax1, ax2) = plt.subplots(2, 1, figsize = (3, 3))

In [None]:
# Figure with two plots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (5, 3))

You can play around with both the number of rows and columns as well as figsize to arrive at the look that you want.


Here the example. `scatter` function is now a method of `axis` objects.

In [None]:
fig, ((ax1,ax2,ax3,ax4),(ax5,ax6,ax7,ax8),(ax9,ax10,ax11,ax12)) = plt.subplots(3,4,figsize = (10, 6))

ax1.scatter(scatter_data.popularity, scatter_data.acousticness, alpha=0.05)
ax1.set(title='acousticness')
ax2.scatter(scatter_data.popularity, scatter_data.danceability, alpha=0.05)
ax2.set(title='danceability')
ax3.scatter(scatter_data.popularity, scatter_data.duration_ms, alpha=0.05)
ax3.set(title='duration_ms')
ax4.scatter(scatter_data.popularity, scatter_data.energy, alpha=0.05)
ax4.set(title='energy')
ax5.scatter(scatter_data.popularity, scatter_data.instrumentalness, alpha=0.05)
ax5.set(title='instrumentalness')
ax6.scatter(scatter_data.popularity, scatter_data.liveness, alpha=0.05)
ax6.set(title='liveness')
ax7.scatter(scatter_data.popularity, scatter_data.loudness, alpha=0.05)
ax7.set(title='loudness')
ax8.scatter(scatter_data.popularity, scatter_data.speechiness, alpha=0.05)
ax8.set(title='speechiness')
ax9.scatter(scatter_data.popularity, scatter_data.tempo, alpha=0.05)
ax9.set(title='tempo')
ax10.scatter(scatter_data.popularity, scatter_data.valence, alpha=0.05)
ax10.set(title='valence')
ax11.scatter(scatter_data.popularity, scatter_data.key, alpha=0.05)
ax11.set(title='key')
ax12.scatter(scatter_data.popularity, scatter_data.popularity, alpha=0.05)
ax12.set(title='popularity')

plt.subplots_adjust(hspace=0.5, wspace=0.3)
plt.suptitle('Relationship between Popularity and ______')

plt.show()

# Pie chart

Generates pie charts to represent data as proportions of a whole.

**`plt.pie()`** takes the following parameters:

- `x`: the numeric variable shown as pieces of the pie. The function adds up all the x values to compare part and whole and auto-generate the pieces of the pie.
- `labels`: the variable used to label each section of the pie chart
- `startangle`: the rotation of the pie chart, adjusted to improve readability
- `colors`: an array of colors the chart will cycle through. If blank, defaults to matplotlib’s default 10-color “Tableau” palette.
- more: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.pie.html

`x` is the only required parameter here, but the others can all help make the pie chart more readable. 

***Your turn:*** **Exercise 6**

For the pie chart below add labels from the *'EV_market_share.csv'* file and set the start angle equal to 60.

In [None]:
# load dataset
pie_data = pd.read_csv('EV_market_share.csv')
pie_data.head()

In [None]:
## YOUR SOLUTION HERE ##

plt.pie(pie_data.number_sold)
plt.title('Top 10 EVs sold in the US by count, 2021')
plt.show()

# Histogram
Used to create histograms, which represent the distribution of a dataset.

**`plt.hist()`** takes the following parameters:

- `x`: the value being distributed – note that it should not be an aggregated value
- `bins`: specifies how many bins to make (e.g. 10), intervals or ranges into which the data is divided in order to construct the histogram
- `range`: the lower and upper range of the bins. If unspecified, set to the min and max values for x
- `color`: sets the color of the bars
- more: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html

As with most of our other graph functions, only the data portion (`x`) is strictly necessary, but adjusting other parameters makes a much more useful graph.

***Your turn:*** **Exercise 7**

Run the code to check out the first few rows of hist_data to see what we’re working with. (Note: carapace_length refers to the length of the lobster’s shell.)

In [None]:
# load dataset
hist_data = pd.read_csv('lobster_dist.csv')
hist_data.head()

**Exercise**


1. Now that we've looked at some of the data in table form, let's see it as a histogram. Write the code to make a histogram. Set the bins to 10.

2. Play with the number of bins. Let's also make this graph more readable: title the graph "Lobsters tagged by size", and label the x and y axes "Carapace length (mm)" and "Number tagged", respectively.

3. Let's give the distribution a little breathing room on either side by specifying a range of (75, 155) - just slightly outside of our minimum and maximum x-values. Additionally, set the histogram color to 'gold'... because why not?

In [None]:
## YOUR SOLUTION HERE ##
plt.hist(hist_data.carapace_length, bins=20)
plt.show()

# Add AB lines

An **AB line** is a straight line added to a graph (i.e. “from point A to point B”). It can be vertical, horizontal, or diagonal. Generally, an AB line helps to demarcate one area from another or provide context that helps the viewer make sense of the data in the graph. In matplotlib, the function **`plt.axvline()`** makes a vertical AB line, and plt.axhline() makes a horizontal AB line. **`plt.axvline()`** takes the following parameters:

- `x`: where to position the line along the x-axis
- `ymin`: how close to the bottom of the graph the line starts. Usually set to 0
- `ymax`: how close to the bottom of the graph the line ends. Usually set to 1
- `linewidth`: line width of the AB line
- `dashes`: dash pattern given as (line_length, space_length)
- `color`: color of the AB line

**`plt.axhline()`** functions exactly the same, but with `y`, `xmin`, and `xmax` parameters to determine position.


We can easily add **annotations** to a graph (or label an AB line) using **`plt.annotate()`**, which allows us to position and format text on a graph. To split an annotation over two or more lines, we can simply add line breaks to the annotation text using \n.

**`plt.annotate()`** takes the following arguments:

- `text`: annotation text
- `xy`: (x, y) coordinate position for annotation
- `color`: color of annotation text and arrow

**Exercise**

1. Copy the last histogram.
2. Let's mark the minimum and maximum legal catch sizes for lobsters: 83mm minimum, and 127mm maximum. Place one vertical AB line at each of these numbers. Set the `linewidth` to `2`, `dashes` equal to `(1,2)`, and `color` to `mediumblue`. 
3. Nice! Now we can see clearly that the first peak in the population distribution happens _just_ before lobsters become too big enough to be legally caught. Interesting! Let's annotate these lines to make that connection more obvious. Add an annotation for 'Minimum legal catch size' to the line at `x = 84mm` and `y = 23` and 'Maxmimum legal catch size' at `x = 128mm` and `y = 7`.
4. So close! Let's fix up the spacing on the second annotation to keep the graph tidy. Split the ‘Maximum legal catch size’ annotation into two lines using a line break, \n, instead of the space between the words ‘legal’ and ‘catch’.

In [None]:
## YOUR SOLUTION HERE ##



**Bibliography**
1. https://www.codecademy.com/
2. https://matplotlib.org/
3. https://www.geeksforgeeks.org/
4. https://www.earthdatascience.org/