<a href="https://colab.research.google.com/github/PrincetonUniversity/python-data-vis/blob/main/python_data_vis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## ‼️ Please save a copy of this notebook in Google Drive (under the `File` tab) to use

# Data Visualization in Python

Welcome to the Data Visualization in Python workshop at the [Princeton Institute for Computational Science & Engineering](https://researchcomputing.princeton.edu/about/about-picscie) (PICSciE), [OIT](https://oit.princeton.edu/) Research Computing and [Center for Statistics and Machine Learning](https://csml.princeton.edu/) (CSML) at Princeton University.

During this course we will learn how to interact with some of the most powerful Python packages for data visualization.

We will try to cover a lot and move quickly to show you how to generally use these packages and what's possible.

---

## Why Visualize Your Data?

There are two main reasons to visualize your data:

1. It helps you **understand** your data in order to make better insights.
2. It helps you **communicate** your findings to others.

This workshop will help with both of these goals!

---

## Setting up your environment in Google Colab or Jupyter Notebooks

For making plots we will be using *Matplotlib*, *Seaborn* and *Plotly* if time permits.

We will also use *Numpy* and *Pandas* for creating and reshaping matrices and data tables, but more on this in a moment.

### Option 1: Google Colab

For today's workshop we will be running the notebook in remotely in a virtual machine through [Google Colab](https://colab.research.google.com), which already has all of these packages installed.

In [None]:
# Note that we have to type an exclamation mark (!) before the pip command. This exclamation mark allow us to execute a shell command in our Colab Notebook.
!pip install numpy matplotlib pandas seaborn plotly

### Option 2: Jupyter Notebook

If you would like to run this notebook locally (on your computer), you would have to install *Jupyter Notebooks* and some other Python packages before running this notebook. Depending on your Python installation you can install packages using either **conda install *packagename*** or **pip install *packagename***.

#### Steps:

1. Install [Conda](https://www.anaconda.com/download).
2. Create an environment with the required dependencies:
  ```
  $ conda create --name dataviz python=3.10 numpy scipy pandas matplotlib seaborn plotly ipython jupyter
  ```

3. Activate the environment:
  ```
  $ conda activate dataviz
  ```

4. Open this notebook with Jupyter:
  ```
  $ jupyter notebook python_data_vis.ipynb
  ```

## Numpy

These packages are not **directly** related to data visualization but are extremely important for data analysis and reshaping data into formats that are used by various plotting functions, allowing us to quickly understand the features of large datasets!

### *Very brief* intro to Numpy:
- matrix data structures, potentially multidimensional
- numpy arrays faster + more compact than python lists
- numpy arrays must be homogenous (e.g. numerical data only!)






In [None]:
import numpy as np

Let's make a simple array from a list.

In [None]:
a = np.array([1, 2, 3, 4])
a

Or from a list of lists, i.e. a matrix with rows and columns. Note the extra square brackets!



In [None]:
a = np.array([[1, 2, 3, 4],
              [5, 6, 7, 8],
              [9, 10, 11, 12]])

# what is the "shape"? Rows and columns.
a.shape

We can look at **all** elements in the second ROW (the `:` is optional here).

In [None]:
a[1, :]

or **all** elements in the second column

In [None]:
a[:, 1]

We can "broadcast" arithmetic operations.

In [None]:
a * 2

Numpy has many useful built-in functions, such as summing across columns.

In [None]:
a.sum(axis=0)

Or summing across rows.

In [None]:
a.sum(axis=1)

There are **many** other numpy functions that do quickly do complex operations with little code. We barely scratched the surface here. For more, see this [Python for Data Science Cheat Sheet](https://web.itu.edu.tr/iguzel/files/Python_Cheat_Sheets.pdf) or the official [Numpy User Guide](https://numpy.org/doc/stable/user/).

---

## Pandas
- Built on Numpy
    - Uses more memory, slower indexing
- *Easy* to use with 2D data tables or spreadsheets with *labeled* columns and rows!
  - Has simple functions to reshape data, summarize information, and do common/useful tasks
  - See this [Pandas Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) for an overview of capabilities/functions
- Reshaping or filtering data with Pandas allows us to pass different pieces of information to plotting functions to understand different features of our data.

In [None]:
import pandas as pd

### Loading and reshaping data with Pandas

We can use the *Pandas* function `read_csv` to create a *Pandas DataFrame*. Here we will use a url everyone has access to, but you may simply replace this with a file located on your computer.

In [None]:
url = "https://corgis-edu.github.io/corgis/datasets/csv/weather/weather.csv"
df = pd.read_csv(url)

# or download locally:
# !wget https://corgis-edu.github.io/corgis/datasets/csv/weather/weather.csv
# df = pd.read_csv('./weather.csv')

This dataset contains weather information collected from weather stations accross the US from January 2016 to January 2017.

We can learn much about these data using *Pandas* built-in functions to interact with this DataFrame.

Let's peek at the first 3 lines of the Data Frame with `.head()` and the column names with `.column`.

In [None]:
df.head(3)

For huge datasets with many columns that can't all be displayed on the screen, it may easier to just look at the columns.

In [None]:
df.columns

We can also ask Pandas to generate some descriptive statistics using `.decribe()`.

In [None]:
df.describe()

We can remove the whitespaces or dots "." from our column names and replace them with underscores "_". This makes downstream analyses with DataFrames much easier!

In [None]:
df.columns = [col.replace('.', '_').replace(' ', '_') for col in df.columns]

# OR, the less pythonic, longer way to do it for those not familiar with list comprehensions:
#new_col_names = []
#for col in df.columns:
#  new_name = col.replace('.', '_').replace(' ', '_')
#  new_col_names.append( new_name )
#df.columns = new_col_names
df.columns


We can access columns **by name** using square brackets.

In [None]:
df['Data_Wind_Direction']

Or using dot notation (which autocompletes in Jupyter notebooks, very useful!)

Note: we **had** to change column names for this to work, as they originally contained "." which has a specific meaning in python. This approach is nice because it auto-completes in Jupyter notebooks!

In [None]:
df.Data_Wind_Direction

We can check the unique entries of each column using the `.unique()` method.



In [None]:
print(df.Station_State.unique())
print(len(df.Station_State.unique()))

There should be 51 entries here (50 states + Puerto Rico), but this quick check shows Virginia and Delaware are represented both by their full names and their 2-letter abbreviations. Pandas functions and plotting (below) allow us to quickly do many sanity checks!

Let's confirm this by looking at the rows in which the State is VA instead of Virginia.

The `.loc` is used to access a group of rows and columns by label(s) or a boolean array.

In [None]:
condition = df.Station_State == 'VA' # this is a Series of True/False statements
df.loc[condition].head(5) # select rows that correspond to True in 'condition' Series
# or simply
#df.loc[df.Station_State == 'VA'].head(5)

We can rename the Station_State of *DE* and *VA* to *Delaware* and *Virginia*, respectively, using the `.replace()` method of Pandas DataFrames.

In [None]:
df = df.replace(to_replace={'Station_State': 'DE'}, value={'Station_State': 'Delaware'}, regex=True)
df = df.replace(to_replace={'Station_State': 'VA'}, value={'Station_State': 'Virginia'}, regex=True)

Lets check again the number of unique entries in the *Station_State* column.

In [None]:
print('Number of unique states in the dataframe:', len(df.Station_State.unique()))

We will now focus on data visualization, but see the [Pandas documention and tutorials](https://pandas.pydata.org/pandas-docs/stable/index.html) for more information about many other useful functions!

Tip: Use Google to search what you want to do with "pandas" in the query. Usually the first few search hits will have the answer. Memorization then comes with time.

---

### Matplotlib

[Matplotlib](http://matplotlib.org/) is one of the most common Python packages for creating 1-D and 2-D plots. Here we will cover how to use Matplotlib to create three of the most widely used plot types for data visualization: line plots, bar plots, and scatter plots. We will also go through an example of how to combine different plot types using the flexibility that the package offers.

We also encourage you to use the package [documentation](https://matplotlib.org/stable/) and its [examples](https://matplotlib.org/stable/gallery/index.html) for details and inspiration.

We will start by importing the *Matplotlib* package as:

In [None]:
import matplotlib.pyplot as plt  # This is equivalent to 'from matplotlib import pyplot as plt'.

Since we are running this interactive code through a Notebook (instead of running it in the Python console) we will have to activate some flags to print out the *Matplotlib* graphical outputs:

In [None]:
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid') # common style

plt.style.use('fivethirtyeight') # cleaner aesthetics
plt.rcParams['axes.facecolor'] = 'white'
plt.rcParams['axes.edgecolor'] = 'white'
plt.rcParams['savefig.facecolor'] = 'white'
plt.rcParams['savefig.edgecolor'] = 'white'

---

### Line Plots

There are essentially two approaches to use Matplotlib:

1. **Function-based Approach**: rely on pyplot to automatically create and manage the figures and axes, and then use pyplot functions for plotting.
2. **Object-oriented Approach**: explicitly create figures and axes, and call methods on them.

We will start with a basic example to understand the syntax that is used to create figures in Matplotlib using the **function-based approach**.

In [None]:
# Generate some data.
x = [0., 2., 3., 4.]
y = [5., 2., 5., 4.]

# We can make a figure using:
plt.figure()

# We can then insert a 1D line plot as:
plt.plot(x, y, c='orange')

# We can set a title using the command:
plt.title('Simple Plot')

# And show the result using:
plt.show()

---

Now that we know the basics, we can go a bit further and improve our plots using different properties of the functions that are built in *Matplotlib*. For instance, we can add more data or change properties of the plots (e.g. figure size, resolution, colors, line styles...). We will also learn how to save our figures so we can use them in combination with other software such as *Adobe Illustrator®* ([available free to students, faculty, and staff](https://oit.princeton.edu/creativecloud)), *Affinity Designer* or *Inkscape*.

In [None]:
# Create your data.
x  = [1, 2, 3, 4, 5]
y1 = [5, 3, 5, 3, 5]
y2 = [2, 4, 6, 4, 2]

# Create a new figure.
plt.figure(figsize=(5.0, 3.0), dpi=100)

# Add plots in the figure (you can set a name to each series).
plt.plot(x, y1, label="Label A", color='green')
plt.plot(x, y2, label="Label B", color='#B2C144', lw=2.0, ls='--')

# Add title, axis labels, legend, etc.
plt.title("New example " + r'$\beta_1$')
plt.xlabel("x axis", size=12)
plt.ylabel(r'$\nu$', size=20)

plt.legend(loc=0)  # Automatic position using 'loc=0' (default).

# Save your figure using plt.savefig('filename.svg').
plt.savefig('./my_fig2.svg', dpi=300)

# Matplotlib Colors: https://matplotlib.org/stable/gallery/color/named_colors.html
# Hexadecimal Colors: https://htmlcolorcodes.com/color-picker/
# LaTeX Math notation: http://tug.ctan.org/info/undergradmath/undergradmath.pdf
# Matplotlib Linestyles: https://matplotlib.org/stable/gallery/lines_bars_and_markers/linestyles.html

In [None]:
!ls

---

We can also generate data systematically using a function. Here we will use some built in functions from *Numpy*. We will import *Numpy* using the alias **np** since we will use this package very often.

In [None]:
import numpy as np

We can use *Numpy* to generate an array with equally spaced data-points in a certain range using the `np.arange` function. You may also find the parametrization in `np.linspace` more convenient for the same purposes.

In [None]:
x = np.arange(0, 10, 0.1)  # start, end, step size.
x

x = np.linspace(0, 10, 101) # start, end, step count
x

Then we can pass the generated data a numpy function — in this case, the cosine function. Like most others in Numpy, this function is _vectorized_, meaning that it operates on and returns arrays.

In [None]:
y = np.cos(x)
y

Now we can use the generated data to plot the $\cos(a)$ function.

In [None]:
plt.figure()
plt.plot(x, y)

# Exercise 1. Set labels "a" and "f(a)" for the x and y axes.
# Exercise 2. Change the line color to green and set the linewidth to 2.0.
# Exercise 3. Add another plot representing sin(a).

---

### Subplots

Recap: There are essentially two approaches to use Matplotlib:

1. **Function-based Approach**: rely on pyplot to automatically create and manage the figures and axes, and then use pyplot functions for plotting.
2. **Object-oriented Approach**: explicitly create figures and axes, and call methods on them.

In the examples that we showed you before we used the function-based approach, which is simpler and easier to use for small, uncomplicated plots. However, for larger, more involved plots, we prefer the second approach, as it offers far more flexibility.

Here we will start by creating a *figure* and *axes* Object, and then we will use these Objects to draw a plot 1-D Line Plot.

In [None]:
x  = [1, 2, 3, 4, 5]
y1 = [5, 3, 5, 3, 5]
y2 = [2, 4, 6, 4, 2]

fig, ax = plt.subplots()

fig.set_size_inches(8.0, 5.0)

ax.plot(x, y1)
ax.plot(x, y2)

# Careful! The syntax changes when using axes Objects in Matplotlib.
# Before we were using plt.xlabel('xlabel'). Now we need to
# prefix these calls with "set_".

ax.set_xlabel('x label')
ax.set_ylabel('y label')

ax.set_xlim(1.5, 4.5)
ax.set_ylim(2.0, 6.5)

plt.show()

Why should we use *axes Objects* then? Well, one of the motivations for creating *axes Objects* is that they offer an intuitive way of generating subplots in a given *figure*.

In [None]:
# Data.
x1 = [0, 1, 2, 3, 4, 5]
x2 = [0, 3, 6, 9, 12, 15]
y1 = [-1, -2, -3, -4, -5, 0]
y2 = [1, 2, 3, 4, 5, 0]

# Create figure and axes Objects.
fig, axs = plt.subplots(nrows=1, ncols=2)

# Set figure size.
fig.set_size_inches(7., 3)

# We can use the index of the subplots to draw specific plots in these axes.
axs[0].plot(x1, y1, color='orange')  # Axes

axs[1].plot(x2, y2, color='blue')

# We can specify the axis labels in the same way.
axs[0].set_xlabel('x1')
axs[0].set_ylabel('y1')
axs[1].set_xlabel('x2')
axs[1].set_ylabel('y2')


We can even go further and create a grid of plots using the *axis* index notation.

In [None]:
fig, axarr = plt.subplots(nrows=2, ncols=2)
axarr[0][0].plot(x1, y1, 'orange', label='Line A')
axarr[0][1].plot(x2, y2, 'blue', label='Line B')
axarr[1][0].plot(x1, y2, 'red', label='Line C')
axarr[1][1].plot(x2, y1, 'olive', label='Line D')

fig.set_size_inches(5.0, 5.0)

fig.legend(loc="center left", bbox_to_anchor=(1, 0.5))
plt.tight_layout()

We can also add a bit more complexity to our figures by adding a grid with different shape and size to each subplot. For that, we will need to use the `subplot2grid` function.

In [None]:
def random_data():
  x = np.cos(np.linspace(0, 100, 1000))
  x *= np.random.randint(0, 50, 1)
  y = np.cos(x) * np.sin(x)
  return x, y

fig = plt.figure()

ax1 = plt.subplot2grid((6, 2), (0, 0), rowspan=1, colspan=2)
ax2 = plt.subplot2grid((6, 2), (1, 0), rowspan=3, colspan=2)
ax3 = plt.subplot2grid((6, 2), (4, 0), rowspan=2, colspan=1)
ax4 = plt.subplot2grid((6, 2), (4, 1), rowspan=2, colspan=1)

x, y = random_data()
ax1.plot(x, y, color='blue')

x, y = random_data()
ax2.plot(x, y, color='navy')

x, y = random_data()
ax3.plot(x, y, color='purple')

x, y = random_data()
ax4.plot(x, y, color='lightblue')
plt.tight_layout()
plt.show()

---

### Bar Plots

In [None]:
x = ['Ruby', 'Python', 'C++', 'Java', 'Perl', 'JavaScript', 'C#', 'PHP']
y1 = [108065, 103492, 101561, 99951, 99857, 95902, 94653, 93169]
y2 = [121253, 119082, 101440, 104544, 107698, 117718, 76271, 93828]

plt.figure()

plt.bar(x, y1, label='2017 Average salary (in $)')
plt.bar(x, y2, label='2020 Average salary (in $)', color='orange', alpha=0.2)
plt.legend()

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=2)
fig.set_size_inches(15, 3.)

ax[0].bar(x, y1, label='2017')
ax[1].bar(x, y2, label='2020', color='orange')
ax[0].set_title('2017 Average salary (in $)')
ax[1].set_title('2018 Average salary (in $)')

ax[0].set_ylim(0, 125e3)
ax[0].set_ylim(0, 125e3)

In [None]:
x_positions = np.arange(0, len(x))

plt.figure(figsize=(12, 3))

plt.bar(x_positions-0.2, y1, width=0.4, label='2017')
plt.bar(x_positions+0.2, y2, width=0.4, label='2020', color='orange')

plt.xticks(x_positions, labels=x)

plt.legend()
plt.show()

---

### Scatter Plots

**Note**: Line plots are typically used for representing continuous functions. In contrast, scatter plots are used to show relationships between two sets of data without implicitly stating the ordering and connection between the data-points.

In [None]:
xvals = [0.0, 1.0, 4.3, 2.0, 3.5, 2.5, 0.5, 5.0, 6.0, 7.0]
yvals = [0.0126, 0.031, 0.072, 0.056, 0.079, 0.068, 0.020, 0.056, 0.031, 0.0125]

fig, ax = plt.subplots()
fig.set_size_inches(11, 7)

for xval, yval in zip(xvals, yvals):
  ax.scatter(xval, yval, s=803)


In [None]:
import random

fig, ax = plt.subplots()
fig.set_size_inches(11, 7)

for _ in range(500):
  xval = random.gauss(0, 1)
  yval = 2 * xval + 3 + random.gauss(0, 1.3)

  ax.scatter(xval, yval, s=103, alpha=0.41)


It is often useful to annotate scatter points with information on what a given point represents. For this we can use colours, shapes, labels, etc. In this case, we will use the flexibility of matplotlib to create a scatterpie plot, in which each scatter point is its own pie graph.

For this we will make use of the `inset_axes` function, which allows us to create a subplot of arbitrary size and position anywhere within an existing plot.

In [None]:
from mpl_toolkits.axes_grid1.inset_locator import inset_axes

fig, ax = plt.subplots()
fig.set_size_inches(13, 13)

# fix the randomness so that we can make the same random plot each time
random.seed(3333)

for _ in range(13):
  xval = random.gauss(0, 1)
  yval = 2 * xval + 3 + random.gauss(0, 1.3)

  # choose the proporties of the pie chart
  pie_clr = random.random(), random.random(), random.random()
  pie_size = random.uniform(0.5, 2.5)
  pie_prop = random.uniform(0.1, 0.9)

  # define the area the pie plot will cover by stating its lower-left corner and
  # its dimensions
  pie_bbox = (xval - pie_size / 2, yval - pie_size / 2,
              pie_size, pie_size)

  # create a new axis for the pie chart subplot
  pie_ax = inset_axes(ax, width='100%', height='100%',
                      bbox_to_anchor=pie_bbox, bbox_transform=ax.transData,
                      axes_kwargs=dict(aspect='equal'), borderpad=0)

  # add the pie chart to the subplot
  pie_ax.pie(x=[pie_prop, 1 - pie_prop],
             colors=[pie_clr + (0.91, ), pie_clr + (0.41, )],
             explode=[0.29, 0], startangle=90)

# clean up plot aesthetics
ax.set_xlim(-3, 3)
ax.set_ylim(-5, 13)
ax.grid(lw=0.71, alpha=0.43)


## Seaborn

- Often used along with matplotlib
- Integrates with Pandas and 2D Data Frames
- Simple syntax, complex plots with statistics: perfect for exploratory data analysis or even publishable figures

In [None]:
import seaborn as sns

### Distribution Plots

Now we will use *Seaborn* to analyze our data in a visual manner.
We can start by plotting a distribuition plot for a given column (e.g. the distribution of the variable *Wind Speed*).

In [None]:
g = sns.histplot(df['Data_Wind_Speed'])

We can select the number of bins to represent our data using the `bins` argument.  We can also set some limits for our plot using the `set` function and the `xlim` argument.

In [None]:
g = sns.histplot(df.Data_Wind_Speed, bins=50)
g.set(xlim=(0, 25))

Very often it's useful to print out the minimum and maximum values for a given column to get an idea of the range of our data.
One can use the `sort_values` function to sort the values in ascending order. In addition, we can use the `min` and `max` functions to get the bounds of our data.

In [None]:
print(df.Data_Wind_Speed.min())
print(df.Data_Wind_Speed.max())

We can filter the values of our dataset, using a conditional statement and create a new dataset with the data that satisfies the criteria that we impose.
Here, we will create a second dataset (*df2*) with the entries that contain *Wind Speed* values below 25 units.

In [None]:
df2 = df[df.Data_Wind_Speed < 25]
print(len(df), len(df2))

We can observe that the maximum *Wind Speed* value in our new subset (*df2*) is below 25 units.

In [None]:
df2.Data_Wind_Speed.max()
# or df[df.Data_Wind_Speed < 25].Data_Wind_Speed.max() if you don't want to create a new data frame

Lets plot a new distribuition plot for *Wind Speed* but now using our new subset (*df2*). No need to change axes now!

In [None]:
sns.histplot(df2.Data_Wind_Speed, bins=50)
print(f"Average wind speed: {np.average(df2.Data_Wind_Speed):0.4f}")

We can plot histograms on top of eachother to visualize differences between groups of data.

In [None]:
# Distribution of the Max. Temperature in the U.S.A.
# Try modifying this code to use stat="density" so the total area under
# the histogram is 1. This is useful for comparing distributions!
sns.histplot(df.Data_Temperature_Max_Temp,
            stat="count",
            binwidth=5)
sns.histplot(df.Data_Temperature_Max_Temp[df.Station_State == "California"],
            stat="count",
            color="orange",
            binwidth=5)
plt.title('Distribution max temperature')
sns.despine()

---

### Joint Plots

We can also show a Joint plot for two different variables. For instance, `Data_Temperature_Min_Temp` and `Data_Temperature_Max_Temp`.

Seaborn easily generates multi-part plots! Doing this with Matplotlib would require more code.

In [None]:
sns.jointplot(data=df,
                x='Data_Temperature_Min_Temp',
                y='Data_Temperature_Max_Temp',
                alpha=0.1)

It seems that we have a strange value at the 60/60 position.
Lets interrogate our dataset...

In [None]:
df60 = df[(df['Data_Temperature_Max_Temp'] == 60) & (df['Data_Temperature_Min_Temp'] == 60) ]
df60.head()
# Note the use of parentheses to specify more than 1 condition to filter rows

We realize that some stations in Alaska contain a 60 value for the Avg, Max, Min temperature columns...that's weird – especially if we consider that some temperature recordings were made in January.
If we search on the internet we will find out that the average temperature in Alaska in January 2016 was 27°F.
So it seems that these entries could potentially be wrong.

This was a great example to show how visualization tools help to understand and find anomalies in our data.



---

### Pair Plots

A _pair plot_ draws a pairwise relationships in a dataset, so one can use it to analyze the distribution of multiple variables at the same time.

We start by creating a new subset of our data by selecting some columns. For instance we can select the columns that contain temperature information.

In [None]:
dfTemp = df[['Data_Temperature_Avg_Temp', 'Data_Temperature_Max_Temp', 'Data_Temperature_Min_Temp']]

Let's take a look to our new subset.

In [None]:
dfTemp.head(5)

Now, we can plot a joint plot for each pair of columns in the temperatures subset using the *Seaborn* `pairplot` built-in function.
Note: this may be computationally expensive.


In [None]:
# the 'fivethirtyeight' theme set above doesn't play well with pairplot! Let's revert to defaults!
plt.rcParams.update(plt.rcParamsDefault)
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')

sns.pairplot(dfTemp, plot_kws={'alpha': 0.5})

Since the temperature data above fall into different categories (e.g. from year 2016 or 2017), we can also keep track of these categories with different colors using the hue argument.

Check color palettes in: https://seaborn.pydata.org/tutorial/color_palettes.html


In [None]:
dfTemp = df[['Date_Year', 'Data_Temperature_Avg_Temp', 'Data_Temperature_Max_Temp', 'Data_Temperature_Min_Temp']]
sns.pairplot(dfTemp, hue='Date_Year', palette='bright')

---

### Bar Plots

We can build Bar Plots with the values in our *Pandas* dataframe using the *Seaborn* `barplot` built-in function.
Here we decide to plot the maximum temperature (see `Data_Temperature_Max_Temp` column) for the different months of the year.

In [None]:
sns.barplot(x='Date_Month', y='Data_Temperature_Max_Temp', data=df)

Note that the *Seaborn* `barplot` function provides (by default) the average values over a given variable. In this example it computes the average of the `Data_Temperature_Max_Temp` values for a given month (these include different weeks, stations, states, ...).

We can also define specific estimators by passing a function to the `estimator` parameter of the *Seaborn* `barplot` method. For instance, we can ask *Seaborn* to compute the standard deviation of the `Data_Temperature_Max_Temp` values for a given month of the year.

In [None]:
sns.barplot(x='Date_Month', y='Data_Temperature_Max_Temp', data=df, estimator=np.std)

---

### Box Plots

A box plot (or box-and-whisker plot) shows the distribution of data and allows us to compare across levels of a categorical variable. The box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution, except for points that are determined to be “outliers” using a method that is a function of the inter-quartile range (by default `whis=1.5`, proportion of the IQR past the low and high quartiles to extend the plot whiskers).

In [None]:
sns.boxplot(x='Date_Month', y='Data_Temperature_Avg_Temp', data=df,
            whis=1.5)

If we pass the `Date_Year` column name to the `hue` argument, we can create box plots that directly compare the distributions across years for each month. We can select a color scheme passing a `palette` argument.

This is a decently complicated plot created using minimal code!

In [None]:
sns.boxplot(x='Date_Month', y='Data_Temperature_Avg_Temp',
            data=df, hue='Date_Year', palette='coolwarm')

Here we realize that our dataset is almost limited to data collected in 2016 (in 2017 we only have entries of data-points collected in January).

Again, representing our data in a visual manner allow us to unravel information about our dataset that we would have missed if we would have only checked the tables.


---

### Violin Plots

Violin plots show similar information as the box plots – showing the distribution of quantitative data in a way that facilitates comparisons between variables or across levels of a categorical variable – but they contain extra information. Violin Plots also show the probability distribution of our data, which is great to check whether our data is skewed.

In [None]:
sns.violinplot(x='Date_Month', y='Data_Temperature_Avg_Temp', data=df)

Here we can also pass `hue` argument to show different series on different sides of the Violin Plots.

In [None]:
sns.violinplot(x='Date_Month', y='Data_Temperature_Avg_Temp', data=df, hue='Date_Year', split=True)

---

### Strip Plots

Strip Plots are useful to see the distribuition of all the observations of one variable at a glance.

In [None]:
sns.stripplot(x='Date_Month', y='Data_Temperature_Avg_Temp', hue='Date_Month', palette='husl', legend=False, data=df)

A strip plot can be drawn on its own, but it is also a good complement to box or violin plots – especially for these cases in which showing the underlying distribution of the data is important.

In [None]:
# We can now apply what we learnt from Matplotlib to make a customized plot.

fig, ax = plt.subplots()
fig.set_size_inches(12, 6)

sns.boxplot(x='Date_Month', y='Data_Temperature_Avg_Temp', data=df, ax=ax, zorder=100)
sns.stripplot(x='Date_Month', y='Data_Temperature_Avg_Temp', data=df, ax=ax, hue='Date_Month', palette='husl', linewidth=0.5, alpha=0.5)

# We can change the labels of the "x" axis.
ax.set_xticklabels(['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
                    'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
ax.set_xlabel('Month')
ax.set_ylabel('Average Temperature')


In [None]:
# Exercise. Build a violin plot comparing the Avg. temperature in CA and NJ through the different months of the year.

# Step 1. Create a subset with the data from California and New Jersey.
# Step 2. Create a figure and axis using Matplotlib using the Object method.
# Step 3. Plot violin plot using the previous subset and Seaborn.

### Heatmap Plot of a Correlation Matrix

A _heatmap plot_ is a 3-D representations of data in which the data values are represented as colors that represent some numerical scale.

A _correlation matrix_ is a table that shows the correlation coefficients between different variables.  We can visually represent these Correlation Matrices using heatmaps.

Here, we will start by creating a correlation matrix using the precipitation, temperature and wind variables of our dataset.

In [None]:
df_grid = df[['Data_Precipitation', 'Data_Temperature_Min_Temp',
              'Data_Temperature_Max_Temp', 'Data_Temperature_Avg_Temp',
              'Data_Wind_Direction', 'Data_Wind_Speed']]

data_corr = df_grid.corr()
data_corr

Now we can pass the Correlation Matrix to the *Seaborn* `heatmap` function in order to get a visual representation of the different values in the matrix.
We can even include the values inside corresponding to each column and row by adding passing the argument `annot=True`.

In [None]:
sns.heatmap(data_corr, cmap='viridis', linewidths=2.0, linecolor='white',
            annot=True)

In [None]:
# Exercise.
# Plot in a single figure 3 Heatmap Plots containing the correlation matrices of 3 subsets.
# Only include precipitation, temperature and wind variables.
# (Subset 1): only including data from Florida.
# (Subset 2): only including data from Ohio.
# (Subset 3): only including data from New York.

# Guideline.
# Step 1. Create a subset including precipitation, temperature and wind variables using Pandas.

# Step 2. Create the different subsets of the previous subset using Pandas.

# Step 3. Create the correlation matrices for each subset.

# Step 4. Create a figure and axes for each subplot.

# Step 5. Plot the Correlation Heatmaps in each axes.

# Step 6. Set titles for each plot.


**Extra (Advanced Usage).** One of the benefits of creating plots using a programming language is the fact that one can automate some tasks that are repeated through the process of creating your figures.

In [None]:
df3 = df[['Data_Precipitation', 'Data_Temperature_Min_Temp',
          'Data_Temperature_Max_Temp', 'Data_Temperature_Avg_Temp',
          'Data_Wind_Direction', 'Data_Wind_Speed']]

states = ['California', 'Washington', 'Florida']
# states = list(df['Station.State'].unique())[0:5]

fig, ax = plt.subplots(nrows=1, ncols=len(states))
fig.set_size_inches(5 * len(states), 5)

for i in range(0 , len(states)):
  subset_i = df3[df['Station_State'] == states[i]]
  corr_i = subset_i.corr()
  sns.heatmap(corr_i, cmap='viridis', annot=True, ax=ax[i], cbar=False)
  ax[i].set_title(states[i])

plt.tight_layout()

---

## Interactive Plots with Plotly

*Plotly* is an interactive visualization library. Using *Plotly* we are going to be able to create plots directly from our *Pandas* DataFrames.
We can import *Plotly*:

In [None]:
import plotly.graph_objects as go

In [None]:
def random_cos(a):
  y = np.cos(a)
  y *= np.random.randint(0, 50, 1)
  return y

x = np.linspace(0, 10, 50)
y = random_cos(x)

In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=x, y=y, name="Random Data", line = dict(width=3)))
fig.add_trace(go.Scatter(x=x, y=-y, name="Random Data 2", line = dict(width=3)))
fig.update_layout(title='Title', showlegend=True)

fig.show(renderer="colab")

If you would like to create other type of plots please take a look at the [Plotly Cheat Sheet](https://images.plot.ly/plotly-documentation/images/python_cheat_sheet.pdf).

---

Plotly also offers an express mode. Users often define this express mode as: "*plotly.express is to plotly what seaborn is to matplotlib.*".
This express mode is very intuitive and functions in a similar way than *Seaborn*.
For that, we will need to import the express mode as follows:

In [None]:
import plotly.express as px

Now we can create a Scatter Plot using the *Plotly Express* `scatter` built-in function.

In [None]:
fig = px.scatter(x=x, y=y, render_mode='-')
fig.add_scatter(x=x, y=-y)
fig.show(renderer='colab')

---

We can also make subplots, in a similar way that we did in the *Matplotlib* section.

In [None]:
from plotly.subplots import make_subplots

fig = make_subplots(rows=2, cols=2,
                    specs=[[{"type": "xy"}, {"type": "polar"}],
                           [{"type": "scatter"}, {"type": "scene"}]])


fig.add_bar(row=1, col=1, x=['A', 'B', 'C', 'D'], y=[3, 4, 1, 10])
fig.add_scatter(row=2, col=1, x=x, y=y, mode='markers')
fig.add_barpolar(row=1, col=2, theta=[0, 45, 90], r=[2, 3, 1])
fig.add_scatter3d(row=2, col=2, x=[1, 2, 3], y=[1, 2, 3], z=[1, 1, 1])

fig.update_layout(height=700, showlegend=True)


fig.show(renderer='colab')

---

In [None]:
x = np.random.randint(0, 20, 10)
y = np.random.randint(0, 20, 10)
X, Y = np.meshgrid(x, y)
Z = np.cos(X) * np.cos(Y)

To finalize we will create an interactive 3-D Scatter Plot with some randomly generated data.

In [None]:
def random3d():
  x = np.random.randint(0, 20, 10)
  y =  np.random.randint(0, 20, 10)
  X, Y = np.meshgrid(x, y)
  Z = np.cos(X) * np.cos(Y)
  a, b, c = [X.reshape(-1), Y.reshape(-1), Z.reshape(-1)]
  return [a, b, c]


a, b, c = random3d()
fig = px.scatter_3d(x=a, y=b, z=c)

for i in range(0, 3):
  a, b, c = random3d()
  fig.add_scatter3d(x=a, y=b, z=c, mode='markers')

# fig.update_layout(height=700, showlegend=False)
fig.show(renderer='colab')

## Hopefully you can extrapolate the information that we have covered in this session to your own projects. We will be happy to help you through that process, so please send us an email if you have any questions!

## Before you leave, please fill out the survey for the Data Visualization in Python session. Having feedback from our audience is essential for improving our future bootcamps and workshops. Please complete the survey.

## Thank you very much!!!