<div>
<img src="https://edlitera-images.s3.us-east-1.amazonaws.com/new_edlitera_logo.png" width="500"/>
</div>

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

# Visualizations

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## DataFrames and plots

* the `pandas` library supports plotting a `DataFrame` object directly
* the plotting is built on top of `matplotlib`
* for more advanced plots, you can use `matplotlib` directly
* `seaborn` is another Python visualization library, also based on `matplotlib`
  * has more advanced plots, such as **heatmaps**, **boxplots**, etc.
* `bokeh` is an excellent alternative  

https://seaborn.pydata.org/   
https://matplotlib.org/contents.html    
https://docs.bokeh.org/en/latest/

#### I strongly encourage you to study the `matplotlib` tutorials - they are very good!

https://matplotlib.org/tutorials/index.html

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Introduction to Matplotlib

In [None]:
import pandas as pd
import numpy as np
import datetime

import matplotlib
import matplotlib.pyplot as plt

# This allows us to show the plots inline
%matplotlib inline

# This increases the resolution of the plots
matplotlib.rcParams['figure.dpi'] = 300

In [None]:
type(plt)

In [None]:
dir(plt)

<br><br>

### The `matplotlib` `pyplot` submodule

* often aliased as `plt` for brevity
* includes a variety of functions for drawing plots:
    * `plot()` for line plots
    * `boxplot()` for box plots
    * `scattter()` for scatter plots
    * `hist()` for histogram plots
    * `bar()` and `barh()` for bar plots
    * `violinplot()` for violin plots
    * etc.
* also includes definitions for plot objects that we can customize:
    * ticks
    * axes
    * titles
    * legends
    * etc.

**NOTES:** 

* Knowledge of object-oriented programming can help a lot with understanding `matplotlib` and being able to use more advanced features.
<br><br>
* You don't have to memorize these functions. You can always check the documentation, or the examples shown on the `matplotlib` page.

In [None]:
help(plt.boxplot)

<br><br><br>

## First look

In [None]:
x_values = pd.Series([1, 2, 3, 4, 5, 6])
y_values = pd.Series([4, 5, 7, 3, 2, 2])

In [None]:
# Let's draw a simple line plot

plt.plot(x_values, y_values)

plt.show()

#### A few observations:
* The plot is drawn inline, underneath the code cell
    * that's because of the `%matplotlib inline` directive which tells Jupyter to display the plots inline
<br><br>
* Calling `matplotlib` functions like `plot`, `boxplot`, etc. actually create Python objects
<br><br>
* The various plotting functions actually create Python objects represeting plots. The `show()` function actually draws / displays the plots. 
    * in some cases, `show()` is not required and Jupyter can draw the plots on its own

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Figures and axes subplots

<br><br>

<img src="https://edlitera-images.s3.us-east-1.amazonaws.com/matplotlib_anatomy.png" width="500px" />

<br><br>

### Creating `Figure` and `AxesSubplot` objects

* use the `subplots()` function
    * it returns a tuple containing a `Figure` object and:
        * one `AxesSubplot` object OR
        * an ndarray object containing `AxesSubplot` objects
<br><br>        
* more often, we create plots by first creating a `Figure` object and one or more `AxesSubplot` objects

In [None]:
plt.subplots()

In [None]:
plt.subplots(nrows=2, ncols=2)

In [None]:
plt.subplots(nrows=3, ncols=2)

In [None]:
type(plt.subplots(nrows=2, ncols=2))

In [None]:
type(plt.subplots(nrows=2, ncols=2)[0])

In [None]:
type(plt.subplots(nrows=2, ncols=2)[1])

<br><br>

We can use tuple unpacking to extract the `Figure` and `AxesSubplot` object into separate variables:

In [None]:
fig, ax = plt.subplots() # one Figure, one AxesSubplot

**NOTES:**
* Because of the `%matplotlib inline` directive, Jupyter will actually draw the `Figure` and `AxesSubplot` objects inline. 
<br><br>
* Since we used none of the plotting functions, no object representing a plot is actually attached to the `AxesSubplot` object
    * i.e. no plot is drawn

<br><br><br><br><br><br>

## Properties and methods of the `AxesSubplot`

In [None]:
fig, ax = plt.subplots()

dir(ax)

We notice that the `AxesSubplot` data type (class) has all the plotting methods we mentioned earlier:
* `plot()` for line plots
* `boxplot()` for box plots
* `scattter()` for scatter plots
* `hist()` for histogram plots
* `bar()` and `barh()` for bar plots
* `violinplot()` for violin plots
* etc.

<br><br>
It also includes properties and methods for creating / manipulating plot objects:
* ticks
* axes
* titles
* legends
* etc.

<br><br><br><br><br><br>

## Drawing a plot in an AxesSubplot

In [None]:
fig, ax = plt.subplots()

ax.plot(
    [1, 2, 3, 4], 
    [10, 20, 25, 30], 
    color='#ADD8E6', 
    linewidth=2
)

plt.show() 

<br><br>

## Drawing multiple plots in the same AxesSubplot

In [None]:
fig, ax = plt.subplots()

# Add a line plot
ax.plot(
    [1, 2, 3, 4], 
    [10, 20, 25, 30], 
    color='#ADD8E6', 
    linewidth=1
)

# Add a scatter plot
ax.scatter(
    [0.3, 3.8, 1.2, 2.5], 
    [11, 25, 9, 26], 
    color='#FE5645', 
    marker='o'
)

ax.set_xlim(0.5, 4.5)

plt.show()

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Drawing several subplots in a figure

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=3)

plt.show()

In [None]:
type(fig)

In [None]:
axes

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=3)

axes[0,0].plot(
    [-10, -5, 0, 5, 10, 15], 
    [-1.2, 2, 3.5, -0.3, -4, 1]
)

axes[0,1].scatter(
    [-10, -5, 0, 5, 10, 15], 
    [-1.2, 2, 3.5, -0.3, -4, 1]
)

plt.show()

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

### Adjust space between subplots

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=2)

axes[0,0].plot(
    [-10, -5, 0, 5, 10, 15], 
    [-1.2, 2, 3.5, -0.3, -4, 1]
)

axes[0,1].scatter(
    [-10, -5, 0, 5, 10, 15], 
    [-1.2, 2, 3.5, -0.3, -4, 1]
)

plt.subplots_adjust(wspace=0.3, hspace=0.4)

plt.show()

**More info:** https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplots_adjust.html

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

### Name subplots

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=2)

axes[0,0].plot(
    [-10, -5, 0, 5, 10, 15], 
    [-1.2, 2, 3.5, -0.3, -4, 1]
)

axes[0,1].scatter(
    [-10, -5, 0, 5, 10, 15], 
    [-1.2, 2, 3.5, -0.3, -4, 1]
)

axes[0,0].set_title('Upper Left')
axes[0,1].set_title('Upper Right')
axes[1,0].set_title('Lower Left')
axes[1,1].set_title('Lower Right')

plt.subplots_adjust(wspace=0.4, hspace=0.5)

plt.show()

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

### Set axes limits

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=2)

# plot data for each axes
axes[0,0].plot(
    [-10, -5, 0, 5, 10, 15], 
    [-1.2, 2, 3.5, -0.3, -4, 1]
)

axes[0,0].scatter(
    [-10, -5, 0, 5, 10, 15], 
    [-1.2, 2, 3.5, -0.3, -4, 1]
)

# set axes limits
axes[0,0].set_ylim(top=10.0)
axes[0,0].set_xlim(left=-15)

plt.subplots_adjust(wspace=0.4, hspace=0.4)

plt.show()

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

### Show legend

In [None]:
fig, ax = plt.subplots()

# plot data for each axes
ax.plot(
    [-10, -5, 0, 5, 10, 15], 
    [-1.2, 2, 3.5, -0.3, -4, 1], 
    label='Signal 1'
)

ax.plot(
    [-10, -5, 0, 5, 10, 15], 
    [1.2, 4, 1.5, 0.3, -2.5, 1], 
    label='Signal 2'
)

# show plot legend
ax.legend(loc='lower left')

plt.show()

**More info:** https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.legend.html

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

### Set axis labels

In [None]:
fig, ax = plt.subplots()

# plot data for each axes
ax.plot(
    [-10, -5, 0, 5, 10, 15], 
    [-1.2, 2, 3.5, -0.3, -4, 1], 
    label='Data'
)

# show plot legend
ax.legend()

# set axis labels
ax.set_xlabel('foo')
ax.set_ylabel('bar')

plt.show()

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

### Markers and line styles

In [None]:
fig, ax = plt.subplots()

ax.plot(
    [1, 2, 3, 4], 
    [10, 20, 25, 30], 
    color='#ADD8E6', 
    linewidth=2,
    linestyle='-.'
)

ax.scatter(
    [0.3, 3.8, 1.2, 2.5], 
    [11, 25, 9, 26], 
    color='#FE5645', 
    marker='d'
)

# ax.set_xlim(0.5, 4.5)

plt.show()

**More info:** https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html

**For marker style options:** https://matplotlib.org/stable/api/markers_api.html

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

### Custom ticks and annotations

In [None]:
fig, ax = plt.subplots()

ax.plot([1, 2, 3, 4], [10, 20, 25, 30])

# set custom ticks on the X Axis
ax.set_xticks([-4, -2, 0, 2, 4])

# set custom ticks on the Y Axis
ax.set_yticks(range(0, 80, 20))

plt.show()

In [None]:
fig, ax = plt.subplots()

ax.plot([1, 2, 3, 4], [40, 20, 25, 30])

ax.annotate(
    'this is bad', 
    xy=(2, 20), 
    xytext=(3, 30), 
    arrowprops={
        'facecolor': '#dfdfdf', 
        'width': 1,
    }
)

plt.show()

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

### Set the figure size

In [None]:
fig, ax = plt.subplots(nrows=1)
ax.plot([1, 2, 3, 4], [40, 20, 25, 30])

fig.set_size_inches(16.5, 8.5)

plt.show()

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

### Save the figure to a file

In [None]:
fig, ax = plt.subplots()

fig.set_size_inches(16.5, 8.5)

ax.plot([1, 2, 3, 4], [40, 20, 25, 30])

ax.annotate(
    'this is bad', 
    xy=(2, 20), 
    xytext=(2, 25), 
    arrowprops={
        'facecolor': 'black', 
        'width': 2,
    }
)

plt.show()

fig.savefig('line_chart.jpeg', dpi=300)

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Using `matplotlib` with `pandas`

In [None]:
# Read our sample survey data

data = pd.read_csv(
    'https://edlitera-datasets.s3.amazonaws.com/survey_sample.csv',
    parse_dates=['Date']
)

data.head()

In [None]:
data.describe()

In [None]:
filtered_data = data[['Helpfulness', 'Courtesy', 'Empathy']]
filtered_data.head()

In [None]:
# We can count how many responses of each type
# we got
totals = filtered_data.apply(pd.value_counts)
totals

In [None]:
# We can rotate this dataframe using the T attribute, 
# which transposes (i.e. rotates) the dataframe

rotated_totals = totals.T
rotated_totals

<br>

### Let's plot this data!

* DataFrames have support for plotting built-in
* underneath they use matplotlib functions
* not quite as many options and customizations available as when using `matplotlib` directly

<br>

Plotting functionality is available via the `PlotAccessor` object:

In [None]:
rotated_totals.plot

In [None]:
dir(rotated_totals.plot)

In [None]:
ax = rotated_totals.plot.bar()

ax.set_title('Aggregated Data')

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Bar plots

In [None]:
rotated_totals

In [None]:
rotated_totals.plot

In [None]:
rotated_totals.plot.bar()

In [None]:
rotated_totals.plot.barh()

**IMPORTANT: the `DataFrame` plotting functions return the `AxesSubplot` objects where those plots are drawn:**

In [None]:
type(rotated_totals.plot.barh())

<br><Br>

### Stack bar plots

In [None]:
rotated_totals.plot.barh(stacked=True)

<br><br>

### Change the plot figure size

In [None]:
# figsize units are in inches...
rotated_totals.plot.barh(stacked=True, figsize=(9, 4))

<Br><br>

### Custom legend

**IMPORTANT: the `DataFrame` plotting functions return the `AxesSubplot` objects where those plots are drawn:**

In [None]:
type(rotated_totals.plot.barh(stacked=True, figsize=(9, 4)))

<br>

#### We saw earlier how we can use the `legend()` method of the `AxesSubplot` object

In [None]:
rotated_totals.plot.barh(
    stacked=True, 
    figsize=(9, 6)
).legend(
    ['Unfavorable', 'Neutral', 'Favorable'], 
    bbox_to_anchor=(1.0, 0.5)
)

<br><br>

### Custom colors

**NOTE:** The order of the color `hex` codes should match the order of the bars representing the data.

In [None]:
rotated_totals.plot.barh(
    stacked=True, 
    figsize=(9, 6), 
    color=['#ff9a8d', '#4a536b', '#aed6dc']
).legend(
    ['Unfavorable', 'Neutral', 'Favorable'], 
    bbox_to_anchor=(1.0, 1.0)
)

<br><br>

### Specify the `x-ticks` values

In [None]:
rotated_totals

In [None]:
# And we can specify which ticks we want, say on X axis

rotated_totals.plot.barh(
    stacked=True, 
    figsize=(9, 6), 
    color=['#ff9a8d', '#4a536b', '#aed6dc'], 
    xticks=[0, 333, 250, 500, 750, 1000]
).legend(
    ['Unfavorable', 'Favorable', 'Neutral'], 
    bbox_to_anchor=(1.0, 1.0)
)

**More info:** https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.barh.html

<br><br>

### Save the figure to a file

In [None]:
ax = rotated_totals.plot.barh(
    stacked=True, 
    figsize=(9, 6), 
    color=['#ff9a8d', '#4a536b', '#aed6dc'], 
    xticks=[0, 333, 250, 500, 750, 1000]
).legend(
    ['Unfavorable', 'Favorable', 'Neutral'], 
    bbox_to_anchor=(1.0, 1.0)
)

# Get the matplotlib Figure object 
# that contains the AxesSubplot object
fig = ax.get_figure()

# And call the savefig method on that
# object
fig.savefig('bar_chart.jpeg', dpi=300)

<br><br><br><br><br><br>

## Line plots

In [None]:
data = pd.read_csv(
    'https://edlitera-datasets.s3.amazonaws.com/timeseries_survey_sample.csv', 
    index_col=['Date'], 
    parse_dates=['Date']
)

data.head()

In [None]:
data.info()

Let's plot the values read by `Sensor A` in `Budapest`.

In [None]:
budapest_data = data.loc[
    data['Location'] == 'Budapest', 
    ['Sensor A']]

**REMEMBER!** With `DatetimeIndex`, it's good to sort the index, unless you're 100% sure it's sorted.

In [None]:
budapest_data.sort_index(inplace=True)
budapest_data.head()

<br><Br>

### Draw the plot using only `matplotlib`

In [None]:
fig, ax = plt.subplots()

fig.set_size_inches(16.5, 6)

ax.plot(
    budapest_data.index, 
    budapest_data['Sensor A'], 
    color='#aed6dc',
    linewidth=2
)

ax.set_title('Budapest, Sensor A')

plt.show()

<br><br>

### Draw the plot using `pandas` methods

In [None]:
budapest_data.plot.line(title='Budapest, Sensor A')

### Change the plot figure size

In [None]:
budapest_data.plot.line(figsize=(9, 3))

### Set a title for the plot

In [None]:
budapest_data.plot.line(
    figsize=(9, 3), 
    title='Budapest: Sensor A'
)

### Hide the legend

In [None]:
ax = budapest_data.plot.line(
    figsize=(9, 3), 
    title='Budapest: Sensor A', 
    legend=False, 
    rot=45
)

ax.set_ylabel('Sensor A')

<br><br><br><br><br><br>

## Plot more than one variable in the same line plot

In [None]:
budapest_data = data.loc[
    data['Location'] == 'Budapest', 
    ['Sensor A', 'Sensor B']]

budapest_data.head()

In [None]:
budapest_data.plot.line(figsize=(9, 3))

**Let's focus on a subset of the data:**

In [None]:
bd_data = budapest_data['2014':'2015']
bd_data.plot.line(figsize=(9, 3), title='Budapest')

<br>

### Plot each variable in a subplot (`pandas` way)

In [None]:
bd_data.plot.line(
    figsize=(9, 6),
    subplots=True,
    sharex=True
)

<br>

### Plot each variable in a subplot (`matplotlib` way)

In [None]:
fig, axes = plt.subplots(
    nrows=len(bd_data.columns), 
    sharex=True
)

colors = ['#ff9a8d', '#aed6dc']

fig.set_size_inches(16.5, 6)

for idx, column in enumerate(bd_data.columns):
    axes[idx].plot(
        bd_data.index, 
        bd_data[column],
        color=colors[idx],
        label=column
    )

    axes[idx].legend()

plt.show()

<br><br><br><br><br><br>

## `pandas` line plots customizations

### Emphasize data points with custom markers

In [None]:
bd_data.plot.line(figsize=(9, 3), marker='o')

### Set custom limits on the axes

In [None]:
ax = bd_data.plot.line(figsize=(9,3), marker='o')
ax.set_ylim(bottom=0.0, top=3.5)

### Add annotations

In [None]:
ax = bd_data.plot.line(figsize=(9,3), marker='o')

# Let's get the minimum value for Sensor B
min_row = (
    bd_data['Sensor B']
    .sort_values(ascending=True)
)

min_val = min_row.iloc[0]
min_date = min_row.index[0]

ax.text(
    min_date, 
    min_val, 
    'low point', 
    fontsize=12
)

plt.show()

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Scatter plots

In [None]:
bd_data.head()

In [None]:
bd_data.plot.scatter(
    x='Sensor A', 
    y='Sensor B', 
    c='#ff9a8d'
)

In [None]:
data.head()

In [None]:
dv_data = data.loc[data['Location'] == 'Dubrovnik']
dv_data = dv_data['2014':'2015']
dv_data.head()

In [None]:
dv_data.plot.scatter(
    x='Sensor A', 
    y='Sensor B', 
    c='#4a536b'
)

<br>

### How can we show multiple scatter plots in the same axes?

In [None]:
p_data = data[data['Location'] == 'Prague']['2014':'2015']
p_data.head()

In [None]:
ax = bd_data.plot.scatter(
    x='Sensor A', 
    y='Sensor B', 
    c='#ff9a8d', 
    label='Budapest'
)

dv_data.plot.scatter(
    x='Sensor A', 
    y='Sensor B', 
    c='#4a536b', 
    ax=ax, 
    label='Dubrovnik'
)

p_data.plot.scatter(
    x='Sensor A', 
    y='Sensor B', 
    c='#aed6dc', 
    ax=ax,
    label='Prague'
)

ax.legend(bbox_to_anchor=(1.0, 1.0))

<br><br>

### Creating scatter plots using `matplotlib`

In [None]:
data = pd.read_csv(
    'https://edlitera-datasets.s3.amazonaws.com/housing.csv'
)

data.head()

In [None]:
fig, ax = plt.subplots()

fig.set_size_inches(20, 18)

ax.scatter(
    x=data['longitude'],
    y=data['latitude'],
    s=data['population'] / 50,
    alpha=0.1, 
    c=data['median_house_value'],
    cmap=plt.get_cmap('jet')
)

plt.show()

<br>

#### BTW, the customization options above are also available for the `pandas` scatter plots:

In [None]:
data.plot.scatter(
    x='longitude', 
    y='latitude', 
    s=data['population'] / 50,
    alpha=0.1, 
    c='median_house_value',
    colormap=plt.get_cmap('jet'),
    figsize=(20, 18)
)

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Histograms

In [None]:
data = pd.read_csv(
    'https://edlitera-datasets.s3.amazonaws.com/titanic.csv'
)

data.head()

<br>

`Fare` stores the price that each passenger paid for their ticket

In [None]:
data.plot.hist()

In [None]:
data['Fare'].plot.hist(bins=5)

**More info:** https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.hist.html

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Visualizing time series

In [None]:
# The file in data/QCOM.csv contains stock price information for 
# Qualcomm between 1999 and 2022

qc = pd.read_csv(
    'https://edlitera-datasets.s3.amazonaws.com/QCOM.csv', 
    index_col='Date', 
    parse_dates=['Date']
)

qc.sort_index(inplace=True)

qc.head()

In [None]:
qc.plot.line()

In [None]:
# Let's look at the closing stock price
qc_close = qc['Close']
qc_close.plot.line(figsize=(9, 4))

In [None]:
qc_close.loc['2018':'2022']

In [None]:
# We can zoom in on a period
qc_close.loc['2018':'2022'].plot.line(figsize=(9, 4))

### Resample

In [None]:
# Let's look at the rolling mean over 2018, 
# with a rolling window of 14 days

(
    qc_close['2018':'2022']
    .rolling('14D')
    .mean()
    .plot
    .line(figsize=(9, 4))
)

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Save `pandas` plots to file

In [None]:
data = pd.read_csv(
    'https://edlitera-datasets.s3.amazonaws.com/housing.csv'
)

In [None]:
ax = data.plot.scatter(
    x='longitude', 
    y='latitude', 
    s=data['population'] / 50,
    alpha=0.1, 
    c='median_house_value',
    colormap=plt.get_cmap('jet'),
    figsize=(18, 18), 
    title='California housing prices'
)

In [None]:
dir(ax)

In [None]:
fig = ax.get_figure()

fig.savefig('housing.jpeg', dpi=300)

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>