# CSC271H1: Intro to Data Visualization

## In this lesson
1. Basic plotting using `matplotlib`
- line plot
- bar plot
- histogram
- pie chart
- saving plots to a file

## Introduction 

The Pandas library has helpful tools for visualizing data. The Pandas tools are built on a library called `matplotlib` that provides many options for customizing visualizations. (This is similar to how Pandas replies on NumPy for numerical operations.)

Pandas' DataFrames and Series both have a `plot` method. 

We are going to work with City of Toronto Open Data on the number of library card registrations per year for each Toronto Public Library (TPL) branch.

Source: https://open.toronto.ca/dataset/library-card-registrations/

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

tpl_df = pd.read_csv('tpl-card-registrations-annual-by-branch.csv')
tpl_df.head()


We will group the rows by year and then sum the number of registrations in each year:

In [None]:
tpl_df.groupby('Year')['Registrations'].sum()

#### Line plot

A **line plot** displays data along a number line. It is useful for showing changes in data and is commonly used for time series data. A **time series** is a set of observations of a variable recorded at successive points in time, usually at regular intervals (such as daily, monthly, or yearly). The TPL registration data is time-series data.

Let's visualize the TPL data using the `plot` method. The default plot type generated is a line plot. The x-axis represents the year and the y-axis represents the total number of registrations across all TPL branches.

In [None]:
reg_by_year = tpl_df.groupby('Year')['Registrations'].sum()
reg_by_year.plot()
plt.show() 


<div class="alert alert-block alert-info">
<h3>The <code>show</code> function</h3>

`plt.show()` is required when running Python scripts (.py files). It opens a window containing the plot. In Jupyter notebooks, plots are displayed automatically, so the call to plt.show() could have been omitted above.
</div>

That's not a bad start! It's a bit ugly, but we can improve it and make it easier to see the data points.

The `plot` method call above returns a `matplotlib.axes.Axes` object:
https://matplotlib.org/stable/api/axes_api.html

The documentation provides information about `Axes` attributes of which there are many! Let's customize a few:
- `marker`: symbol at each data point in the line plot
- `title`: title of the chart
- `ylabel`: label for the y‑axis

We'll also add a and set some of the `matplotlib.axes.Axes.grid` attributes. You can learn more about grid attibutes by reading the documentation:
https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.grid.html#matplotlib.axes.Axes.grid

In [None]:
axes = reg_by_year.plot(marker='o',
                        title='Library Card Registrations by Year',
                        ylabel='Registrations')
axes.grid(True, color="gray", linestyle=":", linewidth=0.7, alpha=0.6)


The default plot type (set using the `kind` attribute) is a line chart, but there the table below shows other options.

<div class="alert alert-block alert-info">
<h3>Pandas plot types</h3>
<table>
  <thead>
    <tr>
      <th>Plot (<code>kind</code>)</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>line</td>
      <td>Line chart showing trends over an index or time</td>
    </tr>
    <tr>
      <td>bar</td>
      <td>Vertical bar chart for comparing values across categories</td>
    </tr>
    <tr>
      <td>barh</td>
      <td>Horizontal bar chart, useful for long category labels</td>
    </tr>
    <tr>
      <td>hist</td>
      <td>Distribution of a single variable using bins</td>
    </tr>
    <tr>
      <td>box</td>
      <td>Shows median, quartiles, and outliers</td>
    </tr>
    <tr>
      <td>area</td>
      <td>Stacked or unstacked area chart for cumulative values</td>
    </tr>
    <tr>
      <td>scatter</td>
      <td>Relationship between two numerical variables</td>
    </tr>
    <tr>
      <td>hexbin</td>
      <td>2D density plot for large scatter datasets</td>
    </tr>
    <tr>
      <td>pie</td>
      <td>Proportional representation of parts of a whole</td>
    </tr>
    <tr>
      <td>density / kde</td>
      <td>Smoothed probability density estimate</td>
    </tr>
  </tbody>
</table>
</div>


### Bar chart


A **bar chart** uses rectangular bars to represent values for different categories.  

Let's plot the data again using a bar chart, with each bar showing the total registrations for a given year. This time we wil set the `kind` parameter to `'bar'`. We also need to remove the `marker` argument, since bar charts don't have markers.

In [None]:
axes = tpl_df.groupby('Year')['Registrations'].sum().plot(kind='bar',
                                                    title='Library Card Registrations by Year',
                                                    ylabel='Registrations')
axes.grid(True, color="gray", linestyle=":", linewidth=0.7, alpha=0.6)


#### Histogram

Let’s visualize the data in a different way by creating a histogram of registrations.

 **Histograms** show the distribution of numerical data. The values are organized into bins (ranges). The x-axis indicates the bin ranges, and the y-axis displays the number of items in each bin.

In this case, the bins will represent ranges of registration counts, and the y-axis will indicate the number of branches that fall into each range.

In [None]:
reg_ax = tpl_df['Registrations'].plot(kind='hist')


Hmm, that isn't very helpful -- most registration counts fall into the first bin. Let's filter the data to look more closely at just the entires with 0 to 4999 registrations.

In [None]:
tpl_df[tpl_df['Registrations'] < 5000]['Registrations'].plot(kind='hist')


We'll improve the appearance of this plot as well by setting the following attributes:

- `bins`: number of bins in the x-axis
- `edgecolor`: color of each histogram bar border
- `linewidth`: thickness of bar border
- `figsize`: (width, height)
- `color`: bar color

In [None]:
under_5000_df = tpl_df[tpl_df['Registrations'] < 5000]['Registrations']
under_5000_df.plot(kind='hist',
                   title = 'Library Card Registrations by Year (Under 5000)',
                   bins=30,
                   edgecolor="white",
                   linewidth=1,
                   figsize=(9, 4),
                   color="#1f77b4")

#### Pie plot

Finally, we'll plot the data one last time using a pie plot. A **pie plot** shows the proportions of a whole. 

In this case, we'll consider the total registrations for the 10 branches with the highest total registration counts and use the pie plot to visually communicate what each branch's share of the total is.



In [None]:
top10_df = tpl_df.groupby('BranchCode')['Registrations'].sum().sort_values(ascending=False).head(10)
top10_df.plot(kind='pie',
              title='Ten TPL Branches With Highest Total Registrations for 2012-2023')

### Saving plots to a file: `figure.savefig`

When we made our box plot above, we stored the value returned by `plot` in a variable. You'll do that for plots you generate in tutorial (for marking purposes), but it also helpful because you have a way to refer to the plot again later.

We'll create a variable to refer to our top 10 pie chart and then use that variable to save the plot to a file. That will allow use to access the plot from outside of this notebook and VS code.

The default location of the file is the current directory, but we can also specify an absolute or relative file path, instead of just a file name, e.g.,:
- '../TPL-top10branches-piechart.pdf'
- '/Users/me/TPL-top10branches-piechart.svg'
- 'C:\Users\me\TPL-top10branches-piechart.png'

In [None]:
top10_plot = top10_df.plot(kind='pie',
              title='Ten TPL Branches With Highest Total Registrations for 2012-2023')
top10_plot.figure.savefig('../TPL-top10branches-piechart.jpg')

<div class="alert alert-block alert-success">

Practice Exercises (not-for-credit)

1. The scatterplot above uses the default colors. One way to set the colors is to use a colormap. Read the documentation to find a colormap that you like and set the `colormap` attribute of the scatterplot to it.

    https://matplotlib.org/stable/gallery/color/colormap_reference.html

2. TRL, the Toronto Reference Library, had highest card registrations. Create a line plot to show its registrations by year. 

    For our line plot above, our data was grouped by year and `plot` inferred the x and y values. When the data is not grouped, we can use `plot`'s `x` and `y`parameters to specify which columns of the dataframe to use (e.g., x='ColumnName`) for the axes.

    Set the line `color` to 'red'.

    In our line plot earlier, we used a circle marker. Pick a different marker to use this time:
    
    https://matplotlib.org/stable/gallery/lines_bars_and_markers/marker_reference.html
</div>

<div class="alert alert-block alert-danger">

<h3>Practice Exercises: Sample Solutions</h3>

</div>

In [None]:
# 1.

top10_df.plot(kind='pie',
              title='Ten TPL Branches With Highest Total Registrations for 2012-2023',
              colormap='twilight')

In [None]:
# 2. 

trl_df = tpl_df[tpl_df['BranchCode'] == 'TRL']
trl_df.plot(x='Year', y='Registrations', marker='v', color='red')