# Data Visualization in Python

* There are many libraries for doing data visualization in Python, none of them are perfect
* matplotlib is a very powerful data visualization toolkit for Python
* But it is a bit clunky and old school
* Increasingly there are newer alternatives to graphing in Python
    * [Seaborn](http://seaborn.pydata.org/#), built on top of matplotlib, provides an easier (and more asthetically pleasing) programming interface 
    * [Bokeh](http://bokeh.pydata.org/en/latest/), which is great for displaying data on the web
    * [Plotly](https://plot.ly/python/), which is a whole subscription based web service
    * [plotnine](https://plotnine.readthedocs.io/en/stable/), a python implementation of the very popular visualization library GGPlot from R
    * [Altair](https://altair-viz.github.io), which is a new *declarative* visualization library from one of the Jupyter developers
* Still, `matplotlib` is the OG of the python data visualization libraries (respect), so we are going to spend the next couple hours getting matplotlib'd



## Visualizing data


* I assume you already know what data visualization *is*
* The *real* question is *how* to visualize data AND *what* visualization to use
* There are many different kinds of data visualizations and which one is the best for any given situation is a loaded question

![XKCD comic on data vis](https://imgs.xkcd.com/comics/self_description.png)

### What kinds of visulizations are there?

* Here is a semy ugly looking periodic table of visualization methods: [http://www.visual-literacy.org/periodic_table/periodic_table.html](http://www.visual-literacy.org/periodic_table/periodic_table.html)
* Here is a good website cataloguing many of the different types of visualizations: [http://www.datavizcatalogue.com](http://www.datavizcatalogue.com)
* For example, here is the information about [box charts](http://www.datavizcatalogue.com/methods/bar_chart.html)
* When making a data visualization, you need to consider how various visual elements can convey the information you wish to show.
* There are many ways to represent quantiative information visually

## Visualizing Data

In [None]:
# load up our python libraries
# all will be explained later
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
plt.style.use('ggplot')

# create a bunch of random data
data = np.random.randn(1000)
data[0:10]

### Univariate or one dimensional data

* Histograms are useful because they show you how the values are distributed
* Not to be confused with bar charts!

In [None]:
# create a histogram of the data
plt.hist(data);

### Bivariate or two dimensional data

* Scatter plots are useful for visualizing two dimensional numerical data

In [None]:
# create two lists of numerical data
x = np.linspace(0, 10, 30)
y = np.sin(x)

# plot the data as a series of points
plt.plot(x, y, 'o');

* Bar charts are useful for visualizing two dimensional data where one dimension is categorical

In [None]:
# create some categorical data
labels = ["cats", "dogs", "chickens", "spiders"]
x_pos = [1,2,3,4]
y = [1,5,2,5]

# plot as a bar chart
plt.bar(x_pos, y)
plt.xticks(x_pos, labels);

* And line charts can be used to represent continuous data

In [None]:
plt.plot(x, np.sin(x));

* So this plot has two dimensions
* How might we add more?

### Adding more dimensions?

* What if we want to visualize additional dimensions? 
* Color, size, and shapes add information from other dimensions

In [None]:
rng = np.random.RandomState(0)
x = rng.randn(100)
y = rng.randn(100)
colors = rng.rand(100)
sizes = 1000 * rng.rand(100)

plt.scatter(x, y, c=colors, s=sizes, alpha=0.3,
            cmap='viridis')
plt.colorbar();  # show color scale

* How many dimensions are we displaying in this visualization?

### Interacting with matplotlib

* Two interfaces (or "data structures" to be consistent with how we've been talking in these workshops)
    * `pyplot` - which is designed to resemble the *not pythonic* plotting interface of MATLAB
        * You programmatically issue commands to build your plot
    * The "object oriented" and more *pythonic* interfaces, consisting of 
        * `Figure` - a "data structure" that contains the figure. Has parameters and methods for setting the physical size of the visualzations, saving the visualization as an image to disk, etc. Usually stored in the `fig` variable.
        * `Axes` - a "data structure" that contains the specific elements of the visualization. You can have more than one axes if you want to create parallel plots. Has parameters and methods for setting the tick marks, legend, axis labels, etc. Usually stored in the `ax` variable.

### matplotlib and Jupyter Notebooks

* In order to use `matplotlib` with Jupyter Notebooks, you need to run this "magic" command
    * It is best to put it at the top of your notebook when you import libraries
* This command tells matplotlib to render the plots inside the notebook

In [None]:
%matplotlib inline

* The convention is to import the pyplot interface as `plt`

In [None]:
import matplotlib.pyplot as plt

* Now whenever you want to create a data vizualization, you will methods part of the `plt` "data structure"

In [None]:
# create two lists of numerical data
x = np.linspace(0, 10, 30)
y = np.sin(x)

# plot the data as a series of points
# the 'o' is a format string saying render as points
plt.plot(x, y, 'o');

* Without that `'o'` format string pyplot defaults to lines
* For more examples of format strings see the [matplotlib documentation](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html)

In [None]:
# create two lists of numerical data
x = np.linspace(0, 10, 30)
y = np.sin(x)

# plot the data as a series of points
plt.plot(x, y);

* You can make subsequent calls using the `plt` object to modify your chart
* Note: Juptyter makes a new plot with each cell, so you must put all your commands for an individual chart in the same cell

In [None]:
# create some categorical data
labels = ["cats", "dogs", "chickens", "spiders"]
x_pos = [1,2,3,4]
y = [1,5,2,5]

# plot as a bar chart
plt.bar(x_pos, y);
#plt.xticks(x_pos, labels);

In [None]:
plt.xticks(x_pos, labels);

In [None]:
# create some categorical data
labels = ["cats", "dogs", "chickens", "spiders"]
x_pos = [1,2,3,4]
y = [1,5,2,5]

# plot as a bar chart
plt.bar(x_pos, y);
plt.xticks(x_pos, labels);

* Even if you use the other interface to build your viz, you start with `plt`

## Modifying your Chat



In [None]:
data = np.random.randn(30).cumsum()

In [None]:
plt.plot(data, 'ko--');

* Another way of saying this not using a format string

In [None]:
plt.plot(data, color='k', linestyle='dashed', marker='o');

* You can add a legend to your chart 

In [None]:
plt.plot(data, color='k', linestyle='dashed', marker='o');
plt.legend('best');

* You can also set the title, axis labels, ticks, and ticklables using subsequent calls to the `plt` object

In [None]:
plt.plot(data, color='k', linestyle='dashed', marker='o');
plt.xticks([0,10,20,30]);

* You can also specify your own labels for the axes

In [None]:
plt.plot(data, color='k', linestyle='dashed', marker='o');
plt.xticks([0,10,20,30], ["start", "early", "middle", "end"]);

* There is an equivalent `plt.yticks()` for modifying the y axis.

In [None]:
plt.plot(data, color='k', linestyle='dashed', marker='o');
plt.xticks([0,10,20,30], ["start", "early", "middle", "end"]);

* This plot still feels small, so let's make it bigger!

In [None]:
# note, you need to specify this before you run the plot function
plt.figure(figsize=(10,8))
plt.plot(data, color='k', linestyle='dashed', marker='o');
plt.xticks([0,10,20,30], ["start", "early", "middle", "end"]);


## Saving plots to a file

In [None]:
plt.plot(data, color='k', linestyle='dashed', marker='o');
plt.xticks([0,10,20,30], ["start", "early", "middle", "end"]);

plt.savefig('my-sweet-chart.png')

* Sweet! But what if we want to make it bigger

In [None]:
plt.plot(data, color='k', linestyle='dashed', marker='o');
plt.xticks([0,10,20,30], ["start", "early", "middle", "end"]);

plt.savefig('my-sweet-chart-big.png', dpi=400, bbox_inches='tight')

* The format of the images can be specificed by the file extension or with the `format` parameter
    * png, pdf, svg, ps, eps, and many more
* The `dpi` parameter sets the resolution of the image, the default is 100 which doesn't look pretty when printed
* The `bbox_inches` parameter specifies the amount of whitespace around the figure.