<DIV ALIGN=CENTER>

# Introduction to Data Visualizations
## Professor Robert J. Brunner
  
</DIV>  
-----
-----

## Introduction to Python Visualizations

While there are a number of different options to plot data by using
[Python](http://www.python.org), we will use
[MatPlotLib](http://matplotlib.org), which is the most popular Python
plotting package, especially for large data. In this lesson, we will
focus on creating basic plots by using matplotlib.


Note: When using an [IPython](http://ipython.org) Notebook, we will
always use the following statement to ensure that all figures generated
by MatPlotLib appear inline within the IPython notebook. If you are
writing a Python program outside of IPython, you should not use this
line. Instead, the figures will either be displayed by an appropriate
MatPlotLib visualization backend or be saved to a file.

-----

In [1]:
%matplotlib notebook

-----

## MatPlotLib Basics

The MatPlotLib plotting library provides powerful data visualization
functionality. There are two basic methods for working with matplotlib.
The first method is to use the pylab interface, which provides a simple
interface to the power of matplotlib by mimicking the matlab approach.
The second method is to use the matplotlib pyplot interface, which is
the preferred method for Python programs as it provides an
object-oriented access to the full matplotlib library. Since we will be
focusing on writing Python programs, we will use the pyplot access
method.

Thus our first step in using matplotlib will be to import that pyplot
interface:

```python
    import matplotlib.pyplot as plt
```

The second step will be to create `Figure` object that will allow us to
control the global appearance of our visualization. This can include,
among other things, the size and the resolution (or dots per inch) of
the generated figure. We make a Figure object by calling the `figure()`
method within our pyplot interface:

```python
    fig = plt.figure()
```

Next, we create an `Axes` object, which allows us to make an actual plot
within our figure. By separating out these concepts, we can easily add
multiple subplots to our figure, although for now we will simply stick
to a single plot. We can easily add a subplot to our `figure` object:

```python
    ax = fig.add_subplot()
```

Of course, for a quick and simple plot, the matplotlib library provides
a shortcut form that both creates the `Figure` object and creates a
single subplot `axes` object:

```python
    fig, ax = plt.subplots()
```

We will use this simple technique for now, but later we will
switch to the more expressive technique of explicitly creating the
`Figure` and `Axes` objects to have more control over the resulting plot.

In the following code block, we demonstrate the basic concepts required
to create a simple plot.

-----

In [4]:
# First we need to import matplotlib and numpy
import matplotlib.pyplot as plt
import numpy as np

# Now we create our figure and axes for the plot we will make.
fig, ax = plt.subplots()
plt.show()

<IPython.core.display.Javascript object>

-----

Now this plot is rather boring, we simply have an empty box with labels
on the bottom and left-hand sides. This box encloses your default plot
area, and the labels show that, by default, your first plot spans zero
to one in both the horizontal, or x-axis, and the vertical, or y-axis.

Our next step will be to actually display something. But first, I want
to briefly discuss our use of the iPython notebook. Since this course
will focus on the creation of Python programs, we will renter all
matplotlib python commands in each iPython code cell in these notebooks.
Technically this is not necessary, since an iPython notebook carries
state through the notebook. What this means is that if you define a
variable or import a package in a cell, later cells in the notebook will
be able to use that variable or package. This makes a notebook a concise
visual representation of your work. However, by doing that, you can't
simply cut-n-paste code from an iPython notebook into a Python program.

Thus, we will treat each iPython code cell as a separate Python program
and include all relevant Python statements in that cell. We also will
explicitly call the matplotlib `show()` method, which generates and
displays our figure. By default, matplotlib does nto actually create a
plot until it is forced to do so. While an iPython notebook will
generally do this automatically, this must be done explicitly within a
normal Python program. Later on we will discuss how to save a figure we
generate with matplotlib.

-----

## Plotting data with matplotlib

Now, given the `Axes` object, we can simply pass the data we wish to
visualize to the `plot()` method. In the following code sample, we use
numpy to create an array of linearly spaced values, which will be our
independent variable. Next, we define two constants: m and b, that will
be used to create our dependent variable. Given these to arrays, we can
plot this sequence of points, which are a linear function, by passing
them to the `plot()` method.

Note that you can change the m and b values and reprocess the cell
(either by using the toolbar at the top, or  using the `shift+enter` key
combination within the cell).

----

In [5]:
# First we need to import matplotlib and numpy
import matplotlib.pyplot as plt
import numpy as np

# Now we create our figure and axes for the plot we will make.
fig, ax = plt.subplots()

# Now we generate something to plot. In this case, we will plot a straight line.
# You can change the constants: m and b, in the equation below to get a different plot.

m = -2
b = 5
x = np.linspace(0,10)
y = m * x + b

ax.plot(x, y)
plt.show()

<IPython.core.display.Javascript object>

-----

### Adding text information

You have now actually plotted a linear function! But overall the plot
remains uninformative, so lets add information to make our plot
self-contained. First, we can place descriptive text along both axes.
This text is known as an axis label and can be easily added to your plot
by calling the appropriate set function on our `Axes` object:

```python
    ax.set_xlabel("X Axis")
    ax.set_ylabel("Y Axis")
```

Second, most plots benefit from having a title that gives context for
the rest of the datain the plot. You also might consider adding your
name to your plot, either in the title or within the plot box itself
(which will be discussed later).

```python
    ax.set_title("Our first plot!")
```

Together, these new methods will improve the readability of our plot, as
shown below, where we now have the two-dimensional plot with axes labels
and a title.

-----

In [7]:
# First we need to import matplotlib and numpy
import matplotlib.pyplot as plt
import numpy as np

# Now we create our figure and axes for the plot we will make.
fig, ax = plt.subplots()

# Now we generate something to plot. In this case, we will plot a straight line.
# You can change the constants: m and b, in the equation below to get a different plot.

m = -2
b = 5
x = np.linspace(0,10)
y = m * x + b

ax.plot(x, y)

# Set our axis labels
ax.set_xlabel("X Axis")
ax.set_ylabel("Y Axis")

ax.set_title("Our First Plot!")
plt.show()

<IPython.core.display.Javascript object>

-----

### Restricting the axes range

Matplotlib provides you with a great deal of control over the appearance
of your plot. Two items you may wish to change include the degree to
which your plot fills the subplot box, and how the numbers are displayed
on the axes. For the first item, you can change the range displayed in
either the x-axis, the y-axis or both by simply setting the limits
displayed by matplotlib. For instance, we can change the x-axis to show
-2 to 12, and the y-axis to show 5 to -20:

```python
    ax.set_xlim(-2, 12)
    ax.set_ylim(-20, 10)
```

In this case, we have told matplotlib to change our display to span from
-2 to 12 in the x-direction and from -20 to 10 in the y-direction.

You also can change how the numbers are displayed on the plot, which are
done at specific intervals on the axes, which are known as ticks. Once
again, you can control each axis independently and change only one or
both by using the appropriate set method:

```python
    ax.set_xticks(np.arange(0, 15, 5))
    ax.set_yticks(np.arange(-15, 10, 5))
```

In this example, we have used the arange method from numpy to create an
array of labels. The x-axis labels will now include 0, 5, and 10, while
the y-axis labels will include -15, -10. -5, 0, and 5. Of course you
can change these values to whatever you like, but be careful as by using
too many ticks (and thus numerical labels as well) your plot can become
overly full and hard to read.

-----

In [8]:
# First we need to import matplotlib and numpy
import matplotlib.pyplot as plt
import numpy as np

# Now we create our figure and axes for the plot we will make.
fig, ax = plt.subplots()

# Now we generate something to plot. In this case, we will plot a straight line.
# You can change the constants: m and b, in the equation below to get a different plot.

m = -2
b = 5
x = np.linspace(0,10)
y = m * x + b

ax.plot(x, y)

# Set our axis labels
ax.set_xlabel("X Axis")
ax.set_ylabel("Y Axis")

# Change the axis limits displayed in our plot
ax.set_xlim(-2, 12)
ax.set_ylim(-20, 10)

# Change the ticks on each axis and the corresponding numerical values that are displayed
ax.set_xticks(np.arange(0, 15, 5))
ax.set_yticks(np.arange(-15, 10, 5))
    
ax.set_title("Our Next Plot!")
plt.show()

<IPython.core.display.Javascript object>

-----

## Plotting multiple functions

We are not restricted to plotting only one item within a matplotlib
figure. We can either call plot twice with different data, or,
alternatively, pass multiple data to the plot method itself.

```python
    ax.plot(x1, y1)
    ax.plot(x2, y2)
```

or equivalently:

```python
    ax.plot(x1,y1, x2,y2)
```

Note that when plotting multiple items, you may need to change your axis
limits or ticks to ensure everything is properly displayed. For
instance, in the demonstration below, we have changed our y-axis limits
to be -20 to 20, and our y-axis labels to now run to 15.

-----

In [9]:
# First we need to import matplotlib and numpy
import matplotlib.pyplot as plt
import numpy as np

# Now we create our figure and axes for the plot we will make.
fig, ax = plt.subplots()

# Now we generate something to plot. In this case, we will plot a straight line.
# You can change the constants: m and b, in the equation below to get a different plot.

m = -2
b = 5
x1 = np.linspace(0,10)
y1 = m * x1 + b

x2 = x1
y2 = -1 * y1

# we can either plot each set of data separately as shown, or plot them all at 
# once by calling ax.plot(x1, y1, x2, y2)

ax.plot(x1, y1)
ax.plot(x2, y2)

# Set our axis labels
ax.set_xlabel("X Axis")
ax.set_ylabel("Y Axis")

# Change the axis limits displayed in our plot
ax.set_xlim(-2, 12)
ax.set_ylim(-20, 20)

# Change the ticks on each axis and the corresponding numerical values that are displayed
ax.set_xticks(np.arange(0, 15, 5))
ax.set_yticks(np.arange(-15, 20, 5))
    
ax.set_title("Our Final Plot!")
plt.show()

<IPython.core.display.Javascript object>

-----

You should notice how matplotlib changed the plot color for the two
functions being displayed. While this was done automatically in this
case, you can explicitly change the color, the linestyle, the linewidth,
and many other plot characteristics should you so choose. We will cover
some of these topics in later modules, but you are encouraged to look at
the matplotlib documentation (and [example
gallery](http://matplotlib.org/gallery.html)) to learn more.

-----

## Saving your Plot

To this point, you have made several different plots and had them
displayed on the screen. In same cases, however, you might want to save
your plot to a file, which can then be displayed online, printed out, or
submitted as part of a programming assignment. To save a figure, you
call the `savefig()` method. This method takes several parameters, the
most important of which is the file name for your saved plot. In this
case, matplotlib will take the file extension you provide and select the
appropriate rendering method. For example, if your filename ends in
'pdf' the plot will be created using a PDF rendering function, while if
you use 'png' a different renderer will be used.

When saving your plot, either a PDF or PNG is preferred as these formats
are portable vector graphics formats. This means that the instructions
to make your plot are stored in the file as opposed to the actual image
of the plot. As a result, your new plot can be made larger or smaller
without affecting the image quality.

In the example below, we save our plot by using a PDF renderer.

-----

In [10]:
# First we need to import matplotlib and numpy
import matplotlib.pyplot as plt
import numpy as np

# Now we create our figure and axes for the plot we will make.
fig, ax = plt.subplots()

# Now we generate something to plot. In this case, we will plot a straight line.
# You can change the constants: m and b, in the equation below to get a different plot.

m = -2
b = 5
x1 = np.linspace(0,10)
y1 = m * x1 + b

x2 = x1
y2 = -1 * y1

# we can either plot each set of data separately as shown, or plot them all at 
# once by calling ax.plot(x1, y1, x2, y2)

ax.plot(x1, y1)
ax.plot(x2, y2)

# Set our axis labels
ax.set_xlabel("X Axis")
ax.set_ylabel("Y Axis")

# Change the axis limits displayed in our plot
ax.set_xlim(-2, 12)
ax.set_ylim(-20, 20)

# Change the ticks on each axis and the corresponding numerical values that are displayed
ax.set_xticks(np.arange(0, 15, 5))
ax.set_yticks(np.arange(-15, 20, 5))
    
ax.set_title("Our Final Plot!")
plt.savefig('test.pdf')

<IPython.core.display.Javascript object>

-----

Notice how in the previous example, your plot was still displayed inline
as before. However, if you now navigate to the directory where this
iPython notebook is saved, you will see a new file called 'test.pdf'.
Open that file, you will see your new figure!

-----

In [11]:
!ls -la test.pdf

-rw-r--r-- 1 data_scientist staff 10152 Aug 17 18:05 test.pdf


## Scatter Plots

Scatter plots are a useful tool to visually explore the relationship
between two or more variables (or columns of data you may have read from
a file). The number of variables used in the plot corresponds to the
dimensionality of the plot. For practical purposes, most scatter plots
are two-dimensional, thus we will focus on two-dimensional scatter plots
in this module. In a two-dimensional scatter plot, the two variables are
displayed graphically inside a two-dimensional box. The horizontal
dimension is typically called the x-axis, while the vertical dimension
is called the y-axis. Each point in the data file is placed within this
two-dimensional box according to its particular x and y values. 

While we use the names x and y for the dimensions in this plot, in
practice the x and y values can be any two columns from your data file.
Thus, you can use a scatter plot to identify if there is a dependence
between any two variables in your data set. 

To make a scatter plot in Python, we can either use the `plot()` method
as discussed in the last module, or we can use the `scatter()` method.
Since the `scatter()` method provides greater flexibility when making
scatter plots, we will use it within this module. Throughout this
module, we will use the random module within the numpy library to
generate artificial data for plotting purposes. Thus, every time you run
this iPython notebook, or even just one cell within the notebook, you
will get different data and a different plot.

## Positive Correlation

In the first plot, we show a scatter plot where the vertical values, or
y-axis, display an increase as the horizontal value, or x-axis,
increases. This type of dependence is known as a positive correlation.

In [12]:
# First we need to import matplotlib and numpy
import matplotlib.pyplot as plt
import numpy as np

# Now we create our figure and axes for the plot we will make.
fig, ax = plt.subplots()

# Now we generate something to plot. In this case, we will need data 
# that are randomly sampled from a particular function.

x = np.linspace(0,100)
y = x + np.random.uniform(-10, 10, 50)

ax.scatter(x, y)

# Set our axis labels
ax.set_xlabel("X Axis")
ax.set_ylabel("Y Axis")

# Change the axis limits displayed in our plot
ax.set_xlim(-20, 120)
ax.set_ylim(-20, 120)

# Change the ticks on each axis and the corresponding numerical values that are displayed
ax.set_xticks(np.arange(0, 120, 20))
ax.set_yticks(np.arange(0, 120, 20))
    
ax.set_title("A positive correlation scatter plot!")

plt.show()

<IPython.core.display.Javascript object>

-----

### Negative Correlation

In the second plot, we see the opposite effect, where the vertical
values tend to decrease as the horizontal values increase. This type of
dependence is known as a negative correlation.

-----

In [13]:
# First we need to import matplotlib and numpy
import matplotlib.pyplot as plt
import numpy as np

# Now we create our figure and axes for the plot we will make.
fig, ax = plt.subplots()

# Now we generate something to plot. In this case, we will need data 
# that are randomly sampled from a particular function.

x = np.linspace(0,100)
y = 100 - x + np.random.uniform(-10, 10, 50)

ax.scatter(x, y)

# Set our axis labels
ax.set_xlabel("X Axis")
ax.set_ylabel("Y Axis")

# Change the axis limits displayed in our plot
ax.set_xlim(-20, 120)
ax.set_ylim(-20, 120)

# Change the ticks on each axis and the corresponding numerical values that are displayed
ax.set_xticks(np.arange(0, 120, 20))
ax.set_yticks(np.arange(0, 120, 20))
    
ax.set_title("A negative correlation scatter plot!")

plt.show()

<IPython.core.display.Javascript object>

----- 
### Null Correlation

In many cases, a scatter plot shows no obvious trend between the two
variables being plotted. In this case, we have a null (or no)
correlation.

-----

In [14]:
# First we need to import matplotlib and numpy
import matplotlib.pyplot as plt
import numpy as np

# Now we create our figure and axes for the plot we will make.
fig, ax = plt.subplots()

# Now we generate something to plot. In this case, we will need data 
# that are randomly sampled from a particular function.

x = np.random.uniform(0, 100, 50)
y = np.random.uniform(0, 100, 50)

ax.scatter(x, y)

# Set our axis labels
ax.set_xlabel("X Axis")
ax.set_ylabel("Y Axis")

# Change the axis limits displayed in our plot
ax.set_xlim(-20, 120)
ax.set_ylim(-20, 120)

# Change the ticks on each axis and the corresponding numerical values that are displayed
ax.set_xticks(np.arange(0, 120, 20))
ax.set_yticks(np.arange(0, 120, 20))
    
ax.set_title("A null correlation scatter plot!")

plt.show()

<IPython.core.display.Javascript object>

-----

### Outlier Detection

One final benefit of making a scatter plot is that it can be easy to
identify points that are outliers, or significantly different than the
typical trend shown by the majority of the points in the plot. For
example, in the following plot, there are two points with low values of
the x variable that have abnormally large y values, at least compared to
the rest of the data points. In some cases, these points will indicate
an error in data collection; while in other cases they may simply
reflect a lack of knowledge about a certain part of a problem.

-----

In [15]:
# First we need to import matplotlib and numpy
import matplotlib.pyplot as plt
import numpy as np

# Now we create our figure and axes for the plot we will make.
fig, ax = plt.subplots()

# First we create 50 linearly spaced x values
x = np.linspace(0,100)

# Second, we create 50 y values that are linearly related to the x values
y = x + np.random.uniform(-10, 10, 50)
 
# Now we change two points to be outliers
y[2] = 60
y[6] = 75

ax.scatter(x, y)

# Set our axis labels
ax.set_xlabel("X Axis")
ax.set_ylabel("Y Axis")

# Change the axis limits displayed in our plot
ax.set_xlim(-20, 120)
ax.set_ylim(-20, 120)

# Change the ticks on each axis and the corresponding numerical values that are displayed
ax.set_xticks(np.arange(0, 120, 20))
ax.set_yticks(np.arange(0, 120, 20))
    
ax.set_title("An outlier detection scatter plot!")

plt.show()

<IPython.core.display.Javascript object>

-----

The ability to visually __see__ a trend or __spot__ outlier points in a
scatter plot make them an important tool for a data scientist. Later in
this course we will discus using scatter plots to compare a model to the
actual data. But for now we will focus on making more complex scatter
plots.

-----

### Comparing data with a scatter plot

The scatter plots shown previously have all been rather plain, but as
shown in the last module, we can plot multiple items within the same
matplotlib figure. For example, if we have a function that we wish to
compare to our data, we can use the `plot()` method to place the
function over our data plot. 

Another option would be to color points differently within a scatter
plot, based on a third dimension. For example, we might read three
columns from a file containing age, height, and gender. By using a
scatter to plot age versus height and coloring the points differently
based on the gender, we can explore trends in more than two dimensions.

We employ both of these techniques in the following plot, where we
display a positive and negative correlation and overplot a function for
the positive correlation.

-----

In [16]:
# First we need to import matplotlib and numpy
import matplotlib.pyplot as plt
import numpy as np

# Now we create our figure and axes for the plot we will make.
fig, ax = plt.subplots()

# Now we generate something to plot. In this case, we will need data 
# that are randomly sampled from a particular function.

x = np.linspace(0,100)
y1 = x + np.random.uniform(-10, 10, 50)
y2 = 100 - x + np.random.uniform(-10, 10, 50)

ax.scatter(x, y1, color='red', marker='s')
ax.scatter(x, y2, color='green', marker='d')
ax.plot(x, x, color='blue', linestyle='dashed')

# We can replace the previous command with the following shortcut
#ax.plot(x, x, 'b--')

# Set our axis labels
ax.set_xlabel("X Axis")
ax.set_ylabel("Y Axis")

# Change the axis limits displayed in our plot
ax.set_xlim(-20, 120)
ax.set_ylim(-20, 120)

# Change the ticks on each axis and the corresponding numerical values that are displayed
ax.set_xticks(np.arange(0, 120, 20))
ax.set_yticks(np.arange(0, 120, 20))
    
ax.set_title("A complex correlation scatter plot!")

plt.show()

<IPython.core.display.Javascript object>

----

In the previous plot, we not only displayed two different types of data
that were differentiated by their color, we also changed the type of
marker used for each point. The different [color
options](http://matplotlib.org/api/colors_api.html#module-matplotlib.
colors) that you can use within matplotlib is quite large. When choosing
colors, be sure to keep an eye on the overall design of your plot.
Certain colors go better together than others, and you want to ensure
viewers focus on the information content of your visualizations.
Likewise, there are a number of different [marker
types](http://matplotlib.org/api/markers_api.html) you can use within
your plots.

For a full description of the options available to the `scatter()`
method, see the appropriate [matplotlib
documentation](http://matplotlib.org/api/axes_api.html?highlight=scatter#matplotlib.axes.Axes.scatter). 

Finally, when we overplotted the function, y = x, we used the `plot()`
method and specified both the color, as well as the linestyle. The
matplotlib documentation provides more information on the
[linestyle](http://matplotlib.org/api/artist_api.html#matplotlib.lines.
Line2D.set_linestyle) and other options for the
[plot](http://matplotlib.org/api/axes_api.html?highlight=plot#matplotlib
.axes.Axes.plot) method. Notice that the abbreviations exist for many of
these option combinations. For example, you can specify a blue dashed
line, by using 

    plot(x, x, 'b--')

-----

### Labeling Data

When overplotting multiple data or functions, it is generally a good
idea to label them so the viewer can quickly understand the differences.
This can be easily done via matplotlib by simply adding the `label =''`
attribute to each plotting command. For example, we can modify our
previous three plotting commands to have descriptive labels:

```python
    ax.scatter(x, y1, color='red', marker='s', label='positive')
    ax.scatter(x, y2, color='green', marker='d', label='negative')
    ax.plot(x, x, color='blue', linestyle='dashed', label='function')
```

The `legend()` method can be used to label these data within our plot,
as shown below.

-----

In [17]:
# First we need to import matplotlib and numpy
import matplotlib.pyplot as plt
import numpy as np

# Now we create our figure and axes for the plot we will make.
fig, ax = plt.subplots()

# Now we generate something to plot. In this case, we will need data 
# that are randomly sampled from a particular function.

x = np.linspace(0,100)
y1 = x + np.random.uniform(-10, 10, 50)
y2 = 100 - x + np.random.uniform(-10, 10, 50)

# We have now labeled our plot data
ax.scatter(x, y1, color='red', marker='s', label='positive')
ax.scatter(x, y2, color='green', marker='d', label='negative')
ax.plot(x, x, color='blue', linestyle='dashed', label='function')

# Set our axis labels
ax.set_xlabel("X Axis")
ax.set_ylabel("Y Axis")

# Change the axis limits displayed in our plot
ax.set_xlim(-20, 120)
ax.set_ylim(-20, 140)

# Change the ticks on each axis and the corresponding numerical values that are displayed
ax.set_xticks(np.arange(0, 120, 20))
ax.set_yticks(np.arange(0, 120, 20))

ax.legend(loc='upper center')
    
ax.set_title("A complex correlation scatter plot!")

plt.show()

<IPython.core.display.Javascript object>

-----

In this plot, we now have multiple data overplotted, with a legend that
allows the viewer to understand the differences between the different
plot components. Notice how we first increased the range on the y-axis
to provide room for the legend, and second how we specified the location
of the legend by using the `loc` attribute, which in this case we set to
'upper center'. By default, the `legend()` method will display all plot
components that have a distinct label assigned. You can however, control
the behavior of this method in a number of different manners, as
detailed in the [matplotlib
documentation](http://matplotlib.org/api/axes_api.html?highlight=legend#
matplotlib.axes.Axes.legend).

-----

-----

### Multiple Plots

We can compare multiple variables (or data columns) by using the subplot
functionality within matplotlib. This allows us to make a scatterplot
spreadsheet, where different variables are compared in different plots.
When creating subplots in matplotlib, we use the `add_subplot()` method,
which takes three parameters. The first two parameters specify the number
of rows and columns to use in the plot array, while the last parameter
specifies which subplot is currently active. We demonstrate this within
the following code, which first defines a function to generate a plot,
and then calls this function four times to make four different plots.

While powerful, this technique can lead to confusion if not done
properly. Care should be taken to make sure the axes labels are not
overlapping the plot ticks. You can use the `subplots_adjust()` method
to provide extra space for either the width, height, or both via the
`wspace` and `hspace` attributes.

-----

In [18]:
# First we need to import matplotlib and numpy
import matplotlib.pyplot as plt
import numpy as np

# Now we create our figure and axes for the plot we will make.
fig = plt.figure()
fig.subplots_adjust(wspace=0.2, hspace=0.3)

# We now define a function to make a random scatter plot within the current subplot
def makePlot(ax, t, c):
    x = np.random.uniform(-5, 5, 100)
    y = np.random.uniform(-5, 5, 100)
    ax.scatter(x, y, color=c)
    ax.set_title(t)
    ax.set_xlim(-6, 6)
    ax.set_ylim(-6, 6)
    ax.set_xticks(np.arange(-4, 5, 2))
    ax.set_yticks(np.arange(-4, 5, 2))

# We now make new suplots, and populate them accordingly 
ax1 = fig.add_subplot(2, 2, 1)
makePlot(ax1, "Figure 1", 'black')

ax2 = fig.add_subplot(2, 2, 2)
makePlot(ax2, "Figure 2", 'green')

ax3 = fig.add_subplot(2, 2, 3)
makePlot(ax3, "Figure 3", 'blue')

ax4 = fig.add_subplot(2, 2, 4)
makePlot(ax4, "Figure 4", 'red')

plt.show()

<IPython.core.display.Javascript object>

-----

You should change attributes in the last plot in order to better learn
how the
[scatter](http://matplotlib.org/api/axes_api.html?highlight=scatter#
matplotlib.axes.Axes.scatter) method works within matplotlib.

-----

## Histograms

A [histogram](http://en.wikipedia.org/wiki/Histogram) is a binned
representation of a data set. As a result, it provides a concise
representation of a data along one dimension, where the size of the
representation is determined solely by the number of bins used and not
the total number of data points. As a result, it can be used to provide
a concise summary of a very large data set. 

### Binning

Sometimes the binning can be determined easily, for example, months of
the year or days of the week might provide natural bins. Other times,
both the number of bins and the bin ranges will need to be determined
before the histogram is constructed. A general rule of thumb is that if
you have N data points you should have root-three N bins. The following
code summarizes the results form this formula:

-----

In [19]:
# We need to import the math library for the ceil method, which returns the next largest 
# integer to a floating point value

import math as mt

# We want to loop from 10 to 100,000,000
for i in range(1,8):
    
    # Now print out the integer value, and the number of bins
    # We used a math trick here, 10**i**(1/3) = 10**(i/3)
    print("%9d\t%4d\n" % (10**i, mt.ceil(pow(10, i/3.))))

       10	   3

      100	   5

     1000	  10

    10000	  22

   100000	  47

  1000000	 100

 10000000	 216



-----

### Making a Histogram

Now we are in position to actually demonstrate how to make and display a
histogram by using matplotlib. We first will need data, which for this
example we create by randomly generating data. We also start by using
the default number of bins, which is ten, and bin range, which is the
minimum and maximum data values.

-----

In [20]:
# First we need to import matplotlib and numpy
import matplotlib.pyplot as plt
import numpy as np

# Now we create our figure and axes for the plot we will make.
fig, ax = plt.subplots()

# Now we generate something to plot. In this case, we will need lots of data 
# that are randomly sampled from a particular function.

x = np.linspace(0, 100, 10000)
y = x + np.random.uniform(-10, 10, 10000)

# Now we want to make a default histogram
ax.hist(y)# , bins=10 , histtype='stepfilled', normed=False, color='BurlyWood', label='Gaussian')

# Set our axis labels and plot title
ax.set_title("Histogram")
ax.set_xlabel("Value")
ax.set_ylabel("Frequency")

#Show final result
plt.show()

<IPython.core.display.Javascript object>

-----

In this simple example, we first construct random data in a similar
manner as we did when making scatter plots in the previous lesson. We
first create an  array of 10,000 elements that are linearly spaced
between 0 and 100. This array is used to create a second array, where
each value has now been perturbed by a randomly selected value between
-10 and 10. If we did not do this random perturbation the histogram
would be flat since we have the same number of values in each bin.
Matplotlib automatically creates the ten bins and computes the frequency
with which values in the input data set occur in each bin and plots the
results.

### Histogram Options

Now that our first histogram is completed, we can look at changing the
default selections, such as the number of bins, the bin centers, and the
color and style of the histogram bins. These values can all be changed
by passing parameters into the histogram function. 

First, the number and locations of the bins used to construct the
histogram can be specified by using the bin parameter. There are several
different ways to control this parameter:

- `bins = 22` will give twenty-two bins
- `bins = (0,20,90,100)` will produce three bins that span 0-20, 20-90,
and 90-100, respectively.
- `bins = np.linspace(0, 100, 100)` will produce one hundred bins
linearly spaced between zero and one hundred.

Second, there are four different types of histogram plots that you cane
make: `bar`, `barstacked`, `step`, and `stepfilled`, with `bar` being
the default value. Third, you can specify the line or fill color of the
bins by defining the `color` parameter. For example, `color =
'BurlyWood'` will set the histogram color to be the web-safe color
`BurlyWood`. For more color examples, see the [HTML color
name](http://www.w3schools.com/html/html_colornames.asp) page, other
colors that you might try to use include AntiqueWhite, DarkSalmon,
DarkTurquoise, IndianRed, or PeachPuff.  Use the following code example
to see how these parameters change the appearance of the sample
histogram.

-----

In [21]:
# First we need to import matplotlib and numpy
import matplotlib.pyplot as plt
import numpy as np

# Now we create our figure and axes for the plot we will make.
fig, ax = plt.subplots()

# Now we generate something to plot. In this case, we will need lots of data 
# that are randomly sampled from a particular function.

x = np.linspace(0, 100, 10000)
y = x + np.random.uniform(-10, 10, 10000)

# Now we want to make a modified histogram
ax.hist(y, bins=20, histtype='stepfilled', color='BurlyWood')

# Set our axis labels and plot title
ax.set_title("Histogram")
ax.set_xlabel("Value")
ax.set_ylabel("Frequency")

#Show final result
plt.show()

<IPython.core.display.Javascript object>

-----

### Histogram Range

Sometimes the frequency counts can very dramatically between bins. In
that case, it is often convenient to change the presentation of the
frequency counts to improve the discrimination of different bin counts.
This can easily be accomplished by changing the vertical axis to display
the logarithm of the frequency count, which is done by setting the
optional parameter `log` to `True`. in the following example, see the
difference in the generated histogram by changing to a log histogram by
setting `log=True` in the histogram method.

-----

In [22]:
# First we need to import matplotlib and numpy
import matplotlib.pyplot as plt
import numpy as np

# Now we create our figure and axes for the plot we will make.
fig, ax = plt.subplots()

# Now we generate something to plot. In this case, we will need data 
# that are randomly sampled, but we want the to be non-uniform

x = np.sqrt(np.linspace(0, 10000, 10000))
y = x + np.random.uniform(-10, 10, 10000)

ax.hist(y, bins=20 , histtype='step', color='BurlyWood', log=False)

ax.set_title("Histogram")
ax.set_xlabel("Value")
ax.set_ylabel("Frequency")
ax.set_xlim(-15, 115)
plt.show()

<IPython.core.display.Javascript object>

-----

### Probability

Another use of a histogram is to display the normalized frequency
counts, which can also be interpreted as the probability that a
particular value lies within a given bin. Formally, this is computed by
diving the bin counts by the total counts (or frequencies of occurrence
by the total number of occurrences). But with matplotlib, we simply need
to set the optional `normed` parameter to `True`. The values on the
y-axis will now be the normalized probability for a value to lie within
the bin, which means that the total probability musty be multiplied by
the width of the bin (as can be seen in the examples).

Note that we now change the y-axis label to __Probability__ reflect the
change in what is being displayed.

-----

In [23]:
# First we need to import matplotlib and numpy
import matplotlib.pyplot as plt
import numpy as np

# Now we create our figure and axes for the plot we will make.
fig, ax = plt.subplots()

# Now we generate something to plot. In this case, we will need data 
# that are randomly sampled from a particular function.

x = np.linspace(0,100, 10000)
y1 = x + np.random.uniform(-10, 10, 10000)
y2 = x + np.random.uniform(-25, 25, 10000)

ax.hist(y1, bins=20 , histtype='bar', normed=True, color='BurlyWood')

# Complete the plot, but change the y-axis labal accordingly
ax.set_title("Histogram")
ax.set_xlabel("Value")
ax.set_ylabel("Probability")
ax.set_xlim(-15, 115)
plt.show()

<IPython.core.display.Javascript object>

-----

### Multiple Histograms

In certain cases, it may be instructive to compare two distributions
directly within the same plot. For example, if you have computed a
histgoram of the ages of people in a population, you might want to
differentiate the male and female populations in separate histograms for
comparison. This can easily be done by simply overplotting two
histograms. In the following example, we create two populations: y1 and
y2, and display their histograms within the same plot window. Note that
by default, the two histograms will be overplotted, so to allow both to
be seen, we set the `alpha` parameter in the second one, which make the
second histogram somewhat transparent (based on the value assigned to
the `alpha` parameter).

We also assign a label to each histogram, so that the `legend` method
can be used to differentiate between the two histograms.

-----

In [24]:
# First we need to import matplotlib and numpy
import matplotlib.pyplot as plt
import numpy as np

# Now we create our figure and axes for the plot we will make.
fig, ax = plt.subplots()

# Now we generate something to plot. In this case, we will need data 
# that are randomly sampled from a particular function.

x = np.linspace(0,100, 10000)
y1 = x + np.random.uniform(-10, 10, 10000)
y2 = x + np.random.uniform(-25, 25, 10000)

mybins = np.linspace(-20,120,20)
ax.hist(y1, bins=mybins, histtype='bar', normed=True, color='BurlyWood', label='Type A')
ax.hist(y2, bins=mybins, histtype='bar', normed=True, color='IndianRed', label='Type B', alpha=0.5)

# Complete the plot, and include a legend.

ax.set_title("Histogram")
ax.set_xlabel("Value")
ax.set_ylabel("Probability")
ax.set_xlim(-35, 135)
ax.legend()
plt.show()



<IPython.core.display.Javascript object>

-----

The two histograms can also be plotted side-by-side, which can often
simplify a comparison. In the following example, we plot three
histograms side-by-side. In this case, we pass the three data sets as a
list to the same histogram method call, which means we also need to pass
the colors and labels as a list. All other parameters will be assigned
equally to the three histograms.

In this sample code, we have also used the [figure
method](http://matplotlib.org/api/figure_api.html#matplotlib.figure.
Figure) to specify a larger plot window (which is helpful for multiple
histograms), and also used the attributes in the [add_subplot
method](http://matplotlib.org/api/figure_api.html#matplotlib.figure.
Figure.add_subplot) to specify only one subplot, which should have an
'Ivory' background.

-----

In [25]:
# First we need to import matplotlib and numpy
import matplotlib.pyplot as plt
import numpy as np

# Now we create our figure and axes for the plot we will make.
fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111, axisbg='Ivory')

# Now we generate something to plot. In this case, we will need data 
# that are randomly sampled from a particular function.

x = np.linspace(0,100, 10000)
y1 = x + np.random.uniform(-10, 10, 10000)
y2 = x + np.random.uniform(-25, 25, 10000)
y3 = x + np.random.uniform(-35, 35, 10000)

ax.hist((y1, y2, y3), bins=15 , histtype='bar', normed=True, \
        color=('BurlyWood', 'IndianRed', 'DeepSkyBlue'), label=('Type A', 'Type B', 'Type C'))

# Complete the plot

ax.set_title("Histogram")
ax.set_xlabel("Value")
ax.set_ylabel("Probability")
ax.set_xlim(-45, 145)
ax.legend()
plt.show()

<IPython.core.display.Javascript object>

-----

In all of the sample code provided in this notebook, we have ignored the
return values from the `hist` method. In truth, this function returns
three items:

* `n`, the number contained within each bin. The length of this array is
the same as the number of bins used to make the histogram.
* `bins`, the bin edges. The number of edges is one more than the number
of bins.
* `patches`, which are matplotlib plotting objects (to make the bins
show up) but you will generally ignore these.

The first two arrays can often prove useful if you want to operate on
the histogrammed data (in addition to plotting them). To use these data,
you simply capture the returned values:

```python
n, bins, patches = ax.hist(y)
```

-----

## Introduction to Seaborn

While the MatPlotLib library provides powerful plotting functionality to
the Python programmer, the MatPlotLib API can be rather daunting for
beginners and the basic color schemes and plot styles are not the most
visually appealing. For anyone who has been exposed to the work and
ideas of [Edward Tufte][i], the importance of making good data
visualizations can not be understated. As a result, other libraries have
been developed that build on the MatPlotLib legacy and provide both
improved visual aesthetics as well as new plot functions. Of these, one
of the most interesting is the [Seaborn][1]library. 

Searbon is easy to start using, yet introduces a number of new powerful
plot styles along with statistical functionality that can make difficult
plotting tasks simple. Seaborn also simplifies the task of removing plot
clutter and you also can choose from different color palettes that have
already been screened to maximize the visual impact of your new plots.
Seaborn can be easily imported into your Python program, by using
`import seaborn as sns`, and with one function call to `sns.set()` your
plot colors and styles can be improved. The `set` method takes a number
of parameters like `style` or `font` that you can also specify
explicitly, as shown later in this Notebook.

-----
[i]: http://www.edwardtufte.com/tufte/
[1]: http://web.stanford.edu/~mwaskom/software/seaborn/index.html

In [26]:
# Start using the Seaborn library. Note that this will remain in effect in your 
# IPython Notebook until the Kernel is restarted or you issue an reset() call.

import seaborn as sns

sns.set()

-----

To demonstrate the impact of these two Python statements, we can simply
repeat one of the first Python visualization examples from earlier in
this IPython Notebook. In this case we are simply plotting two lines,
along with labels for the `x` and `y` axes and a title for the plot.
Compare this version with the same plot created by only using the
MatPlotLib library.

-----

In [27]:
# Repeat this plot, but now with Seaborn default style

# First we need to import matplotlib and numpy
import matplotlib.pyplot as plt
import numpy as np

# Now we create our figure and axes for the plot we will make.
fig, ax = plt.subplots()

# Now we generate something to plot. In this case, we will plot a straight line.
# You can change the constants: m and b, in the equation below to get a different plot.

m = -2
b = 5
x1 = np.linspace(0,10)
y1 = m * x1 + b

x2 = x1
y2 = -1 * y1

# we can either plot each set of data separately as shown, or plot them all at 
# once by calling ax.plot(x1, y1, x2, y2)

ax.plot(x1, y1)
ax.plot(x2, y2)

# Set our axis labels
ax.set_xlabel("X Axis")
ax.set_ylabel("Y Axis")

# Change the axis limits displayed in our plot
ax.set_xlim(-2, 12)
ax.set_ylim(-20, 20)

# Change the ticks on each axis and the corresponding numerical values that are displayed
ax.set_xticks(np.arange(0, 15, 5))
ax.set_yticks(np.arange(-15, 20, 5))
    
ax.set_title("Our Final Plot!")
plt.show()

<IPython.core.display.Javascript object>

-----

In the previous plot, you saw how by simply using the _Seaborn_ plotting
library, the plot appearance changed. This is one of the primary
benefits of using Seaborn, in that the color schemes and plotting styles
have already been well-thought out. In this case, a grey background with
soft lines at the tick marks are used along with a specific font type.
However, Seaborn provides the capability of changing the plot
appearance, both by using other predefined styles as well as specific
features in the plot.

In the following code cell, we demonstrate how to change the overall
appearance of the plot by using a white style with tick marks, and a
specific Seaborn context, which can take one of four predefined types:

- `notebook`
- `paper`
- `talk
- `poster`

which can be used in a `with` statement. We also use the `despine`
method, which can remove the box appearance of the plot, and we change
the fontsize, resulting a very different plot appearance.

-----

In [28]:
fs = 24 # Font Size

# Now we create our figure and axes for the plot we will make.
fig, ax = plt.subplots()

# Now we generate something to plot. In this case, we will plot a straight line.
# You can change the constants: m and b, in the equation below to get a different plot.

m = -2
b = 5
x1 = np.linspace(0,10)
y1 = m * x1 + b

x2 = x1
y2 = -1 * y1

# we can either plot each set of data separately as shown, or plot them all at 
# once by calling ax.plot(x1, y1, x2, y2)

linestyle = [':', '--', '-']


ax.plot(x1, y1, label = 'Plot 1', ls = ':')
ax.plot(x2, y2, label = 'Plot 2', ls = '--')

# Set our axis labels
ax.set_xlabel("X Axis", size=fs)
ax.set_ylabel("Y Axis", size=fs)

plt.legend(loc='upper left', fontsize=fs)

# Change the axis limits displayed in our plot
ax.set_xlim(-2, 12)
ax.set_ylim(-20, 20)

ax.set_title("Our Final Plot!")

# Now Seaborn specific modifications

sns.set_context("poster")

sns.despine(offset=10, trim=True)
sns.set_style("white")
sns.set_style("ticks")

plt.show()

<IPython.core.display.Javascript object>

### Additional References

1. The [MatPlotLib][1] website, which contains a gallery of interesting examples and the official project documentation.
2. The [Seaborn][2] website, which includes example plots with source code.
3. A [Matplotlib Tutorial][3] that covers some topics beyond those presented herein.
4. The official [Seaborn Tutorial][4], which contains a full details on this library, most of which is beyond the level of this current lesson. We will come back to some of the more advanced features of Seaborn in later lessons.

-----

[1]: http://matplotlib.org
[2]: http://web.stanford.edu/~mwaskom/software/seaborn/index.html
[3]: http://nbviewer.ipython.org/github/WeatherGod/AnatomyOfMatplotlib/blob/master/AnatomyOfMatplotlib-Part1-pyplot.ipynb

[4]: http://web.stanford.edu/~mwaskom/software/seaborn/tutorial.html




### Return to the [Course Index](index.ipynb).

-----