In [None]:
%matplotlib inline
import matplotlib
import seaborn as sns
sns.set()
matplotlib.rcParams['figure.dpi'] = 144

In [None]:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
from pandas import DataFrame

import expectexception

# NumPy


NumPy (pronounced `NUM-PIE`) is a powerful Python library for working with numerical data. It is widely used for Python scientific computing. When used correctly, its performance will be significantly faster than ordinary Python code. It isn't that ordinary Python code is anything less than awesome; it is because NumPy adds certain constraints to the data structures that enable faster computations.

Consider this Python list of lists of numbers:

In [None]:
np.random.seed(42)

test_data = np.random.rand(40, 5).tolist()

test_data[:4]

I can index into this nested data structure in the usual manner:

In [None]:
test_data[3][2]

If we want to calculate the sum of all the numbers, we can do that with a list comprehension.

In [None]:
%%timeit

sum([sum(r) for r in test_data])

Now let's put the same data into a NumPy array data structure. We can do that with as follows:

In [None]:
test_np_array = np.array(test_data)

In [None]:
%%timeit

test_np_array.sum()

It's faster! And there are bigger performance improvements if we make the test data larger.

But why is it faster?

The most important reason is that Python allows the lists to contain general (object) data types. NumPy limits us to one data type.

The array we created is limited to floating point numbers. If we try to add a string, we will get an error.

In [None]:
%%expect_exception ValueError

test_data[3][4] = 'no error'
test_np_array[0, 0] = 'error'

The Python list of lists allows for any data type, but as a consequence, the Python `sum` function must first evaluate each object to determine what type it is, and if the addition operator is allowed on that object type. This dynamic typing is fundamental to Python's ease of use, but it also slows down execution.

NumPy imposes a constraint that all members of an array must have the same type, and it needs to know what that type is. We can find out using the `dtype` attribute:

In [None]:
test_np_array.dtype

In [None]:
type(test_np_array[0, 0])

Because of this type limitation, NumPy can offer an impressive collection of fast tools for working with data.

Let's explore some of the basics of these tools.

## NumPy Indexing


As demonstrated above, we can index into a NumPy array using the square brackets [ ]. This is slightly different from regular Python indexing in that one pair of brackets can be used for indexing in multiple dimensions.

In [None]:
# Indexing into Python list of lists
print test_data[3][4]

# Indexing into NumPy array
print test_np_array[3, 4]

We can inspect and modify the shape of a NumPy array. This will also alter the array's indexing.

In [None]:
small_array = np.random.rand(12)

small_array.shape

In [None]:
small_array

In [None]:
small_array.shape = (4, 3)

small_array

We can access a subset of the array if we wish:

In [None]:
smaller_array = small_array[1:, 2:]

smaller_array

This can also be used in assignments.

In [None]:
smaller_array[1:] = 42

smaller_array

We changed `smaller_array`. What about the data in `small_array`?

In [None]:
small_array

The original array changed also! But why?

The smaller array shares the memory space with the original array. The memory overlaps. Changes to one will be reflected in the other. NumPy was designed to do this for performance reasons.

If that's not what you need, use the `copy` method.

In [None]:
small_array_copy = small_array.copy()

### Conditional Indexing


NumPy allows you to use conditional statements to select a subset of the array.

Consider the situation where you want to select all rows of `small_array` where the number in the first column is greater than 0.5. To do that, first you must write code to determine which rows are in fact greater than 0.5. That can be done with the greater than sign, like so:

In [None]:
small_array[:, 0] > 0.5

An array of Booleans. We can use this array of Booleans as an index into `small_array`.

In [None]:
small_array[small_array[:, 0] > 0.5, :]

You might be wondering if the same memory sharing applies. For a question like this, there's only one way to find out:

In [None]:
test = small_array[small_array[:, 0] > 0.5]

test[:,:] = 42
print test
print small_array

It does not share memory. The NumPy library would have an inefficient array implementation if this were not the case.

### NumPy Array methods


NumPy comes with some built-in mathematical functions for you to use to transform your data. Here are a few:

In [None]:
sample = np.random.rand(10)
sample

In [None]:
# min and max
(sample.min(), sample.max())

In [None]:
# index of min and max
(sample.argmin(), sample.argmax())

There are also important mathematical functions in the NumPy library that you should take note of. Here are just a few; explore the NumPy library yourself to see them all.

In [None]:
# all trig functions available
np.sin(sample)

In [None]:
# square root
np.sqrt(sample)

In [None]:
# natural log
print np.log(sample)
# base 10 log
print np.log10(sample)
# base 2 log
print np.log2(sample)

## Saving NumPy Data Files


NumPy has its own binary data format for files. You can use it with the save and load commands.

In [None]:
np.save('small_array.npy', small_array)

In [None]:
!ls -l *.npy

In [None]:
retrieved_small_array = np.load('small_array.npy')

retrieved_small_array

## Relationship with Pandas


NumPy arrays are the foundation of Pandas. Each Pandas DataFrame contains a NumPy array inside. You can access that array with the `.values` attribute.

In [None]:
test_df = DataFrame([[1, 2], [3, 4], [5, 6]],
                   columns=['X', 'Y'],
                   index=['a', 'b', 'c'])

test_df

Observe that ONLY the integers are in the NumPy array:

In [None]:
test_df.values

The column headers and the index are stored in different data structures that also have their own NumPy arrays:

In [None]:
test_df.index.values

In [None]:
test_df.columns.values

## Matplotlib


_Matplotlib_ lets you plot things, and _pyplot_ is a layer on top of it to give it a MATLAB-like syntax.

Below are some basic examples of these charts:
- Line plots
- Bar plots and histograms
- Scatter plots

### Line plot


Matplotlib can do basic X-Y plots if you give it the `x` and `y` data of equal length.  Here is a plot of a few sample paths of Brownian Motion.

Notice that calling `plt.plot` multiple results in multiple lines on the same figure.  Call `plt.figure` to create a new figure.

In [None]:
# Line plot example
xs = np.random.randn(5, 100)

plt.title("A few paths of Brownian Motion")
bms = xs.cumsum(1)
for bm in bms:
    plt.plot(np.arange(0, 1., .01), bm)

### Scatter plot


Matplotlib can generate 2D scatter plot data.

In [None]:
# Generate randomly sampled dots within the unit circle, with gamma-distributed radius
N=250
A=20
xo,yo = np.random.uniform(low=-1, high=1, size=N), np.random.uniform(low=-1, high=1, size=N)
so = A*np.random.gamma(4.5, 1.0, size=N)

x = xo[xo**2+yo**2 < 1]
y = yo[xo**2+yo**2 < 1]
s = so[xo**2+yo**2 < 1]

# Scatter plot, with _s_izes and translucent circles
plt.scatter(x, y, s=s, alpha=0.5)

### Histograms


Matplotlib can also plot histograms from raw count data.

In [None]:
data = np.random.gamma(4.5, 1.0, 10000)
plt.hist(data, bins=50)
plt.title("Gamma(4.5, 1.0) distribution, 10000 samples")
plt.xlabel("Value")
plt.ylabel("Occurances per 10,000");

### Images


Matplotlib can plot arrays as 2D images, using a color map that you specify.

In [None]:
a = np.arange(-4, 4, 0.01)

x, y = np.meshgrid(a, a)
assert(x.shape == (len(a), len(a)))
r = np.sqrt(x ** 2 + y ** 2)
plt.imshow(r, cmap=plt.cm.viridis)
plt.colorbar()
plt.title("radius")
plt.xlabel("x")
plt.ylabel("y")

If you have a visual representation in mind for how you wish to plot your data, a good place to start is the Matplotlib gallery. Find a chart that is close to what you are looking for and then modify the sample code to build what you want.

- [Matplotlib Gallery](http://matplotlib.org/gallery.html)
- [Seaborn Gallery](http://seaborn.pydata.org/examples/index.html)

## Matplotlib and Pyplot


You'll notice that all of the plots created thus far started with `plt.` That references this import at the top of the notebook:

```python
import matplotlib.pyplot as plt
```

Pyplot is a special plotting "state machine" created for Matplotlib to simplify the creation of plots. Basically, it has an internal concept of the current chart being operated on by the set of methods made available to you. It is a wrapper around Matplotlib's object oriented plotting library.

For the previous plot, we could have created it like this:

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)
ax.imshow(r, cmap=plt.cm.viridis)
fig.colorbar(ax.get_images()[0])
ax.set_title("radius")
ax.set_xlabel("x")
ax.set_ylabel("y")

This approach is more typing but it exposes some of the hidden complexity in `pyplot`. There are figure and axis objects and each has methods that contribute to the result.

One approach is not necessarily better than the other, but it is important to know that there is a `pyplot` state machine that creates plots and there is a separate object oriented approach for creating plots.

Later in your Python adventures you will see sample Matplotlib code on the Internet and will want to use it to add features to your data visualizations. The sample code might not easily fit the code you have already written if one is using `pyplot` and the other is not.

To help you with this, `pyplot` provides the `gcf` and `gca` methods. You can use these to get `pyplot`'s current figure or axis objects.

### Matplotlib subplots


Frequently you will want 2 or more plots in the same figure. You can do that with the subplot command.

A common way of creating subplots is with a 3 digit number. The hundreds digit represents the number of rows, the tens digit represents the number of columns, and the ones digit represents the current chart. You call this repeatedly to move from one subplot to the next.

In [None]:
# create a 2x2 subplot grid, and prepare to plot data into the first subplot.
plt.subplot(2, 2, 1)
plt.title('Upper Left')
plt.plot(np.random.rand(10))

# move to the second subplot
plt.subplot(2, 2, 2)
plt.title('Upper Right')
plt.plot(np.random.rand(10))

# move to the third
plt.subplot(2, 2, 3)
plt.title('Lower Left')
plt.plot(np.random.rand(10))

# move to the last subplot
plt.subplot(2, 2, 4)
plt.title('Lower Right')
plt.plot(np.random.rand(10))

## Matplotlib plots from Pandas


The Pandas library comes with built-in plotting tools. Data stored in a DataFrame can be plotted just as easily as the previous examples.

In [None]:
test_data = DataFrame(np.random.rand(10, 2),
                      index=np.arange(10),
                      columns=['A', 'B'])
test_data

In [None]:
test_data.plot()

By default, it assumes you would like to see a line chart. Other choices are available:

In [None]:
test_data.plot.bar()

We can pass parameters to the `bar` method to adjust the chart.

In [None]:
test_data.plot.bar(stacked=True, color=['red', 'blue'], legend=False)

These plots can be useful for visually inspecting your data.

A histogram is particularly helpful for understanding the range and distribution of your data. Outliers will be visible, as well as potential data errors.

In [None]:
test_hist = DataFrame(np.random.beta(0.6, 0.5, size=5000),
                      columns=['Beta(0.6, 0.5)'])

test_hist.hist(bins=100, color='red')

One of the great features of Pandas and plotting is how it handles dates.

In [None]:
import pandas.util.testing as pd_testing

time_df = pd_testing.makeTimeDataFrame(50).cumsum()

time_df.head()

This DataFrame has dates in the index. Pandas tries to figure out an intelligent way of arranging the x axis so the labels look pretty.

In [None]:
time_df.plot()

### Exercises


1. At the beginning of this notebook we compare two approaches for summing numbers. Test this with arrays of varying sizes and plot the results.
1. Evaluate NumPy's sin, cos, and tan functions from -pi to pi and plot them in a 3x1 grid.
1. Visit the Matplotlib chart gallery and pick a chart that catches your eye. Customize the chart as you see fit.

### Exit Tickets


1. Why are numerical calculations on NumPy arrays faster than similar computations on Python lists?
2. Why do NumPy arrays share memory?
3. What is the Pyplot state machine?

*Copyright &copy; 2016 The Data Incubator.  All rights reserved.*