IPython Notebooks
==================
* You can run a cell by pressing ``[shift] + [Enter]`` or by pressing the "play" button in the menu.
* You can get help on a function or object by pressing ``[shift] + [tab]`` after the opening parenthesis ``function(``
* You can also get help by executing ``function?``

## Numpy Arrays

Manipulating `numpy` arrays is an important part of doing machine learning
(or, really, any type of scientific computation) in python.  This will likely
be review for most: we'll quickly go through some of the most important features.

In [5]:
import numpy as np

# Generating a random array
X = np.random.random((3, 5))  # a 3 x 5 array

print(X)

[[ 0.69652948  0.71322531  0.87303095  0.66338181  0.30747741]
 [ 0.17695642  0.92226946  0.80353339  0.16396814  0.63893805]
 [ 0.92485078  0.76047082  0.75959511  0.78755313  0.7737218 ]]


In [6]:
# Accessing elements

# get a single element
print(X[0, 0])

# get a row
print(X[1])

# get a column
print(X[:, 1])

0.696529483987
[ 0.17695642  0.92226946  0.80353339  0.16396814  0.63893805]
[ 0.71322531  0.92226946  0.76047082]


In [2]:
# Transposing an array
print(X.T)

[[ 0.27389843  0.58116501  0.92244197]
 [ 0.90905434  0.94018787  0.13214212]
 [ 0.45642157  0.85420951  0.83450268]
 [ 0.14776375  0.22336301  0.59969835]
 [ 0.8751158   0.34712598  0.91280907]]


In [7]:
# Turning a row vector into a column vector
y = np.linspace(0, 12, 5)
print(y)

[  0.   3.   6.   9.  12.]


In [8]:
# make into a column vector
print(y[:, np.newaxis])

[[  0.]
 [  3.]
 [  6.]
 [  9.]
 [ 12.]]


In [9]:
# getting the shape or reshaping an array
print(X.shape)
print(X.reshape(5, 3))

(3, 5)
[[ 0.69652948  0.71322531  0.87303095]
 [ 0.66338181  0.30747741  0.17695642]
 [ 0.92226946  0.80353339  0.16396814]
 [ 0.63893805  0.92485078  0.76047082]
 [ 0.75959511  0.78755313  0.7737218 ]]


In [11]:
# indexing by an array of integers (fancy indexing)
indices = np.array([3, 1, 0])
print(indices)
X[:, indices]

[3 1 0]


array([[ 0.66338181,  0.71322531,  0.69652948],
       [ 0.16396814,  0.92226946,  0.17695642],
       [ 0.78755313,  0.76047082,  0.92485078]])

There is much, much more to know, but these few operations are fundamental to what we'll
do during this tutorial.

## Scipy Sparse Matrices

We won't make very much use of these in this tutorial, but sparse matrices are very nice
in some situations.  In some machine learning tasks, especially those associated
with textual analysis, the data may be mostly zeros.  Storing all these zeros is very
inefficient, and representing in a way that only contains the "non-zero" values can be much more efficient.  We can create and manipulate sparse matrices as follows:

In [12]:
from scipy import sparse

# Create a random array with a lot of zeros
X = np.random.random((10, 5))
print(X)

[[ 0.25264316  0.59628956  0.87658157  0.55418026  0.72705737]
 [ 0.64619249  0.82075482  0.31258246  0.91256041  0.60851268]
 [ 0.19483739  0.48221296  0.03532192  0.90441105  0.14773225]
 [ 0.76030261  0.83135908  0.65258765  0.94433549  0.67232905]
 [ 0.34968841  0.45536481  0.02027762  0.40317953  0.69677195]
 [ 0.55846071  0.00912506  0.35630009  0.23567354  0.3943246 ]
 [ 0.81713671  0.12668524  0.02222078  0.94317755  0.38775845]
 [ 0.38066444  0.50797371  0.95025624  0.08957438  0.29307724]
 [ 0.09144117  0.9516462   0.30397811  0.48451704  0.50715675]
 [ 0.4446116   0.33774142  0.21781378  0.55669082  0.55307792]]


In [13]:
# set the majority of elements to zero
X[X < 0.7] = 0
print(X)

[[ 0.          0.          0.87658157  0.          0.72705737]
 [ 0.          0.82075482  0.          0.91256041  0.        ]
 [ 0.          0.          0.          0.90441105  0.        ]
 [ 0.76030261  0.83135908  0.          0.94433549  0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.81713671  0.          0.          0.94317755  0.        ]
 [ 0.          0.          0.95025624  0.          0.        ]
 [ 0.          0.9516462   0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]]


In [14]:
# turn X into a csr (Compressed-Sparse-Row) matrix
X_csr = sparse.csr_matrix(X)
print(X_csr)

  (0, 2)	0.876581572313
  (0, 4)	0.72705736896
  (1, 1)	0.820754824951
  (1, 3)	0.912560411259
  (2, 3)	0.904411047928
  (3, 0)	0.760302613666
  (3, 1)	0.831359078351
  (3, 3)	0.944335487558
  (6, 0)	0.817136712949
  (6, 3)	0.943177549427
  (7, 2)	0.950256236274
  (8, 1)	0.951646201985


In [15]:
# convert the sparse matrix to a dense array
print(X_csr.toarray())

[[ 0.          0.          0.87658157  0.          0.72705737]
 [ 0.          0.82075482  0.          0.91256041  0.        ]
 [ 0.          0.          0.          0.90441105  0.        ]
 [ 0.76030261  0.83135908  0.          0.94433549  0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.81713671  0.          0.          0.94317755  0.        ]
 [ 0.          0.          0.95025624  0.          0.        ]
 [ 0.          0.9516462   0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]]


The CSR representation can be very efficient for computations, but it is not
as good for adding elements.  For that, the LIL (List-In-List) representation
is better:

In [16]:
# Create an empty LIL matrix and add some items
X_lil = sparse.lil_matrix((5, 5))

for i, j in np.random.randint(0, 5, (15, 2)):
    X_lil[i, j] = i + j

print(X_lil)

  (0, 1)	1.0
  (0, 4)	4.0
  (1, 1)	2.0
  (1, 2)	3.0
  (1, 3)	4.0
  (2, 0)	2.0
  (2, 1)	3.0
  (3, 1)	4.0
  (3, 4)	7.0
  (4, 0)	4.0
  (4, 4)	8.0


In [None]:
print(X_lil.toarray())

Often, once an LIL matrix is created, it is useful to convert it to a CSR format
(many scikit-learn algorithms require CSR or CSC format)

In [None]:
print(X_lil.tocsr())

The available sparse formats that can be useful for various problems:

- `CSR` (compressed sparse row)
- `CSC` (compressed sparse column)
- `BSR` (block sparse row)
- `COO` (coordinate)
- `DIA` (diagonal)
- `DOK` (dictionary of keys)
- `LIL` (list in list)

The ``scipy.sparse`` submodule also has a lot of functions for sparse matrices
including linear algebra, sparse solvers, graph algorithms, and much more.

## Matplotlib

Another important part of machine learning is visualization of data.  The most common
tool for this in Python is `matplotlib`.  It is an extremely flexible package, but
we will go over some basics here.

First, something special to IPython notebook.  We can turn on the "IPython inline" mode,
which will make plots show up inline in the notebook.

In [None]:
%matplotlib inline

In [18]:
import matplotlib.pyplot as plt

In [19]:
# plotting a line
x = np.linspace(0, 10, 100)
plt.plot(x, np.sin(x))

[<matplotlib.lines.Line2D at 0x7f5dfe9cee48>]

In [20]:
# scatter-plot points
x = np.random.normal(size=500)
y = np.random.normal(size=500)
plt.scatter(x, y)

<matplotlib.collections.PathCollection at 0x7f5dfe95a9e8>

In [21]:
# showing images
x = np.linspace(1, 12, 100)
y = x[:, np.newaxis]

im = y * np.sin(x) * np.cos(y)
print(im.shape)

(100, 100)


In [22]:
# imshow - note that origin is at the top-left by default!
plt.imshow(im)

<matplotlib.image.AxesImage at 0x7f5dfec5f128>

In [23]:
# Contour plot - note that origin here is at the bottom-left by default!
plt.contour(im)

<matplotlib.contour.QuadContourSet at 0x7f5dfec5f160>

In [17]:
# 3D plotting
from mpl_toolkits.mplot3d import Axes3D
ax = plt.axes(projection='3d')
xgrid, ygrid = np.meshgrid(x, y.ravel())
ax.plot_surface(xgrid, ygrid, im, cmap=plt.cm.jet, cstride=2, rstride=2, linewidth=0)

NameError: name 'plt' is not defined

There are many, many more plot types available.  One useful way to explore these is by
looking at the matplotlib gallery: http://matplotlib.org/gallery.html

You can test these examples out easily in the notebook: simply copy the ``Source Code``
link on each page, and put it in a notebook using the ``%load`` magic.
For example:

In [None]:
# %load http://matplotlib.org/mpl_examples/pylab_examples/ellipse_collection.py
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.collections import EllipseCollection

x = np.arange(10)
y = np.arange(15)
X, Y = np.meshgrid(x, y)

XY = np.hstack((X.ravel()[:,np.newaxis], Y.ravel()[:,np.newaxis]))

ww = X/10.0
hh = Y/15.0
aa = X*9

fig, ax = plt.subplots()

ec = EllipseCollection(ww, hh, aa, units='x', offsets=XY,
                       transOffset=ax.transData)
ec.set_array((X+Y).ravel())
ax.add_collection(ec)
ax.autoscale_view()
ax.set_xlabel('X')
ax.set_ylabel('y')
cbar = plt.colorbar(ec)
cbar.set_label('X+Y')
plt.show()