In [2]:
import numpy as np

In [3]:
a = np.array ([0, 1, 2, 3, 4, 5])

In [4]:
a

array([0, 1, 2, 3, 4, 5])

In [5]:
a.ndim

1

In [6]:
a.shape

(6L,)

We just created an array in a similar way to how we would create a list in Python.
However, NumPy arrays have additional information about the shape. In this case,
it is a one-dimensional array of five elements. No surprises so far.

We can now transform this array in to a 2D matrix:

In [7]:
b = a.reshape ((3, 2))

In [8]:
b

array([[0, 1],
       [2, 3],
       [4, 5]])

In [9]:
b.ndim

2

In [10]:
b.shape

(3L, 2L)

Ah-Ha! Eureka! ndim returns the dimensions of the array and shape returns the number and size of the sets.
A, which ia a one dimensional array, is a "straight line" of sequential numbers. In this regard, the ndim is 1 and the shape is 6L. Similiarly, b is two dimensional with three pairs of numbers for its shape!

In [11]:
c= a.reshape ((2, 3))

In [12]:
c

array([[0, 1, 2],
       [3, 4, 5]])

In [13]:
c.ndim

2

In [14]:
c.shape

(2L, 3L)

In [15]:
b[1][0]=77

In this case, we have modified the value 2 to 77 in b, and we can immediately see
the same change reflected in a as well. Keep that in mind whenever you need a
true copy. The code is read as: 
# in array b, set 2, first value equals 77

In [16]:
b

array([[ 0,  1],
       [77,  3],
       [ 4,  5]])

In [17]:
a

array([ 0,  1, 77,  3,  4,  5])

In [18]:
c = a.reshape((3,2)).copy()

In [19]:
c

array([[ 0,  1],
       [77,  3],
       [ 4,  5]])

In [20]:
c[0][0] = -99

In [21]:
c

array([[-99,   1],
       [ 77,   3],
       [  4,   5]])

In [22]:
a

array([ 0,  1, 77,  3,  4,  5])

Above, c and a are totally independent copies.

### IMPORTANT- I'm going to forget this!!
Another big advantage of NumPy arrays is that the operations are propagated
to the individual elements.

In [23]:
a * 2

array([  0,   2, 154,   6,   8,  10])

In [24]:
a ** 2

array([   0,    1, 5929,    9,   16,   25])

Contrast that to ordinary Python lists:

In [25]:
[1,2,3,4,5]*2

[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]

In [26]:
[1,2,3,4,5]**2

TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'int'

In [None]:
X = np.array([  [0,0,1],
                [0,1,1],
                [1,0,1],
                [1,1,1] ])

In [None]:
X.ndim

In [None]:
X.shape

In [None]:
X = X^3

In [None]:
X

# Indexing

In [27]:
a[np.array([2,3,4])]

array([77,  3,  4])

The above called values stored in np.array a in the 2, 3 and 4 positions.

In addition to the fact that conditions are now propagated to the individual elements,
we gain a very convenient way to access our data.

In [28]:
a>4

array([False, False,  True, False, False,  True], dtype=bool)

This can also be used to trim outliers.

In [31]:
a[a>4]=4

In [32]:
a

array([0, 1, 4, 3, 4, 4])

As this is a frequent use case, there is a special clip function for it, clipping the values
at both ends of an interval with one function call as follows:

In [37]:
a.clip(0,4)

array([0, 1, 4, 3, 4, 4])

# Handling non-existing values

In [39]:
c = np.array([1, 2, np.NAN, 3, 4]) 
#let's pretend we have read this from a text file

In [40]:
c

array([  1.,   2.,  nan,   3.,   4.])

In [41]:
np.isnan(c)

array([False, False,  True, False, False], dtype=bool)

In [42]:
c[~np.isnan(c)]

array([ 1.,  2.,  3.,  4.])

In [43]:
np.mean(c[~np.isnan(c)])

2.5

# Comparing runtime behaviors

In [50]:
import timeit
normal_py_sec = timeit.timeit('sum(x*x for x in xrange(1000))',
number=10000)
naive_np_sec = timeit.timeit('sum(na*na)',
setup="import numpy as np; na=np.arange(1000)",
number=10000)
good_np_sec = timeit.timeit('na.dot(na)',
setup="import numpy as np; na=np.arange(1000)",
number=10000)
print("Normal Python: %f sec"%normal_py_sec)
print("Naive NumPy: %f sec"%naive_np_sec)
print("Good NumPy: %f sec"%good_np_sec)

Normal Python: 1.039862 sec
Naive NumPy: 1.197771 sec
Good NumPy: 0.022111 sec


We make two interesting observations. First, just using NumPy as data storage
(Naive NumPy) takes 3.5 times longer, which is surprising since we believe it must
be much faster as it is written as a C extension. One reason for this is that the access of
individual elements from Python itself is rather costly. Only when we are able to apply
algorithms inside the optimized extension code do we get speed improvements, and
tremendous ones at that: using the dot() function of NumPy, we are more than 25
times faster. In summary, in every algorithm we are about to implement, we should
always look at how we can move loops over individual elements from Python to some
of the highly optimized NumPy or SciPy extension functions.

However, the speed comes at a price. Using NumPy arrays, we no longer have the
incredible flexibility of Python lists, which can hold basically anything. NumPy
arrays always have only one datatype.

In [51]:
a = np.array([1,2,3])

In [52]:
a.dtype

dtype('int32')

If we try to use elements of different types, NumPy will do its best to coerce them
to the most reasonable common datatype:

In [54]:
np.array([1, "stringy"])

array(['1', 'stringy'], 
      dtype='|S11')

In [55]:
np.array([1, "stringy", set([1,2,3])])

array([1, 'stringy', set([1, 2, 3])], dtype=object)

# Learning SciPy

In [56]:
import scipy, numpy

In [57]:
scipy.version.full_version

'0.18.1'

In [58]:
scipy.dot is numpy.dot

True

# SciPy package 
Functionality
# cluster 
Hierarchical clustering (cluster.hierarchy)
Vector quantization / K-Means (cluster.vq)
# constants 
Physical and mathematical constants
Conversion methods
# fftpack 
Discrete Fourier transform algorithms
# integrate 
Integration routines
# interpolate 
Interpolation (linear, cubic, and so on)
# io 
Data input and output
# linalg 
Linear algebra routines using the optimized
BLAS and LAPACK libraries
# maxentropy
Functions for fitting maximum entropy models
# ndimage 
n-dimensional image package
# odr 
Orthogonal distance regression
# optimize
Optimization (finding minima and roots)
# signal 
Signal processing
# sparse 
Sparse matrices
# spatial 
Spatial data structures and algorithms
# special 
Special mathematical functions such as Bessel or
Jacobian
# stats 
Statistics toolkit

The toolboxes most interesting to our endeavor are scipy.stats, scipy.
interpolate, scipy.cluster, and scipy.signal. For the sake of brevity, we
will briefly explore some features of the stats package and leave the others to be
explained when they show up in the chapters.