## Chapter 2: Introduction to NumPy
#### Book: [Python for data science handbook](https://jakevdp.github.io/PythonDataScienceHandbook/) by Jake VanderPlas

This chapter introduces NumPy (Short for Numerical Python). NumPy forms the core of nearly the entire ecosystem of data science tools in Python. It is the basis for how data is stored and manipulated.

The focus on Numpy: 
* Understanding Data Types in Python
* The Basics of Numpy Arrays
* Computation on Numpy Arrays
* Aggregations
* Computations on Arrays
* Comparisons, Masks, and Boolean Logic
* Fancy Indexing
* Sorting Arrays
* Structured Data


__1. Understanding Data Types in Python__

Python data types are dynamically inferred. We can assign any kind of data to any variable. 

In [1]:
result =0;
for i in range(100):
    result +=i

In [3]:
x=4;
x="4"

__Python list__: a Python data structure that holds many Python objects. Allows flexible types

In [5]:
#a list of integers
L = list(range(10))
L

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [6]:
type(L[0])

int

In [7]:
#a list of strings
L2 = [str(c) for c in L]
L2

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

In [8]:
type(L2[0])

str

In [12]:
#heterogeneous list
L3 = [True, "2", 3.0, 4]
[type(item) for item in L3]

[bool, str, float, int]

__Fixed-Type arrays in Python__: dense arrays of a uniform type, provides __efficient storage__

NumPy adds efficient __operations__ on that data

In [13]:
import array
L = list(range(10))
A = array.array('i',L)
A

array('i', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [1]:
#standard NumPy import, under the alias np
import numpy as np


__Creating arrays from Python lists__ 

Note NumPy is constrained to arrays that all contain the same type.

If we want to explicitly set the data type of the resulting array, we can use the ``dtype``
keyword

unlike Python lists, NumPy arrays can explicitly be multidimensional

In [15]:
#integer array
np.array([1,4,3,5,2])

array([1, 4, 3, 5, 2])

In [18]:
np.array([1,5,6,7,8],dtype='float32')

array([ 1.,  5.,  6.,  7.,  8.], dtype=float32)

In [19]:
# nested lists result in multidimensional arrays
#The inner lists are treated as rows of the resulting two-dimensional array.

np.array([range(i,i+3) for i in [2,4,6]])

array([[2, 3, 4],
       [4, 5, 6],
       [6, 7, 8]])

__Creating Arrays from Scratch__: efficient especially for larger arrays using routines built into NumPy

In [23]:
# Create a 3x5 floating-point array filled with 1s
np.ones((3,5))

array([[ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.]])

In [22]:
# Create a 3x5 array filled with 3.14
np.full((3,5),3.14)

array([[ 3.14,  3.14,  3.14,  3.14,  3.14],
       [ 3.14,  3.14,  3.14,  3.14,  3.14],
       [ 3.14,  3.14,  3.14,  3.14,  3.14]])

In [26]:
# Create an array filled with a linear sequence
# Starting at 0, ending at 20, stepping by 2
# (this is similar to the built-in range() function)
np.arange(0,20,2)

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [27]:
# Create an array of five values evenly spaced between 0 and 1
np.linspace(0, 1, 5)

array([ 0.  ,  0.25,  0.5 ,  0.75,  1.  ])

In [28]:
# Create a 3x3 array of uniformly distributed
# random values between 0 and 1
np.random.random((3, 3))

array([[ 0.74988009,  0.23010776,  0.71065657],
       [ 0.76599523,  0.25888275,  0.50780185],
       [ 0.98790211,  0.18900414,  0.80086819]])

In [29]:
# Create a 3x3 identity matrix
np.eye(3)

array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])

__2. The Basics of Numpy Arrays__

Basic array manipulations are:

*Attributes of arrays*

    Determining the size, shape, memory consumption, and data types of arrays

*Indexing of arrays*
    
    Getting and setting the value of individual array elements

*Slicing of arrays*

    Getting and setting smaller subarrays within a larger array

*Reshaping of arrays*

    Changing the shape of a given array

*Joining and splitting of arrays*

    Combining multiple arrays into one, and splitting one array into many

__2.1 Numpy Array Attributes__
* ``ndim`` :the number of dimensions
* ``shape``: the size of each dimension
* ``size``: the total size of the array
* ``dtype``: the data type of the array
* ``itemsize``: lists the size (in bytes) of each array element 
* ``nbytes``: lists the total size (in bytes) of the array

In [3]:
np.random.seed(0) # seed for reproducibility
x1 = np.random.randint(10, size=6) # One-dimensional array
x2 = np.random.randint(10, size=(3, 4)) # Two-dimensional array
x3 = np.random.randint(10, size=(3, 4, 5)) # Three-dimensional array

In [4]:
print("x3 ndim: ", x3.ndim)
print("x3 shape: ", x3.shape)
print("x3 size: ", x3.size)
print("x3 dtype: ", x3.dtype)

x3 ndim:  3
x3 shape:  (3, 4, 5)
x3 size:  60
x3 dtype:  int64


In [5]:
print("x3 itemsize: ", x3.itemsize)
print("x3 nbytes: ", x3.nbytes)

x3 itemsize:  8
x3 nbytes:  480


__2.2 Array Indexing: Accessing Single Elements__
* In a one-dimensional array, you can access the ``i`` value (counting from zero) by specifying the desired index in square brackets
* To index from the end of the array, you can use negative indices
* In a multidimensional array, you access items using a comma-separated tuple of indices

In [4]:
x1

array([5, 0, 3, 3, 7, 9])

In [5]:
x1[4]

7

In [7]:
x1[-1]

9

In [8]:
x2

array([[3, 5, 2, 4],
       [7, 6, 8, 8],
       [1, 6, 7, 7]])

In [12]:
x2[2,1]

6

__2.3 Array Slicing: Accessing Subarrays__

We can also use square brackets to access subarrays with the *slice* notation, marked by the colon (``:``) 

To access a slice of an array ``x`` , use this: ``x[start:stop:step]`` character

*One-dimensional subarrays*

In [14]:
x = np.arange(10)
x

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [15]:
x[:5] #first five elements

array([0, 1, 2, 3, 4])

In [16]:
x[5:] # elements after index 5

array([5, 6, 7, 8, 9])

In [17]:
x[1::2] # every other element, starting at index 1

array([1, 3, 5, 7, 9])

In [18]:
x[::-1] # all elements reversed

array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

In [19]:
x[5::-1] # reversed every other from index 5

array([5, 4, 3, 2, 1, 0])

*Multidimensional subarrays*


In [20]:
x2

array([[3, 5, 2, 4],
       [7, 6, 8, 8],
       [1, 6, 7, 7]])

In [21]:
x2[:2, :3] # two rows, three columns

array([[3, 5, 2],
       [7, 6, 8]])

In [22]:
x2[:,0] # first column of x2

array([3, 7, 1])

In [23]:
x2[0]  #first row

array([3, 5, 2, 4])

In [16]:
myarray3 = np.array([[[5, 20], [480,90]],[[250,50],[100,33]]])
myarray3

array([[[  5,  20],
        [480,  90]],

       [[250,  50],
        [100,  33]]])

In [17]:
myarray3.shape

(2, 2, 2)

In [18]:
#code to access the element [100,33]
myarray3[1,1]

array([100,  33])

In [19]:
myarray3[1,1][1]

33

In [20]:
myarray3[1,1][0]

100

*Subarrays as no-copy views*

One important—and extremely useful—thing to know about array slices is that they
return *views* rather than *copies* of the array data.

In [24]:
print(x2)

[[3 5 2 4]
 [7 6 8 8]
 [1 6 7 7]]


In [26]:
x2_sub = x2[:2,:2]
print(x2_sub)

[[3 5]
 [7 6]]


Now if we modify this subarray, we’ll see that the original array is changed!

In [28]:
x2_sub[0,0] =99
print(x2_sub)

[[99  5]
 [ 7  6]]


In [29]:
print(x2)

[[99  5  2  4]
 [ 7  6  8  8]
 [ 1  6  7  7]]


*Creating copies of arrays*

Despite the nice features of array views, it is sometimes useful to instead explicitly
copy the data within an array or a subarray. 

This can be most easily done with the ``copy()`` method:

If we now modify this subarray, the original array is not touched:

In [31]:
x2_sub_copy = x2[:2, :2].copy()
print(x2_sub_copy)

[[99  5]
 [ 7  6]]


In [32]:
x2_sub_copy[0,0] = 42
print(x2_sub_copy)

[[42  5]
 [ 7  6]]


In [33]:
print(x2)

[[99  5  2  4]
 [ 7  6  8  8]
 [ 1  6  7  7]]


__2.4 Reshaping of Arrays__

A useful type of operation is reshaping of arrays. Achieved with the ``reshape()`` method.

For example, if you want to put the numbers 1 through 9 in a ``3×3`` grid, you can do the following:

In [35]:
grid = np.arange(1,10).reshape((3,3))
print(grid)

[[1 2 3]
 [4 5 6]
 [7 8 9]]


In [37]:
x = np.array([1,2,3])

In [38]:
# column vector via reshape
x.reshape((3, 1))

array([[1],
       [2],
       [3]])

In [39]:
# column vector via newaxis
x[:, np.newaxis]

array([[1],
       [2],
       [3]])

__2.5 Array Concatenation and Splitting__

Concatenation, or joining of two arrays in NumPy, is primarily accomplished
through the routines ``np.concatenate , np.vstack``, and ``np.hstack``.

``np.concatenate``: takes a tuple or list of arrays as its first argument

``np.concatenate`` can also be used for two-dimensional arrays:

In [40]:
x= np.array([1,2,3])
y= np.array([3,2,1])
np.concatenate([x,y])

array([1, 2, 3, 3, 2, 1])

In [45]:
grid = np.array([[1,2,3],[4,5,6]])



In [46]:
# concatenate along the first axis
np.concatenate([grid, grid])

array([[1, 2, 3],
       [4, 5, 6],
       [1, 2, 3],
       [4, 5, 6]])

In [47]:
# concatenate along the second axis (zero-indexed)
np.concatenate([grid, grid], axis=1)

array([[1, 2, 3, 1, 2, 3],
       [4, 5, 6, 4, 5, 6]])

For working with arrays of mixed dimensions, it can be clearer to use the ``np.vstack``
(vertical stack) and ``np.hstack`` (horizontal stack) functions:

In [48]:
x = np.array([1, 2, 3])
grid = np.array([[9, 8, 7],
[6, 5, 4]])
# vertically stack the arrays
np.vstack([x, grid])

array([[1, 2, 3],
       [9, 8, 7],
       [6, 5, 4]])

In [49]:
# horizontally stack the arrays
y = np.array([[99],
[99]])
np.hstack([grid, y])

array([[ 9,  8,  7, 99],
       [ 6,  5,  4, 99]])

*Splitting of arrays*

The opposite of concatenation is *splitting*, which is implemented by the functions
``np.split , np.hsplit``, and ``np.vsplit``. 

For each of these, we can pass a list of indices giving the split points:

In [50]:
x = [1, 2, 3, 99, 99, 3, 2, 1]
x1, x2, x3 = np.split(x, [3, 5])
print(x1, x2, x3)

[1 2 3] [99 99] [3 2 1]


In [51]:
grid = np.arange(16).reshape((4, 4))
grid

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

In [54]:
upper, lower = np.vsplit(grid, [2])
print(upper)
print(lower)

[[0 1 2 3]
 [4 5 6 7]]
[[ 8  9 10 11]
 [12 13 14 15]]


In [55]:
left, right = np.hsplit(grid, [2])
print(left)
print(right)

[[ 0  1]
 [ 4  5]
 [ 8  9]
 [12 13]]
[[ 2  3]
 [ 6  7]
 [10 11]
 [14 15]]


__3. Computation on NumPy Arrays: Universal Functions__

Computation on NumPy arrays can be very fast, or it can be very slow. 

The key to making it fast is to use *vectorized* operations, generally implemented through NumPy’s *universal functions (ufuncs)*.

NumPy is so important in the Python data science world. Namely, it provides an easy and flexible interface to optimized computation with arrays of data.

*Ufuncs* exist in two flavors: *unary ufuncs*, which operate on a single input, and *binary
ufuncs*, which operate on two inputs.

* Array arithmetic
* Absolute value
* Trigonometric functions
* Exponents and logarithms

In [4]:
x= np.arange(4)
print("x = ", x)
print("x+5 = ", x+5)
print("x-5 = ", x-5)
print("x * 2 = ", x*2)
print("x/2 = ", x/2)
print("x//2 = ", x//2) #floor division

x =  [0 1 2 3]
x+5 =  [5 6 7 8]
x-5 =  [-5 -4 -3 -2]
x * 2 =  [0 2 4 6]
x/2 =  [ 0.   0.5  1.   1.5]
x//2 =  [0 0 1 1]


There is also a unary ``ufunc`` for negation, ``a **`` operator for exponentiation, and ``a %``
operator for modulus:

In [5]:
print("-x = ", -x)
print("x**2 = ", x**2)
print("x % 2 = ", x%2)

-x =  [ 0 -1 -2 -3]
x**2 =  [0 1 4 9]
x % 2 =  [0 1 0 1]


In [6]:
x = np.array([-2, -1, 0, 1, 2])
abs(x)

array([2, 1, 0, 1, 2])

In [7]:
np.abs(x)

array([2, 1, 0, 1, 2])

In [8]:
np.absolute(x)

array([2, 1, 0, 1, 2])

In [9]:
y = np.array([3 - 4j, 4 - 3j, 2 + 0j, 0 + 1j])
np.abs(y)  #returns the magnitude

array([ 5.,  5.,  2.,  1.])

In [10]:
theta = np.linspace(0, np.pi, 3)

In [12]:
print("theta  = ", theta)
print("sin(theta) = ", np.sin(theta))
print("cos(theta) = ", np.cos(theta))
print("tan(theta) = ", np.tan(theta))

theta  =  [ 0.          1.57079633  3.14159265]
sin(theta) =  [  0.00000000e+00   1.00000000e+00   1.22464680e-16]
cos(theta) =  [  1.00000000e+00   6.12323400e-17  -1.00000000e+00]
tan(theta) =  [  0.00000000e+00   1.63312394e+16  -1.22464680e-16]


In [16]:
x = [1,2,3]
print("x = ",x)
print("e^x = ", np.exp(x))
print("2^x = ", np.exp2(x))
print("2^x = ", np.power(2,x))
print("3^x = ", np.power(3,x))


x =  [1, 2, 3]
e^x =  [  2.71828183   7.3890561   20.08553692]
2^x =  [ 2.  4.  8.]
2^x =  [2 4 8]
3^x =  [ 3  9 27]


In [18]:
x = [1, 2, 4, 10]
print("x = ", x)
print("ln(x) = ", np.log(x))
print("log2(x) = ",np.log2(x))
print("log10(x) = ", np.log10(x))



x =  [1, 2, 4, 10]
ln(x) =  [ 0.          0.69314718  1.38629436  2.30258509]
log2(x) =  [ 0.          1.          2.          3.32192809]
log10(x) =  [ 0.          0.30103     0.60205999  1.        ]


Another excellent source for more specialized and obscure ufuncs is the submodule
``scipy.special``.

In [19]:
from scipy import special

In [20]:
# Gamma functions (generalized factorials) and related functions
x = [1, 5, 10]
print("gamma(x) = ", special.gamma(x))
print("ln|gamma(x)| = ", special.gammaln(x))
print("beta(x, 2)= ", special.beta(x, 2))

gamma(x) =  [  1.00000000e+00   2.40000000e+01   3.62880000e+05]
ln|gamma(x)| =  [  0.           3.17805383  12.80182748]
beta(x, 2)=  [ 0.5         0.03333333  0.00909091]


In [21]:
# Error function (integral of Gaussian)
# its complement, and its inverse
x = np.array([0, 0.3, 0.7, 1.0])
print("erf(x) =", special.erf(x))
print("erfc(x) =", special.erfc(x))
print("erfinv(x) =", special.erfinv(x))

erf(x) = [ 0.          0.32862676  0.67780119  0.84270079]
erfc(x) = [ 1.          0.67137324  0.32219881  0.15729921]
erfinv(x) = [ 0.          0.27246271  0.73286908         inf]


For all ufuncs, you can do this using the ``out`` argument of the
function to write computation results directly to the memory location where you’d like them to be rather than creating a temporary array

In [22]:
x = np.arange(5)
y = np.empty(5)
np.multiply(x, 10, out=y)
print(y)

[  0.  10.  20.  30.  40.]


In [23]:
y = np.zeros(10)
np.power(2, x, out=y[::2])
print(y)

[  1.   0.   2.   0.   4.   0.   8.   0.  16.   0.]


__4. Aggregates__

to reduce an array with a particular operation, we can use the ``reduce`` method of any ufunc.

A reduce repeatedly applies a given operation to the elements of an array until only a single result remains

For example, calling ``reduce`` on the ``add`` ufunc returns the sum of all elements in the
array.

Similarly, calling ``reduce`` on the ``multiply`` ufunc results in the product of all array
elements

If we’d like to store all the intermediate results of the computation, we can instead use
``accumulate`` 

In [24]:
x = np.arange(1, 6)
np.add.reduce(x)

15

In [25]:
np.multiply.reduce(x)

120

In [26]:
np.add.accumulate(x)

array([ 1,  3,  6, 10, 15])

In [27]:
np.multiply.accumulate(x)

array([  1,   2,   6,  24, 120])

__Outer products__

any ``ufunc`` can compute the output of all pairs of two different inputs using the ``outer`` method

This allows you, in one line, to do things like create a multiplication table

In [28]:
x = np.arange(1, 6)
np.multiply.outer(x, x)

array([[ 1,  2,  3,  4,  5],
       [ 2,  4,  6,  8, 10],
       [ 3,  6,  9, 12, 15],
       [ 4,  8, 12, 16, 20],
       [ 5, 10, 15, 20, 25]])

In [30]:
sum(np.random.random(5))

2.9054322527231942

__Multidimensional aggregates__

One common type of aggregation operation is an aggregate along a row or column

Aggregation functions take an additional argument specifying the ``axis`` along which
the aggregate is computed. 

For example, we can find the minimum value within each ``column`` by specifying ``axis=0``

Similarly, we can find the maximum value within each ``row`` by specifying ``axis=1``

In [31]:
M = np.random.random((3, 4))
print(M)

[[ 0.23339516  0.66858601  0.47506736  0.91742644]
 [ 0.25185677  0.64024972  0.62014868  0.2267939 ]
 [ 0.18170519  0.56225329  0.50423601  0.85699945]]


In [32]:
np.min(M,axis=0)

array([ 0.18170519,  0.56225329,  0.47506736,  0.2267939 ])

In [33]:
np.max(M,axis=1)

array([ 0.91742644,  0.64024972,  0.85699945])

__5. Computation on Arrays: Broadcasting__

Another means of ``vectorizing`` operations is to use NumPy’s ``broadcasting`` functionality. 

``Broadcasting`` is simply a set of rules for applying binary ufuncs (addition, subtraction, multiplication, etc.) on arrays of *different* sizes.

__Rules of Broadcasting__

Broadcasting in NumPy follows a strict set of rules to determine the interaction
between the two arrays:
* Rule 1: If the two arrays differ in their number of dimensions, the shape of the one with fewer dimensions is padded with ones on its leading (left) side.
* Rule 2: If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that dimension is stretched to match the other shape.
* Rule 3: If in any dimension the sizes disagree and neither is equal to 1, an error is raised.

__Broadcasting in Practice__

* Centering an array
* Plotting a two-dimensional function

In [34]:
a = np.array([0, 1, 2])

In [35]:
a+5

array([5, 6, 7])

In [36]:
M = np.ones((3, 3))
M

array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])

In [37]:
M + a

array([[ 1.,  2.,  3.],
       [ 1.,  2.,  3.],
       [ 1.,  2.,  3.]])

In [38]:
#broadcasting of both arrays

a = np.arange(3)
b = np.arange(3)[:, np.newaxis]
print(a)
print(b)

[0 1 2]
[[0]
 [1]
 [2]]


In [39]:
a + b

array([[0, 1, 2],
       [1, 2, 3],
       [2, 3, 4]])

In [41]:
b=np.arange(3)

In [47]:
M = np.ones((2, 3))
a = np.arange(3)

In [48]:
M.shape

(2, 3)

In [49]:
a.shape

(3,)

In [58]:
M+a

array([[ 1.,  2.,  3.],
       [ 1.,  2.,  3.]])

In [56]:
a.shape

(3,)

__6. Comparisons, Masks, and Boolean Logic__

Masking comes up when you want to extract, modify, count, or otherwise manipulate values in an array based on some criterion: for example, you might wish to count all values greater than a certain value, or perhaps remove all outliers that are above some threshold.

In NumPy, Boolean masking is often the most efficient way to accomplish these types of tasks.

In [59]:
x = np.array([1, 2, 3, 4, 5])

In [60]:
x <3

array([ True,  True, False, False, False], dtype=bool)

In [61]:
x==3

array([False, False,  True, False, False], dtype=bool)

In [62]:
x!=3

array([ True,  True, False,  True,  True], dtype=bool)

A summary of the comparison operators and their equivalent ufunc
<table>
    <tr><td><b>Operator</b></td><td><b>Equivalent ufunc</b></td></tr>
    <tr><td>==</td><td>np.equal</td></tr>
    <tr><td>!=</td><td>np.not_equal</td></tr>
    <tr><td> < </td><td>np.less</td></tr>
    <tr><td> <= </td><td>np.less_equal</td></tr>
    <tr><td> > </td><td>np.greater</td></tr>
    <tr><td> >= </td><td>np.greater_equal</td></tr>
</table>

In [63]:
x = np.array([2, 1, 4, 3, 5])
np.sort(x)

array([1, 2, 3, 4, 5])

A related function is ``argsort``, which instead returns the indices of the sorted
elements

In [64]:
x = np.array([2, 1, 4, 3, 5])
i = np.argsort(x)
print(i)

[1 0 3 2 4]


These indices can then be used (via fancy indexing) to construct the sorted array if desired:

In [65]:
x[i]


array([1, 2, 3, 4, 5])

Sorting along rows or columns

In [66]:
rand = np.random.RandomState(42)
X = rand.randint(0, 10, (4, 6))
print(X)

[[6 3 7 4 6 9]
 [2 6 7 4 3 7]
 [7 2 5 4 1 7]
 [5 1 4 0 9 5]]


In [67]:
# sort each column of X
np.sort(X, axis=0)

array([[2, 1, 4, 0, 1, 5],
       [5, 2, 5, 4, 3, 7],
       [6, 3, 7, 4, 6, 7],
       [7, 6, 7, 4, 9, 9]])

In [68]:
# sort each row of X
np.sort(X, axis=1)

array([[3, 4, 6, 6, 7, 9],
       [2, 3, 4, 6, 7, 7],
       [1, 2, 4, 5, 7, 7],
       [0, 1, 4, 5, 5, 9]])

__Structured Data: NumPy’s Structured Arrays__

This section demonstrates the use of NumPy’s structured arrays and record arrays, which provide efficient storage for compound, heterogeneous data.

In [69]:
#dictionary method
np.dtype({'names':('name', 'age', 'weight'),
'formats':('U10', 'i4', 'f8')})

dtype([('name', '<U10'), ('age', '<i4'), ('weight', '<f8')])

In [70]:
#A compound type can also be specified as a list of tuples:
np.dtype([('name', 'S10'), ('age', 'i4'), ('weight', 'f8')])

dtype([('name', 'S10'), ('age', '<i4'), ('weight', '<f8')])