# Introduction to NumPy

This chapter outlines techniques for effectively loading, storing, and manipulating in-memory data in Python. 
The topic is very broad: datasets can come from a wide range of sources and a wide range of formats, including  collections of documents, collections of images, collections of sound clips, collections of numerical measurements, or nearly anything else. Despite this apparent heterogeneity, it will help us to think of all data fundamentally as arrays of numbers.

This chapter will cover NumPy in detail. NumPy (short for Numerical Python) provides an efficient interface to store and operate on dense data buffers. In some ways, NumPy arrays are like Python's built-in list type, but NumPy arrays provide much more efficient storage and data operations as the arrays grow larger in size. NumPy arrays form the core of nearly the entire ecosystem of data science tools in Python, so time spent learning to use NumPy effectively will be valuable no matter what aspect of data science interests you.

If you followed the advice outlined in the Preface and installed the Anaconda stack, you already have NumPy installed and ready to go. If you're more the do-it-yourself type, you can go to http://www.numpy.org/ and follow the installation instructions found there. Once you do, you can import NumPy and double-check the version:

In [1]:
import numpy
numpy.__version__

'1.18.1'

For the pieces of the package discussed here, I'd recommend NumPy version 1.8 or later. By convention, you'll find that most people in the SciPy/PyData world will import NumPy using np as an alias:

In [2]:
import numpy as np

## Understanding numpy datatypes

Effective data-driven science and computation requires understanding how data is stored and manipulated. This section outlines and contrasts how arrays of data are handled in the Python language itself, and how NumPy improves on this.

Users of Python are often drawn-in by its ease of use, one piece of which is dynamic typing. While a statically-typed language like C or Java requires each variable to be explicitly declared, a dynamically-typed language like Python skips this specification. For example, in C you might specify a particular operation as follows:

```C
/* C code */
int result = 0;
for(int i=0; i<100; i++){
    result += i;
}
```
while in Python the equivalent operation could be written this way:

```python
# Python code
result = 0
for i in range(100):
    result += i
```

But a Python integer is more than just an integer

```C
struct _longobject {
    long ob_refcnt;
    PyTypeObject *ob_type;
    size_t ob_size;
    long ob_digit[1];
};
```
That means, a single integer in Python 3.4 actually contains four pieces:

* ob_refcnt, a reference count that helps Python silently handle memory allocation and deallocation
* ob_type, which encodes the type of the variable
* ob_size, which specifies the size of the following data members
* ob_digit, which contains the actual integer value that we expect the Python variable to represent.
 
 Similarly, a Python list is more than just a list. What happens when we use a Python data structure that holds many Python objects. The standard mutable multi-element container in Python is the list and so on and so forth, but that is completely out of scope for this notebook. So getting back to numpy. Drawing from the similarity of a list comprehension of a python list.

In [4]:
L = list(range(10))
L, type(L[0])

([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], int)

In [6]:
# integer array:
np.array(range(10))

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [7]:
np.array([3.14, 1, 2, 3, 4, 5])

array([3.14, 1.  , 2.  , 3.  , 4.  , 5.  ])

In [9]:
np.array(range(10), dtype='float32')

array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.], dtype=float32)

In [11]:
# nested lists result in multi-dimensional arrays
[[i for i in range(i, i + 3)] for i in [2, 4, 6]]

[[2, 3, 4], [4, 5, 6], [6, 7, 8]]

In [12]:
# nested lists result in multi-dimensional arrays
np.array([range(i, i + 3) for i in [2, 4, 6]])

array([[2, 3, 4],
       [4, 5, 6],
       [6, 7, 8]])

| Data type    | Description |
|---------------|-------------|
| ``bool_``     | Boolean (True or False) stored as a byte |
| ``int_``      | Default integer type (same as C ``long``; normally either ``int64`` or ``int32``)| 
| ``intc``      | Identical to C ``int`` (normally ``int32`` or ``int64``)| 
| ``intp``      | Integer used for indexing (same as C ``ssize_t``; normally either ``int32`` or ``int64``)| 
| ``int8``      | Byte (-128 to 127)| 
| ``int16``     | Integer (-32768 to 32767)|
| ``int32``     | Integer (-2147483648 to 2147483647)|
| ``int64``     | Integer (-9223372036854775808 to 9223372036854775807)| 
| ``uint8``     | Unsigned integer (0 to 255)| 
| ``uint16``    | Unsigned integer (0 to 65535)| 
| ``uint32``    | Unsigned integer (0 to 4294967295)| 
| ``uint64``    | Unsigned integer (0 to 18446744073709551615)| 
| ``float_``    | Shorthand for ``float64``.| 
| ``float16``   | Half precision float: sign bit, 5 bits exponent, 10 bits mantissa| 
| ``float32``   | Single precision float: sign bit, 8 bits exponent, 23 bits mantissa| 
| ``float64``   | Double precision float: sign bit, 11 bits exponent, 52 bits mantissa| 
| ``complex_``  | Shorthand for ``complex128``.| 
| ``complex64`` | Complex number, represented by two 32-bit floats| 
| ``complex128``| Complex number, represented by two 64-bit floats|

##  Creating Arrays from Scratch

Especially for larger arrays, it is more efficient to create arrays from scratch using routines built into NumPy. Here are several examples:

In [13]:
# Create a length-10 integer array filled with zeros
np.zeros(10, dtype=int)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [14]:
# Create a 3x5 floating-point array filled with ones
np.ones((3, 5), dtype=float)

array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])

In [15]:
# Create a 3x5 array filled with 3.14
np.full((3, 5), 3.14)

array([[3.14, 3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14, 3.14]])

In [16]:
# Create an array filled with a linear sequence
# Starting at 0, ending at 20, stepping by 2
# (this is similar to the built-in range() function)
np.arange(0, 20, 2)

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [17]:
# Create an array of five values evenly spaced between 0 and 1
np.linspace(0, 1, 5)

array([0.  , 0.25, 0.5 , 0.75, 1.  ])

In [20]:
# Create a 3x3 array of random integers in the interval [0, 10)
np.random.randint(0, 10, (3, 3))

array([[0, 9, 8],
       [8, 3, 8],
       [0, 4, 9]])

In [21]:
# Create a 3x3 identity matrix
np.eye(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [19]:
# Create a 3x3 array of normally distributed random values
# with mean 0 and standard deviation 1
np.random.normal(0, 1, (3, 3))

array([[-1.45002564, -0.08903001,  1.34583242],
       [ 1.04412898,  2.38358294, -0.6227037 ],
       [-0.45620765, -0.35299495,  1.92451558]])

In [18]:
# Create a 3x3 array of uniformly distributed
# random values between 0 and 1
np.random.random((3, 3))

array([[0.87691581, 0.67340387, 0.23846048],
       [0.99687518, 0.52054276, 0.46180203],
       [0.0979753 , 0.5994925 , 0.81234344]])

### Creating reproducible results

In [28]:
np.random.seed(0)  # seed for reproducibility

x = np.random.randint(10, size=6)  # One-dimensional array
# x = np.random.randint(10, size=(3, 4))  # Two-dimensional array
# x = np.random.randint(10, size=(3, 4, 5))  # Three-dimensional array
x

array([5, 0, 3, 3, 7, 9])

In [29]:
print("x3 ndim: ", x.ndim)
print("x3 shape:", x.shape)
print("x3 size: ", x.size)

x3 ndim:  1
x3 shape: (6,)
x3 size:  6


In [30]:
print("dtype:", x.dtype)

dtype: int64


In [31]:
print("itemsize:", x.itemsize, "bytes")
print("nbytes:", x.nbytes, "bytes")

itemsize: 8 bytes
nbytes: 48 bytes


### Array Indexing: Accessing Single Elements

In [32]:
x

array([5, 0, 3, 3, 7, 9])

In [33]:
x[0]

5

In [36]:
x = np.random.randint(10, size=(3, 4)); x

array([[8, 1, 5, 9],
       [8, 9, 4, 3],
       [0, 3, 5, 0]])

In [38]:
x[0, 0]

8

In [39]:
x2[2, 0]

1

### Array Slicing: Accessing Subarrays

In [40]:
x = np.arange(10)
x

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [42]:
x[:5], x[5:]

(array([0, 1, 2, 3, 4]), array([5, 6, 7, 8, 9]))

In [43]:
x[4:7]

array([4, 5, 6])

In [44]:
x[::2]

array([0, 2, 4, 6, 8])

In [45]:
x[::-1]

array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

### Multi-dimentional slicing

In [50]:
x = np.random.randint(10, size=(3, 4)); x

array([[0, 4, 7, 3],
       [2, 7, 2, 0],
       [0, 4, 5, 5]])

In [51]:
x[:2, :3]

array([[0, 4, 7],
       [2, 7, 2]])

### Reshaping of Arrays

Another useful type of operation is reshaping of arrays. The most flexible way of doing this is with the reshape method. For example, if you want to put the numbers 1 through 9 in a $3 \times 3$ grid, you can do the following:

In [54]:
grid = np.arange(1, 10)
print(grid)

[1 2 3 4 5 6 7 8 9]


In [55]:
print(grid.reshape((3, 3)))

[[1 2 3]
 [4 5 6]
 [7 8 9]]


In [56]:
x = np.array([1, 2, 3])

# row vector via reshape
x.reshape((1, 3))

array([[1, 2, 3]])

In [57]:
# row vector via newaxis
x[np.newaxis, :]

array([[1, 2, 3]])

In [58]:
# column vector via reshape
x.reshape((3, 1))

array([[1],
       [2],
       [3]])

In [59]:
# column vector via newaxis
x[:, np.newaxis]

array([[1],
       [2],
       [3]])

In [60]:
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
np.concatenate([x, y])

array([1, 2, 3, 3, 2, 1])

In [61]:
z = [99, 99, 99]
print(np.concatenate([x, y, z]))

[ 1  2  3  3  2  1 99 99 99]


## Introducing UFuncs

For many types of operations, NumPy provides a convenient interface into just this kind of statically typed, compiled routine. This is known as a vectorized operation. This can be accomplished by simply performing an operation on the array, which will then be applied to each element. This vectorized approach is designed to push the loop into the compiled layer that underlies NumPy, leading to much faster execution.

In [64]:
x = np.arange(4)

print("x      =", x)
print("x + 5  =", x + 5)
print("x - 5  =", x - 5)
print("x * 2  =", x * 2)
print("x / 2  =", x / 2)
print("x // 2 =", x // 2)  # floor division

x      = [0 1 2 3]
x + 5  = [5 6 7 8]
x - 5  = [-5 -4 -3 -2]
x * 2  = [0 2 4 6]
x / 2  = [0.  0.5 1.  1.5]
x // 2 = [0 0 1 1]


In [65]:
print("-x     = ", -x)
print("x ** 2 = ", x ** 2)
print("x % 2  = ", x % 2)

-x     =  [ 0 -1 -2 -3]
x ** 2 =  [0 1 4 9]
x % 2  =  [0 1 0 1]


In [67]:
-(0.5*x + 1) ** 2

array([-1.  , -2.25, -4.  , -6.25])

In [68]:
np.add(x, 2)

array([2, 3, 4, 5])

|Operator	|Equivalent ufunc	|Description|
|-----------|-------------------|-----------|
|+	|np.add|	Addition (e.g., 1 + 1 = 2)|
|-	|np.subtract|	Subtraction (e.g., 3 - 2 = 1)|
|-	|np.negative|	Unary negation (e.g., -2)|
|*	|np.multiply|	Multiplication (e.g., 2 * 3 = 6)|
|/	|np.divide|	Division (e.g., 3 / 2 = 1.5)|
|//	|np.floor_divide|	Floor division (e.g., 3 // 2 = 1)|
|**	|np.power|	Exponentiation (e.g., 2 ** 3 = 8)|
|%	|np.mod|	Modulus/remainder (e.g., 9 % 4 = 1)|

In [70]:
np.abs(x)

array([0, 1, 2, 3])

### Trignometry Functions

In [71]:
theta = np.linspace(0, np.pi, 3)

In [72]:
print("theta      = ", theta)
print("sin(theta) = ", np.sin(theta))
print("cos(theta) = ", np.cos(theta))
print("tan(theta) = ", np.tan(theta))

theta      =  [0.         1.57079633 3.14159265]
sin(theta) =  [0.0000000e+00 1.0000000e+00 1.2246468e-16]
cos(theta) =  [ 1.000000e+00  6.123234e-17 -1.000000e+00]
tan(theta) =  [ 0.00000000e+00  1.63312394e+16 -1.22464680e-16]


In [74]:
x = [-1, 0, 1]
print("x         = ", x)
print("arcsin(x) = ", np.arcsin(x))
print("arccos(x) = ", np.arccos(x))
print("arctan(x) = ", np.arctan(x))

x         =  [-1, 0, 1]
arcsin(x) =  [-1.57079633  0.          1.57079633]
arccos(x) =  [3.14159265 1.57079633 0.        ]
arctan(x) =  [-0.78539816  0.          0.78539816]


### Exponents and Logs

In [76]:
print("x     =", x)
print("e^x   =", np.exp(x))
print("2^x   =", np.exp2(x))

x     = [-1, 0, 1]
e^x   = [0.36787944 1.         2.71828183]
2^x   = [0.5 1.  2. ]


In [77]:
x = [1, 2, 4, 10]
print("x        =", x)
print("ln(x)    =", np.log(x))
print("log2(x)  =", np.log2(x))
print("log10(x) =", np.log10(x))

x        = [1, 2, 4, 10]
ln(x)    = [0.         0.69314718 1.38629436 2.30258509]
log2(x)  = [0.         1.         2.         3.32192809]
log10(x) = [0.         0.30103    0.60205999 1.        ]


For further reading, you should head over to the [numpy documentations](https://numpy.org/devdocs/user/quickstart.html) found on [numpy's website](https://numpy.org/)