In [1]:
# Last Updated : 02-03-2021
# This Notebook contains NumPy Tutorial

1. NumPy, short for Numerical Python, is one of the most important foundational packages for numerical computing in Python. NumPy has below capabilities
2. ndarray, an efficient multidimensional array providing fast array-oriented arithmetic operations and flexible broadcasting capabilities.
3. Mathematical functions for fast operations on entire arrays of data without having to write loops.
4. Tools for reading/writing array data to disk and working with memory-mapped files.
5. Linear algebra, random number generation, and Fourier transform capabilities.
6. It is designed for efficiency on large arrays of data
7. NumPy operations perform complex computations on entire arrays without the need for Python for loops
8. NumPy internally stores data in a contiguous block of memory, independent of other built-in Python objects
9. NumPy-based algorithms are generally 10 to 100 times faster (or more) than their pure Python counterparts and use significantly less memory

While NumPy by itself does not provide modeling or scientific functionality, having an understanding of NumPy arrays and array-oriented computing will help you use tools with array-oriented semantics, like pandas, much more effectively

In [1]:
import numpy as np

In [3]:
my_arr = np.arange(1000000)
my_list = list(range(1000000))

In [4]:
%time my_arr*2

CPU times: user 3.9 ms, sys: 5.41 ms, total: 9.31 ms
Wall time: 9.41 ms


array([      0,       2,       4, ..., 1999994, 1999996, 1999998])

In [5]:
%time for i in range(len(my_list)) : i*2

CPU times: user 110 ms, sys: 3.15 ms, total: 114 ms
Wall time: 116 ms


* One of the key features of NumPy is its N-dimensional array object, or ndarray, which is a fast, flexible container for large datasets in Python. 
* An ndarray is a generic multidimensional container for homogeneous data; that is, all of the elements must be the same type. 
* The NumPy ndarray provides a means to interpret a block of homogeneous data (either contiguous or strided) as a multidimensional array object. 
* The data type, or dtype, determines how the data is interpreted as being floating point, integer, boolean, or any of the other types we’ve been looking at.
* Part of what makes ndarray flexible is that every array object is a strided view on a block of data.<br>


In [6]:
# The easiest way to create an array is to use the array function
data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)
arr1

array([6. , 7.5, 8. , 0. , 1. ])

In [7]:
type(arr1)

numpy.ndarray

In [8]:
data2 = [[1,2,3,4],[4,5,6,7]]
arr2 = np.array(data2)
arr2

array([[1, 2, 3, 4],
       [4, 5, 6, 7]])

In [9]:
arr2.shape

(2, 4)

In [10]:
arr2.ndim

2

zeros and ones create arrays of 0s or 1s, respectively, with a given length or shape. empty creates an array 
without initializing its values to any particular value. It’s not safe to assume that np.empty will return an array of all zeros. In some cases, it may return uninitialized “garbage” values.

In [11]:
np.zeros(10)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [12]:
np.zeros((2,3))

array([[0., 0., 0.],
       [0., 0., 0.]])

In [13]:
np.empty((2,3,2))

array([[[-2.68156159e+154, -2.68156159e+154],
        [ 2.96439388e-323,  0.00000000e+000],
        [-2.68156159e+154, -2.68156159e+154]],

       [[ 1.97626258e-323,  0.00000000e+000],
        [ 0.00000000e+000,  0.00000000e+000],
        [-2.68156159e+154,  8.34402697e-309]]])

In [14]:
np.ones((3,3), dtype = 'int')

array([[1, 1, 1],
       [1, 1, 1],
       [1, 1, 1]])

In [15]:
np.arange(15)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

In [16]:
np.arange(1,10)

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

In [17]:
np.arange(1,10, 2)

array([1, 3, 5, 7, 9])

In [18]:
np.eye(2)

array([[1., 0.],
       [0., 1.]])

In [19]:
np.full((3,3), 5)

array([[5, 5, 5],
       [5, 5, 5],
       [5, 5, 5]])

In [20]:
# The data type or dtype is a special object containing the information (or metadata,data about data)
arr3 = np.array([1,2,3], dtype = np.int32)

In [21]:
arr3.dtype

dtype('int32')

In [22]:
# Calling astype always creates a new array (a copy of the data), even if the new dtype is the same as 
# the old dtype.
arr3.astype(np.float32)

array([1., 2., 3.], dtype=float32)

In [23]:
arr = np.array([3.7, -1.2, -2.6, 0.5, 12.9, 10.1])
arr.astype(np.int32)

array([ 3, -1, -2,  0, 12, 10], dtype=int32)

In [24]:
# It’s important to be cautious when using the numpy.string_ type, as string data in NumPy is fixed size 
# and may truncate input without warning. pandas has more intuitive out-of-the-box behavior on non-numeric data.

numeric_strings = np.array(['1.25', '-9.6', '42'], dtype=np.string_)
numeric_strings.astype(np.float32)

array([ 1.25, -9.6 , 42.  ], dtype=float32)

Arrays are important because they enable you to express batch operations on data without writing any for loops.
NumPy users call this vectorization. Any arithmetic operations between equal-size arrays applies the operation 
element-wise. Arithmetic operations with scalars propagate the scalar argument to each element in the array

In [25]:
arr = np.array([[1., 2., 3.], [4., 5., 6.]])
print(arr)

[[1. 2. 3.]
 [4. 5. 6.]]


In [26]:
arr*2

array([[ 2.,  4.,  6.],
       [ 8., 10., 12.]])

In [27]:
arr+arr

array([[ 2.,  4.,  6.],
       [ 8., 10., 12.]])

In [28]:
arr-arr

array([[0., 0., 0.],
       [0., 0., 0.]])

In [29]:
1/arr

array([[1.        , 0.5       , 0.33333333],
       [0.25      , 0.2       , 0.16666667]])

In [30]:
arr**0.5

array([[1.        , 1.41421356, 1.73205081],
       [2.        , 2.23606798, 2.44948974]])

In [31]:
# Comparisons between arrays of the same size yield boolean arrays
arr2 = np.array([[0., 4., 1.], [7., 2., 12.]])
arr2

array([[ 0.,  4.,  1.],
       [ 7.,  2., 12.]])

In [32]:
arr>arr2

array([[ True, False,  True],
       [False,  True, False]])

### Indexing and Slicing

In [33]:
arr = np.arange(10)
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [34]:
arr[5]

5

In [35]:
arr[2:6]

array([2, 3, 4, 5])

In [36]:
arr[-1]

9

In [37]:
arr[3:5] = 12
arr

array([ 0,  1,  2, 12, 12,  5,  6,  7,  8,  9])

An important first distinction from Python’s built-in lists is that array slices are views on the original
array. This means that the data is not copied, and any modifications to the view will be reflected in the source array. If you want a copy of a slice of an ndarray instead of a view, you will need to explicitly copy the array

In [38]:
arr_slice = arr[5:8]

In [39]:
arr_slice[1] = 1234

In [40]:
arr

array([   0,    1,    2,   12,   12,    5, 1234,    7,    8,    9])

In [41]:
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
arr2d

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [42]:
arr2d[2]

array([7, 8, 9])

In [43]:
arr2d[2][1]

8

In [44]:
arr2d[2,1]

8

In [45]:
arr2d[0:2,1:3]

array([[2, 3],
       [5, 6]])

In [46]:
arr3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
arr3d

array([[[ 1,  2,  3],
        [ 4,  5,  6]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

In [47]:
arr3d.shape

(2, 2, 3)

In [48]:
arr3d[0][1][0]

4

In [49]:
arr3d[0,1,0]

4

In [50]:
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
data = np.random.randn(7,4)

In [51]:
data

array([[-0.403291  ,  0.24885289,  0.23852086, -0.32008736],
       [ 0.87259096, -0.44393341, -0.75151061, -0.76422871],
       [-0.97472836, -0.46155518,  1.58584935,  0.38080582],
       [ 1.09493962,  0.06767838, -1.15654091,  0.2698516 ],
       [-0.40667794,  1.29025625, -0.30151619, -0.99322652],
       [-0.9117948 , -0.9097452 ,  1.83450566, -0.4871305 ],
       [-0.13261178,  0.09148674, -1.4176264 ,  1.5065429 ]])

In [52]:
names == 'Bob'

array([ True, False, False,  True, False, False, False])

In [53]:
data[names == 'Bob']

array([[-0.403291  ,  0.24885289,  0.23852086, -0.32008736],
       [ 1.09493962,  0.06767838, -1.15654091,  0.2698516 ]])

In [54]:
# To select everything but 'Bob', you can either use != or negate the condition using ~
# The ~ operator can be useful when you want to invert a general condition
data[~(names == 'Bob')]

array([[ 0.87259096, -0.44393341, -0.75151061, -0.76422871],
       [-0.97472836, -0.46155518,  1.58584935,  0.38080582],
       [-0.40667794,  1.29025625, -0.30151619, -0.99322652],
       [-0.9117948 , -0.9097452 ,  1.83450566, -0.4871305 ],
       [-0.13261178,  0.09148674, -1.4176264 ,  1.5065429 ]])

In [55]:
data[names != 'Bob']

array([[ 0.87259096, -0.44393341, -0.75151061, -0.76422871],
       [-0.97472836, -0.46155518,  1.58584935,  0.38080582],
       [-0.40667794,  1.29025625, -0.30151619, -0.99322652],
       [-0.9117948 , -0.9097452 ,  1.83450566, -0.4871305 ],
       [-0.13261178,  0.09148674, -1.4176264 ,  1.5065429 ]])

In [56]:
# Selecting two of the three names to combine multiple boolean conditions, use
# boolean arithmetic operators like & (and) and | (or)
mask = (names == 'Bob') | (names == 'Will') 
data[mask]

array([[-0.403291  ,  0.24885289,  0.23852086, -0.32008736],
       [-0.97472836, -0.46155518,  1.58584935,  0.38080582],
       [ 1.09493962,  0.06767838, -1.15654091,  0.2698516 ],
       [-0.40667794,  1.29025625, -0.30151619, -0.99322652]])

In [57]:
data[data < 0]

array([-0.403291  , -0.32008736, -0.44393341, -0.75151061, -0.76422871,
       -0.97472836, -0.46155518, -1.15654091, -0.40667794, -0.30151619,
       -0.99322652, -0.9117948 , -0.9097452 , -0.4871305 , -0.13261178,
       -1.4176264 ])

In [58]:
# Fancy indexing is a term adopted by NumPy to describe indexing using integer arrays
# the result of fancy indexing is always one-dimensional
# fancy indexing, unlike slicing, always copies the data into a new array.

arr = np.empty((8,4))
for i in range(8):
    arr[i] = i
arr

array([[0., 0., 0., 0.],
       [1., 1., 1., 1.],
       [2., 2., 2., 2.],
       [3., 3., 3., 3.],
       [4., 4., 4., 4.],
       [5., 5., 5., 5.],
       [6., 6., 6., 6.],
       [7., 7., 7., 7.]])

In [59]:
# To select out a subset of the rows in a particular order, you can simply pass a list or
# ndarray of integers specifying the desired order
arr[[4,3,0,6]]

array([[4., 4., 4., 4.],
       [3., 3., 3., 3.],
       [0., 0., 0., 0.],
       [6., 6., 6., 6.]])

In [60]:
# negative indices selects rows from the end
arr[[-3, -5, -7]]

array([[5., 5., 5., 5.],
       [3., 3., 3., 3.],
       [1., 1., 1., 1.]])

In [61]:
arr = np.arange(32).reshape((8,4))
arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23],
       [24, 25, 26, 27],
       [28, 29, 30, 31]])

In [62]:
# Passing multiple index arrays does something slightly different; it selects a onedimensional
# array of elements corresponding to each tuple of indices
# (1, 0), (5, 3), (7, 1), and (2, 2)

arr[[1, 5, 7, 2], [0, 3, 1, 2]]

array([ 4, 23, 29, 10])

In [63]:
arr.T

array([[ 0,  4,  8, 12, 16, 20, 24, 28],
       [ 1,  5,  9, 13, 17, 21, 25, 29],
       [ 2,  6, 10, 14, 18, 22, 26, 30],
       [ 3,  7, 11, 15, 19, 23, 27, 31]])

In [64]:
np.transpose(arr)

array([[ 0,  4,  8, 12, 16, 20, 24, 28],
       [ 1,  5,  9, 13, 17, 21, 25, 29],
       [ 2,  6, 10, 14, 18, 22, 26, 30],
       [ 3,  7, 11, 15, 19, 23, 27, 31]])

In [65]:
np.dot(arr.T, arr)

array([[2240, 2352, 2464, 2576],
       [2352, 2472, 2592, 2712],
       [2464, 2592, 2720, 2848],
       [2576, 2712, 2848, 2984]])

In [66]:
# transpose will accept a tuple of axis numbers to permute the axes
arr = np.arange(16).reshape((2, 2, 4))
print(arr)
print(f'Shape is : {arr.shape}')

[[[ 0  1  2  3]
  [ 4  5  6  7]]

 [[ 8  9 10 11]
  [12 13 14 15]]]
Shape is : (2, 2, 4)


In [67]:
arr.transpose((1, 0, 2))

array([[[ 0,  1,  2,  3],
        [ 8,  9, 10, 11]],

       [[ 4,  5,  6,  7],
        [12, 13, 14, 15]]])

## Universal Functions 

A universal function, or ufunc, is a function that performs element-wise operations
on data in ndarrays. You can think of them as fast vectorized wrappers for simple
functions that take one or more scalar values and produce one or more scalar results.

In [68]:
arr = np.arange(1,10)
arr

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

In [69]:
np.sqrt(arr)

array([1.        , 1.41421356, 1.73205081, 2.        , 2.23606798,
       2.44948974, 2.64575131, 2.82842712, 3.        ])

In [70]:
np.exp(arr)

array([2.71828183e+00, 7.38905610e+00, 2.00855369e+01, 5.45981500e+01,
       1.48413159e+02, 4.03428793e+02, 1.09663316e+03, 2.98095799e+03,
       8.10308393e+03])

In [71]:
x = np.random.randn(8)
y = np.random.randn(8)

In [72]:
# numpy.maximum computed the element-wise maximum of the elements in x and y.
np.maximum(x,y)

array([ 1.87532563,  0.49705625,  0.64113972,  0.97555421, -0.60158657,
        1.11555438, -0.5486063 ,  1.72248984])

a ufunc can return multiple arrays. modf is one example, a vectorized
version of the built-in Python divmod; it returns the fractional and integral
parts of a floating-point array

In [73]:
# (a // b, x % y)
divmod(3,8)

(0, 3)

In [74]:
a = np.random.randn(7)*5
a

array([-3.42259194,  4.28008251, -4.01644711, -0.07615375,  4.13790125,
       -4.0841628 ,  4.3705004 ])

In [75]:
# Return fractional and integral parts of array as a separate array
reminder, whole_part = np.modf(a)

In [76]:
reminder

array([-0.42259194,  0.28008251, -0.01644711, -0.07615375,  0.13790125,
       -0.0841628 ,  0.3705004 ])

In [77]:
whole_part

array([-3.,  4., -4., -0.,  4., -4.,  4.])

In [78]:
# Compute the absolute value element-wise for integer, floating-point, or complex values
np.abs(a)

array([3.42259194, 4.28008251, 4.01644711, 0.07615375, 4.13790125,
       4.0841628 , 4.3705004 ])

In [79]:
# Natural logarithm (base e), log base 10, log base 2, and log(1 + x), respectively
np.log(arr)
np.log10(arr)
np.log2(arr)
np.log1p(arr)

array([0.69314718, 1.09861229, 1.38629436, 1.60943791, 1.79175947,
       1.94591015, 2.07944154, 2.19722458, 2.30258509])

In [80]:
#Compute the sign of each element: 1 (positive), 0 (zero), or –1 (negative)
np.sign(a)

array([-1.,  1., -1., -1.,  1., -1.,  1.])

In [81]:
# Compute the ceiling of each element (i.e., the smallest integer greater than or equal to that number
np.ceil(a)

array([-3.,  5., -4., -0.,  5., -4.,  5.])

In [82]:
# Compute the floor of each element (i.e., the largest integer less than or equal to each element)
np.floor(a)

array([-4.,  4., -5., -1.,  4., -5.,  4.])

In [83]:
# Return boolean array indicating whether each value is NaN (Not a Number)
a = np.array([1,2, np.nan, 5,6])
np.isnan(a)

array([False, False,  True, False, False])

In [84]:
# Binary universal functions
np.add(x,y)

array([ 2.75705651, -0.25641949,  0.45078178, -0.15744833, -2.54083498,
        0.75828424, -1.79016018,  3.00265294])

In [85]:
np.subtract(x,y)

array([-0.99359475,  1.25053199,  0.83149766,  2.10855674, -1.33766183,
       -1.47282451, -0.69294759, -0.44232675])

In [86]:
np.multiply(x,y)

array([ 1.65353251, -0.37451983, -0.12204604, -1.10530539,  1.1666258 ,
       -0.39855426,  0.68112428,  2.20506792])

In [87]:
np.divide(x,y)
np.floor_divide(x,y)

array([ 0., -1., -4., -1.,  3., -1.,  2.,  0.])

In [88]:
np.power(arr,2)

array([ 1,  4,  9, 16, 25, 36, 49, 64, 81])

In [89]:
np.maximum(x,y)

array([ 1.87532563,  0.49705625,  0.64113972,  0.97555421, -0.60158657,
        1.11555438, -0.5486063 ,  1.72248984])

In [90]:
np.minimum(x,y)

array([ 0.88173088, -0.75347574, -0.19035794, -1.13300254, -1.93924841,
       -0.35727014, -1.24155388,  1.28016309])

In [91]:
# Element-wise modulus (remainder of division)
np.mod(x,y)

array([ 0.88173088, -0.25641949, -0.12029205, -0.15744833, -0.13448869,
        0.75828424, -0.14434129,  1.28016309])

Using NumPy arrays enables you to express many kinds of data processing tasks as
concise array expressions that might otherwise require writing loops. This practice of
replacing explicit loops with array expressions is commonly referred to as vectorization.

$\sqrt{x^2 + y^2}$

In [92]:
xarr = np.array([1.1, 1.2, 1.3, 1.4, 1.5])
yarr = np.array([2.1, 2.2, 2.3, 2.4, 2.5])
cond = np.array([True, False, True, True, False])

[(x if c else y) for x,y,c in zip(xarr, yarr, cond)]

[1.1, 2.2, 1.3, 1.4, 2.5]

In [93]:
np.where(cond,xarr,yarr)

array([1.1, 2.2, 1.3, 1.4, 2.5])

A typical use of where in data analysis is to produce a new array
of values based on another array. Suppose you had a matrix of randomly generated
data and you wanted to replace all positive values with 2 and all negative values with
–2. This is very easy to do with np.where

In [94]:
arr = np.random.randn(4,4)
arr

array([[-0.09298251,  0.66575709,  0.51219953,  1.78572014],
       [ 0.59431908,  1.54399541, -1.32293715, -0.27509134],
       [ 0.1413959 ,  0.32063672, -0.17714596,  0.24664338],
       [-0.03827581,  0.03456646, -2.60578704,  0.12884836]])

In [95]:
arr>0

array([[False,  True,  True,  True],
       [ True,  True, False, False],
       [ True,  True, False,  True],
       [False,  True, False,  True]])

In [96]:
np.where(arr>0,2,-2)

array([[-2,  2,  2,  2],
       [ 2,  2, -2, -2],
       [ 2,  2, -2,  2],
       [-2,  2, -2,  2]])

In [97]:
# The arrays passed to np.where can be more than just equal-sized arrays or scalars.
np.where(arr>0, 2, arr)

array([[-0.09298251,  2.        ,  2.        ,  2.        ],
       [ 2.        ,  2.        , -1.32293715, -0.27509134],
       [ 2.        ,  2.        , -0.17714596,  2.        ],
       [-0.03827581,  2.        , -2.60578704,  2.        ]])

You can use aggregations
(often called reductions) like sum, mean, and std (standard deviation) either by
calling the array instance method or using the top-level NumPy function.

In [98]:
arr = np.random.randn(5,4)
arr

array([[ 0.34407714, -0.5643335 ,  2.0648052 ,  1.27352862],
       [ 0.56279613,  1.35819313, -0.51068743, -0.2978583 ],
       [-0.44060907,  0.32660231,  0.2526144 , -0.30947365],
       [-0.31770598,  0.35856201,  0.72714821, -0.77572427],
       [-0.73655354,  0.80487324, -0.81713091,  0.23716766]])

In [99]:
arr.mean()

0.17701457080418842

In [100]:
np.mean(arr)

0.17701457080418842

In [101]:
np.mean(arr,axis=1)

array([ 0.77951937,  0.27811088, -0.0427165 , -0.00193001, -0.12791089])

In [102]:
np.mean(arr,axis=0)

array([-0.11759906,  0.45677944,  0.34334989,  0.02552801])

arr.mean(1) means “compute mean across the columns” where arr.sum(0)
means “compute sum down the rows.”

In [103]:
arr.sum(axis=1)

array([ 3.11807746,  1.11244354, -0.17086601, -0.00772002, -0.51164355])

In [104]:
arr.std(axis =0)

array([0.4906252 , 0.63292659, 1.01931721, 0.7015849 ])

In [105]:
np.std(arr)

0.772693227746066

methods like cumsum and cumprod do not aggregate, instead producing an array
of the intermediate results

In [106]:
arr = np.arange(10)
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [107]:
np.cumsum(arr)

array([ 0,  1,  3,  6, 10, 15, 21, 28, 36, 45])

In [108]:
arr = np.arange(1,9)
np.cumprod(arr)

array([    1,     2,     6,    24,   120,   720,  5040, 40320])

In [109]:
arr = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
arr

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [110]:
np.cumsum(arr, axis = 1)

array([[ 0,  1,  3],
       [ 3,  7, 12],
       [ 6, 13, 21]])

In [111]:
np.cumprod(arr, axis = 0)

array([[ 0,  1,  2],
       [ 0,  4, 10],
       [ 0, 28, 80]])

In [112]:
np.argmax(arr)

8

In [113]:
# argmin, argmax Indices of minimum and maximum elements, respectively
np.argmin(arr)

0

In [114]:
# variance
np.var(arr)

6.666666666666667

* There are two additional methods, any and all, useful especially for boolean arrays.
* any tests whether one or more values in an array is True, while all checks if every value is True. 
* These methods also work with non-boolean arrays, where non-zero elements evaluate to True

In [115]:
bools = np.array([False, False, True, False])

In [116]:
bools.any()

True

In [117]:
bools.all()

False

In [118]:
arr = np.random.randn(5,3)
arr

array([[ 1.16174763, -0.30700364, -0.30496934],
       [ 0.18303891,  1.05519517, -1.22934994],
       [ 1.3927857 ,  0.60885661,  0.80873381],
       [ 1.5828642 ,  1.69162955, -0.13227472],
       [-1.15915935,  1.80315499,  0.51127197]])

In [119]:
arr.sort()
arr

array([[-0.30700364, -0.30496934,  1.16174763],
       [-1.22934994,  0.18303891,  1.05519517],
       [ 0.60885661,  0.80873381,  1.3927857 ],
       [-0.13227472,  1.5828642 ,  1.69162955],
       [-1.15915935,  0.51127197,  1.80315499]])

In [120]:
arr.sort(axis=1)
arr

array([[-0.30700364, -0.30496934,  1.16174763],
       [-1.22934994,  0.18303891,  1.05519517],
       [ 0.60885661,  0.80873381,  1.3927857 ],
       [-0.13227472,  1.5828642 ,  1.69162955],
       [-1.15915935,  0.51127197,  1.80315499]])

In [121]:
# The top-level method np.sort returns a sorted copy of an array instead of modifying the array in-place.
np.sort(arr)

array([[-0.30700364, -0.30496934,  1.16174763],
       [-1.22934994,  0.18303891,  1.05519517],
       [ 0.60885661,  0.80873381,  1.3927857 ],
       [-0.13227472,  1.5828642 ,  1.69162955],
       [-1.15915935,  0.51127197,  1.80315499]])

In [122]:
arr

array([[-0.30700364, -0.30496934,  1.16174763],
       [-1.22934994,  0.18303891,  1.05519517],
       [ 0.60885661,  0.80873381,  1.3927857 ],
       [-0.13227472,  1.5828642 ,  1.69162955],
       [-1.15915935,  0.51127197,  1.80315499]])

In [123]:
# np.unique, returns the sorted unique values in an array
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
names

array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'], dtype='<U4')

In [124]:
# sorted(set(names)) : Python way of doing
np.unique(names)

array(['Bob', 'Joe', 'Will'], dtype='<U4')

In [125]:
ints = np.array([3, 3, 3, 2, 2, 1, 1, 4, 4])
np.unique(ints)

array([1, 2, 3, 4])

In [126]:
# np.in1d, tests membership of the values in one array in another, returning a boolean array
values = np.array([6, 0, 0, 3, 2, 5, 6])
np.in1d(values, [2,3,7])

array([False, False, False,  True,  True, False, False])

<!DOCTYPE html>
<html>
<head>
<style>
table, th, td {
  border: 1px solid black;
  border-collapse: collapse;
}
th {
      text-align: left;
}
    
</style>
</head>
<body>

<h5>Array Set Operations</h5>

<table >
  <tr>
    <th>Method</th>
    <th>Description</th> 
  </tr>
  <tr>
    <td>unique(x)</td>
    <td>Compute the sorted, unique elements in x</td>
  </tr>
  <tr>
    <td>intersect1d(x, y</td>
    <td>Compute the sorted, common elements in x and y</td>
  </tr>
  <tr>
    <td>union1d(x, y)</td>
    <td>Compute the sorted union of elements</td>
  </tr>
    <tr>
    <td>in1d(x, y)</td>
    <td>Compute a boolean array indicating whether each element of x is contained in y </td>
  </tr>
    <tr>
    <td>setdiff1d(x, y)</td>
    <td>Set difference, elements in x that are not in y</td>
  </tr>
    <tr>
    <td>setxor1d(x, y) </td>
    <td>Set symmetric differences; elements that are in either of the arrays, but not both</td>
  </tr>
</table>

</body>
</html>


## File Input and Output with Arrays
**np.save** and **np.load** are the two workhorse functions for efficiently saving and loading
array data on disk. Arrays are saved by default in an uncompressed raw binary
format with file extension .npy

In [127]:
arr = np.arange(18)
np.save('arr_values', arr)

In [128]:
np.load('arr_values.npy')

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17])

In [129]:
#You save multiple arrays in an uncompressed archive using np.savez and passing the arrays as keyword arguments
np.savez('array_archive.npz', a=arr, b=arr)

In [130]:
#When loading an .npz file, you get back a dict-like object that loads the individual arrays lazily
arch = np.load('array_archive.npz')
arch['b']

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17])

In [131]:
arch

<numpy.lib.npyio.NpzFile at 0x7fa397780910>

In [132]:
# If your data compresses well, you may wish to use numpy.savez_compressed instead
np.savez_compressed('arrays_compressed.npz', a=arr, b=arr)

## Linear Algebra

In [2]:
x = np.array([[1., 2., 3.], [4., 5., 6.]])
y = np.array([[6., 23.], [-1, 7], [8, 9]])

In [3]:
np.dot(x,y)

array([[ 28.,  64.],
       [ 67., 181.]])

In [4]:
# A matrix product between a two-dimensional array and a suitably sized onedimensional
# array results in a one-dimensional array
np.dot(x,np.ones(3))

array([ 6., 15.])

In [5]:
x@np.ones(3)

array([ 6., 15.])

In [8]:
# numpy.linalg has a standard set of matrix decompositions and things like inverse and determinant.
from numpy.linalg import inv, qr
x = np.random.randn(5,5)
mat = np.dot(np.transpose(x), x)
mat

array([[ 1.90914602,  2.66772291,  0.32639908, -0.84767272,  2.27165885],
       [ 2.66772291, 12.6075861 ,  0.27929802,  4.43536529,  1.25949214],
       [ 0.32639908,  0.27929802,  3.54223772, -0.48144081,  0.26784632],
       [-0.84767272,  4.43536529, -0.48144081,  5.36503417, -1.67868584],
       [ 2.27165885,  1.25949214,  0.26784632, -1.67868584,  3.38409189]])

In [9]:
inv(mat)

array([[ 66.4300438 , -15.40278647,  -0.68806533,  13.0515762 ,
        -32.33154411],
       [-15.40278647,   3.72769123,   0.13409155,  -3.20269095,
          7.35282943],
       [ -0.68806533,   0.13409155,   0.29745504,  -0.08444609,
          0.34654252],
       [ 13.0515762 ,  -3.20269095,  -0.08444609,   2.98583978,
         -6.08141255],
       [-32.33154411,   7.35282943,   0.34654252,  -6.08141255,
         16.21817841]])

In [10]:
mat.dot(inv(mat))

array([[ 1.00000000e+00, -1.98853859e-15,  1.58191527e-16,
         5.38507580e-16, -2.06624152e-14],
       [ 8.94598984e-15,  1.00000000e+00,  4.08056148e-17,
        -3.03417173e-15, -1.33206767e-14],
       [ 9.25635554e-15, -6.28323311e-16,  1.00000000e+00,
         4.10875962e-17, -1.18917019e-15],
       [-3.66356801e-14,  3.42094289e-15, -1.88424804e-16,
         1.00000000e+00,  4.89214824e-15],
       [ 2.42222932e-14, -3.06972171e-15, -2.46866962e-16,
         5.45031831e-15,  1.00000000e+00]])

In [140]:
# Return the diagonal (or off-diagonal) elements of a square matrix as a 1D array, or convert a 1D array into a
# square matrix with zeros on the off-diagonal
x = np.eye(3,3)
np.diag(x)

array([1., 1., 1.])

In [141]:
# Compute the sum of the diagonal elements
np.trace(x)

3.0

In [11]:
# Compute the matrix determinant
np.linalg.det(x)

2.271729718014758

1. eig   : Compute the eigenvalues and eigenvectors of a square matrix
2. inv   : Compute the inverse of a square matrix
3. pinv  : Compute the Moore-Penrose pseudo-inverse of a matrix
4. qr    : Compute the QR decomposition
5. svd   : Compute the singular value decomposition (SVD)
6. solve : Solve the linear system Ax = b for x, where A is a square matrix
7. lstsq : Compute the least-squares solution to Ax = b

## Pseudorandom Number Generation
The numpy.random module supplements the built-in Python random with functions
for efficiently generating whole arrays of sample values from many kinds of probability
distributions.We say that these are pseudorandom numbers because they are generated by an algorithm
with deterministic behavior based on the seed of the random number generator.

In [12]:
# standard normal distribution using normal
np.random.normal(size = (4,4))

array([[ 2.09438439, -2.24143525,  0.70790465, -0.65108843],
       [ 1.40340659, -2.36801647, -1.11394008, -1.2314313 ],
       [-0.08698561, -0.03982736, -1.24814142,  0.39253046],
       [-0.29200018, -1.18217022,  0.39900315,  0.59890235]])

In [13]:
# You can change NumPy’s random number generation seed using np.random.
np.random.seed(1234)

The data generation functions in numpy.random use a global random seed. To avoid
global state, you can use numpy.random.RandomState to create a random number
generator isolated from others

In [14]:
rng = np.random.RandomState(1234)
rng.randn(10)

array([ 0.47143516, -1.19097569,  1.43270697, -0.3126519 , -0.72058873,
        0.88716294,  0.85958841, -0.6365235 ,  0.01569637, -2.24268495])

seed  : Seed the random number generator <br/>
permutation : Return a random permutation of a sequence, or return a permuted range <br/>
shuffle : Randomly permute a sequence in-place <br/>
rand :  Draw samples from a uniform distribution <br/>
randint : Draw random integers from a given low-to-high range <br/>
randn :  Draw samples from a normal distribution with mean 0 and standard deviation 1 (MATLAB-like interface) <br/>
binomial :  Draw samples from a binomial distribution <br/>
normal :  Draw samples from a normal (Gaussian) distribution <br/>
beta  : Draw samples from a beta distribution <br/>
chisquare : Draw samples from a chi-square distribution <br/>
gamma : Draw samples from a gamma distribution <br/>
uniform : Draw samples from a uniform [0, 1) distribution

In [15]:
np.random.chisquare(2,4)

array([0.42519732, 1.94629776, 1.15153819, 3.07757295])

A typical (C order) 3 × 4 × 5 array of float64 (8-byte) values has strides (160, 40,
8) (knowing about the strides can be useful because, in general, the larger the strides
on a particular axis, the more costly it is to perform computation along that axis. Strides can
even be negative, which enables an array to move “backward” through memory

In [16]:
np.ones((3,4,5), dtype = np.float64).strides

(160, 40, 8)

In [17]:
# You can see all of the parent classes of a specific dtype by calling the type’s mro method
np.float64.mro()

[numpy.float64,
 numpy.floating,
 numpy.inexact,
 numpy.number,
 numpy.generic,
 float,
 object]

In [18]:
arr = np.arange(8)
arr

array([0, 1, 2, 3, 4, 5, 6, 7])

In [19]:
#you can convert an array from one shape to another without copying any data
arr.reshape((4,2))

array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7]])

In [20]:
arr.reshape((4,2)).reshape((2,4))

array([[0, 1, 2, 3],
       [4, 5, 6, 7]])

In [22]:
# One of the passed shape dimensions can be –1, in which case the value used for that
# dimension will be inferred from the data
# Since an array’s shape attribute is a tuple, it can be passed to reshape, too
arr = np.arange(15)
arr.reshape((5,-1))

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

In [24]:
arr = np.arange(15).reshape((5, 3))
arr

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

In [25]:
# ravel does not produce a copy of the underlying values if the values in the result 
# were contiguous in the original array
arr.ravel()

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

In [26]:
# The flatten method behaves like ravel except it always returns a copy of the data
arr.flatten()

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

C/row major order <br>
 Traverse higher dimensions first (e.g., axis 1 before advancing on axis 0). <br>
Fortran/column major order<br>
 Traverse higher dimensions last (e.g., axis 0 before advancing on axis 1).

In [28]:
arr.ravel('F')

array([ 0,  3,  6,  9, 12,  1,  4,  7, 10, 13,  2,  5,  8, 11, 14])

In [29]:
arr1 = np.array([[1, 2, 3], [4, 5, 6]])
arr2 = np.array([[7, 8, 9], [10, 11, 12]])

In [30]:
np.concatenate([arr1,arr2], axis=0)

array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])

In [31]:
np.concatenate([arr1,arr2], axis=1)

array([[ 1,  2,  3,  7,  8,  9],
       [ 4,  5,  6, 10, 11, 12]])

In [32]:
# np.column_stack
np.hstack((arr1,arr2))

array([[ 1,  2,  3,  7,  8,  9],
       [ 4,  5,  6, 10, 11, 12]])

In [33]:
# np.row_stack
np.vstack((arr1,arr2))

array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])

In [34]:
# split, on the other hand, slices apart an array into multiple arrays along an axis
arr = np.random.randn(5,2)
arr

array([[-0.72058873,  0.88716294],
       [ 0.85958841, -0.6365235 ],
       [ 0.01569637, -2.24268495],
       [ 1.15003572,  0.99194602],
       [ 0.95332413, -2.02125482]])

In [35]:
first, second, third = np.split(arr,[1,3])

In [164]:
first

array([[-0.72058873,  0.88716294]])

In [36]:
second

array([[ 0.85958841, -0.6365235 ],
       [ 0.01569637, -2.24268495]])

In [37]:
third

array([[ 1.15003572,  0.99194602],
       [ 0.95332413, -2.02125482]])

In [167]:
# hsplit/vsplit Convenience functions for splitting on axis 0 and 1, respectively
# split Split array at passed locations along a particular axis

In [168]:
# repeat replicates each element in an array some number of times, producing a larger array
arr = np.arange(3)
arr.repeat(3)

array([0, 0, 0, 1, 1, 1, 2, 2, 2])

In [169]:
arr.repeat([2,3,4])

array([0, 0, 1, 1, 1, 2, 2, 2, 2])

In [38]:
arr = np.random.randn(2, 2)
arr.repeat(2, axis=0)

array([[-0.33407737,  0.00211836],
       [-0.33407737,  0.00211836],
       [ 0.40545341,  0.28909194],
       [ 0.40545341,  0.28909194]])

In [39]:
arr.repeat([2,3], axis=1)

array([[-0.33407737, -0.33407737,  0.00211836,  0.00211836,  0.00211836],
       [ 0.40545341,  0.40545341,  0.28909194,  0.28909194,  0.28909194]])

In [40]:
arr

array([[-0.33407737,  0.00211836],
       [ 0.40545341,  0.28909194]])

In [173]:
np.tile(arr,2)

array([[-0.33407737,  0.00211836, -0.33407737,  0.00211836],
       [ 0.40545341,  0.28909194,  0.40545341,  0.28909194]])

In [174]:
np.tile(arr,(3,2))

array([[-0.33407737,  0.00211836, -0.33407737,  0.00211836],
       [ 0.40545341,  0.28909194,  0.40545341,  0.28909194],
       [-0.33407737,  0.00211836, -0.33407737,  0.00211836],
       [ 0.40545341,  0.28909194,  0.40545341,  0.28909194],
       [-0.33407737,  0.00211836, -0.33407737,  0.00211836],
       [ 0.40545341,  0.28909194,  0.40545341,  0.28909194]])

In [175]:
arr = np.arange(10) * 100
inds = [7, 1, 2, 6]

In [176]:
arr[inds]

array([700, 100, 200, 600])

In [177]:
arr.take(inds)

array([700, 100, 200, 600])

In [178]:
# put does not accept an axis argument but rather indexes into the flattened (onedimensional, C order)
# version of the arra
arr.put(inds,42)
arr

array([  0,  42,  42, 300, 400, 500,  42,  42, 800, 900])

## Broadcasting

Broadcasting describes how arithmetic works between arrays of different shapes. It
can be a powerful feature, but one that can cause confusion, even for experienced
users.

In [179]:
# we say that the scalar value 4 has been broadcast to all of the other elements in
# the multiplication operation.
arr = np.arange(5)
arr*4

array([ 0,  4,  8, 12, 16])

In [180]:
arr = np.random.randn(4,3)
arr.mean(axis = 0)

array([ 0.04157829, -0.5014745 ,  0.52132896])

In [181]:
# we can demean each column of an array by subtracting the column means.
demean = arr-arr.mean(axis=0)
demean

array([[ 1.2795799 , -1.04543105, -0.72397529],
       [-0.69754764,  0.69489588,  0.03210995],
       [ 1.27657326,  0.03216922,  0.15422512],
       [-1.85860552,  0.31836596,  0.53764022]])

In [182]:
demean.mean(0)

array([0.00000000e+00, 5.55111512e-17, 2.77555756e-17])

### Broadcasting Rule :
Two arrays are compatible for broadcasting if for each trailing dimension (i.e., starting
from the end) the axis lengths match or if either of the lengths is 1. Broadcasting is
then performed over the missing or length 1 dimensions.

In [183]:
# (4,3) + (3,) = (4,3)
# (4,3) + (4,1) = (4,3)
# (3,4,2) + (4,2) = (3,4,2)

In [184]:
def demean_axis(arr, axis=0):
    means = arr.mean(axis)
    # This generalizes things like [:, :, np.newaxis] to N dimensions
    indexer = [slice(None)] * arr.ndim
    indexer[axis] = np.newaxis
    return arr - means[indexer]

In [185]:
col = np.array([1.28, -0.42, 0.44, 1.6])
arr[:] = col[:, np.newaxis]

In [186]:
arr

array([[ 1.28,  1.28,  1.28],
       [-0.42, -0.42, -0.42],
       [ 0.44,  0.44,  0.44],
       [ 1.6 ,  1.6 ,  1.6 ]])

In [187]:
arr[:2] = [[-1.37], [0.509]]

In [188]:
arr

array([[-1.37 , -1.37 , -1.37 ],
       [ 0.509,  0.509,  0.509],
       [ 0.44 ,  0.44 ,  0.44 ],
       [ 1.6  ,  1.6  ,  1.6  ]])

In [189]:
# reduce takes a single array and aggregates its values, optionally along an axis, by performing
# a sequence of binary operations
arr = np.arange(10)
np.add.reduce(arr)

45

In [190]:
np.sum(arr)

45

In [191]:
# we can use np.logical_and to check whether the values in each row of an array are sorted

In [192]:
np.random.seed(12346) # for reproducibility

In [193]:
arr = np.random.randn(5,5)
arr

array([[-8.99822478e-02,  7.59372617e-01,  7.48336101e-01,
        -9.81497953e-01,  3.65775545e-01],
       [-3.15442628e-01, -8.66135605e-01,  2.78568155e-02,
        -4.55597723e-01, -1.60189223e+00],
       [ 2.48256116e-01, -3.21536673e-01, -8.48730755e-01,
         4.60468309e-04, -5.46459347e-01],
       [ 2.53915229e-01,  1.93684246e+00, -7.99504902e-01,
        -5.69159281e-01,  4.89244731e-02],
       [-6.49092950e-01, -4.79535727e-01, -9.53521432e-01,
         1.42253882e+00,  1.75403128e-01]])

In [194]:
arr[::2].sort(1) # sort a few rows

In [195]:
arr[:, :-1] < arr[:, 1:]

array([[ True,  True,  True,  True],
       [False,  True, False, False],
       [ True,  True,  True,  True],
       [ True, False,  True,  True],
       [ True,  True,  True,  True]])

In [196]:
np.logical_and.reduce(arr[:, :-1] < arr[:, 1:], axis=1)

array([ True, False,  True, False,  True])

accumulate is related to reduce like cumsum is related to sum. It produces an array of
the same size with the intermediate “accumulated” values

In [197]:
arr = np.arange(15).reshape((3,5))
np.add.accumulate(arr, axis=1)

array([[ 0,  1,  3,  6, 10],
       [ 5, 11, 18, 26, 35],
       [10, 21, 33, 46, 60]])

In [198]:
# outer performs a pairwise cross-product between two arrays
arr = np.arange(3).repeat([1, 2, 2])
np.multiply.outer(arr, np.arange(5))

array([[0, 0, 0, 0, 0],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 2, 4, 6, 8],
       [0, 2, 4, 6, 8]])

In [199]:
# The output of outer will have a dimension that is the sum of the dimensions of the inputs
x, y = np.random.randn(3, 4), np.random.randn(5)
result = np.subtract.outer(x, y)
result.shape

(3, 4, 5)

In [200]:
# reduceat, performs a “local reduce,” in essence an array groupby
# operation in which slices of the array are aggregated together. 
# It accepts a sequence of “bin edges” that indicate how to split and aggregate the values
arr = np.arange(10)
np.add.reduceat(arr, [0, 5, 8])

array([10, 18, 17])

In [201]:
arr = np.multiply.outer(np.arange(4), np.arange(5))
arr

array([[ 0,  0,  0,  0,  0],
       [ 0,  1,  2,  3,  4],
       [ 0,  2,  4,  6,  8],
       [ 0,  3,  6,  9, 12]])

reduce(x) : Aggregate values by successive applications of the operation <br/>
accumulate(x) : Aggregate values, preserving all partial aggregates<br/>
reduceat(x, bins) : “Local” reduce or “group by”; reduce contiguous slices of data to produce aggregated array<br/>
outer(x, y) :  Apply operation to all pairs of elements in x and y; the resulting array has shape x.shape +
y.shape

## Structured and Record Arrays
You may have noticed up until now that ndarray is a homogeneous data container;
that is, it represents a block of memory in which each element takes up the same
number of bytes, determined by the dtype. On the surface, this would appear to not
allow you to represent heterogeneous or tabular-like data. A structured array is an
ndarray in which each element can be thought of as representing a struct in C (hence
the “structured” name) or a row in a SQL table with multiple named fields

In [202]:
dtype = [('x', np.float64), ('y', np.int32)]
sarr = np.array([(1.5, 6), (np.pi, -2)], dtype=dtype)
print(sarr)

[(1.5       ,  6) (3.14159265, -2)]


In [203]:
sarr[0]

(1.5, 6)

In [204]:
sarr[0]['y']

6

In [205]:
sarr['x']

array([1.5       , 3.14159265])

In [206]:
dtype = [('x', np.int64, 3), ('y', np.int32)]
arr = np.zeros(4, dtype=dtype)
arr

array([([0, 0, 0], 0), ([0, 0, 0], 0), ([0, 0, 0], 0), ([0, 0, 0], 0)],
      dtype=[('x', '<i8', (3,)), ('y', '<i4')])