# Numpy

In [2]:
import numpy as np

# Why choosing NumPy: The Benefits

If you already know Python, do you need to learn NumPy to be a Data Scientist? Well, not necessarly, but yes if you want to be a good one.

NumPy (**Num**erical **Py**thon) is used to do numerical computations ___efficiently___ in Python.

1. More speed: Under the hood, NumPy uses algorithms written in C that complete in nanoseconds rather than seconds.
2. Fewer loops: NumPy helps you to reduce loops and keep from getting tangled up in iteration indices --> Vectorization.
3. Clearer code: Without loops, your code will look cleaner, more like the equations you’re trying to calculate.
4. Better quality: Stand on the shoulders of giants. Better programmers than us keep NumPy fast, friendly, and bug free.

NumPy is the de facto standard for multidimensional arrays in Python data science, and many of the most popular libraries are built on top of it. Great way to set down a solid foundation as you expand your knowledge into more specific areas of data science.

# Motivation

## Vectorization

In [3]:
lst = [1, 2, 3, 4, 5]
lst

[1, 2, 3, 4, 5]

In [4]:
type(lst)

list

I want to multiply all elements by `2`

In [5]:
lst * 2

[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]

WTF? We didn't want two lists!

`np.array` is the way to go.

In [6]:
arr = np.array(lst)
arr

array([1, 2, 3, 4, 5])

In [7]:
type(arr)

numpy.ndarray

In [8]:
arr * 2

array([ 2,  4,  6,  8, 10])

We even get less errors:

In [9]:
arr * 1.5

array([1.5, 3. , 4.5, 6. , 7.5])

In [10]:
lst * 1.5

TypeError: can't multiply sequence by non-int of type 'float'

## Simpler syntax

In [11]:
newLst = []

for val in lst:
    tmpVal = val * 2
    newLst.append(tmpVal)
    
newLst

[2, 4, 6, 8, 10]

In [12]:
[val * 2 for val in lst]

[2, 4, 6, 8, 10]

In [13]:
list( map(lambda x: x * 2, lst) )

[2, 4, 6, 8, 10]

### VS

In [15]:
arr * 2

array([ 2,  4,  6,  8, 10])

## Faster

It takes less time to do stuff with numpy because it's C programming language optimized.

Create a really big list:

In [14]:
lst2 = list(range(1_000_000))  # One million: 1.000.000

In [15]:
type(lst2)

list

In [16]:
%%timeit

list( map(lambda x: x * 2, lst2) )

89.4 ms ± 3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [22]:
%%timeit

newLst = []
for val in lst2:
    tmpVal = val * 2
    newLst.append(tmpVal)

82.2 ms ± 415 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [23]:
%%timeit

[n * 2 for n in lst2]

54.4 ms ± 781 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [19]:
arr2 = np.array(lst2)

In [20]:
type(arr2)

numpy.ndarray

In [24]:
%%timeit

arr2 * 2

1.31 ms ± 14.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


# Let's start with NumPy: creating arrays

#### 1 dimensional

In [26]:
a = np.array( [1, 2, 3] )
a

array([1, 2, 3])

In [27]:
type(a)

numpy.ndarray

In [28]:
a.shape

(3,)

#### 2 dimensional

In [29]:
b = np.array( [ [1, 2, 3], [4, 5, 6] ] )
b

array([[1, 2, 3],
       [4, 5, 6]])

In [30]:
b.shape

(2, 3)

Meaning 2 rows, 3 columns.

In [31]:
b2 = np.array([[1, 2, 3], [4, 5, 6, 7]])

In [32]:
b2

array([list([1, 2, 3]), list([4, 5, 6, 7])], dtype=object)

#### 3 dimensional

In [None]:
# ListaGrande( 
#        ListaMediana( ListaPequeña(), ListaPequeña() ),
#        ListaMediana( ListaPequeña(), ListaPequeña() ),
#        ListaMediana( ListaPequeña(), ListaPequeña() ),
#        ListaMediana( ListaPequeña(), ListaPequeña() )
#    )

In [34]:
c = np.array([
    [[55, 66, 3], [40, 90, 3]],
    [[10, 10, 3], [10, 11, 3]],
    [[8, 9, 354], [6, 75, 34]],
    [[2, 3, 443], [3, 4, 199]]
])

c

array([[[ 55,  66,   3],
        [ 40,  90,   3]],

       [[ 10,  10,   3],
        [ 10,  11,   3]],

       [[  8,   9, 354],
        [  6,  75,  34]],

       [[  2,   3, 443],
        [  3,   4, 199]]])

In [30]:
c.shape

(4, 2, 3)

Example 3 dimensional arrays: RGB images

### Built-in functions for creating arrays

In [32]:
np.zeros((3, 5))

array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

In [33]:
np.zeros((3, 5), dtype=int)

array([[0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0]])

In [34]:
np.ones((2, 3))

array([[1., 1., 1.],
       [1., 1., 1.]])

In [35]:
np.eye(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

$$I = 
\begin{pmatrix}
1 & 0 & 0 \\
0 & 1 & 0 \\
0 & 0 & 1 \\
\end{pmatrix}
$$

In [37]:
np.full((2, 3), fill_value=4)

array([[4, 4, 4],
       [4, 4, 4]])

__Ex__: Create a 2x2 matrix initially full of `10`s. Add `3.5` to every value, then divide every value by `7` and then substract `5` from them. Finally add the identity matrix.

In [53]:
mat = np.full((2,2), fill_value=10)
((mat + 3.5 ) / 7 ) - 5 + np.eye(2)

array([[-2.07142857, -3.07142857],
       [-3.07142857, -2.07142857]])

### What about other numbers, like random values?

In [36]:
np.random

<module 'numpy.random' from '/home/palozano/anaconda3/lib/python3.7/site-packages/numpy/random/__init__.py'>

In [37]:
np.random.random

<function RandomState.random>

In [38]:
# uniform <-- documentation
np.random.random()

0.39834522994635935

![Distribución uniforme](uniform.png)

__Ex__: A coworker wrote a function to test if the values from `np.random.random()` are correct. Some of the pieces she wrote went missing. Fill them.

(The idea is to count how many values are below/above a certain threshold)

In [None]:
_____ = _

for i in range(100000):
    if ______________ <= ___:
        ______ += 1
        
______ / _______

We can generate arrays with specific properties:

In [45]:
np.random.randint(low=20, high=30, size=(2, 3))

array([[25, 21, 20],
       [22, 23, 25]])

In [41]:
np.random.standard_normal((3, 3))

array([[ 0.43627245, -0.2882815 ,  0.3794586 ],
       [-0.94698243, -0.79281318, -0.84480087],
       [-3.03300571, -0.6897332 , -1.14453822]])

`arange` is similar to `range`

In [42]:
np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [43]:
np.arange(start=0, stop=20, step=2)

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

What about we know the limits of our interval and we need a certain number of divisions? Then, we use `linspace`

In [53]:
np.linspace(start=0, stop=1, num=5)

array([0.  , 0.25, 0.5 , 0.75, 1.  ])

## What properties do these objects have?

In [44]:
b = np.random.randint(0, 100, (3, 4))
b

array([[ 8,  9, 74, 44],
       [83,  5, 51,  5],
       [82,  3, 44, 93]])

In [45]:
b.shape

(3, 4)

In [46]:
b.size

12

In [47]:
b.ndim

2

In [48]:
c

array([[[ 55,  66,   3],
        [ 40,  90,   3]],

       [[ 10,  10,   3],
        [ 10,  11,   3]],

       [[  8,   9, 354],
        [  6,  75,  34]],

       [[  2,   3, 443],
        [  3,   4, 199]]])

In [49]:
c.ndim

3

In [50]:
b.dtype

dtype('int64')

## Slicing

__Ex__: Create a 3x4 matrix with integer values from `0` to `100`.

In [62]:
b = np.random.randint(0, 100, (3, 4))
b

In [64]:
b[0, 1]

16

In [66]:
b[0, :]

array([ 7, 16, 39, 50])

In [69]:
b[1:3, 0:2]

array([[97, 22],
       [37, 31]])

### Conditional slicing

In [55]:
a = np.random.randint(0, 100, 20)
a

array([16, 61, 23, 38, 14, 31,  8, 48, 23, 87, 82, 32, 61, 79, 45, 77, 57,
       19, 82,  8])

In [56]:
a > 50

array([False,  True, False, False, False, False, False, False, False,
        True,  True, False,  True,  True, False,  True,  True, False,
        True, False])

In [57]:
a[ a > 50 ]

array([61, 87, 82, 61, 79, 77, 57, 82])

__Ex__: construct condition for the values of `a` that are greater than 10 and are even numbers. Then store it in a variable. Select the values of `a` that fullfil the condition.

In [58]:
cond = (a > 10) & (a % 2 == 0)

print(cond)

a[cond]

[ True False False  True  True False False  True False False  True  True
 False False False False False False  True False]


array([16, 38, 14, 48, 82, 32, 82])

__Ex__: get the subarray of `a` of elements that start with the number `7`.

In [59]:
cond = (a >= 70) & (a < 80) | (a == 7)
# cond = a // 10 == 7 

a[cond]

array([79, 77])

**Exercise**: how many families have more than 2 children?

Remember that booleans are a sub-type of integers in Python? You can sum `True`s and `False`s.

In [64]:
childrens = np.random.randint(0, 5, 100)
childrens

array([0, 4, 4, 1, 1, 3, 1, 4, 2, 0, 2, 2, 1, 3, 2, 1, 4, 2, 3, 4, 4, 0,
       4, 2, 1, 3, 2, 1, 4, 2, 1, 1, 2, 4, 3, 1, 2, 0, 1, 3, 3, 4, 4, 4,
       2, 4, 2, 1, 1, 1, 3, 3, 0, 4, 2, 0, 4, 1, 4, 4, 1, 4, 1, 3, 3, 2,
       1, 4, 4, 4, 3, 0, 2, 2, 2, 4, 4, 3, 2, 0, 3, 3, 2, 0, 3, 4, 1, 2,
       2, 4, 1, 0, 0, 1, 0, 2, 4, 0, 4, 1])

In [65]:
sum(childrens > 2)
# (children > 2).sum

43

Similar to conditional slicing we have...
### Conditional assignation

In [80]:
b = np.random.randint(0, 100, (3, 4))
b

array([[63, 22, 63,  5],
       [35, 77, 22, 89],
       [39, 80, 99, 83]])

In [68]:
b < 50

array([[ True, False, False, False],
       [ True, False, False,  True],
       [False, False, False,  True]])

In [69]:
b[b < 50] = 0
b[b > 90] = 100

In [70]:
b

array([[  0,  60, 100,  85],
       [  0,  73,  53,   0],
       [ 79,  78,  79,   0]])

__Ex__: Set to 0 the values from `b` that are greater than 70 or less than 40:

In [83]:
cond = (b < 40) | (b > 70)
b[ cond ] = 0
b

array([[63,  0, 63,  0],
       [ 0,  0,  0,  0],
       [ 0,  0,  0,  0]])

## Useful array methods

In [98]:
a = np.random.randint(0, 1000, 30)
a

array([366, 530, 337, 767, 610, 149, 460, 511, 491, 397, 348, 405, 580,
       374, 983, 932, 830, 219,  44, 385, 894, 959, 361, 903, 303, 243,
       543, 420, 528, 950])

In [99]:
a.max()

983

In [100]:
a.min()

44

In [101]:
a.sum()

15822

In [110]:
b = np.random.randint(0, 1000, (4, 5))
b

array([[682, 254, 642, 710, 553],
       [663, 947, 815, 148,  75],
       [511, 640, 505,  85, 558],
       [682,  47, 448, 652, 742]])

In [111]:
b.max()

947

In [113]:
b.max(axis=0)

array([682, 947, 815, 710, 742])

In [114]:
b.max(axis=1)

array([710, 947, 640, 742])

In [109]:
b.mean()

547.2333333333333

In [110]:
b.mean(axis=0)

array([413.5, 489.5, 425. , 262.5, 271.5])

In [111]:
b.mean(axis=1)

array([368.4, 423.8, 365.8, 331.6])

In [115]:
b.std(axis=0)

array([ 71.72342714, 347.13037896, 141.50353352, 283.86913798,
       247.01518172])

In [117]:
a = np.random.random(20)
a

In [119]:
a.round(1)

array([0.7, 0.3, 0.8, 0.7, 0.5, 0. , 0.8, 0.3, 0.8, 0.3, 0.3, 0.6, 0.4,
       0.9, 0.1, 0.5, 0.6, 0.6, 0.8, 0.3])

In [124]:
# turn to 1D array
b.flatten()

array([510, 140, 972, 200,  20, 436,  48, 310, 369, 956,  18, 971, 418,
       418,   4, 690, 799,   0,  63, 106])

In [126]:
b.reshape((5, 4))

array([[510, 140, 972, 200],
       [ 20, 436,  48, 310],
       [369, 956,  18, 971],
       [418, 418,   4, 690],
       [799,   0,  63, 106]])

In [125]:
b.reshape((3, 3))

ValueError: cannot reshape array of size 20 into shape (3,3)

In [134]:
b / 2

array([[48.5, 16.5, 19.5],
       [15. , 24.5, 15.5],
       [ 1.5, 25. , 36.5]])

In [135]:
1 / b

array([[0.01030928, 0.03030303, 0.02564103],
       [0.03333333, 0.02040816, 0.03225806],
       [0.33333333, 0.02      , 0.01369863]])

In [141]:
b + b

array([[ 6,  2, 16],
       [ 4, 16, 10],
       [ 8,  2, 12]])

In [142]:
b * 3

array([[ 9,  3, 24],
       [ 6, 24, 15],
       [12,  3, 18]])

In [143]:
b ** 2

array([[ 9,  1, 64],
       [ 4, 64, 25],
       [16,  1, 36]])

In [144]:
np.exp(b)

array([[2.00855369e+01, 2.71828183e+00, 2.98095799e+03],
       [7.38905610e+00, 2.98095799e+03, 1.48413159e+02],
       [5.45981500e+01, 2.71828183e+00, 4.03428793e+02]])

__Ex__: Compute how many infected people there will be tomorrow.

In [84]:
# Array with infected people in 5 cities
infected_people = np.array([44, 55, 66, 155, 120])

# How infections will change
change = 1.20

In [85]:
infected_people_tomorrow = infected_people * change
infected_people_tomorrow

array([ 52.8,  66. ,  79.2, 186. , 144. ])

All the cities will have more than 50 infected?

In [90]:
all(infected_people_tomorrow > 50)

True

Does someone have less than 200?

In [91]:
any(infected_people_tomorrow < 100)

True

__Ex__: Create two 5x5 arrays that contain random __integers__ from 0 to 10.

In [120]:
b1 = np.random.randint(0, 10, (5, 5))
b1

array([[6, 1, 6, 0, 4],
       [4, 5, 8, 6, 7],
       [3, 3, 7, 0, 0],
       [8, 4, 9, 4, 2],
       [8, 2, 3, 5, 9]])

In [121]:
b2 = np.random.randint(0, 10, (5, 5))
b2

array([[0, 1, 7, 1, 9],
       [4, 7, 7, 1, 4],
       [6, 3, 5, 1, 2],
       [6, 9, 6, 8, 0],
       [6, 9, 0, 4, 1]])

Check when do the values of the first arrays are less or equal to the ones in the second array. (You need to return just True/False values for this case.)

In [122]:
b1 <= b2

array([[False,  True,  True,  True,  True],
       [ True,  True, False, False, False],
       [ True,  True, False,  True,  True],
       [False,  True, False,  True, False],
       [False,  True, False, False, False]])

Which values fullfil the previous condition?

In [125]:
b1[ b1 <= b2 ]

array([1, 6, 0, 4, 4, 5, 3, 3, 0, 0, 4, 4, 2])

Compute the average of those values.

In [126]:
b1[ b1 <= b2 ].mean()

2.769230769230769

# Broadcasting

Allows numpy to work with arrays of different shapes when performing arithmetic operations. 

Frequently we have a smaller array and a larger array, and we want to use the smaller array multiple times to perform some operation on the larger array, like adding a constant vector to each row of a matrix.

In [103]:
# Data: what a shop has as stock...
# Rows: shirt, hats, trousers, and boots
# Columns: Monday, Tuesday, Friday.

stock = np.full((4,3), fill_value=100)

# What it sells every week
#                 M  T  F
sell = np.array([[1, 2, 3],         # <-- shirts
                 [4, 5, 6],         # <-- hats
                 [7, 8, 9],         # <-- trousers
                 [10, 11, 12]])     # <-- boots

# Every Monday and Friday recovers
shipment = np.array([2, 2, 1])

__Ex__: How many products will you have in the shop after 7 weeks? 

In [107]:
fin = stock + 7 * (- sell + shipment)
fin

array([[107, 100,  86],
       [ 86,  79,  65],
       [ 65,  58,  44],
       [ 44,  37,  23]])

**Ex**: After a week, what is the mean of stock per day? (Days are in the columns.)

In [108]:
fin.mean(axis=0)

array([75.5, 68.5, 54.5])

# Further materials

Numpy's [website](https://numpy.org/)

Numpy's [documentation](https://numpy.org/doc/stable/)

NumPy [Cheatsheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf)

Numpy [broadcasting](https://cs231n.github.io/python-numpy-tutorial/#broadcasting)