# Introduction to NumPy
## Data Science for Data Scientists
---


## Why use NumPy?

NumPy provides fast numerical processing and fast arrays to python. 

Python itself is very slow. 

## How do I import NumPy?

Exercise:

Load the numpy library with it's common alias.

In [1]:
#Solution
import numpy as np

## How do you create a NumPy array?

You could start with a list, and then covert it:

Exercise:

1) Create a python list containing four integers (in square brackets) and save this as x_age.

2) Use np.array() to convert your python list into a numpy array. Save this as x.

3) Use the numpy array method mean() to calculate the mean of your four values.

Hint: mean() is stored with the values when you saved it as x. To access it, use a dot.

In [4]:
#Solution
x_age = [18, 22, 33, 41]

# x is now much faster than it was!
x = np.array(x_age)

In [6]:
#Solution
x.mean()

28.5

You can also create numpy arrays using specific utilties... 

Exercise:

1) Create a numpy array containing 0 to 8 in steps of two: use np.arrange() with three arguments; the start value, the non-inclusive end value, and the step size.

2) Create a numpy array with five 0's and five 1's: use np.repeat() with two arguments; the first is a python list containing the numbers to be repeated, the second is the number of repeats required.

3) Create a numpy array with the results from rolling a six sided die 10 times: use np.random.choice() with two arguments; the first is a python list containing the numbers on the die, the second is the number of 'rolls' you would like to make.

In [8]:
#Solution
np.arange(0, 10, 2) # a range of numbers from 0 to 10 in steps of 2

array([0, 2, 4, 6, 8])

In [11]:
#Solution
np.repeat([0, 1], 5) # repeat [0, 1] five times

array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])

In [15]:
#Solution
np.random.choice([1, 2, 3, 4, 5, 6], 10) # 10 rolls of a dice

array([2, 1, 2, 3, 3, 5, 6, 5, 3, 3])

...here we are using the random library within numpy to simulate experimental data... 

(ie., drawing a number out of a set (1 to 6) 10 times). 

## How do you compute with arrays?

Suppose we generate an array which represents the ages of 10 people (mean=$30 \pm 5$)

Exercise:

1) Use np.random.normal() with three arguments: the mean of the normal distribution, the standard deviation, and the number of numbers to randomly generate. Save this as x_age.

2) Compare x_age with 3\*x_age + 1. Has this worked as you expected?

3) Does this work if the ages were stored as a python list?

In [16]:
#Solution
x_age = np.random.normal(30, 5, 10) 

x_age

array([26.75645146, 30.06000099, 33.07749316, 33.23508197, 32.8935573 ,
       25.18538857, 26.23845934, 40.78925534, 31.12088011, 20.20150136])

Suppose we need to compute $3x_{age} + 1$, then we write:

In [17]:
#Solution
3 * x_age + 1

array([ 81.26935439,  91.18000296, 100.23247948, 100.7052459 ,
        99.68067189,  76.55616571,  79.71537801, 123.36776601,
        94.36264032,  61.60450407])

Notice that `3*` is run on every element, as is `+1`. 

This is called **vectorization**. 

#Solution

This doesnt work with lists:

In [23]:
#Solution
x = [1, 2, 3]
# x = np.array([1, 2, 3])

x + 1

TypeError: can only concatenate list (not "int") to list

## What is a Sequence?

Exercise:

1) Display x_age.

2) Investigate what x_age.shape returns.

3) Compare this to len(x_age)

In [25]:
#Solution
x_age

array([26.75645146, 30.06000099, 33.07749316, 33.23508197, 32.8935573 ,
       25.18538857, 26.23845934, 40.78925534, 31.12088011, 20.20150136])

#Solution

The shape of this array defines how it is structured for calculations, $(10,)$ -- a sequence of 10 elements...

In [26]:
#Solution
x_age.shape

(10,)

In [27]:
#Solution
len(x_age)

10

### How do I index a sequence?

Just the same as python lists...

Exercise:

1) Display the first item in your x_age array using x_age\[0\].

2) Use 0:2 to select the first and second items from your array (Notice that the third, in the 2nd position is not included).

3) Use -1 to select the last item from your array.


In [28]:
#Solution
x_age[0]

26.756451462116274

In [29]:
#Solution
x_age[0:2]

array([26.75645146, 30.06000099])

In [30]:
#Solution
x_age[-1]

20.20150135831472

## What is a Matrix?

A table of numbers...

Exercise:

1) Run the section of code below to create the matrix saved as M.

2) Which values are in the first row and which are in the first column?

In [46]:
M = np.array([
    [1000, 12, +1], #eg., Loan, Duration, Settle
    [2000, 9, -1], #eg., Loan, Duration, Settle  
    [3000, 6, -1], #eg., Loan, Duration, Settle  
])

In [47]:
#Solution
#The first row is 1000, 12, 1. The first column is 1000, 2000, 3000.
M

array([[1000,   12,    1],
       [2000,    9,   -1],
       [3000,    6,   -1]])

### How do I index a matrix?

`M[row-index, col-index]`

Note, both indexes work like list indexes -- except now there are two. 

Exercise:

1) Use M[r,c] with appropriate values for r and c to display the value in the first row in the first column.

2) Display the value in the second row in the first column.

3) Display the first and second rows (using :) from the last column.

In [48]:
#Solution
M[0, 0] # first row, first column

1000

In [49]:
#Solution
M[1, 0] # second row, first column

2000

In [50]:
#Solution
M[0:2, -1] # first two rows, last column

array([ 1, -1])

### What is a Vector?
A vector is a matrix of one column.

Exercise:

1) Use np.array() to create a matrix with one column containing the numbers 10, 11, 12.

2) Display your results from question 1 to check it's a column rather than a row.

3) Select the first item in your vector using square brackets to address the location you want. Hint: You will still need to give the column location!

4) Attempt to select the item in the first row *second* column to see what error you get when you try to address an item outside of the dimensions of your matrix.

In [51]:
#Solution
x_profit = np.array([
    [10],
    [11],
    [12]
])

x_profit

array([[10],
       [11],
       [12]])

In [52]:
#Solution
x_profit[0, 0]

10

In [53]:
#Solution
x_profit[0, 1]

IndexError: index 1 is out of bounds for axis 1 with size 1

## Why would we use a matrix of one column?

In machine learning (libraries) we must always have our features ($X$) formatted as a matrix.

Each row of the feature matrix $X$ *must* be one complete observation. This is assumed in how these libraries process data. 

## How do I select multiple elements?

Exercise:

1) Display M (so you have it for reference)

2) Explore the difference between M\[0:2,0\] and M\[:2,0\]

3) What does M\[:,0\] do?

4) What does M\[(0,2),0\] do?

5) Select the first and third rows (all columns).

In [54]:
#Solution
M

array([[1000,   12,    1],
       [2000,    9,   -1],
       [3000,    6,   -1]])

#Solution

`:2` means from `0` to `2`

In [55]:
#Solution
M[:2, 0] # 0:2   :2

array([1000, 2000])

#Solution

`:` - from the beginning to the end

In [56]:
#Solution
M[:, 0]

array([1000, 2000, 3000])

#Solution

NB. you can just read `:` as "all".

So, `M[:, 0]` means `M[all rows, first column]`

In [61]:
#Solution
M[ [0, 2], :] # chose rows indexed [0, 2] and all columns

array([[1000,   12,    1],
       [3000,    6,   -1]])

Remember:  `label[index]`  <- always means FIND `index` in `label`

Remember: `[data,]` <- always means `list`


## How do I select elements by a condition?

Comparisons are also *vectorized*, meaning, they run across every element:

Exercise:

1) Display x_age. Display the results from x_age < 30. Compare them to check it worked.

2) Use np.where(x_age < 30) and check that the results are the locations of the ages you have that are less than 30.

3) Use the code from the previous question to select all the rows where the condition is met. Hint: put np.where(x_age < 30) in the square brackets as the address.

4) Compare your previous answer to using x_age\[ x_age < 30 \].

Extension:

Use a for loop to create a list of the ages that are less than 30. Use %%timeit to compare how long this takes compared with your answers to question 3 and 4.



In [65]:
#Solution
x_age  < 30

array([ True, False, False, False, False,  True,  True, False, False,
        True])

`np.where` tells you the index of the `True` values... 

In [67]:
#Solution
np.where(x_age < 30)

(array([0, 5, 6, 9]),)

In [69]:
#Solution
x_age[ np.where(x_age < 30)  ]  # here I select the elements which match this condition

array([26.75645146, 25.18538857, 26.23845934, 20.20150136])

In [71]:
#Solution
x_age[ x_age < 30  ]  # FIND elements in x_age, WHERE  <30

array([26.75645146, 25.18538857, 26.23845934, 20.20150136])

#Solution

Aside: to do this in raw python, we would use a loop and a condition:

NOTE: far far slower... 

In [102]:
#Solution
keep = []
for age in x_age:
    if age < 30:
        keep.append(age)
keep

[19.260181550953583,
 24.35850053422343,
 23.18504804125957,
 17.933264321074905,
 9.931122518399246,
 18.16283598114136,
 22.236638562348347,
 18.91942978375155,
 5.084342289695776,
 24.375355468023493,
 20.56660078848965]

## How do I combine conditions?

Recall, in python:

In [103]:
age = 18
email = "michael.burgess@qa.com"

(age <= 20) and ("@" in email)

True

The problem with using `and`, (`or`, `not` etc.) with numpy, is that they only work for *single* comparisons. 

Exercise:

Run the below section of code to store the temp and hours arrays.

In [108]:
temp = np.array([19, 21, 23]) # eg., temp of a room 
hours = np.array([0, 0.5, 1]) # eg., duration of heating

To combine comparisons across and array we must use *vectorized* operators (ie., ones which work with arrays).

* `&` and 
* `|` or
* `~` not

Exercise:

Display True of False for each row:

1) Temperature over 20 and house less than 0.75

2) Temperature over 20 or hours 1 or more

3) Not the answer to question 1.

Extension

Is there an alternative to question 3 that gives the same results but without using any nots?

In [110]:
#Solution
(temp > 20) & (hours < 0.75)

array([False,  True, False])

In [114]:
#Solution
(temp > 20) | (hours >= 1)

array([False,  True,  True])

In [115]:
#Solution
~((temp > 20) & (hours < 0.75))

array([ True, False,  True])

#Solution

Alternatively:

~((temp > 20) & (hours < 0.75))

is the same as ~(temp > 20) | ~(hours < 0.75) (By one of De Morgan's laws)

Which is the same as (temp <= 20) | (hours >=0.75)

## How do you simulate real-valued data?

10 random values, whose mean will be aproximately $30$, and which will vary from 30, on average, by $5$...

In [166]:
np.random.normal(30, 5, 10)

array([19.42974668, 45.13651107, 20.74212228, 34.93130155, 29.081382  ,
       41.01384116, 30.09996855, 15.98946629, 31.61639624, 32.35979923])

## How do you simulate categorical data?

Categorical data is represented as *labels* (eg., die faces, cards, answers to questions, locations)....

Exercise:

Run the below sections of code to save x_like_film and calculate the probability of 'No'.

In [143]:
x_like_film = np.random.choice(["YES", "NO"], 10)
x_like_film

array(['YES', 'YES', 'YES', 'YES', 'NO', 'YES', 'YES', 'YES', 'NO', 'YES'],
      dtype='<U3')

In numpy, `random.choice` is the easiest way to simulate a categorical variable (eg., `x_like_film`). 

A categorical variable *IS NOT* numerical in the ordinary sense, so if we wish to compute statistics on it, we typically convert it to a frequency distribution (ie., we count the entires). 

In [147]:
categories, counts = np.unique(x_like_film, return_counts=True)

counts

array([2, 8])

The rate of "NO" (ie., $P(x=\text{NO})$), 

In [163]:
counts[0] / sum(counts)

0.44

## Exercise (25 min)

You are hired by a cinema to make film recommendations to customers as they speak to your front desk staff.

Your staff may observe: their age, budget, like_action, like_comedy. 

Note, $x : (age, budget, action, comedy) = (18, 10, +1, -1)$

Let's simulate some data:

$x_{age} \sim N(\mu=35, \sigma=5) $

In [151]:
x_age = np.random.normal(35, 15, 25) # normal = 35 +- 5
x_budget = np.random.normal(10, 1.50, 25) # normal = 10 +- 1.50
x_action = np.random.choice([1, 0], 25) # bernouli = coin flip
x_comedy = np.random.choice([1, 0], 25) # bernouli = coin flip

### Q1. Import and Compute
* import the numpy library
    * recall, use `np`
    
* you are given a regression and classification formula
* use numpy to compute the $y$ predictions for each person
* a formula to compute likely spend on consessions (food counter)
    * $y = f(x_{age}, x_{budget}, x_{action}, x_{comedy}) = 0.1x_{age} + 0.1x_{budget} + x_{action} - x_{comedy}$

* what is the expected (ie., average) spend for these customers?
    * HINT: `.mean()`
    
#### EXTRA
* a formula to compute whether they will like the blockbuster currently showing
    * $y = f(x_{age}, x_{budget}, x_{action}, x_{comedy}) = (x_{age} < 18) \text{ or } (x_{budget} > 10) \text{ and } x_{action}$
* HINT:
    * `(age < 18) | (budget > 10) & (action == 1)`
    
* what is $P(y=LikeFilm)$ ?
    * HINT: `.mean`

In [152]:
#Solution

import numpy as np

y = 0.1*x_age + 0.1*x_budget + x_action - x_comedy
y

array([1.44704469, 4.42374746, 4.37874372, 4.51483304, 4.70004981,
       3.66928777, 5.30266447, 7.03543912, 5.33375516, 3.64851062,
       1.56088317, 3.93430881, 6.97832745, 6.1155721 , 2.06641119,
       4.69138717, 2.32444407, 3.26504077, 4.89035442, 4.84909597,
       3.95136611, 3.29794222, 1.55054888, 3.11699592, 3.47919903])

In [168]:
#Solution
y.mean()

4.021038125514765

In [169]:
#Solution
y_like = (x_age < 18) | (x_budget > 10) & (x_action == +1)

In [170]:
#Solution
y_like.mean()

0.36

### Q2. Select Elements & Describe

* Produce a report of the simulated data
    * show `.mean()` of all x
    * show `.std()` of all 
    * `.min()`, `.max()`
    
* Show sample observations
    * first, last
    * first two, last two
    * extra: the median
    
* EXTRA: Show the budget of people who are adults
    * HINT: `x_budget[ x_age ... ]`
    
    * and other conditions of interest...

In [155]:
#Solution
x_age.mean(), x_age.std()

(29.301154521178837, 11.895863094323746)

In [156]:
#Solution
x_action.mean(), x_action.std()

(0.56, 0.4963869458396342)

#Solution

Aside: since we simulated the categorical variable using `0`, `1`, we didnt need to convert into counts (, ie., we set it up so that the `mean` is the probability)..

In [171]:
#Solution
answers, counts = np.unique(x_action, return_counts=True)

counts[1]/sum(counts)

0.56

## Extension

Start to create your own NumPy cheat sheet:

* Using the NumPy user guide: https://numpy.org/doc/stable/numpy-user.pdf
* Read Chapter 3 to start populating your cheat sheet: include Array methods (functions), Matrix methods, and Mathematical functions. Include a short description/summary to help you use them later.
* Full NumPy documentation is available here: https://numpy.org/doc/stable/numpy-ref.pdf