# Introduction to NumPy
## Data Science for Data Scientists
---


## Why use NumPy?

NumPy provides fast numerical processing and fast arrays to python. 

Python itself is very slow. 

## How do I import NumPy?

Exercise:

Load the numpy library with it's common alias.

## How do you create a NumPy array?

You could start with a list, and then convert it:

Exercise:

1) Create a python list containing four integers (in square brackets) and save this as x_age.

2) Use np.array() to convert your python list into a numpy array. Save this as x.

3) Use the numpy array method mean() to calculate the mean of your four values.

Hint: mean() is stored with the values when you saved it as x. To access it, use a dot.

You can also create numpy arrays using specific utilties... 

Exercise:

1) Create a numpy array containing 0 to 8 in steps of two: use np.arange() with three arguments; the start value, the non-inclusive end value, and the step size.

2) Create a numpy array with five 0's and five 1's: use np.repeat() with two arguments; the first is a python list containing the numbers to be repeated, the second is the number of repeats required.

3) Create a numpy array with the results from rolling a six sided die 10 times: use np.random.choice() with two arguments; the first is a python list containing the numbers on the die, the second is the number of 'rolls' you would like to make.

array([0, 2, 4, 6, 8])

array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])

array([6, 4, 2, 6, 5, 1, 5, 6, 6, 1])

...here we are using the random library within numpy to simulate experimental data... 

(ie., drawing a number out of a set (1 to 6) 10 times). 

## How do you compute with arrays?

Suppose we generate an array which represents the ages of 10 people (mean=$30 \pm 5$)

Exercise:

1) Use np.random.normal() with three arguments: the mean of the normal distribution, the standard deviation, and the number of numbers to randomly generate. Save this as x_age.

2) Compare x_age with 3\*x_age + 1. Has this worked as you expected?

3) Does this work if the ages were stored as a python list?

array([34.1991967 , 30.24432994, 22.2981988 , 30.3467465 , 36.82930756,
       28.61027986, 21.12384923, 31.55380876, 30.37955958, 31.69838353])

array([103.5975901 ,  91.73298982,  67.89459639,  92.04023949,
       111.48792268,  86.83083958,  64.37154769,  95.66142628,
        92.13867873,  96.0951506 ])

30

Notice that `3*` is run on every element, as is `+1`. 

This is called **vectorization**. 

## What is a Sequence?

Exercise:

1) Display x_age.

2) Investigate what x_age.shape returns.

3) Compare this to len(x_age)

[34.1991967  30.24432994 22.2981988  30.3467465  36.82930756 28.61027986
 21.12384923 31.55380876 30.37955958 31.69838353]


(10,)

10

### How do I index a sequence?

Just the same as python lists...

Exercise:

1) Display the first item in your x_age array using x_age\[0\].

2) Use 0:2 to select the first and second items from your array (Notice that the third, in the 2nd position is not included).

3) Use -1 to select the last item from your array.


34.19919669861618

array([34.1991967 , 30.24432994])

31.698383533844563

## What is a Matrix?

A table of numbers...

Exercise:

1) Run the section of code below to create the matrix saved as M.

2) Which values are in the first row and which are in the first column?

In [21]:
M = np.array([
    [1000, 12, +1], #eg., Loan, Duration, Settle
    [2000, 9, -1], #eg., Loan, Duration, Settle  
    [3000, 6, -1], #eg., Loan, Duration, Settle  
])

[[1000   12    1]
 [2000    9   -1]
 [3000    6   -1]] (3, 3)


array([[1000,   12,    1],
       [2000,    9,   -1],
       [3000,    6,   -1]])

### How do I index a matrix?

`M[row-index, col-index]`

Note, both indexes work like list indexes -- except now there are two. 

Exercise:

1) Use M[r,c] with appropriate values for r and c to display the value in the first row in the first column.

2) Display the value in the second row in the first column.

3) Display the first and second rows (using :) from the last column.

1000

2000

array([ 1, -1])

### What is a Vector?
A vector is a matrix of one column.

Exercise:

1) Use np.array() to create a matrix with one column containing the numbers 10, 11, 12.

2) Display your results from question 1 to check it's a column rather than a row.

3) Select the first item in your vector using square brackets to address the location you want. Hint: You will still need to give the column location!

4) Attempt to select the item in the first row *second* column to see what error you get when you try to address an item outside of the dimensions of your matrix.

[[10]
 [11]
 [12]]


(3, 1)

10

## Why would we use a matrix of one column?

In machine learning (libraries) we must always have our features ($X$) formatted as a matrix.

Each row of the feature matrix $X$ *must* be one complete observation. This is assumed in how these libraries process data. 

## How do I select multiple elements?

Exercise:

1) Display M (so you have it for reference)

2) Explore the difference between M\[0:2,0\] and M\[:2,0\]

3) What does M\[:,0\] do?

4) What does M\[(0,2),0\] do?

5) Select the first and third rows (all columns).

[[1000   12    1]
 [2000    9   -1]
 [3000    6   -1]]


(array([1000, 2000]), array([1000, 2000]))

array([1000, 2000, 3000])

array([[1000,   12,    1],
       [3000,    6,   -1]])

Remember:  `label[index]`  <- always means FIND `index` in `label`

Remember: `[data,]` <- always means `list`


## How do I select elements by a condition?

Comparisons are also *vectorized*, meaning, they run across every element:

Exercise:

1) Display x_age. Display the results from x_age < 30. Compare them to check it worked.

2) Use np.where(x_age < 30) and check that the results are the locations of the ages you have that are less than 30.

3) Use the code from the previous question to select all the rows where the condition is met. Hint: put np.where(x_age < 30) in the square brackets as the address.

4) Compare your previous answer to using x_age\[ x_age < 30 \].

Extension:

Use a for loop to create a list of the ages that are less than 30. Use %%timeit to compare how long this takes compared with your answers to question 3 and 4.



array([False, False,  True, False, False,  True,  True, False, False,
       False])

`np.where` tells you the index of the `True` values... 

(array([2, 5, 6], dtype=int64),)

## How do I combine conditions?

Recall, in python:

In [103]:
age = 18
email = "michael.burgess@qa.com"

(age <= 20) and ("@" in email)

True

The problem with using `and`, (`or`, `not` etc.) with numpy, is that they only work for *single* comparisons. 

Exercise:

Run the below section of code to store the temp and hours arrays.

In [108]:
temp = np.array([19, 21, 23]) # eg., temp of a room 
hours = np.array([0, 0.5, 1]) # eg., duration of heating

To combine comparisons across and array we must use *vectorized* operators (ie., ones which work with arrays).

* `&` and 
* `|` or
* `~` not

Exercise:

Display True of False for each row:

1) Temperature over 20 and house less than 0.75

2) Temperature over 20 or hours 1 or more

3) Not the answer to question 1.

Extension

Is there an alternative to question 3 that gives the same results but without using any nots?

## How do you simulate real-valued data?

10 random values, whose mean will be aproximately $30$, and which will vary from 30, on average, by $5$...

In [166]:
np.random.normal(30, 5, 10)

array([19.42974668, 45.13651107, 20.74212228, 34.93130155, 29.081382  ,
       41.01384116, 30.09996855, 15.98946629, 31.61639624, 32.35979923])

## How do you simulate categorical data?

Categorical data is represented as *labels* (eg., die faces, cards, answers to questions, locations)....

Exercise:

Run the below sections of code to save x_like_film and calculate the probability of 'No'.

In [143]:
x_like_film = np.random.choice(["YES", "NO"], 10)
x_like_film

array(['YES', 'YES', 'YES', 'YES', 'NO', 'YES', 'YES', 'YES', 'NO', 'YES'],
      dtype='<U3')

In numpy, `random.choice` is the easiest way to simulate a categorical variable (eg., `x_like_film`). 

A categorical variable *IS NOT* numerical in the ordinary sense, so if we wish to compute statistics on it, we typically convert it to a frequency distribution (ie., we count the entires). 

In [147]:
categories, counts = np.unique(x_like_film, return_counts=True)

counts

array([2, 8])

The rate of "NO" (ie., $P(x=\text{NO})$), 

In [163]:
counts[0] / sum(counts)

0.44

## Exercise (25 min)

You are hired by a cinema to make film recommendations to customers as they speak to your front desk staff.

Your staff may observe: their age, budget, like_action, like_comedy. 

Note, $x : (age, budget, action, comedy) = (18, 10, +1, -1)$

Let's simulate some data:

$x_{age} \sim N(\mu=35, \sigma=5) $

In [151]:
x_age = np.random.normal(35, 15, 25) # normal = 35 +- 5
x_budget = np.random.normal(10, 1.50, 25) # normal = 10 +- 1.50
x_action = np.random.choice([1, 0], 25) # bernouli = coin flip
x_comedy = np.random.choice([1, 0], 25) # bernouli = coin flip

### Q1. Import and Compute
* import the numpy library
    * recall, use `np`
    
* you are given a regression and classification formula
* use numpy to compute the $y$ predictions for each person
* a formula to compute likely spend on consessions (food counter)
    * $y = f(x_{age}, x_{budget}, x_{action}, x_{comedy}) = 0.1x_{age} + 0.1x_{budget} + x_{action} - x_{comedy}$

* what is the expected (ie., average) spend for these customers?
    * HINT: `.mean()`
    
#### EXTRA
* a formula to compute whether they will like the blockbuster currently showing
    * $y = f(x_{age}, x_{budget}, x_{action}, x_{comedy}) = (x_{age} < 18) \text{ or } (x_{budget} > 10) \text{ and } x_{action}$
* HINT:
    * `(age < 18) | (budget > 10) & (action == 1)`
    
* what is $P(y=LikeFilm)$ ?
    * HINT: `.mean`

### Q2. Select Elements & Describe

* Produce a report of the simulated data
    * show `.mean()` of all x
    * show `.std()` of all 
    * `.min()`, `.max()`
    
* Show sample observations
    * first, last
    * first two, last two
    * extra: the median
    
* EXTRA: Show the budget of people who are adults
    * HINT: `x_budget[ x_age ... ]`
    
    * and other conditions of interest...

## Extension

Start to create your own NumPy cheat sheet:

* Using the NumPy user guide: https://numpy.org/doc/stable/numpy-user.pdf
* Read Chapter 3 to start populating your cheat sheet: include Array methods (functions), Matrix methods, and Mathematical functions. Include a short description/summary to help you use them later.
* Full NumPy documentation is available here: https://numpy.org/doc/stable/numpy-ref.pdf