# Numpy basics part 2

## Prerequisites
- Numpy basics part 1

## Learning Objectives
- Perform mathematical operations on numpy arrays
- Use in buit in numpy mathematical and statistical functions and methods on one-dimensional arrays
- Perform mathematical operations on two or more numpy arrys of the same shape
- Calculate covariance and correlation of two numpy arrays

## References
- https://wesmckinney.com/book/numpy-basics

In [1]:
# Execute this code block to import numpy
import numpy as np

### Basic mathematical operations with 1-dimensional arrays

In [2]:
# Suppose 10 students take an exam and the scores are stored in the array below
scores = np.array([85.72, 71.22, 64.48, 86.31, 92.11, 47.82, 63.37, 78.5 , 79.72, 80.55])
scores

array([85.72, 71.22, 64.48, 86.31, 92.11, 47.82, 63.37, 78.5 , 79.72,
       80.55])

#### Addition and subtraction
- adding a single number, or _scalar_ to an array adds the value to every element of the array
    - Suppose the professor wants to add 5 points to every score.
    - In the code block below, create a new array called new_scores = scores + 5
        - _Hint_: scores + 5 returns an array with 5 added to every element, but does not change the initial values. If you want to save the new array, you need to define a new object or redefine the array. scores = ...
    - Subracting a scalar works the same way.  Experiment with addition and substraction of scalars the code block below. 


#### Multiplication and division
- Multiplation and division of scalars works the same way.
- Find each element of scores divided by 100 as scores/100
- Find all scores doubled by entering scores*2
- Experiment with multiplication and division in the code block below

#### Raising to a power
- Use '**' to raise to a power.  
- Raise every element the array scores to the power of 2, or square each value, by entering scores**2
    - Find the square root of each element by scores**(1/2) _Note_ See the paranthesis around (1/2).  If you leave the parathesis out, you get scores**1 and then divided by 2.  Experiment in the code block below

#### multiple operations 
- Order of operations applys here too.  if you want to add 5 to scores, divide by 100 and then square that value, do the following:
  ((scores + 5)/100)**2
- practice with order of operations below

### Numpy mathematical functions

- to use any of numpy's functions, you need to preface the function name with "np.", for example np.log(scores) returns the natural log of each element in the array scores.  log(scores) will return an error
- Here are some built mathematical operations you might use in this class
  - np.exp(_array name_) returns an array with the exponential constant 'e' (2.718...) raised to the value of each element of _array name_
  - np.sqrt(_array name_) return an array with the square root of each element of _array name_
  - np.square(_array name_) returns an array with each element of _array name_ squared
  - np.abs(_array name_) returns an array with the absolute value of each element of _array name_
  - np.log(_array name_), np.log10(_array name_), np.log2(_array name_), np.log1p(_array name_) 	Natural logarithm (base e), log base 10, log base 2, and log(1 + x), respectively
- Experiment with built in functions below

### Numpy statistical functions and methods 
- Often, is possible to use both a function and a method for the same task.
- The "function" np.sum(scores) and the "method" scores.sum() both return the sum of the values in the array. Neither is wrong.  You may see both when reading and writing code, so we include both below.
 
_arrayname_.max() or np.max(_arrayname_) returns the maximum value of the elements of  array $\max \{x \}$, where $x$ represents _arrayname_

_arrayname_.min() or np.min(_arrayname_) returns the minimum value of the elements of the array, $\min \{x \}$

_arrayname_.sum() or np.sum(_arrayname_) returns the sum of all values in the array,  $\sum_i x_i $

_arrayname_.mean() or np.mean(_arrayname_)  returns the mean of the values in the array,  $\bar{x} = \frac{\sum_i x_i}{N}$

_arrayname_.var() or np.var(_arrayname_) returns the _population variance_ of the values in the array, $\sigma^2 = \frac{\sum_i(x_i - \bar{x}^2)}{N}$

_arrayname_.var(ddof=1) or np.var(_arrayname_, ddof=1) returns the _sample variance_ $s^2 = \frac{\sum_i(x_i - \bar{x}^2}{N - 1}$ 

_arrayname_.std() or np.std(_arrayname_) returns the _population standard deviation_ of the values in the array, $\sqrt{\sigma^2}$ 

_arrayname_.std(ddof=1) or np.std(_arrayname_, ddof=1) returns the _sample standard deviation_ $\sqrt{s^2}.$   

_Note:_ "ddof" in the .var() and .std() methods stands for "degrees of freedom." Both np.var() and np.std() default to ddof=0 or population variance and standard deviation. Use ddof = 1 when calculating sample statistics or as directed by your instructor.  

Experiment with mathematical and statistical functions and methods in the codeblock below

#### Mathematical operations with multiple arrays
- You can add, subtract, muliply, and divide and raise to the power two one-dimensional arrays of the same number of elements\
  
- _arrayname1_ + _arrayname2_ will add, element-by-element, the values of _arrayname1_ and _arrayname2_ as long as both arrays are of the same shape
   - +, -, *, /, and ** all work element-by-element on same-shape arrays
- run the codeblock below to create a second array

In [3]:
# Suppose the same 10 students take a second exam and the scores are stored in the array scores_2
# Also suppose than only five students choose to take the final exam and the scores are stored in scores_final
# run this codeblock to create the arrays scores_2 and final
scores2 = np.array([83.15, 88.43, 58.45, 84.22, 98.10, 38.21, 62.45, 81.72, 79.55, 84.38])
scores_final = np.array([75.12, 91.34, 100, 65.14, 55.88])

- try adding and subtracting the arrays scores and scores2 in the codeblock below.
- Though it does not make sense for exam scores to be multiplied, divided, or raised to a power, try that as well.
  - Try some operations with scores and scores_final and confirm you see the an error because the two arrays are of a different shape.

#### Statistical operations with two arrays
##### Covariance
The _sample covariance_ of two variables $x$ and $y$ is $s_{x,y} = \sum_i\frac{(x_i - \bar{x})(y_i - \bar{y})}{N-1}$


The function np.cov(_arrayname1_, _arrayname2_) returns the _sample_ covariance matrix of variables $x$ and $y$


<img src="sample_covariance.png" alt="Drawing" style="width: 300px;"/> 
$s^2_x$ is the sample variance of $x$ and $s^2_y$ is the sample variance of $y$. 

$s_{x,y}$ is the sample covariance of $x$ and $y$.  Note $s_{x,y} = s_{y,x}$ 

The _population covariance_ of two variables $x$ and $y$ is  $\sigma_{x,y} = \sum_i\frac{(x_i - \bar{x})(y_i - \bar{y})}{N}$

The function np.cov(_arrayname1_, _arrayname2_, ddof=0) returns the _popluation_ covariance matrix  

<img src="population_covariance.png" alt="Drawing" style="width: 350px;"/> 

$\sigma^2_x$ is the population variance of $x$ and $\sigma^2_y$ is the population variance of $y$. 

$\sigma_{x,y}$ is the population covariance of $x$ and $y$. Also note $\sigma_{x,y} = \sigma_{y,x}$ 

Note: The function np.cov() defaults ddof=1, unlike the function np.var() which defaults to ddof=0.  If you want the covariance matrix for a population, you must specify ddof=0. 

- In the codeblock below, experiment with creating the population and sample covariance matricies for scores and scores2. 

   - Confirm the first row and column of np.cov(scores, scores2)  is equal to the sample variance of scores, np.var(scores, ddof=1)


##### Recall selecting arrays
- np.cov(_arrayname1_, _arrayname2_) returns an array with two rows and two columns.  
- The sample covariance is found in the first row and second column _or_ the second row and first column.
- In the codeblock below, create a new array called scov = np.cov(scores, scores2).
  - The sample covariance of scores and scores2 can be returned with scov[0,1] or scov[1,0]
  - The sample variance of scores is returned with scov[0,0]
  - The sample variance of scores2 is returned with scov[1,1]
- Experiment with selecting elements of covariance arrays in the code block below.
- Verify that scov[0,0] is equal to np.std(scores, ddof=1) and scov[1,1] = np.std(scores2, ddof=1)

##### Correlation coefficients

The sample Pearson correlation coefficient between variables $x$ and $y$ is 

$r_{x,y} = \frac{s_{x,y}}{s_x, s_y}$, where $s_{x,y}$ is the sample covariance, $s_x$ is the sample standard deviation of $x$ and $s_y$ is the sample standard deviation of $y$.

np.corrcoef(_arrayname1_, _arrayname2_) returns the sample correlation _matrix_ of the two arrays. 

<img src="sample_correlation_matrix.png" alt="Drawing" style="width: 300px;"/> 

- $r_{xy}$ is the correlation coefficient of x and y.  $r_{xy}$ = $r_{yx}$ and $r_{xx}$ = $r_{yy}$ = 1.  This means we care only about $r_{xy}$ or $r_{yx}$
- Use the code block below to find the correlation coefficient of scores and scores2.  Notice that you cannot calculate the correlation coefficient for scores and scores_final because the arrays are not the same size.
- practice selecting elements of the two-by-two array

  
_Note_: Sometimes you see correlation as $\rho_{x,y}$, this is the population correlation coefficient.  The degrees of freedom cancel out in the correlation coefficient formula, so we do not need to worry about the ddof argument in the np.corrcoef() function.



### Up next: working with multi-dimensional arrays