<a href="https://colab.research.google.com/github/GeorgeHoughton/PPP4012/blob/main/Arrays_vs_Lists.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#<center> PPP-4012: Python coding notebooks.
#<center> Introduction to NumPy arrays for data analysis.



---



##Aims:
The aim of this notebook is to illustrate the advantages of numerical arrays (Numpy arrays) for data analysis. In particular, we illustrate (i) the summing, subtracting and multiplying of arrays with each other, and (ii) the "element-wise" application of an operation (or function) to an array, where every element of the array is subjected to the operation. These features reduce the need for explicit for loops in code.


---



## Arrays vs lists

We'll first look at some differences between how lists and arrays behave.  Create some data in list, and then make a copy as an array.

In [None]:
import numpy as np #Arrays belong to  numpy

#first make a list of data
l1 = [20, 30, 35, 40, 50, 45, 38, 25, 10]

#copy it to a numpy array
a1 = np.array(l1)

#show the two
print("List =", l1)
print("Array =",a1)

Apart from the commas in the list, they look pretty much identical. But, look at what happens when we try to multiply the data by 2:

In [None]:
#multiply the list by 2
print("List =", l1)
print('Multiply the list by 2')
print(l1 * 2)

print("\nArray =",a1)
print('Multiply the array by 2')
#multiply the array by 2
print(a1 * 2)

As you can see, the results are very different. Multiplying the list has   duplicated the contents of the original list. In contrast, multiplying the array has **multiplied every element in it by 2**.

The way the array behaves is much more useful for modelling and data analysis, and is not limited to multiplication. We can similarly add or subtract a value to all the elements in an array. For instance, to "center" data, the mean of the data is subtracted from every datapoint. Here is an example (see also **Example 1** (z-scores) below):

In [None]:
data = np.array([20, 30, 34, 40, 50, 45, 39, 42, 31])  #create some data
center = data - np.mean(data)  #center the data by subtracting the mean from all elements

print('Raw data =', data)
print('Mean of the data =', np.mean(data))
print('Centered data =', center)




Let's try another example, this time adding two lists or arrays together.

In [None]:

#make two new lists of equal length
l2 = [10, 20, 30 , 40]
l3 = [15, 25, 35, 45]

#make arrays from the lists
a2, a3 = np.array(l2), np.array(l3)

print('Add the two lists')
print(l2 + l3)

print('Add the two arrays')
print(a2 + a3)

Again, the difference in the results is clear. The data in the lists has been "joined together", wheareas for the arrays corresponding data points have been added together (that is, the first element of a1 has been added to the first element of a2 and so on).  

#Advantages of arrays

We can do many more useful things with arrays, things that in most cases will simply produce an error if we try to do the same with lists. For instance, instead of multiplying the array by a number, let's try dividing it.

In [None]:
print('Dividing arrays by 2')
print('\na2 =', a2)
print('a2/2 =', a2/2)
print('\na3 =', a3)
print('a3/2 =', a3/2)

Trying to do this with a list will simply produce an error: Try it your self!

In [None]:
#try to divide any of the lists l1, l2 etc by a number

Similarly, arrays can not only be added element-wise as above, they can subtracted, multiplied or divided, using the the usual arithmetic signs. For example, subtraction:

In [None]:
#'Subtract one array from another'
print('Subtract a2 from a3')
print('a3 - a2 =', a3 - a2)


Try it yourself! In the following code window, try to
(i) multiply a2 by a3 (using *)
(ii) divide a2 by a3 (using /)

In [None]:
#(i) Multiply a2 by a3

#(ii) divide a2 by a3

# Elementwise application of operations to arrays

Operations (functions) such as squaring, x\*\*2  , or taking the square root, np.sqrt(x), apply  to a single number, or "input variable", $x$, and produce a single number as output.
Suppose however, we want to square all the numbers in a set of data. With a list, we would need to use a *for* loop, applying the squaring operation to every number in turn.

  However, with NumPy, operations such as squaring can be applied to an *array*, with the result that the operation **acts on every element in the array**, producing a new array as output. To do this on data in a Python list would require a *for* loop.   

Here are a few examples:

In [None]:
# Elementwise application of functions
print('a2 = ', a2)

print('\n 1. Square every element of a2: a2 ** 2')
print(a2 ** 2)  # a2 ** 2

print('\n 2. Take log of  every element of a2: np.log(a2)')
print(np.log(a2))  # log(a2)

print('\n 3. Reciprocal of  every element of a2: 1/a2')
print(1/a2)  # 1/a2

Try it yourself:

Using these methods, create an array of data of your own and apply the following operations to all the elements in the array (no for-loops allowed!):

(i) Get the cube (raise to the power 3) of every element in your data.

(ii) Get the square root of every element.

(iii) (Harder) Get 1/(x**2) of every element x in the data (ie the reciprocal of the square. This combines 2 operations).

In [None]:
#(i) Cube

#(ii) Square root

#(iiI) 1/(x**2)



##Example 1: Standardising data (the z-score)

Standardised, or z-score, data is created  by subtracting the mean of the data from all the data points (called centering), and then dividing the centered data by the standard deviation. Whatever form the original data took, after conversion to the z-score, it will have mean of 0, values that were originally below the mean will be negative, those above positive, and the values will be in units of the standard deviation. For instance, a z-score of 0.5 means that the original data point was 0.5 standard deviations above the mean.

Below, we show how easily this can be computed using  "vectorised" operations as described above:

In [None]:
import matplotlib.pyplot as plt  #for plotting results

#Raw data is a set of MCQ scores
MCQdata = np.array([59, 70, 75, 25, 50, 50, 80, 68, 64, 61, 57, 95.45, 55, 50, 50, 73, 34.09,
59, 82, 93, 65.91, 73, 77, 50, 48, 39, 57, 34, 27, 77, 63.64, 39, 57, 95, 77.27])

#use numpy to get mean and stan dev of MCQ data
MCQmean = np.mean(MCQdata)
MCQstd = np.std(MCQdata, ddof = 1) #use sample stand. dev.

#center the data, i.e., subtract the mean from every data point
centered = MCQdata - MCQmean

#z-score: divide centered data by the standard deviation
zed = centered/MCQstd

#Define a function to return the z-scored data
def zscore(data):
  """ z-transform of data in one line """
  return (data - np.mean(data))/np.std(data)

#------------ THE END: Rest is plotting ---------

#plot the z-scored data
plt.bar(range(1, len(zed)+1), zed) #bar plot
plt.xlabel('Subject Number') #label x-axis
plt.ylabel('z-score ')       #label y-axis
plt.title('MCQ Z-SCORES')     #title for the plot

##Example 2: Correlation (introducing the dot product)

https://realpython.com/numpy-scipy-pandas-correlation-python/

The Pearson correleation (or $r$ value) represents the (co-)relatedness between 2 variables, and has a value between $-1$ and $1$ (negative and positive correlation respectively).
An $r$ value at or around  $0$ means the variables are independent of each other.

The $r-$value is related to the z-score, as $r$ can be computed via the "dot product" of the two sets of z-transformed data. The dot product of two (same size) data sets $V = (v_0, v_1, ..., v_{n-1}), W = (w_0, w_1, ..., w_{n-1})$ is the **sum of the products of the corresponding elements** of the two arrays. In maths, it is written with a dot like this: $V \cdot W$. So the definition is,
$
V \cdot W = \sum_i v_iw_i
$
.

Using this operation with the z-score function given above (denoted $z$), the correlation $r$ between two sets of data $D_1, D_2$ can be computed as

$$
r(D_1, D_2) = \frac{z(D_1) \cdot z(D_2)}{n}
$$

($n$ is the number of data points).

NumPy uses the @ symbol, ie V @ W, for the dot product. So we can define a one-line function corr to compute the $r$-value:

https://media.geeksforgeeks.org/wp-content/uploads/20190413155221/dotproduct.png

In [None]:
def corr(data1, data2):
  """ the Pearson correlation (r value) of two sets of data,
   using dot product of the z-scores """
  return (zscore(data1) @ zscore(data2)) / len(data1)

#Some example data (you can change this to see the effect).
#Make sure both arrays have the same length, otherwise you will get an error.
D1 = np.array([2,4,6,8,10])
D2 = np.array([25, 26, 30, 35, 32])

#compute the correlation
print('r =', corr(D1,D2))

#check the result
print('Checking result, using the Numpy corrcoef function')
print(np.corrcoef(D1,D2))

##Example 3: Paired-samples t-value

https://www.statology.org/paired-samples-t-test/

In a paired-samples design all subjects are tested in both of two conditions, allowing each subject to be compared directly with themselves. Suppose the data from the two conditions are in arrays $C_1, C_2 $, such that the corresponding data points in each array are *from the same subject*. For the t-test, we get the difference between the two conditions for each subject by the elementwise subtraction of one array from the other,
$$
D = C_1 - C_2
$$

where $D$ stands for difference. The formula for the t-value test is then simply

$$
t = \frac{mean(D)}{SE(D)}
$$

where $SE$ is the standard error, that is, the standard deviation ($SD$) divided by the square root of $n$; i.e.,

$$
SE(data) = \frac{SD(data)}{\sqrt n}
$$


Let's create a function SE for the standard error:


In [None]:
def SE(data):
  """ Compute the (sample) standard error of a data set """
  return np.std(data, ddof = 1)/np.sqrt(len(data)) #the input "ddof = 1" is to
                                                   #get the sample SD

To calculate the t-value using arrays, we can subtract the elements of one array of data from the corresponding elements of another with just the minus sign:

In [None]:
def tvalue(C1, C2):
  """ Calculate the t-value of difference of the means of
   two conditions c1, C2 """

  #subtract each score in condition2, C2, from the corresponding score in C1
  #using NumPy array subtraction; D is the array of differences
  D = C1 - C2

  #the t-value is the mean difference divided by the standard error of the
  #differences
  return np.mean(D)/SE(D)

To test whether a course improves their knowledge, students take a 30 question MCQ at the start and at the end of the course.  The code below contains the raw test scores for $n = 20$ students before (MCQ_pre) and after (MCQ_post) the course, and tests the differences between the means:

In [None]:
# MCQ scores before taking the course
MCQ_pre = np.array([18, 21, 16, 22, 19, 24, 17,21, 23, 18, 14, 16,16, 19,
18,20 , 12 , 22, 15 , 17] )

# MCQ scores after taking the course
MCQ_post = np.array([22  ,25, 17, 24, 16, 29, 20, 23, 19, 20 , 15, 15 , 18, 26,
18 , 24 , 18, 25 , 19 , 16])

#Results
#1. Means of the two conditions
print('Mean score: pre-test =', np.mean(MCQ_pre))
print('Mean score: post-test =', np.mean(MCQ_post))

#2. the array of difference scores, using array subtraction
diff = MCQ_post - MCQ_pre

#3. the mean difference
print('Mean score difference =', np.mean(diff))

#4. Standard error of the differences
print('SE of differences =', SE(diff))

#5. the t-value
print('t-value of difference =', tvalue(MCQ_post, MCQ_pre))

# ---------------------------------------
#Check result using the paired-samples ttest from the scipy.stats library
print('\n CHECKING THE RESULT WITH scipy.stats: Compare the t-values !')
#import the ttest function
from scipy.stats import ttest_rel as ttest
#scipy ttest on the same data: returns both the t-value (t) and p-value (p)
t, p = ttest(MCQ_post, MCQ_pre)
print('t-value =', t, ': p-value =', p)
