# <span style="color:#d50283">IT Academy - Data Science Itinerary</span>
## Sprint 3 - Practice with numerical programming
### Assignment by: Kat Weissman

#### Python Learning Objectives:
- Arrays
- Matrices
- Functions

*Recommended learning resources:*
- https://www.w3schools.com/python/numpy/numpy_intro.asp

### Exercise 1

Create a function that, given an array of one dimension, gives you a basic statistical summary of the data. If it detects that the array has more than one dimension, it should display an error message.

The summary function defined below displays the same summary statistics provided by the built-in R summary() function for all numeric values of the array:
- minimum (https://numpy.org/doc/stable/reference/generated/numpy.amin.html)
- 1st quartile (https://numpy.org/doc/stable/reference/generated/numpy.quantile.html)
- median (https://numpy.org/doc/stable/reference/generated/numpy.median.html)
- mean (https://numpy.org/doc/stable/reference/generated/numpy.mean.html)
- 3rd quartile (https://numpy.org/doc/stable/reference/generated/numpy.quantile.html)
- maximum (https://numpy.org/doc/stable/reference/generated/numpy.amax.html)

In [50]:
import numpy as np

#Define the function to create summary statistics of a one-dimensional array.
def summary(array):
    """
    Displays summary statistics including minimum, 1st quartile, median,
    mean, 3rd quartile, and maximum.
    
    Arguments:
    array -- a one-dimensional numpy array
    """
    if array.ndim == 1:
        print ("Minimum:",np.amin(array))
        print ("1st quartile:",np.quantile(array,0.25))
        print ("Median:", np.median(array))
        print ("Mean:", np.mean(array))
        print ("3rd quartile:",np.quantile(array,0.75))
        print ("Maximum:", np.amax(array))
    else:
        print ("Error: The argument of this function must be a one-dimensional array.")

In [51]:
#Create an array of one dimension in order to test the function.
arr = np.array([3,5,5,8,9,9,9,12,18,22,24,27,58,59,83,84,85,88])
print (arr)

[ 3  5  5  8  9  9  9 12 18 22 24 27 58 59 83 84 85 88]


In [52]:
#Test the function on a one-dimensional array
summary(arr)

Minimum: 3
1st quartile: 9.0
Median: 20.0
Mean: 33.77777777777778
3rd quartile: 58.75
Maximum: 88


In [53]:
#Reshape the array in order to test the function with more than one dimension.
newarr = arr.reshape(2, 9)
print(newarr)

[[ 3  5  5  8  9  9  9 12 18]
 [22 24 27 58 59 83 84 85 88]]


In [54]:
#Test the function on a two-dimensional array
summary(newarr)

Error: The argument of this function must be a one-dimensional array.


### Exercise 2

Create a function that generates an NxN square of random numbers between 0 and 100.

Reference: https://www.w3schools.com/python/numpy/numpy_random.asp

In [55]:
from numpy import random

#Define the function to create an NxN square of random numbers between 0 and 100.
def random_square(N):
    """
    Generates and returns a numpy array which is an NxN matrix 
    of random numbers between 0 and 100.
    
    Arguments:
    N -- an integer
    """
    square = random.randint(100, size=(N, N))
    return square

In [56]:
#Test the function with the argument 3
my_square = random_square(3)
print (my_square)

[[36 87 61]
 [ 4 51 17]
 [35 38 39]]


In [57]:
#Test the function with the argument 5
my_square2 = random_square(5)
print (my_square2)

[[31 94 81 12 88]
 [45 46 73 96  1]
 [14 84 48 60 99]
 [77 67 73 83 36]
 [61 78 50 51 87]]


### Exercise 3
Create a function that given a two-dimensional table, calculates the totals per row and the totals per column.

In [58]:
#Define the function to calculate row and column totals of a two-dimensional table.
def totals(array):
    """
    Generates and returns a tuple of two numpy arrays which are 
    the row and column totals of the two-dimensional table.
    
    Arguments:
    array -- a two-dimensional numpy array.
    """
    if array.ndim == 2:
        row_totals = np.sum(array, axis=1)
        column_totals = np.sum(array, axis=0)
        return (row_totals, column_totals)
    else:
        print("Error: The argument of this function must be a two-dimensional array.")

In [59]:
#Test the function by assigning the results to new variables.
#The first test generates an error because it is using a one-dimensional array.
totals_arr = totals(arr) 
#The following tests are successful.
totals_newarr = totals(newarr)
totals_my_square = (totals(my_square))
totals_my_square2 = (totals(my_square2))

Error: The argument of this function must be a two-dimensional array.


In [60]:
#Display the totals by calling the first and second elements of the variables, which are tuples.
print("The row totals of the array 'newarr' are:", totals_newarr[0])
print("The column totals of the array 'newarr' are:", totals_newarr[1])
print("The row totals of the array 'my_square' are:", totals_my_square[0])
print("The column totals of the array 'my_square' are:", totals_my_square[1])
print("The row totals of the array 'my_square2' are:", totals_my_square2[0])
print("The column totals of the array 'my_square2' are:", totals_my_square2[1])

The row totals of the array 'newarr' are: [ 78 530]
The column totals of the array 'newarr' are: [ 25  29  32  66  68  92  93  97 106]
The row totals of the array 'my_square' are: [184  72 112]
The column totals of the array 'my_square' are: [ 75 176 117]
The row totals of the array 'my_square2' are: [306 261 305 336 327]
The column totals of the array 'my_square2' are: [228 369 325 302 311]


### Exercise 4
Manually implement a function that calculates the correlation coefficient. Learn about its uses and interpretation.

References: 
- https://en.wikipedia.org/wiki/Pearson_correlation_coefficient
- https://en.wikipedia.org/wiki/Sample_mean_and_covariance
- https://www.statisticshowto.com/probability-and-statistics/correlation-coefficient-formula/
- https://numpy.org/doc/stable/reference/generated/numpy.corrcoef.html
- https://numpy.org/doc/stable/reference/generated/numpy.cov.html
- https://www.w3schools.com/datascience/ds_stat_correlation.asp
- https://www.w3schools.com/datascience/ds_stat_correlation_matrix.asp

In [61]:
#Define the function to calculate the Pearson correlation coefficient for a sample.
def Pearsons_r(X,Y):
    """
    Generates and returns the Pearson's correlation coefficient
    given paired data X and Y of sample size n.
    
    Arguments:
    X -- a one-dimensional numpy array of size n.
    Y -- a one-dimensional numpy array of size n.
    """
    if X.ndim == 1 and Y.ndim == 1:
        numerator = np.sum((X-np.mean(X))*(Y-np.mean(Y)))
        denominator = np.sqrt(np.sum((X-np.mean(X))**2))*np.sqrt(np.sum((Y-np.mean(Y))**2))
        return (numerator/denominator)
    else:
        print("Error: The arguments of this function must be two one-dimensional arrays.")

The following example is taken from the website "Statistics How To" which explains correlation coefficient.

https://www.statisticshowto.com/probability-and-statistics/correlation-coefficient-formula/#hand
- X: an array of ages
- Y: an array of blood glucose levels

In [62]:
#X values represent ages
X = np.array([43,21,25,42,57,59])
#Y values represent blood glucose levels
Y = np.array([99,65,79,75,87,81])

In [63]:
#Test the function using the X and Y arrays defined previously.
Pearsons_r(X,Y)

0.5298089018901744

Since the Pearson's correlation coefficient ranges between -1 and 1, a result of 0.5298 indicates a positive correlation.

Next, I will compare the result of function that manually calculates the correlation coefficient with the NumPy function corrcoef.

In [64]:
#Use the NumPy corrcoef function on the same arrays.
np.corrcoef(X,Y)

array([[1.       , 0.5298089],
       [0.5298089, 1.       ]])

The NumPy function returns a matrix instead of a single number. The NumPy function can take a 1-D or 2-D array with multiple variables. The function I wrote only compares two variables. The diagonal values are 1 because X is perfectly correlated with itself and Y is perfectly correlated with itself. The correlation coefficient is same for (X,Y) and (Y,X) regardless of the order of the variables.

In [65]:
#Use the NumPy corrcoef function on the same arrays in reverse order.
np.corrcoef(Y,X)

array([[1.       , 0.5298089],
       [0.5298089, 1.       ]])

In [66]:
#Use my function on the same arrays in reverse order.
Pearsons_r(Y,X)

0.5298089018901744

The functions behave as expected and the results match.