# Week 1 - Exploratory data analysis

## 2. Numpy

Python has been highlighted as a great programming language in the field of data science because it is easy to learn and is supported by a number of scientific computing libraries. Numpy is one of the vital libraries that deals with mathematical computation and enables users to compute on multi-dimensional data structures more efficiently and easily.

### 2.1 Basics

Before starting, make sure you have installed the Numpy package by executing this shell:

In [None]:
!pip install numpy

In [None]:
import numpy as np
print(np.__version__) # prints the current version of Numpy

Numpy offers a very intuitive way of representing matrices as multidimensional arrays; this data structure builds upon the ```list``` datatype in Python. Here is an example of initializing a numpy array, and using some built-in functions to retrieve some information about it:

In [None]:
a = [1, 2, 3, 4] # normal Python list
b = np.array([1, 2, 3, 4]) # Numpy rank 1 array

print('-----List-----')
print('Type:', type(a), '\n')

print('-----Numpy Array-----')
print("Type: ", type(b))
print("Shape: ", b.shape)
print("The first element: ", b[0])
print("The last element: ", b[-1])

Numpy offers various ways of initializing arrays, use Numpy's documentations at https://numpy.org/doc/ to answer the questions below.

In [None]:
"""
TODO: Replace 'None's with appropriate answers
e.g) b = np.None((2, 2)) --> np.ones((2, 2))
"""

# create a matrix full of ones
b = np.None(2, 2)
print("Matrix b")
print(b)

# create a matrix full of zeros
c = np.None((2, 3))
print("\nMatrix c")
print(c)

# create an identity matrix
d = np.None(3)
print("\nMatrix d")
print(d)

# create a matrix filled with random numbers between 0 and 1
e = np.None(2, 2)
print("\nMatrix e")
print(e)

# create an array which has 0-9 as its elements in sorted order
# expected output: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
f = np.None(10)
print("\nMatrix f")
print(f)

# create a matrix placeholder, without initializing entries (elements in the matrix).
g = np.None((5, 3))
print("\nMatrix g")
print(g)

#### Why Numpy?

You might be wondering, why do I need all these custom functions? Can I not do things the usual way, with for loops? The answer to that question is, of course, yes you can, but it will be *A LOT* slower. Let's see an example of the same function being implemented with and without NumPy.

In [None]:
# A library to help us measure how fast our algorithms are
import timeit

def numpy_method(n):
    return np.arange(n) ** 2
    
def for_loop_method(n):
    result = []
    for i in range(n):
        result.append(i ** 2)
        
%timeit numpy_method(1000)
%timeit for_loop_method(1000)

There are 1000 microsecond(µs) in a millisecond(ms). That is, the NumPy method here is 1000 times faster than our for loop method. This shows us the beauty of NumPy - once you get the hang of it, you will be able to get the performance of a low-level language (like C++), but with the ease of use of a high-level language like (Python).

**Note**: If you re-run the above, you will observe different values, however, the ratio of times should be comparable.

### 2.2 Matrix Operations

In machine learning, we will deal with a lot of matrix calculations. It is therefore good for us to get accustomed to some of the common operations we perform on them. Here is a list of the first few:

- `np.transpose()` : Transpose of an array
- `np.dot(a, b)` : Dot product of two arrays
- `np.linalg.inv()` : Inverse matrix of an array (only valid for square matrices, whose dimension is n * n)
- `np.diagonal()` : Diagonal components of a two-dimensional array
- `a.reshape(row = x, column = y)` : Reshape an array to the given dimension

Now let's check what each of them does.

In [None]:
# Initialise the data we will use below
x = np.array([
    [3, 11, 1],
    [7, 5, 2],
    [6, 8, 9],
    [0, 10, 4]
])
x

In [None]:
# To Do: Transpose the array

# Expected outcome:
# [[ 3  7  6  0]
#  [11  5  8 10]
#  [ 1  2  9  4]]

transposed = None
transposed

In [None]:
# To Do: Dot product of two arrays: original x and x_transposed
# (4x3) dot (3x4) should give you (4x4)

# Expected outcome:
# [[131  78 115 114]
#  [ 78  78 100  58]
#  [115 100 181 116]
#  [114  58 116 116]]

y = None
y

In [None]:
# TODO: Do elementwise multiplication with 'broadcaster' and 'x' (replace 'None')
# You will know what we meant by 'broadcast' once you check your result.

# Expected outcome for the varible 'elementwise_broadcasting':
# [[ 0  0  0]
# [ 7  5  2]
# [12 16 18]
# [ 0 30 12]]

broadcaster = np.array([
    [0],
    [1],
    [2],
    [3]
])
print("broadcaster: \n{}\n".format(broadcaster))

elementwise_broadcasting = None
print("broadcasted: \n{}".format(elementwise_broadcasting))

In [None]:
# To Do: Extract the diagonal elements of an array x
# Expected outcome: [3 5 9]

diagonal = None
print(diagonal)

In [None]:
# To Do: Reshape an array x to one that has 6 rows and 2 columns
# Expected outcome: 
# [[ 3 11]
#  [ 1  7]
#  [ 5  2]
#  [ 6  8]
#  [ 9  0]
#  [10  4]]
reshaped = None
print(reshaped)

### 2.3 Statistics in Numpy

When we deal with large amounts of data, we will often want to know things about the data as a whole. This is where NumPy's statistics come to the rescue. Most of them are self-explanatory:

- `np.sum()` : sum of all elements in an array
- `np.max()` : returns the maximum element in an array
- `np.min()` : Minimum value of an array
- `np.mean()` : Mean of elements in an array
- `np.median()` : Median value among elements
- `np.var()` : Variance of the elements in the array
- `np.std()` : Standard deviation of the elements in the array

As before, fill in the cells below to get used to these methods.

In [None]:
x = np.array(
    [34, 56, 6, 3, 9, 89, 120, 12, 201],
    dtype = np.int32
)

In [None]:
# To Do: Summation of elements 
# Expected outcome: 530
summation = np.None(x)
print(summation)

In [None]:
# To Do: Minimum element in the array
# Expected outcome: 3
minimum = x.None()
print(minimum)

In [None]:
# To Do: Maximum element in the array
# Expected outcome: 201
maximum = x.None()
print(maximum)

In [None]:
# To Do: Average value of elements in the array
# Expected outcome: 58.89
mean = x.None()
print(mean)

In [None]:
# To Do: Median element in the array
# Expected outcome: 34.0
median = np.None(x)
print(median)

In [None]:
# TO DO: Variation of x
# Expected outcome: 4008.098765432099
variation = np.None(x)
print(variation)

In [None]:
# To Do: Standard deviation of the array
# Expected outcome: 63.3095471902311
std = np.None(x)
print(std)

In [None]:
# To Do: Standard deviation of the array with n-1 degrees of freedom
# Expected outcome: 189.92864157069332
std_sample = np.None(x)
print(std_sample)

### 2.4 Exercise

Now let's start combining these concepts together to manipulate data.

In [None]:
# Data we will use in this exercise
x = np.array([
    [1, 52, 22, 2, 31, 65, 7, 8, 24, 10],
    [12, 2322, 33, 1, 2, 3, 99, 24, 1, 42],
    [623, 24, 3, 56, 5, 2, 7, 85, 22, 110],
    [63, 4, 3, 4, 5, 64, 7, 82, 3, 20],
    [48, 8, 3, 24, 57, 63, 7, 8, 9, 1032],
    [33, 64, 0, 24, 5, 6, 72, 832, 3, 10],
    [12, 242, 2, 11, 52, 63, 32, 8, 96, 2],
    [13, 223, 52, 4, 35, 62, 7, 8, 9, 10],
    [19, 2, 3, 149, 15, 6, 172, 2, 2, 11],
    [34, 23, 32, 24, 54, 63, 1, 5, 92, 7]
])

x.shape

In [None]:
# To Do: Extract the first column of x
# expected outcome: [1 12 623 63 48 33 12 13 19 34]
firstcol_x = None
print(firstcol_x)

In [None]:
# To Do: extract the last row of x
# expected outcome: [34 23 32 24 54 63 1 5 92 7]
lastrow_x = None
print(lastrow_x)

In [None]:
# To Do: calculate the mean of elements in the last row
# expected outcome: 33.5
mean_lastrow = None
print(mean_lastrow)

In [None]:
# To Do : calculate the diagonal components of x
# expected outcome: [1 2322 3 4 57 6 32 8 2 7]
diag_x = None
print(diag_x)

In [None]:
# To Do: calculate the variatoin of the diagonal components of x
# expected outcome: 479979.9600000001
var_diag = None
print(var_diag)