<center>
  <h1>DS3000</h1>
  <h1>FLA1: Vectorized Operations - Numpy</h1>
</center>

This tutorial will focus on the vectorized operations in <b>numpy</b> package. At the end of the tutorial, you'll able to confidently answer the following questions:
<ol>
<li> What are vectorized operations?
<li> How to use them?
</ol>

**Instructions:**
- It is okay if you work with another person on this FLA. Each of you should individually submit the completed file (even if it's the same file). 
- Synchronous students will complete this FLA during the live session.
- Asynchronous students will complete it by the deadline specified. If you want to collaborate with an asynchronous classmate, you will need to find them on your own (through discussion posts maybe).
- Upload your completed Notebook to Google Colab (https://colab.research.google.com/notebooks/)
- Get a shareable link to your Notebook on Colab
    * Go to Share near the top-right corner of the screen.
    * <font color='red'> **Make sure you select "Anyone with the Link Can View"** </font>
    * Copy the link
* Go to Canvas
* **Upload your completed Jupyter Notebook through the Canvas FLA link.**
* **Paste the link to your Colab Notebook in the Comments field.**
* Both are required!
<hr>



<center> 
    <h2>Part 1: Array Operations</h2>
</center>
<hr>

1. Write a code snippet to generate two 3x3 arrays, arr1 and arr2. The arrays should be randomly populated with numbers 1-5. Remember there is a method for this! Refer to the sample outputs below.

In [1]:
import numpy as np

arr1 = np.random.randint(1,6,(3,3))
arr2 = np.random.randint(1,6,(3,3))

In [2]:
arr1

array([[3, 3, 3],
       [1, 5, 5],
       [5, 3, 3]])

In [None]:
arr2

array([[3, 2, 5],
       [2, 1, 4],
       [5, 3, 4]])

2. Go ahead and **add** these two arrays using the addition operator (+) and store the result in the arr3 variable.

In [3]:
arr3 = arr1 + arr2
arr3

array([[4, 4, 6],
       [6, 9, 8],
       [6, 5, 5]])

3. Now go ahead and **multiply** these two arrays using the multiplication operator (*) and store the result in the arr4 variable.

In [4]:
arr4 = arr1 * arr2
arr4

array([[ 3,  3,  9],
       [ 5, 20, 15],
       [ 5,  6,  6]])

4. Based on arr3 and arr4, what can you conclude about array/matrix operations in Numpy? Type your answer below:

In [None]:
#answer goes here.

<center> 
    <h2>Part 2: Vectorization Tutorial</h2>
</center>
<hr>

<h2> 1. Introduction </h2>
Most of real life applications often deal with large amounts of data. So one can imagine how a computationally-suboptimal function can quickly become a bottleneck and result in high latency. 

In other words, non-optimal functions can steal away luster from your amazing code! There comes vectorization to save the day!

So what is vectorization? 
<i><p><center>"This practice of replacing explicit loops with array expressions is commonly referred to as vectorization. In general, vectorized array operations will often be one or two (or more) orders of magnitude faster than their pure Python equivalents, with the biggest impact [seen] in any kind of numerical computations." <br>~Wes McKinney (creator of Pandas package)</center></p></i>

Simply put, it is a way to code stuff without using loop statements! It is the ability to express operations over the entire array than their individual elements!

Let's look at some examples to further understand this.

<h2> 2. Examples </h2>

To demonstrate this, we will follow a simple approach: time non-vectorized and vectorized operations.
You are expected to write the non-vectorized code by following the instructions given in the corrosponding blocks.

After every example, we'll check out the time differences.

<hr>
Note: Inferences are left as an exercise for the reader.

In [None]:
import numpy as np
import time

<h3> 2.1 Square Root </h3>

<b>Given</b> -> A 1-d matrix of size 10000 <br>
<b>To-do</b> -> Calculate square root of the matrix

Execute the following cell first.

In [None]:
data = np.random.randint(low = 1, high = 10000, size = 10000)
data

<h4>2.1.1. Non-Vectorized approach</h4>

In [None]:
'''
Here we are going to calculate the square root of each element in the given array.
Follow the instructions below.
'''

start_time = time.time() #execution start time

# COMPLETE THE CODE BELOW
# DEFINE AN EMPTY LIST 'res'
# ITERATE OVER THE 'data' VARIABLE USING FOR LOOP
# CALCULATE THE SQUARE ROOT OF EACH ELEMENT
# SAVE THE RESULTS IN 'res'

''' WRITE CODE BELOW '''
# answer goes here
''' WRITE CODE ABOVE '''

end_time = time.time() #execution start time

for_loop_time_0 = end_time - start_time #time taken

print ("Time taken:\t{}".format(for_loop_time_0))

<hr>
<h4>2.1.2. Vectorized Approach</h4>
Check out how to solve the same with vectorized operations by running the following two cells:

* You don't need to write or change anything for this one. Simply execute the cells.

**To-Note**: Pay attention to the difference in execution times



In [None]:
# vector method
start_time = time.time() #execution start time

# simply calling the numpy.sqrt method
res = np.sqrt(data)

end_time = time.time() #execution end time

vector_time_0 = end_time - start_time #time taken

print ("Time taken:\t{}".format(vector_time_0))

In [None]:
print ("Vector operation is {:.2f}x faster".format(for_loop_time_0/vector_time_0))

<h3> 2.2 Counting numbers </h3>

<b>Given</b> -> 1-d array of 100000 elements between 1 and 999. <br>
<b>To-do</b> -> Count number of numbers greater than or equal to 500.

<hr>
Careful readers will notice how this can be applied for summing over a matrix as well.

Execute the following cell first.

In [None]:
data_1 = np.random.randint(low = 1, high = 1000, size = 100000, dtype = int)
data_1

<h4>2.2.1. Non-Vectorized Approach</h4>

In [None]:
'''
Here we will count the number of numbers greater than 500
'''

start_time = time.time() #execution start time

# COMPLETE THE CODE
# INITLIAZE A 'count' VARIABLE
# ITERATE OVER THE ARRAY USING LOOP OF YOUR CHOICE
## CHECK IF THE ELEMENT IS GREATER THAN 500, IF YES INCREASE THE 'count' VARIABLE BY 1

''' WRITE CODE BELOW '''
# answer goes here
''' WRITE CODE ABOVE '''

end_time = time.time() #execution end time

for_loop_time_1 = end_time - start_time #time taken

print ("{} numbers are greater than or equal to 500\nTime taken: {}".format(count, for_loop_time_1))

<h4>2.2.2. Vectorized Approach</h4>

In [None]:
# vectorized approach
start_time = time.time() #execution start time

count = np.sum(data_1 >= 500)

end_time = time.time() #execution end time

vector_time_1 = end_time - start_time #time taken
print ("{} numbers are greater than or equal to 500\nTime taken: {}".format(count, vector_time_1))

In [None]:
print ("Vector operation is {:.2f}x faster".format(for_loop_time_1/vector_time_1))

<h3> 2.3 Dot Product </h3>
    
Simply put, the dot product of two arrays is the sum of the products of the corresponding elements of the two arrays. To compute the dot product of two arrays, you first perform an element-wise multiplication of the corresponding elements and then sum them up. In this question, you will practice this.

<b>Given</b> -> Two 1-d arrays of size 10000 each <br>
<b>To-do</b> -> Perform dot-product of those matrices

Execute the following cells first.


In [None]:
data_a = np.random.randint(low = 1, high = 500, size = (10000))
data_b = np.random.randint(low = 1, high = 500, size = (10000))

In [None]:
data_a

In [None]:
data_b

<h4>2.3.1. Non-Vectorized Approach</h4>

In [None]:
'''
Here we'll calculate the dot product of 2 arrays (sum of element-wise multiplication of 2 arrays)
'''

start_time = time.time() #execution start time

# COMPLETE THE CODE
# INITIALIZE A VARIABLE 'dot' TO HOLD THE RESULT
# ITERATE OVER ONE OF THE ARRAYs USING A FOR LOOP 
# NOTE: BOTH ARRAYS ARE SAME SIZE, SO IT WON'T MATTER WHICH ARRAY YOU ITERATE OVER
# MULTIPLY ELEMENTS AND UPDATE THE 'dot' VARIABLE

''' WRITE CODE BELOW '''
# answer goes here
''' WRITE CODE ABOVE '''

end_time = time.time() #execution end time

for_loop_time_2 = end_time - start_time #time taken
print ("Dot product:\t{}\nTime taken:\t{}".format(dot, for_loop_time_2))

<h4>2.3.2. Vectorized Approach</h4>

In [None]:
# vector method
start_time = time.time() #execution start time

dot = np.dot(data_a, data_b)

end_time = time.time() #execution end time

vector_time_2 = end_time - start_time #time taken
print ("Dot product:\t{}\nTime taken:\t{}".format(dot, vector_time_2))

In [None]:
print ("Vector operation is {:.2f}x faster".format(for_loop_time_2/vector_time_2))

<h3> 2.4 Matrix Multiplication </h3>
<b>Given</b> -> Two 2-d arrays of size 100x100 each  <br>
<b>To-do</b> -> Perform matrix multiplication

Execute the following cells first.

In [None]:
data_a = np.random.randint(low = 1, high = 1000, size = (100, 100))
data_b = np.random.randint(low = 1, high = 1000, size = (100, 100))

In [None]:
data_a

In [None]:
data_b

<h4>2.4.1. Non-Vectorized Approach</h4>

In [None]:
'''
Here we'll multiply 2 matrices. Simply run the cell.
'''
# COMPLETE CODE
start_time = time.time()

result = np.zeros((100, 100))
for i in range(len(data_a)):
    for j in range(len(data_b[0])):
        for k in range(len(data_b)):
            result[i][j] += data_a[i][k] * data_b[k][j]

end_time = time.time()
for_loop_time_3 = end_time - start_time
print ("Time taken:\t{}".format(for_loop_time_3))

In [None]:
result.reshape(100, 100)

<h4>2.4.2. Vectorized Approach</h4>

In [None]:
# vector operations
start_time = time.time()

res = np.dot(data_a, data_b)

end_time = time.time()

vector_time_3 = end_time - start_time
print ("Time taken:\t{}".format(vector_time_3))

In [None]:
print ("Vector operation is {:.2f}x faster".format(for_loop_time_3/vector_time_3))

<h2> 3. Conclusion </h2>

We see how "fast" the vector operations are! Although this is on sub-second scales, one should realize the importance of vectorized operations.

<hr>
Learn more here: https://www.pythonlikeyoumeanit.com/Module3_IntroducingNumpy/VectorizedOperations.html
