NumPy, short for Numerical Python is used to analyze numeric data with Python. Although numeric operations may be performed without NumPy, NumPy is preferred for its efficiency, especially when working with large arrays of data. A couple of reasons that make NumPy more efficient are:

1. NumPy arrays use much less memory than other built-in Python data structures. This is because a NumPy array is densenly packed due to the homogenous nature of data stored in it. This helps retrieve the data faster as well, thereby making computations faster.
2. With NumPy, vectorized computations can replace the relatively more expensive python `for` loops.

We'll see the above two advantages of NumPy with the examples below.

Let us import the NumPy library to use its methods and functions.

In [5]:
import numpy as np

**Example 1:** This example shows that computations using NumPy arrays are typically much faster than computations with other data structures such as a list.

**Q:** Multiply whole numbers upto 1 million by an integer, say 2. Compare the time taken for the computation if the numbers are stored in a NumPy array vs a list.

Use the numpy function [arange()](https://numpy.org/doc/stable/reference/generated/numpy.arange.html) to define a one-dimensional NumPy array.

In [25]:
#Examples showing NumPy arrays are more efficient for numerical computation
import time as tm
start_time = tm.time()
list_ex = list(range(1000000)) #List containinig whole numbers upto 1 million
a=(list_ex*2)
print("Time take to multiply numbers in a list = ", tm.time()-start_time)

start_time = tm.time()
tuple_ex = tuple(range(1000000)) #List containinig whole numbers upto 1 million
a=(tuple_ex*2)
print("Time take to multiply numbers in a tuple = ", tm.time()-start_time)

start_time = tm.time()
numpy_ex = np.arange(1000000) #tuple containinig whole numbers upto 1 million
a=(numpy_ex*2)
print("Time take to multiply numbers in a NumPy array = ", tm.time()-start_time)

Time take to multiply numbers in a list =  0.04031014442443848
Time take to multiply numbers in a tuple =  0.03827619552612305
Time take to multiply numbers in a NumPy array =  0.0


## Vectorized computation with NumPy

Several matrix algebra operations such as multiplications, decompositions, determinants, etc. can be performed convenienetly with NumPy. However, we'll focus on matrix multiplication as it is very commonly used to avoid python `for` loops and make computations faster. The [dot](https://numpy.org/doc/stable/reference/generated/numpy.dot.html) function is used to multiply matrices:

In [98]:
#Defining a 2x3 matrix
a = np.array([[0,1],[3,4]])
a

array([[0, 1],
       [3, 4]])

In [99]:
#Defining a 3x2 matrix
b = np.array([[6,-1],[2,1]])
b

array([[ 6, -1],
       [ 2,  1]])

In [100]:
#Multiplying matrices 'a' and 'b' using the dot function
a.dot(b)

array([[ 2,  1],
       [26,  1]])

In [101]:
#Note that * results in element-wise multiplication
a*b

array([[ 0, -1],
       [ 6,  4]])

**Example 2:** This example will show vectorized computations with NumPy. Vectorized computations help perform computations more efficiently, and also make the code concise.

**Q:** Read the (1) quantities of roll, bun, cake and bread required by 3 people - Ben, Barbara & Beth, from *food_quantity.csv*, (2) price of these food items in two shops - Target and Kroger, from *price.csv*. Find out which shop should each person go to minimize their expenses.

In [32]:
#Reading the datasets on food quantity and price
import pandas as pd
food_qty = pd.read_csv('./Datasets/food_quantity.csv')
price = pd.read_csv('./Datasets/price.csv')

In [33]:
food_qty

Unnamed: 0,Person,roll,bun,cake,bread
0,Ben,6,5,3,1
1,Barbara,3,6,2,2
2,Beth,3,4,3,1


In [34]:
price

Unnamed: 0,Item,Target,Kroger
0,roll,1.5,1.0
1,bun,2.0,2.5
2,cake,5.0,4.5
3,bread,16.0,17.0


First, let's start from a simple problem. We'll compute the expenses of Ben if he prefers to buy all food items from Target

In [36]:
#Method 1: Using loop
bens_target_expense = 0 #Initializing Ben's expenses to 0
for k in range(4):   #Iterating over all the four desired food items
    bens_target_expense += food_qty.iloc[0,k+1]*price.iloc[k,1] #Total expenses on the kth item
bens_target_expense    #Total expenses for Ben if he goes to Target

50.0

In [37]:
#Method 2: Using NumPy array
food_num = food_qty.iloc[0,1:].to_numpy()  #Converting food quantity (for Ben) dataframe to NumPy array
price_num = price.iloc[:,1].to_numpy()     #Converting price (for Target) dataframe to NumPy array
food_num.dot(price_num)   #Matrix multiplication of the quantity vector with the price vector directly yields the result

50.0

Ben will spend $50 if he goes to Target

Now, let's add another layer of complication. We'll compute Ben's expenses for both stores - Target and Kroger

In [38]:
#Method 1: Using loops

#Initializing a Series of length two to store the expenses in Target and Kroger for Ben
bens_store_expense = pd.Series(0.0,index=price.columns[1:3])
for j in range(2):      #Iterating over both the stores - Target and Kroger
    for k in range(4):        #Iterating over all the four desired food items
        bens_store_expense[j] += food_qty.iloc[0,k+1]*price.iloc[k,j+1]
bens_store_expense

Target    50.0
Kroger    49.0
dtype: float64

In [39]:
#Method 2: Using NumPy array
food_num = food_qty.iloc[0,1:].to_numpy()  #Converting food quantity (for Ben) dataframe to NumPy array
price_num = price.iloc[:,1:].to_numpy()    #Converting price dataframe to NumPy array
food_num.dot(price_num)      #Matrix multiplication of the quantity vector with the price matrix directly yields the result

array([50.0, 49.0], dtype=object)

Ben will spend \\$50 if he goes to Target, and $49 if he goes to Kroger. Thus, he should choose Kroger.

Now, let's add the final layer of complication, and solve the problem. We'll compute everyone's expenses for both stores - Target and Kroger

In [40]:
#Method 1: Using loops
store_expense = pd.DataFrame(0.0,index=price.columns[1:3],columns = food_qty['Person'])
for i in range(3):    #Iterating over all the three people - Ben, Barbara, and Beth
    for j in range(2):     #Iterating over both the stores - Target and Kroger
        for k in range(4):        #Iterating over all the four desired food items
            store_expense.iloc[j,i] += food_qty.iloc[i,k+1]*price.iloc[k,j+1]
store_expense

Person,Ben,Barbara,Beth
Target,50.0,58.5,43.5
Kroger,49.0,61.0,43.5


In [41]:
#Method 2: Using NumPy array
food_num = food_qty.iloc[:,1:].to_numpy() #Converting food quantity dataframe to NumPy array
price_num = price.iloc[:,1:].to_numpy()  #Converting price dataframe to NumPy array
food_num.dot(price_num)  #Matrix multiplication of the quantity matrix with the price matrix directly yields the result

array([[50. , 49. ],
       [58.5, 61. ],
       [43.5, 43.5]])

Based on the above table, Ben should go to Kroger, Barbara to Target and Beth can go to either store.  \
Note that, with each layer of complication, the number of for loops keep increasing, thereby increasing the complexity of Method 1, while the method with NumPy array does not change much. Vectorized computations with arrays are much more efficient.

### In-class exercise {-}

Use matrix multiplication to find the average IMDB rating and average Rotten tomatoes rating for each genre - comedy, action, drama and horror. Use the data: *movies_cleaned.csv*. Which is the most preferred genre for IMDB users, and which is the least preferred genre for Rotten Tomatoes users?

**Hint:** 
1. Create two matrices - one containing the IMDB and Rotten Tomatoes ratings, and the other containing the genre flags (comedy/action/drama/horror). 
2. Multiply the two matrices created in 1.
3. Divide each row/column of the resulting matrix by a vector having the number of ratings in each genre to get the average rating for the genre.

In [78]:
#| echo: false
#| output: false

data = pd.read_csv('./Datasets/movies_cleaned.csv')
data.head()

Unnamed: 0,Title,IMDB Rating,Rotten Tomatoes Rating,Running Time min,Release Date,US Gross,Worldwide Gross,Production Budget,comedy,Action,drama,horror
0,Broken Arrow,5.8,55,108,Feb 09 1996,70645997,148345997,65000000,0,1,0,0
1,Brazil,8.0,98,136,Dec 18 1985,9929135,9929135,15000000,1,0,0,0
2,The Cable Guy,5.8,52,95,Jun 14 1996,60240295,102825796,47000000,1,0,0,0
3,Chain Reaction,5.2,13,106,Aug 02 1996,21226204,60209334,55000000,0,1,0,0
4,Clash of the Titans,5.9,65,108,Jun 12 1981,30000000,30000000,15000000,0,1,0,0


In [63]:
#| echo: false
#| output: false

# Getting ratings of all movies
drating = data[['IMDB Rating','Rotten Tomatoes Rating']]
drating_num = drating.to_numpy() #Converting the data to NumPy array

# Getting genres of all movies
dgenre = data.iloc[:,8:12]
dgenre_num = dgenre.to_numpy() #Converting the data to NumPy array

In [64]:
#| echo: false
#| output: false

dgenre_num

array([[0, 1, 0, 0],
       [1, 0, 0, 0],
       [1, 0, 0, 0],
       ...,
       [1, 0, 0, 0],
       [0, 1, 0, 0],
       [0, 1, 0, 0]], dtype=int64)

In [65]:
#| echo: false
#| output: false

#Total IMDB and Rotten tomatoes ratings for each genre
ratings_sum_genre = drating_num.T.dot(dgenre_num)
ratings_sum_genre

array([[ 1785.6,  1673.1,  1630.3,   946.2],
       [14119. , 13725. , 14535. ,  6533. ]])

In [66]:
#| echo: false
#| output: false

#Number of movies in the data will be stored in 'rows', and number of columns stored in 'cols'
rows, cols = data.shape

In [67]:
#| echo: false
#| output: false

#Getting number of movies in each genre
movies_count_genre = dgenre_num.T.dot(np.ones(rows))
movies_count_genre

array([302., 264., 239., 154.])

In [68]:
#| echo: false
#| output: false

#Finding the average IMDB and average Rotten tomatoes ratings for each genre
ratings_sum_genre/movies_count_genre

array([[ 5.91258278,  6.3375    ,  6.82133891,  6.14415584],
       [46.75165563, 51.98863636, 60.81589958, 42.42207792]])

In [71]:
#| echo: false
#| output: false

pd.DataFrame(ratings_sum_genre/movies_count_genre,columns = ['comedy','Action','drama','horror'],
             index = ['IMDB Rating','Rotten Tomatoes Rating'])

Unnamed: 0,comedy,Action,drama,horror
IMDB Rating,5.912583,6.3375,6.821339,6.144156
Rotten Tomatoes Rating,46.751656,51.988636,60.8159,42.422078


In [None]:
#| echo: false

#IMDB users prefer *drama*, and are amused the least by *comedy* movies, on an average. However, Rotten tomatoes critics would rather watch *comedy* than *horror* movies, on an average.

## Pseudorandom number generation
Random numbers often need to be generated to analyze processes or systems, especially in cases when these processes or systems are governed by known probability distrbutions. For example, the number of personnel required to answer calls at a call center can be analyzed by simulating occurence and duration of calls.

NumPy's [random](https://numpy.org/doc/stable/reference/random/index.html) module can be used to generate arrays of random numbers from several different probability distributions. For example, a `3x5` array of uniformly distributed random numbers can be generated using the `uniform` function of teh `random` module.

In [81]:
np.random.uniform(size = (3,5))

array([[0.69256322, 0.69259973, 0.03515058, 0.45186048, 0.43513769],
       [0.07373366, 0.07465425, 0.92195975, 0.72915895, 0.8906299 ],
       [0.15816734, 0.88144978, 0.05954028, 0.81403832, 0.97725557]])

Random numbers can also be generated by Python's built-in `random` module. However, it generates one random number at a time, which makes it much slower than NumPy's random module.

**Example 3:** Suppose 500 people eat at Mod Pizza, and another 500 eat at Viet nom nom, everyday.

The waiting time at Viet nom nom has a normal distribution with mean 8 minutes and standard deviation 3 minutes, while the waiting time at Mod Pizza has a uniform distribution with minimum 5 minutes and maximum 25 minutes. 

Simulate a dataset containing waiting times for 500 ppl for 30 days in each of the food joints. Assume that the waiting time is meansured simultaneously at a certain time in both places, i.e., the observations are paired.

**On how many days is the average waiting time at Viet Nom Nom higher than that at Mod Pizza?**

**What percentage of times the waiting time at Viet nom nom was higher than the waiting time at Mod Pizza?**

Try both approaches: (1) Using loops to generate data, (2) numpy array to generate data. Compare the time taken in both approaches. 

In [82]:
import time as tm

In [88]:
#Method 1: Using loops
start_time = tm.time() #Current system time

#Initializing waiting times for 500 ppl over 30 days
waiting_times_MOD = pd.DataFrame(0,index=range(500),columns=range(30)) #Mod pizza
waiting_times_Vnom = pd.DataFrame(0,index=range(500),columns=range(30)) #Viet nom nom
import random as rm
for i in range(500):  #Iterating over 500 ppl
    for j in range(30): #Iterating over 30 days
        waiting_times_Vnom.iloc[i,j] = rm.gauss(8,3) #Simulating waiting time in Viet nom nom for the ith person on jth day
        waiting_times_MOD.iloc[i,j] = rm.uniform(5,25) #Simulating waiting time in Mod pizza for the ith person on jth day
time_diff = waiting_times_Vnom-waiting_times_MOD

print("On ",sum(time_diff.mean()>0)," days, the average waiting time at Viet Nom Nom higher than that at Mod Pizza")
print("Percentage of times waiting time at Viet nom nom was greater than that at Mod Pizza = ",100*(time_diff>0).sum().sum()/(30*500),"%")
end_time = tm.time() #Current system time
print("Time taken = ", end_time-start_time)

On  0  days, the average waiting time at Viet Nom Nom higher than that at Mod Pizza
Percentage of times waiting time at Viet nom nom was greater than that at Mod Pizza =  16.58 %
Time taken =  3.5454351902008057


In [89]:
#Method 2: Using NumPy arrays
start_time = tm.time()
waiting_time_Vnom = np.random.normal(8,3,size = (500,30)) #Simultaneously generating the waiting times of 500 ppl over 30 days in Viet nom nom
waiting_time_MOD = np.random.uniform(5,25,size = (500,30)) #Simultaneously generating the waiting times of 500 ppl over 30 days in MOD pizza
time_diff = waiting_time_Vnom-waiting_time_MOD
print("On ",(time_diff.mean()>0).sum()," days, the average waiting time at Viet Nom Nom higher than that at Mod Pizza")
print("Percentage of times waiting time at Viet nom nom was greater than that at Mod Pizza = ",100*(time_diff>0).sum()/15000,"%")
end_time = tm.time()
print("Time taken = ", end_time-start_time)

On  0  days, the average waiting time at Viet Nom Nom higher than that at Mod Pizza
Percentage of times waiting time at Viet nom nom was greater than that at Mod Pizza =  16.486666666666668 %
Time taken =  0.001995563507080078


The approach with NumPy is much faster than the one with loops.

### In-class exercise {-}

**Lab Question**: Bootstrapping \
Question) Find the 95% confidence interval of Profit for 'Action' movies, using Bootstrapping \
Answer) Bootstrapping is a non-parametric method for obtaining confidence interval. The Bootstrapping method for finding the confidence interval is as follows.\
(a) Find the profit for each of the 'Action' movies. Suppose there are *N* such movies. We will have a *Profit* column with *N* values. \
(b) Randomly sample *N* values with replacement from the *Profit* column \
(c) Find the mean of the *N* values obtained in (b) \
(d) Repeat steps (b) and (c) *M=1000* times \
(e) The 95% Confidence interval is the range between the 2.5% and 97.5% percentile values of the 1000 means obtained in (c) \
Use the *movies_cleaned.csv* dataset. \
Go ahead, code this up, and find the confidence interval!