
## Q1:

You are developing a backend system for an application that processes videos uploaded by users to the server. On the server side, each instance of your program should predict what objects are in each frame, and return the result to another process within the application as a list of pairs of frame IDs and bounding box lications in JSON format. Each instance of your program has a 4GB RAM limit. The model that creates bounding boxes for objects exists in GPU RAM and so it does not consume general purpose RAM. Your program must return only one JSON representing the results for the whole video, not partial results. The incoming video file for each upload can be up to 100GB. Please describe how you would feed the data into your model, and feed the resulting predictions inot the JSON response.

## A1. 

In my opion the best approach to analyze a big video file where we want prediction on each frame and we have a very large number of frames should be to break the video into chuncks that can be loaded into the memory and then pass one chunck as a batch to make predictions and at the end join the `JSONs` for each chunk in order. This is pretty common and there are ways to break the video into chunk of desired size and then process the chunks one at a time. Which is where we can make use of clusters to make these predictions for chunks in different machines and then combine them (Result of one chunck does not affect the result of another chunk, otherwise we need a more involved process is needed to parallelize predictions). One way of doing it is using executor pool or threading and then combing results of all of the chunks in order.

Other than chunking, we can try compressing the input hoping that there will not be a lot of precision by doing this, we can try using a model which takes in frame of a smaller size allowing us to load more frames at once. 

----


## Q2:

You are multithreading a list of parallel tasks using a thread pool. What is t.join() used for? What is the purpose of using t.join() rather than skipping it. This program seems to work with or without the t.join(). Why do we still include it?


In [1]:
from threading import Thread
from queue import Queue
import time

def worker(args,q):
    time.sleep(1)
    print("done {}".format(args))
    q.put(1)
    return

workerList=[]
for i in range(3):
    q = Queue()
    t = Thread(target=worker,args=(i,q))
    t.start()
    workerList.append([q,t])

for i,workerPair in enumerate(workerList):
    workerPair[1].join()
    
print("ALL WORK DONE")

total=0
for i,workerPair in enumerate(workerList):
    total+=workerPair[0].get()
    
print("TOTAL={}".format(total))

done 0
done 2
done 1
ALL WORK DONE
TOTAL=3


## A2.
For the given example the workers are performing not so complex work, which hides the need of thread.join(). thread.join() is needed:

i. If we want all the processes running in the threads to finish before continuing the main thread.
ii. We want to use result of one thread in another thread or main thread.

The reason you don't see any changes is because main-thread does nothing after your join. You could say join is (only) relevant for the execution-flow of the main-thread. But what if we wanted to use the results of the processes running in different threads in that case this will no longer work especially with a more complex case. \\
ref: https://pages.mtu.edu/~shene/NSF-3/e-Book/FUNDAMENTALS/thread-management.html

In [3]:
print('#'*10 + ' '*3 +'WITH JOIN' + ' '*3 + '#'*10)
def worker(args,q):
    time.sleep(args*10)
    print("done {}".format(args))
    q.put(1)
    return

workerList=[]
for i in range(3):
    q = Queue()
    t = Thread(target=worker,args=(i,q))
    t.start()
    workerList.append([q,t])

for i,workerPair in enumerate(workerList):
    workerPair[1].join()
    
print("ALL WORK DONE")

total=0
for i,workerPair in enumerate(workerList):
    total+=workerPair[0].get()
    
print("TOTAL={}".format(total))

print('#'*10 + ' '*3 +'WITHOUT JOIN' + ' '*3 + '#'*10)
workerList=[]
for i in range(3):
    q = Queue()
    t = Thread(target=worker,args=(i,q))
    t.start()
    workerList.append([q,t])
print("ALL WORK DONE")

total=0
for i,workerPair in enumerate(workerList):
    total+=workerPair[0].get()
    
print("TOTAL={}".format(total))

##########   WITH JOIN   ##########
done 0
done 2
done 1
done 2
ALL WORK DONE
TOTAL=3
##########   WITHOUT JOIN   ##########
done 0
ALL WORK DONE
done 1
done 2
TOTAL=3


---

## Q3:

What is faster, `pd.concat([df1,df2,df3])` or a loop of `df.append()`? Explain your reasoning.

In [4]:
%%time
import random
import pandas as pd
numRows=10000

df = pd.DataFrame(columns=["age","gender"])
for _ in range(numRows):
    df2=pd.DataFrame(data={'age': [random.randint(0,120)], 'gender': [random.choice(["M","F"])]})
    df=df.append(df2)
df.head()

CPU times: user 13.2 s, sys: 226 ms, total: 13.5 s
Wall time: 13.2 s


Unnamed: 0,age,gender
0,1,F
0,112,M
0,107,M
0,37,F
0,72,F


In [5]:
%%time
import random
numRows=10000
resultArr = []
for _ in range(numRows):
    df2=pd.DataFrame(data={'age': [random.randint(0,120)], 'gender': [random.choice(["M","F"])]})
    resultArr.append(df2)

df=pd.concat(resultArr)
df.head()

CPU times: user 5.49 s, sys: 370 ms, total: 5.86 s
Wall time: 5.59 s


Unnamed: 0,age,gender
0,106,F
0,29,F
0,80,F
0,34,F
0,29,M


## A3. 
From the above example it can be seen that for that case `pd.concat` is quicker than `df.append`. `df.append` is just one case of `pd.concat` which is `(axis=0, join='outer')`. The functionality `df.append` is not only less general way of concatinating but is also slow which is why it's being deprecated by `pandas` as well.

ref: 
- https://pandas.pydata.org/docs/whatsnew/v1.4.0.html#whatsnew-140-deprecations-frame-series-append
- https://github.com/pandas-dev/pandas/issues/41828. 

---

## Q4:

Compare the purposes of 1,2, and 3:

    Flask/Django/Others

    apache2/nginx

    gunicorn/other WSGI

## A4.
1. *Flask/Django/Others*: Flask, django, Fast API and others are python modules which makes it easy to create APIs and web-browsers (web-framework).
2. *apache2/nginx*: These are just gateway which stands between the backend and the internet. Its a webserver which is also used as reverse proxy which distributes traffic to multiple backend servers.
3. *gunicorn/other WSGI*: WSGI or Gunicorn is the layer in between the web server and the backend, its a specification that describes the communication between web server and web applications, and how web applications can be chained together to process one request. 

---

## Q5:

You are writing a program that scrapes text from a long list of websites. How would you apply parallelism to speed up the scraping task?

## A5.
I will just use different threads to scrapre different websites because order in which the websites are scraped does not matter I can use an executor pool and create an thread to scrape each website.

---

## Q6:

You are writing a python3 program. When should you use Docker and when should you use venv?

## A6.
venv only encapsulates python dependencies whereas docker encapsulates the entire OS. For simple application or local deployment venv is suitable whereas for more complex application where there are major OS dependencies docker is prefered.