# CSC4008 tut7
## Part 1: Bloom Filter

A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set or not. It uses a bit array and a set of hash functions to represent the presence or absence of elements in the set.

The process of Bloom filter can be divided into the following steps:

* Initialize the bit array: A Bloom filter uses a bit array of m bits, which is initially set to all zeros.

* Choose k hash functions: A Bloom filter uses k hash functions, which map an element to k different positions in the bit array. Ideally, the hash functions should be **independent and uniformly** distributed.

* Add elements to the Bloom filter: To add an element to the Bloom filter, apply each of the k hash functions to the element and set the corresponding bits in the bit array to one. If a bit is already set to one, leave it unchanged.

* Check if an element is in the Bloom filter: To check if an element is in the Bloom filter, apply each of the k hash functions to the element and check if the corresponding bits in the bit array are **all set to one**. If any bit is set to zero, the element is definitely not in the Bloom filter. If all bits are set to one, the element may or may not be in the Bloom filter.

* False positives: The probability of a false positive depends on the **size of the bit array**, **the number of hash functions**, and **the number of elements stored** in the Bloom filter.

When we insert an element to the bloom filter, one hash function will set one position in the bitarray to 1. Because the hash function is **uniformly distributed**, for any position in the bitarray, the probability that it will be set to 1 is equal.
$$p^{1} = \frac{1}{m}$$
$$p^{0} = 1 - \frac{1}{m}$$
After we use k hash functions, given all hash functions are **independent**, the probability that a position is still zero can also be calculated.
$$p_{k}^{0} = (1 - \frac{1}{m})^k$$
After inserting n elements,
$$p_{k,n}^{0} = (1 - \frac{1}{m})^{kn}$$
$$p_{k,n}^{1} = 1 - (1 - \frac{1}{m})^{kn}$$
When a new element comes in the data stream, assume it is not a duplicate of previous data items, the probability that it will be judged to exist in bloom filter is that,
$$p_{fp} = [1 - (1 - \frac{1}{m})^{kn}]^{k} = [1 - ((1 - \frac{1}{m})^{-m})^{-\frac{kn}{m}}]^{k}$$
This is also known as the **false positive probability**.
Since the limitation
$$\lim_{x \to ∞} (1 - \frac{1}{x})^{-x} = e$$
We further have:
$$p_{fp} ≈ (1 - e^{-\frac{kn}{m}})^{k}$$
From the above formula, if we increase m or decrease n, the false positive probability can be reduced. 

Furthermore, given m and n, we can try to calculate the best number of hash functions k.
let $a = e^{\frac{n}{m}}$, we have
$$f(k) = (1 - a^{-k})^{k} ≈ p_{fp} $$
$$lnf(k) = kln(1 - a^{-k})$$
Take the differential w.r.t. k on both sides,
$$\frac{1}{f(k)}f^{'}(k) = ln(1 - a^{-k}) + \frac{ka^{-k}lna}{1-a^{-k}}$$
When $f(k)≈ p_{fp}$ reaches the minimum, we have $f^{'}(k)=0$. Therefore,
$$ln(1 - a^{-k}) + \frac{ka^{-k}lna}{1-a^{-k}} = 0$$
$$(1-a^{-k})ln(1 - a^{-k}) = -ka^{-k}lna$$
$$(1-a^{-k})ln(1 - a^{-k}) = a^{-k}lna^{-k}$$
$$1 - a^{-k} = a^{-k}$$
$$a^{-k} = \frac{1}{2}$$
Replace a by $e^{\frac{n}{m}}$,
$$e^{\frac{-kn}{m}}= \frac{1}{2}$$
Finally,
$$k = \frac{m}{n}ln2 ≈ \frac{0.7m}{n}$$



In [None]:
!pip install bitarray
!pip install mmh3

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bitarray
  Downloading bitarray-2.7.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (269 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m269.6/269.6 KB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitarray
Successfully installed bitarray-2.7.3
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting mmh3
  Downloading mmh3-3.0.0-cp39-cp39-manylinux2010_x86_64.whl (49 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.3/49.3 KB[0m [31m473.4 kB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: mmh3
Successfully installed mmh3-3.0.0


In [None]:
from bitarray import bitarray
import mmh3

#Bloom Filter Class
class BloomFilter:
  def __init__(self,size,k):
    self.bitarr = bitarray(size)
    self.bitarr.setall(0)
    self.size = size
    self.k = k
  def __len__(self):
    return self.size

  def add(self,item):
    for seed in range(self.k):
      idx = mmh3.hash(item,seed) % self.size
      self.bitarr[idx] = 1
  
  def judgeContains(self,item):
    have = True
    for seed in range(self.k):
      idx = mmh3.hash(item,seed) % self.size
      if self.bitarr[idx] == 0:
        have = False
    return have

In [None]:
#Testing
def main():
  bloom = BloomFilter(10000, 20)
  animals = ['dog', 'cat', 'giraffe', 'fly', 'mosquito', 'horse', 'eagle',
        'bird', 'bison', 'boar', 'butterfly', 'ant', 'anaconda', 'bear',
        'chicken', 'dolphin', 'donkey', 'crow', 'crocodile']
  #Insert animals into the bloom filter
  for animal in animals:
    bloom.add(animal)

  #check existence for animals
  #all animals here are not in bloom filter
  test_animals = ['badger', 'cow', 'pig', 'sheep', 'bee', 'wolf', 'fox',
           'whale', 'shark', 'fish', 'turkey', 'duck', 'dove',
           'deer','elephant','frog','falcon','goat','gorilla','hawk']
  for animal in test_animals:
    if bloom.judgeContains(animal):
      print("{} is not in the BloomFilter, but false positive happens".format(animal))
    else:
      print("{} is not in the BloomFilter as expected".format(animal))

main()


badger is not in the BloomFilter as expected
cow is not in the BloomFilter as expected
pig is not in the BloomFilter as expected
sheep is not in the BloomFilter as expected
bee is not in the BloomFilter as expected
wolf is not in the BloomFilter as expected
fox is not in the BloomFilter as expected
whale is not in the BloomFilter as expected
shark is not in the BloomFilter as expected
fish is not in the BloomFilter as expected
turkey is not in the BloomFilter as expected
duck is not in the BloomFilter as expected
dove is not in the BloomFilter as expected
deer is not in the BloomFilter as expected
elephant is not in the BloomFilter as expected
frog is not in the BloomFilter as expected
falcon is not in the BloomFilter as expected
goat is not in the BloomFilter as expected
gorilla is not in the BloomFilter as expected
hawk is not in the BloomFilter as expected


## Part 2: Studying COVID-19

### Setup

Let's set up Spark on your Colab environment.  Run the cell below!

In [None]:
!pip install pyspark
!pip install -U -q PyDrive
!apt install openjdk-8-jdk-headless -qq
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.2.tar.gz (281.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.4/281.4 MB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 KB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.2-py2.py3-none-any.whl size=281824025 sha256=eaf6f4d3689b966f13fc305b0afd41ce0fbfbcb82e0cb184a0bb897ee2d5656d
  Stored in directory: /root/.cache/pip/wheels/6c/e3/9b/0525ce8a69478916513509d43693511463c6468db0de237c86
Successfully built pyspark
Installing collected packages: py4j, pyspa

Now we authenticate a Google Drive client to download the files we will be processing in our Spark job.

**Make sure to follow the interactive instructions.**

In [None]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [None]:
id='1YT7ttUAafCjbVdm6obeHp1TWAK0rEtoR'
downloaded = drive.CreateFile({'id': id})
downloaded.GetContentFile('time_series_covid19_confirmed_global.csv')

id='1YxEA5UQ2EFJ_9oLssM__Gs1ncVNufGNA'
downloaded = drive.CreateFile({'id': id})
downloaded.GetContentFile('time_series_covid19_deaths_global.csv')

id='1CNxszuZTeIw-5cF5yrzKMZdb1qV0hSoy'
downloaded = drive.CreateFile({'id': id})
downloaded.GetContentFile('time_series_covid19_recovered_global.csv')

If you executed the cells above, you should be able to see the dataset we will use for this Colab under the "Files" tab on the left panel.

Next, we import some of the common libraries needed for our task.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import pyspark
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark import SparkContext, SparkConf

Let's initialize the Spark context.

In [None]:
# create the session
conf = SparkConf().set("spark.ui.port", "4050")

# create the context
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession.builder.getOrCreate()

### Data Loading

In this Colab, we will be analyzing the time series data of the Coronavirus COVID-19 Global Cases, collected by Johns Hopkins CSSE.

Here you can check a realtime dashboard based on this dataset: [https://www.arcgis.com/apps/opsdashboard/index.html?fbclid=IwAR2hQKsEZ3D38wVtXGryUhP9CG0Z6MYbUM_boPEaV8FBe71wUvDPc65ZG78#/bda7594740fd40299423467b48e9ecf6](https://www.arcgis.com/apps/opsdashboard/index.html?fbclid=IwAR2hQKsEZ3D38wVtXGryUhP9CG0Z6MYbUM_boPEaV8FBe71wUvDPc65ZG78#/bda7594740fd40299423467b48e9ecf6)

---



*   ```confirmed```: dataframe containing the cumulative number of confirmed COVID-19 cases, divided by geographical area
*   ```deaths```: dataframe containing the cumulative number of deaths due to COVID-19, divided by geographical area
*   ```recovered```: dataframe containing the cumulative number of recovered patients, divided by geographical area

The data sets contain data entries for each day, representing the cumulative totals as of that day.







In [None]:
confirmed = spark.read.csv('time_series_covid19_confirmed_global.csv', header=True)
deaths = spark.read.csv('time_series_covid19_deaths_global.csv', header=True)
recovered = spark.read.csv('time_series_covid19_recovered_global.csv', header=True)

In [None]:
confirmed.printSchema()
confirmed.count()

root
 |-- Province/State: string (nullable = true)
 |-- Country/Region: string (nullable = true)
 |-- Lat: string (nullable = true)
 |-- Long: string (nullable = true)
 |-- 1/22/20: string (nullable = true)
 |-- 1/23/20: string (nullable = true)
 |-- 1/24/20: string (nullable = true)
 |-- 1/25/20: string (nullable = true)
 |-- 1/26/20: string (nullable = true)
 |-- 1/27/20: string (nullable = true)
 |-- 1/28/20: string (nullable = true)
 |-- 1/29/20: string (nullable = true)
 |-- 1/30/20: string (nullable = true)
 |-- 1/31/20: string (nullable = true)
 |-- 2/1/20: string (nullable = true)
 |-- 2/2/20: string (nullable = true)
 |-- 2/3/20: string (nullable = true)
 |-- 2/4/20: string (nullable = true)
 |-- 2/5/20: string (nullable = true)
 |-- 2/6/20: string (nullable = true)
 |-- 2/7/20: string (nullable = true)
 |-- 2/8/20: string (nullable = true)
 |-- 2/9/20: string (nullable = true)
 |-- 2/10/20: string (nullable = true)
 |-- 2/11/20: string (nullable = true)
 |-- 2/12/20: string (

275

### Your Task

Consider the entries for May 1, 2021, in the timeseries, and compute:


*   number of confirmed COVID-19 cases across the globe
*   number of deaths due to COVID-19 (across the globe)
*   number of recovered patients across the globe



In [None]:
# YOUR CODE HERE
confirmed.select(sum("5/1/21").alias("sum_confirmed")).show()
deaths.select(sum("5/1/21").alias("sum_deaths")).show()
recovered.select(sum("5/1/21").alias("sum_recovered")).show()



+-------------+
|sum_confirmed|
+-------------+
| 1.52196159E8|
+-------------+

+----------+
|sum_deaths|
+----------+
| 3192930.0|
+----------+

+-------------+
|sum_recovered|
+-------------+
|  8.8919401E7|
+-------------+



Consider the data points for March 1, 2020, and March 1, 2021, and filter out the geographical locations where less than 50 cases have been confirmed.
For the areas still taken into consideration after the filtering step, compute the ratio between number of deaths and number of confirmed cases.

In [None]:
# YOUR CODE HERE
def f(date):
  df = confirmed.groupBy("Country/Region").agg(sum(date).alias("Sum_confirmed"))
  df_filtered = df.filter(df.Sum_confirmed < 50)
  df_filtered.show()

  deaths_df = deaths.groupBy("Country/Region").agg(sum(date).alias("Sum_deaths"))
  result = df_filtered.join(deaths_df,df_filtered["Country/Region"] == deaths_df["Country/Region"],"leftouter"). \
        select(df_filtered["Country/Region"],df_filtered.Sum_confirmed,deaths_df.Sum_deaths)
  result = result.withColumn("ratio",result.Sum_deaths/result.Sum_confirmed)
  result.show(100)
  print(result.count())

# March 1, 2020
f("3/1/20")
# March 1, 2021 
f("3/1/21")


+--------------+-------------+
|Country/Region|Sum_confirmed|
+--------------+-------------+
|          Chad|          0.0|
|      Paraguay|          0.0|
|        Russia|          2.0|
|         Yemen|          0.0|
|       Senegal|          0.0|
|    Cabo Verde|          0.0|
|        Sweden|         14.0|
|        Guyana|          0.0|
|         Burma|          0.0|
|       Eritrea|          0.0|
|   Philippines|          3.0|
|      Djibouti|          0.0|
|      Malaysia|         29.0|
|          Fiji|          0.0|
|        Turkey|          0.0|
|        Malawi|          0.0|
|          Iraq|         19.0|
|       Comoros|          0.0|
|   Afghanistan|          1.0|
|      Cambodia|          1.0|
+--------------+-------------+
only showing top 20 rows

+--------------------+-------------+----------+--------------------+
|      Country/Region|Sum_confirmed|Sum_deaths|               ratio|
+--------------------+-------------+----------+--------------------+
|                Chad| 

Consider the data points for March 1, 2021, and May 1, 2021, in the timeseries, and filter out the geographical locations where less than 50 deaths have been confirmed (as of March 1, 2021).
For the areas still taken into consideration after the filtering step, compute the percent increase in cumulative deaths between the two dates (with respect to number of confirmed cases as of March 1, 2021).

In [None]:
# YOUR CODE HERE
df = confirmed.groupBy("Country/Region").agg(sum("3/1/21").alias("Sum_confirmed"))
df_filtered = df.filter(df.Sum_confirmed < 50)
df_filtered.show()

deaths_df = deaths.groupBy("Country/Region"). \
        agg(sum("3/1/21").alias("Sum_deaths_start"),
          sum("5/1/21").alias("Sum_deaths_end"))
result = df_filtered.join(deaths_df,df_filtered["Country/Region"] == deaths_df["Country/Region"],"leftouter"). \
        select(df_filtered["Country/Region"],df_filtered.Sum_confirmed,deaths_df.Sum_deaths_start,deaths_df.Sum_deaths_end)
result = result.withColumn("precent_increase",(result.Sum_deaths_end - result.Sum_deaths_start)/result.Sum_confirmed)
result.show()






+--------------------+-------------+
|      Country/Region|Sum_confirmed|
+--------------------+-------------+
|          MS Zaandam|          9.0|
|    Marshall Islands|          4.0|
|                Laos|         45.0|
|            Holy See|         27.0|
|Saint Kitts and N...|         41.0|
|               Samoa|          3.0|
|     Solomon Islands|         18.0|
|          Micronesia|          1.0|
|             Vanuatu|          1.0|
+--------------------+-------------+

+--------------------+-------------+----------------+--------------+----------------+
|      Country/Region|Sum_confirmed|Sum_deaths_start|Sum_deaths_end|precent_increase|
+--------------------+-------------+----------------+--------------+----------------+
|          MS Zaandam|          9.0|             2.0|           2.0|             0.0|
|    Marshall Islands|          4.0|             0.0|           0.0|             0.0|
|                Laos|         45.0|             0.0|           0.0|             0.0|
| 