
## Bloom Filters

### Setup

In this Colab, we need to install a [bloom_filter](https://github.com/hiway/python-bloom-filter), a Python library which offers an implementation of Bloom filters.  Run the cell below!

In [1]:
!pip install bloom_filter

Collecting bloom_filter
  Downloading bloom_filter-1.3.3-py3-none-any.whl (8.1 kB)
Installing collected packages: bloom-filter
Successfully installed bloom-filter-1.3.3


### Data Loading

From the NLTK (Natural Language ToolKit) library, we import a large list of English dictionary words, commonly used by the very first spell-checking programs in Unix-like operating systems.

In [1]:
import nltk
nltk.download('words')

from nltk.corpus import words
word_list = words.words()
print(f'Dictionary length: {len(word_list)}')
print(word_list[:15])

Dictionary length: 236736
['A', 'a', 'aa', 'aal', 'aalii', 'aam', 'Aani', 'aardvark', 'aardwolf', 'Aaron', 'Aaronic', 'Aaronical', 'Aaronite', 'Aaronitic', 'Aaru']


[nltk_data] Downloading package words to /Users/mohamed/nltk_data...
[nltk_data]   Package words is already up-to-date!


Then we load another dataset from the NLTK Corpora collection: ```movie_reviews```.

The movie reviews are categorized between *positive* and *negative*, so we construct a list of words (usually called **bag of words**) for each category.

In [2]:
from nltk.corpus import movie_reviews
nltk.download('movie_reviews')

neg_reviews = []
pos_reviews = []

for fileid in movie_reviews.fileids('neg'):
  neg_reviews.extend(movie_reviews.words(fileid))
for fileid in movie_reviews.fileids('pos'):
  pos_reviews.extend(movie_reviews.words(fileid))

[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/mohamed/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


### Your task

In this Colab, you will develop a very simplistic spell-checker.  By no means you should think of using it for a real-world use case, but it is an interesting exercise to highlight the strenghts and weaknesses of Bloom Filters!

In [3]:
from bloom_filter import BloomFilter

word_filter = BloomFilter(max_elements=236736)

for word in word_list:
  word_filter.add(word)

word_set = set(word_list)

If you executed the cell above, you now have 3 different variables in your scope:

1.   ```word_list```, a Python list containing the English dictionary (in case insensitive order)
2.   ```word_filter```, a Bloom filter where we have already added all the words in the English dictionary
3.   ```word_set```, a [Python set](https://docs.python.org/3.6/library/stdtypes.html#set-types-set-frozenset) built from the same list of words in the English dictionary

Let's inspect the size of each datastructure using the [getsizeof()](https://docs.python.org/3/library/sys.html#sys.getsizeof) method!



In [5]:
from sys import getsizeof

print(f'Size of word_list (in bytes): {getsizeof(word_list)}')
print(f'Size of word_list (in bytes): {getsizeof(word_filter)}')
print(f'Size of word_list (in bytes): {getsizeof(word_set)}')

# YOUR CODE HERE


Size of word_list (in bytes): 2115944
Size of word_list (in bytes): 48
Size of word_list (in bytes): 8388824


You should have noticed how efficient is the Bloom filter in terms of memory footprint!

Now let's find out how fast is the main operation for which we construct Bloom filters: *membership testing*. To do so, we will use the ```%timeit``` IPython magic command, which times the repeated execution of a single Python statement.

In [7]:
%timeit -r 3 "California" in word_filter
%timeit -r 3 "California" in word_list
%timeit -r 3 "California" in word_set

# YOUR CODE HERE


9.03 µs ± 90.6 ns per loop (mean ± std. dev. of 3 runs, 100000 loops each)
279 µs ± 5.7 µs per loop (mean ± std. dev. of 3 runs, 1000 loops each)
33.1 ns ± 0.595 ns per loop (mean ± std. dev. of 3 runs, 10000000 loops each)


Notice the performance gap between linear search on a list, multiple hash computations in a Bloom filter, and a single hash computation in a native Python ```Set()```.

We now have all the building blocks required to build our spell-checker, and we understand the performance tradeoffs of each datastructure we chose. Write a function that takes as arguments (1) a list of words, and (2) any of the 3 dictionary datastructures we constructed. The function must return the number of words which **do not appear** in the dictionary.

In [7]:
# YOUR CODE HERE
count = 0
for word in neg_reviews:
  if word not in word_filter:
    count+=1
print(count)


193802


In [19]:
count = 0
for word in neg_reviews:
  if word not in word_list:
    count+=1
print(count)

210258
