In [15]:
import re


### Part 1

In this first part, we're going to use Spark to analyze the following books, which Iahev downloaded from Project Gutenberg and saved to the data folder.

| File name | Book Title|
|:---------:|:----------|
|43.txt | The Strange Case of Dr. Jekyll and Mr. Hyde by Robert Louis Stevenson|
|84.txt | Frankenstein; Or, The Modern Prometheus by Mary Wollstonecraft Shelley |
|398.txt  | The First Book of Adam and Eve by Rutherford Hayes Platt|
|3296.txt | The Confessions of St. Augustine by Bishop of Hippo Saint Augustine|

Our objective is to cluster these 7 books based on thier similarity in terms of their most frequent context specific words, i.e. note "the", "and", "or", etc..

* One we've generated those workds for these 7 works, we will use those vectors to generate a hierarchical clustering for that shows the simialrity between these books.

For this assignment, you will need to make sure you're running from a PySpark docker environment I introduced in class. You can start the docker pySpark docker environment using the following command:

```
docker run --rm -p 4040:4040 -p 8888:8888 -v $(pwd):/home/jovyan/work jupyter/all-spark-notebook
```

Make sure you run the command from the directory containing this jupyter notebook and your data folder.


</b>
# WARNING: For some reason, the document didn't always sync properly what I was pushing to github. I suspect it had something to do with running in a mounted volume in Docker. As such, I strongly encourage you to push often and to check that the document synced properly to github
</b>

### Prologue

An important aspect of Natural Lanaguge Processing is the identification of texts that are similar. A naive approach to decide whether two documents are similar is by treating  a book as a collection of words (or, bag of words) and compare the documents based on these words. For example, one would expect two books the topic of which is religion  (ex. books 398.txt and  3296.txt), to have more words in common that words than a book that talks about religion and a book that discusess science fiction (ex books 84.txt and 398.txt). 

As mentioned above, we will be using Spark to analyze the data. While Spark is not necessary for such a small example, the plateform would be idea for analyzing a very large collection of documents, such those are often handled by large comapnies

This part of the assignment will rely exclusively on RDDs.


QX. We'll start by importing Spark and making sure our environemnt is set up properly for the assignment.

Import the spark context necesarry to load a document as an RDD

* Ignore  any error messages

In [3]:
from pyspark import SparkContext
sc = SparkContext()
sc.version

'3.1.2'

QX Read in the file `43.txt` as a spark RDD and save it to the variable book_43
 * make sure book_43 of type MapPartitionsRDD
   * str(type(book_43)) == "<class 'pyspark.rdd.RDD'>"


In [11]:
book_43 = sc.textFile('data/43.txt')
str(type(book_43)) == "<class 'pyspark.rdd.RDD'>"

True

QX How many lines does `book_43` the file contain?
* You can only use operarations or actions to answer the question. 
  * Code that uses methods such as `some_rdd.X().Y().Z()...` is allowed
  * Code that uses function such as `some_func(...)` is not allowed


In [14]:
book_43.count()

2935

QX We need to first remove the occurrences of non-alphabetical characters and numbers. You can use the following function, which given a line, remove digist and non-word characters and splits it into a collection of word 

```python
def clean_split_line(line):
    a = re.sub('\d+', '', line)
    b = re.sub('[\W]+', ' ', a)
    return b.upper().split()
```

Use the fucntion above on the variable (test_line) to see what it returns.
```python
test_line = "This is an example of that contains 234 and a dash-containing number"
```

In [19]:
def clean_split_line(line):
    line = re.sub('\d+', '', line)
    line = re.sub('[\W]+', ' ', line)
    return line.split()
test_line = "This is an example of that contains 234 and a dash-containing number"
clean_split_line(test_line)

['This',
 'is',
 'an',
 'example',
 'of',
 'that',
 'contains',
 'and',
 'a',
 'dash',
 'containing',
 'number']

QX How words does this book contain.  To answer this question, you may find it useful to apply the function in a spark-fashion. 
* You can only use operarations or actions to answer the question. 
  * Code that uses methods such as `some_rdd.X().Y().Z()...` is allowed
  * Code that uses function such as `some_func(...)` is not allowed


In [23]:
book_43.flatMap(clean_split_line).count()

29116

QX How many of the words in book_43 are unique. Given that words can appear in any case (ex. The, THE, the), make sure you convert the words into lower case (arbitrarily seleted).



In [27]:
book_43.flatMap(clean_split_line).map(lambda x: x.lower()).distinct().collect()

['project',
 'gutenberg',
 'ebook',
 'of',
 'strange',
 'mr',
 'hyde',
 'robert',
 'stevenson',
 'this',
 'is',
 'use',
 'anyone',
 'anywhere',
 'in',
 'united',
 'other',
 'world',
 'at',
 'no',
 'restrictions',
 'whatsoever',
 'may',
 'give',
 'away',
 'online',
 'www',
 'org',
 'are',
 'have',
 'check',
 'country',
 'where',
 'before',
 'using',
 'title',
 'october',
 'language',
 'set',
 'encoding',
 'utf',
 'produced',
 'widger',
 'start',
 'contents',
 'story',
 'search',
 'was',
 'quite',
 'carew',
 'murder',
 'letter',
 'last',
 'night',
 's',
 'narrative',
 'henry',
 'full',
 'statement',
 'utterson',
 'rugged',
 'never',
 'cold',
 'scanty',
 'discourse',
 'backward',
 'long',
 'dusty',
 'yet',
 'somehow',
 'friendly',
 'meetings',
 'when',
 'wine',
 'his',
 'something',
 'human',
 'beaconed',
 'indeed',
 'way',
 'into',
 'but',
 'spoke',
 'only',
 'these',
 'symbols',
 'after',
 'more',
 'loudly',
 'acts',
 'he',
 'drank',
 'gin',
 'though',
 'crossed',
 'twenty',
 'years',
 

QX 

* Generate an RDD that contains the frequency of each word in `book_43`. Call the variable `book_43_counts`
* Since there colleciton may contain a large number of words, it would be improdent to collect all the wods on the same machine. Instead, display the counts of first word in your list . 
* Given the random nature of this operaiton, result may be different. For me, the first entry was
  * You can only use operarations or actions to answer the question. 
  * Code that uses methods such as `some_rdd.X().Y().Z()...` is allowed
  * Code that uses function such as `some_func(...)` is not allowed

```
[('project', 88)]
```

In [31]:
book_43_counts = book_43.flatMap(clean_split_line).map(lambda x: (x.lower(),1)).reduceByKey(lambda x,y: x+y)
book_43_counts.take(1)


[('project', 88)]

QX Sort book_43_counts and print the 20 most common words in book_43. 
  * Hint: function sortByKey sort a collection of tuples on the first element element of the list. Make sure you instead sort on the second of each element in `book_43_counts`
  * You can only use operarations or actions to answer the question. 
  * Code that uses methods such as `some_rdd.X().Y().Z()...` is allowed
  * Code that uses function such as `some_func(...)` is not allowed


In [38]:
book_43_counts.map(lambda x:(x[1], x[0])).sortByKey(ascending=False).take(10)

[(1807, 'the'),
 (1068, 'of'),
 (1043, 'and'),
 (726, 'to'),
 (686, 'a'),
 (646, 'i'),
 (485, 'in'),
 (471, 'was'),
 (392, 'that'),
 (384, 'he')]

QX 
Note that the most frequent workds in `book_43_counts` include stop workds such as `of`, `the`, `and`, etc.
It would be foolish to compare document based on whether or not they contain such stop words. As such, it's common to remove such stop words.
The librarary `sklearn.feature_extraction` provides access to a collection of english stop words. Those are accessible using the following snippet

```
from sklearn.feature_extraction import stop_words
stop_words.ENGLISH_STOP_WORDS
```

* Explore the frozen set data structure (a set that you cannot modify)  by print any 10 words from it. 
 * Hint conver the frozen set to something you can subscript




In [44]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
list(ENGLISH_STOP_WORDS)[:10]

['once',
 'thru',
 'sometimes',
 'seem',
 'give',
 'everything',
 'upon',
 'hereby',
 'never',
 'formerly']

QX

filter out the words in book_43_counts by removing those that appear in the ENGLISH_STOP_WORDS.
Save the results to a new variable called `book_43_counts_filtered`
  * You can only use operarations or actions to answer the question. 
  * Code that uses methods such as `some_rdd.X().Y().Z()...` is allowed
  * Code that uses function such as `some_func(...)` is not allowed


In [52]:
book_43_counts_filtered = book_43_counts.filter(lambda x: x[0] not in ENGLISH_STOP_WORDS)


QX 
how many words left in book_43_counts_filtered after removing stop words

In [51]:
book_43_counts_filtered.count()

4296

QX Write a function, call it `process_RDD`  that combines the relevant steps above so that we can apply them to the four remaining books. Your function should take a text file as input and:
 * Read in the file as a textRDD
 * Clean and split the line
 * Filter our stop words
 * returns an word count RDD where each item is tuple of word and its count.
 



In [56]:
def process_RDD(file_path):
    book_rdd = sc.textFile(file_path)
    book_rdd_counts = book_rdd.flatMap(clean_split_line).map(lambda x: (x.lower(),1)).reduceByKey(lambda x,y: x+y)
    book_rdd_counts_filtered = book_rdd_counts.filter(lambda x: x[0] not in ENGLISH_STOP_WORDS)
    return book_rdd_counts_filtered

QX apply the funciton `process_RDD` to book_84, book_398 and book_3296 and the results the variables book_84_counts_filtered, book_398_counts_filtered and book_3296_counts_filtered respectively and print the number of distinct words after filtering stop words in each of these books 



In [62]:
book_84_counts_filtered = process_RDD("data/84.txt")
book_398_counts_filtered = process_RDD("data/398.txt")
book_3296_counts_filtered = process_RDD("data/3296.txt")

print("Book 84 count is: ", book_84_counts_filtered.count())
print("Book 398 count is: ", book_398_counts_filtered.count())
print("Book 3296 count is: ", book_3296_counts_filtered.count())

Book 84 count is:  7016
Book 398 count is:  2421
Book 3296 count is:  7293


QX. In the prelude, we discussed evaluating the similarity between two texts using the number of words they share. If that holds, book_398 and book_3296, which both talk about religion will have more words in common than, say, book_84 and book_398. Test this hypothesis by writing code that compares and prints the number of words shared between first book_398 and book_3296 and then between book_84 and book_398.


In [73]:
print("number of words shared between book_398 and book_3296 is:")
print(book_398_counts_filtered.map(lambda x: x[0]).intersection(book_3296_counts_filtered.map(lambda x: x[0])).count())
print("number of words shared between book_84 and book_3296 is:")
print(book_84_counts_filtered.map(lambda x: x[0]).intersection(book_3296_counts_filtered.map(lambda x: x[0])).count())

number of words shared between book_398 and book_3296 is:
1790
number of words shared between book_84 and book_3296 is:
3608


QX. mBased on the above, do you think counting the number of shared words is a good idea? Justify your answer?
* Hint: what's common to both book_84 and book_3296? 

######  ANSWER
Not a good idea. Both book_84 and book_3296 are moch larger than the other books are most likely to have more words in common just by virtue of their length?

Part II 

Another appraoch to estimating similarity consits of computing the Euclidean distance across a set of words. For example Suppose we have 3 books A, B and C with the following ocunts for words `evolution`, `DNA`, `biology` and `finance`. 
```python 
A = [4, 9, 6, 8]
B = [3, 7, 7, 10]
C = [15, 10, 1, 1]
```
note that although all workds contain exactly the same four workds, the number of times these words is used may be indicative of thier topic, for example, documents A and B are more likely to be business related since the work finance occours frequently (8 and 10 times respectively). The third may be a technical document since it focuse more technical workds (evolution and DNA) and less on finance.

The Euclidean distance, whcih can be computed suing scikit usng the snippet below is more indicative of topic-relatedness between the two documents.
cpython
from scipy.spatial.distance import euclidean 
print(f"The Euclidean distance between A and B is: {euclidean(A, B)}")

print(f"The Euclidean distance between A and C is: {euclidean(A, C)}")

print(f"The Euclidean distance between B and C is: {euclidean(B, C)}")
```

In [77]:
from scipy.spatial.distance import euclidean 

A = [4, 9, 6, 8]
B = [3, 7, 7, 10]
C = [15, 10, 1, 1]

print(f"The Euclidean distance between A and B is: {euclidean(A, B)}")
print(f"The Euclidean distance between A and C is: {euclidean(A, C)}")
print(f"The Euclidean distance between B and C is: {euclidean(B, C)}")

The Euclidean distance between A and B is: 3.1622776601683795
The Euclidean distance between A and C is: 14.0
The Euclidean distance between B and C is: 16.431676725154983


QX To test whether the Euclidean distance is a good approach to identify similar clusters, we need first identify the set of words across which we will compare the documents. Here, we will explore the words that are common to all 4 documents. We will store the data in a martrix called `counts_matrix`.

Start by finding the words that are common to all four documents after stop-word filtering and store the counts for each word in a column of `counts_matrix`. 

To take the previous example, you can generate an emtpy matrix with 3 lines (books A, B and X) and 4 columns (words `evolution`, `DNA`, `biology` and `finance`) using the following code.

```python
import numpy as np
counts_matrix = np.zeros([3,4])
```

After generting the counts, you can fill the counts for a document, say A, using the following code:

```python
counts_matrix[0, :] = [4, 9, 6, 8] 
```
* other than for buidling the counts into `counts_matrix` you shoud exclusively use operarations or actions on the RDD to answer the question. 
  * Code that uses methods such as `some_rdd.X().Y().Z()...` is allowed
  * Code that uses function such as `some_func(...)` is not allowed



In [96]:
common_words = (
book_43_counts_filtered.
    map(lambda x: x[0]).
    intersection(book_84_counts_filtered.map(lambda x: x[0])).
    intersection(book_398_counts_filtered.map(lambda x: x[0])).
    intersection(book_3296_counts_filtered.map(lambda x: x[0]))
).collect()
counts_matrix = np.zeros([4,len(common_words)])

x = book_43_counts_filtered.filter(lambda x: x[0] in common_words).collect()
counts_matrix[0,:]= [x[1] for x in x]

x = book_84_counts_filtered.filter(lambda x: x[0] in common_words).collect()
counts_matrix[1,:]= [x[1] for x in x]

x = book_398_counts_filtered.filter(lambda x: x[0] in common_words).collect()
counts_matrix[2,:]= [x[1] for x in x]

x = book_3296_counts_filtered.filter(lambda x: x[0] in common_words).collect()
counts_matrix[3,:]= [x[1] for x in x]

QX. Compute the Euclidean distance between book_398 and book_3296, which both talk about religion and book_84 and book_398. What do you conclude about using the Euclidean distance for comparing documents. 



In [99]:
print(f"The Euclidean distance between  \
      book_398 and book_3296 is: {euclidean(counts_matrix[2], counts_matrix[3])}")
print(f"The Euclidean distance between \
      book_84 and book_398 is: {euclidean(counts_matrix[1], counts_matrix[2])}")


The Euclidean distance between        book_398 and book_3296 is: 1470.8415958219293
The Euclidean distance between       book_84 and book_398 is: 867.2508287687017


Again Simplly due to the fact that both same more words, the counts are biaed. 

QX 
Bonus question (5 points): Can you think of a few things we could do to improve similarity between documents that pertain to the same topic. Jutify your answer without given codem

* normalize the data by the total number of words in the bool
* selecting words that are common to all four biases words that are common to one topic (ex. Religion). All thos words, whcih are highly specific are missing.
* Ultimately, words alone don't mean much. If a book uses the words Religion and Dogma  and the second book uses the words faith and belief, the match would be nil. If, howeever, we could use synonyms, we would know that that all four 4 words are synonymouls. 

Part III

In this part we will build some basic analytics from data pertaining to all flights in the US petaining to US Airliners. Here, you should use exclusively `SparkDatFrames. In one month.

Load the file `flight_info.csv` into a spark DataFrame. 

  * note that you will have to create sparkSession prior to loading the data

In [None]:
session = SparkSession(sc)