In [15]:
import re


### Part 1

In this first part, we're going to use Spark to analyze the following books, which Iahev downloaded from Project Gutenberg and saved to the data folder.

| File name | Book Title|
|:---------:|:----------|
|43.txt | The Strange Case of Dr. Jekyll and Mr. Hyde by Robert Louis Stevenson|
|84.txt | Frankenstein; Or, The Modern Prometheus by Mary Wollstonecraft Shelley |
|398.txt  | The First Book of Adam and Eve by Rutherford Hayes Platt|
|3296.txt | The Confessions of St. Augustine by Bishop of Hippo Saint Augustine|

Our objective is to cluster these 7 books based on thier similarity in terms of their most frequent context specific words, i.e. note "the", "and", "or", etc..

* One we've generated those workds for these 7 works, we will use those vectors to generate a hierarchical clustering for that shows the simialrity between these books.

For this assignment, you will need to make sure you're running from a PySpark docker environment I introduced in class. You can start the docker pySpark docker environment using the following command:

```
docker run --rm -p 4040:4040 -p 8888:8888 -v $(pwd):/home/jovyan/work jupyter/all-spark-notebook
```

Make sure you run the command from the directory containing this jupyter notebook and your data folder.





### Prologue

An important aspect of Natural Lanaguge Processing is the identification of texts that are similar. A naive approach to decide whether two documents are similar is by treating  a book as a collection of words (or, bag of words) and compare the documents based on these words. For example, one would expect two books the topic of which is religion  (ex. books 398.txt and  3296.txt), to have more words in common that words than a book that talks about religion and a book that discusess science fiction (ex books 84.txt and 398.txt). 

As mentioned above, we will be using Spark to analyze the data. While Spark is not necessary for such a small example, the plateform would be idea for analyzing a very large collection of documents, such those are often handled by large comapnies

This part of the assignment will rely exclusively on RDDs.


QX. We'll start by importing Spark and making sure our environemnt is set up properly for the assignment.

Import the spark context necesarry to load a document as an RDD

* Ignore  any error messages

In [3]:
from pyspark import SparkContext
sc = SparkContext()
sc.version

'3.1.2'

QX Read in the file `43.txt` as a spark RDD and save it to the variable book_43
 * make sure book_43 of type MapPartitionsRDD
   * str(type(book_43)) == "<class 'pyspark.rdd.RDD'>"


In [11]:
book_43 = sc.textFile('data/43.txt')
str(type(book_43)) == "<class 'pyspark.rdd.RDD'>"

True

QX How many lines does `book_43` the file contain?
* You can only use operarations or actions to answer the question. 
  * Code that uses methods such as `some_rdd.X().Y().Z()...` is allowed
  * Code that uses function such as `some_func(...)` is not allowed


In [14]:
book_43.count()

2935

QX We need to first remove the occurrences of non-alphabetical characters and numbers. You can use the following function, which given a line, remove digist and non-word characters and splits it into a collection of word 

```python
def clean_split_line(line):
    a = re.sub('\d+', '', line)
    b = re.sub('[\W]+', ' ', a)
    return b.upper().split()
```

Use the fucntion above on the variable (test_line) to see what it returns.
```python
test_line = "This is an example of that contains 234 and a dash-containing number"
```

In [19]:
def clean_split_line(line):
    line = re.sub('\d+', '', line)
    line = re.sub('[\W]+', ' ', line)
    return line.split()
test_line = "This is an example of that contains 234 and a dash-containing number"
clean_split_line(test_line)

['This',
 'is',
 'an',
 'example',
 'of',
 'that',
 'contains',
 'and',
 'a',
 'dash',
 'containing',
 'number']

QX How words does this book contain.  To answer this question, you may find it useful to apply the function in a spark-fashion. 
* You can only use operarations or actions to answer the question. 
  * Code that uses methods such as `some_rdd.X().Y().Z()...` is allowed
  * Code that uses function such as `some_func(...)` is not allowed


In [23]:
book_43.flatMap(clean_split_line).count()

29116

QX How many of the words in book_43 are unique. Given that words can appear in any case (ex. The, THE, the), make sure you convert the words into lower case (arbitrarily seleted).



In [27]:
book_43.flatMap(clean_split_line).map(lambda x: x.lower()).distinct().collect()

['project',
 'gutenberg',
 'ebook',
 'of',
 'strange',
 'mr',
 'hyde',
 'robert',
 'stevenson',
 'this',
 'is',
 'use',
 'anyone',
 'anywhere',
 'in',
 'united',
 'other',
 'world',
 'at',
 'no',
 'restrictions',
 'whatsoever',
 'may',
 'give',
 'away',
 'online',
 'www',
 'org',
 'are',
 'have',
 'check',
 'country',
 'where',
 'before',
 'using',
 'title',
 'october',
 'language',
 'set',
 'encoding',
 'utf',
 'produced',
 'widger',
 'start',
 'contents',
 'story',
 'search',
 'was',
 'quite',
 'carew',
 'murder',
 'letter',
 'last',
 'night',
 's',
 'narrative',
 'henry',
 'full',
 'statement',
 'utterson',
 'rugged',
 'never',
 'cold',
 'scanty',
 'discourse',
 'backward',
 'long',
 'dusty',
 'yet',
 'somehow',
 'friendly',
 'meetings',
 'when',
 'wine',
 'his',
 'something',
 'human',
 'beaconed',
 'indeed',
 'way',
 'into',
 'but',
 'spoke',
 'only',
 'these',
 'symbols',
 'after',
 'more',
 'loudly',
 'acts',
 'he',
 'drank',
 'gin',
 'though',
 'crossed',
 'twenty',
 'years',
 

QX 

* Generate an RDD that contains the frequency of each word in `book_43`. Call the variable `book_43_counts`
* Since there colleciton may contain a large number of words, it would be improdent to collect all the wods on the same machine. Instead, display the counts of first word in your list . 
* Given the random nature of this operaiton, result may be different. For me, the first entry was
  * You can only use operarations or actions to answer the question. 
  * Code that uses methods such as `some_rdd.X().Y().Z()...` is allowed
  * Code that uses function such as `some_func(...)` is not allowed

```
[('project', 88)]
```

In [31]:
book_43_counts = book_43.flatMap(clean_split_line).map(lambda x: (x.lower(),1)).reduceByKey(lambda x,y: x+y)
book_43_counts.take(1)


[('project', 88)]

QX Sort book_43_counts and print the 20 most common words in book_43. 
  * Hint: function sortByKey sort a collection of tuples on the first element element of the list. Make sure you instead sort on the second of each element in `book_43_counts`
  * You can only use operarations or actions to answer the question. 
  * Code that uses methods such as `some_rdd.X().Y().Z()...` is allowed
  * Code that uses function such as `some_func(...)` is not allowed


In [38]:
book_43_counts.map(lambda x:(x[1], x[0])).sortByKey(ascending=False).take(10)

[(1807, 'the'),
 (1068, 'of'),
 (1043, 'and'),
 (726, 'to'),
 (686, 'a'),
 (646, 'i'),
 (485, 'in'),
 (471, 'was'),
 (392, 'that'),
 (384, 'he')]

QX 
Note that the most frequent workds in `book_43_counts` include stop workds such as `of`, `the`, `and`, etc.
It would be foolish to compare document based on whether or not they contain such stop words. As such, it's common to remove such stop words.
The librarary `sklearn.feature_extraction` provides access to a collection of english stop words. Those are accessible using the following snippet

```
from sklearn.feature_extraction import stop_words
stop_words.ENGLISH_STOP_WORDS
```

* Explore the frozen set data structure (a set that you cannot modify)  by print any 10 words from it. 
 * Hint conver the frozen set to something you can subscript




In [44]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
list(ENGLISH_STOP_WORDS)[:10]

['once',
 'thru',
 'sometimes',
 'seem',
 'give',
 'everything',
 'upon',
 'hereby',
 'never',
 'formerly']

QX

filter out the words in book_43_counts by removing those that appear in the ENGLISH_STOP_WORDS.
Save the results to a new variable called `book_43_counts_filtered`
  * You can only use operarations or actions to answer the question. 
  * Code that uses methods such as `some_rdd.X().Y().Z()...` is allowed
  * Code that uses function such as `some_func(...)` is not allowed


In [52]:
book_43_counts_filtered = book_43_counts.filter(lambda x: x[0] not in ENGLISH_STOP_WORDS)


QX 
how many words left in book_43_counts_filtered after removing stop words

In [51]:
book_43_counts_filtered.count()

4296

QX Write a function, call it `process_RDD`  that combines the relevant steps above so that we can apply them to the four remaining books. Your function should take a text file as input and:
 * Read in the file as a textRDD
 * Clean and split the line
 * Filter our stop words
 * returns an word count RDD where each item is tuple of word and its count.
 



In [54]:
def process_RDD(book_rdd):
    book_rdd = sc.textFile('data/43.txt')
    book_rdd_counts = book_rdd.flatMap(clean_split_line).map(lambda x: (x.lower(),1)).reduceByKey(lambda x,y: x+y)
    book_rdd_counts_filtered = book_rdd_counts.filter(lambda x: x[0] not in ENGLISH_STOP_WORDS)
    return book_rdd_counts_filtered

QX apply the funciton `process_RDD` to book_84, book_398 and book_3296 and the results the variables book_84_filtered, book_398_filtered and book_3296_filtered respectively and print the number of distinct words after filtering stop words in each of these books 



In [55]:
book_84_filtered = process_RDD(book_84)
book_398_filtered = process_RDD(book_398)
book_3296_filtered = process_RDD(book_3296)

NameError: name 'book_84' is not defined