# Setup Notebook for Exercises

##### <span style="color:red">IMPORTANT: Only modify cells which have the following comment:</span>
```python
# Modify this cell
```
##### <span style="color:red">Do not add any new cells when you submit the homework</span>

In [1]:
import findspark
findspark.init()

In [2]:
from pyspark import SparkContext
sc=SparkContext(master="local[4]")

In [3]:
import Tester.WordCount as WordCount
pickleFile="Tester/WordCount.pkl"

Importing all packages necessary to complete the homework

In [4]:
import numpy as np

In [5]:
WordCount.get_data()

# Exercise
A `k`-mer is a sequence of `k` consecutive words. 

For example, the `3`-mers in the line `you are my sunshine my only sunsine` are

* `you are my`
* `are my sunshine`
* `my sunshine my`
* `sunshine my only`
* `my only sunsine`

For the sake of simplicity we consider only the `k`-mers that appear in a single line. In other words, we ignore `k`-mers that span more than one line.

Write a function, using spark all the way to the end, to find to top 10 `k`-mers in a given text for a given `k`.

Specifically write functions with the following signatures:
```python
def map_kmers(text,k):
    \\ text: an RDD of text lines. Lines contain only lower-case letters and spaces. Spaces should be ignored.
    \\ k: length of `k`-mers
    return  singles
    \\ singles: an RDD of pairs of the form (tuple of k words,1)
def count_kmers(singles):
    \\ singles: as above
    return counts
    \\ count: RDD of the form: (tuple of k words, number of occurances)
def sort_counts(count):
    \\ count: as above
    return sorted_counts
    \\ sorted_counts: RDD of the form (number of occurances, tuple of k words) sorted in decreasing number of occurances.
```

######  <span style="color:blue">Code:</span>
```python 
text_file = sc.textFile(u'../../Data/Moby-Dick.txt')
print getkmers(text_file,5,2, map_kmers, count_kmers, sort_counts)
```
######  <span style="color:magenta">Output:</span>
most common 2-mers<br>
1796:	(u'of', u'the')<br>
1145:	(u'in', u'the')<br>
708:	(u'to', u'the')<br>
408:	(u'from', u'the')<br>
376:	(u'the', u'whale')

In [6]:
def map_kmers(text,k):
    singles = text.map(lambda x: x.split())\
        .map(lambda x: [a.lower() for a in x])\
        .filter(lambda x: x != [''])\
        .filter(lambda x: filter(None, x))\
        .flatMap(lambda x: [tuple(x[i:i+k]) for i in range(len(x)+1-k)])\
        .map(lambda w: (w, 1))
    return  singles

def count_kmers(singles):
    count = singles.reduceByKey(lambda a, b: a+b)
    return count
    
def sort_counts(count):
    sorted_count = count.map(lambda (k,v): (v,k))\
        .sortByKey(False)
    return sorted_count

In [7]:
# Do Not modify this cell
def getkmers(text_file, l,k, map_kmers, count_kmers, sort_counts):
    # text_file: the text_file RDD read above
    # k: k-mers
    # l: l most common k-mers
    
    import re
    def removePunctuation(text):
        return re.sub("[^0-9a-zA-Z ]", " ", text)
    text = text_file.map(removePunctuation)\
                    .map(lambda x: x.lower())
    
    singles=map_kmers(text,k)
    count=count_kmers(singles)
    sorted_counts=sort_counts(count)
    
    C=sorted_counts.take(l)
    print 'most common %d-mers\n'%k,'\n'.join(['%d:\t%s'%c for c in C])

In [8]:
# First, check that the text file is where we expect it to be
%ls -l ../../Data/Moby-Dick.txt

-rw-r--r-- 1 root root 1257260 Apr 24 21:57 ../../Data/Moby-Dick.txt


In [9]:
text_file = sc.textFile(u'../../Data/Moby-Dick.txt')

In [10]:
# Print the output of the aggregate function for top 5 2-mers
getkmers(text_file,5,2, map_kmers, count_kmers, sort_counts)

most common 2-mers
1796:	(u'of', u'the')
1145:	(u'in', u'the')
708:	(u'to', u'the')
408:	(u'from', u'the')
376:	(u'the', u'whale')


In [11]:
import Tester.WordCount as WordCount
WordCount.exercise(pickleFile, map_kmers, count_kmers, sort_counts, sc)

Correct Output: [(91, (u'of', u'the', u'whale')), (91, (u'the', u'sperm', u'whale')), (76, (u'the', u'white', u'whale')), (55, (u'one', u'of', u'the')), (54, (u'of', u'the', u'sea'))]
Great Job!

