---
#### Overview
For this assignment, you will be using Python and Spark to analyze the [pointwise mutual information (PMI)](http://en.wikipedia.org/wiki/Pointwise_mutual_information) of tokens in the text of Shakespeare's plays.    For this assignment, you will need the same text file (`Shakespeare.txt`) and Python tokenizer module, `simple_tokenize.py`, that you used for the first two assignments.    You will also use the same definition of PMI that was used for [Assignment 1](https://lintool.github.io/bigdata-2018w/assignment1-431.html).

To use Spark from within a Python program, it is first necessary to tell the Python interpreter where the Spark installation is located.   You will be using the Spark installation in the CS451 course account.   The code in the following cell tells Python how to find this Spark installation.   Before going on, execute that code (by selecting the cell and hitting 'return' while holding down the shift key).   It will take a few seconds to run, and will produce no output.

In [1]:
import findspark
findspark.init("/u/cs451/packages/spark")

from pyspark import SparkContext, SparkConf

ModuleNotFoundError: No module named 'findspark'

Once Python knows where Spark is located, you can create a `SparkContext`.   All Spark commands must run within an active `SparkContext`.   The code below will create a `SparkContext`, and store a reference to the context in the variable `sc`. 
The `appName` parameter assigns a name of your choosing to the Spark jobs that are created in this context - this is useful mostly for debugging.   The `master` parameter indicates that Spark jobs will run in local mode, using two threads.   This means that your Spark jobs are not really running on a cluster (since we do not have a Spark cluster in the CS student computing environment), and are instead running in a single process on the local machine.   You program Spark jobs the same way whether they run in local mode or on a cluster - the main difference between local and cluster modes is, of course, performance.

Run the code in the cell below to create a Spark context.   Creating the `SparkContext` causes your Python program (running in this notebook) to prepare to run Spark jobs, and will take a few seconds to complete.  Be sure that you run this code only one time, because a single Python program may only have one active SparkContext.

In [2]:
sc = SparkContext(appName="YourTest", master="local[2]")

Next, let's test that your `SparkContext` has been set up properly by running some simple test code (adapted from the [Spark examples page](https://spark.apache.org/examples.html)).   This code uses a single Spark job to estimate the value of $\pi$.  `parallelize()` and `filter()` are Spark *transformations*, and `count()` is a Spark *action*.   Study the code in the cell below, then go ahead and run it.   It should take a few seconds, since a Spark job is being created and executed, and should print an estimate of $\pi$ when it finishes.   

In [5]:
import random

num_samples = 100000000

def inside(p):     
  x, y = random.random(), random.random()
  return x*x + y*y < 1

count = sc.parallelize(range(0, num_samples)).filter(inside).count()

pi_estimate = 4 * count / num_samples
print(pi_estimate)

3.14169016


---
#### Question 1  (4/30 marks):

In the following cell, briefly explain how the $\pi$-estimation example works.   What is the Spark job doing, and how is it used to estimate the value of $\pi$?

#### Your answer to Question 1:

*The math theory here is that: we have a unit square ((0, 0) to (1,1)), the area of it is denoted as S1 and we have a unit circle centering at(0,0), so 1/4 of the circle is covered by the unit square, the area of this part of the circle is denoted as S2. Since S2/S1=pi/4, so this can be approximated by the probability of falling in the circle(denoted as P) when we throwing darts at the unit squre for many times. Thus pi_hat=4*P=4*(number of times in the unit circle/total throwing times). 
The inside() function here basically serves as one trial of throwing dark, if it fall in the circle, it will return true , if not, it will return false.
Here what spark is doing is that: first it creates parallelized collection holding the number 0 to num_samples-1.Then,it applies a the function specified by filter(), here it is inside() to each number in the parallized collection and only return the number which makes the function(inside()) return 'True'.After that we have the set of numbers that 'fall in the circle' so we count them and get the total times that fall in the circle and then plug it into the formula I mentioned in the math theory with the total throwing times(num_samples) to get the estimate of pi*

---
#### Question 2  (4/30 marks):

Now it is your turn to write some Spark programs.   Start with the simple task of counting the number of *distinct* tokens which appear in `Shakespeare.txt`.   You have already written Python code to do this in Assignment 1, but for this assignment we want you to use Spark to solve the same problem.   You should compare the answer you get using Spark with the answer you got from your pure-Python solution from Assignment 1.   Both answers should, of course, be the same.

Your code should use Spark, not the Python driver code, to read `Shakespeare.txt` and do the counting.   The idea is to use Spark to give you a data-parallel alternative to the sequential Python solution you wrote for Assignment 1.

Write your solution for in the code cell below.   It should use the `SparkContext` which was created previously (referenced by the variable `sc`), and it should print the number of distinct tokens.

In [7]:
from simple_tokenize import simple_tokenize
textFile = sc.textFile("Shakespeare.txt")
counts = textFile.flatMap(lambda line: simple_tokenize(line))\
                  .distinct()\
                  .count()
print(counts)


25975


---
#### Question 3  (4/30 marks):

Next, write a Spark program that will count the number of distinct token pairs in `Shakespeare.txt`, as you did in Assignment 1.   Again, ensure that the answer that you get using Spark matches the answer you got in the first assignment.

Write your solution for in the code cell below.   It should use the `SparkContext` which was created previously (referenced by the variable `sc`), and it should print the number of distinct token pairs.

In [8]:
# your solution to Question 3 here
from simple_tokenize import simple_tokenize
def combinations(text):
    t=simple_tokenize(text)
    C_pair=[]
    for i in t:
        for j in t:
            if i!=j:
                C_pair.append(tuple((i,j)))
    return C_pair# for each line, return the pairs as tuples   
textFile = sc.textFile("Shakespeare.txt")
count_2 = textFile.flatMap(combinations)\
                  .distinct()\
                  .count()
print(count_2)
textFile = sc.textFile("Shakespeare.txt")





1969760


---
#### Question 4  (6/30 marks):

Next, write Spark code that will calculate $n(x)$ and $p(x)$ (as defined in Assignment 1) for every distinct token $x$ in `Shakespeare.txt`.   Your code should report (print) the 50 highest-probability tokens, and their probabilities.

Make sure that your solution calculates $n(x)$ and $p(x)$ and identifies the 50 highest-probability tokens in a data-parallel fashion, using Spark transformations and actions.   Only the 50 highest-probability tokens (and their probabilities) should be returned by Spark to your driver code.

Write your solution for in the code cell below.   It should use the `SparkContext` which was created previously (referenced by the variable `sc`), and it should print the 50 highest-probability tokens, along with their counts ($n(x)$) and probabilities ($p(x)$).

In [9]:
# your solution to Question 4 here
from simple_tokenize import simple_tokenize
row_num = textFile.count()# count the total lines in the text 
def token(text):
    t=simple_tokenize(text)
    t=set(t)
    return t
def combiner(t):
    B=(t[0],t[1],t[1]/row_num)  
    return B
        
textFile = sc.textFile("Shakespeare.txt")
row_num = textFile.count()
# tokenize each line get the distinct tokens->flatMap it-> for each emit(token x,1)->reduceByKey,get (token x,n(x))
#->map each as(token x,n(x),n(x)/total number of lines)->sort by probability(n(x)/total number of lines)->take 50 highest
count_3=textFile.flatMap(token)\
        .map(lambda word: (word,1))\
        .reduceByKey(lambda a,b:a+b)\
        .map(combiner)\
        .sortBy(lambda t:t[2],False)\
        .take(50)
        

for i in range(50):
    print("token:{0},n(x):{1},p(x):{2}\n".format(count_3[i][0],count_3[i][1],count_3[i][2]))




    
 
    
    
    
    
                
            



token:and,n(x):24604,p(x):0.2009178657172255

token:the,n(x):24300,p(x):0.19843538192686472

token:i,n(x):18657,p(x):0.1523542765682928

token:to,n(x):18237,p(x):0.148924529226347

token:of,n(x):16624,p(x):0.1357526662202551

token:a,n(x):13280,p(x):0.10844534452628657

token:you,n(x):12196,p(x):0.09959332995802643

token:my,n(x):11549,p(x):0.09430988583840991

token:in,n(x):10614,p(x):0.08667461497003054

token:that,n(x):10569,p(x):0.08630714204053634

token:is,n(x):8756,p(x):0.07150206601447026

token:not,n(x):8230,p(x):0.06720671577193814

token:with,n(x):7552,p(x):0.06167012363422561

token:me,n(x):7396,p(x):0.06039621747864574

token:for,n(x):7326,p(x):0.05982459292165477

token:it,n(x):7147,p(x):0.05836286726877787

token:be,n(x):6662,p(x):0.054402325695340446

token:this,n(x):6425,p(x):0.05246696826667102

token:his,n(x):6403,p(x):0.05228731483447386

token:your,n(x):6233,p(x):0.050899083767495794

token:but,n(x):6205,p(x):0.05067043394469941

token:he,n(x):5816,p(x):0.047493834

---
#### Question 5  (6/30 marks):

Next, write Spark code that will prompt the user to input a positive integer threshold $T$, and then print all token pairs that co-occur at least $T$ times in `Shakespeare.txt`.   For each such pair $(x,y)$, the program should also report $n(x,y)$ and PMI$(x,y)$.    You can compare the results produced by this code with the results of Two-Token queries (from Assignment 1) for consistency.

As always, calculations should be done in data-parallel fashion by using Spark.

In [None]:
# your solution to Question 5 here
from math import log
from simple_tokenize import simple_tokenize
textFile = sc.textFile("Shakespeare.txt")
row_num = textFile.count()# get the total number of lines

def token(text):
    t=simple_tokenize(text)
    t=set(t)
    return t

def Pairs(text):
    t=simple_tokenize(text)
    t=set(t)
    C_pair=[]
    for i in t:
        for j in t:
            if j!=i:
                C_pair.append(tuple(((i,j),1)))
    return C_pair

def Nx_PMI(U):
    Px=D[U[0][0]]/row_num
    Py=D[U[0][1]]/row_num
    n_xy=U[1]
    Pxy=n_xy/row_num
    PMI=log(Pxy/(Px*Py),10)
    B=((U[0],(n_xy,PMI)))
    return B# ((x,y),n_xy)->((x,y),(n_xy,PMI))


while True:
    T = input("Input a positive integer threshold(return to quit): ")
    if len(T) == 0:
        break
    T=int(T)
    if T<=0:
        print('threshold must be positive integer')
    else:
        # get the counts for each token
        D=textFile.flatMap(token)\
                  .map(lambda word: (word,1))\
                  .countByKey()

        M=textFile.flatMap(Pairs)\
                  .reduceByKey(lambda a,b: a+b)\
                  .filter(lambda x: x[1]>=T)\
                  .map(Nx_PMI)
                    
    A=M.collect()
    if len(A)>0:
        for i in range(len(A)):
            print("pairs:{0},n_xy:{1},PMI:{2}\n".format(A[i][0],A[i][1][0],A[i][1][1]))
    else:
        print('There\'s no token pairs co-occur at least {0} times'.format(T))

        


    
    










Input a positive integer threshold(return to quit): 500
pairs:('he', 'his'),n_xy:772,PMI:0.4045965628641987

pairs:('with', 'me'),n_xy:714,PMI:0.19462649020263792

pairs:('of', 'he'),n_xy:689,PMI:-0.05915354395284624

pairs:('the', 'you'),n_xy:1894,PMI:-0.1064565524207891

pairs:('and', 'from'),n_xy:502,PMI:-0.01645220118755594

pairs:('would', 'i'),n_xy:929,PMI:0.4368562016159068

pairs:('will', 'it'),n_xy:513,PMI:0.25994370772774095

pairs:('and', 'for'),n_xy:1451,PMI:-0.006218058038888795

pairs:('our', 'to'),n_xy:581,PMI:0.1303120537332908

pairs:('a', 'be'),n_xy:795,PMI:0.041551587472925144

pairs:('that', 'not'),n_xy:888,PMI:0.09696639486903776

pairs:('the', 'what'),n_xy:516,PMI:-0.21894256555716282

pairs:('be', 'my'),n_xy:642,PMI:0.009373180570842032

pairs:('his', 'of'),n_xy:1045,PMI:0.07998442061341464

pairs:('if', 'you'),n_xy:819,PMI:0.37736045390329614

pairs:('and', 'your'),n_xy:1282,PMI:0.010172342142701409

pairs:('say', 'i'),n_xy:688,PMI:0.4325377426042171

pairs:('of

Input a positive integer threshold(return to quit): 10000000
There's no token pairs co-occur at least 10000000 times


---
#### Question 6  (6/30 marks):

Finally, write Spark code that will prompt the user for two inputs: a positive integer threshold $T$, and a sample size $N$.   For every token $x$ in `Shakespeare.txt`, your code should find all tokens $y$ that co-occur with $x$ at least $T$ times, as well as PMI$(x,y)$ for each such pair.

For each $x$, the output of your program should be similar to the output that would be produced by a One-Token query on $x$ (see Assignment 1), with threshold $T$ - except that here you report all co-occuring tokens above the threshold, rather than just five.   Rather than producing output for all possible tokens $x$, your program should produce output only for $N$ different $x$'s, chosen *at random* from among all distinct tokens in the input file.


---
Don't forget to save your workbook!   When you are finished and you are ready to submit your assignment, download your notebook file (.ipynb) from the hub to your machine, and then follow the submission instructions in the assignment.

In [None]:
# your solution to Question 6 here
from simple_tokenize import simple_tokenize
from math import log
textFile = sc.textFile("Shakespeare.txt")
row_num = textFile.count()
def token(text):
    t=simple_tokenize(text)
    t=set(t)
    return t

def Co(text):
    t=simple_tokenize(text)
    t=set(t)
    b=[]
    for i in t:
        for j in t:
            if i!=j:
                b.append(tuple(((i,j),1)))          
    return b

def Nx_PMI(U):
    if U[1] !=[]:
        Px=D[U[0][0]]/row_num
        Py=D[U[0][1]]/row_num
        n_xy=U[1]
        Pxy=n_xy/row_num
        PMI=log(Pxy/(Px*Py),10)
        B=((U[0],(n_xy,PMI)))
    if U[1] ==[]:
        B=((U[0],([],[])))
    return B#essentially, it means if one pair appears at least T times ((x,y),n_xy)->((x,y),(n_xy,PMI)),if not,((x,y),[])->((x,y),([],[]))

def DIC(q):
    if q[1][0]!=[]:
        B=(q[0][0],(q[0][1],(q[1][0],q[1][1])))
    else:
        B=(q[0][0],())
    return B #essentially, it means if one pair appears at least T times ((x,y),(n_xy,PMI))->(x,(y,(n_xy,PMI))), if not ((x,y),([],[]))->(x,())

def J(s):
    if s[1]<T:
        C=(s[0],[])
    else:
        C=s
    return C# if appear at least T times return((x,y),n_xy), if not,((x,y),[])

while True:
    T = input("Input a positive integer threshold(return to quit): ")
    if len(T) == 0:
        break
    T=int(T)
    if  T<=0:
        print('threshold must be positive integer')
    else:
        N=int(input('Input the sample size'))
        if N<=0:
            print('The sample size must be positive integer')
        else:
            print('If the input sample size exceeds the number of distinct tokens, we will return all we have,\n it\'s running...')
            D=textFile.flatMap(token)\
                      .map(lambda word: (word,1))\
                      .countByKey()

            M=textFile.flatMap(Co)\
                      .reduceByKey(lambda a,b:a+b)\
                      .map(J)\
                      .map(Nx_PMI)\
                      .map(DIC)\
                      .reduceByKey(lambda a,b:a+b)\
                      .take(N)
            for i in range(len(M)):
                print("  The token is {0}\n".format(M[i][0]))
                if M[i][1]==():
                    print('{0}\n'.format(M[i]))
                else:
                    for j in range(0,len(M[i][1]),2):
                        print("  n({0},{1}) = {2},  PMI({0},{1}) = {3}\n".format(M[i][0],M[i][1][j],M[i][1][j+1][0],M[i][1][j+1][1])) 

              # if the sample size exceeds the distinct token size, it will give all the samples we have

    


Input a positive integer threshold(return to quit): 50
Input the sample size20
If the input sample size exceeds the number of distinct tokens, we will return all we have,
 it's running...
  The token is heedful

('heedful', ())

  The token is circle

('circle', ())

  The token is suppresseth

('suppresseth', ())

  The token is divers

('divers', ())

  The token is tapp'd

("tapp'd", ())

  The token is music's

("music's", ())

  The token is destin'd

("destin'd", ())

  The token is tents

('tents', ())

  The token is wassails

('wassails', ())

  The token is neighing

('neighing', ())

  The token is pavilion'd

("pavilion'd", ())

  The token is spoken

('spoken', ())

  The token is gramercies

('gramercies', ())

  The token is skittish

('skittish', ())

  The token is petrarch

('petrarch', ())

  The token is impotence

('impotence', ())

  The token is exposture

('exposture', ())

  The token is why

  n(why,do) = 88,  PMI(why,do) = 0.3119736491231157

  n(why,have) = 