# **Text Analysis with Spark RDD API**



## Installing pyspark

The following cell install the latest pyspark package

In [None]:
!pip install pyspark



## Mounting Google Drive

The following cell mounts your google drive in the virtual machine runing the notebook. You will be asked to authenticate your account to access Google drive. Once authenticated, your google drive is mounted at `/content/drive`. Anything in your google drive can be accessed from `/content/drive/MyDrive`.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


The following cell lists the content of your google drive. We assume you have created a folder called `data` in your google drive and have uploaded all the data files for assignment 1 there.

In [None]:
!ls /content/drive/MyDrive/data

 Anti_assignment_CIC_g3.csv
 Governing_Law.csv
'Label Report - Anti-assignment, CIC (Group 3).xlsx'
'Label Report - Governing Law.xlsx'


## Initializing spark


In [None]:
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("Cloud Computing1") \
    .getOrCreate()

# reference: https://colab.research.google.com/drive/1cdTy7-sLgO8FFliMMlUGn6LiVmVZOTT-?usp=sharing

## READ DATA
Read from two csv files(Governing_Law.csv and Anti_assignment_CIC_g3.csv) and covert the result to RDD seperately.

In [None]:
governing_law = spark.read.csv("file:///content/drive/MyDrive/data/Governing_Law.csv",header=True).rdd
anti_assignment_cic_g3 = spark.read.csv("file:///content/drive/MyDrive/data/Anti_assignment_CIC_g3.csv",header=True).rdd
governing_law_raw_data = governing_law.map(lambda x: x["Governing Law"]).filter(lambda y: y != "nan")
change_of_control_raw_data = anti_assignment_cic_g3.map(lambda x: x["Change of Control"]).filter(lambda y: y != None)
anti_assignment_raw_data = anti_assignment_cic_g3.map(lambda x: x["Anti-assignment"]).filter(lambda y: y!= None)

# reference: https://colab.research.google.com/drive/1cdTy7-sLgO8FFliMMlUGn6LiVmVZOTT-?usp=sharing

Load stop words and punctuation



In [None]:
import string
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
stopwords.append("page")
print("stopwords: ", end = "")
print(stopwords)

punctuations = string.punctuation
print("punctuation: " + punctuations)

# reference: https://colab.research.google.com/drive/1cdTy7-sLgO8FFliMMlUGn6LiVmVZOTT-?usp=sharing

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
stopwords: ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all',

# **Method 1**

### **Remove stopwords and punctuation**

In [None]:
# Function that combine the splited word by punctuation, stopword, and result
def combine_word(result):
  final_result = []
  current_line = []
  i = 0
  while i < len(result):
      j = 0
      while j < len(result[i]):
        if result[i][j] in punctuations or result[i][j] in stopwords or result[i][j].isdigit():
          if not (len(current_line) < 1 or len(current_line) > 4):
            final_result.append(' '.join(current_line))
          current_line = []   
        else:
          current_line.append(result[i][j])
        j += 1
      if not (len(current_line) > 4 or len(current_line) < 1):
        final_result.append(' '.join(current_line))
      current_line = []
      i += 1
  return final_result

  # reference: https://github.com/aneesha/RAKE/blob/master/rake.py

In [None]:
import nltk
nltk.download('punkt')

# Start of split function.
def split_content(texts):

  # Put every word into lower case first.
  texts = texts.lower()


  result = []
  split_result = []

  # Use "(page " as keyword to split document and split by stop words.
  for text in texts.split('(page '):
    result = []
    words = []
    words = nltk.word_tokenize(text)
  
  # Because tokenize cannot split "/" use more code to fix this.
    i = 0
    while i < len(words):
      if "/" in words[i]:
          temp = words[i].split("/")
          words.pop(i)
          j = 0
          count = 0
          while j < len(temp):
              words.insert(i+j+count, temp[j])
              count += 1
              words.insert(i+j+count, ",")
              j += 1
          continue
      i += 1
    if len(words) != 0:
      result.append(words)
    
  # Sum up the splited document word by digit and punctuations and stopwords 
    final_result = combine_word(result)
    if not len(final_result) <= 0:
      split_result.append(final_result)
  return split_result

# reference: https://edstem.org/au/courses/8206/discussion/792452

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Using the function above(split_content()) to split documents into keywords. Which is also the **Candidate Phrase Identification for method 1**

In [None]:
governing_law_data = governing_law_raw_data.flatMap(split_content)
change_of_control_data = change_of_control_raw_data.flatMap(split_content)
anti_assignment_data = anti_assignment_raw_data.flatMap(split_content)

**Candidate Phrase Identification of governing_law for method 1**

In [None]:
governing_law_data.take(4)

[['agreement',
  'accepted',
  'company',
  'state',
  'nevada',
  'shall',
  'governed',
  'construed',
  'accordance',
  'laws thereof',
  'laws shall prevail',
  'event',
  'conflict'],
 ['agreement shall',
  'governed',
  'laws',
  'province',
  'ontario',
  'federal laws',
  'canada applicable therein'],
 ['agreement',
  'subject',
  'laws',
  'regulations',
  'license conditions',
  'decisions',
  'canadian radio-television',
  'telecommunications commission',
  '‚äúcrtc‚äù',
  'municipal',
  'provincial',
  'federal governments',
  'authorities',
  'applicable',
  'rogers',
  'licensor',
  'force',
  'hereafter adopted',
  '‚äúapplicable law‚äù'],
 ['questions',
  'respect',
  'construction',
  'agreement',
  'rights',
  'liabilities',
  'parties hereto',
  'shall',
  'governed',
  'laws',
  'state',
  'florida']]

**Candidate Phrase Identification of change_of_control for method 1**

In [None]:
change_of_control_data.take(4)

[['purposes',
  'preceding sentence',
  'without limiting',
  'generality',
  'merger',
  'consolidation',
  'reorganization involving licensee',
  'regardless',
  'whether licensee',
  'surviving',
  'disappearing entity',
  'deemed',
  'transfer',
  'rights',
  'obligations',
  'performance',
  'agreement',
  'required'],
 ['``',
  'term',
  'agreement shall',
  'effective',
  'date first stated',
  'shall continue',
  'term',
  'three',
  'years',
  'unless terminated earlier',
  'accordance',
  'provisions',
  'agreement',
  'provided'],
 ['``',
  'purposes',
  'agreement',
  "`` '' change",
  "control '' '' means",
  'merger',
  'consolidation',
  'party'],
 ['neither party shall voluntarily',
  'operation',
  'law assign',
  'otherwise transfer',
  'rights',
  'obligations incurred pursuant',
  'terms',
  'agreement without',
  'prior written consent',
  'party']]

**Candidate Phrase Identification of anti_assignment for method 1**

In [None]:
anti_assignment_data.take(4)

[['may',
  'assign',
  'sell',
  'lease',
  'otherwise transfer',
  'whole',
  'party',
  'rights granted pursuant',
  'company'],
 ['agreement may',
  'assigned',
  'sold',
  'transferred without',
  'prior written consent',
  'party'],
 ['notwithstanding',
  'foregoing',
  'rogers may',
  'without consent',
  'assign',
  'rights',
  'obligations',
  'agreement',
  'whole',
  'part',
  'person',
  'directly',
  'indirectly controls',
  'controlled',
  'common control',
  'rogers',
  'ii',
  'purchaser',
  'substantially',
  'assets used',
  'connection',
  'rod service',
  'change',
  'control',
  'rogers shall',
  'considered',
  'assignment',
  'agreement'],
 ['purported assignment',
  'sale',
  'transfer',
  'contravention',
  'section shall',
  'null',
  'void']]

## Count Word Score

First step in method 1 is to count the score of each word in every keywords.

In [None]:
def calculate_word_score(word_list):
  word_score_list = []
  frequency_list = []
  frequency_value_list = []
  degree_list = []
  degree_value_list = []
  score_list = []
  score_value_list = []
  combine_list = []

  # The following block is to calculate the frequency value of each word
  i = 0
  while i < len(word_list):
    if word_list[i] not in frequency_list:
      frequency_list.append(word_list[i])
      frequency_value_list.append(1)
    else:
      frequency_value_list[frequency_list.index(word_list[i])] += 1
    phrase = word_list[i].split(" ")
    if len(phrase) == 1:
      temp = 1
    else:
      combine_list.append((phrase, word_list[i]))
    i += 1
  i = 0
  while i < len(combine_list):
    j = 0
    while j < len(combine_list[i][0]):
      if combine_list[i][0][j] not in frequency_list:
        frequency_list.append(combine_list[i][0][j])
        frequency_value_list.append(1)
      else:
        frequency_value_list[frequency_list.index(combine_list[i][0][j])] += 1
      j += 1
    i += 1

  # The following block is to calculate the degree of each word
  i = 0
  keys = list(frequency_list)
  while i < len(keys):
    if keys[i] in degree_list:
      temp = 0
    else:
      degree_list.append(keys[i])
      degree_value_list.append(frequency_value_list[frequency_list.index(keys[i])])
    j = 0
    while j < len(combine_list):
      if len(combine_list[j][0]) <= 4 and combine_list[j][0].count(keys[i]) > 0:
        degree_value_list[degree_list.index(keys[i])] += (len(combine_list[j][0]) - 1) * frequency_value_list[frequency_list.index(combine_list[j][1])] * combine_list[j][0].count(keys[i])
      j += 1
    i += 1

  # The following block is to calculate the score of each word
  i = 0
  while i < len(keys):
    score_list.append(keys[i])
    score_value_list.append(degree_value_list[degree_list.index(keys[i])] * 1.0 / frequency_value_list[frequency_list.index(keys[i])])
    i += 1

  # In the following block, keywords with length no more than 4 will be 
  i = 0
  while i < len(keys):
    if len(keys[i].split()) < 5:
      word_score_list.append((keys[i], score_value_list[score_list.index(keys[i])]))
    i += 1
  word_score_list.sort(key = (lambda x :(-x[1], x[0])))
    
  return word_score_list

In [None]:
governing_law_word_score = governing_law_data.map(lambda x: calculate_word_score(x))
change_of_control_word_score = change_of_control_data.map(lambda x: calculate_word_score(x))
anti_assignment_word_score = anti_assignment_data.map(lambda x: calculate_word_score(x))

Show **word score calculation result** for method 1

**word score calculation result of governing_law**

In [None]:
governing_law_word_score.take(3)

[[('prevail', 3.0),
  ('laws', 2.5),
  ('shall', 2.0),
  ('thereof', 2.0),
  ('accepted', 1.0),
  ('accordance', 1.0),
  ('agreement', 1.0),
  ('company', 1.0),
  ('conflict', 1.0),
  ('construed', 1.0),
  ('event', 1.0),
  ('governed', 1.0),
  ('laws shall prevail', 1.0),
  ('laws thereof', 1.0),
  ('nevada', 1.0),
  ('state', 1.0)],
 [('applicable', 3.0),
  ('canada', 3.0),
  ('therein', 3.0),
  ('agreement', 2.0),
  ('federal', 2.0),
  ('shall', 2.0),
  ('laws', 1.5),
  ('agreement shall', 1.0),
  ('canada applicable therein', 1.0),
  ('federal laws', 1.0),
  ('governed', 1.0),
  ('ontario', 1.0),
  ('province', 1.0)],
 [('adopted', 2.0),
  ('canadian', 2.0),
  ('commission', 2.0),
  ('conditions', 2.0),
  ('federal', 2.0),
  ('governments', 2.0),
  ('hereafter', 2.0),
  ('law‚äù', 2.0),
  ('license', 2.0),
  ('radio-television', 2.0),
  ('telecommunications', 2.0),
  ('‚äúapplicable', 2.0),
  ('agreement', 1.0),
  ('applicable', 1.0),
  ('authorities', 1.0),
  ('canadian radio-tele

**word score calculation result of change_of_control**

In [None]:
change_of_control_word_score.take(3)

[[('involving', 3.0),
  ('reorganization', 3.0),
  ('licensee', 2.5),
  ('disappearing', 2.0),
  ('entity', 2.0),
  ('limiting', 2.0),
  ('preceding', 2.0),
  ('sentence', 2.0),
  ('whether', 2.0),
  ('without', 2.0),
  ('agreement', 1.0),
  ('consolidation', 1.0),
  ('deemed', 1.0),
  ('disappearing entity', 1.0),
  ('generality', 1.0),
  ('merger', 1.0),
  ('obligations', 1.0),
  ('performance', 1.0),
  ('preceding sentence', 1.0),
  ('purposes', 1.0),
  ('regardless', 1.0),
  ('reorganization involving licensee', 1.0),
  ('required', 1.0),
  ('rights', 1.0),
  ('surviving', 1.0),
  ('transfer', 1.0),
  ('whether licensee', 1.0),
  ('without limiting', 1.0)],
 [('date', 3.0),
  ('earlier', 3.0),
  ('first', 3.0),
  ('stated', 3.0),
  ('terminated', 3.0),
  ('unless', 3.0),
  ('continue', 2.0),
  ('shall', 2.0),
  ('agreement', 1.5),
  ('``', 1.0),
  ('accordance', 1.0),
  ('agreement shall', 1.0),
  ('date first stated', 1.0),
  ('effective', 1.0),
  ('provided', 1.0),
  ('provisions

**word score calculation result of anti-assignment**

In [None]:
anti_assignment_word_score.take(3)

[[('granted', 3.0),
  ('pursuant', 3.0),
  ('rights', 3.0),
  ('otherwise', 2.0),
  ('transfer', 2.0),
  ('assign', 1.0),
  ('company', 1.0),
  ('lease', 1.0),
  ('may', 1.0),
  ('otherwise transfer', 1.0),
  ('party', 1.0),
  ('rights granted pursuant', 1.0),
  ('sell', 1.0),
  ('whole', 1.0)],
 [('consent', 3.0),
  ('prior', 3.0),
  ('written', 3.0),
  ('agreement', 2.0),
  ('may', 2.0),
  ('transferred', 2.0),
  ('without', 2.0),
  ('agreement may', 1.0),
  ('assigned', 1.0),
  ('party', 1.0),
  ('prior written consent', 1.0),
  ('sold', 1.0),
  ('transferred without', 1.0)],
 [('assets', 2.0),
  ('common', 2.0),
  ('consent', 2.0),
  ('controls', 2.0),
  ('indirectly', 2.0),
  ('may', 2.0),
  ('rod', 2.0),
  ('service', 2.0),
  ('shall', 2.0),
  ('used', 2.0),
  ('without', 2.0),
  ('rogers', 1.6666666666666667),
  ('control', 1.5),
  ('agreement', 1.0),
  ('assets used', 1.0),
  ('assign', 1.0),
  ('assignment', 1.0),
  ('change', 1.0),
  ('common control', 1.0),
  ('connection', 

Second of method 1 is to calculate the keyword score by adding each word score in this keyword.

In [None]:
def calculate_candidate_phrase_score(word_list):
  # The following block is same to the previous one
  # I did this because I calculate the keyword score directly and did not show the score of each word.
  # But the rubic indicts that we need to show the score of each word, so I copy code in this part and delete
  # some of it then become the code in the previous part.
  keyword_score = []
  frequency_list = []
  frequency_value_list = []
  degree_list = []
  degree_value_list = []
  score_list = []
  score_value_list = []
  combine_list = []

  i = 0
  while i < len(word_list):
    if word_list[i] not in frequency_list:
      frequency_list.append(word_list[i])
      frequency_value_list.append(1)
    else:
      frequency_value_list[frequency_list.index(word_list[i])] += 1
    phrase = word_list[i].split(" ")
    if len(phrase) == 1:
      temp = 1
    else:
      combine_list.append((phrase, word_list[i]))
    i += 1

  i = 0
  while i < len(combine_list):
    j = 0
    while j < len(combine_list[i][0]):
      if combine_list[i][0][j] not in frequency_list:
        frequency_list.append(combine_list[i][0][j])
        frequency_value_list.append(1)
      else:
        frequency_value_list[frequency_list.index(combine_list[i][0][j])] += 1
      j += 1
    i += 1

  
  i = 0
  keys = list(frequency_list)
  while i < len(keys):
    if keys[i] in degree_list:
      temp = 0
    else:
      degree_list.append(keys[i])
      degree_value_list.append(frequency_value_list[frequency_list.index(keys[i])])
    j = 0
    while j < len(combine_list):
      if len(combine_list[j][0]) <= 4 and combine_list[j][0].count(keys[i]) > 0:
        degree_value_list[degree_list.index(keys[i])] += (len(combine_list[j][0]) - 1) * frequency_value_list[frequency_list.index(combine_list[j][1])] * combine_list[j][0].count(keys[i])
      j += 1
    i += 1

  i = 0
  while i < len(keys):
    score_list.append(keys[i])
    score_value_list.append(degree_value_list[degree_list.index(keys[i])] * 1.0 / frequency_value_list[frequency_list.index(keys[i])])
    i += 1
  i = 0

  # The following block is new compared to the previous part of code, this added all the word score in one key word
  while i < len(combine_list):
    score = 0
    temp_list = []
    k = 0
    while k < len(combine_list[i][0]):
      if combine_list[i][0][k] not in temp_list:
        temp_list.append(combine_list[i][0][k])
      k += 1
    j = 0
    while j < len(temp_list):
      score += score_value_list[score_list.index(temp_list[j])]
      j += 1
    score_value_list[score_list.index(combine_list[i][1])] = score
    i += 1
  
  i = 0
  while i < len(word_list):
    if len(word_list[i].split()) < 5:
      keyword_score.append((word_list[i], score_value_list[score_list.index(word_list[i])]))
    i += 1
  keyword_score.sort(key=(lambda x :(-x[1], x[0])))
    
  return keyword_score

In [None]:
governing_law_candidate_phrase_score = governing_law_data.map(lambda x: calculate_candidate_phrase_score(x))
change_of_control_candidate_phrase_score = change_of_control_data.map(lambda x: calculate_candidate_phrase_score(x))
anti_assignment_candidate_phrase_score = anti_assignment_data.map(lambda x: calculate_candidate_phrase_score(x))

Show **candidate phrase score calculation result** for method 1

**candidate phrase score calculation of governing_law**

In [None]:
governing_law_candidate_phrase_score.take(3)

[[('laws shall prevail', 7.5),
  ('laws thereof', 4.5),
  ('shall', 2.0),
  ('accepted', 1.0),
  ('accordance', 1.0),
  ('agreement', 1.0),
  ('company', 1.0),
  ('conflict', 1.0),
  ('construed', 1.0),
  ('event', 1.0),
  ('governed', 1.0),
  ('nevada', 1.0),
  ('state', 1.0)],
 [('canada applicable therein', 9.0),
  ('agreement shall', 4.0),
  ('federal laws', 3.5),
  ('laws', 1.5),
  ('governed', 1.0),
  ('ontario', 1.0),
  ('province', 1.0)],
 [('canadian radio-television', 4.0),
  ('federal governments', 4.0),
  ('hereafter adopted', 4.0),
  ('license conditions', 4.0),
  ('telecommunications commission', 4.0),
  ('‚äúapplicable law‚äù', 4.0),
  ('agreement', 1.0),
  ('applicable', 1.0),
  ('authorities', 1.0),
  ('decisions', 1.0),
  ('force', 1.0),
  ('laws', 1.0),
  ('licensor', 1.0),
  ('municipal', 1.0),
  ('provincial', 1.0),
  ('regulations', 1.0),
  ('rogers', 1.0),
  ('subject', 1.0),
  ('‚äúcrtc‚äù', 1.0)]]

**candidate phrase score calculation of change_of_control**

In [None]:
change_of_control_candidate_phrase_score.take(3)

[[('reorganization involving licensee', 8.5),
  ('whether licensee', 4.5),
  ('disappearing entity', 4.0),
  ('preceding sentence', 4.0),
  ('without limiting', 4.0),
  ('agreement', 1.0),
  ('consolidation', 1.0),
  ('deemed', 1.0),
  ('generality', 1.0),
  ('merger', 1.0),
  ('obligations', 1.0),
  ('performance', 1.0),
  ('purposes', 1.0),
  ('regardless', 1.0),
  ('required', 1.0),
  ('rights', 1.0),
  ('surviving', 1.0),
  ('transfer', 1.0)],
 [('date first stated', 9.0),
  ('unless terminated earlier', 9.0),
  ('shall continue', 4.0),
  ('agreement shall', 3.5),
  ('agreement', 1.5),
  ('``', 1.0),
  ('accordance', 1.0),
  ('effective', 1.0),
  ('provided', 1.0),
  ('provisions', 1.0),
  ('term', 1.0),
  ('term', 1.0),
  ('three', 1.0),
  ('years', 1.0)],
 [("control '' '' means", 11.666666666666666),
  ("`` '' change", 8.666666666666666),
  ('``', 2.0),
  ('agreement', 1.0),
  ('consolidation', 1.0),
  ('merger', 1.0),
  ('party', 1.0),
  ('purposes', 1.0)]]

**candidate phrase score calculation of anti_assignment**

In [None]:
anti_assignment_candidate_phrase_score.take(3)

[[('rights granted pursuant', 9.0),
  ('otherwise transfer', 4.0),
  ('assign', 1.0),
  ('company', 1.0),
  ('lease', 1.0),
  ('may', 1.0),
  ('party', 1.0),
  ('sell', 1.0),
  ('whole', 1.0)],
 [('prior written consent', 9.0),
  ('agreement may', 4.0),
  ('transferred without', 4.0),
  ('assigned', 1.0),
  ('party', 1.0),
  ('sold', 1.0)],
 [('assets used', 4.0),
  ('indirectly controls', 4.0),
  ('rod service', 4.0),
  ('without consent', 4.0),
  ('rogers may', 3.666666666666667),
  ('rogers shall', 3.666666666666667),
  ('common control', 3.5),
  ('rogers', 1.6666666666666667),
  ('control', 1.5),
  ('agreement', 1.0),
  ('agreement', 1.0),
  ('assign', 1.0),
  ('assignment', 1.0),
  ('change', 1.0),
  ('connection', 1.0),
  ('considered', 1.0),
  ('controlled', 1.0),
  ('directly', 1.0),
  ('foregoing', 1.0),
  ('ii', 1.0),
  ('notwithstanding', 1.0),
  ('obligations', 1.0),
  ('part', 1.0),
  ('person', 1.0),
  ('purchaser', 1.0),
  ('rights', 1.0),
  ('substantially', 1.0),
  ('w

First to calculate the frequency value, degree value, feature extraction value for each keyword

Calculate the rdf, edf, and ess

In [None]:
#Put each line in format like (phrase, 1) then sum up the count of same phrase to get the frequency of this phrase
governing_law_rdf = governing_law_candidate_phrase_score.flatMap(lambda x: x).map(lambda x: (x[0], 1)).reduceByKey(lambda x, y: x + y)
change_of_control_rdf = change_of_control_candidate_phrase_score.flatMap(lambda x: x).map(lambda x: (x[0], 1)).reduceByKey(lambda x, y: x + y)
anti_assignment_rdf = anti_assignment_candidate_phrase_score.flatMap(lambda x: x).map(lambda x: (x[0], 1)).reduceByKey(lambda x, y: x + y)

**rdf of governing_law**

In [None]:
governing_law_rdf.map(lambda x: (x[0] + "(rdf: " + str(x[1]) + ")")).take(20)

['laws shall prevail(rdf: 1)',
 'laws thereof(rdf: 3)',
 'shall(rdf: 47)',
 'accepted(rdf: 3)',
 'accordance(rdf: 294)',
 'agreement(rdf: 196)',
 'company(rdf: 3)',
 'conflict(rdf: 112)',
 'construed(rdf: 263)',
 'event(rdf: 3)',
 'governed(rdf: 376)',
 'nevada(rdf: 8)',
 'state(rdf: 396)',
 'canada applicable therein(rdf: 4)',
 'agreement shall(rdf: 250)',
 'federal laws(rdf: 6)',
 'laws(rdf: 462)',
 'ontario(rdf: 13)',
 'province(rdf: 15)',
 'canadian radio-television(rdf: 1)']

**rdf of change_of_control**


In [None]:
change_of_control_rdf.map(lambda x: (x[0] + "(rdf: " + str(x[1]) + ")")).take(20)

['reorganization involving licensee(rdf: 1)',
 'whether licensee(rdf: 2)',
 'disappearing entity(rdf: 1)',
 'preceding sentence(rdf: 2)',
 'without limiting(rdf: 2)',
 'agreement(rdf: 152)',
 'consolidation(rdf: 32)',
 'deemed(rdf: 14)',
 'generality(rdf: 1)',
 'merger(rdf: 52)',
 'obligations(rdf: 23)',
 'performance(rdf: 4)',
 'purposes(rdf: 14)',
 'regardless(rdf: 2)',
 'required(rdf: 9)',
 'rights(rdf: 33)',
 'surviving(rdf: 2)',
 'transfer(rdf: 47)',
 'date first stated(rdf: 1)',
 'unless terminated earlier(rdf: 1)']

**rdf of anti_assignment**

In [None]:
anti_assignment_rdf.map(lambda x: (x[0] + "(rdf: " + str(x[1]) + ")")).take(20)

['rights granted pursuant(rdf: 2)',
 'otherwise transfer(rdf: 39)',
 'assign(rdf: 207)',
 'company(rdf: 34)',
 'lease(rdf: 7)',
 'may(rdf: 24)',
 'party(rdf: 270)',
 'sell(rdf: 21)',
 'whole(rdf: 84)',
 'prior written consent(rdf: 251)',
 'agreement may(rdf: 50)',
 'transferred without(rdf: 1)',
 'assigned(rdf: 113)',
 'sold(rdf: 4)',
 'assets used(rdf: 1)',
 'indirectly controls(rdf: 1)',
 'rod service(rdf: 1)',
 'without consent(rdf: 4)',
 'rogers may(rdf: 1)',
 'rogers shall(rdf: 1)']

Then calculate the edf of each corpora

In [None]:
#Choose the top 4 score phrase of each document first
#Then put each line in format like (phrase, 1) then sum up the count of same phrase to get the frequency of this phrase
governing_law_edf = governing_law_candidate_phrase_score.flatMap(lambda x: (x[0], x[1], x[2], x[3]) if len(x) > 4 else x[:len(x)]).map(lambda x: (x[0], 1)).reduceByKey(lambda x, y: x + y)
change_of_control_edf = change_of_control_candidate_phrase_score.flatMap(lambda x: (x[0], x[1], x[2], x[3]) if len(x) > 4 else x[:len(x)]).map(lambda x: (x[0], 1)).reduceByKey(lambda x, y: x + y)
anti_assignment_edf = anti_assignment_candidate_phrase_score.flatMap(lambda x: (x[0], x[1], x[2], x[3]) if len(x) > 4 else x[:len(x)]).map(lambda x: (x[0], 1)).reduceByKey(lambda x, y: x + y)

**edf of governing_law**

In [None]:
governing_law_edf.map(lambda x: (x[0] + "(edf: " + str(x[1]) + ")")).take(20)

['laws shall prevail(edf: 1)',
 'laws thereof(edf: 2)',
 'shall(edf: 2)',
 'accepted(edf: 1)',
 'canada applicable therein(edf: 4)',
 'agreement shall(edf: 226)',
 'federal laws(edf: 3)',
 'laws(edf: 56)',
 'canadian radio-television(edf: 1)',
 'federal governments(edf: 1)',
 'hereafter adopted(edf: 1)',
 'license conditions(edf: 1)',
 'parties hereto(edf: 2)',
 'agreement(edf: 64)',
 'construction(edf: 8)',
 'florida(edf: 5)',
 'nevada without giving effect(edf: 2)',
 'internal laws(edf: 14)',
 'law provision(edf: 5)',
 'without giving effect(edf: 38)']

**edf of change_of_control**

In [None]:
change_of_control_edf.map(lambda x: (x[0] + "(edf: " + str(x[1]) + ")")).take(20)

['reorganization involving licensee(edf: 1)',
 'whether licensee(edf: 1)',
 'disappearing entity(edf: 1)',
 'preceding sentence(edf: 1)',
 'date first stated(edf: 1)',
 'unless terminated earlier(edf: 1)',
 'shall continue(edf: 1)',
 'agreement shall(edf: 1)',
 "control '' '' means(edf: 2)",
 "`` '' change(edf: 2)",
 '``(edf: 2)',
 'agreement(edf: 12)',
 'neither party shall voluntarily(edf: 1)',
 'obligations incurred pursuant(edf: 1)',
 'prior written consent(edf: 18)',
 'agreement without(edf: 1)',
 'financial resources substantially similar(edf: 1)',
 'shall specifically assume(edf: 1)',
 'control transaction(edf: 2)',
 'skype holding may assign(edf: 1)']

**edf of anti_assignment**

In [None]:
anti_assignment_edf.map(lambda x: (x[0] + "(edf: " + str(x[1]) + ")")).take(20)

['rights granted pursuant(edf: 2)',
 'otherwise transfer(edf: 18)',
 'assign(edf: 28)',
 'company(edf: 1)',
 'prior written consent(edf: 233)',
 'agreement may(edf: 33)',
 'transferred without(edf: 1)',
 'assigned(edf: 15)',
 'assets used(edf: 1)',
 'indirectly controls(edf: 1)',
 'rod service(edf: 1)',
 'without consent(edf: 3)',
 'purported assignment(edf: 10)',
 'section shall(edf: 7)',
 'contravention(edf: 5)',
 'null(edf: 16)',
 'case whether voluntarily(edf: 2)',
 'licensee shall(edf: 5)',
 'section 11.7(edf: 1)',
 'delegation(edf: 5)']

Count ess of each corpora and get ess, rdf, edf together to be shown

In [None]:
# ess is calculated by join rdf to edf and find the ratio of edf to rdf, the result shall be quoted in two decimal places
# Then the final result is shown in format: keyword(ess: x, rdf: y, edf: z)
governing_law_ess_rdf_edf = governing_law_rdf.join(governing_law_edf).map(lambda x: (x[0], round((x[1][1]**2) * 1.0 / x[1][0], 2), x[1][0], x[1][1])).sortBy(lambda x: -x[1])
change_of_control_ess_rdf_edf = change_of_control_rdf.join(change_of_control_edf).map(lambda x: (x[0], round((x[1][1]**2) * 1.0 / x[1][0], 2), x[1][0], x[1][1])).sortBy(lambda x: -x[1])
anti_assignment_ess_rdf_edf = anti_assignment_rdf.join(anti_assignment_edf).map(lambda x: (x[0], round((x[1][1]**2) * 1.0 / x[1][0], 2), x[1][0], x[1][1])).sortBy(lambda x: -x[1])

**Show the result of governing_law of method 1**

In [None]:
governing_law_ess_rdf_edf.map(lambda x: (x[0] + "(ess: " + str(x[1]) + "; rdf: " + str(x[2]) + "; edf: " + str(x[3]) + ")")).take(20)

['agreement shall(ess: 204.3; rdf: 250; edf: 226)',
 'accordance(ess: 67.62; rdf: 294; edf: 141)',
 'new york(ess: 54.4; rdf: 85; edf: 68)',
 'without giving effect(ess: 38.0; rdf: 38; edf: 38)',
 'without regard(ess: 33.47; rdf: 66; edf: 47)',
 'law principles(ess: 23.08; rdf: 39; edf: 30)',
 'agreement(ess: 20.9; rdf: 196; edf: 64)',
 'law provisions(ess: 17.05; rdf: 19; edf: 18)',
 'laws principles(ess: 15.56; rdf: 34; edf: 23)',
 'law rules(ess: 14.45; rdf: 20; edf: 17)',
 'without reference(ess: 14.44; rdf: 25; edf: 19)',
 'united states(ess: 11.13; rdf: 23; edf: 16)',
 'substantive laws(ess: 10.23; rdf: 22; edf: 15)',
 'british columbia(ess: 10.08; rdf: 12; edf: 11)',
 'performed entirely within(ess: 9.0; rdf: 9; edf: 9)',
 'new york without regard(ess: 9.0; rdf: 9; edf: 9)',
 'united nations convention(ess: 9.0; rdf: 9; edf: 9)',
 'new york applicable(ess: 9.0; rdf: 9; edf: 9)',
 'another jurisdiction(ess: 8.33; rdf: 12; edf: 10)',
 'internal laws(ess: 8.17; rdf: 24; edf: 14)']

**Show the result of change_of_control of method 1**

In [None]:
change_of_control_ess_rdf_edf.map(lambda x: (x[0] + "(ess: " + str(x[1]) + "; rdf: " + str(x[2]) + "; edf: " + str(x[3]) + ")")).take(20)

['prior written consent(ess: 16.2; rdf: 20; edf: 18)',
 'home named competitor(ess: 6.0; rdf: 6; edf: 6)',
 'ownership transfer(ess: 5.0; rdf: 5; edf: 5)',
 'days written notice(ess: 4.17; rdf: 6; edf: 5)',
 'indenture trustee(ess: 4.0; rdf: 4; edf: 4)',
 'agreement without consent(ess: 4.0; rdf: 4; edf: 4)',
 'provide services shall(ess: 4.0; rdf: 4; edf: 4)',
 'agreement may(ess: 3.5; rdf: 14; edf: 7)',
 'control shall(ess: 3.0; rdf: 3; edf: 3)',
 'agreement upon written notice(ess: 3.0; rdf: 3; edf: 3)',
 'control buy-out payment(ess: 3.0; rdf: 3; edf: 3)',
 'licensor may terminate(ess: 3.0; rdf: 3; edf: 3)',
 'express written consent(ess: 3.0; rdf: 3; edf: 3)',
 'franchised business(ess: 3.0; rdf: 3; edf: 3)',
 'another corporation(ess: 3.0; rdf: 3; edf: 3)',
 'control event(ess: 2.27; rdf: 11; edf: 5)',
 'competitive product(ess: 2.25; rdf: 4; edf: 3)',
 'affected party(ess: 2.25; rdf: 4; edf: 3)',
 'ordinary course(ess: 2.25; rdf: 4; edf: 3)',
 'agreement shall terminate(ess: 2.2

**Show the result of anti_assignment of method 1**

In [None]:
anti_assignment_ess_rdf_edf.map(lambda x: (x[0] + "(ess: " + str(x[1]) + "; rdf: " + str(x[2]) + "; edf: " + str(x[3]) + ")")).take(20)

['prior written consent(ess: 216.29; rdf: 251; edf: 233)',
 'neither party may assign(ess: 53.07; rdf: 57; edf: 55)',
 'agreement without(ess: 35.56; rdf: 79; edf: 53)',
 'either party without(ess: 30.42; rdf: 38; edf: 34)',
 'either party may assign(ess: 27.0; rdf: 27; edf: 27)',
 'agreement may(ess: 21.78; rdf: 50; edf: 33)',
 'attempted assignment(ess: 18.24; rdf: 29; edf: 23)',
 'neither party shall assign(ess: 18.0; rdf: 18; edf: 18)',
 'neither party shall(ess: 13.0; rdf: 13; edf: 13)',
 'obligations hereunder without(ess: 12.8; rdf: 20; edf: 16)',
 'express written consent(ess: 12.0; rdf: 12; edf: 12)',
 'third party without(ess: 11.53; rdf: 17; edf: 14)',
 'consent shall(ess: 11.26; rdf: 43; edf: 22)',
 'party may assign(ess: 11.25; rdf: 20; edf: 15)',
 'prior written approval(ess: 11.0; rdf: 11; edf: 11)',
 'agreement shall(ess: 11.0; rdf: 44; edf: 22)',
 'third party(ess: 10.72; rdf: 68; edf: 27)',
 'agreement(ess: 9.38; rdf: 493; edf: 68)',
 'express prior written consent(es

## **Method 2**

### **Calculate the candidate for method 2**

First: map the candidate for method 2

In [None]:
# Simply flatMap the candidate for method 1, remove the list boundry of each document, them get the candiate for method 2
governing_law_method2_candidate = governing_law_data.flatMap(lambda x: x)
change_of_control_method2_candidate = change_of_control_data.flatMap(lambda x: x)
anti_assignment_method2_candidate = anti_assignment_data.flatMap(lambda x: x)

**Show the candidate for governing_law of method 2**

In [None]:
governing_law_method2_candidate.take(15)

['agreement',
 'accepted',
 'company',
 'state',
 'nevada',
 'shall',
 'governed',
 'construed',
 'accordance',
 'laws thereof',
 'laws shall prevail',
 'event',
 'conflict',
 'agreement shall',
 'governed']

**Show the candidate for change_of_control of method 2**

In [None]:
change_of_control_method2_candidate.take(15)

['purposes',
 'preceding sentence',
 'without limiting',
 'generality',
 'merger',
 'consolidation',
 'reorganization involving licensee',
 'regardless',
 'whether licensee',
 'surviving',
 'disappearing entity',
 'deemed',
 'transfer',
 'rights',
 'obligations']

**Show the candidate for anti_assignment of method 2**

In [None]:
anti_assignment_method2_candidate.take(15)

['may',
 'assign',
 'sell',
 'lease',
 'otherwise transfer',
 'whole',
 'party',
 'rights granted pursuant',
 'company',
 'agreement may',
 'assigned',
 'sold',
 'transferred without',
 'prior written consent',
 'party']

### **Calculate Word Score for Method 2**

This function is to calculate the score of each word in candidate phrase.

First all cancidate will be show in format(candidate, 1) and then be reduced.

Then find the frequency of each word.

Then find the co-occurrence of each word and union with frequency to reduce(add them up) them together to get the degree of each word.

Join frequency with degree and calculate degree/frequency to get the score of each word.

In [None]:
# this method calculate the frequency for every candidate
def calculate_frequency(word_list):
  words, countt = word_list
  result = []
  
  if len(words.split(" ")) > 1:
    i = 0
    words = words.split(" ")
    while i < len(words):
      result.append((words[i], words.count(words[i]) * countt))
      i += 1
  else:
    result.append((words, countt))
  return result

In [None]:
# this method calculate the cooccurrence for every candidate
def calculate_cooccurrence(word_list):
  words, countt = word_list
  result = []
  
  if len(words.split(" ")) > 1:
    i = 0
    words = words.split(" ")
    while i < len(words):
      result.append((words[i], words.count(words[i]) * countt * (len(words) - 1)))
      i += 1
  else:
    result.append((words, 0))
  return result

In [None]:
def get_word_score(phrase_candidates):

  #To put every word in the phrase as a new line together with the phrase  
  word_list = phrase_candidates.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y)

  #Calculate frequency by multiply the reduced phrase number by the number of that word in that phrase and reduce.
  word_list_frequency = word_list.flatMap(calculate_frequency).map(lambda x: (x[0], x[1])).reduceByKey(lambda x, y: x + y)

  #Calculate co-occurrence by multiply the length-1 of that phrase by the reduced phrase number by the number of that word in that phrase and reduce.
  word_list_cooccurrence = word_list.flatMap(calculate_cooccurrence).map(lambda x: (x[0], x[1])).reduceByKey(lambda x, y: x + y)
  
  #Use cooccurrence union with frequency to calculate degree
  word_list_degree = word_list_cooccurrence.reduceByKey(lambda x, y: x + y).union(word_list_frequency).reduceByKey(lambda x, y: x + y)

  #calculate word score by join frequency and degree and find the ratio of degree to frequency
  return word_list_frequency.join(word_list_degree).map(lambda x: (x[0], round(x[1][1]*1.0/x[1][0], 2)))

In [None]:
governing_law_method2_word_score = get_word_score(governing_law_method2_candidate)
change_of_control_method2_word_score = get_word_score(change_of_control_method2_candidate)
anti_assignment_method2_word_score = get_word_score(anti_assignment_method2_candidate)

**Show the word score of governing_law**

In [None]:
governing_law_method2_word_score.take(20)

[('agreement', 1.64),
 ('accepted', 1.0),
 ('company', 1.25),
 ('state', 1.1),
 ('nevada', 1.6),
 ('shall', 2.06),
 ('governed', 1.03),
 ('construed', 1.06),
 ('accordance', 1.0),
 ('laws', 1.3),
 ('thereof', 2.83),
 ('prevail', 2.0),
 ('event', 1.0),
 ('conflict', 1.11),
 ('province', 1.0),
 ('ontario', 1.0),
 ('federal', 2.07),
 ('canada', 1.85),
 ('applicable', 2.19),
 ('therein', 3.08)]

In [None]:
governing_law_method2_word_score.take(10)

[('agreement', 1.64),
 ('accepted', 1.0),
 ('company', 1.25),
 ('state', 1.1),
 ('nevada', 1.6),
 ('shall', 2.06),
 ('governed', 1.03),
 ('construed', 1.06),
 ('accordance', 1.0),
 ('thereof', 2.83)]

**Show the word score of change_of_control**

In [None]:
change_of_control_method2_word_score.take(10)

[('purposes', 1.0),
 ('sentence', 2.25),
 ('preceding', 2.2),
 ('without', 2.04),
 ('limiting', 2.8),
 ('generality', 1.0),
 ('merger', 1.09),
 ('consolidation', 1.06),
 ('involving', 2.57),
 ('licensee', 1.78)]

**Show the word score of anti_assignment**

In [None]:
anti_assignment_method2_word_score.take(10)

[('may', 2.82),
 ('assign', 2.24),
 ('sell', 1.61),
 ('lease', 1.22),
 ('otherwise', 1.74),
 ('transfer', 1.45),
 ('whole', 1.0),
 ('party', 2.26),
 ('pursuant', 2.23),
 ('granted', 2.6)]

## **Calculate the keyword score of method 2**

In [None]:
#This function is used to split candidate in format (word_in_candidate, candidate)
def split_keywords(word_list):
  result = []
  if len(word_list.split(" ")) > 1:
    words = word_list.split(" ")
    i = 0
    while i < len(words):
      result.append((words[i], word_list))
      i += 1
  else:
    result.append((word_list, word_list))

  return result

In [None]:
def get_sentence_score(candidates, word_score):

  # split the candidate word
  sequences = candidates.distinct().flatMap(split_keywords).map(lambda x: (x[0], x[1]))

  # join the splited candidate with word score and reduce
  sequence_score = sequences.join(word_score).map(lambda x: x[1]).reduceByKey(lambda a,b: a+b)

  #return result in order
  return sequence_score.union(word_score).sortBy(lambda x: -x[1])

In [None]:
governing_law_method2_result = get_sentence_score(governing_law_method2_candidate, governing_law_method2_word_score)
change_of_control_method2_result = get_sentence_score(change_of_control_method2_candidate, change_of_control_method2_word_score)
anti_assignment_method2_result = get_sentence_score(anti_assignment_method2_candidate, anti_assignment_method2_word_score)

**Show the result score of governing_law**

In [None]:
governing_law_method2_result.map(lambda x: (x[0] + "(score: " + str(round(x[1], 2)) + ")")).take(20)

['-- -- -- --(score: 15.2)',
 'either party herein initiate(score: 13.75)',
 'met independently without reference(score: 13.26)',
 'intellectual property right applies(score: 12.25)',
 'german private international law(score: 12.04)',
 'parties hereto expressly attorns(score: 11.84)',
 'agreement shall become valid(score: 11.7)',
 'united states trademark act(score: 11.61)',
 'transactions contemplated herein shall(score: 11.29)',
 'agreement takes effect upon(score: 11.29)',
 'maryland without giving effect(score: 11.16)',
 'issues collateral thereto shall(score: 11.12)',
 'pennsylvania without giving effect(score: 10.91)',
 'intellectual property laws relevant(score: 10.88)',
 'transactions contemplated hereby shall(score: 10.79)',
 'either party may apply(score: 10.78)',
 'new york without recourse(score: 10.62)',
 'massachusetts without giving effect(score: 10.62)',
 'delaware without giving effect(score: 10.56)',
 'nevada without giving effect(score: 10.51)']

**Show the result score of change_of_control**

In [None]:
change_of_control_method2_result.map(lambda x: (x[0] + "(score: " + str(round(x[1], 2)) + ")")).take(20)

['vs key leadership position(score: 14.5)',
 'reasonable detail based upon(score: 13.39)',
 'successor- in-interest expressly assumes(score: 13.08)',
 'post-termination royalty term therefor(score: 13.03)',
 'without thereby becoming liable(score: 12.79)',
 'golf instruction related products(score: 12.57)',
 'advertising agency representing tda(score: 12.53)',
 'ehave companion solution within(score: 12.53)',
 "control '' '' means(score: 12.29)",
 'set aside within ninety(score: 12.25)',
 'providing ebix written notice(score: 12.18)',
 'vs�s outstanding voting securities(score: 12.11)',
 'first refusal shall cease(score: 12.1)',
 'janssen�s confidential information hereunder(score: 12.01)',
 'upon sending written notice(score: 11.95)',
 'authority granted hereunder shall(score: 11.93)',
 'maintenance services performed prior(score: 11.9)',
 'admit additional general partners(score: 11.75)',
 'reynolds group holdings limited(score: 11.71)',
 'dova hereunder whether accruing(score: 11.71

**Show the result score of anti_assignment**

In [None]:
anti_assignment_method2_result.map(lambda x: (x[0] + "(score: " + str(round(x[1], 2)) + ")")).take(20)

['transporter�s ferc gas tariff(score: 15.5)',
 'minimum net worth equal(score: 14.5)',
 'indirect loss thus caused(score: 14.44)',
 'restrictions set forth herein(score: 13.9)',
 'new start-up location franchisees(score: 13.55)',
 'managing group may decide(score: 13.52)',
 'event papa john�s wishes(score: 13.48)',
 'express prior written authorization(score: 13.46)',
 'taken together would constitute(score: 13.33)',
 'current franchise application fee(score: 13.25)',
 'el pollo loco�s sole(score: 12.99)',
 'expressly set forth herein(score: 12.82)',
 'nfla prior written consent(score: 12.67)',
 'forty niners sc without(score: 12.63)',
 'cause forty niners sc(score: 12.5)',
 'reynolds group holdings limited(score: 12.45)',
 'buyer�s current common parent(score: 12.38)',
 'current initial franchise fee(score: 12.25)',
 'proposed new owner�s directors(score: 12.24)',
 'otherwise create derivative works(score: 12.24)']