## File Input

First import the files. To run this file, please make the full_contract_txt folder and the path in the following code the same. We define a list named content. Where content[i] refers the the $i^{th}$ file in the folder fulll_contract_txt.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import os
import re
import spacy
# pip install spacy
# python -m spacy download en_core_web_sm
import nltk
# pip install nltk
import collections

# Input folder Path, please modify it when running this file.
path = "/content/drive/MyDrive/ipynb/nlp_a1/full_contract_txt"
os.chdir(path)
content=[]
def read_text_file(file_path):
    with open(file_path, 'r',encoding='utf-8') as f:
        content.append(f.read())
for file in os.listdir():
    if file.endswith(".txt"):
        file_path = f"{file}"
        read_text_file(file_path)
# May need to modify the path of stop words txt file.
path=os.getcwd()
path=path.replace("/full_contract_txt","")
os.chdir(path)
with open("stop_words.txt", 'r', encoding='utf-8') as f:
    stopword=f.read()
stopword = re.findall(r'\w+', stopword)
# The output path for output and token txt files. May need to modify the path.
path=os.getcwd()
os.chdir(path)

## Define the Tokenizer

I programmed a self-defined tokenizer using the re library. The re.findall function can find all parts in a string that satisfy a pattern, and then return all found parts in a list. \\
we decided to use a self-defined tokenizer because other tokenizers may seperate date, time, multiple symbols like "!!!!!", or decimal numbers like 5985.2 into multiple tokens. But we wish them to be in one token. \\
Here we Briefly explain the meaning of the pattern string: \\


*   Enter r' ' to start a pattern string. \\
*   "\w" refers to words, A to Z, a to z, 0 - 9, and _.
*   "\s" refers to spaces.
*   "\[\]" combines logics together. Users can also enter exact value in the bracket. Like \[a-z\] means a to z, and \[,.?\] means only ,.?.
*   "^" refers to not. So \[^\w\s\] will pick terms not words and space, therefore will only pick symbols like ^@$_+*/#%!.
*   "+" means multiple. For example, \w will only pick one letter("a book" will produce "\['a','b','o','o','k'\]"), and \w+ will pick a word("\['a','book'\]").
*   "|" means "or" can separate patterns, so a string satisfied any one of these patterns will be picked to the list.

Note that if the pattern in the left is satisfied, then Python will not check the following patterns. For example, if we have a pattern "r'\w|\w+\' " Then the input "a book" will produce "\['a','b','o','o','k'\]" instead of "\['a','book'\]".



In [None]:
test_text="2015/02/21, I moved to New York. Let's go! It's one-way. I spent $5985.2!!!!!!!!"
print("\nThe original text is: \n",test_text)

# The main part of the tokenizer
pattern=r'\d+[^\w\s]+\d+[^\w\s]+\d+|\d+[^\w\s]+\d+|\w+|[^\w\s]+'
word=re.findall(pattern, test_text)
word=[word.lower() for word in word]

print("\nThe tokenizer used in the corpus would produce:\n",word)


The original text is: 
 2015/02/21, I moved to New York. Let's go! It's one-way. I spent $5985.2!!!!!!!!

The tokenizer used in the corpus would produce:
 ['2015/02/21', ',', 'i', 'moved', 'to', 'new', 'york', '.', 'let', "'", 's', 'go', '!', 'it', "'", 's', 'one', '-', 'way', '.', 'i', 'spent', '$', '5985.2', '!!!!!!!!']


I decided to program a function that can fix contractions and verb tense.
Spacy Library has a contraction fixer. In the first loop of the token\_fixer function, token.lemma\_ would try to fix the contraction of all words with a '. This function would return a list of decontracted words(Like \["Let's",'go'\] becomes \[\['Let','us'\],'go'\]), the second loop will fix the list problem(Like \[\['Let','us'\],'go'\] becomes \['Let','us','go'\]). The third loop is the stemmer in the NLTK library to fix the verb tense. \\
It can fix the verb tense and contractions. Uses an input of a tokenized list, it then returns a new list with fixed contractions and verb tense. However, it is $\textbf{too computationally expensive}$ to fix the corpus in the question(no output for more than 15 minutes). Therefore I instead apply this function to a simple text for demonstration. The following code is the token fixer. \\

In [1]:
pattern_for_fix=r'\d+[^\w\s]+\d+[^\w\s]+\d+|\w+[^\w\s]+\w+|\w+|[^\w\s]+'
def token_fixer(tokenlist):
    word = []
    extend_word = []
    root_word = []
    for i in range(len(tokenlist)):
        if re.match(r'\w+\'\w+', tokenlist[i]):
            doc = spacy.load("en_core_web_sm")(tokenlist[i])
            word = word.__add__([token.lemma_  for token in doc])
        else:
            word.append(tokenlist[i])
    for i in range(len(word)):
        if isinstance(word[i], list):
            extend_word = extend_word.__add__(word[i])
        else:
            extend_word.append(word[i])
    for i in range(len(extend_word)):
        root_word.append(nltk.PorterStemmer().stem(extend_word[i]))
        #root_word.append(nltk.WordNetLemmatizer().lemmatize(extend_word[i]))
    return root_word

# The use of the tokenizer with the decontract and lemma function.
word=re.findall(pattern_for_fix, test_text)
root_word=token_fixer(word)
print("The tokenizer with the fixing function would produce:\n",root_word)

NameError: name 're' is not defined

## Solve the Quetion 1

# 1 (a)
Now apply the tokenizer to the full contract. The list token_full contains all numbers, words and symbols in the folder full_contract_txt.

In [None]:
token_words=[]
token_full=[]
for i in range(len(content)):
    tokens = re.findall(pattern, content[i])
    #tokens = re.findall(pattern_for_fix, content[i])
    token_full = token_full.__add__(tokens)
#token_full=token_fix(token_full)
token_full=[word.lower() for word in token_full]

As the question required, write the result to a file called :"output.txt".

In [None]:

with open("output.txt", 'w', encoding='utf-8') as f:
    f.write(f"{token_full}")

# 1 (b)
Here we used the collection library. It can accept a list input, and output a counter object that shows the frequency of every item in the input list. For example, if we have a list "\[1,2,1,2,1,2,2,1,1,3,2,1,2,3,3,4,5,5\]", Then the collections.Counter function would produce a Counter"(\{1: 6, 2: 6, 3: 3, 5: 2, 4: 1\})" . The length of the Counter object is the unique elements in the input list, which is 5 in the exapmle.

In [None]:
frequency=collections.Counter(token_full)
length_token=len(token_full)
unique_token=len(frequency)
print("The total number of tokens is:", length_token)
print("The total number of unique tokens is:", unique_token)
print("The type/token ratio of the corpus is:", unique_token/length_token)

The total number of tokens is: 4730682
The total number of unique tokens is: 38413
The type/token ratio of the corpus is: 0.008119970862552164


# 1 (c)
As the question required, write the result to a file called :"token.txt".

In [None]:
with open("token.txt", 'w', encoding='utf-8') as f:
    f.write(f"{frequency}")


# 1 (d)
Now apply a loop that counts the number of tokens in the counter object that have the frequency = 1.

In [None]:
token_once=0
for word in frequency:
    if frequency[f"{word}"]==1:
        token_once+=1
print(f"{token_once} tokens appeared only once in the corpus.")

14437 tokens appeared only once in the corpus.


# 1 (e)
Intuitively, we should not consider "_" or numbers to be a word. But python treats them as a word. So apply the re.search function here which returns TRUE if there is at least one a to z letter in the string. The variable token_words is the tokenizer output with only words.

In [None]:
token_words = [word for word in token_full if re.search(r'[a-z]+', word)]
frequency_word=collections.Counter(token_words)
length_token_word=len(token_words)
unique_token_word=len(frequency_word)
print("The total number of words is:", length_token_word)
print("The total number of unique words is:", unique_token_word)
print("The lexical diversity(type/word ratio) of the corpus is:", unique_token_word/length_token_word)

The total number of words is: 3920618
The total number of unique words is: 27721
The lexical diversity(type/word ratio) of the corpus is: 0.007070568976625623


# 1 (f)
When reading the file "stop_words.txt", there is one word per line. So we apply the re.findall function here again with a pattern r'\w+'. This means placing all words into a list with one word per element. This produced the stopword list. Then we need to produce a new list token_without_stopword that contains all words in the list token_word but not in the list stopword. I initially programmed a loop to finish this step. But when I tried the filter with lambda function method that studied online, the compute speed increased. \\
The filter function can accept a function and a list as input, produce a filter object output. It will run the input function with every element of the input list. If the input function can produce True or False, then only elements with True value would be saved. If the input function produces other output, then the filter function would keep all elements in the input list. Here the lambda function is a quick version to define a function. It will return True if the item is in the list stopword and False otherwise.

In [None]:

token_without_stopwords = list(filter(lambda item: item not in stopword, token_words))
frequency_without_stopwords=collections.Counter(token_without_stopwords)
length_token_without_stopwords=len(token_without_stopwords)
unique_token_word_stopwords=len(frequency_without_stopwords)
print("the top 20 most frequent words and their frequencies are:",frequency_without_stopwords.most_common(20))
print("The lexical diversity(type/word ratio) of the corpus is:", unique_token_word_stopwords/length_token_without_stopwords)

the top 20 most frequent words and their frequencies are: [('agreement', 43655), ('party', 33277), ('parties', 13523), ('section', 13350), ('company', 12638), ('information', 10943), ('product', 10923), ('date', 10181), ('products', 8201), ('rights', 8067), ('services', 7890), ('applicable', 7540), ('business', 7343), ('set', 7058), ('confidential', 6916), ('written', 6818), ('terms', 6714), ('right', 6681), ('term', 6676), ('notice', 6660)]
The lexical diversity(type/word ratio) of the corpus is: 0.014535317221483754


# 1 (g)
NLTK library has a function called bigrams, it can produce a bigrams object. If the input is a string, it will put all two letters next to each other in one tuple(for example, "list(nltk.bigrams("abcd"))" would produce "\[('a', 'b'), ('b', 'c'), ('c', 'd')\]"). If the input is a list, then it will put all two elements next to each other in one tuple("list(nltk.bigrams(\['a','b','c','d'\]))" would produce "\[('a', 'b'), ('b', 'c'), ('c', 'd')\]").

In [None]:
consecutive_words=list(nltk.bigrams(token_without_stopwords))
frequency_consecutive_words=collections.Counter(consecutive_words)
print("the top 20 most frequent consecutive words and their frequencies are:",frequency_consecutive_words.most_common(20))

the top 20 most frequent consecutive words and their frequencies are: [(('confidential', 'information'), 3607), (('intellectual', 'property'), 2936), (('effective', 'date'), 2840), (('written', 'notice'), 2386), (('terms', 'conditions'), 2087), (('set', 'section'), 1826), (('prior', 'written'), 1814), (('term', 'agreement'), 1709), (('confidential', 'treatment'), 1540), (('termination', 'agreement'), 1434), (('parties', 'agree'), 1417), (('securities', 'exchange'), 1410), (('receiving', 'party'), 1368), (('pursuant', 'section'), 1353), (('written', 'consent'), 1330), (('party', 'party'), 1313), (('united', 'states'), 1269), (('applicable', 'law'), 1249), (('disclosing', 'party'), 1210), (('agreement', 'party'), 1206)]
