<a href="https://colab.research.google.com/github/Perriex/NLP_uOttawa/blob/AST_ONE/NLP_Assignment1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 1

## Parna Asadi, Pouya Khodaee, Melika Morafegh

The primary goal of this project was to analyze a corpus of legal text documents from the Atticus dataset. This involved tokenizing the text, counting word frequencies, and performing various statistical analyses to gain insights into the lexical characteristics of the legal language.


In [7]:
!unzip ./CUAD_v1.zip

Archive:  ./CUAD_v1.zip
   creating: CUAD_v1/
  inflating: CUAD_v1/CUAD_v1.json    
  inflating: CUAD_v1/CUAD_v1_README.txt  
   creating: CUAD_v1/full_contract_pdf/
   creating: CUAD_v1/full_contract_pdf/Part_I/
   creating: CUAD_v1/full_contract_pdf/Part_I/Affiliate_Agreements/
  inflating: CUAD_v1/full_contract_pdf/Part_I/Affiliate_Agreements/CreditcardscomInc_20070810_S-1_EX-10.33_362297_EX-10.33_Affiliate Agreement.pdf  
  inflating: CUAD_v1/full_contract_pdf/Part_I/Affiliate_Agreements/CybergyHoldingsInc_20140520_10-Q_EX-10.27_8605784_EX-10.27_Affiliate Agreement.pdf  
  inflating: CUAD_v1/full_contract_pdf/Part_I/Affiliate_Agreements/DigitalCinemaDestinationsCorp_20111220_S-1_EX-10.10_7346719_EX-10.10_Affiliate Agreement.pdf  
  inflating: CUAD_v1/full_contract_pdf/Part_I/Affiliate_Agreements/LinkPlusCorp_20050802_8-K_EX-10_3240252_EX-10_Affiliate Agreement.pdf  
  inflating: CUAD_v1/full_contract_pdf/Part_I/Affiliate_Agreements/SouthernStarEnergyInc_20051202_SB-2A_EX-9_801890_E

In [23]:
import re
import os

### Data Preparation:

*   Loaded the legal text documents from the Atticus dataset.
*   Concatenated the text from all documents to form a single corpus.

In [33]:
def tokenize_text(text):
  """
  Tokenizes the given text into a list of tokens,
  handling end-of-sentence dots appropriately.

  Args:
    text: The input text string.

  Returns:
    A list of tokens.
  """

  # Remove extra whitespace and newlines
  text = re.sub(r'\s+', ' ', text).strip()

  # Split text into potential words (including punctuation)
  words = re.findall(r"[\w']+|[.,!?;:]", text)

  # Filter out punctuation that is not end-of-sentence
  tokens = []
  for i, word in enumerate(words):
    if word.isalpha():
      tokens.append(word)
    elif word == ".":
      if i + 1 < len(words) and not words[i + 1].isalpha():
        tokens.append(word)
      elif i == len(words) - 1:
        tokens.append(word)

  return tokens

### Tokenization:

* Implemented a tokenizer that effectively separated words from punctuation marks and other symbols.
* Carefully considered and addressed the handling of punctuation, particularly end-of-sentence dots.

In [35]:
# Assuming the text files are in a folder named "full_contract_txt"
folder_path = "CUAD_v1/full_contract_txt"

# Initialize empty lists to store tokens and all text
all_tokens = []
all_text = ""

for filename in os.listdir(folder_path):
  # Read the text file
  with open(os.path.join(folder_path, filename), 'r') as f:
    file_text = f.read()
    all_text += file_text

    # Tokenize the text
    tokens = tokenize_text(file_text)
    all_tokens.extend(tokens)

In [36]:
# Write all tokens to a file (output.txt) for analysis
with open("output.txt", 'w') as f:
  f.writelines(f"{token}\n" for token in all_tokens)  # Write only the first 20 lines

### Word Counting and Analysis:

* Counted the occurrences of each token in the corpus.
* Calculated and reported key statistics:
* Total number of tokens
* Number of unique tokens (types)
* Type-token ratio
* Number of tokens appearing only once (hapax legomena)
* Extracted words by excluding punctuation.
* Analyzed word frequencies and calculated lexical diversity (type-token ratio for words).
* Filtered words to exclude stopwords using a custom stopword list.
* Analyzed filtered word frequencies and calculated lexical density.

In [37]:
from collections import Counter

# Count token frequencies
token_counts = Counter(all_tokens)

# Total number of tokens
total_tokens = len(all_tokens)

# Number of unique tokens (types)
unique_tokens = len(token_counts)

# Type-token ratio
type_token_ratio = unique_tokens / total_tokens

# Print statistics
print(f"Total Tokens: {total_tokens}")
print(f"Unique Tokens (Types): {unique_tokens}")
print(f"Type-Token Ratio: {type_token_ratio:.4f}")

Total Tokens: 3973224
Unique Tokens (Types): 38504
Type-Token Ratio: 0.0097


In [39]:
with open("tokens.txt", 'w') as f:
  for token, count in token_counts.most_common():
    f.write(f"{token}: {count}\n")

In [40]:
# Count occurrences of each token
token_counts = Counter(all_tokens)

# Find tokens that appeared only once
unique_tokens = [token for token, count in token_counts.items() if count == 1]

print(f"Number of tokens that appeared only once: {len(unique_tokens)}")

Number of tokens that appeared only once: 11782


In [41]:
# Extract only words (excluding punctuation and other symbols)
words = [token for token in all_tokens if token.isalpha()]

# Count word frequencies
word_counts = Counter(words)

# Total number of words
total_words = len(words)

# Number of unique words (word types)
unique_words = len(word_counts)

# Lexical diversity (type-token ratio for words)
lexical_diversity = unique_words / total_words

print(f"Total Words: {total_words}")
print(f"Unique Words (Word Types): {unique_words}")
print(f"Lexical Diversity: {lexical_diversity:.4f}")

# Print top 20 most frequent words
print("Top 20 Most Frequent Words:")
for word, count in word_counts.most_common(20):
  print(f"{word}: {count}")

Total Words: 3873485
Unique Words (Word Types): 38503
Lexical Diversity: 0.0099
Top 20 Most Frequent Words:
the: 239996
of: 151811
and: 128997
to: 127311
or: 106442
in: 74268
any: 58853
shall: 48424
a: 46607
by: 42048
be: 39165
Agreement: 37013
for: 35480
this: 35216
such: 34815
with: 32574
as: 31636
that: 27281
other: 25063
is: 21544


In [42]:
# Print top 20 most frequent words
print("Top 20 Most Frequent Words:")
for word, count in word_counts.most_common(20):
  print(f"{word}: {count}")

Top 20 Most Frequent Words:
the: 239996
of: 151811
and: 128997
to: 127311
or: 106442
in: 74268
any: 58853
shall: 48424
a: 46607
by: 42048
be: 39165
Agreement: 37013
for: 35480
this: 35216
such: 34815
with: 32574
as: 31636
that: 27281
other: 25063
is: 21544


In [50]:
from collections import Counter
from nltk import ngrams
import pandas as pd

In [53]:
signs = pd.read_fwf('stop_words.txt', header=None, names=['stop_words']) # Assume first column contains stop words

stop_words = signs['stop_words'].tolist()
stop_words.append(' ')
def removeStopWordsInRows(arry):
    new_words = [word for word in arry if word not in stop_words]
    return new_words

In [55]:
from nltk.corpus import stopwords

# Remove stopwords from the list of words
filtered_words = [word for word in words if word.lower() not in stop_words]

# Count word frequencies after removing stopwords
filtered_word_counts = Counter(filtered_words)

# Total number of filtered words
total_filtered_words = len(filtered_words)

# Number of unique filtered words
unique_filtered_words = len(filtered_word_counts)

# Lexical density (type-token ratio for words without stopwords)
lexical_density = unique_filtered_words / total_filtered_words

print(f"Total Filtered Words (without stopwords): {total_filtered_words}")
print(f"Unique Filtered Words: {unique_filtered_words}")
print(f"Lexical Density: {lexical_density:.4f}")

# Print top 20 most frequent words after removing stopwords
print("Top 20 Most Frequent Words (without stopwords):")
for word, count in filtered_word_counts.most_common(20):
  print(f"{word}: {count}")

Total Filtered Words (without stopwords): 1840424
Unique Filtered Words: 37101
Lexical Density: 0.0202
Top 20 Most Frequent Words (without stopwords):
Agreement: 37013
Party: 19212
Section: 12406
party: 11043
Company: 9941
Product: 8852
Parties: 7492
set: 6873
written: 6735
applicable: 6477
information: 6342
right: 6216
rights: 6202
respect: 6183
terms: 6101
Products: 5718
notice: 5709
parties: 5431
Date: 5406
prior: 5243


### Bigram Analysis:

* Computed and analyzed the frequencies of bigrams (pairs of consecutive words) after excluding stopwords and punctuation.

In [56]:
from nltk import ngrams

# Create bigrams (pairs of consecutive words)
bigrams = list(ngrams(filtered_words, 2))

# Count bigram frequencies
bigram_counts = Counter(bigrams)

# Print top 20 most frequent bigrams
print("Top 20 Most Frequent Bigrams:")
for bigram, count in bigram_counts.most_common(20):
  print(f"{bigram}: {count}")

Top 20 Most Frequent Bigrams:
('Confidential', 'Information'): 2870
('written', 'notice'): 2371
('Effective', 'Date'): 2264
('terms', 'conditions'): 1902
('prior', 'written'): 1807
('set', 'Section'): 1755
('Intellectual', 'Property'): 1636
('written', 'consent'): 1323
('pursuant', 'Section'): 1308
('United', 'States'): 1256
('termination', 'Agreement'): 1204
('term', 'Agreement'): 1140
('obligations', 'Agreement'): 1098
('terminate', 'Agreement'): 1093
('terms', 'Agreement'): 1078
('Securities', 'Exchange'): 1041
('meaning', 'set'): 1036
('directly', 'indirectly'): 956
('intellectual', 'property'): 949
('filed', 'separately'): 873


In [60]:
from tabulate import tabulate

def generate_table(total_tokens, unique_tokens, type_token_ratio,
                   token_counts, words, word_counts,
                   filtered_words, filtered_word_counts, bigram_counts):
  """
  Generates the table for the corpus analysis.

  Args:
    total_tokens: Total number of tokens.
    unique_tokens: Number of unique tokens.
    type_token_ratio: Type-token ratio for all tokens.
    token_counts: Counter object with token frequencies.
    words: List of words (excluding punctuation).
    word_counts: Counter object with word frequencies (excluding punctuation).
    filtered_words: List of words (excluding punctuation and stopwords).
    filtered_word_counts: Counter object with word frequencies (excluding punctuation and stopwords).
    bigram_counts: Counter object with bigram frequencies.

  Returns:
    A list of lists representing the table data.
  """

  # Calculate number of tokens appearing only once
  tokens_once = sum(1 for count in token_counts.values() if count == 1)

  # Calculate type-token ratio for words (excluding punctuation)
  type_token_ratio_words = len(set(words)) / len(words)

  # Calculate type-token ratio for words (excluding punctuation and stopwords)
  type_token_ratio_filtered = len(set(filtered_words)) / len(filtered_words)

  # Get top 3 most frequent words
  top_3_words = [f"{word}: {count}" for word, count in word_counts.most_common(3)]

  # Get top 3 most frequent words (excluding stopwords)
  top_3_filtered_words = [f"{word}: {count}" for word, count in filtered_word_counts.most_common(3)]

  # Get top 3 most frequent bigrams
  top_3_bigrams = [f"{bigram}: {count}" for bigram, count in bigram_counts.most_common(3)]

  # Create the table data
  table_data = [
      ["# of tokens (b)", total_tokens],
      ["# of types (b)", len(unique_tokens)],
      ["type/token ratio (b)", f"{type_token_ratio:.4f}"],
      ["tokens appeared only once (d)", tokens_once],
      ["# of words (excluding punctuation) (e)", len(words)],
      ["type/token ratio (excluding punctuation) (e)", f"{type_token_ratio_words:.4f}"],
      ["List the top 3 most frequent words and their frequencies (e)", ", ".join(top_3_words)],
      ["type/token ratio (excluding punctuation and stopwords) (f)", f"{type_token_ratio_filtered:.4f}"],
      ["List the top 3 most frequent words and their frequencies (excluding stopwords) (f)", ", ".join(top_3_filtered_words)],
      ["List the top 3 most frequent bigrams", ", ".join(top_3_bigrams)]
  ]

  return table_data

# Example Usage:
# Assuming you have the necessary variables from your analysis
table_data = generate_table(total_tokens, unique_tokens, type_token_ratio,
                      token_counts, words, word_counts,
                      filtered_words, filtered_word_counts, bigram_counts)

# Print the table using tabulate
print(tabulate(table_data, headers=["Metric", "Value"], tablefmt="grid"))

+------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+
| Metric                                                                             | Value                                                                                           |
| # of tokens (b)                                                                    | 3973224                                                                                         |
+------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+
| # of types (b)                                                                     | 11782                                                                                           |
+--------------------------------------------------------------------------