# Part 1: CUAD Filtering

This notebook is responsible for filtering the raw CUAD dataset into a more manageable dataset.

This file requires a path to the [CUAD](https://www.atticusprojectai.org/cuad) dataset, a copy of which can be stored in a Google drive and accessed from this notebook.

The output of this notebook is a filtered subset of the CUAD contracts stored in a pickle file. This file is fetchedin the next processing step.



## NLP Setup

In [None]:
!python -m spacy download en_core_web_md

In [2]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_md')

In [3]:
import nltk 

nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## Import Raw Contracts

In [4]:
import json
import re
import os
import numpy as np
import shutil

In [10]:
# Set these value to point to your Google Drive folder and the specific path containing the CUAD dataset, found in the json file
ROOT_PATH = '/content/drive/MyDrive/Masters/Thesis/contracts'

CUAD_PATH = f'{ROOT_PATH}/CUAD'

data_path = f'{CUAD_PATH}/CUAD_v1.json'

In [6]:
with open(data_path) as json_file:
    contract_data = json.load(json_file)

raw_contracts = contract_data['data']
print(f'Num Contracts: {len(raw_contracts)}')

Num Contracts: 510


## Contract Dict

In [7]:
# Construct a dictionary to store the token and character length of each contract
def build_contract_dict(contract_data):
    result = {}

    for c in contract_data:
        key = c['title'].strip()
        context = c['paragraphs'][0]['context']
        tokens = nltk.word_tokenize(context)
    
        result[key] = {
            'title': key,
            'num_chars': len(context),
            'num_tokens': len(tokens)
        }
    
    return result


In [8]:
full_contract_dict = build_contract_dict(raw_contracts)

In [9]:
# Example
key = list(full_contract_dict.keys())[0]
print(f'Key: {key}')
print(full_contract_dict[key])

Key: LIMEENERGYCO_09_09_1999-EX-10-DISTRIBUTOR AGREEMENT
{'title': 'LIMEENERGYCO_09_09_1999-EX-10-DISTRIBUTOR AGREEMENT', 'num_chars': 54290, 'num_tokens': 6864}


## Industry Dict

In [11]:
# Construct a dict to capture the "industries" that each contract is associated with
## This can be used to show the diversity of our contract domain 
def build_contract_type_dict(contract_dict):
    result = {}
    missing_docs = 0

    for folder in os.listdir(f'{CUAD_PATH}/full_contract_pdf'):
        for k in os.listdir(f'{CUAD_PATH}/full_contract_pdf/{folder}'):
            if k not in result:
                result[k] = []

            for f in os.listdir(f'{CUAD_PATH}/full_contract_pdf/{folder}/{k}'):
                key = f[:-4].strip()

                if key in contract_dict:
                    result[k].append(contract_dict[key])
                else:
                    missing_docs += 1
    
    print(f'Missing: {missing_docs}')

    result = {key:val for key, val in result.items() if len(val) > 0}
    return result

In [12]:
def print_ctd_stats(ctd):
    print(f'{"TYPE":25} {"##":5} {"AV_TOKS":10} {"AV_CHARS":10} {"AV_T_LEN":10}')
    for x,y in ctd.items():
        av_tokens = np.mean([a['num_tokens'] for a in y])
        av_chars = np.mean([a['num_chars'] for a in y])   
        print(f'{x:25} {len(y):5} {round(av_tokens):10} {round(av_chars): 10} {round(av_chars/av_tokens, 2): 10}')   

In [13]:
ctd = build_contract_type_dict(full_contract_dict)

Missing: 2


In [14]:
print_ctd_stats(ctd)

TYPE                      ##    AV_TOKS    AV_CHARS   AV_T_LEN  
Collaboration                26      15604      81871       5.25
Affiliate Agreement           1       3434      27015       7.87
Agency Agreements            13      16229      91541       5.64
IP                           17       9672      55232       5.71
Distributor                  31       7081      40683       5.75
Hosting                      20       8536      49593       5.81
Franchise                    15      18281     103918       5.68
Endorsement Agreement         9       3899      22365       5.74
Development                  28      15021      82056       5.46
Maintenance                  34      10199      56979       5.59
Joint Venture _ Filing       14       1594       8629       5.41
Manufacturing                17      10048      54740       5.45
Marketing                    17      11701      63858       5.46
Promotion                    12      15677      86132       5.49
Sponsorship              

## Filtering Length

We now filter down to contracts that are approximately between 500-3000 words

In [15]:
def get_context(c):
    return c['paragraphs'][0]['context']

In [16]:
av_token_length = 5.5
MIN_WORDS = 500
MAX_WORDS = 3000

min_chars = MIN_WORDS * av_token_length
max_chars = MAX_WORDS * av_token_length

In [17]:
short_contract_data = [x for x in contract_data['data'] if len(get_context(x)) > min_chars and len(get_context(x)) < max_chars]
print(len(short_contract_data))

109


In [18]:
#Rebuild our contract dicts on our filtered set
short_contract_dict = build_contract_dict(short_contract_data)

In [19]:
short_ctd = build_contract_type_dict(short_contract_dict)

Missing: 402


In [20]:
print_ctd_stats(short_ctd)

TYPE                      ##    AV_TOKS    AV_CHARS   AV_T_LEN  
Collaboration                 1        470       2811       5.98
Agency Agreements             2       1426       7620       5.34
IP                            4       2039      11726       5.75
Distributor                   8       1968      11339       5.76
Hosting                       3       1940      11019       5.68
Franchise                     5        855       4747       5.55
Endorsement Agreement         3       2114      13016       6.16
Development                   4       2382      13247       5.56
Maintenance                   8       1430       8245       5.77
Joint Venture _ Filing        3        874       4482       5.13
Manufacturing                 3       2257      11585       5.13
Promotion                     2       2288      12828       5.61
Sponsorship                   8       1946      10612       5.45
Service                      11       1670       9267       5.55
Outsourcing              

## Filtering Sentences

SpaCy can only work with sentences that are less than a certain length. We therefore want to filter out contracts that fail this threshold. It is likely that contracts that fail this threshold are not observing regular contract conventions.

In [21]:
def custom_format(text):
    t1 =  text.replace('\n', ' ')
    t2 = re.sub(' +', ' ', t1)
    return t2.lower()

In [22]:
short_contract_dict['KUBIENT,INC_07_02_2020-EX-10.14-MASTER SERVICES AGREEMENT_Part2']

{'title': 'KUBIENT,INC_07_02_2020-EX-10.14-MASTER SERVICES AGREEMENT_Part2',
 'num_chars': 7600,
 'num_tokens': 1459}

In [24]:
doc_dict = {}

# Run all contracts through the SpaCy NLP pipeline. 
## This will catch and filter out any contracts that contain long sentences
for sc in short_contract_data:
    key = sc['title'].strip()
    next_text = get_context(sc)
    next_text = custom_format(next_text)

    try:
        doc = nlp(next_text)
        doc_dict[key] = doc
    except Exception as e:
        print(key)



## Saving the Pickle File

In [None]:
SAVE_PATH = f'{ROOT_PATH}/rq3_actual'
FILE_NAME = 'selected_docs.pickle'

In [None]:
import pickle 

# Save as pickle
with open(f'{SAVE_PATH}/{FILE_NAME}', 'wb') as f:
    pickle.dump(doc_dict, f)

## Outputting Initial Set

We also output the initial set to a new file for any desired manual inspection. This is optional.

In [None]:
key_type_dict = {}
for t in short_ctd.keys():
    for c in short_ctd[t]:
        key_type_dict[c['title'].strip()] = t

In [None]:
# Ensure this folder exists in your drive
copy_folder = f'{ROOT_PATH}/rq3_actual/inspect/pdf'

In [None]:
pdf_folder = f'{CUAD_PATH}/full_contract_pdf'

for p in os.listdir(pdf_folder):
    for t in os.listdir(f'{pdf_folder}/{p}'):
        for f in os.listdir(f'{pdf_folder}/{p}/{t}'):
            key = f[:-4].strip()
            if key in key_type_dict:
                pdf_src = f'{pdf_folder}/{p}/{t}/{f}'
                pdf_dst = f'{copy_folder}/{f}'
                shutil.copy(pdf_src, pdf_dst)
