<a href="https://colab.research.google.com/github/RichardMWarburton/ExploringCUAD/blob/Dev/Named%20Entity%20Recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Named Entity Recognition

This notebook investigates extracting named entities from the the governing law annotations.





## The Data

CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review

https://arxiv.org/abs/2103.06268

This code is an adaptation of the scrape.py file avaliable on the github repository for CUAD.  It has been adapted to run in Jypter notebooks and allow us to step throght the coding line by line.`

## 1: Import Packages & Define Useful Functions

In [1]:
from zipfile import ZipFile
import json
import os
from collections import Counter, defaultdict
import matplotlib.pyplot as plt
import re
from random import sample, choice
import numpy as np
import pandas as pd
import re
import string
from sklearn.manifold import TSNE
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import DBSCAN, KMeans
from sklearn.cluster import AgglomerativeClustering

In [2]:
def extract_zip(pth,data_pth = None):
    """Function to extract contents of a zip file to a specified location (wd if data_pth not passed)"""
    with ZipFile(pth, 'r') as zipObj:
       # Extract all the contents of zip file in different directory
       zipObj.extractall(data_pth)

## 2: Download repository and extract data

In [3]:
#Download CUAD git repository
if not os.path.exists('main.zip'):
  !wget --no-check-certificate https://github.com/TheAtticusProject/cuad/archive/refs/heads/main.zip
  !unzip -q main.zip

#If it has not already been extracted, extract the contents of data.zip
if not os.path.exists('cuad-main/data'):
  os.makedirs('cuad-main/data')

if not os.path.exists('cuad-main/data/CUADv1.json'):
  extract_zip('cuad-main/data.zip','cuad-main/data/')

#Download a manualy curated set of labels for the full CUAD data. 
if not os.path.exists('labels3.txt'):
  !wget https://raw.githubusercontent.com/RichardMWarburton/ExploringCUAD/main/labels3.txt

--2021-07-16 13:48:39--  https://github.com/TheAtticusProject/cuad/archive/refs/heads/main.zip
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/TheAtticusProject/cuad/zip/refs/heads/main [following]
--2021-07-16 13:48:39--  https://codeload.github.com/TheAtticusProject/cuad/zip/refs/heads/main
Resolving codeload.github.com (codeload.github.com)... 140.82.114.9
Connecting to codeload.github.com (codeload.github.com)|140.82.114.9|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘main.zip’

main.zip                [   <=>              ]  17.77M  21.3MB/s    in 0.8s    

2021-07-16 13:48:40 (21.3 MB/s) - ‘main.zip’ saved [18631176]

--2021-07-16 13:48:41--  https://raw.githubusercontent.com/RichardMWarburton/ExploringCUAD/main/labels3.txt
Resolving raw.githubuserconten

In [4]:
#Load CUADv1 JSON to data
with open('cuad-main/data/CUADv1.json','r') as infile:
    for line in infile:
        contract_data = json.loads(line)

### 2.1: Read in Label Data & Generate Look Up Dictionary

In [5]:
#Initate storage for labels look up (LU)
labels_LU = {}

#Read in labels data
with open('labels3.txt','r',encoding ='UTF-8') as infile:
  for line in infile:
    #Remove trailing special characters and split on tab
    data = line.strip().split(sep='\t')
    #Add name and label to labels_LU dictionary
    labels_LU[data[0]] = data[1]

The look up returns one error, most likely due to the accented E and a disparity of encoding.  This will be forced to 'Marketing Agreement' manually for now (EITHER SORT OR PROVIDE EXAMPLE)

### 2.2: Extract Raw Contract Data

In [6]:
#ser reg ex expression for characters to remove from contract contest
spec_chars = '\\n|\\t|\\t'

#Set number of contracts in data
num_contracts = len(contract_data['data'])

#Initate dictionary to store raw contract data
raw_contracts = defaultdict(list)

#for each contract
for i in range(num_contracts):
  #Append the title, contract text and character length of text to the raw_contracts dictionary
  raw_contracts['contract title'].append(contract_data['data'][i]['title'])
  raw_contracts['label'].append(labels_LU[contract_data['data'][i]['title']] if contract_data['data'][i]['title'] in labels_LU else 'marketing agreement' ) #<- manual error trap applied here (see below)
  
  #Parse raw text and process to remove breaks
  raw_text = contract_data['data'][i]['paragraphs'][0]['context']
  clean_text = re.sub(spec_chars,'',raw_text)

  #Split clean text in to sentances and tokens
  sentance_text = clean_text.split(sep = '. ')
  token_text = clean_text.split(sep = ' ')

  #Append text to the respective key in the raw_contracts dictionary
  raw_contracts['raw text'].append(raw_text)
  raw_contracts['clean text'].append(clean_text)
  raw_contracts['sentance text'].append(sentance_text)
  raw_contracts['token text'].append(token_text)
  
  #Add character, sentance and token counts to raw_contracts dictionary
  raw_contracts['character count'].append(len(raw_text))
  raw_contracts['sentance count'].append(len(sentance_text))
  raw_contracts['token count'].append(len(token_text))


In [7]:
#Sanity check value lists for each key ahve the correct length (510)
for key in raw_contracts:
  print(key,len(raw_contracts[key]))

contract title 510
label 510
raw text 510
clean text 510
sentance text 510
token text 510
character count 510
sentance count 510
token count 510


### 2.3: Extract Clause Specific Data

In [8]:
#Define the number of clauses
num_clauses = 41

#initate dictioanry to store caluse data
clause_data = defaultdict(list)

#For each contract
for i in range(num_contracts):
  #for each clause
  for j in range(num_clauses):
    #for each found clause annotation
    for k in range(len(contract_data['data'][i]['paragraphs'][0]['qas'][j]['answers'])): 
      #Add the contract title
      clause_data['contract title'].append(contract_data['data'][i]['title'])
      clause_data['label'].append(labels_LU[contract_data['data'][i]['title']] if contract_data['data'][i]['title'] in labels_LU else 'marketing agreement' )  #<- manual error trap applied here
      clause_data['clause'].append(contract_data['data'][i]['paragraphs'][0]['qas'][j]['id'].split(sep='__')[1])
      clause_data['annotation'].append(contract_data['data'][i]['paragraphs'][0]['qas'][j]['answers'][k]['text'])
      clause_data['annotation start'].append(contract_data['data'][i]['paragraphs'][0]['qas'][j]['answers'][k]['answer_start'])
      clause_data['annotation length'].append(len(contract_data['data'][i]['paragraphs'][0]['qas'][j]['answers'][k]['text']))


In [9]:
#Sanity check value lists for each key ahve the correct length (13823)
for key in clause_data:
  print(key,len(clause_data[key]))

contract title 13823
label 13823
clause 13823
annotation 13823
annotation start 13823
annotation length 13823


In [10]:
np.unique(clause_data['clause'])

array(['Affiliate License-Licensee', 'Affiliate License-Licensor',
       'Agreement Date', 'Anti-Assignment', 'Audit Rights',
       'Cap On Liability', 'Change Of Control',
       'Competitive Restriction Exception', 'Covenant Not To Sue',
       'Document Name', 'Effective Date', 'Exclusivity',
       'Expiration Date', 'Governing Law', 'Insurance',
       'Ip Ownership Assignment', 'Irrevocable Or Perpetual License',
       'Joint Ip Ownership', 'License Grant', 'Liquidated Damages',
       'Minimum Commitment', 'Most Favored Nation',
       'No-Solicit Of Customers', 'No-Solicit Of Employees',
       'Non-Compete', 'Non-Disparagement', 'Non-Transferable License',
       'Notice Period To Terminate Renewal', 'Parties',
       'Post-Termination Services', 'Price Restrictions', 'Renewal Term',
       'Revenue/Profit Sharing', 'Rofr/Rofo/Rofn', 'Source Code Escrow',
       'Termination For Convenience', 'Third Party Beneficiary',
       'Uncapped Liability', 'Unlimited/All-You-Can-Eat

## 3: Cleaning data and extracting a single clause

In [11]:
#Initate dataframe of all clause data
clause_df = pd.DataFrame(clause_data)

#Convert to lower case
#clause_df['annotation'] = clause_df['annotation'].apply(lambda x: x.lower())

#Remove any formating characters or multiple spaces and replace with a single space
clause_df['annotation'] = clause_df['annotation'].apply(lambda x: re.sub('\\t|\\r|\\n|[^\S]{2,}',' ',x))

#Remove punctuation from the string
clause_df['annotation'] = clause_df['annotation'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))

In [12]:
#Define clause of interest
clause_of_interest = 'Governing Law'

#Limit df to clause of interest and extract annotations of itnerest
of_interest_data = clause_df[clause_df['clause'] == clause_of_interest]
annotations_of_interest = of_interest_data['annotation'].values

#Identify where there are multiple annotations per contract
titles,counts = np.unique(of_interest_data['contract title'],return_counts =True)
dups = titles[counts >= 2]

#Output Analysis
print('There are {} contracts with \'{}\' annotations'.format(*(titles.shape[0],clause_of_interest)))
print('There are {} contracts with more than one annotation'.format(dups.shape[0]))

There are 437 contracts with 'Governing Law' annotations
There are 25 contracts with more than one annotation


From the above we can see that: 

1.   Contracts may have multiple annotations for the same clause
2.   Not all contracts have an annotation of interest

Provisionally, we will look to concatinate all such annotations for a contract in to one string.  This will then represent all the salient points for the contract and clause in question.

In [13]:
#output duplicate annotations anc contract titles
dup_df = of_interest_data[of_interest_data['contract title'].isin(dups)][['contract title','annotation']]

#print sample of duplicate annotations
for i in dup_df.index[:8]:
  print(dup_df.loc[i,'contract title'])
  print(repr(dup_df.loc[i,'annotation']))
  #print(dup_df.loc[i,'annotation'].split(sep=' '))
  print('\n')

ChinaRealEstateInformationCorp_20090929_F-1_EX-10.32_4771615_EX-10.32_Content License Agreement
'This Termination Agreement shall be governed by the laws of the PRC without regard to conflicts of law principles'


ChinaRealEstateInformationCorp_20090929_F-1_EX-10.32_4771615_EX-10.32_Content License Agreement
'This Agreement and any dispute or claim arising out of or in connection with it or its subject matter shall be governed by and construed in accordance with the laws of the Peoples Republic of China without regard to its conflicts of laws rules that would mandate the application of the laws of another jurisdiction'


LOYALTYPOINTINC_11_16_2004-EX-10.2-RESELLER AGREEMENT
'This Agreement shall be subject to and governed by the laws of the State of Missouri USA'


LOYALTYPOINTINC_11_16_2004-EX-10.2-RESELLER AGREEMENT
'This Agreement shall be deemed to have been made and executed in the State of Missouri and any dispute arising thereunder shall be resolved in accordance with the laws o

**THE ABOVE COULD BE DISPLAYED BETTER**

In [14]:
#Initate memory for annotations within contracts
combined_annotations_list = defaultdict(list)
combined_annotations_string = {}

#For each annotation of interest found in the contract, 
#append annotation to a default dict list with contract as key
for i in of_interest_data.index:
  name = of_interest_data.loc[i,['contract title']].values[0]
  annotation = of_interest_data.loc[i,['annotation']].values[0]
  combined_annotations_list[name].append(annotation)

#Produce a singel string of all annotations found in specific contracts
for key in combined_annotations_list.keys():
  combined_annotations_string[key] = ' '.join(combined_annotations_list[key])

In [42]:
#Build array of contract names and concatenated annotations
contracts = np.array(list(combined_annotations_string.keys()))
combined_annotations = list(combined_annotations_string.values())

## Section 4: Texting named Entity Extraction

In [35]:
txt = combined_annotations[0]
txt

'This Agreement is to be construed according to the laws of the State of Illinois'

In [25]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [26]:
def preprocess(sent):
    sent = nltk.word_tokenize(sent)
    sent = nltk.pos_tag(sent)
    return sent

In [27]:
sent = preprocess(txt)
sent

[('This', 'DT'),
 ('Agreement', 'NNP'),
 ('is', 'VBZ'),
 ('to', 'TO'),
 ('be', 'VB'),
 ('construed', 'VBN'),
 ('according', 'VBG'),
 ('to', 'TO'),
 ('the', 'DT'),
 ('laws', 'NNS'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('State', 'NNP'),
 ('of', 'IN'),
 ('Illinois', 'NNP')]

In [29]:
pattern = 'NP: {<IN>?<NNP>}'
cp = nltk.RegexpParser(pattern)
cs = cp.parse(sent)
print(cs)

(S
  This/DT
  (NP Agreement/NNP)
  is/VBZ
  to/TO
  be/VB
  construed/VBN
  according/VBG
  to/TO
  the/DT
  laws/NNS
  of/IN
  the/DT
  (NP State/NNP)
  (NP of/IN Illinois/NNP))


In [31]:
from nltk.chunk import conlltags2tree, tree2conlltags
from pprint import pprint
iob_tagged = tree2conlltags(cs)
pprint(iob_tagged)

[('This', 'DT', 'O'),
 ('Agreement', 'NNP', 'B-NP'),
 ('is', 'VBZ', 'O'),
 ('to', 'TO', 'O'),
 ('be', 'VB', 'O'),
 ('construed', 'VBN', 'O'),
 ('according', 'VBG', 'O'),
 ('to', 'TO', 'O'),
 ('the', 'DT', 'O'),
 ('laws', 'NNS', 'O'),
 ('of', 'IN', 'O'),
 ('the', 'DT', 'O'),
 ('State', 'NNP', 'B-NP'),
 ('of', 'IN', 'B-NP'),
 ('Illinois', 'NNP', 'I-NP')]


In [33]:
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()


In [56]:
doc = nlp(combined_annotations[1])
pprint([(X.text, X.label_) for X in doc.ents])

[('English', 'LANGUAGE'), ('English', 'NORP')]


In [57]:
pprint([(X, X.ent_iob_, X.ent_type_) for X in doc])

[(This, 'O', ''),
 (Agreement, 'O', ''),
 (is, 'O', ''),
 (governed, 'O', ''),
 (by, 'O', ''),
 (English, 'B', 'LANGUAGE'),
 (law, 'O', ''),
 (and, 'O', ''),
 (the, 'O', ''),
 (parties, 'O', ''),
 (submit, 'O', ''),
 (to, 'O', ''),
 (the, 'O', ''),
 (exclusive, 'O', ''),
 (jurisdiction, 'O', ''),
 (of, 'O', ''),
 (the, 'O', ''),
 (English, 'B', 'NORP'),
 (courts, 'O', ''),
 (in, 'O', ''),
 (relation, 'O', ''),
 (to, 'O', ''),
 (any, 'O', ''),
 (dispute, 'O', ''),
 (contractual, 'O', ''),
 (or, 'O', ''),
 (noncontractual, 'O', ''),
 (concerning, 'O', ''),
 (this, 'O', ''),
 (Agreement, 'O', ''),
 (save, 'O', ''),
 (that, 'O', ''),
 (either, 'O', ''),
 (party, 'O', ''),
 (may, 'O', ''),
 (apply, 'O', ''),
 (to, 'O', ''),
 (any, 'O', ''),
 (court, 'O', ''),
 (for, 'O', ''),
 (an, 'O', ''),
 (injunction, 'O', ''),
 (or, 'O', ''),
 (other, 'O', ''),
 (relief, 'O', ''),
 (to, 'O', ''),
 (protect, 'O', ''),
 (its, 'O', ''),
 (Intellectual, 'O', ''),
 (Property, 'O', ''),
 (Rights, 'O', '')]


## Section 4: Parsing Annotations for Named Entities

In [59]:
new_features = []

for annotation in combined_annotations:
  annotation_GPEs = []
  
  doc = nlp(annotation)

  for X in doc.ents:

    if X.label_ == 'GPE' or X.label_ == 'LANGUAGE':
      annotation_GPEs.append(X.text)
  
  new_features.append(annotation_GPEs)



In [66]:
for i in range(10):
  print('Annotation Raw Text:')
  print(combined_annotations[i],'\n')
  print('Extracted Locations:')
  print(new_features[i],'\n')
  print('*'*75)

  


Annotation Raw Text:
This Agreement is to be construed according to the laws of the State of Illinois 

Extracted Locations:
['the State of Illinois'] 

***************************************************************************
Annotation Raw Text:
This Agreement is governed by English law and the parties submit to the exclusive jurisdiction of the English courts in relation to any dispute contractual or noncontractual concerning this Agreement save that either party may apply to any court for an injunction or other relief to protect its Intellectual Property Rights 

Extracted Locations:
['English'] 

***************************************************************************
Annotation Raw Text:
It will be governed by the law of the Peoples Republic of China otherwise it is governed by United Nations Convention on Contract for the International Sale of Goods 

Extracted Locations:
['the Peoples Republic of China'] 

*******************************************************************

In [61]:
combined_annotations[7]

'This Agreement and any and all matters arising directly or indirectly herefrom shall be governed by and construed and enforced in accordance with the internal laws of the  applicable to agreements made and to be performed entirely in such state including its statutes of limitation but without giving effect to the conflict of law principles thereof'