# Thoughtriver Technical Exercise
The objective is to create a system that identifies section headers from within legal contracts. Here the contract is represented as a json object containing the text from each paragraph from the original document. We wish to classify each line of text as either being a section header or not based on the contents of the paragraph. 

In [3]:
import re
import json
from pathlib import Path

In [6]:
path_to_data = Path('./document-headers/files/')

In [42]:
input_name = 'letter.json'

In [43]:
path_to_input = path_to_data / input_name

In [44]:
def read_document_json(path_to_input):
    """
    Reads in json file and outputs list of paragraph strings
    
    Args:
        path_to_input (Path object): Path to input file
        
    Returns:
        List of paragraph strings
    """
    with path_to_input.open('r') as f_in:
        document_json = json.load(f_in)
    
    return [paragraph['p_text'] for paragraph in document_json]

## Naive Regex Approach

According to the highlighted document, there is a degree of regularity in the structure of the headers, e.g. a number indicating the section followed by the header text - `5. Exclusions:`. This suggests that a regex-based approach may be appropriate to identify the headers. 

In [45]:
paragraphs = read_document_json(path_to_input)

In [46]:
paragraphs

['[NAME OF RECIPIENT]',
 '[ADDRESS]',
 '[DATE] 2014',
 'Dear [●],',
 'non-disclosure of confidential information',
 '1 DISCLOSURE',
 '1.1 In consideration of our disclosure to you of [DESCRIBE CONFIDENTIAL INFORMATION] and other information (whether or not contained in documents) relating to this information ("Protected Material") for the purposes of [PURPOSE OF DISCLOSURE] ("Purpose"), you will keep the Protected Material confidential. Accordingly, for a period of [two] years from the date of this letter, you will not, without our prior written consent, either:',
 '1.1.1 communicate or otherwise make available the Protected Material to any third party except in accordance with paragraph 1.2; or',
 '1.1.2 use the Protected Material for any purpose other than the Purpose.',
 '1.2 You may disclose the Protected Material to the minimum extent required by: any order of any competent court or judicial, governmental or regulatory body; or the laws or regulations of any country with jurisdict

In [47]:
comp = re.compile(r'[0-9]+\.\s+[A-Z]*:')

In [28]:
comp = re.compile(r'[0-9]+\.\s+[A-Z].+:')

In [29]:
res = comp.match('2. Confidentiality Period:')

In [30]:
res

<_sre.SRE_Match object; span=(0, 26), match='2. Confidentiality Period:'>

In [31]:
def identify_header_with_regex(paragraph_text, compiled_regexes):
    """
    Returns true if regex matches 
    
    Args:
        paragraph_text (String): Input string
        compiled_regexes: List of compiled regex objects
    
    Returns:
        True if regex matches, False otherwise
    """
    for comp in compiled_regexes:
        if comp.match(paragraph_text):
            return True
    return False

In [41]:
header_indices = []

regex_list = [
    re.compile(r'[0-9]+(\.?)\s+[A-Z].+'),
    # re.compile(r'[A-Z]+\s')
]

for idx, paragraph in enumerate(paragraphs):
    if identify_header_with_regex(paragraph, regex_list):
        header_indices.append(idx)
        print(paragraph)
    
    

1. Purpose or Use:
2. Confidentiality Period:
3. Standard of Care:
4. Return of information
5. Exclusions:
6. Required Disclosure:
7. Warranty:
8. Rights:
9. Export:
10. No Obligations:
11. Relationship:
12. Severability:
13. Governing Law and dispute resolution:
14. Term and Termination:
15. General:


In [39]:
header_indices

[22, 25, 27, 30, 32, 38, 39, 40, 42, 44, 46, 48, 50, 52, 55, 57, 59]