# Final Project - Notice and comment
Project by Jason Danker, Proxima DasMohapatra, Ankur Kumar, Emily Witt and Kinshuk
## Notebook title - Summarize regulation text


**Overview**: 
This notebook contains functions to extract the following information from a document:
* Keywords
* Top relevant sentences

** Input **
Reads in data stored in pickle format as extracted using Kinshuk's notebook


** Output **
`list_of_keywords`,`top_10_relevant_sentences`


** Usage **
Call `extract_summary` function with the document text as a parameter.


** Notes **
* We tried to use Texttiling in order to tokenize the text by topics and then use it to extract keywords, themes, etc. However, it did not result in any better quality keywords. A new challenge was that of increased number of keywords, hence beating the purpose of summarizing the text. We decided not to use the algorithm in our processing pipeline.
* The function takes into account only unigrams and bigrams while extracting top relevant sentences.

In [1]:
# Import List
from pickle import dump, load
import nltk
from nltk import word_tokenize,FreqDist
import re
from nltk.corpus import wordnet as wn
from nltk.util import ngrams

**This Cell has code for key word extraction. The result is directly used as top keywords as well fed into sentence extraction algorith.**

In [2]:
#Input: Raw text 
#Output: supplementry information part and summary part 
#Explanation: The function cleans raw text by removing newline character, return character and also html tags. 
#            Then we seperate the text into summary and supplementry information
def get_document_text(raw_text):
    """ This function takes in raw document text as input which we receive from the API and returns a clean text 
    of the associated document. It cleans up any HTML code in the text, newline characters, and extracts supplemental
    information part of the document.
    
    INPUT: string
    OUTPUT: string
    """
    raw_text = raw_text.replace('\n',' ')
    raw_text = raw_text.replace('*','') # added
    raw_text = raw_text.replace('\r',' ') # added
    supp_info_idx = raw_text.find("SUPPLEMENTARY INFORMATION:")
    summary_idx = raw_text.find("SUMMARY:")
    dates_idx = raw_text.find("DATES:")
    suppl_info = raw_text[supp_info_idx+26:] # To leave out the string 'Supplementary Information'
    summary = raw_text[summary_idx+8:dates_idx]
    # Remove any residual HTML tags in text
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', suppl_info)
    cleansummary = re.sub(cleanr, '', summary)
    return cleantext, cleansummary

#Input: text sentence 
#Output: List of tokens
#Explanation: uses regex and a regex tokenizer

def tokenize_text(corpus):
    pattern = r'''(?x)    # set flag to allow verbose regexps
    (([A-Z]\.)+)       # abbreviations, e.g. B.C.
    |(\w+([-']\w+)*)       # words with optional internal hyphens e.g. after-ages or author's
    '''
    tokens = nltk.regexp_tokenize(corpus,pattern)
    all_token = [word.lower() for token in tokens for word in token if word != "" 
                 and word[0] != "'" and word[0] != "-"]
    return all_token

#Input: corpus
#Output: List of tokens in form of list of sentence which contains list of tokesn
#Explanation: first tokenize sentences and run it through word tokenizer

def tokenize_text_sent(corpus):
    sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
    raw_sents = sent_tokenizer.tokenize(corpus) # Split text into sentences    
    return [tokenize_text(sent) for sent in raw_sents]

#Input: list of tokenized word in sentences (nested list)
#Output: tagged tokens
#Explanation: uses nlp default tokenizer
def tag_my_text(sents):
    return [nltk.pos_tag(sent) for sent in sents]

#Input: None
#Output: A noun phrase chunker based on regex
#Explanation: Chunk noun phrases in tree
def noun_phrase_chunker():
    grammar = r"""
    NP: {<DT|PP\$>?<JJ>*<NN>}   # chunk determiner/possessive, adjectives and noun
    """
    cp = nltk.RegexpParser(grammar)
    return cp

#Input: tagged sentences list and noun phrase chunker object
#Output: all noun phrase chunks
#Explanation: Extract only the NP marked phrases from the parse tree, that is the chunk we defined
def noun_phrase_extractor(sentences, chunker):
    res = []
    for sent in sentences:
        tree = chunker.parse(sent)
        for subtree in tree.subtrees():
            if subtree.label() == 'NP' : 
                res.append(subtree[0:len(subtree)])
                #res.append(subtree[0])
                #print(subtree)
    return res


#Input: tagged text
#Output: A noun phrase list 
#Explanation: Only add take the noun phrase list and just extract the noun phrase itself 
def noun_phrase_finder(tagged_text):
    all_proper_noun = noun_phrase_extractor(tagged_text,noun_phrase_chunker()) 
    #does not literally mean proper noun. Chunker only extracts common noun
    noun_phrase_list = []                                                      
    #noun_phrase_string_list =[]
    for noun_phrase in all_proper_noun:
        if len(noun_phrase) > 0: 
            small_list =[]
            for (word,tag) in noun_phrase:
                small_list.append(word)
            noun_phrase_list.append(small_list)
            #noun_phrase_string_list.append(' '.join(small_list))
    return noun_phrase_list

#Input: noun phrase list
#Output: dictionary of len np and count of nps of that length
#Explanation: Get frequency dist of different length in all the noun phrases extracted. 
#             Something of the form {1:45,2:23} - how many 1phrased and 2 phrased chunks I have etc.
def get_length_np(nounPhrase):
    np_length={}
    for inner_np in nounPhrase:
        np_length[len(inner_np)] = np_length.get(len(inner_np),0) + 1
    return np_length

#Input: Nested list of all noun phrases and the length of np 
#Output: a frequecy distribution object
#Explanation: get freq dist obj for noun phrase of different lengths
def find_freq(nested_list,nest_len):
    #from nltk.probability import FreqDist
    fdist_list =[]
    for inner_np in nested_list:
        if len(inner_np) == nest_len:
            fdist_list.append(' '.join(inner_np))
    fdist = FreqDist(fdist_list)
    return fdist

#Input: Noun Phrase list
#Output: Master list which is list of list of top np of different size
#Explanation: Make a grand list of top occuring noun phrases of different sizes 
#             **For testing purpose only. Wont be used**
def get_top_np(np):
    master_common_list=[]
    len_list =get_length_np(np).keys()
    for item in len_list:
        fdist_np = find_freq(np,item)
        master_common_list.append(fdist_np.most_common(15))
    return master_common_list

#Input: Noun Phrase list
#Output: Top unigrams 
#Explanation: Top 30% of unigrams which have word length of more than 3
def get_top_unigrams(np):
    unigrams = []
    for item in np:
        if len(item) ==  1:
            unigrams.append(item)
    fdist_uni = find_freq(np,1)
    uni_list = fdist_uni.most_common()
    threshold = 0.3 * len(unigrams)
    top = []
    s = 0
    for word,count in uni_list:
        if(len(word)>3):
            top.append(word)
            s += count
            if s > threshold:
                break      
    return top


**This cell has the sentence extraction algorithm and takes as input keywords from above algorithm **

In [3]:
#Input: Corpus adn list of keywords
#Output: Sorted top sentences
#Explanation: It takes the corpus and extracts top sentences based on 
#            the stepped length of the sentences, keyword occurance and seleted word occurence.
#            Penalized if tables exist.
#            It then extracts the top 4 sentences based on score and then sorts them by index in text.
def get_top_sents(corpus,keywords_list):
    sentence_list = get_sentences(corpus)
    indexed_sents = sentence_indexing(sentence_list) # This is so that we can re-order most relevant sentences later
    
    table_scores = handle_tables(sentence_list)
    sentence_length_scores = get_sentence_lengths(sentence_list)
    keyphrase_scores = get_keyphrase_scores(corpus,sentence_list, keywords_list)
    #stepped length
    stepped_sentence_length =[]
    for each_score in sentence_length_scores:
        s = each_score//10
        if s>10:
            s = 10
        stepped_sentence_length.append(s)
        
    #sent_scores = [s+c for s,c in zip(sentence_length_scores,keyphrase_scores)] #original score = keyphrase +length
    #sent_scores = [c/s for s,c in zip(sentence_length_scores,keyphrase_scores)] #score = ratio of keyphrase /length
    sent_scores = [c+s+t for s,c,t in zip(stepped_sentence_length,keyphrase_scores,table_scores)] #score = key phrase + stepped length
    #sent_scores = [s+(k/l)*100 for s,l,k in zip(stepped_sentence_length,sentence_length_scores,keyphrase_scores)] #score = key phrase + stepped length
    
    idx_sent_scores = [(s,c) for s,c in zip(indexed_sents,sent_scores)]
    sorted_sents = sorted(idx_sent_scores,key=lambda sent: sent[1],reverse=True)
    
    # Keep top 10% of the sentences, or top 10 whichever is less
    top_10 = int(len(sorted_sents) * 0.1)
    if top_10 > 4: # changed from 10 to 4
        top_10 = 4
    x = sorted_sents[:top_10]
    top_list = [item[0] for item in x]
    sorted_top_list = sorted(top_list,key=lambda sent:sent[1],reverse=False)
    sorted_top_list = [sent[0] for sent in sorted_top_list]
    
    return sorted_top_list

#Input: Corpus
#Output: Clean list of sentecnes
#Explanation: We tokenize setence as well as remove reference number in form /1/, dashes in form --- to mark end of section 
#             and page number in form [[Page 123]]     
def get_sentences(corpus):
    # First, tokenize the corpus into sentences
    sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
    raw_sents = sent_tokenizer.tokenize(corpus)
    clean_sent =[]
    for sent in raw_sents:
        clean_sent.append(re.sub(r"\[+Page\s*\d+\]+|\\+\d+\\+|-+",'',sent))
    return clean_sent
    #return raw_sents

#Input: Sentence list
#Output: List of sentence as a tuple with index number
#Explanation:  This is needed to ultimately arrange them by order of occurence in main text so that they are coherent when read in order
def sentence_indexing(sent_list):
    indexed_sents = []
    for idx,sent in enumerate(sent_list):
        indexed_sents.append((sent,idx))
    return indexed_sents    


#Input: Sentence list
#Output: list of length of sentences
#Explanation: This is a criteria to score sentences
def get_sentence_lengths(sent_list):
    sent_length = []
    for s in sent_list:
        sent_length.append(len(s.split(' ')))
    
    return sent_length

#Input: Corpus, Sentence list, keyword list
#Output: List of sentence keyword score
#Explanation: This is a criteria to score sentences. Scored on basis of occurence of top unigrams and bigrams
def get_keyphrase_scores(corpus,sent_list, keywords):
    #keywords = get_keywords(corpus) # This gives us a list containing unigrams at index 0 and bigrams at index 1,etc
    
    unigrams = [item[0] for item in keywords[0]]
    bigrams = [item[0] for item in keywords[1]]

    unigram_scores = get_unigram_scores(unigrams,sent_list)
    bigram_scores = get_bigram_scores(bigrams,sent_list)

    sent_feature_import = [a+b for a,b in zip(unigram_scores,bigram_scores)]
    
    return sent_feature_import

#Input: Unigram list, Sentence list
#Output: List of sentence unigram score
#Explanation: Scored on basis of occurence of top unigrams and selected words
def get_unigram_scores(unigram_list,sent_list):
    occurence_list = []
    for s in sent_list:
        words = s.split(' ')
        occurence_count = 0
        for w in words:
            if w.lower() in unigram_list or w.lower() in ['complaint','concern','documented','evidence','warn']:
                occurence_count += 1
        occurence_list.append(occurence_count)
        
    return occurence_list

#Input: Sentence list
#Output: List of sentence table score
#Explanation: Scored on basis of occurence of table. A whole table is treated as sentence and is not very interesting or useful
#             A table is mostly numeric so the assumption is if the frequency of number in a sentence is more than 9% it is penalized as it is table
#             9% is used after testing it on various tables. Penalization is heavy because we never want to show it 
#             no matter how many keyword it has or how big it is (table senetcence are usually big).
def handle_tables(sent_list):
    scores = []
    for sent in sent_list:
        ss = re.sub(r"\s+",' ',sent)
        dots = ss.count('.')
        numbers = ss.count('1') +sent.count('2') +sent.count('3')+ sent.count('4') +sent.count('5')+sent.count('6')+sent.count('7')+sent.count('8')+sent.count('9')+sent.count('10')
        sent_len = len(ss)
        if (dots+numbers)/sent_len >= 0.09:
            scores.append(-100)
        else:
            scores.append(0)
    return scores
            
#Input: Bigram list, Sentence list
#Output: List of sentence bigram score
#Explanation: Scored on basis of occurence of top bigrams
def get_bigram_scores(bigram_list,sent_list):
    occurence_list = []
    for s in sent_list:
        # create bigrams
        token=nltk.word_tokenize(s)
        bigram_phrases = ngrams(token,2)
        occurence_count = 0
        for w in bigram_phrases:
            w = [word.lower() for word in w]
            if ' '.join(w) in bigram_list:
                occurence_count += 1
        occurence_list.append(occurence_count)
        
    return occurence_list

In [4]:
#Input: text from data file
#Output: top keywords, top sentences, summary from document
#Explanation: Pakgesa all the functions above together. 
def extract_summary(text):
    clean_text, clean_summary = get_document_text(text)
    tagged_tokens = tag_my_text(tokenize_text_sent(clean_text))
    np_list = noun_phrase_finder(tagged_tokens)
    keywords = get_top_np(np_list)
    top_np = get_top_unigrams(np_list)  
    #keywords = get_keywords(clean_text)
    top_sent = get_top_sents(clean_text,keywords)
    
    return top_np,top_sent,clean_summary

** Test **

Testing out the function

In [23]:
# Test
doc_list =load(open("data/Master_doc_content",'rb'))
document = doc_list[3]
document_text = []
for item in document['text']:
    document_text.append(str(item))

In [24]:
key,sentences, summary = extract_summary(document_text[0])

** Output **

In [25]:
key

['room',
 'research',
 'usage',
 'labor',
 'part',
 'nara',
 'washington',
 'march',
 'college',
 'monday',
 'amend']

In [26]:
for sent in sentences:
    print(sent)
    print("\n")

Research Room Hours in DC Area Facilities        Our research center and Central Research Room in the National   Archives Building and the research rooms at the National Archives at   College Park facility are currently open for research Monday through   Friday from 8:45 a.m. to 5 p.m.; on Tuesday, Thursday and Friday   evenings from 5 p.m. to 9 p.m.; and Saturdays from 8:45 a.m. to 4:45   p.m.


Researchers who conduct research in original archival records in the   evening or on Saturday currently must make a reference request in  person before 3:30 on weekdays to have the records identified and   retrieved from the stack areas for their research use; no records are   retrieved during those extended hours.


Currently the National Archives Experience (our Washington DC   museum) including the Rotunda for the Charters of Freedom (displaying   the Declaration of Independence, Constitution, and Bill of Rights) is   open to the public as follows:       The day after Labor Day through Marc

In [15]:
len(sentences)

4

In [32]:
summary

' This document provides an additional 15 days for interested   persons to submit comments on the proposed rule to amend the Customs   and Border Protection (CBP) regulations pertaining to pilots of any   private aircraft arriving in the United States from a foreign port or   location or departing the United States for a foreign port or location.   The proposed rule was published in the Federal Register on September   18, 2007, and the comment period was scheduled to expire on November   19, 2007.    '

### Writing to a file as JSON
After testing we run it ion every document in data file and save in JSON formatat. We chose this formata as it would be easier to include on the demo website. 

The dictionary format in JSON will be like - 
```
{"data": [
        "keywords": [],
        "sentences": [],
        "summary": <string>,
        "doc_id": <string>,
        "doc_title": <string>
    ],
    ....
}
```

In [5]:
data = []

In [6]:
doc_list1 =load(open("data/Master_doc_content",'rb'))
doc_list2 = load(open("data/Master2_doc_content",'rb'))

In [7]:
doc_id1 = ["FAA-2010-1127-0001","USCBP-2007-0064-1986","FMCSA-2015-0419-0001","NARA-06-0007-0001","APHIS-2006-0041-0001","EBSA-2012-0031-0001","IRS-2010-0009-0001","BOR-2008-0004-0001","OSHA-2013-0023-1443","DOL-2016-0001-0001","NRC-2015-0057-0086","CMS-2010-0259-0001","CMS-2009-0008-0003","CMS-2009-0038-0002","NPS-2014-0005-000","BIS-2015-0011-0001","HUD-2011-0056-0019","HUD-2011-0014-0001","OCC-2011-0002-0001","ACF-2015-0008-0124","ETA-2008-0003-0001","CMS-2012-0152-0004","CFPB-2013-0033-0001","USCIS-2016-0001-0001","FMCSA-2011-0146-0001","USCG-2013-0915-0001","NHTSA-2012-0177-0001","USCBP-2005-0005-0001"]
doc_id2 = ["HUD-2015-0101-0001","ACF-2010-0003-0001","NPS-2015-0008-0001","FAR-2014-0025-0026","CFPB-2013-0002-0001","DOS-2010-0035-0001","USCG-2013-0915-0001","SBA-2010-0001-0001"]

In [8]:
doc_title1 = ["Photo Requirements for Pilot Certificates",
             "Advance Information on Private Aircraft Arriving and Departing the United States",
             "Evaluation of Safety Sensitive Personnel for Moderate-to-Severe Obstructive Sleep Apnea",
             "Changes in NARA Research Room and Museum Hours",
             "Bovine Spongiform Encephalopathy; Minimal-Risk Regions; Importation of Live Bovines and Products Derived From Bovines",
             "Incentives for Nondiscriminatory Wellness Programs in Group Health Plans",
             "Furnishing Identifying Number of Tax Return Preparer",
             "Use of Bureau of Reclamation Land, Facilities, and Waterbodies",
             "Improve Tracking of Workplace Injuries and Illnesses",
             "Implementation of the Nondiscrimination and Equal Opportunity Provisions of the Workforce Innovation and Opportunity Act",
             "Linear No-Threshold Model and Standards for Protection Against Radiation; Extension of Comment Period",
             "Medicare Program: Accountable Care Organizations and the Medicare Shared Saving Program",
             "Medicare Program: Changes to the Competitive Acquisition of Certain Durable Medical Equipment, Prosthetics, Orthotics and Supplies (DMEPOS) by Certain Provisions of the Medicare Improvements for Patients and Providers Act of 2008 (MIPPA)",
             "Medicare Program: Inpatient Rehabilitation Facility Prospective Payment System for Federal Fiscal Year 2010 ",
             "Special Regulations: Areas of the National Park System, Cuyahoga Valley National Park, Bicycling",
             "Wassenaar Arrangement Plenary Agreements Implementation; Intrusion and Surveillance Items",
             "Credit Risk Retention 2",
             "FR 5359–P–01 Equal Access to Housing in HUD Programs Regardless of Sexual Orientation or Gender Identity ",
             "Credit Risk Retention",
             "Head Start Performance Standards; Extension of Comment Period",
             "Senior Community Service Employment Program",
             "Patient Protection and Affordable Care Act: Benefit and Payment Parameters for 2014",
             "Debt Collection (Regulation F)",
             "U.S. Citizenship and Immigration Services Fee Schedule",
             "Applicability of Regulations to Operators of Certain Farm Vehicles and Off-Road Agricultural Equipment",
             "Carriage of Conditionally Permitted Shale Gas Extraction Waste Water in Bulk",
             "Federal Motor Vehicle Safety Standards: Event Data Recorders",
             "Documents Required for Travel Within the Western Hemisphere"]
doc_title2 = ["FR 5597-P-02 Instituting Smoke- Free Public Housing",
             "Head Start Program",
             "Off-Road Vehicle Management: Cape Lookout National Seashore",
             "Federal Acquisition Regulations: Fair Pay and Safe Workplaces; Second Extension of Time for Comments (FAR Case 2014-025)",
             "Ability to Repay Standards under Truth in Lending Act (Regulation Z)",
             "Schedule of Fees for Consular Services, Department of State and Overseas Embassies and Consulates",
             "Carriage of Conditionally Permitted Shale Gas Extraction Waste Water in Bulk",
             "Women-Owned Small Business Federal Contract Program"]

In [9]:
for i in range(len(doc_list1)):
    print(i)
    info_dic = {}
    doc_text = str(doc_list1[i]['text'][0])
    info_dic["keywords"],info_dic["sentences"], info_dic["summary"] = extract_summary(doc_text)
    info_dic["doc_id"], info_dic["doc_title"] = doc_id1[i], doc_title1[i]
    data.append(info_dic)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27


In [10]:
for i in range(len(doc_list2)):
    print(i)
    info_dic = {}
    doc_text = str(doc_list2[i]['text'][0])
    info_dic["keywords"],info_dic["sentences"], info_dic["summary"] = extract_summary(doc_text)
    info_dic["doc_id"], info_dic["doc_title"] = doc_id2[i], doc_title2[i]
    data.append(info_dic)

0
1
2
3
4
5
6
7


In [11]:
top_obj ={}
top_obj["data"] = data

In [12]:
import json
with open('data/text_data.json', 'w') as outfile:
    json.dump(top_obj, outfile)