This notebook contains functions to extract the following information from a document:
* Keywords
* Top relevant sentences

** Input **
Reads in data stored in pickle format as extracted using Kinshuk's notebook


** Output **
`list_of_keywords`,`top_10_relevant_sentences`


** Usage **
Call `extract_summary` function with the document text as a parameter.


** Notes **


* I tried to use Texttiling in order to tokenize the text by topics and then use it to extract keywords, themes, etc. However, it did not result in any better quality keywords. A new challenge was that of increased number of keywords, hence beating the purpose of summarizing the text. We decided not to use the algorithm in our processing pipeline.
* The function takes into account only unigrams and bigrams while extracting top relevant sentences.

** Imports **

In [1]:
from pickle import dump, load
import nltk
from nltk import word_tokenize,FreqDist
import re
from nltk.corpus import wordnet as wn
from nltk.util import ngrams

In [36]:
def get_document_text(raw_text):
    """ This function takes in raw document text as input which we receive from the API and returns a clean text 
    of the associated document. It cleans up any HTML code in the text, newline characters, and extracts supplemental
    information part of the document.
    
    INPUT: string
    OUTPUT: string
    """
    raw_text = raw_text.replace('\n',' ')
    raw_text = raw_text.replace('*','') # added
    raw_text = raw_text.replace('\r',' ') # added
    supp_info_idx = raw_text.find("SUPPLEMENTARY INFORMATION:")
    summary_idx = raw_text.find("SUMMARY:")
    dates_idx = supp_info_idx = raw_text.find("DATES:")
    suppl_info = raw_text[supp_info_idx+26:] # To leave out the string 'Supplementary Information'
    summary = raw_text[summary_idx:dates_idx]
    # Remove any residual HTML tags in text
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', suppl_info)
    cleansummary = re.sub(cleanr, '', summary)
    return cleantext, cleansummary

def get_keywords(clean_corpus):
    """ This function takes in a clean corpus as input and extracts most important keywords and top 10% of relevant 
    sentences from the text.
    
    INPUT: string
    OUTPUT: List of tuples: [(list_of_keywords,list_of_sentences)]
    """
    
    tagged_tokens = tag_my_text(tokenize_text_sent(clean_corpus))
    grand_list = get_top_np(noun_phrase_finder(tagged_tokens))
    
    return grand_list

def tokenize_text(corpus):
    pattern = r'''(?x)    # set flag to allow verbose regexps
    (([A-Z]\.)+)       # abbreviations, e.g. B.C.
    |(\w+([-']\w+)*)       # words with optional internal hyphens e.g. after-ages or author's
    '''
    tokens = nltk.regexp_tokenize(corpus,pattern)
    all_token = [word.lower() for token in tokens for word in token if word != "" 
                 and word[0] != "'" and word[0] != "-"]
    return all_token

def tokenize_text_sent(corpus):
    sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
    raw_sents = sent_tokenizer.tokenize(corpus) # Split text into sentences    
    return [tokenize_text(sent) for sent in raw_sents]

def tag_my_text(sents):
    return [nltk.pos_tag(sent) for sent in sents]

#Chunk noun phrases in tree 
def noun_phrase_chunker():
    grammar = r"""
    NP: {<DT|PP\$>?<JJ>*<NN>}   # chunk determiner/possessive, adjectives and noun
    """
    cp = nltk.RegexpParser(grammar)
    return cp

#Extract only the NP marked phrases from the parse tree, that is the chunk we defined
def noun_phrase_extractor(sentences, chunker):
    res = []
    for sent in sentences:
        tree = chunker.parse(sent)
        for subtree in tree.subtrees():
            if subtree.label() == 'NP' : 
                res.append(subtree[0:len(subtree)])
                #res.append(subtree[0])
                #print(subtree)
    return res

def noun_phrase_finder(tagged_text):
    all_proper_noun = noun_phrase_extractor(tagged_text,noun_phrase_chunker()) 
    #does not literally mean proper noun. Chunker only extracts common noun
    noun_phrase_list = []                                                      
    #noun_phrase_string_list =[]
    for noun_phrase in all_proper_noun:
        if len(noun_phrase) > 0: #this means where the size of the phrase is greater than 1
            small_list =[]
            for (word,tag) in noun_phrase:
                small_list.append(word)
            noun_phrase_list.append(small_list)
            #noun_phrase_string_list.append(' '.join(small_list))
    return noun_phrase_list

#get frequency dist of different length in all the noun phrases extracted. 
#Something of the form {1:45,2:23} - how many 1phrased and 2 phrased chunks I have etc.
def get_length_np(nounPhrase):
    np_length={}
    for inner_np in nounPhrase:
        np_length[len(inner_np)] = np_length.get(len(inner_np),0) + 1
    return np_length

#get freq dist obj for noun phrase of different lengths
def find_freq(nested_list,nest_len):
    #from nltk.probability import FreqDist
    fdist_list =[]
    for inner_np in nested_list:
        if len(inner_np) == nest_len:
            fdist_list.append(' '.join(inner_np))
    fdist = FreqDist(fdist_list)
    return fdist

#make a grand list of top occuring noun phrases of different sizes
def get_top_np(np):
    master_common_list=[]
    len_list =get_length_np(np).keys()
    for item in len_list:
        fdist_np = find_freq(np,item)
        top = fdist_np.most_common(15) 
        top_list = []
        for w,c in top:
            if c >= 1: # changed to 1 from 10
#                 print (w)
                top_list.append((w,c))
                #top.remove((w,c))
        if len(top_list) > 0:
            master_common_list.append(top_list)
    return master_common_list

In [16]:
def get_top_sents(corpus,keywords_list):
    sentence_list = get_sentences(corpus)
    indexed_sents = sentence_indexing(sentence_list) # This is so that we can re-order most relevant sentences later
    
    sentence_length_scores = get_sentence_lengths(sentence_list)
    keyphrase_scores = get_keyphrase_scores(corpus,sentence_list, keywords_list)
    
    sent_scores = [s+c for s,c in zip(sentence_length_scores,keyphrase_scores)]
    idx_sent_scores = [(s,c) for s,c in zip(indexed_sents,sent_scores)]
    sorted_sents = sorted(idx_sent_scores,key=lambda sent: sent[1],reverse=True)
    
    # Keep top 10% of the sentences, or top 10 whichever is less
    top_10 = int(len(sorted_sents) * 0.1)
    if top_10 > 10:
        top_10 = 10
    x = sorted_sents[:top_10]
    top_list = [item[0] for item in x]
    sorted_top_list = sorted(top_list,key=lambda sent:sent[1],reverse=False)
    sorted_top_list = [sent[0] for sent in sorted_top_list]
    
    return sorted_top_list
    
def get_sentences(corpus):
    # First, tokenize the corpus into sentences
    sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
    raw_sents = sent_tokenizer.tokenize(corpus)
    
    return raw_sents

def sentence_indexing(sent_list):
    indexed_sents = []
    for idx,sent in enumerate(sent_list):
        indexed_sents.append((sent,idx))
    return indexed_sents    

def get_sentence_lengths(sent_list):
    sent_length = []
    for s in sent_list:
        sent_length.append(len(s.split(' ')))
    
    return sent_length

def get_keyphrase_scores(corpus,sent_list, keywords):
    #keywords = get_keywords(corpus) # This gives us a list containing unigrams at index 0 and bigrams at index 1,etc
    
    unigrams = [item[0] for item in keywords[0]]
    bigrams = [item[0] for item in keywords[1]]

    unigram_scores = get_unigram_scores(unigrams,sent_list)
    bigram_scores = get_bigram_scores(bigrams,sent_list)

    sent_feature_import = [a+b for a,b in zip(unigram_scores,bigram_scores)]
    
    return sent_feature_import

def get_unigram_scores(unigram_list,sent_list):
    occurence_list = []
    for s in sent_list:
        words = s.split(' ')
        occurence_count = 0
        for w in words:
            if w.lower() in unigram_list or w.lower() in ['complaint','concern','documented','evidence','warn']:
                occurence_count += 1
        occurence_list.append(occurence_count)
        
    return occurence_list

def get_bigram_scores(bigram_list,sent_list):
    occurence_list = []
    for s in sent_list:
        # create bigrams
        token=nltk.word_tokenize(s)
        bigram_phrases = ngrams(token,2)
        occurence_count = 0
        for w in bigram_phrases:
            w = [word.lower() for word in w]
            if ' '.join(w) in bigram_list:
                occurence_count += 1
        occurence_list.append(occurence_count)
        
    return occurence_list

In [38]:
def extract_summary(text):
    clean_text, clean_summary = get_document_text(text)
    keywords = get_keywords(clean_text)
    top = get_top_sents(clean_text,keywords)
    
    return keywords[0],top, clean_summary

** Test **

Testing out the function

In [19]:
# Test
doc_list =load(open("data/Master_doc_content",'rb'))
document = doc_list[0]
document_text = []
for item in document['text']:
    document_text.append(str(item))

In [20]:
keywords,sentences, summary = extract_summary(document_text[0])

** Output **

In [21]:
keywords

[('certificate', 161),
 ('pilot', 124),
 ('photo', 82),
 ('sec', 52),
 ('student', 31),
 ('date', 28),
 ('act', 24),
 ('flight', 23),
 ('paragraph', 19),
 ('expiration', 16),
 ('information', 16),
 ('identification', 14),
 ('rule', 13),
 ('period', 13),
 ('replacement', 13)]

In [22]:
for sent in sentences:
    print(sent)
    print("\n")

---------------------------------------------------------------------------      Specifically, the FAA proposes to charge a $22 fee to process an  application for: (1) Exchanging an existing certificate without a photo  for a certificate with photo; (2) issuing a new pilot certificate or  student pilot certificate; and (3) replacing a pilot certificate with  photo whenever a replacement certificate is requested by a pilot or  required by regulation.


(7) If the FAA accepts digitally-captured photos, what are the  advantages and disadvantages of the following methods of acquiring the  photo: (a) An applicant uploading a self-captured photo to the IACRA  sub-system; (b) a FSDO capturing the photo when the application is  submitted; (c) a Knowledge Testing Center capturing the photo when an  application is submitted; and (d) a DPE capturing the photo when an  application is submitted?


Annual Burden Estimate: This proposal would result in a 20-year  recordkeeping and reporting burden as

In [23]:
summary

'SUMMARY: This action would require a person to carry a pilot  certificate with photo to exercise the privileges of the pilot  certificate. This proposal responds to section 4022 of the Intelligence  Reform and Terrorism Prevention Act (IRTPA). The FAA previously  required all pilots to obtain a plastic certificate (excepting  temporary certificates and student pilot certificates). This proposal  furthers the fulfillment of IRTPA by requiring a photo of the pilot to  be on all pilot certificates. The FAA also proposes to require student  pilots to obtain a plastic certificate with photo. Student pilot  certificates would also have the same duration as other pilot  certificates. Additionally, because of the new photo requirements, this  proposal modifies the application process and the fee structure for  pilot certificates.  '

In [24]:
title

'Photo Requirements for Pilot Certificates'

### Writing to a fle as JSON

In [44]:
data = []

In [26]:
doc_list =load(open("data/Master_doc_content",'rb'))
doc_list2 = load(open("data/Master2_doc_content",'rb'))

In [27]:
doc_id1 = ["FAA-2010-1127-0001","USCBP-2007-0064-1986","FMCSA-2015-0419-0001","NARA-06-0007-0001","APHIS-2006-0041-0001","EBSA-2012-0031-0001","IRS-2010-0009-0001","BOR-2008-0004-0001","OSHA-2013-0023-1443","DOL-2016-0001-0001","NRC-2015-0057-0086","CMS-2010-0259-0001","CMS-2009-0008-0003","CMS-2009-0038-0002","NPS-2014-0005-000","BIS-2015-0011-0001","HUD-2011-0056-0019","HUD-2011-0014-0001","OCC-2011-0002-0001","ACF-2015-0008-0124","ETA-2008-0003-0001","CMS-2012-0152-0004","CFPB-2013-0033-0001","USCIS-2016-0001-0001","FMCSA-2011-0146-0001","USCG-2013-0915-0001","NHTSA-2012-0177-0001","USCBP-2005-0005-0001"]
doc_id2 = ["HUD-2015-0101-0001","ACF-2010-0003-0001","NPS-2015-0008-0001","FAR-2014-0025-0026","CFPB-2013-0002-0001","DOS-2010-0035-0001"]

In [35]:
doc_title1 = ["Photo Requirements for Pilot Certificates",
             "Advance Information on Private Aircraft Arriving and Departing the United States",
             "Evaluation of Safety Sensitive Personnel for Moderate-to-Severe Obstructive Sleep Apnea",
             "Changes in NARA Research Room and Museum Hours",
             "Bovine Spongiform Encephalopathy; Minimal-Risk Regions; Importation of Live Bovines and Products Derived From Bovines",
             "Incentives for Nondiscriminatory Wellness Programs in Group Health Plans",
             "Furnishing Identifying Number of Tax Return Preparer",
             "Use of Bureau of Reclamation Land, Facilities, and Waterbodies",
             "Improve Tracking of Workplace Injuries and Illnesses",
             "Implementation of the Nondiscrimination and Equal Opportunity Provisions of the Workforce Innovation and Opportunity Act",
             "Linear No-Threshold Model and Standards for Protection Against Radiation; Extension of Comment Period",
             "Medicare Program: Accountable Care Organizations and the Medicare Shared Saving Program",
             "Medicare Program: Changes to the Competitive Acquisition of Certain Durable Medical Equipment, Prosthetics, Orthotics and Supplies (DMEPOS) by Certain Provisions of the Medicare Improvements for Patients and Providers Act of 2008 (MIPPA)",
             "Medicare Program: Inpatient Rehabilitation Facility Prospective Payment System for Federal Fiscal Year 2010 ",
             "Special Regulations: Areas of the National Park System, Cuyahoga Valley National Park, Bicycling",
             "Wassenaar Arrangement Plenary Agreements Implementation; Intrusion and Surveillance Items",
             "Credit Risk Retention 2",
             "FR 5359–P–01 Equal Access to Housing in HUD Programs Regardless of Sexual Orientation or Gender Identity ",
             "Credit Risk Retention",
             "Head Start Performance Standards; Extension of Comment Period",
             "Senior Community Service Employment Program",
             "Patient Protection and Affordable Care Act: Benefit and Payment Parameters for 2014",
             "Debt Collection (Regulation F)",
             "U.S. Citizenship and Immigration Services Fee Schedule",
             "Applicability of Regulations to Operators of Certain Farm Vehicles and Off-Road Agricultural Equipment",
             "Carriage of Conditionally Permitted Shale Gas Extraction Waste Water in Bulk",
             "Federal Motor Vehicle Safety Standards: Event Data Recorders",
             "Documents Required for Travel Within the Western Hemisphere"]
doc_title2 = ["FR 5597-P-02 Instituting Smoke- Free Public Housing",
             "Head Start Program",
             "Off-Road Vehicle Management: Cape Lookout National Seashore",
             "Federal Acquisition Regulations: Fair Pay and Safe Workplaces; Second Extension of Time for Comments (FAR Case 2014-025)",
             "Ability to Repay Standards under Truth in Lending Act (Regulation Z)",
             "Schedule of Fees for Consular Services, Department of State and Overseas Embassies and Consulates  "]

In [45]:
for i in range(len(doc_list)):
    print(i)
    info_dic = {}
    doc_text = str(doc_list[i]['text'][0])
    info_dic["keywords"],info_dic["sentences"], info_dic["summary"] = extract_summary(doc_text)
    info_dic["doc_id"], info_dic["doc_title"] = doc_id1[i], doc_title1[i]
    data.append(info_dic)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27


In [46]:
for i in range(len(doc_list2)):
    print(i)
    info_dic = {}
    doc_text = str(doc_list[i]['text'][0])
    info_dic["keywords"],info_dic["sentences"], info_dic["summary"] = extract_summary(doc_text)
    info_dic["doc_id"], info_dic["doc_title"] = doc_id2[i], doc_title2[i]
    data.append(info_dic)

0
1
2


In [47]:
top_obj ={}
top_obj["data"] = data

In [48]:
import json
with open('data/text_data.json', 'w') as outfile:
    json.dump(top_obj, outfile)

### for debugging from down here, do not execute

In [49]:
len(data)

31

In [50]:
for item in data:
    print(item["doc_title"])

Photo Requirements for Pilot Certificates
Advance Information on Private Aircraft Arriving and Departing the United States
Evaluation of Safety Sensitive Personnel for Moderate-to-Severe Obstructive Sleep Apnea
Changes in NARA Research Room and Museum Hours
Bovine Spongiform Encephalopathy; Minimal-Risk Regions; Importation of Live Bovines and Products Derived From Bovines
Incentives for Nondiscriminatory Wellness Programs in Group Health Plans
Furnishing Identifying Number of Tax Return Preparer
Use of Bureau of Reclamation Land, Facilities, and Waterbodies
Improve Tracking of Workplace Injuries and Illnesses
Implementation of the Nondiscrimination and Equal Opportunity Provisions of the Workforce Innovation and Opportunity Act
Linear No-Threshold Model and Standards for Protection Against Radiation; Extension of Comment Period
Medicare Program: Accountable Care Organizations and the Medicare Shared Saving Program
Medicare Program: Changes to the Competitive Acquisition of Certain Dur

In [59]:
clean_text, clean_summary= get_document_text(str(doc_list[1]['text'][0]))

In [60]:
keywords = get_keywords(clean_text) # This gives us a list containing unigrams at index 0 and bigrams at index 1,etc


In [61]:
keywords

[]

In [62]:
tagged_tokens = tag_my_text(tokenize_text_sent(clean_text))
    

In [63]:
tagged_tokens

[[('posed', 'VBN'),
  ('rule', 'NN'),
  ('published', 'VBN'),
  ('at', 'IN'),
  ('72', 'CD'),
  ('fr', 'JJ'),
  ('53394', 'CD'),
  ('september', 'NN'),
  ('18', 'CD'),
  ('2007', 'CD'),
  ('must', 'MD'),
  ('be', 'VB'),
  ('received', 'VBN'),
  ('on', 'IN'),
  ('or', 'CC'),
  ('before', 'IN'),
  ('december', 'JJ'),
  ('4', 'CD'),
  ('2007', 'CD')],
 [('addresses', 'NNS'),
  ('you', 'PRP'),
  ('may', 'MD'),
  ('submit', 'VB'),
  ('comments', 'NNS'),
  ('identified', 'VBN'),
  ('by', 'IN'),
  ('docket', 'NN'),
  ('number', 'NN'),
  ('by', 'IN'),
  ('one', 'CD'),
  ('of', 'IN'),
  ('the', 'DT'),
  ('following', 'JJ'),
  ('methods', 'NNS'),
  ('federal', 'JJ'),
  ('erulemaking', 'VBG'),
  ('portal', 'JJ'),
  ('http', 'NN'),
  ('www', 'JJ'),
  ('regulations', 'NNS'),
  ('gov', 'VBP')],
 [('follow', 'VB'),
  ('the', 'DT'),
  ('instructions', 'NNS'),
  ('for', 'IN'),
  ('submitting', 'VBG'),
  ('comments', 'NNS'),
  ('via', 'IN'),
  ('docket', 'NN'),
  ('number', 'NN'),
  ('uscbp-2007-0064', 

In [64]:
#grand_list = get_top_np(noun_phrase_finder(tagged_tokens))
np_list = noun_phrase_finder(tagged_tokens)

In [69]:
grand = get_top_np(np_list)

In [70]:
grand

[[('border', 5),
  ('rule', 5),
  ('protection', 4),
  ('location', 3),
  ('office', 3),
  ('number', 3),
  ('docket', 3),
  ('dc', 2),
  ('washington', 2),
  ('period', 2),
  ('rulemaking', 2),
  ('background', 2),
  ('september', 2),
  ('fr', 2),
  ('cbp', 2)],
 [('international trade', 4),
  ('an extension', 2),
  ('private aircraft', 2),
  ('a notice', 2),
  ('this rulemaking', 1),
  ('regular business', 1),
  ('any pilot', 1),
  ('the agency', 1),
  ('the notice', 1),
  ('vereb branch', 1),
  ('the docket', 1),
  ('the comment', 1),
  ('the office', 1),
  ('supplementary information', 1),
  ('a m', 1)],
 [('a private aircraft', 2),
  ('a foreign location', 2),
  ('the federal register', 2),
  ('a foreign port', 2),
  ('any personal information', 1),
  ('advance electronic information', 1),
  ('nw 5th floor', 1)]]

#### doc_id and title -1 
0. FAA-2010-1127-0001
1. USCBP-2007-0064-1986
2. FMCSA-2015-0419-0001
3. NARA-06-0007-0001
4. APHIS-2006-0041-0001
5. EBSA-2012-0031-0001
6. IRS-2010-0009-0001
7. BOR-2008-0004-0001
8. OSHA-2013-0023-1443
9. DOL-2016-0001-0001
10. NRC-2015-0057-0086
11. CMS-2010-0259-0001
12. CMS-2009-0008-0003
13. CMS-2009-0038-0002
14. NPS-2014-0005-0001
15. BIS-2015-0011-0001
16. HUD-2011-0056-0019
17. HUD-2011-0014-0001
18. OCC-2011-0002-0001
19. ACF-2015-0008-0124
20. ETA-2008-0003-0001
21. CMS-2012-0152-0004
22. CFPB-2013-0033-0001
23. USCIS-2016-0001-0001
24. FMCSA-2011-0146-0001
25. USCG-2013-0915-0001
26. NHTSA-2012-0177-0001
27. USCBP-2005-0005-0001


#### doc_id and title -2
1. HUD-2015-0101-0001
2. ACF-2010-0003-0001
3. NPS-2015-0008-0001

In [4]:
s = str(doc_list2[2]['text'][0])


In [6]:
title_end = s.find("AGENCY:")


In [12]:
title_start = s.rfind("\n", 0, title_end - 4)

In [13]:
s[title_start:title_end]

'\nCape Lookout National Seashore, Off-Road Vehicle Management\n\n'

In [14]:
doc_list2[2]['text'][0]

<pre>
[Federal Register Volume 80, Number 243 (Friday, December 18, 2015)]
[Proposed Rules]
[Pages 79013-79020]
From the Federal Register Online via the Government Publishing Office [<a href="http://www.gpo.gov">www.gpo.gov</a>]
[FR Doc No: 2015-31793]



[[Page 79013]]

-----------------------------------------------------------------------

DEPARTMENT OF THE INTERIOR

National Park Service

36 CFR Part 7

[NPS-CALO-19111; PPWONRADE2, PMP00EI05.YP]
RIN 1024-AE24


Cape Lookout National Seashore, Off-Road Vehicle Management

AGENCY: National Park Service, Interior

ACTION: Proposed rule.

-----------------------------------------------------------------------

SUMMARY: The National Park Service proposes to designate routes for, 
and manage off-road vehicle use within Cape Lookout National Seashore, 
North Carolina. Under the National Park Service general regulations, 
the operation of motor vehicles off roads is prohibited unless 
authorized by special regulation. The proposed rule wou