This notebook contains functions to extract the following information from a document:
* Keywords
* Top relevant sentences

** Input **
Reads in data stored in pickle format as extracted using Kinshuk's notebook


** Output **
`list_of_keywords`,`top_10_relevant_sentences`


** Usage **
Call `extract_summary` function with the document text as a parameter.


** Notes **


* I tried to use Texttiling in order to tokenize the text by topics and then use it to extract keywords, themes, etc. However, it did not result in any better quality keywords. A new challenge was that of increased number of keywords, hence beating the purpose of summarizing the text. We decided not to use the algorithm in our processing pipeline.
* The function takes into account only unigrams and bigrams while extracting top relevant sentences.

** Imports **

In [1]:
from pickle import dump, load
import nltk
from nltk import word_tokenize,FreqDist
import re
from nltk.corpus import wordnet as wn
from nltk.util import ngrams

In [2]:
def get_document_text(raw_text):
    """ This function takes in raw document text as input which we receive from the API and returns a clean text 
    of the associated document. It cleans up any HTML code in the text, newline characters, and extracts supplemental
    information part of the document.
    
    INPUT: string
    OUTPUT: string
    """
    raw_text = raw_text.replace('\n',' ')
    raw_text = raw_text.replace('*','') # added
    raw_text = raw_text.replace('\r',' ') # added
    supp_info_idx = raw_text.find("SUPPLEMENTARY INFORMATION:")
    summary_idx = raw_text.find("SUMMARY:")
    dates_idx = raw_text.find("DATES:")
    suppl_info = raw_text[supp_info_idx+26:] # To leave out the string 'Supplementary Information'
    summary = raw_text[summary_idx+8:dates_idx]
    # Remove any residual HTML tags in text
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', suppl_info)
    cleansummary = re.sub(cleanr, '', summary)
    return cleantext, cleansummary

# def get_keywords(clean_corpus):
#     """ This function takes in a clean corpus as input and extracts most important keywords and top 10% of relevant 
#     sentences from the text.
    
#     INPUT: string
#     OUTPUT: List of tuples: [(list_of_keywords,list_of_sentences)]
#     """
    
#     tagged_tokens = tag_my_text(tokenize_text_sent(clean_corpus))
#     grand_list = get_top_np(noun_phrase_finder(tagged_tokens))
#     return grand_list

def tokenize_text(corpus):
    pattern = r'''(?x)    # set flag to allow verbose regexps
    (([A-Z]\.)+)       # abbreviations, e.g. B.C.
    |(\w+([-']\w+)*)       # words with optional internal hyphens e.g. after-ages or author's
    '''
    tokens = nltk.regexp_tokenize(corpus,pattern)
    all_token = [word.lower() for token in tokens for word in token if word != "" 
                 and word[0] != "'" and word[0] != "-"]
    return all_token

def tokenize_text_sent(corpus):
    sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
    raw_sents = sent_tokenizer.tokenize(corpus) # Split text into sentences    
    return [tokenize_text(sent) for sent in raw_sents]

def tag_my_text(sents):
    return [nltk.pos_tag(sent) for sent in sents]

#Chunk noun phrases in tree 
def noun_phrase_chunker():
    grammar = r"""
    NP: {<DT|PP\$>?<JJ>*<NN>}   # chunk determiner/possessive, adjectives and noun
    """
    cp = nltk.RegexpParser(grammar)
    return cp

#Extract only the NP marked phrases from the parse tree, that is the chunk we defined
def noun_phrase_extractor(sentences, chunker):
    res = []
    for sent in sentences:
        tree = chunker.parse(sent)
        for subtree in tree.subtrees():
            if subtree.label() == 'NP' : 
                res.append(subtree[0:len(subtree)])
                #res.append(subtree[0])
                #print(subtree)
    return res

def noun_phrase_finder(tagged_text):
    all_proper_noun = noun_phrase_extractor(tagged_text,noun_phrase_chunker()) 
    #does not literally mean proper noun. Chunker only extracts common noun
    noun_phrase_list = []                                                      
    #noun_phrase_string_list =[]
    for noun_phrase in all_proper_noun:
        if len(noun_phrase) > 0: #this means where the size of the phrase is greater than 1
            small_list =[]
            for (word,tag) in noun_phrase:
                small_list.append(word)
            noun_phrase_list.append(small_list)
            #noun_phrase_string_list.append(' '.join(small_list))
    return noun_phrase_list

#get frequency dist of different length in all the noun phrases extracted. 
#Something of the form {1:45,2:23} - how many 1phrased and 2 phrased chunks I have etc.
def get_length_np(nounPhrase):
    np_length={}
    for inner_np in nounPhrase:
        np_length[len(inner_np)] = np_length.get(len(inner_np),0) + 1
    return np_length

#get freq dist obj for noun phrase of different lengths
def find_freq(nested_list,nest_len):
    #from nltk.probability import FreqDist
    fdist_list =[]
    for inner_np in nested_list:
        if len(inner_np) == nest_len:
            fdist_list.append(' '.join(inner_np))
    fdist = FreqDist(fdist_list)
    return fdist

#make a grand list of top occuring noun phrases of different sizes --- For testing purpose only. Wont be used
def get_top_np(np):
    master_common_list=[]
    len_list =get_length_np(np).keys()
    for item in len_list:
        fdist_np = find_freq(np,item)
        master_common_list.append(fdist_np.most_common(15))
    return master_common_list

def get_top_unigrams(np):
    unigrams = []
    for item in np:
        if len(item) ==  1:
            unigrams.append(item)
    fdist_uni = find_freq(np,1)
    uni_list = fdist_uni.most_common()
    threshold = 0.3 * len(unigrams)
    top = []
    s = 0
    for word,count in uni_list:
        if(len(word)>3):
            top.append(word)
            s += count
            if s > threshold:
                break      
    return top


In [3]:
def get_top_sents(corpus,keywords_list):
    sentence_list = get_sentences(corpus)
    indexed_sents = sentence_indexing(sentence_list) # This is so that we can re-order most relevant sentences later
    
    table_scores = handle_tables(sentence_list)
    sentence_length_scores = get_sentence_lengths(sentence_list)
    keyphrase_scores = get_keyphrase_scores(corpus,sentence_list, keywords_list)
    #stepped length
    stepped_sentence_length =[]
    for each_score in sentence_length_scores:
        s = each_score//10
        if s>10:
            s = 10
        stepped_sentence_length.append(s)
        
    #sent_scores = [s+c for s,c in zip(sentence_length_scores,keyphrase_scores)] #original score = keyphrase +length
    #sent_scores = [c/s for s,c in zip(sentence_length_scores,keyphrase_scores)] #score = ratio of keyphrase /length
    sent_scores = [c+s+t for s,c,t in zip(stepped_sentence_length,keyphrase_scores,table_scores)] #score = key phrase + stepped length
    #sent_scores = [s+(k/l)*100 for s,l,k in zip(stepped_sentence_length,sentence_length_scores,keyphrase_scores)] #score = key phrase + stepped length
    
    idx_sent_scores = [(s,c) for s,c in zip(indexed_sents,sent_scores)]
    sorted_sents = sorted(idx_sent_scores,key=lambda sent: sent[1],reverse=True)
    
    # Keep top 10% of the sentences, or top 10 whichever is less
    top_10 = int(len(sorted_sents) * 0.1)
    if top_10 > 4: # changed from 10 to 4
        top_10 = 4
    x = sorted_sents[:top_10]
    top_list = [item[0] for item in x]
    sorted_top_list = sorted(top_list,key=lambda sent:sent[1],reverse=False)
    sorted_top_list = [sent[0] for sent in sorted_top_list]
    
    return sorted_top_list
    
def get_sentences(corpus):
    # First, tokenize the corpus into sentences
    sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
    raw_sents = sent_tokenizer.tokenize(corpus)
    clean_sent =[]
    for sent in raw_sents:
        clean_sent.append(re.sub(r"\[+Page\s*\d+\]+|\\+\d+\\+|-+",'',sent))
    return clean_sent
    #return raw_sents

def sentence_indexing(sent_list):
    indexed_sents = []
    for idx,sent in enumerate(sent_list):
        indexed_sents.append((sent,idx))
    return indexed_sents    

def get_sentence_lengths(sent_list):
    sent_length = []
    for s in sent_list:
        sent_length.append(len(s.split(' ')))
    
    return sent_length

def get_keyphrase_scores(corpus,sent_list, keywords):
    #keywords = get_keywords(corpus) # This gives us a list containing unigrams at index 0 and bigrams at index 1,etc
    
    unigrams = [item[0] for item in keywords[0]]
    bigrams = [item[0] for item in keywords[1]]

    unigram_scores = get_unigram_scores(unigrams,sent_list)
    bigram_scores = get_bigram_scores(bigrams,sent_list)

    sent_feature_import = [a+b for a,b in zip(unigram_scores,bigram_scores)]
    
    return sent_feature_import

def get_unigram_scores(unigram_list,sent_list):
    occurence_list = []
    for s in sent_list:
        words = s.split(' ')
        occurence_count = 0
        for w in words:
            if w.lower() in unigram_list or w.lower() in ['complaint','concern','documented','evidence','warn']:
                occurence_count += 1
        occurence_list.append(occurence_count)
        
    return occurence_list

def handle_tables(sent_list):
    scores = []
    for sent in sent_list:
        ss = re.sub(r"\s+",' ',sent)
        dots = ss.count('.')
        numbers = ss.count('1') +sent.count('2') +sent.count('3')+ sent.count('4') +sent.count('5')+sent.count('6')+sent.count('7')+sent.count('8')+sent.count('9')+sent.count('10')
        sent_len = len(ss)
        if (dots+numbers)/sent_len >= 0.09:
            scores.append(-100)
        else:
            scores.append(0)
    return scores
            

def get_bigram_scores(bigram_list,sent_list):
    occurence_list = []
    for s in sent_list:
        # create bigrams
        token=nltk.word_tokenize(s)
        bigram_phrases = ngrams(token,2)
        occurence_count = 0
        for w in bigram_phrases:
            w = [word.lower() for word in w]
            if ' '.join(w) in bigram_list:
                occurence_count += 1
        occurence_list.append(occurence_count)
        
    return occurence_list

In [4]:
def extract_summary(text):
    clean_text, clean_summary = get_document_text(text)
    tagged_tokens = tag_my_text(tokenize_text_sent(clean_text))
    np_list = noun_phrase_finder(tagged_tokens)
    keywords = get_top_np(np_list)
    top_np = get_top_unigrams(np_list)  
    #keywords = get_keywords(clean_text)
    top_sent = get_top_sents(clean_text,keywords)
    
    return top_np,top_sent,clean_summary

** Test **

Testing out the function

In [23]:
# Test
doc_list =load(open("data/Master_doc_content",'rb'))
document = doc_list[3]
document_text = []
for item in document['text']:
    document_text.append(str(item))

In [24]:
key,sentences, summary = extract_summary(document_text[0])

** Output **

In [25]:
key

['room',
 'research',
 'usage',
 'labor',
 'part',
 'nara',
 'washington',
 'march',
 'college',
 'monday',
 'amend']

In [26]:
for sent in sentences:
    print(sent)
    print("\n")

Research Room Hours in DC Area Facilities        Our research center and Central Research Room in the National   Archives Building and the research rooms at the National Archives at   College Park facility are currently open for research Monday through   Friday from 8:45 a.m. to 5 p.m.; on Tuesday, Thursday and Friday   evenings from 5 p.m. to 9 p.m.; and Saturdays from 8:45 a.m. to 4:45   p.m.


Researchers who conduct research in original archival records in the   evening or on Saturday currently must make a reference request in  person before 3:30 on weekdays to have the records identified and   retrieved from the stack areas for their research use; no records are   retrieved during those extended hours.


Currently the National Archives Experience (our Washington DC   museum) including the Rotunda for the Charters of Freedom (displaying   the Declaration of Independence, Constitution, and Bill of Rights) is   open to the public as follows:       The day after Labor Day through Marc

In [15]:
len(sentences)

4

In [32]:
summary

' This document provides an additional 15 days for interested   persons to submit comments on the proposed rule to amend the Customs   and Border Protection (CBP) regulations pertaining to pilots of any   private aircraft arriving in the United States from a foreign port or   location or departing the United States for a foreign port or location.   The proposed rule was published in the Federal Register on September   18, 2007, and the comment period was scheduled to expire on November   19, 2007.    '

### Writing to a fle as JSON

In [5]:
data = []

In [6]:
doc_list1 =load(open("data/Master_doc_content",'rb'))
doc_list2 = load(open("data/Master2_doc_content",'rb'))

In [7]:
doc_id1 = ["FAA-2010-1127-0001","USCBP-2007-0064-1986","FMCSA-2015-0419-0001","NARA-06-0007-0001","APHIS-2006-0041-0001","EBSA-2012-0031-0001","IRS-2010-0009-0001","BOR-2008-0004-0001","OSHA-2013-0023-1443","DOL-2016-0001-0001","NRC-2015-0057-0086","CMS-2010-0259-0001","CMS-2009-0008-0003","CMS-2009-0038-0002","NPS-2014-0005-000","BIS-2015-0011-0001","HUD-2011-0056-0019","HUD-2011-0014-0001","OCC-2011-0002-0001","ACF-2015-0008-0124","ETA-2008-0003-0001","CMS-2012-0152-0004","CFPB-2013-0033-0001","USCIS-2016-0001-0001","FMCSA-2011-0146-0001","USCG-2013-0915-0001","NHTSA-2012-0177-0001","USCBP-2005-0005-0001"]
doc_id2 = ["HUD-2015-0101-0001","ACF-2010-0003-0001","NPS-2015-0008-0001","FAR-2014-0025-0026","CFPB-2013-0002-0001","DOS-2010-0035-0001","USCG-2013-0915-0001","SBA-2010-0001-0001"]

In [8]:
doc_title1 = ["Photo Requirements for Pilot Certificates",
             "Advance Information on Private Aircraft Arriving and Departing the United States",
             "Evaluation of Safety Sensitive Personnel for Moderate-to-Severe Obstructive Sleep Apnea",
             "Changes in NARA Research Room and Museum Hours",
             "Bovine Spongiform Encephalopathy; Minimal-Risk Regions; Importation of Live Bovines and Products Derived From Bovines",
             "Incentives for Nondiscriminatory Wellness Programs in Group Health Plans",
             "Furnishing Identifying Number of Tax Return Preparer",
             "Use of Bureau of Reclamation Land, Facilities, and Waterbodies",
             "Improve Tracking of Workplace Injuries and Illnesses",
             "Implementation of the Nondiscrimination and Equal Opportunity Provisions of the Workforce Innovation and Opportunity Act",
             "Linear No-Threshold Model and Standards for Protection Against Radiation; Extension of Comment Period",
             "Medicare Program: Accountable Care Organizations and the Medicare Shared Saving Program",
             "Medicare Program: Changes to the Competitive Acquisition of Certain Durable Medical Equipment, Prosthetics, Orthotics and Supplies (DMEPOS) by Certain Provisions of the Medicare Improvements for Patients and Providers Act of 2008 (MIPPA)",
             "Medicare Program: Inpatient Rehabilitation Facility Prospective Payment System for Federal Fiscal Year 2010 ",
             "Special Regulations: Areas of the National Park System, Cuyahoga Valley National Park, Bicycling",
             "Wassenaar Arrangement Plenary Agreements Implementation; Intrusion and Surveillance Items",
             "Credit Risk Retention 2",
             "FR 5359–P–01 Equal Access to Housing in HUD Programs Regardless of Sexual Orientation or Gender Identity ",
             "Credit Risk Retention",
             "Head Start Performance Standards; Extension of Comment Period",
             "Senior Community Service Employment Program",
             "Patient Protection and Affordable Care Act: Benefit and Payment Parameters for 2014",
             "Debt Collection (Regulation F)",
             "U.S. Citizenship and Immigration Services Fee Schedule",
             "Applicability of Regulations to Operators of Certain Farm Vehicles and Off-Road Agricultural Equipment",
             "Carriage of Conditionally Permitted Shale Gas Extraction Waste Water in Bulk",
             "Federal Motor Vehicle Safety Standards: Event Data Recorders",
             "Documents Required for Travel Within the Western Hemisphere"]
doc_title2 = ["FR 5597-P-02 Instituting Smoke- Free Public Housing",
             "Head Start Program",
             "Off-Road Vehicle Management: Cape Lookout National Seashore",
             "Federal Acquisition Regulations: Fair Pay and Safe Workplaces; Second Extension of Time for Comments (FAR Case 2014-025)",
             "Ability to Repay Standards under Truth in Lending Act (Regulation Z)",
             "Schedule of Fees for Consular Services, Department of State and Overseas Embassies and Consulates",
             "Carriage of Conditionally Permitted Shale Gas Extraction Waste Water in Bulk",
             "Women-Owned Small Business Federal Contract Program"]

In [9]:
for i in range(len(doc_list1)):
    print(i)
    info_dic = {}
    doc_text = str(doc_list1[i]['text'][0])
    info_dic["keywords"],info_dic["sentences"], info_dic["summary"] = extract_summary(doc_text)
    info_dic["doc_id"], info_dic["doc_title"] = doc_id1[i], doc_title1[i]
    data.append(info_dic)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27


In [10]:
for i in range(len(doc_list2)):
    print(i)
    info_dic = {}
    doc_text = str(doc_list2[i]['text'][0])
    info_dic["keywords"],info_dic["sentences"], info_dic["summary"] = extract_summary(doc_text)
    info_dic["doc_id"], info_dic["doc_title"] = doc_id2[i], doc_title2[i]
    data.append(info_dic)

0
1
2
3
4
5
6
7


In [11]:
top_obj ={}
top_obj["data"] = data

In [14]:
data[28]

{'doc_id': 'HUD-2015-0101-0001',
 'doc_title': 'FR 5597-P-02 Instituting Smoke- Free Public Housing',
 'keywords': ['housing',
  'phas',
  'tobacco',
  'smoking',
  'smoke',
  'exposure',
  'rule',
  'health',
  'implementation',
  'state',
  'multiunit',
  'community',
  'fire',
  'percent',
  'enforcement',
  'information',
  'unit',
  'damage',
  'order',
  'research',
  'section',
  'lease',
  'part',
  'control',
  'notice'],
 'sentences': ['Surgeon General estimates that exposure to secondhand tobacco  smoke (i.e., the smoke that comes from burning tobacco products and is  exhaled by smokers) is responsible for the death of 41,000 adults non smokers in the United States each year from lung cancer and heart  disease. Secondhand smoke (SHS) contains hundreds of toxic chemicals  and is designated as a known human carcinogen by the U.S. Environmental  Protection Agency, the U.S. National Toxicology Program, and the  International Agency for Research on Cancer. Exposure to SHS can  al

In [12]:
import json
with open('data/text_data.json', 'w') as outfile:
    json.dump(top_obj, outfile)

### for debugging from down here, do not execute

In [49]:
len(data)

31

In [50]:
for item in data:
    print(item["doc_title"])

Photo Requirements for Pilot Certificates
Advance Information on Private Aircraft Arriving and Departing the United States
Evaluation of Safety Sensitive Personnel for Moderate-to-Severe Obstructive Sleep Apnea
Changes in NARA Research Room and Museum Hours
Bovine Spongiform Encephalopathy; Minimal-Risk Regions; Importation of Live Bovines and Products Derived From Bovines
Incentives for Nondiscriminatory Wellness Programs in Group Health Plans
Furnishing Identifying Number of Tax Return Preparer
Use of Bureau of Reclamation Land, Facilities, and Waterbodies
Improve Tracking of Workplace Injuries and Illnesses
Implementation of the Nondiscrimination and Equal Opportunity Provisions of the Workforce Innovation and Opportunity Act
Linear No-Threshold Model and Standards for Protection Against Radiation; Extension of Comment Period
Medicare Program: Accountable Care Organizations and the Medicare Shared Saving Program
Medicare Program: Changes to the Competitive Acquisition of Certain Dur

In [59]:
clean_text, clean_summary= get_document_text(str(doc_list[1]['text'][0]))

In [60]:
keywords = get_keywords(clean_text) # This gives us a list containing unigrams at index 0 and bigrams at index 1,etc


In [61]:
keywords

[]

In [62]:
tagged_tokens = tag_my_text(tokenize_text_sent(clean_text))
    

In [63]:
tagged_tokens

[[('posed', 'VBN'),
  ('rule', 'NN'),
  ('published', 'VBN'),
  ('at', 'IN'),
  ('72', 'CD'),
  ('fr', 'JJ'),
  ('53394', 'CD'),
  ('september', 'NN'),
  ('18', 'CD'),
  ('2007', 'CD'),
  ('must', 'MD'),
  ('be', 'VB'),
  ('received', 'VBN'),
  ('on', 'IN'),
  ('or', 'CC'),
  ('before', 'IN'),
  ('december', 'JJ'),
  ('4', 'CD'),
  ('2007', 'CD')],
 [('addresses', 'NNS'),
  ('you', 'PRP'),
  ('may', 'MD'),
  ('submit', 'VB'),
  ('comments', 'NNS'),
  ('identified', 'VBN'),
  ('by', 'IN'),
  ('docket', 'NN'),
  ('number', 'NN'),
  ('by', 'IN'),
  ('one', 'CD'),
  ('of', 'IN'),
  ('the', 'DT'),
  ('following', 'JJ'),
  ('methods', 'NNS'),
  ('federal', 'JJ'),
  ('erulemaking', 'VBG'),
  ('portal', 'JJ'),
  ('http', 'NN'),
  ('www', 'JJ'),
  ('regulations', 'NNS'),
  ('gov', 'VBP')],
 [('follow', 'VB'),
  ('the', 'DT'),
  ('instructions', 'NNS'),
  ('for', 'IN'),
  ('submitting', 'VBG'),
  ('comments', 'NNS'),
  ('via', 'IN'),
  ('docket', 'NN'),
  ('number', 'NN'),
  ('uscbp-2007-0064', 

In [64]:
#grand_list = get_top_np(noun_phrase_finder(tagged_tokens))
np_list = noun_phrase_finder(tagged_tokens)

In [69]:
grand = get_top_np(np_list)

In [70]:
grand

[[('border', 5),
  ('rule', 5),
  ('protection', 4),
  ('location', 3),
  ('office', 3),
  ('number', 3),
  ('docket', 3),
  ('dc', 2),
  ('washington', 2),
  ('period', 2),
  ('rulemaking', 2),
  ('background', 2),
  ('september', 2),
  ('fr', 2),
  ('cbp', 2)],
 [('international trade', 4),
  ('an extension', 2),
  ('private aircraft', 2),
  ('a notice', 2),
  ('this rulemaking', 1),
  ('regular business', 1),
  ('any pilot', 1),
  ('the agency', 1),
  ('the notice', 1),
  ('vereb branch', 1),
  ('the docket', 1),
  ('the comment', 1),
  ('the office', 1),
  ('supplementary information', 1),
  ('a m', 1)],
 [('a private aircraft', 2),
  ('a foreign location', 2),
  ('the federal register', 2),
  ('a foreign port', 2),
  ('any personal information', 1),
  ('advance electronic information', 1),
  ('nw 5th floor', 1)]]

#### doc_id and title -1 
0. FAA-2010-1127-0001
1. USCBP-2007-0064-1986
2. FMCSA-2015-0419-0001
3. NARA-06-0007-0001
4. APHIS-2006-0041-0001
5. EBSA-2012-0031-0001
6. IRS-2010-0009-0001
7. BOR-2008-0004-0001
8. OSHA-2013-0023-1443
9. DOL-2016-0001-0001
10. NRC-2015-0057-0086
11. CMS-2010-0259-0001
12. CMS-2009-0008-0003
13. CMS-2009-0038-0002
14. NPS-2014-0005-0001
15. BIS-2015-0011-0001
16. HUD-2011-0056-0019
17. HUD-2011-0014-0001
18. OCC-2011-0002-0001
19. ACF-2015-0008-0124
20. ETA-2008-0003-0001
21. CMS-2012-0152-0004
22. CFPB-2013-0033-0001
23. USCIS-2016-0001-0001
24. FMCSA-2011-0146-0001
25. USCG-2013-0915-0001
26. NHTSA-2012-0177-0001
27. USCBP-2005-0005-0001


#### doc_id and title -2
1. HUD-2015-0101-0001
2. ACF-2010-0003-0001
3. NPS-2015-0008-0001

In [4]:
s = str(doc_list2[2]['text'][0])


In [6]:
title_end = s.find("AGENCY:")


In [12]:
title_start = s.rfind("\n", 0, title_end - 4)

In [13]:
s[title_start:title_end]

'\nCape Lookout National Seashore, Off-Road Vehicle Management\n\n'

In [14]:
doc_list2[2]['text'][0]

<pre>
[Federal Register Volume 80, Number 243 (Friday, December 18, 2015)]
[Proposed Rules]
[Pages 79013-79020]
From the Federal Register Online via the Government Publishing Office [<a href="http://www.gpo.gov">www.gpo.gov</a>]
[FR Doc No: 2015-31793]



[[Page 79013]]

-----------------------------------------------------------------------

DEPARTMENT OF THE INTERIOR

National Park Service

36 CFR Part 7

[NPS-CALO-19111; PPWONRADE2, PMP00EI05.YP]
RIN 1024-AE24


Cape Lookout National Seashore, Off-Road Vehicle Management

AGENCY: National Park Service, Interior

ACTION: Proposed rule.

-----------------------------------------------------------------------

SUMMARY: The National Park Service proposes to designate routes for, 
and manage off-road vehicle use within Cape Lookout National Seashore, 
North Carolina. Under the National Park Service general regulations, 
the operation of motor vehicles off roads is prohibited unless 
authorized by special regulation. The proposed rule wou

In [5]:
doc_list =load(open("data/Master_doc_content",'rb'))
document = doc_list[3]
document_text = str(document['text'][0])

In [106]:

#clean_text, clean_summary = get_document_text(document_text)
text = """In addition[[Page 123]] to -----causing multiple  diseases and cancers, tobacco smoking has many other 
adverse effects on  the body, including inflammation and impairment to the immune  system.
\\1\\ ---------------------------------------------------------------------------      
\\1\\ Office of the Surgeon General, ``The Health Consequences of  Smoking--50 Years of Progress,'' (2014), 
available at http://www.surgeongeneral.gov/library/reports/50-years-of-progress/full-report.pdf."
#tokens = tokenize_text(text)

In [121]:
#text

"In addition[[Page 123]] to -----causing multiple  diseases and cancers, tobacco smoking has many other adverse effects on  the body, including inflammation and impairment to the immune  system.\\1\\ ---------------------------------------------------------------------------      \\1\\ Office of the Surgeon General, ``The Health Consequences of  Smoking--50 Years of Progress,'' (2014), available at http://www.surgeongeneral.gov/library/reports/50-years-of-progress/full-report.pdf."

In [16]:
# |\/\d+\/       # words of form /num/ -- references in our doc
#     |\[\[Page \d+\]\] # for page number markings in our doc of form  [[Page num]]
#     | -*
#     | .*
#text = text.replace('\\','|')
test = "abc[[Page 123]]\\59\\c---e"
# cr = re.compile('''\[\[Page \d+\]\]
#                 |\\+\d+\\+
#                 |-*''')

cr = re.compile(r'''\\\\\d+\\
                |-+''')
#cs = re.compile('\\\d+\\ ')
ct = re.sub(cr, r'', test)
#ct = re.sub(cs, '', ct)
ct

'abc[[Page 123]]\\59\\ce'

In [170]:
print(ct)

abc[[Page 123]]
\ce


In [6]:
clean_text, clean_summary = get_document_text(document_text)
#clean_text = clean_text.replace('\\','|')
sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
raw_sents = sent_tokenizer.tokenize(clean_text)
cr = re.compile('\[\[Page \d+\]\]|-*|\\+[0-9]+\\+')
clean_sent =[]
for sent in raw_sents:
    clean_sent.append(re.sub(cr, '', sent))


In [88]:
clean_sent

['  I.',
 'Executive Summary  A.',
 'Purpose of the Proposed Rule      The purpose of the proposed rule is to require PHAs to, within 18  months of the final rule, establish a policy prohibiting lit tobacco  products, as such term is proposed to be defined in Sec.',
 '965.653(c).',
 "inside all indoor areas of public housing, including but not limited to  living units, indoor common areas, electrical closets, storage units,  and PHA administrative office buildings and in all outdoor areas within  25 feet of the housing and administrative office buildings  (collectively, ``restricted areas'').",
 'As further discussed in this  rule, such a policy is expected to improve indoor air quality in public  housing, benefit the health of public housing residents and PHA staff,  reduce the risk of catastrophic fires, and lower overall maintenance  costs.',
 'B.',
 'Summary of Major Provisions of the Proposed Rule      This proposed rule would apply to all public housing, other than  dwelling unit

In [89]:
raw_sents

['  I.',
 'Executive Summary  A.',
 'Purpose of the Proposed Rule      The purpose of the proposed rule is to require PHAs to, within 18  months of the final rule, establish a policy prohibiting lit tobacco  products, as such term is proposed to be defined in Sec.',
 '965.653(c).',
 "inside all indoor areas of public housing, including but not limited to  living units, indoor common areas, electrical closets, storage units,  and PHA administrative office buildings and in all outdoor areas within  25 feet of the housing and administrative office buildings  (collectively, ``restricted areas'').",
 'As further discussed in this  rule, such a policy is expected to improve indoor air quality in public  housing, benefit the health of public housing residents and PHA staff,  reduce the risk of catastrophic fires, and lower overall maintenance  costs.',
 'B.',
 'Summary of Major Provisions of the Proposed Rule      This proposed rule would apply to all public housing, other than  dwelling unit

In [194]:
re.sub(r"\[+Page\s*\d+\]+|\\+\d+\\+|-+",'',text)

"In addition to causing multiple  diseases and cancers, tobacco smoking has many other adverse effects on  the body, including inflammation and impairment to the immune  system.        Office of the Surgeon General, ``The Health Consequences of  Smoking50 Years of Progress,'' (2014), available at http://www.surgeongeneral.gov/library/reports/50yearsofprogress/fullreport.pdf."

In [7]:
sl = get_sentences(clean_text)
s = handle_tables(sl)

In [8]:
for i in range(len(s)):
    print(str(i)+" "+str(s[i]))

0 0
1 0
2 0
3 0
4 -20
5 -20
6 0
7 -20
8 0
9 0
10 0
11 0
12 -20
13 -20
14 0
15 0
16 0
17 0
18 0
19 0
20 0
21 0
22 -20
23 0
24 0
25 0
26 0
27 0
28 0
29 0
30 0
31 0
32 0
33 0
34 0
35 0
36 0
37 0
38 0
39 0
40 0
41 0
42 0
43 0
44 -20
45 -20
46 0
47 -20
48 -20
49 -20
50 -20
51 -20
52 -20
53 0
54 0
55 -20
56 -20
57 -20
58 -20
59 0
60 -20
61 -20
62 -20
63 -20
64 -20
65 -20
66 -20
67 -20
68 -20
69 0
70 0
71 -20
72 -20
73 -20
74 -20
75 -20
76 -20
77 0
78 0
79 -20
80 -20
81 -20
82 0
83 -20
84 -20
85 0
86 0
87 0
88 -20
89 0
90 -20
91 -20


In [9]:
for i in range(len(sl)):
    print(str(i)+" "+str(sl[i]))

0  A discussion of the changes we are making in   this rule follows.
1 Research Room Hours in DC Area Facilities        Our research center and Central Research Room in the National   Archives Building and the research rooms at the National Archives at   College Park facility are currently open for research Monday through   Friday from 8:45 a.m. to 5 p.m.; on Tuesday, Thursday and Friday   evenings from 5 p.m. to 9 p.m.; and Saturdays from 8:45 a.m. to 4:45   p.m.
2 This interim final rule would eliminate Saturday hours and change   the research room hours to 9 a.m. to 5 p.m. on weekdays, more closely   reflecting NARA official business hours in those facilities.
3 The new   research room hours are specified in Sec.
4 Sec.
5 1253.1(a) and   1253.2(b).
6 We are also amending Sec.
7 1253.8 since we no longer will   have Saturday hours.
8 During the evening and Saturday hours we must provide staff to   supervise the seven research rooms and assist researchers.
9 We also   require addition

In [18]:
ss = re.sub(r"\s+",' ',sl[13])
dots = sl[13].count('.')
numbers = sl[13].count('1') +sent.count('2') +sent.count('3')+ sent.count('4') +sent.count('5')+sent.count('6')+sent.count('7')+sent.count('8')+sent.count('9')+sent.count('10')
sent_len = len(sl[13])

In [19]:
dots + numbers


275

In [20]:
sent_len

3294

In [21]:
len(sl[13].split(' '))

1820

In [86]:
ss

"Annual Burden Estimate: This proposal would result in a 20year recordkeeping and reporting burden as follows: Summary of time and costs (20year): The following table sums up the costs and time: Total cost Annual cost Total time Annual time Pilotrelated costs: TriggerInitial $4,221,982 $211,099 82,923.34 4,146.17 Registration............... NonTriggerInitial 191,555,276 9,577,764 3,521,734.67 176,086.73 Registration............... NonTriggerRenewal........ 149,053,511 7,452,676 2,740,341.74 137,017.09 Additional/Replacement...... 9,654,806 482,740 363,509.25 18,175.46 Portals: KTC......................... 10,100,840 505,042 655,048.00 32,752.40 DPE......................... 17,545,925 877,296 233,945.67 11,697.28 FAA Contractor.................. 5,328,284 266,414 N/A N/A Total....................... 387,460,624 19,373,031 7,597,502.66 379,875.13 The agency is soliciting comments to (1) Evaluate whether the proposed information requirement is necessary for the proper performance of the f

In [11]:
tagged_tokens = tag_my_text(tokenize_text_sent(clean_text))
np_list = noun_phrase_finder(tagged_tokens)
keywords = get_top_np(np_list)
top_np = get_top_unigrams(np_list)

In [12]:
table_scores = handle_tables(sl)
sentence_length_scores = get_sentence_lengths(sl)
keyphrase_scores = get_keyphrase_scores(clean_text,sl, keywords)

In [13]:
table_scores[13]

-20

In [88]:
handle_tables([sl[271]])

[-100]

In [17]:
stepped_sentence_length =[]
for each_score in sentence_length_scores:
        s = each_score//10
        if s>10:
            s = 10
        stepped_sentence_length.append(s)
stepped_sentence_length[13]

10

In [18]:
keyphrase_scores[13]

40

In [13]:
s1 = "The fortyfive NAICS codes in which WOSBs are underrepresented are: 2213Water, Sewage and Other systems; 2361Residential Building Construction; 2371Utility System Construction; 2381Foundation, Structure, and Building Exterior Contractors; 2382Building Equipment Contractors; 2383Building Finishing Contractors; 2389Other Specialty Trade Contractors; 3149Other Textile Product Mills; 3159 Apparel Accessories and Other Apparel Manufacturing; 3219Other Wood Product Manufacturing; 3222Converted Paper Product Manufacturing; 3321; Forging and Stamping; 3323Architectural and Structural Metals Manufacturing; 3324Boiler, Tank, and Shipping Container Manufacturing; 3333Commercial and Service Industry Machinery Manufacturing; 3342Communications Equipment Manufacturing; 3345 Navigational, Measuring, Electromedical, and Control Instruments Manufacturing; 3346Manufacturing and Reproducing Magnetic and Optical Media; 3353Electrical Equipment Manufacturing; 3359Other Electrical Equipment and Component Manufacturing; 3369Other Transportation Equipment Manufacturing; 4842Specialized Freight Trucking; 4881 Support Activities for Air Transportation; 4884Support Activities for Road Transportation; 4885Freight Transportation Arrangement; 5121 Motion Picture and Video Industries; 5311Lessors of Real Estate; 5413Architectural, Engineering, and Related Services; 5414 Specialized Design Services; 5415Computer Systems Design and Related Services; 5416 Management, Scientific, and Technical Consulting Services; 5419Other Professional, Scientific, and Technical Services; 5611Office Administrative Services; 5612Facilities Support Services; 5614 Business Support Services; 5616Investigation and Security Services; 5617Services to Buildings and Dwellings; 6116Other Schools and Instruction; 6214Outpatient Care Centers; 6219Other Ambulatory Health Care Services; 7115Independent Artists, Writers, and Performers; 7223Special Food Services; 8111Automotive Repair and Maintenance; 8113Commercial and Industrial Machinery and Equipment (except Automotive and Electronic) Repair and Maintenance; and 8114 Personal and Household Goods Repair and Maintenance."

In [16]:
len(s1.split(' '))

228