## Natural Language Procesing 
### Example from a pdf


Note (possible consideration for projects) -you do not need to work with a pdf for nlp!    
e.g. work with a .txt file, read off of a web page etc. 

#### Notes for HW2   

You need to make your code abstract enough to work for future docs. So if I wanted to run this code for 2017 I should not have to change more than one line of code. You don't need to verify that you get the correct answers with other years, just generalize your code enough so that it runs for other years 
     
Good code orients users to what they'll need to change up at the top

You need to use functions in this assignment. (and going forward)

It is ok to hard code the page numbers      
Best practice:     
- If you need to hard code, make it an argument

You'll want to try to catch negations (e.g. "will not increase", "failed to rise") 

Tip:
Writing fns: 
- Think of what it should do, and mock it out
- Write example input
- Try again with different input
- Use print statements to tell you what's happening
- Remove or comment out the intermediary output when finished

#Spacy documentation
#https://spacy.io/api/doc

In [308]:
# import statements belong at the top of your code
import os
import requests
import PyPDF2 
import spacy
#import pandas as pd

nlp = spacy.load("en_core_web_sm") #English

In [309]:
path = os.getcwd()

In [310]:
url = 'https://countyofsb.org/ceo/asset.c/4171'
filename = 'FY_2020_21_Section_B_Executive_Summary.pdf'

In [312]:
# make a comment for where this function is called
# e.g. called in main()
# but for this example, I'm calling my fns imediately (to demonstrate)
def get_pdf(url, filename, path):
    response = requests.get(url)
    with open(os.path.join(path, filename), 'wb') as ofile:
        ofile.write(response.content)



if filename not in os.listdir():
    print('downloading document from {}'.format(url))
    get_pdf(url, filename, path)
else:
    print('document already in {}'.format(path))

document already in /Users/Sarah/Documents/GitHub/Sarah-Discussion-Notebooks


In [313]:
os.listdir()

['2008-12-16.txt',
 '.DS_Store',
 'dependency parser.svg',
 'FY_2020_21_Section_B_Executive_Summary.pdf',
 '.ipynb',
 'lab_3-nlp and pdfs.ipynb',
 'Fed_nlp_example.py',
 '.ipynb_checkpoints',
 '.git',
 '2019-09-18.txt',
 'lab_1-get requests, pandas and functions.ipynb',
 'lab_2-pandas reshaping and merge.ipynb']

In [318]:
def read_pdf(filename, path):
    with open(os.path.join(path, filename), 'rb') as ifile:
        pdf = PyPDF2.PdfFileReader(ifile)

        print('Number of pages:', pdf.numPages)

        pages = []
        for p in range(pdf.numPages):
            page = pdf.getPage(p)
            text = page.extractText()
            text = text.replace("™", "'")
            text = text.replace("\n", "")
            pages.append(text)
        
        return pages

pages = read_pdf(filename, path)

Number of pages: 24


In [316]:
pages[0]

'    \n \n   \n \n   \n      Section B  Executive Summary '

In [328]:
pages[4]

"Executive Summary B3 both countywide discretionary revenues and departmentspecific revenues, expansion requests that would require ongoing General Fund commitmentŠsome of which might otherwise have been warrantedŠare not being recommended for funding at this time.  Departments have submitted requests for General Fund budget expansions totaling $12.7 million in ongoing funding, $4.3 million in onetime funds, and additional staffing of 59.5 FTE.  Requests for use of cannabis tax revenue total $1.5 million in ongoing funding, $341,000 in onetime funding, and 8.0 additional positions.Adherence to budget development policies continue. These policies were adopted by the Board in December and set guidelines for departments to follow while developing their budget requests.  Some policies called for contributions to reserve accounts for specific purposes, and the recommended budget reflects those allocations:o$500,000 has been set aside for Americans with Disabilities Act (ADA) improvements.  

In [325]:
pages[2][0:9]

'Executive'

In [326]:
def tokenize(pages, page_num):
    text = pages[page_num]
    doc = nlp(text)
    return doc

tokenized_page = tokenize(pages, 4)

In [338]:
tokenized_page[30]

for

In [339]:
list(tokenized_page[30].ancestors)

[recommended, countywide]

In [340]:
list(tokenized_page[30].children)

[funding]

#### Exploring our page

In [341]:
covid_terms = ['pandemic', 'COVID']
covid_tokens = [t for t in tokenized_page if any([e in t.string for e in covid_terms])]
covid_tokens

[COVID19, pandemic, COVID19, pandemic, pandemic]

In [342]:
covid_ancs = [list(t.ancestors) for t in covid_tokens]
covid_ancs

[[pandemic, accelerated, is],
 [accelerated, is],
 [impacts, impacts, against, lines, are],
 [by, caused, navigate, need],
 [after, normalﬂ, in, position, ways, need]]

In [343]:
# nested for loop
for ancs in covid_ancs:
    for anc in ancs:
        print(anc, anc.pos_)

pandemic NOUN
accelerated VERB
is VERB
accelerated VERB
is VERB
impacts NOUN
impacts NOUN
against ADP
lines NOUN
are VERB
by ADP
caused VERB
navigate VERB
need VERB
after ADP
normalﬂ NOUN
in ADP
position VERB
ways NOUN
need VERB


In [344]:
covid_anc_type = [[(anc, anc.pos_) for anc in ancs] for ancs in covid_ancs]
covid_anc_type

[[(pandemic, 'NOUN'), (accelerated, 'VERB'), (is, 'VERB')],
 [(accelerated, 'VERB'), (is, 'VERB')],
 [(impacts, 'NOUN'),
  (impacts, 'NOUN'),
  (against, 'ADP'),
  (lines, 'NOUN'),
  (are, 'VERB')],
 [(by, 'ADP'), (caused, 'VERB'), (navigate, 'VERB'), (need, 'VERB')],
 [(after, 'ADP'),
  (normalﬂ, 'NOUN'),
  (in, 'ADP'),
  (position, 'VERB'),
  (ways, 'NOUN'),
  (need, 'VERB')]]

In [345]:
covid_ancs_verbs = [[a for a in ancs if a.pos_ == 'VERB'] for ancs in covid_ancs]
covid_ancs_verbs

[[accelerated, is],
 [accelerated, is],
 [are],
 [caused, navigate, need],
 [position, need]]

In [346]:
for token_list in covid_ancs:
    for ancestor in token_list:
        print(ancestor.dep_)

nsubj
advcl
ROOT
advcl
ROOT
conj
pobj
prep
attr
ROOT
agent
ccomp
advcl
ROOT
prep
pobj
prep
relcl
conj
ROOT


In [347]:
covid_children = [list(t.children) for t in covid_tokens]
covid_children

[[], [the, COVID19], [], [the], [the]]

In [348]:
children_of_covid_ancs = [[list(a.children) for a in ancs] for ancs in covid_ancs]
children_of_covid_ancs

[[[the, COVID19],
  [ever, as, pandemic, has, transition, to],
  [need, greater, accelerated, .]],
 [[ever, as, pandemic, has, transition, to], [need, greater, accelerated, .]],
 [[COVID19, ,, and, recession],
  [unanticipated, State, budget, ,, impacts],
  [impacts],
  [our, first, of, against],
  [accounts, lines, .]],
 [[pandemic],
  [recession, by],
  [as, we, caused],
  [In, ,, navigate, ,, County, will, focus, ,, and, ways, .]],
 [[pandemic],
  [the, ﬁnext, or, life, after],
  [normalﬂ],
  [to, County, in, ,, through],
  [position],
  [In, ,, navigate, ,, County, will, focus, ,, and, ways, .]]]

In [349]:
list(tokenized_page[0:10].noun_chunks)

[Executive Summary B3, discretionary revenues, departmentspecific revenues]

In [350]:
list(tokenized_page[10:100].noun_chunks)

[expansion requests,
 ongoing General Fund,
 funding,
 this time,
 Departments,
 requests,
 General Fund budget expansions,
 ongoing funding,
 onetime funds,
 additional staffing,
 59.5 FTE,
 Requests,
 use,
 cannabis tax revenue,
 ongoing funding,
 onetime funding,
 8.0 additional positions,
 Adherence,
 development policies]

In [351]:
test = list(tokenized_page[45:47].noun_chunks)
test

[General Fund budget expansions]

In [352]:
test[0].root

expansions

In [353]:
budget_nchunks = [nc for nc in tokenized_page.noun_chunks if 'budget' in nc.string]
budget_nchunks

[General Fund budget expansions,
 their budget requests,
 the recommended budget,
 the adopted budget,
 unanticipated State budget impacts,
 the annual General Fund operating budget]

In [354]:
covid_ancs

[[pandemic, accelerated, is],
 [accelerated, is],
 [impacts, impacts, against, lines, are],
 [by, caused, navigate, need],
 [after, normalﬂ, in, position, ways, need]]

In [355]:
for t_list in covid_ancs:
    #print(t_list) #debug
    for token in t_list:
        #print(token) #debug
        if str(token) == 'accelerated':
            print('ancestor', list(token.ancestors))
            print('child', list(token.children))

ancestor [is]
child [ever, as, pandemic, has, transition, to]
ancestor [is]
child [ever, as, pandemic, has, transition, to]


In [356]:
for t_list in covid_ancs:
    for token in t_list:
        if str(token) == 'accelerated':
            accelerated_anc = list(token.ancestors) # expect "is"

for token in accelerated_anc:
    print(list(token.ancestors))
    print(list(token.children))

[]
[need, greater, accelerated, .]


**This can all be deeply unsatisfying, and you're going to hit a lot of dead ends. Sometimes you do just have to use a brute-force approach though. 
Try enough things and you'll get what you're looking for**

### Let's take look at just one sentence

In [357]:
text = 'The COVID19 pandemic has caused a national recession'
doc = nlp(text)
doc

The COVID19 pandemic has caused a national recession

In [358]:
for token in doc:
    print(token, list(token.ancestors))

The [pandemic, caused]
COVID19 [pandemic, caused]
pandemic [caused]
has [caused]
caused []
a [recession, caused]
national [recession, caused]
recession [caused]


In [359]:
for token in doc:
    print(token, list(token.children))

The []
COVID19 []
pandemic [The, COVID19]
has []
caused [pandemic, has, recession]
a []
national []
recession [a, national]


In [360]:
img = spacy.displacy.render(doc, style='dep', options={'distance' : 140}, jupyter=False)
with open('dependency parser.svg', 'w', encoding='utf-8') as f:
    f.write(img)

In [361]:
# subtree pulls the token, its ancestors and its children
for token in doc:
    print(token, list(token.subtree))

The [The]
COVID19 [COVID19]
pandemic [The, COVID19, pandemic]
has [has]
caused [The, COVID19, pandemic, has, caused, a, national, recession]
a [a]
national [national]
recession [a, national, recession]


In [362]:
nc = list(doc.noun_chunks)
nc

[The COVID19 pandemic, a national recession]

In [363]:
nc[0].root

pandemic

In [364]:
nc[1].root

recession