___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

# NLP Basics Assessment

For this assessment we'll be using the short story [_An Occurrence at Owl Creek Bridge_](https://en.wikipedia.org/wiki/An_Occurrence_at_Owl_Creek_Bridge) by Ambrose Bierce (1890). <br>The story is in the public domain; the text file was obtained from [Project Gutenberg](https://www.gutenberg.org/ebooks/375.txt.utf-8).

In [22]:
# RUN THIS CELL to perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')

**1. Create a Doc object from the file `owlcreek.txt`**<br>
> HINT: Use `with open('../TextFiles/owlcreek.txt') as f:`

In [23]:
# Enter your code here:
with open('owlcreek.txt', errors='ignore') as f:
    doc = nlp(f.read())


In [24]:
# Run this cell to verify it worked:

doc[:36]

AN OCCURRENCE AT OWL CREEK BRIDGE

by Ambrose Bierce

I

A man stood upon a railroad bridge in northern Alabama, looking down
into the swift water twenty feet below.  

**2. How many tokens are contained in the file?**

In [25]:
len(doc)

4833

**3. How many sentences are contained in the file?**<br>HINT: You'll want to build a list first!

In [26]:
doc_sentences = [sent for sent in doc.sents]

len(doc_sentences)

211

**4. Print the second sentence in the document**<br> HINT: Indexing starts at zero, and the title counts as the first sentence.

In [27]:
print(doc_sentences[1].text)

A man stood upon a railroad bridge in northern Alabama, looking down
into the swift water twenty feet below.  


** 5. For each token in the sentence above, print its `text`, `POS` tag, `dep` tag and `lemma`<br>
CHALLENGE: Have values line up in columns in the print output.**

In [28]:
# NORMAL SOLUTION:
for token in doc_sentences[1]:
    print(token.text, '\t', token.pos_, '\t', token.lemma, '\t', token.lemma_)

A 	 DET 	 11901859001352538922 	 a
man 	 NOUN 	 3104811030673030468 	 man
stood 	 VERB 	 16121235759125543490 	 stand
upon 	 ADP 	 12776617025319584140 	 upon
a 	 DET 	 11901859001352538922 	 a
railroad 	 NOUN 	 11929338562591612190 	 railroad
bridge 	 NOUN 	 10505406131357236919 	 bridge
in 	 ADP 	 3002984154512732771 	 in
northern 	 ADJ 	 14402328224860809449 	 northern
Alabama 	 PROPN 	 2372026494511674142 	 alabama
, 	 PUNCT 	 2593208677638477497 	 ,
looking 	 VERB 	 16096726548953279178 	 look
down 	 PART 	 6421409113692203669 	 down

 	 SPACE 	 962983613142996970 	 

into 	 ADP 	 3278561384161438710 	 into
the 	 DET 	 7425985699627899538 	 the
swift 	 ADJ 	 9502497712543975804 	 swift
water 	 NOUN 	 7248544922998488549 	 water
twenty 	 NUM 	 8304598090389628520 	 twenty
feet 	 NOUN 	 779410287755165804 	 foot
below 	 ADV 	 13516515296229086732 	 below
. 	 PUNCT 	 12646065887601541794 	 .
  	 SPACE 	 8532415787641010193 	  


In [29]:
# CHALLENGE SOLUTION:
for token in doc_sentences[1]:
    print(f"{token.text:{15}} {token.pos_:{5}} {token.dep_:{10}} {token.lemma_:{15}}")

A               DET   det        a              
man             NOUN  nsubj      man            
stood           VERB  ROOT       stand          
upon            ADP   prep       upon           
a               DET   det        a              
railroad        NOUN  compound   railroad       
bridge          NOUN  pobj       bridge         
in              ADP   prep       in             
northern        ADJ   amod       northern       
Alabama         PROPN pobj       alabama        
,               PUNCT punct      ,              
looking         VERB  advcl      look           
down            PART  prt        down           

               SPACE            
              
into            ADP   prep       into           
the             DET   det        the            
swift           ADJ   amod       swift          
water           NOUN  pobj       water          
twenty          NUM   nummod     twenty         
feet            NOUN  npadvmod   foot           
below           ADV 

**6. Write a matcher called 'Swimming' that finds both occurrences of the phrase "swimming vigorously" in the text**<br>
HINT: You should include an `'IS_SPACE': True` pattern between the two words!

In [30]:
# Import the Matcher library:

from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

In [31]:
# Create a pattern and add it to matcher:
# pattern = [{'LOWER' : 'swimming'}, {'IS_SPACE' : True}, {'LOWER' : 'vigorously'}]

pattern = [{'LOWER' : 'swimming'}, {'IS_SPACE' : True, 'OP':'*'}, {'LOWER' : 'vigorously'}]

matcher.add('Swimming', None, pattern)

In [32]:
# Create a list of matches called "found_matches" and print the list:
found_matches = matcher(doc)

print(found_matches)

[(12881893835109366681, 1274, 1277), (12881893835109366681, 3607, 3610)]


**7. Print the text surrounding each found match**

In [33]:
for match_id, start, end in found_matches:
    span = doc[start-10:end+10]                    # get the matched span
    print(span.text)
    print('-----')

 By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and
-----
saw all this over his shoulder; he was now swimming
vigorously with the current.  His brain was as energetic
-----


In [34]:
### The Answer
def surrounding(doc, start, end):
    print(doc[start-5:end+5])

surrounding(doc, 1274, 1277)

surrounding(doc, 3607, 3610)

evade the bullets and, swimming
vigorously, reach the bank,
shoulder; he was now swimming
vigorously with the current.  


**EXTRA CREDIT:<br>Print the *sentence* that contains each found match**

In [35]:
for sentence in doc_sentences:
    if found_matches[0][1] < sentence.end:
        print(sentence)
        break

By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away home.  


In [36]:
for sentence in doc_sentences:
    if found_matches[1][1] < sentence.end:
        print(sentence)
        break

The hunted man saw all this over his shoulder; he was now swimming
vigorously with the current.  


### Great Job!