# NLP Basics Assessment

For this assessment we'll be using the short story [_An Occurrence at Owl Creek Bridge_](https://en.wikipedia.org/wiki/An_Occurrence_at_Owl_Creek_Bridge) by Ambrose Bierce (1890). <br>The story is in the public domain; the text file was obtained from [Project Gutenberg](https://www.gutenberg.org/ebooks/375.txt.utf-8).

In [24]:
# RUN THIS CELL to perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')

**1. Create a Doc object from the file `owlcreek.txt`**<br>
> HINT: Use `with open('../TextFiles/owlcreek.txt') as f:`

In [25]:
# Enter your code here:
with open('../TextFiles/owlcreek.txt') as file:
    document = file.read()
    doc = nlp(document)


In [26]:
# Run this cell to verify it worked:

doc[:36]

AN OCCURRENCE AT OWL CREEK BRIDGE

by Ambrose Bierce

I

A man stood upon a railroad bridge in northern Alabama, looking down
into the swift water twenty feet below.  

**2. How many tokens are contained in the file?**

In [27]:
len(doc)

4835

**3. How many sentences are contained in the file?**<br>HINT: You'll want to build a list first!

In [28]:
sentences = list(doc.sents)
len(sentences)


204

**4. Print the second sentence in the document**<br> HINT: Indexing starts at zero, and the title counts as the first sentence.

In [29]:
sentences[1]

The man's hands were behind
his back, the wrists bound with a cord.  

** 5. For each token in the sentence above, print its `text`, `POS` tag, `dep` tag and `lemma`<br>
CHALLENGE: Have values line up in columns in the print output.**

In [30]:
# NORMAL SOLUTION:
sent = sentences[1]
for token in sent:
    print(token.text,token.pos_,token.dep_,token.lemma_)

The DET det the
man NOUN poss man
's PART case 's
hands NOUN nsubj hand
were AUX ROOT be
behind ADP prep behind

 SPACE dep 

his PRON poss his
back NOUN pobj back
, PUNCT punct ,
the DET det the
wrists NOUN appos wrist
bound VERB acl bind
with ADP prep with
a DET det a
cord NOUN pobj cord
. PUNCT punct .
  SPACE dep  


In [31]:
# CHALLENGE SOLUTION:
print(f'{"Token":{14}}  {"P.O.S.":{10}} {"Dep":{13}} {"Lemma":{15}}')
print('------------------------------------------------')
for token in sent:
    print(f'{token.text:{15}} {token.pos_:{10}} {token.dep_:{13}} {token.lemma_:{15}}')


Token           P.O.S.     Dep           Lemma          
------------------------------------------------
The             DET        det           the            
man             NOUN       poss          man            
's              PART       case          's             
hands           NOUN       nsubj         hand           
were            AUX        ROOT          be             
behind          ADP        prep          behind         

               SPACE      dep           
              
his             PRON       poss          his            
back            NOUN       pobj          back           
,               PUNCT      punct         ,              
the             DET        det           the            
wrists          NOUN       appos         wrist          
bound           VERB       acl           bind           
with            ADP        prep          with           
a               DET        det           a              
cord            NOUN       pobj        

**6. Write a matcher called 'Swimming' that finds both occurrences of the phrase "swimming vigorously" in the text**<br>
HINT: You should include an `'IS_SPACE': True` pattern between the two words!

In [35]:
# Import the Matcher library:

from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

In [36]:
# Create a pattern and add it to matcher:
phrase = [
    {"LOWER": "swimming"},
    {"IS_SPACE": True},
    {"LOWER": "vigorously"}
]

matcher.add("Swimming", [phrase])


In [37]:
# Create a list of matches called "found_matches" and print the list:

found_matches = matcher(doc)
found_matches


[(12881893835109366681, 1274, 1277), (12881893835109366681, 3609, 3612)]

**7. Print the text surrounding each found match**

In [38]:
matches = []
for found_match in found_matches:
    # found_match[0] is the unique id of the found match
    # found_match[1] is where the found match in the doc text begins
    # found_match[2] is where the found match in the doc text ends
    # 9 and 14 are arbitrary values, it is only selected as offsets to match the given expected output
    matches.append(doc[found_match[1]-9:found_match[2]+14])

print(matches[0])

By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away home.


In [39]:
print(matches[1])

all this over his shoulder; he was now swimming
vigorously with the current.  His brain was as energetic as his arms



**EXTRA CREDIT:<br>Print the *sentence* that contains each found match**

In [40]:

match_sentence = []
for sentence in sentences:
    for (id, start,end) in found_matches:
        if start > sentence.start and end < sentence.end:
            match_sentence.append(sentence)
        
match_sentence[0]



By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away home.  

In [41]:

match_sentence[1]



The hunted man saw all this over his shoulder; he was now swimming
vigorously with the current.  