# NLP Basics Assessment

For this assessment we'll be using the short story [_An Occurrence at Owl Creek Bridge_](https://en.wikipedia.org/wiki/An_Occurrence_at_Owl_Creek_Bridge) by Ambrose Bierce (1890). <br>The story is in the public domain; the text file was obtained from [Project Gutenberg](https://www.gutenberg.org/ebooks/375.txt.utf-8).

In [1]:
# RUN THIS CELL to perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')

**1. Create a Doc object from the file `owlcreek.txt`**<br>
> HINT: Use `with open('../TextFiles/owlcreek.txt') as f:`

In [9]:
# Enter your code here:
with open('./owlcreek.txt', 'r') as f:
  text = f.read()
doc = nlp(text)

In [10]:
# Run this cell to verify it worked:

doc[:36]

AN OCCURRENCE AT OWL CREEK BRIDGE

by Ambrose Bierce

I

A man stood upon a railroad bridge in northern Alabama, looking down
into the swift water twenty feet below.  

**2. How many tokens are contained in the file?**

In [12]:
num_tokens = len(doc)
num_tokens

4835

**3. How many sentences are contained in the file?**<br>HINT: You'll want to build a list first!

In [14]:
sentences = list(doc.sents)
num_sentences = len(sentences)
num_sentences

204

**4. Print the second sentence in the document**<br> HINT: Indexing starts at zero, and the title counts as the first sentence.

In [28]:
second_sentence = sentences[1]
second_sentence

The man's hands were behind
his back, the wrists bound with a cord.  

** 5. For each token in the sentence above, print its `text`, `POS` tag, `dep` tag and `lemma`<br>
CHALLENGE: Have values line up in columns in the print output.**

In [36]:
# NORMAL SOLUTION:
for token in second_sentence:
  print("{} {} {} {}".format(token.text, token.pos_, token.dep_, token.lemma_))


The DET det the
man NOUN poss man
's PART case 's
hands NOUN nsubj hand
were AUX ROOT be
behind ADP prep behind

 SPACE dep 

his PRON poss his
back NOUN pobj back
, PUNCT punct ,
the DET det the
wrists NOUN appos wrist
bound VERB acl bind
with ADP prep with
a DET det a
cord NOUN pobj cord
. PUNCT punct .
  SPACE dep  


In [None]:
# CHALLENGE SOLUTION:
for token in second_sentence:
  print("{:<15} {:<8} {:<8} {:<8}".format(token.text, token.pos_, token.dep_, token.lemma_))

A               DET   det        a              
man             NOUN  nsubj      man            
stood           VERB  ROOT       stand          
upon            ADP   prep       upon           
a               DET   det        a              
railroad        NOUN  compound   railroad       
bridge          NOUN  pobj       bridge         
in              ADP   prep       in             
northern        ADJ   amod       northern       
Alabama         PROPN pobj       alabama        
,               PUNCT punct      ,              
looking         VERB  advcl      look           
down            PART  prt        down           

               SPACE            
              
into            ADP   prep       into           
the             DET   det        the            
swift           ADJ   amod       swift          
water           NOUN  pobj       water          
twenty          NUM   nummod     twenty         
feet            NOUN  npadvmod   foot           
below           ADV 

**6. Write a matcher called 'Swimming' that finds both occurrences of the phrase "swimming vigorously" in the text**<br>
HINT: You should include an `'IS_SPACE': True` pattern between the two words!

In [38]:
# Import the Matcher library:

from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

In [42]:
# Create a pattern and add it to matcher:
pattern = [{"LOWER": "swimming"}, {"IS_SPACE": True}, {"LOWER": "vigorously"}]
matcher.add("Swimming", [pattern])
matches = matcher(doc)

In [48]:
# Create a list of matches called "found_matches" and print the list:
found_matches = []

for match_id, start, end in matches:
    found_matches.append((match_id, start, end))

print(found_matches)

[(12881893835109366681, 1274, 1277), (12881893835109366681, 3609, 3612)]


**7. Print the text surrounding each found match**

In [73]:
if matches:
    # Get the first match
    match_id, start, end = matches[0]
    # Get the sentence containing the match
    sentence = next(sent for sent in doc.sents if start > sent.start and end <= sent.end)
    print(sentence.text)

By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away home.  


In [78]:
if matches:
    # Get the second match
    match_id, start, end = matches[1]
    # Get the sentence containing the match
    sentence = next(sent for sent in doc.sents if start > sent.start and end <= sent.end)
    print(sentence.text)

The hunted man saw all this over his shoulder; he was now swimming
vigorously with the current.  


**EXTRA CREDIT:<br>Print the *sentence* that contains each found match**

In [62]:
if matches:
    # Get the first match
    match_id, start, end = matches[0]
    # Get the sentence containing the match
    sentence = next(sent for sent in doc.sents if start >= sent.start and end <= sent.end)
    print(sentence.text)

By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away home.  


In [79]:
if matches:
    match_id, start, end = matches[1]
    # Get the sentence containing the match
    sentence = next(sent for sent in doc.sents if start >= sent.start and end <= sent.end)
    print(sentence.text)


The hunted man saw all this over his shoulder; he was now swimming
vigorously with the current.  
