# NLP Basics Assessment

For this assessment we'll be using the short story [_An Occurrence at Owl Creek Bridge_](https://en.wikipedia.org/wiki/An_Occurrence_at_Owl_Creek_Bridge) by Ambrose Bierce (1890). <br>The story is in the public domain; the text file was obtained from [Project Gutenberg](https://www.gutenberg.org/ebooks/375.txt.utf-8).

In [1]:
# RUN THIS CELL to perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')

**1. Create a Doc object from the file `owlcreek.txt`**<br>
> HINT: Use `with open('../TextFiles/owlcreek.txt') as f:`

In [2]:
with open('../TextFiles/owlcreek.txt') as f:
   text = f.read()
   doc = nlp(text)

In [6]:
# Run this cell to verify it worked:
doc[:36]
# Note: This prints the tokens. Each token is separated by a whitespace.

AN OCCURRENCE AT OWL CREEK BRIDGE

by Ambrose Bierce

I

A man stood upon a railroad bridge in northern Alabama, looking down
into the swift water twenty feet below.  

**2. How many tokens are contained in the file?**

In [7]:
len(doc)

4835

**3. How many sentences are contained in the file?**<br>HINT: You'll want to build a list first!

In [15]:
# Can be done with list comprehension
sents = [sent for sent in doc.sents]
len(sents)

204

**4. Print the second sentence in the document**<br> HINT: Indexing starts at zero, and the title counts as the first sentence.

In [19]:
sents[1]
# Notice the sentence tokenizer does not split at new lines

The man's hands were behind
his back, the wrists bound with a cord.  

** 5. For each token in the sentence above, print its `text`, `POS` tag, `dep` tag and `lemma`<br>
CHALLENGE: Have values line up in columns in the print output.**

In [20]:
# NORMAL SOLUTION:
for token in sents[1]:
    print(token.text, token.pos_, token.dep_, token.lemma_)

The DET det the
man NOUN poss man
's PART case 's
hands NOUN nsubj hand
were AUX ROOT be
behind ADP prep behind

 SPACE dep 

his PRON poss his
back NOUN pobj back
, PUNCT punct ,
the DET det the
wrists NOUN appos wrist
bound VERB acl bind
with ADP prep with
a DET det a
cord NOUN pobj cord
. PUNCT punct .
  SPACE dep  


In [22]:
# CHALLENGE SOLUTION:
for token in sents[1]:
    print(f'{token.text:{12}} {token.pos_:{7}} {token.dep_:{12}} {token.lemma_:{12}}')

The          DET     det          the         
man          NOUN    poss         man         
's           PART    case         's          
hands        NOUN    nsubj        hand        
were         AUX     ROOT         be          
behind       ADP     prep         behind      

            SPACE   dep          
           
his          PRON    poss         his         
back         NOUN    pobj         back        
,            PUNCT   punct        ,           
the          DET     det          the         
wrists       NOUN    appos        wrist       
bound        VERB    acl          bind        
with         ADP     prep         with        
a            DET     det          a           
cord         NOUN    pobj         cord        
.            PUNCT   punct        .           
             SPACE   dep                      


**6. Write a matcher called 'Swimming' that finds both occurrences of the phrase "swimming vigorously" in the text**<br>
HINT: You should include an `'IS_SPACE': True` pattern between the two words!

In [23]:
# Import the Matcher library:

from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

In [26]:
# Create a pattern and add it to matcher:

# Pattern to match: r"swimming\s*vigorously"
pattern = [{'LOWER': 'swimming'}, {'IS_SPACE': True, 'OP':'*'}, {'LOWER': 'vigorously'}]
matcher.add('SwimmingVigorously', [pattern]) #Note: pattern is now the 2nd argument based on SpaCy v3



In [28]:
# Create a list of matches called "found_matches" and print the list:
found_matches = matcher(doc)
found_matches

[(13245044497498710760, 1274, 1277), (13245044497498710760, 3609, 3612)]

**7. Print the text surrounding each found match**

In [31]:
ctr=0
for (id, start, end) in found_matches:
  ctr += 1
  print(f'Match #{ctr}')
  print(doc[start-10:end+10],'\n')

Match #1
 By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and 

Match #2
saw all this over his shoulder; he was now swimming
vigorously with the current.  His brain was as energetic 



**EXTRA CREDIT:<br>Print the *sentence* that contains each found match**

In [36]:
# Visualize the structure of doc.sents
for i in range(5):
  print(sents[i], sents[i].start, sents[i].end)

# Observation: sent.start and sent.end contains the token indices to which the sent bound in a doc

AN OCCURRENCE AT OWL CREEK BRIDGE

by Ambrose Bierce

I

A man stood upon a railroad bridge in northern Alabama, looking down
into the swift water twenty feet below.   0 36
The man's hands were behind
his back, the wrists bound with a cord.   36 54
A rope closely encircled his
neck.   54 63
It was attached to a stout cross-timber above his head and the
slack fell to the level of his knees.   63 88
Some loose boards laid upon the
ties supporting the rails of the railway supplied a footing for him
and his executioners--two private soldiers of the Federal army,
directed by a sergeant who in civil life may have been a deputy
sheriff.   88 138


In [37]:
ctr=0
matches_list = found_matches.copy()
# loop through sentences and check if match start and match end within the sent 
for sent in sents:
  if matches_list[0][1] > sent.start and matches_list[0][2] < sent.end:
    print(f"Match {ctr}")
    print(sent, "\n")
    matches_list = matches_list[1:] # Remove first record
    if not matches_list:  # list is empty after removing --> stop the loop
      break
    



Match 0
By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away home.   

Match 0
The hunted man saw all this over his shoulder; he was now swimming
vigorously with the current.   

