# NLP Basics Assessment

For this assessment we'll be using the short story [_An Occurrence at Owl Creek Bridge_](https://en.wikipedia.org/wiki/An_Occurrence_at_Owl_Creek_Bridge) by Ambrose Bierce (1890). <br>The story is in the public domain; the text file was obtained from [Project Gutenberg](https://www.gutenberg.org/ebooks/375.txt.utf-8).

In [1]:
# RUN THIS CELL to perform standard imports:
import spacy
nlp = spacy.load('en_core_web_md')

**1. Create a Doc object from the file `owlcreek.txt`**<br>
> HINT: Use `with open('../TextFiles/owlcreek.txt') as f:`

In [5]:
# Enter your code here:
txt = ''
with open('../TextFiles/owlcreek.txt') as f:
    txt = f.read() 
doc = nlp(txt)

In [6]:
# Run this cell to verify it worked:

doc[:36]

AN OCCURRENCE AT OWL CREEK BRIDGE

by Ambrose Bierce

I

A man stood upon a railroad bridge in northern Alabama, looking down
into the swift water twenty feet below.  

**2. How many tokens are contained in the file?**

In [7]:
len(doc)

4835

**3. How many sentences are contained in the file?**<br>HINT: You'll want to build a list first!

In [8]:
len(list(doc.sents))

203

**4. Print the second sentence in the document**<br> HINT: Indexing starts at zero, and the title counts as the first sentence.

In [15]:
sentences = list(doc.sents)
sentences[1]

I

A man stood upon a railroad bridge in northern Alabama, looking down
into the swift water twenty feet below.  

** 5. For each token in the sentence above, print its `text`, `POS` tag, `dep` tag and `lemma`<br>
CHALLENGE: Have values line up in columns in the print output.**

In [31]:
# NORMAL SOLUTION:

for token in sentences[1]:
    print(token.text, token.pos_, token.dep_, token.lemma_)

I PRON nsubj I


 SPACE dep 


A DET det a
man NOUN appos man
stood VERB ROOT stand
upon SCONJ prep upon
a DET det a
railroad NOUN compound railroad
bridge NOUN pobj bridge
in ADP prep in
northern ADJ amod northern
Alabama PROPN pobj Alabama
, PUNCT punct ,
looking VERB advcl look
down ADP advmod down

 SPACE dep 

into ADP prep into
the DET det the
swift ADJ amod swift
water NOUN pobj water
twenty NUM nummod twenty
feet NOUN npadvmod foot
below ADV advmod below
. PUNCT punct .
  SPACE dep  


In [36]:
# CHALLENGE SOLUTION:
for token in sentences[1]:
    print(f'{token.text:{12}} {token.pos_:{6}} {token.dep_:{10}} {token.lemma_}')

I            PRON   nsubj      I


           SPACE  dep        


A            DET    det        a
man          NOUN   appos      man
stood        VERB   ROOT       stand
upon         SCONJ  prep       upon
a            DET    det        a
railroad     NOUN   compound   railroad
bridge       NOUN   pobj       bridge
in           ADP    prep       in
northern     ADJ    amod       northern
Alabama      PROPN  pobj       Alabama
,            PUNCT  punct      ,
looking      VERB   advcl      look
down         ADP    advmod     down

            SPACE  dep        

into         ADP    prep       into
the          DET    det        the
swift        ADJ    amod       swift
water        NOUN   pobj       water
twenty       NUM    nummod     twenty
feet         NOUN   npadvmod   foot
below        ADV    advmod     below
.            PUNCT  punct      .
             SPACE  dep         


**6. Write a matcher called 'Swimming' that finds both occurrences of the phrase "swimming vigorously" in the text**<br>
HINT: You should include an `'IS_SPACE': True` pattern between the two words!

In [72]:
# Import the Matcher library:

from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

In [73]:
# Create a pattern and add it to matcher:

patterns = [
    [{'LOWER': 'swimming'}, {'IS_SPACE': True}, {'LOWER': 'vigorously'}]
]

matcher.add('Swimming', patterns)

In [74]:
# Create a list of matches called "found_matches" and print the list:


found_matches = matcher(doc)
print(found_matches)

[(12881893835109366681, 1274, 1277), (12881893835109366681, 3609, 3612)]


**7. Print the text surrounding each found match**

In [88]:
match1, start1, end1 = found_matches[0]
print(doc[start1-9:end1+13])

By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away home


In [93]:
match2, start2, end2 = found_matches[1]
print(doc[start2-7 : end2+4])

over his shoulder; he was now swimming
vigorously with the current.


**EXTRA CREDIT:<br>Print the *sentence* that contains each found match**

In [103]:
matched_sentences = []
for sent in sentences:
    if doc[start1 : end1].text in sent.text:
        matched_sentences.append(sent)
        
matched_sentences[0]

By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away home.  

In [104]:
matched_sentences[1]

The hunted man saw all this over his shoulder; he was now swimming
vigorously with the current.  