# NLP Basics Assessment

For this assessment we'll be using the short story [_An Occurrence at Owl Creek Bridge_](https://en.wikipedia.org/wiki/An_Occurrence_at_Owl_Creek_Bridge) by Ambrose Bierce (1890). <br>The story is in the public domain; the text file was obtained from [Project Gutenberg](https://www.gutenberg.org/ebooks/375.txt.utf-8).

In [4]:
# RUN THIS CELL to perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')

**1. Create a Doc object from the file `owlcreek.txt`**<br>
> HINT: Use `with open('../TextFiles/owlcreek.txt') as f:`

In [7]:
# ENTER CODE HERE

with open("owlcreek.txt") as f:
    doc =  nlp(f.read())

In [8]:
# Run this cell to verify it worked

doc[:36]

AN OCCURRENCE AT OWL CREEK BRIDGE

by Ambrose Bierce

I

A man stood upon a railroad bridge in northern Alabama, looking down
into the swift water twenty feet below.  

**2. How many tokens are contained in the file ?**

In [9]:
len(doc)

4833

**3. How many sentences are contained in the file ?**<br>
**HINT:** You'll want to build a list first !

In [10]:
doc_sentences = [sent for sent in doc.sents]

In [11]:
len(doc_sentences)

222

**4. Print the second in the document?**<br>
**HINT:** Indexing starts at Zero, and the tittle counts as the first sentence.

In [17]:
print(doc_sentences[2].text)

A man stood upon a railroad bridge in northern Alabama, looking down
into the swift water twenty feet below.  


**5. For each token in the sentence above, print its `text`, `POS` tag and `lemma`<br>
CHALLENGE: Have values line up in columns in the print output.**

In [20]:
# NORMAL SOLUTION:

for token in doc_sentences[2]:
    print(f"{token.text} {token.pos_} {token.dep_} {token.lemma}")

A DET det 11901859001352538922
man NOUN nsubj 3104811030673030468
stood VERB ROOT 16121235759125543490
upon ADP prep 12776617025319584140
a DET det 11901859001352538922
railroad NOUN compound 11929338562591612190
bridge NOUN pobj 10505406131357236919
in ADP prep 3002984154512732771
northern ADJ amod 14402328224860809449
Alabama PROPN pobj 1974316624015891830
, PUNCT punct 2593208677638477497
looking VERB advcl 16096726548953279178
down ADV advmod 6421409113692203669

 SPACE  962983613142996970
into ADP prep 3278561384161438710
the DET det 7425985699627899538
swift ADJ amod 9502497712543975804
water NOUN pobj 7248544922998488549
twenty NUM nummod 8304598090389628520
feet NOUN npadvmod 779410287755165804
below ADV advmod 13516515296229086732
. PUNCT punct 12646065887601541794
  SPACE  8532415787641010193


In [21]:
# NORMAL SOLUTION:

for token in doc_sentences[2]:
    print(f"{token.text:{15}} {token.pos_:{5}} {token.dep_:{10}} {token.lemma:{15}}") # For white spaces

A               DET   det        11901859001352538922
man             NOUN  nsubj      3104811030673030468
stood           VERB  ROOT       16121235759125543490
upon            ADP   prep       12776617025319584140
a               DET   det        11901859001352538922
railroad        NOUN  compound   11929338562591612190
bridge          NOUN  pobj       10505406131357236919
in              ADP   prep       3002984154512732771
northern        ADJ   amod       14402328224860809449
Alabama         PROPN pobj       1974316624015891830
,               PUNCT punct      2593208677638477497
looking         VERB  advcl      16096726548953279178
down            ADV   advmod     6421409113692203669

               SPACE            962983613142996970
into            ADP   prep       3278561384161438710
the             DET   det        7425985699627899538
swift           ADJ   amod       9502497712543975804
water           NOUN  pobj       7248544922998488549
twenty          NUM   nummod     830459

**6. Write a matches called 'Swimming' that finds both occurances of the phrase "swimming vigorously" in the text** <br>
**HINT:** You should include an `'IS_SPACE' : True` pattern between the two words!

In [23]:
# Import the Matcher Library:

from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

In [24]:
# Create a pattern and add it to matcher:

pattern = [{'LOWER':'swimming'},{'IS_SPACE':True, 'OP':'*'}, {'Lower':'vigorously'}]

In [25]:
matcher.add('Swimming', None, pattern)

In [26]:
found_matches = matcher(doc)

In [27]:
print(found_matches)

[(12881893835109366681, 1274, 1277), (12881893835109366681, 3607, 3610)]


In [28]:
# Create a list of matches called "found_matches" and print the list:

**7. Print the text surrounding each found match?**

In [29]:
def surrounding(doc, start, end):
    print(doc[start-5:end+5])

In [30]:
surrounding(doc, 1274, 1277)

evade the bullets and, swimming
vigorously, reach the bank,


In [31]:
surrounding(doc, 3607, 3610)

shoulder; he was now swimming
vigorously with the current.  


**EXTRA CREDIT:<br>
Print the sentence that contains each found match**

In [33]:
for sentence in doc_sentences:
    if found_matches[0][1] < sentence.end:
        print(sentence)
        break

By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away home.  


In [34]:
for sentence in doc_sentences:
    if found_matches[1][1] < sentence.end:
        print(sentence)
        break

The hunted man saw all this over his shoulder; he was now swimming
vigorously with the current.  
