# NLP Basics Assessment

For this assessment we'll be using the short story [_An Occurrence at Owl Creek Bridge_](https://en.wikipedia.org/wiki/An_Occurrence_at_Owl_Creek_Bridge) by Ambrose Bierce (1890). <br>The story is in the public domain; the text file was obtained from [Project Gutenberg](https://www.gutenberg.org/ebooks/375.txt.utf-8).

In [17]:
# RUN THIS CELL to perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')

**1. Create a Doc object from the file `owlcreek.txt`**<br>
> HINT: Use `with open('../TextFiles/owlcreek.txt') as f:`

In [18]:
# Enter your code here:
with open('../TextFiles/owlcreek.txt') as f:
    text = f.read()
    doc = nlp(text)

In [19]:
# Run this cell to verify it worked:

doc[:36]

AN OCCURRENCE AT OWL CREEK BRIDGE

by Ambrose Bierce

I.

A man stood upon a railroad bridge in northern Alabama, looking down
into the swift water twenty feet below.  

**2. How many tokens are contained in the file?**

In [20]:
print("Number of tokens in the document:", len(doc))

Number of tokens in the document: 4835


**3. How many sentences are contained in the file?**<br>HINT: You'll want to build a list first!

In [21]:
sentences = list(doc.sents)
print("Number of sentences in the document:", len(sentences))

Number of sentences in the document: 205


*4*. Print the second sentence in the document**<br> HINT: Indexing starts at zero, and the title counts as the first sentence.

In [26]:
second_sentence = sentences[1].text
print("The second sentence in the document is:", second_sentence)

The second sentence in the document is: A man stood upon a railroad bridge in northern Alabama, looking down
into the swift water twenty feet below.  


** 5. For each token in the sentence above, print its `text`, `POS` tag, `dep` tag and `lemma`<br>
CHALLENGE: Have values line up in columns in the print output.**

In [31]:
# NORMAL SOLUTION:
sentence = sentences[1]
print(f"{'TEXT':{15}} {'POS':{10}} {'DEP':{10}} {'LEMMA':{15}}")
for token in sentence:
    print(token.text, token.pos_, token.dep_, token.lemma_)

TEXT            POS        DEP        LEMMA          
A DET det a
man NOUN nsubj man
stood VERB ROOT stand
upon SCONJ prep upon
a DET det a
railroad NOUN compound railroad
bridge NOUN pobj bridge
in ADP prep in
northern ADJ amod northern
Alabama PROPN pobj Alabama
, PUNCT punct ,
looking VERB advcl look
down ADV advmod down

 SPACE dep 

into ADP prep into
the DET det the
swift ADJ amod swift
water NOUN pobj water
twenty NUM nummod twenty
feet NOUN npadvmod foot
below ADV advmod below
. PUNCT punct .
  SPACE dep  


In [30]:
# CHALLENGE SOLUTION:
sentence = sentences[1]
print(f"{'TEXT':{15}} {'POS':{10}} {'DEP':{10}} {'LEMMA':{15}}")
for token in sentence:
    print(f"{token.text:{15}} {token.pos_:{10}} {token.dep_:{10}} {token.lemma_:{15}}")

TEXT            POS        DEP        LEMMA          
A               DET        det        a              
man             NOUN       nsubj      man            
stood           VERB       ROOT       stand          
upon            SCONJ      prep       upon           
a               DET        det        a              
railroad        NOUN       compound   railroad       
bridge          NOUN       pobj       bridge         
in              ADP        prep       in             
northern        ADJ        amod       northern       
Alabama         PROPN      pobj       Alabama        
,               PUNCT      punct      ,              
looking         VERB       advcl      look           
down            ADV        advmod     down           

               SPACE      dep        
              
into            ADP        prep       into           
the             DET        det        the            
swift           ADJ        amod       swift          
water           NOUN       p

**6. Write a matcher called 'Swimming' that finds both occurrences of the phrase "swimming vigorously" in the text**<br>
HINT: You should include an `'IS_SPACE': True` pattern between the two words!

In [32]:
# Import the Matcher library:

from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

In [33]:
# Create a pattern and add it to matcher:
pattern = [
    {"LOWER": "swimming"},  # Token's text matches 'swimming' in lowercase
    {"IS_SPACE": True, "OP": "?"},  # Optional token that is a space
    {"LOWER": "vigorously"}  # Token's text matches 'vigorously' in lowercase
]

matcher.add("Swimming", [pattern])

In [40]:
matches = matcher(doc)
print("List of found matches:", matches)

List of found matches: [(12881893835109366681, 1274, 1277), (12881893835109366681, 3609, 3612)]


**7. Print the text surrounding each found match**

In [41]:
for match_id, start, end in matches:
    matched_span = doc[start:end]

    start_context = max(start - 5, 0)  
    end_context = min(end + 5, len(doc))  
    surrounding_text = doc[start_context:end_context].text
    
    print("Surrounding text for match:", surrounding_text)

Surrounding text for match: evade the bullets and, swimming
vigorously, reach the bank,
Surrounding text for match: shoulder; he was now swimming
vigorously with the current.  


**EXTRA CREDIT:<br>Print the *sentence* that contains each found match**

In [42]:
for match_id, start, end in matches:
    matched_span = doc[start:end]
    
    sentence = matched_span.sent
    print("The sentence containing the match:", sentence.text)

The sentence containing the match: By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away home.  
The sentence containing the match: The hunted man saw all this over his shoulder; he was now swimming
vigorously with the current.  
