# NLP Basics Assessment

For this assessment we'll be using the short story [_An Occurrence at Owl Creek Bridge_](https://en.wikipedia.org/wiki/An_Occurrence_at_Owl_Creek_Bridge) by Ambrose Bierce (1890). <br>The story is in the public domain; the text file was obtained from [Project Gutenberg](https://www.gutenberg.org/ebooks/375.txt.utf-8).

In [4]:
# RUN THIS CELL to perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')

**1. Create a Doc object from the file `owlcreek.txt`**<br>
> HINT: Use `with open('../TextFiles/owlcreek.txt') as f:`

In [6]:
# Enter your code here:
with open('../TextFiles/owlcreek.txt') as f:
    doc = nlp(f.read())


In [7]:
# Run this cell to verify it worked:

doc[:36]

AN OCCURRENCE AT OWL CREEK BRIDGE

by Ambrose Bierce

I

A man stood upon a railroad bridge in northern Alabama, looking down
into the swift water twenty feet below.  

**2. How many tokens are contained in the file?**

In [8]:
len(doc)

4835

**3. How many sentences are contained in the file?**<br>HINT: You'll want to build a list first!

In [16]:
doc_list = [sent for sent in doc.sents]
# for sent in doc_list:
#     print(sent)
len(doc_list)


204

**4. Print the second sentence in the document**<br> HINT: Indexing starts at zero, and the title counts as the first sentence.

In [17]:
doc_list[1]

The man's hands were behind
his back, the wrists bound with a cord.  

** 5. For each token in the sentence above, print its `text`, `POS` tag, `dep` tag and `lemma`<br>
CHALLENGE: Have values line up in columns in the print output.**

In [18]:
# NORMAL SOLUTION:
for token in doc:
    print(token.text + "\t" + token.pos_ + "\t" + token.dep_ + "\t" + token.lemma_)


AN	DET	det	an
OCCURRENCE	NOUN	nmod	occurrence
AT	PROPN	prep	AT
OWL	PROPN	compound	OWL
CREEK	PROPN	compound	CREEK
BRIDGE	PROPN	ROOT	BRIDGE


	SPACE	dep	


by	ADP	prep	by
Ambrose	PROPN	compound	Ambrose
Bierce	PROPN	pobj	Bierce


	SPACE	dep	


I	PRON	punct	I


	SPACE	dep	


A	DET	det	a
man	NOUN	nsubj	man
stood	VERB	relcl	stand
upon	SCONJ	prep	upon
a	DET	det	a
railroad	NOUN	compound	railroad
bridge	NOUN	pobj	bridge
in	ADP	prep	in
northern	ADJ	amod	northern
Alabama	PROPN	pobj	Alabama
,	PUNCT	punct	,
looking	VERB	advcl	look
down	ADV	advmod	down

	SPACE	dep	

into	ADP	prep	into
the	DET	det	the
swift	ADJ	amod	swift
water	NOUN	pobj	water
twenty	NUM	nummod	twenty
feet	NOUN	npadvmod	foot
below	ADV	advmod	below
.	PUNCT	punct	.
 	SPACE	dep	 
The	DET	det	the
man	NOUN	poss	man
's	PART	case	's
hands	NOUN	nsubj	hand
were	AUX	ROOT	be
behind	ADP	prep	behind

	SPACE	dep	

his	PRON	poss	his
back	NOUN	pobj	back
,	PUNCT	punct	,
the	DET	det	the
wrists	NOUN	appos	wrist
bound	VERB	acl	bind
with	ADP	prep	with
a	

In [26]:
# CHALLENGE SOLUTION:
def neat_align(text):
    for token in text:
        print(f" {token.text:{12}} {token.pos_:{6}} {token.dep_:<{22}} {token.lemma_} ")


In [27]:
neat_align(doc)

 AN           DET    det                    an 
 OCCURRENCE   NOUN   nmod                   occurrence 
 AT           PROPN  prep                   AT 
 OWL          PROPN  compound               OWL 
 CREEK        PROPN  compound               CREEK 
 BRIDGE       PROPN  ROOT                   BRIDGE 
 

           SPACE  dep                    

 
 by           ADP    prep                   by 
 Ambrose      PROPN  compound               Ambrose 
 Bierce       PROPN  pobj                   Bierce 
 

           SPACE  dep                    

 
 I            PRON   punct                  I 
 

           SPACE  dep                    

 
 A            DET    det                    a 
 man          NOUN   nsubj                  man 
 stood        VERB   relcl                  stand 
 upon         SCONJ  prep                   upon 
 a            DET    det                    a 
 railroad     NOUN   compound               railroad 
 bridge       NOUN   pobj                   bridge 
 i

**6. Write a matcher called 'Swimming' that finds both occurrences of the phrase "swimming vigorously" in the text**<br>
HINT: You should include an `'IS_SPACE': True` pattern between the two words!

In [36]:
# Import the Matcher library:

from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

In [45]:
# Create a pattern and add it to matcher:
pattern = [[{'LOWER': 'swimming'}, {'IS_SPACE': True, 'OP':'*'}, {'LOWER': 'vigorously'}]]

matcher.add('Swimming',  pattern)

In [46]:
# Create a list of matches called "found_matches" and print the list:

found_matches = matcher(doc) #[(12881893835109366681, 1274, 1277), (12881893835109366681, 3607, 3610)]
print(found_matches)

[(12881893835109366681, 1274, 1277), (12881893835109366681, 3609, 3612)]


**7. Print the text surrounding each found match**

In [54]:
print(doc[1265:1290])

By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away home


In [55]:
doc[1277+15: 1277+30]

My
home, thank God, is as yet outside their lines; my

In [56]:
print(doc[3600:3615])

all this over his shoulder; he was now swimming
vigorously with the current


**EXTRA CREDIT:<br>Print the *sentence* that contains each found match**

In [61]:
for sent in doc.sents:
    if found_matches[0][1] < sent.end:
        print(sent)
        break

By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away home.  


In [57]:
found_matches[0][1]

1274

In [63]:
for sent in doc.sents:
    if found_matches[1][1] < sent.end:
        print(sent)
        break



The hunted man saw all this over his shoulder; he was now swimming
vigorously with the current.  


### Great Job!