## Tokenization with Spacy
#### Tokenizing a large text and using Spacy API to find additonal details about a text

In [1]:
import spacy
nlp = spacy.load('en_core_web_sm')

#### Creating a Doc object from the file owlcreek.txt

In [2]:
with open("../TextFiles/owlcreek.txt") as f:
    doc = nlp(f.read())

In [6]:
doc[:36]

AN OCCURRENCE AT OWL CREEK BRIDGE

by Ambrose Bierce

I

A man stood upon a railroad bridge in northern Alabama, looking down
into the swift water twenty feet below.  

#### Counting number of tokens in the file

In [7]:
len(doc)

4835

#### Counting number of sentences in the file

In [10]:
len([token for token in doc.sents])

249

#### Printing the second sentence in the document

In [3]:
sentences = [token for token in doc.sents]
sentences[2]

A man stood upon a railroad bridge in northern Alabama, looking down
into the swift water twenty feet below.  

In [4]:
second = sentences[2]

#### Printing each token's `text`, `POS` tag, `dep` tag and `lemma` in the sentence above

In [17]:
for token in second:
    print(f"{token.text:{15}} {token.pos_:{10}} {token.dep_:{10}} {token.lemma_:{10}}")


A               DET        det        a         
man             NOUN       nsubj      man       
stood           VERB       ROOT       stand     
upon            SCONJ      prep       upon      
a               DET        det        a         
railroad        NOUN       compound   railroad  
bridge          NOUN       pobj       bridge    
in              ADP        prep       in        
northern        ADJ        amod       northern  
Alabama         PROPN      pobj       Alabama   
,               PUNCT      punct      ,         
looking         VERB       advcl      look      
down            ADV        prt        down      

               SPACE                 
         
into            ADP        prep       into      
the             DET        det        the       
swift           ADJ        amod       swift     
water           NOUN       pobj       water     
twenty          NUM        nummod     twenty    
feet            NOUN       npadvmod   foot      
below           ADV 

#### Writing a matcher called 'Swimming' that finds both occurrences of the phrase "swimming vigorously" in the text

In [7]:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

In [5]:
pattern1 = [{"LOWER": "swimming"}, {"IS_SPACE": True} ,{"LOWER": "vigorously"}]

In [8]:
matcher.add("Swimming",None,  pattern1)
found_matches = matcher(doc)
found_matches

[(12881893835109366681, 1274, 1277), (12881893835109366681, 3609, 3612)]

#### Printing the text surrounding each found match

In [27]:
doc[1264:1290]

 By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away home

In [28]:
doc[3600:3620]

all this over his shoulder; he was now swimming
vigorously with the current.  His brain was

#### Printing the sentence that contains each found match

In [17]:
a= None
for sent in doc.sents:
    if sent.start< 1274 and sent.end > 1277:
        print(sent)
    elif sent.start< 3609 and sent.end > 3612:
        print(sent)

By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away home.  
The hunted man saw all this over his shoulder; he was now swimming
vigorously with the current.  
