# 1. Create a Doc object from the file owlcreek.txt

In [2]:
import spacy

In [3]:
# loading the spacy language model
nlp = spacy.load('en_core_web_sm')

In [4]:
with open ("owlcreek.txt", "r") as file:
    text = file.read()

# Here I upload owlcreek file on jupyter then read it.
# We can also use file path instead of uploading and then read file.

In [5]:
# creating doc object of file
doc = nlp(text)
doc


AN OCCURRENCE AT OWL CREEK BRIDGE

by Ambrose Bierce

I

A man stood upon a railroad bridge in northern Alabama, looking down
into the swift water twenty feet below.  The man's hands were behind
his back, the wrists bound with a cord.  A rope closely encircled his
neck.  It was attached to a stout cross-timber above his head and the
slack fell to the level of his knees.  Some loose boards laid upon the
ties supporting the rails of the railway supplied a footing for him
and his executioners--two private soldiers of the Federal army,
directed by a sergeant who in civil life may have been a deputy
sheriff.  At a short remove upon the same temporary platform was an
officer in the uniform of his rank, armed.  He was a captain.  A
sentinel at each end of the bridge stood with his rifle in the
position known as "support," that is to say, vertical in front of the
left shoulder, the hammer resting on the forearm thrown straight
across the chest--a formal and unnatural position, enforcing an ere

# 2. How many tokens are contained in the file?

In [7]:
# Tokens
for t in doc:
    print(t)

AN
OCCURRENCE
AT
OWL
CREEK
BRIDGE



by
Ambrose
Bierce



I



A
man
stood
upon
a
railroad
bridge
in
northern
Alabama
,
looking
down


into
the
swift
water
twenty
feet
below
.
 
The
man
's
hands
were
behind


his
back
,
the
wrists
bound
with
a
cord
.
 
A
rope
closely
encircled
his


neck
.
 
It
was
attached
to
a
stout
cross
-
timber
above
his
head
and
the


slack
fell
to
the
level
of
his
knees
.
 
Some
loose
boards
laid
upon
the


ties
supporting
the
rails
of
the
railway
supplied
a
footing
for
him


and
his
executioners
--
two
private
soldiers
of
the
Federal
army
,


directed
by
a
sergeant
who
in
civil
life
may
have
been
a
deputy


sheriff
.
 
At
a
short
remove
upon
the
same
temporary
platform
was
an


officer
in
the
uniform
of
his
rank
,
armed
.
 
He
was
a
captain
.
 
A


sentinel
at
each
end
of
the
bridge
stood
with
his
rifle
in
the


position
known
as
"
support
,
"
that
is
to
say
,
vertical
in
front
of
the


left
shoulder
,
the
hammer
resting
on
the
forearm
thrown
straight


across


In [8]:
# len of tokens
print(f"len of tokens: {len(doc)}")

len of tokens: 4835


# 3. How many sentences are contained in the file?

In [10]:
sent = list(doc.sents)
print(f"len of sentences: {len(sent)}")

# Note: doc.sents gives a generator object. Converting it to a list with list(doc.sents) allows you to see the actual sentences as output.

len of sentences: 204


# Question: 2 & 3 using nltk

In [12]:
from nltk.tokenize import word_tokenize, sent_tokenize

In [13]:
# Word Tokenize
#<-------------------->
word1 = word_tokenize(text)
print(f"len of tokens: {len(word1)}")

len of tokens: 4357


In [14]:
# Sentence Tokenize
#<--------------------->

sent1 = sent_tokenize(text)
print(f"len of sents: {len(sent1)}")

len of sents: 219


In [15]:
# Note : for nltk not need to create doc object of the text file.

# Difference B/W spacy tokenize and nltk tokenize

In [17]:
# The difference in token counts (4835 for SpaCy and 4357 for NLTK) is due 
# to SpaCy's finer tokenization, which separates punctuation and contractions more distinctly than NLTK.

# similary in sentence tokenize

# Important Notes:

In [19]:
# 1) Note: In SpaCy, it is mandatory to create a Doc object from the text before processing it. This is done by passing the text through the nlp function, which returns the Doc object. For example:
# doc = nlp(text)

# 2) However, in NLTK, there is no need to create such an object. You can directly pass the text as a variable to the desired function. For example:
# tokens = nltk.word_tokenize(text)

# 3) If you already have a Doc object from SpaCy and want to use it with NLTK, you first need to convert it back to a string using str(doc). Once converted, you can pass it to NLTK as follows:
# nltk_tokens = nltk.word_tokenize(str(doc))

# 4. Print the second sentence in the document

In [21]:
sent[1]

The man's hands were behind
his back, the wrists bound with a cord.  

# 5. For each token in the sentence above, print its text, POS tag, dep tag and lemma.


In [23]:
import spacy

# Load spacy English language model
nlp = spacy.load("en_core_web_sm")

# sentence (assuming 'sent' is a list of sentences)
sentence = str(sent[1])

# Creating a Doc Object for the Sentence
doc = nlp(sentence)

# For each token in the sentence, print its text, POS tag, dep tag, and lemma
for token in doc:
    print(f"Token: {token.text}, POS: {token.pos_}, Dep: {token.dep_}, Lemma: {token.lemma_}")


Token: The, POS: DET, Dep: det, Lemma: the
Token: man, POS: NOUN, Dep: poss, Lemma: man
Token: 's, POS: PART, Dep: case, Lemma: 's
Token: hands, POS: NOUN, Dep: nsubj, Lemma: hand
Token: were, POS: AUX, Dep: ROOT, Lemma: be
Token: behind, POS: ADP, Dep: prep, Lemma: behind
Token: 
, POS: SPACE, Dep: dep, Lemma: 

Token: his, POS: PRON, Dep: poss, Lemma: his
Token: back, POS: NOUN, Dep: pobj, Lemma: back
Token: ,, POS: PUNCT, Dep: punct, Lemma: ,
Token: the, POS: DET, Dep: det, Lemma: the
Token: wrists, POS: NOUN, Dep: appos, Lemma: wrist
Token: bound, POS: VERB, Dep: acl, Lemma: bind
Token: with, POS: ADP, Dep: prep, Lemma: with
Token: a, POS: DET, Dep: det, Lemma: a
Token: cord, POS: NOUN, Dep: pobj, Lemma: cord
Token: ., POS: PUNCT, Dep: punct, Lemma: .
Token:  , POS: SPACE, Dep: dep, Lemma:  


# 6) Write a matcher called 'Swimming' that finds both occurrences of the phrase "swimming vigorously" in the text.

In [25]:
with open("owlcreek.txt", "r") as file:
    text = file.read()
    print(text)

AN OCCURRENCE AT OWL CREEK BRIDGE

by Ambrose Bierce

I

A man stood upon a railroad bridge in northern Alabama, looking down
into the swift water twenty feet below.  The man's hands were behind
his back, the wrists bound with a cord.  A rope closely encircled his
neck.  It was attached to a stout cross-timber above his head and the
slack fell to the level of his knees.  Some loose boards laid upon the
ties supporting the rails of the railway supplied a footing for him
and his executioners--two private soldiers of the Federal army,
directed by a sergeant who in civil life may have been a deputy
sheriff.  At a short remove upon the same temporary platform was an
officer in the uniform of his rank, armed.  He was a captain.  A
sentinel at each end of the bridge stood with his rifle in the
position known as "support," that is to say, vertical in front of the
left shoulder, the hammer resting on the forearm thrown straight
across the chest--a formal and unnatural position, enforcing an ere

In [26]:
import spacy
nlp = spacy.load('en_core_web_sm')

from spacy.matcher import PhraseMatcher
# Create a PhraseMatcher object
phrase_matcher = PhraseMatcher(nlp.vocab)

# Define phrases to match
phrases = ["swimming vigorously"]
patterns = [nlp(text) for text in phrases]
phrase_matcher.add("Pattern",patterns)


doc = nlp(text)
matches = phrase_matcher(doc)
matches

# No output see below code and read note for understanding.

[]

In [27]:
import spacy
nlp = spacy.load('en_core_web_sm')

from spacy.matcher import PhraseMatcher
# Create a PhraseMatcher object
phrase_matcher = PhraseMatcher(nlp.vocab)

# Define phrases to match
phrases = ["swimming \nvigorously"]
patterns = [nlp(text) for text in phrases]
phrase_matcher.add("Pattern",patterns)


doc = nlp(text)
matches = phrase_matcher(doc)
matches

# Imp Note: Agr is code me mene swimming ke badh back slash ni lgaya to match ni hoga.
# because vigorously is in the next line in the owlcreek file.
# but agr hme ni pta ki koi word next line me hai. but I want to match.
# to is code se to output ni ayega.
# so we will use simple match. not phrase match.
# agr hmne ni pta ki ek word upar line me hai and 2nd word next line me hai. as swimming vigorously ki treh.
# then we will use match only not phrase match.

# In short agr hm 2 word ko search kr rhe hai and usme se ek word next line me hai.
# means dono word same line me ni hai.
# to phrase match output ni dega. jb tk hm phrase match me bhi same pattern backslash n (\n)
# ka use na kre.
# but match output de dega.

[(2474596767086405709, 1274, 1277), (2474596767086405709, 3609, 3612)]

In [28]:
# Using Matcher
#<--------------->

# 1) read file
with open("owlcreek.txt", "r") as file:
    text = file.read()

# 2) import spacy
import spacy
nlp = spacy.load("en_core_web_sm")

# 3) import matcher
from spacy.matcher import Matcher

# 4) Creating a Matcher object
matcher = Matcher(nlp.vocab)

# 5) Define a pattern to match adjective followed by names
pattern = [
    {'LOWER': 'swimming'}, 
    {'IS_SPACE': True, 'OP': '*'},  # Allow any number of spaces (including newlines)
    {'LOWER': 'vigorously'}
]

# 6) assigning name
matcher.add("SWIMMING VIGOROUSLY", [pattern])

# 7) creating doc object of text
doc2 = nlp(text)

# 8) found
found_matches = matcher(doc2)
found_matches

[(13196987592300466462, 1274, 1277), (13196987592300466462, 3609, 3612)]

# 7. Print the text surrounding each found match.

In [30]:
# Display matches
for match_id, start, end in found_matches:
    # Print the match and its surrounding context
    print(f"Match: {doc2[start:end].text}")
    print(f"Surrounding text: {doc2[max(0, start-5):min(len(doc2), end+5)].text}")
    print("-" * 40)

Match: swimming
vigorously
Surrounding text: evade the bullets and, swimming
vigorously, reach the bank,
----------------------------------------
Match: swimming
vigorously
Surrounding text: shoulder; he was now swimming
vigorously with the current.  
----------------------------------------


# Understanding codes below

In [32]:
doc2[start:end].text

'swimming\nvigorously'

In [33]:
doc2

AN OCCURRENCE AT OWL CREEK BRIDGE

by Ambrose Bierce

I

A man stood upon a railroad bridge in northern Alabama, looking down
into the swift water twenty feet below.  The man's hands were behind
his back, the wrists bound with a cord.  A rope closely encircled his
neck.  It was attached to a stout cross-timber above his head and the
slack fell to the level of his knees.  Some loose boards laid upon the
ties supporting the rails of the railway supplied a footing for him
and his executioners--two private soldiers of the Federal army,
directed by a sergeant who in civil life may have been a deputy
sheriff.  At a short remove upon the same temporary platform was an
officer in the uniform of his rank, armed.  He was a captain.  A
sentinel at each end of the bridge stood with his rifle in the
position known as "support," that is to say, vertical in front of the
left shoulder, the hammer resting on the forearm thrown straight
across the chest--a formal and unnatural position, enforcing an ere

In [34]:
doc2.text

# isme sare words attached ho gye. and back slash n ke sath show ho rhe hai.

'AN OCCURRENCE AT OWL CREEK BRIDGE\n\nby Ambrose Bierce\n\nI\n\nA man stood upon a railroad bridge in northern Alabama, looking down\ninto the swift water twenty feet below.  The man\'s hands were behind\nhis back, the wrists bound with a cord.  A rope closely encircled his\nneck.  It was attached to a stout cross-timber above his head and the\nslack fell to the level of his knees.  Some loose boards laid upon the\nties supporting the rails of the railway supplied a footing for him\nand his executioners--two private soldiers of the Federal army,\ndirected by a sergeant who in civil life may have been a deputy\nsheriff.  At a short remove upon the same temporary platform was an\nofficer in the uniform of his rank, armed.  He was a captain.  A\nsentinel at each end of the bridge stood with his rifle in the\nposition known as "support," that is to say, vertical in front of the\nleft shoulder, the hammer resting on the forearm thrown straight\nacross the chest--a formal and unnatural posit

In [35]:
a = str(doc2)
a[::-1]

# reverse string.

# doc2 is nlp .object so convert into str to reverse.

'\n\n.egdirb keerC lwO eht fo srebmit eht htaeneb edis ot edis morf\nyltneg gnuws ,kcen nekorb a htiw ,ydob sih ;daed saw rahuqraF notyeP\n\n!ecnelis dna ssenkrad si lla neht--nonnac a fo kcohs\neht ekil dnuos a htiw mih tuoba lla sezalb thgil etihw gnidnilb a\n;kcen eht fo kcab eht nopu wolb gninnuts a sleef eh reh psalc ot tuoba\nsi eh sA  .smra dednetxe htiw sdrawrof sgnirps eH  !si ehs lufituaeb\nwoh ,hA  .ytingid dna ecarg sselhctam fo edutitta na ,yoj elbaffeni fo\nelims a htiw ,gnitiaw sdnats ehs spets eht fo mottob eht tA  .mih teem\not adnarev eht morf nwod spets ,teews dna looc dna hserf gnikool ,efiw\nsih ;stnemrag elamef fo rettulf a sees eh ,klaw etihw ediw eht pu\nsessap dna etag eht nepo sehsup eh sA  .thgin eritne eht delevart evah\ntsum eH  .enihsnus gninrom eht ni lufituaeb dna thgirb lla dna ,ti\ntfel eh sa si llA  .emoh nwo sih fo etag eht ta sdnats eH  .muiriled\na morf derevocer ylerem sah eh spahrep--enecs rehtona sees eh won rof\n,gniklaw elihw peelsa nellaf dah

In [37]:
doc2[start:end].text

'swimming\nvigorously'

In [39]:
# Calculate start and end indices for surrounding text
start_index = max(0, start - 5)
end_index = min(len(doc2), end + 5)

# Get surrounding text
surrounding_text = doc2[start_index:end_index].text

# Print the surrounding text
print("Surrounding text:", surrounding_text)


Surrounding text: shoulder; he was now swimming
vigorously with the current.  
