# Basics

## 1- Tokenization Basics

- "i like apple". Tokenization is the process of decomposing the sentence.

In [1]:
s1 = "The sun dipped below the horizon, casting a warm glow across the tranquil landscape." 
s2 = "A gentle breeze rustled the leaves, creating a soothing melody in the quiet evening air."
s3 = "As shadows lengthened, the world seemed to slow down, embracing the serenity of the approaching night."
s4 = "In that moment, nature whispered its timeless secrets to those willing to listen."



In [8]:
## to download models
import spacy
import spacy.cli
spacy.cli.download("en_core_web_lg")

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [9]:
nlp = spacy.load("en_core_web_lg") # load function is to load pre-trained models.

In [11]:
import en_core_web_lg

In [12]:
nlp_1 = en_core_web_lg.load()

In [14]:
doc1 = nlp_1(s1)

In [15]:
for token in doc1:
    print(token)

The
sun
dipped
below
the
horizon
,
casting
a
warm
glow
across
the
tranquil
landscape
.


In [16]:
type(doc1)

spacy.tokens.doc.Doc

In [17]:
len(doc1) # it is divided into len(doc1)-tokens

16

## 2- Stemming and Lemmatization

- Stemming : finding the root of the word.
- Lemmatization : give root that is in the vocabulary
- 

### a) Stemming

In [18]:
words =['run','runner' ,'running','ran','runs','easily',"fairly"]

In [19]:
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer

In [20]:
p_stemmer = PorterStemmer()
s_stemmer = SnowballStemmer(language="english")


In [21]:
for word in words:
    print(word +"------>"+ p_stemmer.stem(word))

run------>run
runner------>runner
running------>run
ran------>ran
runs------>run
easily------>easili
fairly------>fairli


In [22]:
for word in words:
    print(word +"------>"+ s_stemmer.stem(word))

run------>run
runner------>runner
running------>run
ran------>ran
runs------>run
easily------>easili
fairly------>fair


### b) Lemmatization

In [23]:
import spacy
nlp = spacy.load('en_core_web_lg')

In [26]:
text = "RESTful API training This framework covers almost all the different aspects of API writing and contains a set of practical tools."
doc1 = nlp(text=text)

In [27]:
for token in doc1:
    print(token.text , "\t",token.lemma_)

RESTful 	 RESTful
API 	 api
training 	 train
This 	 this
framework 	 framework
covers 	 cover
almost 	 almost
all 	 all
the 	 the
different 	 different
aspects 	 aspect
of 	 of
API 	 api
writing 	 writing
and 	 and
contains 	 contain
a 	 a
set 	 set
of 	 of
practical 	 practical
tools 	 tool
. 	 .


In [28]:
for word in text.split():
    print(word +"------>"+ p_stemmer.stem(word))

RESTful------>rest
API------>api
training------>train
This------>thi
framework------>framework
covers------>cover
almost------>almost
all------>all
the------>the
different------>differ
aspects------>aspect
of------>of
API------>api
writing------>write
and------>and
contains------>contain
a------>a
set------>set
of------>of
practical------>practic
tools.------>tools.


## 3- Stop Words

a an the always ........

In [30]:
import spacy
nlp = spacy.load("en_core_web_lg")

In [31]:
print(nlp.Defaults.stop_words)

{'take', 'always', 'their', 'yours', 'a', 'say', 'herein', 'might', 'everywhere', 'for', 'did', 'other', 'anyone', 'on', 'same', "'ll", 'already', 'done', 'ca', 'regarding', 'besides', 'latterly', 'serious', 'hereupon', 'keep', 'eight', 'anything', 'last', 'he', 'she', 'through', 'from', 'an', 'its', 'become', 'thus', 'latter', 'own', 'any', 'using', 'almost', 'should', 'can', 'some', 'within', 'bottom', 'your', 'make', 'thence', 'two', 'after', 'nobody', 'until', 'former', 'yet', 'yourselves', '‘re', 'due', 'all', 'first', 'part', 'became', 'across', '’s', 'somehow', 'towards', 'seemed', 'put', 'ours', 'of', 'doing', 'per', 'anyhow', 'few', 'made', 'fifty', 'seeming', 'but', 'where', 'just', 'you', 'next', 'so', 'otherwise', 'by', 'ourselves', 'around', 'up', 'back', 'with', 'this', 'again', 'me', 'we', '‘ve', 'could', 'still', 'then', 'below', 'go', 'ten', 'namely', 'nothing', 'throughout', 'often', "'re", 'these', 'seem', 'beside', 'each', 'n‘t', 'along', 'nor', 'name', 'call', 'mos

In [32]:
len(nlp.Defaults.stop_words)

326

In [34]:
# Check if a word is a stop word or not
nlp.vocab["always"].is_stop 

True

In [38]:
# remove/add....(.stop_words.chose) a keyword form stop word
nlp.Defaults.stop_words.remove("wait")

In [39]:
len(nlp.Defaults.stop_words)

326

## 4- Vocabulary and Matching
- best use case is this website : https://explosion.ai/software

### A) Rule-Based Matching

In [44]:
# import the matcher library
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab) # Create Matcher object and pass nlp.vocab

# here matcher is an object that pairs to current Vocab object
# we can add and remove specific named matchers to matcher as needed

In [45]:
# Create a list , and inisde that list add series of dictionaries.
# hello world can appear in the following :
# 1 - hello world
# 2-  hello-world
pattern_1 = [{"LOWER":"hello"},{"LOWER":"world"}]
pattern_2 = [{"LOWER":"hello"},{"IS_PUNCT":True},{"LOWER":"world"}]
# 'LOWER',"IS_PUNCT" are the attributes
# they has to be written in that way only

In [49]:
# Add patterns to matchers object
# Add a match rule to matcher , A match rule consist of ,
# 1) an ID key
# 2) an onmatch callback
# 3) one or more patterns
matcher.add(key="hello world",on_match=None ,patterns=[pattern_1 , pattern_2])

In [50]:
doc = nlp("'hello world' are the first two printed words for most of the programmers , printing 'hello world'")

In [51]:
doc

'hello world' are the first two printed words for most of the programmers , printing 'hello world'

#### Finding matches

In [53]:
find_matches = matcher(doc) # passing doc to matcher object and store this in a variable

find_matches
# it returns output list of tuples
# string ID , index start and index end

[(2758594965276909933, 1, 3), (2758594965276909933, 18, 20)]

In [58]:
# define a function to find the matches
for match_id , start , end in find_matches:
    string_id = nlp.vocab.strings[match_id] # get string representation
    span = doc[start:end]
    print(match_id , string_id , start , end , span.text)

2758594965276909933 hello world 1 3 hello world
2758594965276909933 hello world 18 20 hello world


In [61]:
# remove matches
matcher.remove('hello world')

###  Setting patterns options and quantifiers


In [62]:
# Redefine the patterns:
pattern_3 = [{'LOWER':"hello"},{'LOWER':'world'}]
pattern_4 = [{'LOWER':"hello"},{'IS_PUNCT':True,'OP':'*'},{'LOWER':'world'}]
# 'OP':'*' ->>>> this is going to allow this pattern to match zero or more times for any punctuation

# Add the new set of patterns to the 'hello world' matcher:
matcher.add(key='Hello World' , on_match=None , patterns= [pattern_3,pattern_4])

In [63]:
doc_2 = nlp('you can print Hello World or hello world or Hello-World')


In [64]:
find_matches = matcher(doc_2)
find_matches

[(8585552006568828647, 3, 5),
 (8585552006568828647, 6, 8),
 (8585552006568828647, 9, 12)]

### B)Phrase Matching

In [68]:
import spacy
nlp  = spacy.load("en_core_web_lg")

In [69]:
# import the PhraseMatcher library
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

In [73]:
phrase_list = ['Barack Obama',"Angela Markel","Washington","D.C."]


In [74]:
# convert each phrase to a document object
phrase_patterns = [nlp(text) for text in phrase_list] # to do that we are using list comprehension

In [76]:
phrase_patterns # phrase object are not strings

[Barack Obama, Angela Markel, Washington, D.C.]

In [78]:
type(phrase_patterns[0])
# they are the spacy docs
# that's why we don't have any quotes there

spacy.tokens.doc.Doc

In [79]:
# pass each doc object into the matcher
matcher.add('TerminologyList',None,*phrase_patterns)
# that's we have to add astrisk mark before phrase_pattern

In [80]:
doc_3 = nlp("German CHancellor Angela Markel and US President Barack Obama"
           "converse in the Oval office inside the white house in Washington, D.C.")

In [81]:
find_matches= matcher(doc_3)
find_matches

[(3766102292120407359, 2, 4),
 (3766102292120407359, 18, 19),
 (3766102292120407359, 20, 21)]

In [83]:
# define a function to find the matches 
for match_id , start , end in find_matches:
    string_id = nlp.vocab.strings[match_id] # get string representation
    span = doc_3[start:end]
    print(match_id, string_id ,start , end, span.text)

3766102292120407359 TerminologyList 2 4 Angela Markel
3766102292120407359 TerminologyList 18 19 Washington
3766102292120407359 TerminologyList 20 21 D.C.


## 5- Parts of speech tagging

In [84]:
s1 ="Apple is looking at buying U.K. startup for $1 billion"

In [86]:
import spacy
nlp = spacy.load(name="en_core_web_lg")

In [87]:
doc = nlp(s1)

In [90]:
for token in doc:
    print(token.text , token.pos_ , token.tag_ ,spacy.explain(token.tag_))
    # spacy.explain() is for explaing each token method

Apple PROPN NNP noun, proper singular
is AUX VBZ verb, 3rd person singular present
looking VERB VBG verb, gerund or present participle
at ADP IN conjunction, subordinating or preposition
buying VERB VBG verb, gerund or present participle
U.K. PROPN NNP noun, proper singular
startup NOUN NN noun, singular or mass
for ADP IN conjunction, subordinating or preposition
$ SYM $ symbol, currency
1 NUM CD cardinal number
billion NUM CD cardinal number


In [93]:
# getting noun .. etc
for key , val in doc.count_by(spacy.attrs.POS).items():
    print(key,doc.vocab[key].text,val)

96 PROPN 2
87 AUX 1
100 VERB 2
85 ADP 2
92 NOUN 1
99 SYM 1
93 NUM 2


In [94]:
from spacy import displacy

In [96]:
displacy.render(docs=doc , style="dep" , jupyter=True ,options={"distance":100})

## 6- Named Entity Recognition

In [97]:
s1 = "Hello, how are you today?"
s2 = "The quick brown fox jumps over the lazy dog."
s3 = "123 Main Street, Anytown, USA"


In [99]:
import spacy 
nlp = spacy.load("en_core_web_lg")

In [100]:
doc_1 = nlp(s1)
doc_2 = nlp(s2)
doc_3 = nlp(s3)

In [102]:
doc_1.ents , doc_2.ents , doc_3.ents

((today,), (), (123, Main Street, Anytown, USA))

In [110]:
for i in doc_3.ents:
    print(i.text , i.label_ , str(spacy.explain(i.label_)) )

123 CARDINAL Numerals that do not fall under another type
Main Street FAC Buildings, airports, highways, bridges, etc.
Anytown GPE Countries, cities, states
USA GPE Countries, cities, states


In [121]:
ORG = doc_1.vocab.strings['ORG']
from spacy.tokens import Span
new_ent = Span(doc_1 , 4,5,label =ORG)

In [122]:
doc_1.ents = list(doc_1.ents) + [new_ent]

In [123]:
doc_1.ents

(you, today)

In [124]:
for i in doc_1.ents:
    print(i.text , i.label_ , str(spacy.explain(i.label_)) )

you ORG Companies, agencies, institutions, etc.
today DATE Absolute or relative dates or periods


In [126]:
from spacy import displacy
displacy.render(docs = doc_1 , style="ent",jupyter=True)

In [127]:
displacy.render(docs = doc_1 , style="ent",jupyter=True , options={"ents":['ORG']})

## 7- Sentence Segmentation

In [169]:
s1 = "123 Main Street; Anytown; USA"
s2 = "123 Main Street. Anytown. USA"

In [170]:
import spacy
nlp = spacy.load("en_core_web_lg")

In [171]:
doc_1 = nlp(s1)
doc_2 = nlp(s2)

In [173]:
for i in doc_1.sents:
    print(i.text)

123 Main Street; Anytown; USA


In [174]:
s3 = "123 Main Street U.K. . Anytown. USA"
#s3 = "123 Main Street U.K.. Anytown. USA"

doc_3 = nlp(s3)

In [175]:
for i in doc_3.sents:
    print(i.text)

123 Main Street U.K. .
Anytown.
USA


In [176]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [177]:
s1

'123 Main Street; Anytown; USA'

In [178]:
# adding new transformers
import spacy
from spacy.language import Language

@Language.component("set_custom_boundaries")
def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text == ';':
            doc[token.i+1].is_sent_start = True
    return doc

nlp = spacy.load("en_core_web_lg")
nlp.add_pipe("set_custom_boundaries", before="parser")


<function __main__.set_custom_boundaries(doc)>

In [179]:
nlp.pipe_names

['tok2vec',
 'tagger',
 'set_custom_boundaries',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'ner']

In [180]:
doc_1 = nlp(s1)
for i in doc_1.sents:
    print(i.text)

123 Main Street;
Anytown;
USA
