In [2]:
import spacy
nlp = spacy.load('en_core_web_sm')

doc8 = nlp(u"Apple to build a Hong Kong factory for $6 milion")
for token in doc8:
    print(token.text,end=" | ")
print('\n--------')
for ent in doc8.ents:
    print(ent.text+' - '+ent.label_+' - '+str(spacy.explain(ent.label_)))

Apple | to | build | a | Hong | Kong | factory | for | $ | 6 | milion | 
--------
Apple - ORG - Companies, agencies, institutions, etc.
Hong Kong - GPE - Countries, cities, states
6 - MONEY - Monetary values, including unit


Noun chunks similar to Doc.ents, Doc.noun_chunks are another objects 

In [3]:
doc9 = nlp(u"Autonomous cars shifts insurance liability towards manufacturers.")
for chunk in doc9.noun_chunks:
    print(chunk.text)

Autonomous cars shifts insurance liability
manufacturers


In [4]:
doc10 = nlp(u"Red cars do not carry higher insurance rates.")
for chunk in doc10.noun_chunks:
    print(chunk.text)

Red cars
higher insurance rates


Stemming is a somewhat crude method for cataloging related works; it essentially chops off letters from the end until the stem is reached.

#### Portar Stemmer 

Portar Stemmer One of the most common and effective - stemming tools 

In [8]:
#Import the toolkit and the full portar stemmer
import nltk
from nltk.stem.porter import *


In [9]:
p_stemmer = PorterStemmer()
words = ['run','runner','running','ran','runs','easily','fairly']

for word in words :
    print(word+'--->'+p_stemmer.stem(word))

run--->run
runner--->runner
running--->run
ran--->ran
runs--->run
easily--->easili
fairly--->fairli


Note how the stemmer recognize "runner" as a noun , not a verb from or participate.Also,the adverbs "easily" and "fairly" are stemmed to the unusual root "easili" and "fairli"

#### Snowball Stemmer

Snowball stemmer it offers a slight improvement over the original portar stemmer, both in logic and speed. Since nltk uses the name Snowball Stemmer

In [12]:
from nltk.stem.snowball import SnowballStemmer

#the SnowBall stemmer requires that you pass a lanuage parameter
s_stemmer = SnowballStemmer(language='english')

In [13]:
words = ['run','runner','running','ran','runs','easily','fairly']
# words = ['generous','generation','generously','generate']
for word in words :
    print(word+'--->'+p_stemmer.stem(word))

run--->run
runner--->runner
running--->run
ran--->ran
runs--->run
easily--->easili
fairly--->fairli


Stemming has its drwbacks. If given the token saw, stemming might always 
Lemmatization would likely return either see or saw depending on whether 

In [15]:
phrase = 'I am meeting him tommorow at the meeting'
for word in phrase.split():
    print(word+'----->'+p_stemmer.stem(word))

I----->i
am----->am
meeting----->meet
him----->him
tommorow----->tommorow
at----->at
the----->the
meeting----->meet


Here "meeting" appears twice once for verb and another is for noun, and yet the stemmer treats both equally.

##### Lemmatization in contrast to stemming , lemmatization looks beyond word reduction, and considers a language's full vocabulary to apply a morphological analysis to words. The lemma of "was" is "be" and lemma of "mice" is "mouse". Further the lemma of "meeting" might be "meet" or "meeting" depending on its uses in a sentence.

In [16]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [17]:
doc1 = nlp(u"I am a runner running in a race becuase I love to run since I ran today")
for token in doc1:
    print(token.text,'\t',token.pos_+'\t',token.lemma,'\t',token.lemma_)

I 	 PRON	 4690420944186131903 	 I
am 	 AUX	 10382539506755952630 	 be
a 	 DET	 11901859001352538922 	 a
runner 	 NOUN	 12640964157389618806 	 runner
running 	 VERB	 12767647472892411841 	 run
in 	 ADP	 3002984154512732771 	 in
a 	 DET	 11901859001352538922 	 a
race 	 NOUN	 8048469955494714898 	 race
becuase 	 NOUN	 3636336227294319702 	 becuase
I 	 PRON	 4690420944186131903 	 I
love 	 VERB	 3702023516439754181 	 love
to 	 PART	 3791531372978436496 	 to
run 	 VERB	 12767647472892411841 	 run
since 	 SCONJ	 10066841407251338481 	 since
I 	 PRON	 4690420944186131903 	 I
ran 	 VERB	 12767647472892411841 	 run
today 	 NOUN	 11042482332948150395 	 today


#token.pos_ :- part of speech
#token.lemma_ :- meaning of lemma
Here we're using an f-string to fromat the printed text by setting ,minimum field widths and adding a left-align to the lemma hash value.

In [22]:
def show_lemma(text):
    for token in text:
        print(f'{token.text:{12}} {token.pos_:{6}} {token.lemma:<{22}} {token.lemma_}')

In [23]:
doc2 = nlp(u"I saw eighteen mice today!")
show_lemma(doc2)

I            PRON   4690420944186131903    I
saw          VERB   11925638236994514241   see
eighteen     NUM    9609336664675087640    eighteen
mice         NOUN   1384165645700560590    mouse
today        NOUN   11042482332948150395   today
!            PUNCT  17494803046312582752   !


In [24]:
doc3 = nlp(u"I am meeting him tommorow at the meeting")
show_lemma(doc3)

I            PRON   4690420944186131903    I
am           AUX    10382539506755952630   be
meeting      VERB   6880656908171229526    meet
him          PRON   1655312771067108281    he
tommorow     VERB   14881451523362505806   tommorow
at           ADP    11667289587015813222   at
the          DET    7425985699627899538    the
meeting      NOUN   14798207169164081740   meeting


In [25]:
doc4 = nlp(u"That's an enormous automobile")
show_lemma(doc4)

That         PRON   4380130941430378203    that
's           AUX    10382539506755952630   be
an           DET    15099054000809333061   an
enormous     ADJ    17917224542039855524   enormous
automobile   NOUN   7211811266693931283    automobile


Note that lemmatization does not reduce words to their most basic synonym - that is, enormous doesn't become big and automobile doesn't become car.

#### StopWords

Stopwords words like "a" and "the" appears so frequently that they don't require tagging as thoroughly as nouns. Spacy holds 326 english stopwords.

In [30]:
import spacy
nlp = spacy.load('en_core_web_sm')
print(nlp.Defaults.stop_words)
len(nlp.Defaults.stop_words)

{'re', 'now', 'yourself', 'then', 'since', 'if', 'has', 'four', 'per', 'will', 'beforehand', 'hence', 'within', 'whom', 'seemed', 'among', 'twenty', 'much', 'they', 'behind', 'keep', 'us', 'between', 'might', 'it', 'last', 'very', 'twelve', 'because', 'throughout', 'everything', 'wherever', 'his', 'can', 'nothing', 'none', 'several', 'its', 'although', 'here', 'unless', 'down', 'ever', 'quite', 'a', 'hereafter', 'only', 'five', 'both', 'sometime', 'doing', 'nobody', 'besides', '‘ve', 'elsewhere', 'therein', 'our', 'whereby', 'your', 'next', 'above', 'whoever', 'name', 'been', 'really', 'somehow', 'is', 'still', 'at', 'thus', 'used', 'of', 'across', 'thereafter', 'nowhere', '’ve', 'should', 'about', 'former', 'by', 'no', 'even', 'before', 'made', '‘s', 'we', '‘m', 'were', 'themselves', 'after', 'whither', 'may', 'due', 'towards', 'from', 'hundred', 'for', 'more', 'show', 'again', 'serious', 'ourselves', 'formerly', 'as', 'whereupon', 'along', 'amount', 'beyond', 'through', 'beside', 'bo

326

#### To see if a word is a stop word

In [29]:
nlp.vocab['myself'].is_stop

True

In [31]:
nlp.vocab['mystery'].is_stop

False

#### To add a stp word

Here we're trying to add one new stop word i.e. btw into the existing stop words.

In [35]:
#Add the word to the set of stp words. Use Lowercase!
nlp.Defaults.stop_words.add('btw')

#Set the stop words tag on the lexeme 
nlp.vocab['btw'].is_stop = True

In [36]:
len(nlp.Defaults.stop_words)

327

#### To Remove a Stop Word

In [37]:
#Remove the word from the set of stop words
nlp.Defaults.stop_words.remove('beyond')

#REmove the stop_word tag from the lexeme
nlp.vocab['beyond'].is_stop = False

In [38]:
len(nlp.Defaults.stop_words)
# Here, we remove one 'beyond' stop_words from existing stop_words and assign it as is_stop :- False

326