<a href="https://colab.research.google.com/github/Chaliantosh/datascience_cheatsheets/blob/main/NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Tokenization Basics**

In [5]:
s1 = 'Apple is looking to buying U.K. startup for $1 billion !'
s2 = 'Hello all, we are here to help you! email support@udemy.com or visit us at http://www.udemy.com!'
s3 = '10km cab ride almost costs $20 in NYC'
s4 = "Let's watch movie together."

In [2]:
import spacy

In [6]:
#pretrained model in spacy sm indicates small
nlp = spacy.load(name='en_core_web_sm')

In [7]:
!python -m spacy download en_core_web_md

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-md==3.3.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.3.0/en_core_web_md-3.3.0-py3-none-any.whl (33.5 MB)
[K     |████████████████████████████████| 33.5 MB 1.3 MB/s 
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.3.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [9]:
import en_core_web_md

In [10]:
#pretrained model in spacy md indicates medium
nlp_1 = spacy.load(name='en_core_web_md')

In [13]:
doc1 = nlp(s1)
print(s1)
for token in doc1:
  print(token)

Apple is looking to buying U.K. startup for $1 billion !
Apple
is
looking
to
buying
U.K.
startup
for
$
1
billion
!


In [14]:
doc2 = nlp(s2)
print(s2)
for token in doc2:
  print(token)

Hello all, we are here to help you! email support@udemy.com or visit us at http://www.udemy.com!
Hello
all
,
we
are
here
to
help
you
!
email
support@udemy.com
or
visit
us
at
http://www.udemy.com
!


In [15]:
doc3 = nlp(s3)
print(s3)
for token in doc3:
  print(token)

10km cab ride almost costs $20 in NYC
10
km
cab
ride
almost
costs
$
20
in
NYC


In [16]:
doc4 = nlp(s4)
print(s4)
for token in doc4:
  print(token)

Let's watch movie together.
Let
's
watch
movie
together
.


In [17]:
type(doc4)

spacy.tokens.doc.Doc

In [18]:
len(doc4)

6

In [20]:
doc4[2]

watch

In [23]:
doc4[2:4]

watch movie

**Stemming**

In [24]:
words = ['run', 'runner', 'running', 'ran', 'runs', 'easily', 'fairly']

In [25]:
import nltk

In [29]:
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer

In [31]:
p_stemmer = PorterStemmer()
s_stemmer = SnowballStemmer(language='english')

In [32]:
for word in words:
  print(word + ' ------ ' +p_stemmer.stem(word))

run ------ run
runner ------ runner
running ------ run
ran ------ ran
runs ------ run
easily ------ easili
fairly ------ fairli


In [33]:
for word in words:
  print(word + ' ------ ' +s_stemmer.stem(word))

run ------ run
runner ------ runner
running ------ run
ran ------ ran
runs ------ run
easily ------ easili
fairly ------ fair


**Lemmatization**

In [34]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [35]:
doc1 = nlp("The striped bats are hanging on their feet for best")

In [36]:
for token in doc1:
  print(token.text, '\t', token.lemma_)

The 	 the
striped 	 stripe
bats 	 bat
are 	 be
hanging 	 hang
on 	 on
their 	 their
feet 	 foot
for 	 for
best 	 good


In [37]:
#compare the above result with potter stemmer
s1 = "The striped bats are hanging on their feet for best"
for word in s1.split():
  print(word + ' ------ ' +p_stemmer.stem(word))

The ------ the
striped ------ stripe
bats ------ bat
are ------ are
hanging ------ hang
on ------ on
their ------ their
feet ------ feet
for ------ for
best ------ best


**Stopwords**

In [38]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [40]:
print(nlp.Defaults.stop_words)

{'down', 'perhaps', 'every', 'they', 'though', 'not', 'why', 'where', 're', 'across', 'be', 'before', 'all', 'by', 'due', 'yet', 'everyone', 'own', 'few', 'six', "n't", 'will', 'into', 'beside', 'next', 'same', 'beyond', 'within', 'twelve', 'give', 'thereafter', 'had', "'s", 'ca', 'alone', 'because', 'towards', 'wherever', 'might', 'whereas', 'whoever', 'another', 'else', '‘m', 'although', 'anywhere', 'say', 'below', 'nobody', 'unless', 'whatever', 'whereafter', 'him', 'something', 'n‘t', 'thus', 'former', 'therefore', 'becoming', 'besides', 'front', 'regarding', 'top', 'i', 'on', 'except', 'without', 'please', 'even', 'my', 'nowhere', 'becomes', 'anyway', 'their', 'call', 'does', 'thereby', 'namely', 'up', '‘ll', 'should', 'since', 'you', 'twenty', 'herein', '‘ve', '’s', 'done', 'than', 'anything', 'such', 'very', 'which', 'our', 'then', 'just', 'when', 'yourselves', 'how', 'never', 'noone', 'can', 'keep', '’re', 'latter', 'quite', 'through', 'did', 'us', 'make', 'almost', 'but', 'you

In [41]:
len(nlp.Defaults.stop_words)

326

In [42]:
#to check whether a word is stopword or not?
nlp.vocab['always'].is_stop

True

In [43]:
nlp.vocab['finance'].is_stop

False

In [44]:
nlp.vocab['asdf'].is_stop

False

In [46]:
#to add a word to stopword list
nlp.Defaults.stop_words.add('asdf')
nlp.vocab['asdf'].is_stop = True
nlp.vocab['asdf'].is_stop

True

In [48]:
#len will be 327 since 'asdf' has been added to stopwords list
len(nlp.Defaults.stop_words)

327