# 1.How to tokenize a given text?
text= "Last week, the University of Cambridge shared its own 
research that shows if everyone wears a mask outside home,dreaded 
‘second wave’ of the pandemic can be avoided."

In [1]:
# pip install nltk

In [2]:
import nltk

In [3]:
# Tokeniation with nltk

text= """Last week, the University of Cambridge shared its own 
research that shows if everyone wears a mask outside home,dreaded 
‘second wave’ of the pandemic can be avoided."""

tokens=nltk.word_tokenize(text)
for token in tokens:
    print(token)

Last
week
,
the
University
of
Cambridge
shared
its
own
research
that
shows
if
everyone
wears
a
mask
outside
home
,
dreaded
‘
second
wave
’
of
the
pandemic
can
be
avoided
.


# 2. How to tokenize a text using the `transformers` package ?
text="I love spring season. I go hiking with my friends"

In [4]:
#!pip install transformers

In [5]:
# Import tokenizer from transfromers
from transformers import AutoTokenizer

text="I love spring season. I go hiking with my friends"

# Initialize the tokenizer
tokenizer=AutoTokenizer.from_pretrained('bert-base-uncased')

# Encoding with the tokenizer
inputs=tokenizer.encode(text)
print(inputs)
tokenizer.decode(inputs)

[101, 1045, 2293, 3500, 2161, 1012, 1045, 2175, 13039, 2007, 2026, 2814, 102]


'[CLS] i love spring season. i go hiking with my friends [SEP]'

# 3. How to remove stop words in a text ?
 Remove all the stop words ( ‘a’ , ‘the’, ‘was’…) from the text

In [6]:
# Removing stopwords in nltk

from nltk.corpus import stopwords

text="""the outbreak of coronavirus disease 2019 (COVID-19) has 
created a global health crisis that has had a deep impact on the 
way we perceive our world and our everyday lives. Not only the 
rate of contagion and patterns of transmission threatens our 
sense of agency, but the safety measures put in place to contain 
the spread of the virus also require social distancing by 
refraining from doing what is inherently human, which is to find 
solace in the company of others. Within this context of physical 
threat, social and physical distancing, as well as public alarm, 
what has been (and can be) the role of the different mass media 
channels in our lives on individual, social and societal levels? 
Mass media have long been recognized as powerful forces shaping 
how we experience the world and ourselves. This recognition is 
accompanied by a growing volume of research, that closely follows 
the footsteps of technological transformations (e.g. radio, 
movies, television, the internet, mobiles) and the zeitgeist 
(e.g. cold war, 9/11, climate change) in an attempt to map mass 
media major impacts on how we perceive ourselves, both as 
individuals and citizens. Are media (broadcast and digital) still 
able to convey a sense of unity reaching large audiences, or are 
messages lost in the noisy crowd of mass self-communication? """



my_stopwords=set(stopwords.words('english'))
new_tokens=[]

# Tokenization using word_tokenize()
all_tokens=nltk.word_tokenize(text)

for token in all_tokens:
    if token not in my_stopwords:
        new_tokens.append(token)


" ".join(new_tokens)

'outbreak coronavirus disease 2019 ( COVID-19 ) created global health crisis deep impact way perceive world everyday lives . Not rate contagion patterns transmission threatens sense agency , safety measures put place contain spread virus also require social distancing refraining inherently human , find solace company others . Within context physical threat , social physical distancing , well public alarm , ( ) role different mass media channels lives individual , social societal levels ? Mass media long recognized powerful forces shaping experience world . This recognition accompanied growing volume research , closely follows footsteps technological transformations ( e.g . radio , movies , television , internet , mobiles ) zeitgeist ( e.g . cold war , 9/11 , climate change ) attempt map mass media major impacts perceive , individuals citizens . Are media ( broadcast digital ) still able convey sense unity reaching large audiences , messages lost noisy crowd mass self-communication ?'

# 4. How to remove punctuations ?
text="The match has concluded !!! India has won the match . “

In [7]:
# Removing punctuation in nltk with RegexpTokenizer

text="The match has concluded !!! India has won the match . "

tokenizer=nltk.RegexpTokenizer(r"\w+")

tokens=tokenizer.tokenize(text)
" ".join(tokens)

'The match has concluded India has won the match'

In [8]:
import string
#4. How to remove punctuations ?
text="The match has concluded !!! India has won the match"
string.punctuation
for i in text:
    if i not in string.punctuation :
        print(i,end ="")

The match has concluded  India has won the match

# 5. How to perform stemming?
text= """Dancing is an art. Students should be taught dance as a
subject in schools . I danced in many of my school function. Some
people are always hesitating to dance."""

In [9]:
# Stemming with nltk's PorterStemmer

from nltk.stem import PorterStemmer

text= """Dancing is an art. Students should be taught dance as a
subject in schools . I danced in many of my school function. Some
people are always hesitating to dance."""

stemmer=PorterStemmer()
stemmed_tokens=[]
for token in nltk.word_tokenize(text):
    stemmed_tokens.append(stemmer.stem(token))

" ".join(stemmed_tokens)

'danc is an art . student should be taught danc as a subject in school . i danc in mani of my school function . some peopl are alway hesit to danc .'

# 6. How to extract usernames from emails?
 text= "The new registrations are potter709@gmail.com , 
elixir101@gmail.com. If you find any disruptions, kindly contact 
granger111@gamil.com or severus77@gamil.com "

In [10]:
# Using regular expression to extract usernames
import re  

text= """The new registrations are potter709@gmail.com , 
elixir101@gmail.com. If you find any disruptions, kindly contact 
granger111@gamil.com or severus77@gamil.com """

# \S matches any non-whitespace character 
# @ for as in the Email 
# + for Repeats a character one or more times 
usernames= re.findall('(\S+)@', text)     
print(usernames)

['potter709', 'elixir101', 'granger111', 'severus77']


# 7.How to find the most common words in the text excluding stopwords ?

In [11]:
text="""Junkfood - Food that do no good to our body. And there's no 
need of them in our body but still we willingly eat them because 
they are great in taste and easy to cook or ready to eat. Junk 
foods have no or very less nutritional value and irrespective of 
the way they are marketed, they are not healthy to consume.The 
only reason of their gaining popularity and increased trend of 
consumption is that they are ready to eat or easy to cook foods. 
People, of all age groups are moving towards Junkfood as it is 
hassle free and often ready to grab and eat. Cold drinks, chips, 
noodles, pizza, burgers, French fries etc. are few examples from 
the great variety of junk food available in the market. Junkfood 
is the most dangerous food ever but it is pleasure in eating and 
it gives a great taste in mouth examples of Junkfood are kurkure 
and chips.. cold rings are also source of junk food... they shud 
nt be ate in high amounts as it results fatal to our body... it 
cn be eated in a limited extend ... in research its found tht ths 
junk foods r very dangerous fr our health Junkfood is very 
harmful that is slowly eating away the health of the present 
generation. The term itself denotes how dangerous it is for our 
bodies. Most importantly, it tastes so good that people consume 
it on a daily basis. However, not much awareness is spread about 
the harmful effects of Junkfood ."""



for i in text:
    if i not in stopwords.words('english') :
        print(i,end = "")

Junkf - F h  n g  ur b. An here' n 
nee f he n ur b bu ll we wllngl e he becue 
he re gre n e n e  ck r re  e. Junk 
f hve n r ver le nurnl vlue n rrepecve f 
he w he re rkee, he re n helh  cnue.The 
nl ren f her gnng ppulr n ncree ren f 
cnupn  h he re re  e r e  ck f. 
Peple, f ll ge grup re vng wr Junkf    
hle free n fen re  grb n e. Cl rnk, chp, 
nle, pzz, burger, French fre ec. re few exple fr 
he gre vre f junk f vlble n he rke. Junkf 
 he  ngeru f ever bu   pleure n eng n 
 gve  gre e n uh exple f Junkf re kurkure 
n chp.. cl rng re l urce f junk f... he hu 
n be e n hgh un   reul fl  ur b...  
cn be ee n  le exen ... n reerch  fun h h 
junk f r ver ngeru fr ur helh Junkf  ver 
hrful h  lwl eng w he helh f he preen 
genern. The er elf ene hw ngeru   fr ur 
be. M prnl,  e  g h peple cnue 
 n  l b. Hwever, n uch wrene  pre bu 
he hrful effec f Junkf .

In [12]:
import nltk
nltk.download()
from nltk.corpus import brown
brown.words()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

# 8.How to do spell correction in a given text?
text="He is a gret person. He beleives in bod"
Desired Output:
text="He is a great person. He believes in god"

In [13]:
# Import textblob
from textblob import TextBlob
text="He is a gret person. He beleives in bod"
# Using textblob's correct() function
text=TextBlob(text)
print(text.correct())
#> He is a great person. He believes in god

He is a great person. He believes in god


# 9.How to extract all the nouns in a text?
text="James works at Microsoft. She lives in manchester and likes 
to play the flute"

In [14]:
#!pip install textblob

In [15]:
#!python -m textblob.download_corpora

In [16]:
from textblob import TextBlob

text="""James works at Microsoft. She lives in manchester and likes 
to play the flute"""

blob = TextBlob(text)

blob.words # Word tokenization

WordList(['James', 'works', 'at', 'Microsoft', 'She', 'lives', 'in', 'manchester', 'and', 'likes', 'to', 'play', 'the', 'flute'])

In [17]:
blob.noun_phrases # Noun phrase extraction

WordList(['james', 'microsoft'])

# 10.How to find the cosine similarity of two documents?

In [18]:
# Using Vectorizer of sklearn to get vector representation

text1='Taj Mahal is a tourist place in India'
text2='Great Wall of China is a tourist place in china'


documents=[text1,text2]
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

vectorizer=CountVectorizer()
matrix=vectorizer.fit_transform(documents)

# Obtaining the document-word matrix
doc_term_matrix=matrix.todense()
doc_term_matrix

# Computing cosine similarity
df=pd.DataFrame(doc_term_matrix)

from sklearn.metrics.pairwise import cosine_similarity
print(cosine_similarity(df,df))

#> [[1.         0.45584231]
#> [0.45584231 1.        ]]

[[1.         0.45584231]
 [0.45584231 1.        ]]
