Solving  an  NLP  problem  is  a  multi-stage  process.  We  need  to  clean  the  unstructured  text  data  first 
before we can even think about getting to the modeling stage. Cleaning the data consists of a few key 
steps:
1. Split the paragraph into sentences.


In [23]:
import nltk
from nltk.tokenize import sent_tokenize
txt= '''Are  you  fascinated  by  the  amount  of  text  data  available  on  the  internet?  Are  you 
looking  for  ways  to  work  with  this  text  data  but  aren’t  sure  where  to  begin? 
Machines, after all, recognize numbers, not the letters of our language. And that can 
be a tricky landscape to navigate in machine learning.'''
print(sent_tokenize(txt))

['Are  you  fascinated  by  the  amount  of  text  data  available  on  the  internet?', 'Are  you \nlooking  for  ways  to  work  with  this  text  data  but  aren’t  sure  where  to  begin?', 'Machines, after all, recognize numbers, not the letters of our language.', 'And that can \nbe a tricky landscape to navigate in machine learning.']


2. Split the above paragraph into words.

In [22]:
from nltk.tokenize import word_tokenize
txt ='''Are  you  fascinated  by  the  amount  of  text  data  available  on  the  internet?  Are  you 
looking  for  ways  to  work  with  this  text  data  but  aren’t  sure  where  to  begin? 
Machines, after all, recognize numbers, not the letters of our language. And that can 
be a tricky landscape to navigate in machine learning.'''
print(word_tokenize(txt))

['Are', 'you', 'fascinated', 'by', 'the', 'amount', 'of', 'text', 'data', 'available', 'on', 'the', 'internet', '?', 'Are', 'you', 'looking', 'for', 'ways', 'to', 'work', 'with', 'this', 'text', 'data', 'but', 'aren', '’', 't', 'sure', 'where', 'to', 'begin', '?', 'Machines', ',', 'after', 'all', ',', 'recognize', 'numbers', ',', 'not', 'the', 'letters', 'of', 'our', 'language', '.', 'And', 'that', 'can', 'be', 'a', 'tricky', 'landscape', 'to', 'navigate', 'in', 'machine', 'learning', '.']


3. Find stem and lemma words for the given words?

“cats"
"trouble"
"troubling"
"troubled"
“having”
“Corriendo”
“at”
“was”

In [21]:
import nltk
from nltk.stem.porter import PorterStemmer
porter_stemmer =PorterStemmer()
words=['cats','trouble','troubling','troubled','having','Corriendo','at','was']
print("stemming for :")
for x in words:
    print("   {} is {}".format(x,porter_stemmer.stem(x)))

stemming for :
   cats is cat
   trouble is troubl
   troubling is troubl
   troubled is troubl
   having is have
   Corriendo is corriendo
   at is at
   was is wa


In [20]:
import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
words = ['cats','trouble','troubling','troubled','having','Corriendo','at','was']
print("lemma for :")
for x in words:
    print("   {} is {}".format(x,x, wordnet_lemmatizer.lemmatize(x)))

lemma for :
   cats is cats
   trouble is trouble
   troubling is troubling
   troubled is troubled
   having is having
   Corriendo is Corriendo
   at is at
   was is was


4. Find stop words from the given paragraph?

In [18]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
 
paragraph = '''The NLTK library  is  one  of  the  oldest  and  most  commonly  used  Python  libraries  for 
Natural Language Processing. NLTK supports stop word removal, and you can find the list 
of stop words in the  corpus  module. To remove stop words from a sentence, you can divide 
your text into words and then remove the word if it exits in the list of stop words provided 
by NLTK.'''
 
stopwords=set(stopwords.words('english'))
words=word_tokenize(paragraph)
r=[x for x in words if not x.lower() in stopwords]
r=[]
for i in words:
  if i not in stopwords:
    r.append(i)
print(r)

['The', 'NLTK', 'library', 'one', 'oldest', 'commonly', 'used', 'Python', 'libraries', 'Natural', 'Language', 'Processing', '.', 'NLTK', 'supports', 'stop', 'word', 'removal', ',', 'find', 'list', 'stop', 'words', 'corpus', 'module', '.', 'To', 'remove', 'stop', 'words', 'sentence', ',', 'divide', 'text', 'words', 'remove', 'word', 'exits', 'list', 'stop', 'words', 'provided', 'NLTK', '.']


5. From the above paragraph print frequency of each word using NLTK?

In [16]:
import nltk
from nltk.corpus import webtext
from nltk.probability import FreqDist
from nltk.tokenize import word_tokenize

paragraph='''The NLTK library  is  one  of  the  oldest  and  most  commonly  used  Python  libraries  for 
Natural Language Processing. NLTK supports stop word removal, and you can find the list 
of stop words in the  corpus  module. To remove stop words from a sentence, you can divide 
your text into words and then remove the word if it exits in the list of stop words provided 
by NLTK.'''
 
w=word_tokenize(paragraph)
d_a=nltk.FreqDist(w)
f_w=dict([(x,y) for x,y in d_a.items() if len(x) > 3])
for key in sorted(f_w):
  print("%s: %s" % (key,f_w[key]))

Language: 1
NLTK: 3
Natural: 1
Processing: 1
Python: 1
commonly: 1
corpus: 1
divide: 1
exits: 1
find: 1
from: 1
into: 1
libraries: 1
library: 1
list: 2
module: 1
most: 1
oldest: 1
provided: 1
removal: 1
remove: 2
sentence: 1
stop: 4
supports: 1
text: 1
then: 1
used: 1
word: 2
words: 4
your: 1
