Natural Language Toolkit(NLTK) we can say is the comprehensive python library for natural language processing and text analytics

# Tokenization

Tokenization being the very basic thing which we usually do at beginning of building a application say text analyzer,summarizer etc. Tokenization is the process of splitting a string into a list of pieces or tokens

In [8]:
#Reading a text file 
file = open("para_meteor.txt","r")
para = file.read()

In [9]:
#Sentence tokenization
#Importing the sentence tokenizer
from nltk.tokenize import sent_tokenize
#Splitting the paragraph into sentences using sentence tokenization function
sent_tokens = sent_tokenize(para)
#What we actually get is a list of sentences

The sent_tokenize function uses an instance of PunktSentenceTokenizer from the nltk.tokenize.punkt module This instance has already been trained and works well for
many European languages. So it knows what punctuation and characters mark the end of a
sentence and the beginning of a new sentence.

The instance used in sent_tokenize() is actually loaded on demand from a pickle
file. So if we are going to be tokenizing a lot and lot of the sentences, it will be more efficient to load the
PunktSentenceTokenizer class once,and call its tokenize() method instead

In [10]:
import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/PY3/english.pickle')
tokenizer.tokenize(para)

['Scientists estimate that about 48.5 tons (44 tonnes or 44,000 kilograms) of meteoritic material falls on the Earth each day.',
 'Almost all the material is vaporized in Earth\'s atmosphere, leaving a bright trail fondly called "shooting stars."',
 'Several meteors per hour can usually be seen on any given night.',
 'Sometimes the number increases dramatically—these events are termed meteor showers.Meteor showes occur annually or at regular intervals as the Earth passes through the trail of dusty debris left by a comet.',
 'Meteor showers are usually named after a star or constellation that is close to where the meteors appear in the sky.',
 'Perhaps the most famous are the Perseids, which peak in August every year.',
 'Every Perseid meteor is a tiny piece of the comet Swift-Tuttle, which swings by the Sun every 135 years.',
 'Taking photographs of a meteor shower can be an exercise in patience as meteors streak across the sky quickly and unannounced,but with these tips – and some goo

In [11]:
#Incase if we want to tokenize sentences from other languages what we can do is use(load) pickle file corresponding to other language
spanish_tokenizer = nltk.data.load('tokenizers/punkt/PY3/spanish.pickle')
spanish_tokenizer.tokenize('Hola amigo. Estoy bien.')

['Hola amigo.', 'Estoy bien.']

In [12]:
#Now tokenizing sentences into words(use word_tokenize)
from nltk.tokenize import word_tokenize
#Tokenizing the sentences into words
word_tokens = word_tokenize(sent_tokens[0])
word_tokens

['Scientists',
 'estimate',
 'that',
 'about',
 '48.5',
 'tons',
 '(',
 '44',
 'tonnes',
 'or',
 '44,000',
 'kilograms',
 ')',
 'of',
 'meteoritic',
 'material',
 'falls',
 'on',
 'the',
 'Earth',
 'each',
 'day',
 '.']

The word_tokenize() function is a wrapper function that calls tokenize() on an
instance of the TreebankWordTokenizer class

In [13]:
from nltk.tokenize import TreebankWordTokenizer
#It basically separate words by spaces and punctuation,by the way it doesn't discard punctuation it keep them and let us decide whether to keep them or not
tokenizer = TreebankWordTokenizer()
#It basically work by separating contractions
w_t = tokenizer.tokenize(sent_tokens[0])
w_t

['Scientists',
 'estimate',
 'that',
 'about',
 '48.5',
 'tons',
 '(',
 '44',
 'tonnes',
 'or',
 '44,000',
 'kilograms',
 ')',
 'of',
 'meteoritic',
 'material',
 'falls',
 'on',
 'the',
 'Earth',
 'each',
 'day',
 '.']

In [27]:
#One of the tokenizer's most significant conventions is to separate contractions(doesn't separate punctuations)
word_tokenize("shouldn't and, can't!")

['should', "n't", 'and', ',', 'ca', "n't", '!']

In [28]:
#Alternative word tokenizer available is WordPunctTokenizer
#WordPunctTokenizer
#It splits all punctuation into separate tokens
from nltk.tokenize import WordPunctTokenizer
tokenizer_W = WordPunctTokenizer()
tokenizer_W.tokenize("Shouldn't and, can't!")

['Shouldn', "'", 't', 'and', ',', 'can', "'", 't', '!']

In [24]:
#Using regular expression for tokenization and using this we can tokenize the way we want
#There are cases where some tokenizer are acceptable and in others, different tokenizer are acceptable
from nltk.tokenize import RegexpTokenizer
#Now we will create an instance of reguklar expression tokenizer and give it the matching token(regular expression string)
tokenizer = RegexpTokenizer("[\w']+")
#tokenizer.tokenize("Shouldn't couldn't")
#RegexpTokenizer can also work by matching the gaps, as opposed to the tokens

#If we don't want to create an instance from RegexpTokenizer class
#We can make regex work in the way we want it internally implement the re.findall() and re.split() to find the matching patterg on eich to separate the tokens
from nltk.tokenize import regexp_tokenize
regexp_tokenize("shouldn't couldn't","[\w']+")

["shouldn't", "couldn't"]

In [30]:
#Using the simple whitespace tokenizer
#The gaps=True parameter means that the pattern is used to identify gaps to tokenize on.If we used gaps=False,then the pattern would be used to identify tokens
tokenizer = RegexpTokenizer('\s+', gaps=True)
tokenizer.tokenize("couldn't , shouldn't")

["couldn't", ',', "shouldn't"]

Usually our default sentence tokenizer works very well but sometime it is not the case where we want our text to tokenized in the way we want on some specific patterns, so there we use regex tokenizer

NLTK provides a PunktSentenceTokenizer class that we can use to train on raw text to produce
a custom sentence tokenizer. we can get raw text either by reading in a file, or from an NLTK
corpus using the raw() method.

In [32]:
nltk.download('webtext')

[nltk_data] Downloading package webtext to
[nltk_data]     /home/bluebrain/nltk_data...
[nltk_data]   Unzipping corpora/webtext.zip.


True

In [40]:
from nltk.tokenize import PunktSentenceTokenizer
from nltk.corpus import webtext
text = webtext.raw('overheard.txt')
sent_tokenizer = PunktSentenceTokenizer()
sents = sent_tokenizer.tokenize(text)

In [48]:
sents[678]

'I only have a dollar...Can you spare some change?'

In [49]:
#Using an ordinary sentence tokenizer
from nltk.tokenize import sent_tokenize
sent_tokens = sent_tokenize(text)
sent_tokens[678]

'Girl: But you already have a Big Mac...\nHobo: Oh, this is all theatrical.'

The PunktSentenceTokenizer class uses an unsupervised learning algorithm to learn
what constitutes a sentence break. It is unsupervised because you don't have to give it any
labeled training data, just raw text

# Stopwords Removal

After tokenization the next step is to remove stopwords,they are the words which do not contribute to the meaning of a sentence,
at least for the purposes of information retrieval and natural language processing

Most of the search engines filter out stopwords from the search queries in order to save space in their index

In [50]:
#The stopwords corpus is an instance of nltk.corpus.reader.WordListCorpusReader
from nltk.corpus import stopwords
#Creating a set of all the stopwords in english language(Word of not much importance)
stop_words = set(stopwords.words("english"))
#Now removing all the stop words from a text
text = "Hi my name is sourabh, and i am Data Science Lover"
list_words = word_tokenize(text)
list_stopwords_removed = [word for word in list_words if not word in stop_words]
text_filtered = " ".join(list_stopwords_removed)
print(text_filtered)

Hi name sourabh , Data Science Lover


In [51]:
#Seeing the english stopwords
stopwords.words("english")
#Similarly we can use stopwords of different languages

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

# Looking up synsets for word in wordnet

WordNet is a lexical database for the English language. In other words, it's a dictionary
designed specifically for natural language processing

Synset(synonymous word expressing same concept)

In [64]:
from nltk.corpus import wordnet
syn = wordnet.synsets("core")[0]
#Some synset also have example method
#Basically it return us a list of synonymous words
#syn.name()
#Looking for the synset of core or can look for others
syn.definition()

'a small group of indispensable persons or things'

Synsets are organized in a structure similar to that of an inheritance tree. More abstract terms
are known as hypernyms and more specific terms are hyponyms. This tree can be traced all
the way up to a root hypernym

Hypernyms provide a way to categorize and group words based on their similarity to each
other. The Calculating WordNet Synset similarity recipe details the functions used to calculate
the similarity based on the distance between two words in the hypernym tree

In [62]:
syn.hypernyms()

[Synset('set.n.01')]

In [63]:
syn.hypernyms()[0].hyponyms()

[Synset('bracket.n.01'),
 Synset('chess_set.n.01'),
 Synset('choir.n.02'),
 Synset('conjugation.n.03'),
 Synset('core.n.01'),
 Synset('dentition.n.02'),
 Synset('field.n.12'),
 Synset('field.n.13'),
 Synset('field.n.15'),
 Synset('intersection.n.04'),
 Synset('manicure_set.n.01'),
 Synset('octet.n.03'),
 Synset('pair.n.01'),
 Synset('portfolio.n.02'),
 Synset('quartet.n.03'),
 Synset('quintet.n.04'),
 Synset('score.n.04'),
 Synset('septet.n.03'),
 Synset('sextet.n.04'),
 Synset('singleton.n.02'),
 Synset('suite.n.04'),
 Synset('synset.n.01'),
 Synset('threescore.n.01'),
 Synset('trio.n.04'),
 Synset('union.n.08')]

In [65]:
syn.root_hypernyms()

[Synset('entity.n.01')]

In [66]:
syn.hypernym_paths()

[[Synset('entity.n.01'),
  Synset('abstraction.n.06'),
  Synset('group.n.01'),
  Synset('collection.n.01'),
  Synset('set.n.01'),
  Synset('core.n.01')]]

The hypernym_paths() method returns a list of lists, where each list starts at the root
hypernym and ends with the original Synset. Most of the time, you'll only get one nested
list of Synsets