Algorithm for Tokenization, POS Tagging, stop words removal, Stemming and
Lemmatization:

Step 1: Download the required packages

In [1]:
import nltk

In [3]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

Step 2: Initialize the text

In [4]:
text= "Tokenization is the first step in text analytics. The process of breaking down a text paragraph into smaller chunks such as words or sentences is called Tokenization."

Step 3: Perform Tokenization


In [5]:
#Sentence Tokenization
from nltk.tokenize import sent_tokenize
tokenized_text= sent_tokenize(text)
print(tokenized_text)
#Word Tokenization
from nltk.tokenize import word_tokenize
tokenized_word=word_tokenize(text)
print(tokenized_word)


['Tokenization is the first step in text analytics.', 'The process of breaking down a text paragraph into smaller chunks such as words or sentences is called Tokenization.']
['Tokenization', 'is', 'the', 'first', 'step', 'in', 'text', 'analytics', '.', 'The', 'process', 'of', 'breaking', 'down', 'a', 'text', 'paragraph', 'into', 'smaller', 'chunks', 'such', 'as', 'words', 'or', 'sentences', 'is', 'called', 'Tokenization', '.']


Step 4: Removing Punctuations and Stop Word

In [6]:
# print stop words of English
import re
from nltk.corpus import stopwords
stop_words=set(stopwords.words("english"))
print(stop_words)
text= "How to remove stop words with NLTK library in Python?"
text= re.sub('[^a-zA-Z]', ' ',text)
tokens = word_tokenize(text.lower())
filtered_text=[]
for w in tokens:
    if w not in stop_words:
        filtered_text.append(w)
print("Tokenized Sentence:",tokens)
print("Filterd Sentence:",filtered_text)

{'all', 'both', 'mustn', 'from', 'at', 'your', "don't", 'under', 'further', "hadn't", 'between', 'through', 'here', 's', 'haven', "wouldn't", 'by', 'was', 'why', 'am', 'itself', "hasn't", 'not', 'be', 'on', 'what', "she's", "mightn't", 'those', 'its', 'having', 'which', 'doesn', 'as', 'herself', 'yourselves', 'that', 'had', 'after', 'theirs', 'an', 'when', 'out', 'down', 'of', 'any', 'to', 'couldn', 'where', 't', "wasn't", 'hadn', "should've", 'over', 'mightn', 'this', 'aren', 'you', 'few', 'some', 'these', 'just', 'whom', 'does', 'it', 'once', 'own', 'too', 'didn', 'wouldn', 'nor', "it's", "doesn't", 'his', 'below', 'how', 'hers', 'other', "you're", 'because', 'same', 'll', 'but', 'him', 'isn', 'no', 'ma', 'me', 'before', 'then', 'above', 'have', 'will', 'y', "you'd", 'against', "isn't", 'a', 'the', 'do', 'there', 'm', 'who', "haven't", 'shouldn', 've', 'o', 'd', "you've", 'off', "didn't", "won't", 'don', 'during', 'most', 'been', 'won', 'should', "mustn't", 'hasn', 're', 'we', "shan'

Step 5 : Perform Stemming

In [7]:
from nltk.stem import PorterStemmer
e_words= ["wait", "waiting", "waited", "waits"]
ps =PorterStemmer()
for w in e_words:
   rootWord=ps.stem(w)
print(rootWord)


wait


Step 6: Perform Lemmatization

In [8]:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
text = "studies studying cries cry"
tokenization = nltk.word_tokenize(text)
for w in tokenization:
    print("Lemma for {} is {}".format(w,wordnet_lemmatizer.lemmatize(w)))


Lemma for studies is study
Lemma for studying is studying
Lemma for cries is cry
Lemma for cry is cry


Step 7: Apply POS Tagging to text

In [9]:
import nltk
from nltk.tokenize import word_tokenize
data="The pink sweater fit her perfectly"
words=word_tokenize(data)
for word in words:
   print(nltk.pos_tag([word]))


[('The', 'DT')]
[('pink', 'NN')]
[('sweater', 'NN')]
[('fit', 'NN')]
[('her', 'PRP$')]
[('perfectly', 'RB')]


Algorithm for Create representation of document by calculating TFIDF

Step 1: Import the necessary libraries.

In [10]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

Step 2: Initialize the Documents

In [11]:
doc_a = 'Jupiter is the largest Planet'
doc_b = 'Mars is the fourth planet from the Sun'


Step 3: Create BagofWords (BoW) for Document A and B.

In [12]:
bag_of_words_a = doc_a.split(' ')
bag_of_words_b = doc_b.split(' ')


Step 4: Create Collection of Unique words from Document A and B.

In [13]:
unique_words_set = set(bag_of_words_a).union(set(bag_of_words_b))
print(unique_words_set)

{'largest', 'Mars', 'Sun', 'from', 'planet', 'fourth', 'Planet', 'is', 'the', 'Jupiter'}


Step 5: Create a dictionary of words and their occurrence for each document in the
corpus

In [14]:

dict_a = dict.fromkeys(unique_words_set, 0)
# print(dict_a) # {'this': 0, 'document': 0, 'second': 0, 'is': 0, 'the': 0}

for word in bag_of_words_a:
    dict_a[word] += 1

print(dict_a)
# {'this': 1, 'document': 2, 'second': 1, 'is': 1, 'the': 1}

# similarly

dict_b = dict.fromkeys(unique_words_set, 0)
for word in bag_of_words_b:
    dict_b[word] += 1

print(dict_b)

{'largest': 1, 'Mars': 0, 'Sun': 0, 'from': 0, 'planet': 0, 'fourth': 0, 'Planet': 1, 'is': 1, 'the': 1, 'Jupiter': 1}
{'largest': 0, 'Mars': 1, 'Sun': 1, 'from': 1, 'planet': 1, 'fourth': 1, 'Planet': 0, 'is': 1, 'the': 2, 'Jupiter': 0}


Step 6: Compute the term frequency for each of our documents.

In [22]:
def compute_term_frequency(word_dictionary, bag_of_words):
    term_frequency_dictionary = {}
    length_of_bag_of_words = len(bag_of_words)

    for word, count in word_dictionary.items():
        term_frequency_dictionary[word] = count / float(length_of_bag_of_words)

    return term_frequency_dictionary

# Implementation

print(compute_term_frequency(dict_a, bag_of_words_a))
print(compute_term_frequency(dict_b, bag_of_words_b))

{'largest': 0.2, 'Mars': 0.0, 'Sun': 0.0, 'from': 0.0, 'planet': 0.0, 'fourth': 0.0, 'Planet': 0.2, 'is': 0.2, 'the': 0.2, 'Jupiter': 0.2}
{'largest': 0.0, 'Mars': 0.125, 'Sun': 0.125, 'from': 0.125, 'planet': 0.125, 'fourth': 0.125, 'Planet': 0.0, 'is': 0.125, 'the': 0.25, 'Jupiter': 0.0}


Step 7: Compute the term Inverse Document Frequency.

In [16]:
import math

def compute_inverse_document_frequency(full_doc_list):
    idf_dict = {}
    length_of_doc_list = len(full_doc_list)

    idf_dict = dict.fromkeys(full_doc_list[0].keys(), 0)
    for word, value in idf_dict.items():
        idf_dict[word] = math.log(length_of_doc_list / (float(value) + 1))

    return idf_dict

final_idf_dict = compute_inverse_document_frequency([dict_a, dict_b])
print(final_idf_dict)


{'largest': 0.6931471805599453, 'Mars': 0.6931471805599453, 'Sun': 0.6931471805599453, 'from': 0.6931471805599453, 'planet': 0.6931471805599453, 'fourth': 0.6931471805599453, 'Planet': 0.6931471805599453, 'is': 0.6931471805599453, 'the': 0.6931471805599453, 'Jupiter': 0.6931471805599453}


Step 8: Compute the term TF/IDF for all words.

In [27]:
def compute_term_frequency_inverse_document_frequency(bag_of_words, final_idf_dict): 
    tfidf = {} 
    for word, val in bag_of_words.items(): 
        tfidf[word] = val * final_idf_dict[word] 
        return tfidf

tfidfdict_a = compute_term_frequency_inverse_document_frequency(dict_a, final_idf_dict) 
tfidfdict_b = compute_term_frequency_inverse_document_frequency(dict_b, final_idf_dict)

print(tfidfdict_a) 
print(tfidfdict_b)
df = pd.DataFrame([tfidfdict_a,tfidfdict_b])
df

{'largest': 0.6931471805599453}
{'largest': 0.0}


Unnamed: 0,largest
0,0.693147
1,0.0
