# Text Analytics
1. Extract Sample document and apply following document preprocessing methods:
Tokenization, POS Tagging, stop words removal, Stemming and Lemmatization.
2. Create representation of document by calculating Term Frequency and Inverse Document 
Frequency.

### Notes
NLTK (Natural Language Toolkit) is the go-to API for NLP (Natural Language Processing) with Python. 
It is a really powerful tool to preprocess text data for further analysis like with ML models for instance. 
It helps convert text into numbers, which the model can then easily work with.

### 1.Tokenization
One of the very basic things we want to do is dividing a body of text into words or sentences. This is called tokenization.

### 2.Stop-words
Stop-words are basically words that don’t have strong meaningful connotations for instance, ‘and’, ‘a’, ‘it's’, ‘they’, etc. These have a meaningful impact when we use them to communicate with each other but for analysis by a computer, they are not really that useful

### 3.Stemming /VS/ Lemmatization
- 1.Stemming is a process that stems or removes last few characters from a word, often leading to incorrect meanings and spelling.
--> Stemming : the word ‘Caring‘ would return ‘Car‘.
- 2.Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma.
--> Lemmatizing the word ‘Caring‘ would return ‘Care‘.
Stemming is used in case of large dataset where performance is an issue.becuase Lemmatization is computationally expensive.
- The aim of lemmatization, like stemming, is to reduce inflectional forms to a common base form. As opposed to stemming, lemmatization does not simply chop off inflections. Instead, it uses lexical knowledge bases to get the correct base forms of words.

### 4. POS tagging : Part of Speech Tagging
- Part of Speech Tagging (POS-Tag) is the labeling of the words in a text according to 
their word types (noun, adjective, adverb, verb, etc.)
- The pos_tag() method takes in a list of tokenized words, and tags each of them with a corresponding Parts of Speech identifier into tuples. 
- For example, 
    - VB refers to ‘verb’, 
    - NNS refers to ‘plural nouns’, 
    - DT refers to a ‘determiner’. 
    - Noun (N)- Daniel, London, table, dog, teacher, pen, city, happiness, hope
    - Verb (V)- go, speak, run, eat, play, live, walk, have, like, are, is
    - Adjective(ADJ)- big, happy, green, young, fun, crazy, three
    - Adverb(ADV)- slowly, quietly, very, always, never, too, well, tomorrow
    - Preposition (P)- at, on, in, from, with, near, between, about, under
    - Conjunction (CON)- and, or, but, because, so, yet, unless, since, if
    - Pronoun(PRO)- I, you, we, they, he, she, it, me, us, them, him, her, this
    - Interjection (INT)- Ouch! Wow! Great! Help! Oh! Hey! Hi!

# Tokenization

In [1]:
import numpy as np
import nltk
from nltk import sent_tokenize
from nltk import word_tokenize
# nltk.download('punkt')
# nltk.download('all')

In [2]:
string1 = "My name is kaveri vinod raut. I am computers engineering students studing at PICT pune under SPPU. I likes playing cricket. PICT is top institute in pune under SPPU. I am having caring nature. I think guitar is  interesting instrument."
token_string1 = nltk.word_tokenize(string1)
print("Word Tokenizer: ",token_string1)

string2 = "My name is Krushna vinod raut. I am caring IT engineer working in Citi Bank."
token_string2 = nltk.sent_tokenize(string2)
print("Sent Tokenizer: ",token_string2)

Word Tokenizer:  ['My', 'name', 'is', 'kaveri', 'vinod', 'raut', '.', 'I', 'am', 'computers', 'engineering', 'students', 'studing', 'at', 'PICT', 'pune', 'under', 'SPPU', '.', 'I', 'likes', 'playing', 'cricket', '.', 'PICT', 'is', 'top', 'institute', 'in', 'pune', 'under', 'SPPU', '.', 'I', 'am', 'having', 'caring', 'nature', '.', 'I', 'think', 'guitar', 'is', 'interesting', 'instrument', '.']
Sent Tokenizer:  ['My name is Krushna vinod raut.', 'I am caring IT engineer working in Citi Bank.']


# Stop Words

In [3]:
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
# collect all the stop words from 'english' language
all_stopwords = stopwords.words('english')
# print(all_stopwords)

# now we have to delete all the stop words from our tokenized token_string1 => firstly string must be tokenized 
# store remaining words in new removed_Stop_words_string1 into none_stopword_string1[]
nonstop_string1 = ""
with_stopword_string = ""
stopword_found = []

for word in token_string1:  #iterate on tokenized words
    with_stopword_string += word
    with_stopword_string += " "
    if word not in all_stopwords: #if word do not belong to stopword => add it to nonstop_string1[] array
        nonstop_string1 += word
        nonstop_string1 += " "
    if word in all_stopwords:     #if word belong to stopword => add it to stopword_found[] array
        stopword_found.append(word)
        
# printing string formate of words after of tokenized words array with the stopwords
print("initial String :\n",with_stopword_string,"\n")

# printing new string of words after deleting the stopwords 
print("after removal of stopword:\n",nonstop_string1,"\n")

#printing the stopwords found in our token_string1
print("Stop words removed are: ",stopword_found)

initial String :
 My name is kaveri vinod raut . I am computers engineering students studing at PICT pune under SPPU . I likes playing cricket . PICT is top institute in pune under SPPU . I am having caring nature . I think guitar is interesting instrument .  

after removal of stopword:
 My name kaveri vinod raut . I computers engineering students studing PICT pune SPPU . I likes playing cricket . PICT top institute pune SPPU . I caring nature . I think guitar interesting instrument .  

Stop words removed are:  ['is', 'am', 'at', 'under', 'is', 'in', 'under', 'am', 'having', 'is']


# Stemming

In [5]:
from nltk.stem import PorterStemmer  

ps = PorterStemmer()  #Initialize Python porter stemmer
word = "leaves"
print("leaves -> ",ps.stem(word))
print("engineering -> ",ps.stem("engineering"))
print("beeches -> ",ps.stem("beeches"))

leaves ->  leav
engineering ->  engin
beeches ->  beech


In [6]:
#firstly need to tokenize the string into words then pass words for the stemming

# using tokenized string got after deleting stopwords ===> nonstop_string1
#print(nonstop_string1)

split_str = nonstop_string1.split()

for word in split_str:
    #print(word)
    #print(word," -> ", ps.stem(word))
    print("{0:20}{1:20}".format(word, ps.stem(word)))

My                  my                  
name                name                
kaveri              kaveri              
vinod               vinod               
raut                raut                
.                   .                   
I                   i                   
computers           comput              
engineering         engin               
students            student             
studing             stude               
PICT                pict                
pune                pune                
SPPU                sppu                
.                   .                   
I                   i                   
likes               like                
playing             play                
cricket             cricket             
.                   .                   
PICT                pict                
top                 top                 
institute           institut            
pune                pune                
SPPU            

# Lemmatization

In [7]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [8]:
wnl = WordNetLemmatizer()
word = "leaves"
print("leaves -> ",wnl.lemmatize("leaves"))
print("workers -> ",wnl.lemmatize("workers"))
print("beeches -> ",wnl.lemmatize("beeches"))

leaves ->  leaf
workers ->  worker
beeches ->  beech


In [9]:
#storing this more preprocessed string (tokenized + deleted stopword + lemmitized) into final_string
final_string_list = []

#print(nonstop_string1)
split_str = nonstop_string1.split()
print(nonstop_string1)

for word in split_str:
    new_str = wnl.lemmatize(word,pos="v")
    print("{0:20}{1:20}".format(word, new_str))
    final_string_list.append(new_str) #this final_string_list[] string array for POS tagging purpose

My name kaveri vinod raut . I computers engineering students studing PICT pune SPPU . I likes playing cricket . PICT top institute pune SPPU . I caring nature . I think guitar interesting instrument . 
My                  My                  
name                name                
kaveri              kaveri              
vinod               vinod               
raut                raut                
.                   .                   
I                   I                   
computers           computers           
engineering         engineer            
students            students            
studing             stud                
PICT                PICT                
pune                pune                
SPPU                SPPU                
.                   .                   
I                   I                   
likes               like                
playing             play                
cricket             cricket             
.                  

# POS-tagging : Part Of Speech Tagging

In [10]:
from nltk import pos_tag
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [11]:
# required tokenized string to assign the POS tags
# this final_string_list[] string array for POS tagging purpose
pos_tag_string = pos_tag(final_string_list)
print("POS Tagged words are: \n")
for pos_word in pos_tag_string:
    print(pos_word,"\n")

POS Tagged words are: 

('My', 'PRP$') 

('name', 'NN') 

('kaveri', 'VB') 

('vinod', 'NN') 

('raut', 'NN') 

('.', '.') 

('I', 'PRP') 

('computers', 'NNS') 

('engineer', 'VBP') 

('students', 'NNS') 

('stud', 'JJ') 

('PICT', 'NNP') 

('pune', 'NN') 

('SPPU', 'NNP') 

('.', '.') 

('I', 'PRP') 

('like', 'VBP') 

('play', 'NN') 

('cricket', 'NN') 

('.', '.') 

('PICT', 'NNP') 

('top', 'JJ') 

('institute', 'NN') 

('pune', 'NN') 

('SPPU', 'NNP') 

('.', '.') 

('I', 'PRP') 

('care', 'VBP') 

('nature', 'JJ') 

('.', '.') 

('I', 'PRP') 

('think', 'VBP') 

('guitar', 'JJ') 

('interest', 'NN') 

('instrument', 'NN') 

('.', '.') 



# Term Frequency (TF) 
- count of terms/words in document (single doc)

In [12]:
# to combine all words into single set, using set to store all unique words together 
word_set = set()
# using this final_string_list[] prepared after lemmitization
word_set = word_set.union(set(final_string_list)) #deplicate words removal 
df = dict.fromkeys(word_set, 0) #lets add a way to count the words using a dictionary key-value pairing

In [13]:
for word in final_string_list:
    df[word] = df[word]+1;

In [14]:
df

{'stud': 1,
 'guitar': 1,
 'vinod': 1,
 'care': 1,
 'kaveri': 1,
 'PICT': 2,
 'nature': 1,
 'raut': 1,
 'play': 1,
 'instrument': 1,
 'pune': 2,
 'computers': 1,
 'SPPU': 2,
 'top': 1,
 'interest': 1,
 'cricket': 1,
 'engineer': 1,
 'think': 1,
 'name': 1,
 '.': 6,
 'students': 1,
 'like': 1,
 'institute': 1,
 'I': 4,
 'My': 1}

## tf(t,d) = count of word in document / number of all words in document

In [26]:
TF = {} #tuple of tf values for each word
corpusCount = len(final_string_list)
for word, count in df.items():  # iterate over the dictionary key,value pair
    TF[word] = count / corpusCount
    
TF # print tf-values ka tuple

{'stud': 0.0388316669075566,
 'guitar': 0.0388316669075566,
 'vinod': 0.0388316669075566,
 'care': 0.0388316669075566,
 'kaveri': 0.0388316669075566,
 'PICT': 0.030469722583557124,
 'nature': 0.0388316669075566,
 'raut': 0.0388316669075566,
 'play': 0.0388316669075566,
 'instrument': 0.0388316669075566,
 'pune': 0.030469722583557124,
 'computers': 0.0388316669075566,
 'SPPU': 0.030469722583557124,
 'top': 0.0388316669075566,
 'interest': 0.0388316669075566,
 'cricket': 0.0388316669075566,
 'engineer': 0.0388316669075566,
 'think': 0.0388316669075566,
 'name': 0.0388316669075566,
 '.': 0.017216354396899832,
 'students': 0.0388316669075566,
 'like': 0.0388316669075566,
 'institute': 0.0388316669075566,
 'I': 0.022107778259557644,
 'My': 0.0388316669075566}

# Inverse document Frequency (IDF) 
- count of documents in which word appear
- idf(t) = total count of document set/occurrence of t in documents
- IDF = N / df

In [27]:
import math

IDF_dict = {} #tuple of tf values for each word
N = len(df) #count of corpus i.e. total count of document set
IDF_dict = df
for word, val in IDF_dict.items(): #iterate over the dictionary key,value pair
    IDF_dict[word] = math.log10(N / val)

In [28]:
IDF_dict

{'stud': 1.2524514742164006,
 'guitar': 1.2524514742164006,
 'vinod': 1.2524514742164006,
 'care': 1.2524514742164006,
 'kaveri': 1.2524514742164006,
 'PICT': 1.357769007767367,
 'nature': 1.2524514742164006,
 'raut': 1.2524514742164006,
 'play': 1.2524514742164006,
 'instrument': 1.2524514742164006,
 'pune': 1.357769007767367,
 'computers': 1.2524514742164006,
 'SPPU': 1.357769007767367,
 'top': 1.2524514742164006,
 'interest': 1.2524514742164006,
 'cricket': 1.2524514742164006,
 'engineer': 1.2524514742164006,
 'think': 1.2524514742164006,
 'name': 1.2524514742164006,
 '.': 1.6056963139188123,
 'students': 1.2524514742164006,
 'like': 1.2524514742164006,
 'institute': 1.2524514742164006,
 'I': 1.4970924079355536,
 'My': 1.2524514742164006}

# TF-IDF 
## tf-idf(t, d) = tf(t, d) * log(N/(df + 1))

In [30]:
TF_IDF = {}
for word, val in TF.items():
    TF_IDF[word] = val * IDF_dict[word]

In [31]:
TF_IDF

{'stud': 0.04863477846464948,
 'guitar': 0.04863477846464948,
 'vinod': 0.04863477846464948,
 'care': 0.04863477846464948,
 'kaveri': 0.04863477846464948,
 'PICT': 0.04137084499922329,
 'nature': 0.04863477846464948,
 'raut': 0.04863477846464948,
 'play': 0.04863477846464948,
 'instrument': 0.04863477846464948,
 'pune': 0.04137084499922329,
 'computers': 0.04863477846464948,
 'SPPU': 0.04137084499922329,
 'top': 0.04863477846464948,
 'interest': 0.04863477846464948,
 'cricket': 0.04863477846464948,
 'engineer': 0.04863477846464948,
 'think': 0.04863477846464948,
 'name': 0.04863477846464948,
 '.': 0.027644236794221996,
 'students': 0.04863477846464948,
 'like': 0.04863477846464948,
 'institute': 0.04863477846464948,
 'I': 0.033097386988706436,
 'My': 0.04863477846464948}

## Result : tf-idf now is a the right measure to evaluate how important a word is to a document in a collection or corpus