## Text Analytics
<pre>
1. Extract Sample document and apply following document preprocessing methods:
Tokenization, POS Tagging, stop words removal, Stemming and Lemmatization.
2. Create representation of document by calculating Term Frequency and Inverse Document
Frequency.
</pre>

In [1]:
#  !pip install nltk

In [2]:
import nltk
import math
import re
# nltk.download()

In [3]:
doc = ""
with open("./input.txt",'r') as file:
    doc += file.read().lower()

In [4]:
doc

'as the sun dipped below the horizon, casting a warm glow across the sky, sarah found herself lost in thought. she sat on the old wooden bench, the texture rough against her palms. the gentle breeze played with her hair, carrying with it the scent of blooming flowers. in the distance, she could hear the faint chirping of crickets, signaling the onset of evening. it was in these quiet moments that sarah felt most at peace, away from the hustle and bustle of daily life. she closed her eyes, allowing herself to be enveloped by the serenity of the moment, grateful for the simple beauty that surrounded her.'

### Tokenization
<b>Tokenization is the process of breaking down a text into individual words or tokens. This is often the first step in natural language processing tasks</b>

In [5]:
word_tokenize = nltk.word_tokenize(doc)
word_tokenize

['as',
 'the',
 'sun',
 'dipped',
 'below',
 'the',
 'horizon',
 ',',
 'casting',
 'a',
 'warm',
 'glow',
 'across',
 'the',
 'sky',
 ',',
 'sarah',
 'found',
 'herself',
 'lost',
 'in',
 'thought',
 '.',
 'she',
 'sat',
 'on',
 'the',
 'old',
 'wooden',
 'bench',
 ',',
 'the',
 'texture',
 'rough',
 'against',
 'her',
 'palms',
 '.',
 'the',
 'gentle',
 'breeze',
 'played',
 'with',
 'her',
 'hair',
 ',',
 'carrying',
 'with',
 'it',
 'the',
 'scent',
 'of',
 'blooming',
 'flowers',
 '.',
 'in',
 'the',
 'distance',
 ',',
 'she',
 'could',
 'hear',
 'the',
 'faint',
 'chirping',
 'of',
 'crickets',
 ',',
 'signaling',
 'the',
 'onset',
 'of',
 'evening',
 '.',
 'it',
 'was',
 'in',
 'these',
 'quiet',
 'moments',
 'that',
 'sarah',
 'felt',
 'most',
 'at',
 'peace',
 ',',
 'away',
 'from',
 'the',
 'hustle',
 'and',
 'bustle',
 'of',
 'daily',
 'life',
 '.',
 'she',
 'closed',
 'her',
 'eyes',
 ',',
 'allowing',
 'herself',
 'to',
 'be',
 'enveloped',
 'by',
 'the',
 'serenity',
 'of'

In [6]:
Sentence_tokenize = nltk.sent_tokenize(doc)
Sentence_tokenize

['as the sun dipped below the horizon, casting a warm glow across the sky, sarah found herself lost in thought.',
 'she sat on the old wooden bench, the texture rough against her palms.',
 'the gentle breeze played with her hair, carrying with it the scent of blooming flowers.',
 'in the distance, she could hear the faint chirping of crickets, signaling the onset of evening.',
 'it was in these quiet moments that sarah felt most at peace, away from the hustle and bustle of daily life.',
 'she closed her eyes, allowing herself to be enveloped by the serenity of the moment, grateful for the simple beauty that surrounded her.']

### Stop Words
<i>Stop words are common words like 'the', 'is', 'and', etc., which often do not carry significant meaning in text analysis. Remove these stop words from the text to focus on the more meaningful content.</i>

In [7]:
stop_words = set(nltk.corpus.stopwords.words('english'))
filtered_words = []
for token in word_tokenize:
    if(token not in stop_words):
        filtered_words.append(token)
        
filtered_words

['sun',
 'dipped',
 'horizon',
 ',',
 'casting',
 'warm',
 'glow',
 'across',
 'sky',
 ',',
 'sarah',
 'found',
 'lost',
 'thought',
 '.',
 'sat',
 'old',
 'wooden',
 'bench',
 ',',
 'texture',
 'rough',
 'palms',
 '.',
 'gentle',
 'breeze',
 'played',
 'hair',
 ',',
 'carrying',
 'scent',
 'blooming',
 'flowers',
 '.',
 'distance',
 ',',
 'could',
 'hear',
 'faint',
 'chirping',
 'crickets',
 ',',
 'signaling',
 'onset',
 'evening',
 '.',
 'quiet',
 'moments',
 'sarah',
 'felt',
 'peace',
 ',',
 'away',
 'hustle',
 'bustle',
 'daily',
 'life',
 '.',
 'closed',
 'eyes',
 ',',
 'allowing',
 'enveloped',
 'serenity',
 'moment',
 ',',
 'grateful',
 'simple',
 'beauty',
 'surrounded',
 '.']

### POS Tagging

In [8]:
tagged = nltk.pos_tag(filtered_words)
tagged

[('sun', 'NN'),
 ('dipped', 'VBD'),
 ('horizon', 'NN'),
 (',', ','),
 ('casting', 'VBG'),
 ('warm', 'JJ'),
 ('glow', 'NN'),
 ('across', 'IN'),
 ('sky', 'NN'),
 (',', ','),
 ('sarah', 'NN'),
 ('found', 'VBD'),
 ('lost', 'RB'),
 ('thought', 'VBN'),
 ('.', '.'),
 ('sat', 'JJ'),
 ('old', 'JJ'),
 ('wooden', 'NN'),
 ('bench', 'NN'),
 (',', ','),
 ('texture', 'NN'),
 ('rough', 'JJ'),
 ('palms', 'NN'),
 ('.', '.'),
 ('gentle', 'JJ'),
 ('breeze', 'NN'),
 ('played', 'VBD'),
 ('hair', 'NN'),
 (',', ','),
 ('carrying', 'VBG'),
 ('scent', 'NN'),
 ('blooming', 'VBG'),
 ('flowers', 'NNS'),
 ('.', '.'),
 ('distance', 'NN'),
 (',', ','),
 ('could', 'MD'),
 ('hear', 'VB'),
 ('faint', 'JJ'),
 ('chirping', 'VBG'),
 ('crickets', 'NNS'),
 (',', ','),
 ('signaling', 'VBG'),
 ('onset', 'RP'),
 ('evening', 'VBG'),
 ('.', '.'),
 ('quiet', 'JJ'),
 ('moments', 'NNS'),
 ('sarah', 'VBP'),
 ('felt', 'VBD'),
 ('peace', 'NN'),
 (',', ','),
 ('away', 'RB'),
 ('hustle', 'JJ'),
 ('bustle', 'JJ'),
 ('daily', 'JJ'),
 ('lif

### Stemming
<pre>
<i>Stemming, in Natural Language Processing (NLP), refers to the process of reducing a word to its word stem that affixes  to suffixes and prefixes or the roots.</i>

While a stemming algorithm is a linguistic normalization process in which the variant forms of a word are reduced to a standard form. It is a technique used to extract the base form of the words by removing affixes from them. It is just like cutting down the branches of a tree to its stems. For example, the stem of the words “eating,” “eats,” “eaten” is “eat.”
</pre>


![image.png](attachment:image.png)

### Porter's Stemmer Algorithm:
<p>
Porter’s Stemmer algorithm is one of the most popular stemming methods proposed in 1980. It is based on the idea that the suffixes in the English language are made up of a combination of smaller and simpler suffixes. This stemmer is known for its speed and simplicity. The main applications of Porter Stemmer include data mining and Information retrieval. However, its applications are only limited to English words. Also, the group of stems is mapped onto the same stem and the output stem is not necessarily a meaningful word. The algorithms are fairly lengthy and are known to be the oldest stemmer.

Example: EED -> EE means “if the word has at least one vowel and consonant plus EED ending, change the ending to EE” as ‘agreed’ becomes ‘agree.’</p>

In [9]:
stemmer = nltk.stem.PorterStemmer()
for w in filtered_words:
    print(w," : ",stemmer.stem(w))

sun  :  sun
dipped  :  dip
horizon  :  horizon
,  :  ,
casting  :  cast
warm  :  warm
glow  :  glow
across  :  across
sky  :  sky
,  :  ,
sarah  :  sarah
found  :  found
lost  :  lost
thought  :  thought
.  :  .
sat  :  sat
old  :  old
wooden  :  wooden
bench  :  bench
,  :  ,
texture  :  textur
rough  :  rough
palms  :  palm
.  :  .
gentle  :  gentl
breeze  :  breez
played  :  play
hair  :  hair
,  :  ,
carrying  :  carri
scent  :  scent
blooming  :  bloom
flowers  :  flower
.  :  .
distance  :  distanc
,  :  ,
could  :  could
hear  :  hear
faint  :  faint
chirping  :  chirp
crickets  :  cricket
,  :  ,
signaling  :  signal
onset  :  onset
evening  :  even
.  :  .
quiet  :  quiet
moments  :  moment
sarah  :  sarah
felt  :  felt
peace  :  peac
,  :  ,
away  :  away
hustle  :  hustl
bustle  :  bustl
daily  :  daili
life  :  life
.  :  .
closed  :  close
eyes  :  eye
,  :  ,
allowing  :  allow
enveloped  :  envelop
serenity  :  seren
moment  :  moment
,  :  ,
grateful  :  grate
simple  :

### Lemmatization
Lemmatization is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. Lemmatization is similar to stemming but it brings context to the words. So, it links words with similar meanings to one word. The goal of lemmatization is to reduce a word to its root form, also called a lemma. For example, the verb "running" would be identified as "run."


In [10]:
lemma = nltk.stem.WordNetLemmatizer()
for w in filtered_words:
    print(w," : ",lemma.lemmatize(w))

sun  :  sun
dipped  :  dipped
horizon  :  horizon
,  :  ,
casting  :  casting
warm  :  warm
glow  :  glow
across  :  across
sky  :  sky
,  :  ,
sarah  :  sarah
found  :  found
lost  :  lost
thought  :  thought
.  :  .
sat  :  sat
old  :  old
wooden  :  wooden
bench  :  bench
,  :  ,
texture  :  texture
rough  :  rough
palms  :  palm
.  :  .
gentle  :  gentle
breeze  :  breeze
played  :  played
hair  :  hair
,  :  ,
carrying  :  carrying
scent  :  scent
blooming  :  blooming
flowers  :  flower
.  :  .
distance  :  distance
,  :  ,
could  :  could
hear  :  hear
faint  :  faint
chirping  :  chirping
crickets  :  cricket
,  :  ,
signaling  :  signaling
onset  :  onset
evening  :  evening
.  :  .
quiet  :  quiet
moments  :  moment
sarah  :  sarah
felt  :  felt
peace  :  peace
,  :  ,
away  :  away
hustle  :  hustle
bustle  :  bustle
daily  :  daily
life  :  life
.  :  .
closed  :  closed
eyes  :  eye
,  :  ,
allowing  :  allowing
enveloped  :  enveloped
serenity  :  serenity
moment  :  mome

   ### Part 2

### TF-IDF
<p>Term Frequency - Inverse Document Frequency (TF-IDF) is a widely used statistical method in natural language processing and information retrieval. It measures how important a term is within a document relative to a collection of documents (i.e., relative to a corpus).<br>

Words within a text document are transformed into importance numbers by a text vectorization process. There are many different text vectorization scoring schemes, with TF-IDF being one of the most common.</p>

In [11]:
## calculation of TF

words_freq = dict()
tf = dict()

for w in filtered_words:
    words_freq[w] = filtered_words.count(w)
    
n = len(filtered_words) 

for w,c in words_freq.items():
    tf[w] = c/n
    print(w," ====> ",tf[w])
    
 

sun  ====>  0.014084507042253521
dipped  ====>  0.014084507042253521
horizon  ====>  0.014084507042253521
,  ====>  0.1267605633802817
casting  ====>  0.014084507042253521
warm  ====>  0.014084507042253521
glow  ====>  0.014084507042253521
across  ====>  0.014084507042253521
sky  ====>  0.014084507042253521
sarah  ====>  0.028169014084507043
found  ====>  0.014084507042253521
lost  ====>  0.014084507042253521
thought  ====>  0.014084507042253521
.  ====>  0.08450704225352113
sat  ====>  0.014084507042253521
old  ====>  0.014084507042253521
wooden  ====>  0.014084507042253521
bench  ====>  0.014084507042253521
texture  ====>  0.014084507042253521
rough  ====>  0.014084507042253521
palms  ====>  0.014084507042253521
gentle  ====>  0.014084507042253521
breeze  ====>  0.014084507042253521
played  ====>  0.014084507042253521
hair  ====>  0.014084507042253521
carrying  ====>  0.014084507042253521
scent  ====>  0.014084507042253521
blooming  ====>  0.014084507042253521
flowers  ====>  0.01408

### Calculation of IDF:

Inverse Document Frequency
IDF is a measure used in information retrieval and text mining to quantify the importance of a term within a collection of documents.

![image.png](attachment:image.png) 

In [12]:
doc2 = ""
with open("./input2.txt","r") as file:
    doc2 += file.read().lower()
doc2_tokens = nltk.word_tokenize(doc2)
stop_words = nltk.corpus.stopwords.words('english')
filtered2 = []
for w in doc2_tokens:
    if(w not in stop_words):
        filtered2.append(w)
filtered2

['casting',
 'warm',
 'glow',
 'across',
 'sky',
 ',',
 'sarah',
 'found',
 'lost',
 'thought',
 '.',
 'sat',
 'old',
 'wooden',
 'bench',
 ',',
 'texture',
 'rough',
 'palms',
 '.']

In [13]:
def calculate(doclist):
    all_tokens = []
    for d in doclist:
        all_tokens += d
    n = len(doclist)    
    idf = dict()
    for w in all_tokens:
        f = 0;
        for d in doclist:
            if(w in d):
                f+=1
        idf[w] =  math.log(n/f)
    return idf  
        

In [16]:
idf  = calculate([filtered_words,filtered2])

In [17]:
for w in filtered_words:
    print(w," ===> ",tf[w]*idf[w])

sun  ===>  0.009762636345914722
dipped  ===>  0.009762636345914722
horizon  ===>  0.009762636345914722
,  ===>  0.0
casting  ===>  0.0
warm  ===>  0.0
glow  ===>  0.0
across  ===>  0.0
sky  ===>  0.0
,  ===>  0.0
sarah  ===>  0.0
found  ===>  0.0
lost  ===>  0.0
thought  ===>  0.0
.  ===>  0.0
sat  ===>  0.0
old  ===>  0.0
wooden  ===>  0.0
bench  ===>  0.0
,  ===>  0.0
texture  ===>  0.0
rough  ===>  0.0
palms  ===>  0.0
.  ===>  0.0
gentle  ===>  0.009762636345914722
breeze  ===>  0.009762636345914722
played  ===>  0.009762636345914722
hair  ===>  0.009762636345914722
,  ===>  0.0
carrying  ===>  0.009762636345914722
scent  ===>  0.009762636345914722
blooming  ===>  0.009762636345914722
flowers  ===>  0.009762636345914722
.  ===>  0.0
distance  ===>  0.009762636345914722
,  ===>  0.0
could  ===>  0.009762636345914722
hear  ===>  0.009762636345914722
faint  ===>  0.009762636345914722
chirping  ===>  0.009762636345914722
crickets  ===>  0.009762636345914722
,  ===>  0.0
signaling  ===>