<h2>Text Analytics</h2>

In [32]:
with open("sample.txt",'r') as file:
    text = file.read()

print(text)

Text analytics, also known as text mining, is the process of deriving meaningful insights and patterns from unstructured text data. With the rise of the internet and social media, the amount of text data generated daily has skyrocketed, making text analytics an essential tool for businesses and organizations seeking to extract insights from this data.In this essay, we will explore the different techniques used in text analytics, the benefits of text analytics, and the challenges that come with implementing text analytics.Text analytics can be broadly divided into three main techniques: text classification, sentiment analysis, and topic modeling.Text classification involves categorizing text data into predefined categories. This can be useful for automating tasks such as spam detection, content filtering, and customer feedback analysis. For example, a company might use text classification to automatically route customer complaints to the appropriate department.Sentiment analysis, also k

In [48]:
import nltk
import math
nltk.download('punkt') #tokenization
nltk.download('wordnet') #lexical db contains synonyms,def,etc
nltk.download('stopwords') #stopwords
nltk.download('averaged_perceptron_tagger') #for postag
from nltk import sent_tokenize
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer,WordNetLemmatizer
from collections import Counter 

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\KARTIKI\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\KARTIKI\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\KARTIKI\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\KARTIKI\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


<h3>Part 1</h3>

__Tokenization__

In [34]:
tokens_sents = nltk.sent_tokenize(text)
print(tokens_sents)

['Text analytics, also known as text mining, is the process of deriving meaningful insights and patterns from unstructured text data.', 'With the rise of the internet and social media, the amount of text data generated daily has skyrocketed, making text analytics an essential tool for businesses and organizations seeking to extract insights from this data.In this essay, we will explore the different techniques used in text analytics, the benefits of text analytics, and the challenges that come with implementing text analytics.Text analytics can be broadly divided into three main techniques: text classification, sentiment analysis, and topic modeling.Text classification involves categorizing text data into predefined categories.', 'This can be useful for automating tasks such as spam detection, content filtering, and customer feedback analysis.', 'For example, a company might use text classification to automatically route customer complaints to the appropriate department.Sentiment analy

In [35]:
tokens_words = nltk.word_tokenize(text)
print(tokens_words)

['Text', 'analytics', ',', 'also', 'known', 'as', 'text', 'mining', ',', 'is', 'the', 'process', 'of', 'deriving', 'meaningful', 'insights', 'and', 'patterns', 'from', 'unstructured', 'text', 'data', '.', 'With', 'the', 'rise', 'of', 'the', 'internet', 'and', 'social', 'media', ',', 'the', 'amount', 'of', 'text', 'data', 'generated', 'daily', 'has', 'skyrocketed', ',', 'making', 'text', 'analytics', 'an', 'essential', 'tool', 'for', 'businesses', 'and', 'organizations', 'seeking', 'to', 'extract', 'insights', 'from', 'this', 'data.In', 'this', 'essay', ',', 'we', 'will', 'explore', 'the', 'different', 'techniques', 'used', 'in', 'text', 'analytics', ',', 'the', 'benefits', 'of', 'text', 'analytics', ',', 'and', 'the', 'challenges', 'that', 'come', 'with', 'implementing', 'text', 'analytics.Text', 'analytics', 'can', 'be', 'broadly', 'divided', 'into', 'three', 'main', 'techniques', ':', 'text', 'classification', ',', 'sentiment', 'analysis', ',', 'and', 'topic', 'modeling.Text', 'class

__Stop Words__

In [36]:
stopword = set(stopwords.words('english'))

filtered_tokens = []

for w in tokens_words:
    if w not in stopword:
        filtered_tokens.append(w)

print(filtered_tokens)

['Text', 'analytics', ',', 'also', 'known', 'text', 'mining', ',', 'process', 'deriving', 'meaningful', 'insights', 'patterns', 'unstructured', 'text', 'data', '.', 'With', 'rise', 'internet', 'social', 'media', ',', 'amount', 'text', 'data', 'generated', 'daily', 'skyrocketed', ',', 'making', 'text', 'analytics', 'essential', 'tool', 'businesses', 'organizations', 'seeking', 'extract', 'insights', 'data.In', 'essay', ',', 'explore', 'different', 'techniques', 'used', 'text', 'analytics', ',', 'benefits', 'text', 'analytics', ',', 'challenges', 'come', 'implementing', 'text', 'analytics.Text', 'analytics', 'broadly', 'divided', 'three', 'main', 'techniques', ':', 'text', 'classification', ',', 'sentiment', 'analysis', ',', 'topic', 'modeling.Text', 'classification', 'involves', 'categorizing', 'text', 'data', 'predefined', 'categories', '.', 'This', 'useful', 'automating', 'tasks', 'spam', 'detection', ',', 'content', 'filtering', ',', 'customer', 'feedback', 'analysis', '.', 'For', 'e

__POS Tagging__

In [40]:
pos_tags = nltk.pos_tag(tokens_words)
print(pos_tags)

[('Text', 'NN'), ('analytics', 'NNS'), (',', ','), ('also', 'RB'), ('known', 'VBN'), ('as', 'IN'), ('text', 'NN'), ('mining', 'NN'), (',', ','), ('is', 'VBZ'), ('the', 'DT'), ('process', 'NN'), ('of', 'IN'), ('deriving', 'VBG'), ('meaningful', 'JJ'), ('insights', 'NNS'), ('and', 'CC'), ('patterns', 'NNS'), ('from', 'IN'), ('unstructured', 'JJ'), ('text', 'NN'), ('data', 'NNS'), ('.', '.'), ('With', 'IN'), ('the', 'DT'), ('rise', 'NN'), ('of', 'IN'), ('the', 'DT'), ('internet', 'NN'), ('and', 'CC'), ('social', 'JJ'), ('media', 'NNS'), (',', ','), ('the', 'DT'), ('amount', 'NN'), ('of', 'IN'), ('text', 'NN'), ('data', 'NNS'), ('generated', 'VBD'), ('daily', 'RB'), ('has', 'VBZ'), ('skyrocketed', 'VBN'), (',', ','), ('making', 'VBG'), ('text', 'JJ'), ('analytics', 'NNS'), ('an', 'DT'), ('essential', 'JJ'), ('tool', 'NN'), ('for', 'IN'), ('businesses', 'NNS'), ('and', 'CC'), ('organizations', 'NNS'), ('seeking', 'VBG'), ('to', 'TO'), ('extract', 'VB'), ('insights', 'NNS'), ('from', 'IN'), 

__Stemming and Lemmatization__

In [66]:
#Stemming
stemmer = PorterStemmer()

stemmed = []

for w in filtered_tokens:
    stemmed.append(stemmer.stem(w))
    print(w,':',stemmer.stem(w))
#print(stemmed)

Text : text
analytics : analyt
, : ,
also : also
known : known
text : text
mining : mine
, : ,
process : process
deriving : deriv
meaningful : meaning
insights : insight
patterns : pattern
unstructured : unstructur
text : text
data : data
. : .
With : with
rise : rise
internet : internet
social : social
media : media
, : ,
amount : amount
text : text
data : data
generated : gener
daily : daili
skyrocketed : skyrocket
, : ,
making : make
text : text
analytics : analyt
essential : essenti
tool : tool
businesses : busi
organizations : organ
seeking : seek
extract : extract
insights : insight
data.In : data.in
essay : essay
, : ,
explore : explor
different : differ
techniques : techniqu
used : use
text : text
analytics : analyt
, : ,
benefits : benefit
text : text
analytics : analyt
, : ,
challenges : challeng
come : come
implementing : implement
text : text
analytics.Text : analytics.text
analytics : analyt
broadly : broadli
divided : divid
three : three
main : main
techniques : techniqu


In [44]:
#Lemmatization
lemmatizer = WordNetLemmatizer()

lemmatized=[]

for w in filtered_tokens:
    lemmatized.append(lemmatizer.lemmatize(w))
    print(w,':',lemmatizer.lemmatize(w))

Text : Text
analytics : analytics
, : ,
also : also
known : known
text : text
mining : mining
, : ,
process : process
deriving : deriving
meaningful : meaningful
insights : insight
patterns : pattern
unstructured : unstructured
text : text
data : data
. : .
With : With
rise : rise
internet : internet
social : social
media : medium
, : ,
amount : amount
text : text
data : data
generated : generated
daily : daily
skyrocketed : skyrocketed
, : ,
making : making
text : text
analytics : analytics
essential : essential
tool : tool
businesses : business
organizations : organization
seeking : seeking
extract : extract
insights : insight
data.In : data.In
essay : essay
, : ,
explore : explore
different : different
techniques : technique
used : used
text : text
analytics : analytics
, : ,
benefits : benefit
text : text
analytics : analytics
, : ,
challenges : challenge
come : come
implementing : implementing
text : text
analytics.Text : analytics.Text
analytics : analytics
broadly : broadly
divi

<h3>Part 2</h3>

In [68]:
# Calculate TF
tf = Counter(tokens_words)
total_terms = len(tokens_words)
tf = {term: freq/total_terms for term, freq in tf.items()}

# Print TF
print("Term Frequency (TF):")
for term, freq in tf.items():
    print(f"{term}: {freq}")

# Calculate IDF
unique_terms = set(tokens_words)
num_docs = 1  # Assuming we're dealing with a single document
idf = {term: math.log(num_docs / sum(term in doc for doc in [tokens_words])) for term in unique_terms}

# Print IDF
print("\nInverse Document Frequency (IDF):")
for term, score in idf.items():
    print(f"{term}: {score}")

# Calculate TF-IDF
tfidf = {term: tf[term] * idf[term] for term in unique_terms}

# Print TF-IDF
print("\nTF-IDF:")
for term, score in tfidf.items():
    print(f"{term}: {score}")

Term Frequency (TF):
Text: 0.0017667844522968198
analytics: 0.024734982332155476
,: 0.06713780918727916
also: 0.00530035335689046
known: 0.0035335689045936395
as: 0.00530035335689046
text: 0.04063604240282685
mining: 0.0035335689045936395
is: 0.0176678445229682
the: 0.03356890459363958
process: 0.0035335689045936395
of: 0.028268551236749116
deriving: 0.0017667844522968198
meaningful: 0.0017667844522968198
insights: 0.0088339222614841
and: 0.0353356890459364
patterns: 0.0017667844522968198
from: 0.007067137809187279
unstructured: 0.0017667844522968198
data: 0.015901060070671377
.: 0.030035335689045935
With: 0.0017667844522968198
rise: 0.0017667844522968198
internet: 0.0017667844522968198
social: 0.0088339222614841
media: 0.0088339222614841
amount: 0.0017667844522968198
generated: 0.0035335689045936395
daily: 0.0017667844522968198
has: 0.0017667844522968198
skyrocketed: 0.0017667844522968198
making: 0.0017667844522968198
an: 0.0035335689045936395
essential: 0.0017667844522968198
tool: 0.

In [61]:
tf = Counter(tokens_words)
total_terms = len(tokens_words)
tf = {term: freq/total_terms for term,freq in tf.items()}

for term,freq in tf.items():
    print(f"{term}:{freq}")

Text:0.0017667844522968198
analytics:0.024734982332155476
,:0.06713780918727916
also:0.00530035335689046
known:0.0035335689045936395
as:0.00530035335689046
text:0.04063604240282685
mining:0.0035335689045936395
is:0.0176678445229682
the:0.03356890459363958
process:0.0035335689045936395
of:0.028268551236749116
deriving:0.0017667844522968198
meaningful:0.0017667844522968198
insights:0.0088339222614841
and:0.0353356890459364
patterns:0.0017667844522968198
from:0.007067137809187279
unstructured:0.0017667844522968198
data:0.015901060070671377
.:0.030035335689045935
With:0.0017667844522968198
rise:0.0017667844522968198
internet:0.0017667844522968198
social:0.0088339222614841
media:0.0088339222614841
amount:0.0017667844522968198
generated:0.0035335689045936395
daily:0.0017667844522968198
has:0.0017667844522968198
skyrocketed:0.0017667844522968198
making:0.0017667844522968198
an:0.0035335689045936395
essential:0.0017667844522968198
tool:0.0035335689045936395
for:0.0088339222614841
businesses:0.

In [62]:
unique_terms=set(tokens_words)
num_of_docs=1
idf = { term: math.log(num_of_docs/sum(term in doc for doc in [tokens_words]) ) for term in unique_terms}

for term,score in idf.items():
    print(f"{term}:{score}")

analytics.Text:0.0
feedback:0.0
must:0.0
satisfaction:0.0
different:0.0
the:0.0
explore:0.0
automatically:0.0
used:0.0
articles:0.0
many:0.0
department.Sentiment:0.0
its:0.0
data.Despite:0.0
leveraging:0.0
determining:0.0
modeling.Text:0.0
discussed:0.0
privacy:0.0
patterns:0.0
competitive:0.0
automating:0.0
Since:0.0
predefined:0.0
reviews:0.0
broadly:0.0
challenges:0.0
As:0.0
text:0.0
analytics:0.0
account:0.0
seeking:0.0
them:0.0
,:0.0
in:0.0
inform:0.0
and:0.0
meaningful:0.0
launch.Topic:0.0
This:0.0
edge:0.0
by:0.0
analyzing:0.0
have:0.0
reliable.In:0.0
behavior:0.0
reputation:0.0
data.In:0.0
that:0.0
platforms:0.0
ambiguous:0.0
techniques:0.0
neutral:0.0
example:0.0
economy:0.0
years:0.0
considerations:0.0
make:0.0
online:0.0
For:0.0
increasingly:0.0
By:0.0
products.The:0.0
likely:0.0
Similarly:0.0
growth:0.0
amount:0.0
company:0.0
commonly:0.0
take:0.0
three:0.0
these:0.0
identifying:0.0
see:0.0
derived:0.0
vast:0.0
gain:0.0
strategy.Despite:0.0
are:0.0
spam:0.0
algorithms.Anoth

In [64]:
tfidf = {term: tf[term] * idf[term] for term in unique_terms}
for term, score in tfidf.items():
    print(f"{term}: {score}")

analytics.Text: 0.0
feedback: 0.0
must: 0.0
satisfaction: 0.0
different: 0.0
the: 0.0
explore: 0.0
automatically: 0.0
used: 0.0
articles: 0.0
many: 0.0
department.Sentiment: 0.0
its: 0.0
data.Despite: 0.0
leveraging: 0.0
determining: 0.0
modeling.Text: 0.0
discussed: 0.0
privacy: 0.0
patterns: 0.0
competitive: 0.0
automating: 0.0
Since: 0.0
predefined: 0.0
reviews: 0.0
broadly: 0.0
challenges: 0.0
As: 0.0
text: 0.0
analytics: 0.0
account: 0.0
seeking: 0.0
them: 0.0
,: 0.0
in: 0.0
inform: 0.0
and: 0.0
meaningful: 0.0
launch.Topic: 0.0
This: 0.0
edge: 0.0
by: 0.0
analyzing: 0.0
have: 0.0
reliable.In: 0.0
behavior: 0.0
reputation: 0.0
data.In: 0.0
that: 0.0
platforms: 0.0
ambiguous: 0.0
techniques: 0.0
neutral: 0.0
example: 0.0
economy: 0.0
years: 0.0
considerations: 0.0
make: 0.0
online: 0.0
For: 0.0
increasingly: 0.0
By: 0.0
products.The: 0.0
likely: 0.0
Similarly: 0.0
growth: 0.0
amount: 0.0
company: 0.0
commonly: 0.0
take: 0.0
three: 0.0
these: 0.0
identifying: 0.0
see: 0.0
derived: 0