# 7) Text Analytics
1. Extract Sample document and apply following document preprocessing methods: Tokenization, POS Tagging, stop words removal, Stemming and Lemmatization.
2. Create representation of documents by calculating Term Frequency and Inverse DocumentFrequency.

# In this example, we first define the sample document. Then, we perform the following preprocessing steps:

1) Tokenization: Break the document into individual words or tokens using word_tokenize() function.
2) POS Tagging: Assign a part-of-speech tag to each token using pos_tag() function.
3) Stop Words Removal: Remove stop words from the tokens using a set of stop words from the NLTK corpus.
4) Stemming: Apply stemming to reduce words to their base or root form using PorterStemmer().
5) Lemmatization: Apply lemmatization to transform words to their base form using WordNetLemmatizer().

Finally, we calculate the Term Frequency (TF) using FreqDist() from NLTK and the Inverse Document Frequency (IDF) using TfidfVectorizer() from scikit-learn.

Adjust the code as needed for your specific requirements and incorporate your own document data.






In [17]:
!pip install nltk
import nltk
# nltk.download('punkt')
# nltk.download('averaged_perceptron_tagger')
# nltk.download('stopwords')
# nltk.download('wordnet')
# nltk.download('omw-1.4')



In [21]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.probability import FreqDist
from nltk.tag import pos_tag
from nltk import sent_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample Document
document = """
This is a sample document. It contains multiple sentences and words. 
We will apply various preprocessing techniques on this document.
"""

# Tokenization
tokens = word_tokenize(document)

# POS Tagging
pos_tags = pos_tag(tokens)

# Stop Words Removal
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.casefold() not in stop_words]

# Stemming
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]

# Term Frequency (TF)
tf = FreqDist(lemmatized_tokens)

# Inverse Document Frequency (IDF)
corpus = [document]
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(corpus)
idf = vectorizer.idf_

# Print the results
print("Tokens:", tokens)
print()
print()
print("POS Tags:", pos_tags)
print()
print()
print("Filtered Tokens (Stop Words Removal):", filtered_tokens)
print()
print()
print("Stemmed Tokens:", stemmed_tokens)
print()
print()
print("Lemmatized Tokens:", lemmatized_tokens)
print()
print()
print("Term Frequency (TF):", tf)
print()
print()
print("Inverse Document Frequency (IDF):", idf)


Tokens: ['This', 'is', 'a', 'sample', 'document', '.', 'It', 'contains', 'multiple', 'sentences', 'and', 'words', '.', 'We', 'will', 'apply', 'various', 'preprocessing', 'techniques', 'on', 'this', 'document', '.']


POS Tags: [('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('sample', 'JJ'), ('document', 'NN'), ('.', '.'), ('It', 'PRP'), ('contains', 'VBZ'), ('multiple', 'JJ'), ('sentences', 'NNS'), ('and', 'CC'), ('words', 'NNS'), ('.', '.'), ('We', 'PRP'), ('will', 'MD'), ('apply', 'VB'), ('various', 'JJ'), ('preprocessing', 'VBG'), ('techniques', 'NNS'), ('on', 'IN'), ('this', 'DT'), ('document', 'NN'), ('.', '.')]


Filtered Tokens (Stop Words Removal): ['sample', 'document', '.', 'contains', 'multiple', 'sentences', 'words', '.', 'apply', 'various', 'preprocessing', 'techniques', 'document', '.']


Stemmed Tokens: ['sampl', 'document', '.', 'contain', 'multipl', 'sentenc', 'word', '.', 'appli', 'variou', 'preprocess', 'techniqu', 'document', '.']


Lemmatized Tokens: ['sample', 'docu

# In this result, you can see the different stages of document preprocessing:

1) Tokenization: The document is split into individual tokens.
2) POS Tagging: Each token is assigned a part-of-speech tag.
3) Filtered Tokens: Stop words are removed, resulting in a reduced set of tokens.
4) Stemmed Tokens: The filtered tokens are stemmed using PorterStemmer.
5) Lemmatized Tokens: The filtered tokens are lemmatized using WordNetLemmatizer.

Additionally, the Term Frequency (TF) is calculated using FreqDist(), which provides the frequency of each token in the lemmatized tokens. The Inverse Document Frequency (IDF) is also calculated, which in this case is a single value of 1.0 since there is only one document in the corpus.

Please note that the provided sample document is small, resulting in limited diversity and repetition in the tokens. In a real-world scenario with a larger corpus, the TF and IDF values would provide more meaningful insights.

Adjust the code and incorporate your own document data to see the results for your specific