# Practical - 7 :Text Analytics
1.Extract Sample document and apply following document preprocessing methods: Tokenization, POS Tagging, stop words removal, Stemming and Lemmatization.

2.Create representation of document by calculating Term Frequency and Inverse Document Frequency.

**Tokenization**: Splits the document into individual words or tokens. We use word_tokenize() function from NLTK for this purpose.

**POS Tagging**: Assigns part-of-speech tags to each token in the document. We use pos_tag() function from NLTK for this.

**Stopwords Removal**: Removes common words like 'is', 'the', 'a', etc., which do not carry significant meaning. We use NLTK's built-in list of stopwords and list comprehension to filter out the stopwords.

**Stemming:** Reduces words to their root or base form. We use Porter Stemmer from NLTK for this.

**Lemmatization:** Similar to stemming, but instead of just removing prefixes or suffixes, it returns a valid word. It uses vocabulary and morphological analysis to achieve this. We use WordNet Lemmatizer from NLTK for this.

In [1]:
#pip install nltk

In [17]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import pos_tag
from nltk.corpus import wordnet


In [3]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\swati\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\swati\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\swati\AppData\Roaming\nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\swati\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

In [14]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\swati\AppData\Roaming\nltk_data...
[nltk_data] Error downloading 'omw-1.4' from
[nltk_data]     <https://raw.githubusercontent.com/nltk/nltk_data/gh-
[nltk_data]     pages/packages/corpora/omw-1.4.zip>:   [WinError
[nltk_data]     10053] An established connection was aborted by the
[nltk_data]     software in your host machine


False

In [5]:
# Sample document
document = "Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data."


In [6]:
# Tokenization
tokens = word_tokenize(document)
print("Tokenization:")
print(tokens)

Tokenization:
['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'subfield', 'of', 'linguistics', ',', 'computer', 'science', ',', 'and', 'artificial', 'intelligence', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', 'language', ',', 'in', 'particular', 'how', 'to', 'program', 'computers', 'to', 'process', 'and', 'analyze', 'large', 'amounts', 'of', 'natural', 'language', 'data', '.']


In [7]:
# Part of Speech (POS) Tagging
pos_tags = pos_tag(tokens)
print("\nPOS Tagging:")
print(pos_tags)


POS Tagging:
[('Natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('(', '('), ('NLP', 'NNP'), (')', ')'), ('is', 'VBZ'), ('a', 'DT'), ('subfield', 'NN'), ('of', 'IN'), ('linguistics', 'NNS'), (',', ','), ('computer', 'NN'), ('science', 'NN'), (',', ','), ('and', 'CC'), ('artificial', 'JJ'), ('intelligence', 'NN'), ('concerned', 'VBN'), ('with', 'IN'), ('the', 'DT'), ('interactions', 'NNS'), ('between', 'IN'), ('computers', 'NNS'), ('and', 'CC'), ('human', 'JJ'), ('language', 'NN'), (',', ','), ('in', 'IN'), ('particular', 'JJ'), ('how', 'WRB'), ('to', 'TO'), ('program', 'NN'), ('computers', 'NNS'), ('to', 'TO'), ('process', 'VB'), ('and', 'CC'), ('analyze', 'VB'), ('large', 'JJ'), ('amounts', 'NNS'), ('of', 'IN'), ('natural', 'JJ'), ('language', 'NN'), ('data', 'NNS'), ('.', '.')]


In [8]:
# Stopwords removal
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print("\nStopwords Removal:")
print(filtered_tokens)


Stopwords Removal:
['Natural', 'language', 'processing', '(', 'NLP', ')', 'subfield', 'linguistics', ',', 'computer', 'science', ',', 'artificial', 'intelligence', 'concerned', 'interactions', 'computers', 'human', 'language', ',', 'particular', 'program', 'computers', 'process', 'analyze', 'large', 'amounts', 'natural', 'language', 'data', '.']


In [9]:
# Stemming
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
print("\nStemming:")
print(stemmed_tokens)


Stemming:
['natur', 'languag', 'process', '(', 'nlp', ')', 'subfield', 'linguist', ',', 'comput', 'scienc', ',', 'artifici', 'intellig', 'concern', 'interact', 'comput', 'human', 'languag', ',', 'particular', 'program', 'comput', 'process', 'analyz', 'larg', 'amount', 'natur', 'languag', 'data', '.']


In [21]:
# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
print("\nLemmatization:")
print(lemmatized_tokens)

BadZipFile: File is not a zip file

In [22]:
import nltk
from nltk.stem import WordNetLemmatizer

try:
    # Attempt to download the WordNet corpus
    nltk.download('wordnet')
except Exception as e:
    # Handle the error if downloading fails
    print("Error downloading WordNet corpus:", e)

# Assuming filtered_tokens is your list of tokens to be lemmatized
filtered_tokens = ["running", "cars", "better", "mice"]

# Initialize the WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()

# Lemmatize the tokens
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]

# Print the lemmatized tokens
print("\nLemmatization:")
print(lemmatized_tokens)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\swati\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


BadZipFile: File is not a zip file

In [20]:
# Assuming filtered_tokens is your list of tokens to be lemmatized
filtered_tokens = ["running", "cars", "better", "mice"]

# Initialize the WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()

# Lemmatize the tokens
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]

# Print the lemmatized tokens
print("\nLemmatization:")
print(lemmatized_tokens)

BadZipFile: File is not a zip file

TF-IDF stands for Term Frequency Inverse Document Frequency of records. It can be defined as the calculation of how relevant a word in a series or corpus is to a text. The meaning increases proportionally to the number of times in the text a word appears but is compensated by the word frequency in the corpus (data-set).

Terminologies:

Term Frequency: In document d, the frequency represents the number of instances of a given word t. Therefore, we can see that it becomes more relevant when a word appears in the text, which is rational. Since the ordering of terms is not significant, we can use a vector to describe the text in the bag of term models. For each specific term in the paper, there is an entry with the value being the term frequency.
The weight of a term that occurs in a document is simply proportional to the term frequency.

tf(t,d) = count of t in d / number of words in d      

Document Frequency: This tests the meaning of the text, which is very similar to TF, in the whole corpus collection. The only difference is that in document d, TF is the frequency counter for a term t, while df is the number of occurrences in the document set N of the term t. In other words, the number of papers in which the word is present is DF.   

df(t) = occurrence of t in documents   

Inverse Document Frequency: Mainly, it tests how relevant the word is. The key aim of the search is to locate the appropriate records that fit the demand. Since tf considers all terms equally significant, it is therefore not only possible to use the term frequencies to measure the weight of the term in the paper. First, find the document frequency of a term t by counting the number of documents containing the term:    

df(t) = N(t)
where
df(t) = Document frequency of a term t
N(t) = Number of documents containing the term t   


Term frequency is the number of instances of a term in a single document only; although the frequency of the document is the number of separate documents in which the term appears, it depends on the entire corpus. Now let’s look at the definition of the frequency of the inverse paper. The IDF of the word is the number of documents in the corpus separated by the frequency of the text.   

idf(t) = N/ df(t) = N/N(t)   

The more common word is supposed to be considered less significant, but the element (most definite integers) seems too harsh. We then take the logarithm (with base 2) of the inverse frequency of the paper. So the if of the term t becomes:   

idf(t) = log(N/ df(t))    
Computation: Tf-idf is one of the best metrics to determine how significant a term is to a text in a series or a corpus. tf-idf is a weighting system that assigns a weight to each word in a document based on its term frequency (tf) and the reciprocal document frequency (tf) (idf). The words with higher scores of weight are deemed to be more significant.
Usually, the tf-idf weight consists of two terms-  

Normalized Term Frequency (tf)
Inverse Document Frequency (idf)   
tf-idf(t, d) = tf(t, d) * idf(t)

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample document
document = "Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data."

# List of documents (only one document in this case)
documents = [document]

# Create a TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Fit the vectorizer to the documents and transform the documents into TF-IDF matrix
tfidf_matrix = vectorizer.fit_transform(documents)

# Get the feature names (words)
feature_names = vectorizer.get_feature_names_out()

# Create a dictionary to store TF-IDF values for each word
tfidf_representation = {}

# Iterate over the features and their TF-IDF values
for i in range(len(documents)):
    feature_index = tfidf_matrix[i,:].nonzero()[1]
    tfidf_scores = zip(feature_index, [tfidf_matrix[i, x] for x in feature_index])
    for word_index, score in tfidf_scores:
        word = feature_names[word_index]
        tfidf_representation[word] = score

# Print TF-IDF representation
print("TF-IDF Representation:")
for word, score in tfidf_representation.items():
    print(word, ":", score)

TF-IDF Representation:
data : 0.13130643285972254
amounts : 0.13130643285972254
large : 0.13130643285972254
analyze : 0.13130643285972254
process : 0.13130643285972254
program : 0.13130643285972254
to : 0.2626128657194451
how : 0.13130643285972254
particular : 0.13130643285972254
in : 0.13130643285972254
human : 0.13130643285972254
computers : 0.2626128657194451
between : 0.13130643285972254
interactions : 0.13130643285972254
the : 0.13130643285972254
with : 0.13130643285972254
concerned : 0.13130643285972254
intelligence : 0.13130643285972254
artificial : 0.13130643285972254
and : 0.39391929857916763
science : 0.13130643285972254
computer : 0.13130643285972254
linguistics : 0.13130643285972254
of : 0.2626128657194451
subfield : 0.13130643285972254
is : 0.13130643285972254
nlp : 0.13130643285972254
processing : 0.13130643285972254
language : 0.39391929857916763
natural : 0.2626128657194451
