  ## <div align="center">Labsheet-05-Ranjit-Menon </div>

### Introduction
#### Implement TF-IDF algorithm using 'textblob' package. Use the  files  given in the folder. The output should be the TF-IDF value of each word. 

In [1]:
#install necessary package
#!pip install textblob

In [2]:
import pandas as pd
from textblob import TextBlob as tb 
import math
import os
import string
from nltk.corpus import stopwords
import nltk

### Step 1 - loading file data into Dataset
**prepare_dataset()** 
This method is responsible for loading all the files under Lab5_Data and iterate through each file and load the content into pandas dataframe, this dataframe will be used in other places to load the data.

It contains two column, document - name of the file and content - the content for each file, we will be mostly using content from this dataframe

In [3]:
#method to load the dataset with file content
def prepare_dataset() :
    folder_path = 'Lab5_Data' #path to folder

    # list of all files in the folder
    file_list = os.listdir(folder_path)

    dfs = []

    # Loop through each file, read its content, and create a DataFrame
    for file_name in file_list:
        file_path = os.path.join(folder_path, file_name)

        # Open and read the content of the file
        with open(file_path, 'r', encoding='utf-8') as file:
            content = file.read()

        # Create a DataFrame for the current file
        current_df = pd.DataFrame({'document': [file_name], 'content': [content]})

        # Append the DataFrame to the list as I am getting depcrecation error when appending directly in dataframe
        dfs.append(current_df)

    # Concatenate all DataFrames into a single DataFrame
    df = pd.concat(dfs, ignore_index=True)    
    return df


In [4]:
#call dataset prepation method
df = prepare_dataset()
df

Unnamed: 0,document,content
0,Pearl1.txt,"\n \n\nJohn Steinbeck\n\n \t\t\t\n\t\n\n ""..."
1,Pearl2.txt,\n\nJohn Steinbeck\nChapter 2\nJohn Steinbeck\...
2,Pearl3.txt,\n\nJohn Steinbeck\nChapter 3\nJohn Steinbeck\...
3,Pearl4.txt,\n\nJohn Steinbeck\nChapter 4\nJohn Steinbeck\...
4,Pearl5.txt,\n\nJohn Steinbeck\nChapter 5\nJohn Steinbeck\...
5,Pearl6.txt,\n\nJohn Steinbeck\nChapter 6\nJohn Steinbeck\...


### Step 2 - Cleaning up the corpus by removing stop words.
**remove_stopwords**
This method will remove the stop words from the sentence, advantage of removing this is to get more relevant words for tf-idf calculation and also it will keep the corpus length small as this words does not have contribute in the weightage of the words.
We are using **nltk library** to remove the stop words.

Stopwords typically include words from various parts of speech, such as:

**Articles:** e.g., "a," "an," "the"

**Conjunctions:** e.g., "and," "but," "or"

**Prepositions:** e.g., "in," "on," "at," "with"

**Pronouns:** e.g., "I," "you," "he," "she," "it," "we," "they"

**Auxiliary verbs:** e.g., "is," "am," "are," "was," "were," "be," "been," "have," "has," "had," "do," "does," "did"


In [5]:
nltk.download('stopwords')

# Function to remove stop words from text
def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    #print(stop_words)
    filtered_words = [word for word in text.split() if word.lower() not in stop_words]
    return ' '.join(filtered_words)

# Iterate through the DataFrame and update 'content' column
for index, row in df.iterrows():
    df.at[index, 'content'] = remove_stopwords(row['content'])

# Display the updated DataFrame
print(df)

     document                                            content
0  Pearl1.txt  John Steinbeck "In town tell story great pearl...
1  Pearl2.txt  John Steinbeck Chapter 2 John Steinbeck town l...
2  Pearl3.txt  John Steinbeck Chapter 3 John Steinbeck town t...
3  Pearl4.txt  John Steinbeck Chapter 4 John Steinbeck wonder...
4  Pearl5.txt  John Steinbeck Chapter 5 John Steinbeck late m...
5  Pearl6.txt  John Steinbeck Chapter 6 John Steinbeck wind b...


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ranjit09\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Step 3 - Cleaning up the corpus by removing punctuation
**remove_punctuation**
This method will remove the punctuation from the sentence, this will keep the corpus clean by having only the words which can be used for tf-idf calculation, excluding puncutation we can prioritize meaningful words, avoid noise and improve text analysis accuracy.

example of punctuation : !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

In [10]:
def remove_punctuation(text):    
    blob = tb(text) #use TextBlob to convert this into document/corpus
    words_without_punct = [word for word in blob.words if word not in string.punctuation]
    text_without_punct = ' '.join(words_without_punct)
    return text_without_punct

# Iterate through the DataFrame and update 'content' column
for index, row in df.iterrows():
    df.at[index, 'content'] = remove_punctuation(row['content'])

# Display the updated DataFrame
print(df)

     document                                            content
0  Pearl1.txt  john steinbeck in town tell story great pearl ...
1  Pearl2.txt  john steinbeck chapter 2 john steinbeck town l...
2  Pearl3.txt  john steinbeck chapter 3 john steinbeck town t...
3  Pearl4.txt  john steinbeck chapter 4 john steinbeck wonder...
4  Pearl5.txt  john steinbeck chapter 5 john steinbeck late m...
5  Pearl6.txt  john steinbeck chapter 6 john steinbeck wind b...


### Step 4 - Helper method for tf-idf calculation
**tf**
The method calculates term frequency (TF) by dividing the count of a word in a text blob by its length.

![Alt Text](image/tf.png)

**idf**
The method calculates Inverse Document Frequency (IDF) by taking the logarithm of the ratio of the total number of documents to the number containing a specific word.

![Alt Text](image/idf.png)

**tfidf**
The method computes the Term Frequency-Inverse Document Frequency (TF-IDF) score for a word in a document by multiplying the term frequence (tf) by Inverse Document Frequency (idf)


In [7]:
def tf(word, blob):
    return blob.words.count(word) / len(blob.words)

#this method returns 1 if a word exists in the document and 0 if it does not.
def n_containing(word, bloblist):
    return sum(1 for blob in bloblist if word in blob.words)

def idf(word, bloblist):
    return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))

def tfidf(word, blob, bloblist):
    return tf(word, blob) * idf(word, bloblist)

### Step 5 - Calculating the TF-IDF score - Without lowercase
The below code snippet will get all the content from the dataframe and iterate through each document and calculate the tf-idf score followed by it sort the score in descending order for the scores dictionary tuple and output the top 5 with highest score for each document

#### if you look below the word Scorpion and scoripion are repeated with different score, the same is the case for trackers and Trackers, this change the way the score is assigned as it considers this word differently, but below we will do one more run by lower caseing the same.

In [8]:
# Create TextBlob objects and add them to the bloblist
bloblist = [tb(content) for content in df['content']]

for i, blob in enumerate(bloblist):
    print("Top words in document {}".format(i + 1))
    scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
    #print(scores.items())
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    for word, score in sorted_words[:5]:
        print("\tWord: {}, TF-IDF: {}".format(word, round(score, 5)))

Top words in document 1
	Word: Scorpion, TF-IDF: 0.0065
	Word: doctor, TF-IDF: 0.0048
	Word: scorpion, TF-IDF: 0.0041
	Word: tail, TF-IDF: 0.00295
	Word: Enemy, TF-IDF: 0.00295
Top words in document 2
	Word: oysters, TF-IDF: 0.01054
	Word: oyster, TF-IDF: 0.00958
	Word: basket, TF-IDF: 0.00862
	Word: Might, TF-IDF: 0.00766
	Word: shell, TF-IDF: 0.00544
Top words in document 3
	Word: Pearl, TF-IDF: 0.00678
	Word: doctor, TF-IDF: 0.00448
	Word: News, TF-IDF: 0.00368
	Word: Come, TF-IDF: 0.00232
	Word: priest, TF-IDF: 0.00221
Top words in document 4
	Word: Pearl, TF-IDF: 0.0099
	Word: One, TF-IDF: 0.00767
	Word: dealer, TF-IDF: 0.00587
	Word: Another, TF-IDF: 0.00419
	Word: Let, TF-IDF: 0.00344
Top words in document 5
	Word: god, TF-IDF: 0.00301
	Word: bring, TF-IDF: 0.00301
	Word: Quietly, TF-IDF: 0.00226
	Word: pathway, TF-IDF: 0.00226
	Word: boat, TF-IDF: 0.00222
Top words in document 6
	Word: Little, TF-IDF: 0.01928
	Word: trackers, TF-IDF: 0.00601
	Word: Trackers, TF-IDF: 0.00601
	Wo


### Step 6 - Calculating the TF-IDF score - With lowercase
The below code snippet will get all the content from the dataframe and iterate through each document and calculate the tf-idf score followed by it sort the score in descending order for the scores dictionary tuple and output the top 5 with highest score for each document

#### Now let us see how this changes when we make the text as lower case, as the above one consider the same words in different case seperately

In [9]:
df['content'] = df['content'].str.lower() #convert the data into lowercase

bloblist = [tb(content) for content in df['content']]

for i, blob in enumerate(bloblist):
    print("Top words in document {}".format(i + 1))
    scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
    #print(scores.items())
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    for word, score in sorted_words[:5]:
        print("\tWord: {}, TF-IDF: {}".format(word, round(score, 5)))

Top words in document 1
	Word: doctor, TF-IDF: 0.0048
	Word: scorpion, TF-IDF: 0.0041
	Word: tail, TF-IDF: 0.00295
	Word: hanging, TF-IDF: 0.00261
	Word: rope, TF-IDF: 0.00236
Top words in document 2
	Word: oysters, TF-IDF: 0.01054
	Word: oyster, TF-IDF: 0.00958
	Word: basket, TF-IDF: 0.00862
	Word: shell, TF-IDF: 0.00544
	Word: shells, TF-IDF: 0.00479
Top words in document 3
	Word: doctor, TF-IDF: 0.00448
	Word: priest, TF-IDF: 0.00221
	Word: school, TF-IDF: 0.00184
	Word: books, TF-IDF: 0.00184
	Word: capsule, TF-IDF: 0.00184
Top words in document 4
	Word: dealer, TF-IDF: 0.00587
	Word: buyer, TF-IDF: 0.00335
	Word: coin, TF-IDF: 0.00335
	Word: crowd, TF-IDF: 0.00293
	Word: offer, TF-IDF: 0.00251
Top words in document 5
	Word: bring, TF-IDF: 0.00301
	Word: pathway, TF-IDF: 0.00226
	Word: boat, TF-IDF: 0.00222
	Word: moon, TF-IDF: 0.00194
	Word: wind, TF-IDF: 0.00194
Top words in document 6
	Word: trackers, TF-IDF: 0.00601
	Word: pool, TF-IDF: 0.00474
	Word: cave, TF-IDF: 0.00348
	Wor