<a href="https://colab.research.google.com/github/SubiraPokharel/TF-IDF-Implementation/blob/main/TF_IDF_Implementation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1. TF-IDF

Implement TF-IDF using using Python, Numpy, Pandas and whatever text cleaning library required.

The tf–idf is the product of two statistics, term frequency and inverse document frequency. There are various ways for determining the exact values of both statistics, you can use the following formulas.

### Term Frequency
$$tf_{t,d} = \log_{10}(count(t,d) +1)$$

* $tf_{t,d}$ is the frequency of the word t in the
document d

### Inverse Document Frequency
$$idf_t = \log_{10}(\frac{N}{df_t})$$

* $N$ is the total number of documents
* $df_t $ is the number of documents in which term t occurs

### TF-IDF
$$tf\text{-}idf_{t,d} = tf_{t,d} \times idf_t $$

### What is expected?
Your implementation should include the following two functions:
 * `compute_tfidf_weights(train_docs)`
 * `word_tfidf_vector(word, tf_df, idf_df)`

To revise what TF-IDf is, you can revise the lecture notes and the further reading under Week 7.


In [None]:
import numpy as np
import pandas as pd
import re
from collections import Counter
from math import log10

def clean_text(text):
    """
    Clean the input text by converting it to lowercase and removing special characters and numbers.

    Parameters:
    text (str): The input text to be cleaned.

    Returns:
    str: The cleaned text.
    """
    # Convert text to lowercase
    text = text.lower()
    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    return text

def compute_tf(train_docs):
    """
    Compute the term frequency (TF) matrix for a list of documents.

    Parameters:
    train_docs (list of str): The list of documents for which TF is to be computed.

    Returns:
    list of dict: List of dictionaries where each dictionary contains the TF values for words in a document.
    """
    tf_values = []
    for doc in train_docs:
        cleaned_doc = clean_text(doc)
        word_count = Counter(cleaned_doc.split())
        total_words = len(cleaned_doc.split())
        tf_doc = {word: word_count[word] / total_words for word in word_count}
        tf_values.append(tf_doc)
    return tf_values

def compute_idf(train_docs):
    """
    Compute the inverse document frequency (IDF) matrix for a list of documents.

    Parameters:
    train_docs (list of str): The list of documents for which IDF is to be computed.

    Returns:
    dict: Dictionary containing IDF values for words in the document corpus.
    """
    N = len(train_docs)
    idf_values = {}
    for doc in train_docs:
        cleaned_doc = clean_text(doc)
        words = set(cleaned_doc.split())
        for word in words:
            idf_values[word] = idf_values.get(word, 0) + 1
    idf = {word: log10(N / df) for word, df in idf_values.items()}
    return idf

def compute_tfidf_weights(train_docs):
    """
    Compute TF-IDF weights for a list of documents.

    Parameters:
    train_docs (list of str): The list of documents for which TF-IDF weights are to be computed.

    Returns:
    pandas.DataFrame: DataFrame containing TF values for words in each document.
    pandas.DataFrame: DataFrame containing IDF values for words in the document corpus.
    """
    # Compute TF values for each document
    tf_values = compute_tf(train_docs)
    # Compute IDF values for words in the document corpus
    idf_values = compute_idf(train_docs)
    # Convert TF values to DataFrame and fill NaN values with 0
    docs_tf = pd.DataFrame(tf_values).fillna(0)
    # Convert IDF values to DataFrame
    docs_idf = pd.DataFrame(idf_values.values(), index=idf_values.keys(), columns=['idf'])
    return docs_tf, docs_idf

def word_tfidf_vector(word, tf_df, idf_df):
    """
    Compute the TF-IDF vector for a specific word.

    Parameters:
    word (str): The word for which TF-IDF vector is to be computed.
    tf_df (pandas.DataFrame): DataFrame containing TF values for words in each document.
    idf_df (pandas.DataFrame): DataFrame containing IDF values for words in the document corpus.

    Returns:
    numpy.ndarray: Array containing TF-IDF vector for the given word.
    """
    # Retrieve TF values for the given word
    tf_values = tf_df[word]
    # Retrieve IDF value for the given word
    idf_value = idf_df.loc[word, 'idf']
    # Compute TF-IDF vector for the word
    tf_idf_value = np.array(tf_values * idf_value)
    return tf_idf_value

# Test cases
train_docs = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?",
]

# Compute TF-IDF weights for the train documents
docs_tf, docs_idf = compute_tfidf_weights(train_docs)
print("TF DataFrame:")
print(docs_tf)
print("\nIDF DataFrame:")
print(docs_idf)

# Test TF-IDF vector computation for a specific word
word = 'document'
tf_idf_vector = word_tfidf_vector(word, docs_tf, docs_idf)
print(f"\nTF-IDF vector for '{word}':")
print(tf_idf_vector)


TF DataFrame:
       this        is       the  first  document    second       and  \
0  0.200000  0.200000  0.200000    0.2  0.200000  0.000000  0.000000   
1  0.166667  0.166667  0.166667    0.0  0.333333  0.166667  0.000000   
2  0.166667  0.166667  0.166667    0.0  0.000000  0.000000  0.166667   
3  0.200000  0.200000  0.200000    0.2  0.200000  0.000000  0.000000   

      third       one  
0  0.000000  0.000000  
1  0.000000  0.000000  
2  0.166667  0.166667  
3  0.000000  0.000000  

IDF DataFrame:
               idf
first     0.301030
the       0.000000
is        0.000000
document  0.124939
this      0.000000
second    0.602060
one       0.602060
third     0.602060
and       0.602060

TF-IDF vector for 'document':
[0.02498775 0.04164625 0.         0.02498775]
