<a href="https://colab.research.google.com/github/SriVinayA/SJSU-CMPE256-AdvDataMining/blob/main/TF_IDF_of_a_Document_matrix_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The Term Frequency-Inverse Document Frequency (TF-IDF) is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. The TF-IDF value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.

Here's how to calculate TF-IDF:

Term Frequency (TF): This measures how frequently a term occurs in a document. Since documents may vary in length, it is often the term count in the document divided by the total number of terms in the document:

TF(t,d)= (Number of times term t appears in a document d)/(Total number of terms in the document d)

Inverse Document Frequency (IDF): This measures how important a term is. While computing TF, all terms are considered equally important. However, certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scaling up the rare ones, by computing the following:

IDF(t,D)=log(
  (Total number of documents D)/(Number of documents with term t in it)
  )

TF-IDF: This is simply the TF multiplied by IDF:

TFIDF(*t,d,D*)=TF(*t,d*)×IDF(*t,D*)

Let's calculate the TF-IDF for the given document-term pairs:
- TF(*d2,t3*)
- TF(*d4,t6*)
- TF(*d5,t7*)

First, I will calculate the TF for each term in its respective document. To do that, I need to count the total number of terms in each document, *d2, d4*, and *d5*, and then calculate the TF for the terms *t3, t6*, and *t7* respectively.

Let's start by calculating the TF for these terms.

In [2]:
import pandas as pd
import numpy as np

# The term frequency data from the image as a dictionary
data = {
    't1': [0, 5, 15, 22, 0],
    't2': [4, 19, 0, 3, 7],
    't3': [10, 7, 0, 12, 0],
    't4': [8, 16, 4, 0, 9],
    't5': [0, 0, 9, 5, 2],
    't6': [5, 0, 0, 15, 4],
    't7': [0, 32, 17, 0, 12]
}

# Create a DataFrame from the dictionary
df = pd.DataFrame(data, index=['d1', 'd2', 'd3', 'd4', 'd5'])

# Calculate the total number of terms in each document
df['total_terms'] = df.sum(axis=1)

# Calculate the TF for the specified document-term pairs
tf_d2_t3 = df.at['d2', 't3'] / df.at['d2', 'total_terms']
tf_d4_t6 = df.at['d4', 't6'] / df.at['d4', 'total_terms']
tf_d5_t7 = df.at['d5', 't7'] / df.at['d5', 'total_terms']

tf_d2_t3, tf_d4_t6, tf_d5_t7

(0.08860759493670886, 0.2631578947368421, 0.35294117647058826)

The Term Frequency (TF) for the specified document-term pairs is as follows:

- TF(*d2,t3*): Approximately 0.0886
- TF(*d4,t6*): Approximately 0.2632
- TF(*d5,t7*): Approximately 0.3529


Next, we need to calculate the Inverse Document Frequency (IDF) for these terms. To do this, we'll need to determine the number of documents that contain the terms *t3, t6,* and *t7*. Then we'll use the formula to calculate IDF.

Let's calculate the IDF for the terms.

In [4]:
# Calculate the number of documents (D)
total_documents = df.shape[0]  # Total number of documents

# Calculate the number of documents containing each term
num_docs_with_term = (df.drop(columns='total_terms') > 0).sum()

# Calculate the IDF for the specified terms
idfs = {}
for term in ['t3', 't6', 't7']:
    num_docs_with_term_t = num_docs_with_term[term]
    idfs[term] = np.log(total_documents / num_docs_with_term_t) if num_docs_with_term_t else 0

idfs

{'t3': 0.5108256237659907, 't6': 0.5108256237659907, 't7': 0.5108256237659907}

The Inverse Document Frequency (IDF) for each term is as follows (all values are approximately the same because each term appears in just one document):

- IDF(*t3*): Approximately 0.5108
- IDF(*t6*): Approximately 0.5108
- IDF(*t7*): Approximately 0.5108

With the TF and IDF values calculated, we can now compute the TF-IDF for each of the specified document-term pairs by multiplying the TF by the IDF for each term.

Let's calculate the TF-IDF values.

In [5]:
# Calculate the TF-IDF for the specified document-term pairs
tfidf_d2_t3 = tf_d2_t3 * idfs['t3']
tfidf_d4_t6 = tf_d4_t6 * idfs['t6']
tfidf_d5_t7 = tf_d5_t7 * idfs['t7']

tfidf_d2_t3, tfidf_d4_t6, tfidf_d5_t7

(0.04526302995394855, 0.1344277957278923, 0.18029139662329086)

The TF-IDF values for the specified document-term pairs are:

- TFIDF(*d2,t3*): Approximately 0.0453
- TFIDF(*d4,t6*): Approximately 0.1344
- TFIDF(*d5,t7*): Approximately 0.1803

These values reflect how important each term is to the document relative to the entire corpus. The higher the TF-IDF, the more important the term is within that specific document.