# Task
Implement the TF-IDF algorithm manually in Python and compare the results with scikit-learn's `CountVectorizer` and `TfidfVectorizer` using the provided corpus. Explain the differences in the scores, particularly for common words, and summarize the findings in a README.md file. The final output should be a Jupyter Notebook (.ipynb) or Python script (.py) and a README.md file.

## Define corpus

### Subtask:
Define the corpus given in the problem description.


**Reasoning**:
The subtask requires defining a corpus as a list of strings. This can be achieved by creating a Python list and adding the specified sentences as elements.



In [1]:
corpus = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?",
]

## Manual tf-idf implementation

### Subtask:
Implement a function to calculate Term Frequency (TF).


**Reasoning**:
Implement the TF calculation function as described in the instructions.



In [2]:
import re

def calculate_tf(document):
  """Calculates the Term Frequency (TF) for each word in a document.

  Args:
    document: A string representing the document.

  Returns:
    A dictionary where keys are words and values are their corresponding TF scores.
  """
  # Convert to lowercase and split into words
  words = re.findall(r'\b\w+\b', document.lower())
  word_counts = {}
  for word in words:
    word_counts[word] = word_counts.get(word, 0) + 1

  total_words = len(words)
  tf_scores = {}
  for word, count in word_counts.items():
    tf_scores[word] = count / total_words

  return tf_scores

# Test the function with the first document from the corpus
first_document = corpus[0]
tf_scores_first_doc = calculate_tf(first_document)
print(tf_scores_first_doc)

{'this': 0.2, 'is': 0.2, 'the': 0.2, 'first': 0.2, 'document': 0.2}


## Manual tf-idf implementation

### Subtask:
Implement a function to calculate Inverse Document Frequency (IDF).


**Reasoning**:
Implement the `calculate_idf` function according to the instructions, including iterating through the corpus, counting document frequencies, calculating IDF scores, and testing the function.



In [3]:
import math
import re

def calculate_idf(corpus):
  """Calculates the Inverse Document Frequency (IDF) for each word in a corpus.

  Args:
    corpus: A list of strings, where each string is a document.

  Returns:
    A dictionary where keys are words and values are their corresponding IDF scores.
  """
  document_frequency = {}
  total_documents = len(corpus)

  for document in corpus:
    # Convert to lowercase and find unique words in the document
    words = set(re.findall(r'\b\w+\b', document.lower()))
    for word in words:
      document_frequency[word] = document_frequency.get(word, 0) + 1

  idf_scores = {}
  for word, doc_count in document_frequency.items():
    idf_scores[word] = math.log(total_documents / doc_count)

  return idf_scores

# Test the function with the defined corpus
idf_scores_corpus = calculate_idf(corpus)
print(idf_scores_corpus)

{'this': 0.0, 'is': 0.0, 'document': 0.28768207245178085, 'the': 0.0, 'first': 0.6931471805599453, 'second': 1.3862943611198906, 'and': 1.3862943611198906, 'third': 1.3862943611198906, 'one': 1.3862943611198906}


## Manual tf-idf implementation

### Subtask:
Implement a function to calculate TF-IDF using the TF and IDF functions.


**Reasoning**:
Implement the `calculate_tfidf` function, calculate TF-IDF for the first document, and print the results as instructed.



In [4]:
def calculate_tfidf(tf_scores, idf_scores):
  """Calculates the TF-IDF scores for words in a document.

  Args:
    tf_scores: A dictionary of TF scores for a document.
    idf_scores: A dictionary of IDF scores for the corpus.

  Returns:
    A dictionary where keys are words and values are their TF-IDF scores.
  """
  tfidf_scores = {}
  for word, tf in tf_scores.items():
    idf = idf_scores.get(word, 0)  # Get IDF score, default to 0 if word not in IDF scores
    tfidf_scores[word] = tf * idf
  return tfidf_scores

# Calculate TF-IDF scores for the first document
tfidf_scores_first_doc = calculate_tfidf(tf_scores_first_doc, idf_scores_corpus)

# Print the resulting TF-IDF scores
print(tfidf_scores_first_doc)

{'this': 0.0, 'is': 0.0, 'the': 0.0, 'first': 0.13862943611198905, 'document': 0.05753641449035617}


## Manual tf-idf implementation

### Subtask:
Calculate the manual TF-IDF scores for the corpus.


**Reasoning**:
Iterate through each document in the corpus, calculate its TF and TF-IDF scores, and store the results.



In [5]:
tfidf_scores_corpus_manual = []

for document in corpus:
  tf_scores = calculate_tf(document)
  tfidf_scores = calculate_tfidf(tf_scores, idf_scores_corpus)
  tfidf_scores_corpus_manual.append(tfidf_scores)

print(tfidf_scores_corpus_manual)

[{'this': 0.0, 'is': 0.0, 'the': 0.0, 'first': 0.13862943611198905, 'document': 0.05753641449035617}, {'this': 0.0, 'document': 0.09589402415059362, 'is': 0.0, 'the': 0.0, 'second': 0.23104906018664842}, {'and': 0.23104906018664842, 'this': 0.0, 'is': 0.0, 'the': 0.0, 'third': 0.23104906018664842, 'one': 0.23104906018664842}, {'is': 0.0, 'this': 0.0, 'the': 0.0, 'first': 0.13862943611198905, 'document': 0.05753641449035617}]


## Scikit-learn tf-idf implementation

### Subtask:
Use `CountVectorizer` to get the term frequencies.


**Reasoning**:
Import the necessary class and instantiate it to calculate the term frequencies using CountVectorizer.



In [6]:
from sklearn.feature_extraction.text import CountVectorizer

# Instantiate CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the corpus to get the term frequencies
count_matrix = vectorizer.fit_transform(corpus)

# Get the feature names (words)
feature_names = vectorizer.get_feature_names_out()

# Display the count matrix and feature names
print("Count Matrix:\n", count_matrix.toarray())
print("\nFeature Names:", feature_names)

Count Matrix:
 [[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]

Feature Names: ['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']


## Scikit-learn tf-idf implementation

### Subtask:
Use `TfidfVectorizer` to get the TF-IDF scores.


**Reasoning**:
Import TfidfVectorizer, instantiate it, fit and transform the corpus, get feature names, and display the results.



In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Instantiate TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the corpus to get the TF-IDF scores
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

# Get the feature names (words)
tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()

# Display the TF-IDF matrix and feature names
print("TF-IDF Matrix:\n", tfidf_matrix.toarray())
print("\nFeature Names:", tfidf_feature_names)

TF-IDF Matrix:
 [[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]
 [0.         0.6876236  0.         0.28108867 0.         0.53864762
  0.28108867 0.         0.28108867]
 [0.51184851 0.         0.         0.26710379 0.51184851 0.
  0.26710379 0.51184851 0.26710379]
 [0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]

Feature Names: ['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']


## Comparison and analysis

### Subtask:
Compare the results from the manual implementation and scikit-learn's implementations.


**Reasoning**:
Compare the manual and scikit-learn TF and TF-IDF results and explain the differences.



In [8]:
import pandas as pd

# 1. Compare manual TF-IDF with scikit-learn TF-IDF
print("--- TF-IDF Comparison ---")
print("Manual TF-IDF Scores (first document):", tfidf_scores_corpus_manual[0])
print("Scikit-learn TF-IDF Scores (first document):")

# Align scikit-learn TF-IDF scores with manual scores based on feature names
tfidf_sklearn_first_doc = {}
sklearn_tfidf_array = tfidf_matrix.toarray()[0]
for i, word in enumerate(tfidf_feature_names):
    tfidf_sklearn_first_doc[word] = sklearn_tfidf_array[i]

print(tfidf_sklearn_first_doc)

# 2. Compare manual TF with scikit-learn TF
print("\n--- TF Comparison ---")
print("Manual TF Scores (first document):", tf_scores_first_doc)
print("Scikit-learn TF Scores (first document):")

# Align scikit-learn TF scores with manual scores based on feature names
tf_sklearn_first_doc = {}
sklearn_count_array = count_matrix.toarray()[0]
for i, word in enumerate(feature_names):
    # Calculate TF from count matrix
    total_words_first_doc = sum(count_matrix.toarray()[0])
    tf_sklearn_first_doc[word] = sklearn_count_array[i] / total_words_first_doc if total_words_first_doc > 0 else 0

print(tf_sklearn_first_doc)


# 3. Explanation of differences
print("\n--- Explanation of Differences ---")
print("Comparing the manual and scikit-learn implementations:")

print("\nTF Scores:")
print("Manual TF calculation divides the word count by the total number of words in the document.")
print("Scikit-learn's CountVectorizer provides raw counts. To get TF from scikit-learn's CountVectorizer, we also need to divide the word count by the total number of words in the document.")
print("When calculated correctly, the TF scores should be very similar between manual and scikit-learn implementations, assuming the same tokenization.")
print("Any minor differences might arise from slight variations in tokenization or handling of punctuation/case by the manual regex compared to CountVectorizer's default settings.")

print("\nTF-IDF Scores:")
print("The core difference lies in the IDF calculation.")
print("Manual IDF uses the formula: log(Total Documents / Document Frequency).")
print("Scikit-learn's TfidfVectorizer uses a slightly different formula for IDF: log((Total Documents + 1) / (Document Frequency + 1)) + 1. It also applies L2 normalization to the resulting TF-IDF vectors.")
print("This difference in the IDF formula, particularly the '+1' smoothing terms and the final '+1', is the primary reason for the differences in the final TF-IDF scores, especially for words that appear in all documents (where manual IDF is 0, but scikit-learn's is not).")
print("For common words that appear in all documents ('this', 'is', 'the'), the manual TF-IDF score is 0 because their IDF is 0. Scikit-learn's TfidfVectorizer assigns a non-zero (though small) TF-IDF score due to the smoothing term in its IDF formula.")
print("The L2 normalization in scikit-learn also scales the scores, making direct comparison of raw values challenging without reproducing the normalization manually.")

--- TF-IDF Comparison ---
Manual TF-IDF Scores (first document): {'this': 0.0, 'is': 0.0, 'the': 0.0, 'first': 0.13862943611198905, 'document': 0.05753641449035617}
Scikit-learn TF-IDF Scores (first document):
{'and': np.float64(0.0), 'document': np.float64(0.46979138557992045), 'first': np.float64(0.5802858236844359), 'is': np.float64(0.38408524091481483), 'one': np.float64(0.0), 'second': np.float64(0.0), 'the': np.float64(0.38408524091481483), 'third': np.float64(0.0), 'this': np.float64(0.38408524091481483)}

--- TF Comparison ---
Manual TF Scores (first document): {'this': 0.2, 'is': 0.2, 'the': 0.2, 'first': 0.2, 'document': 0.2}
Scikit-learn TF Scores (first document):
{'and': np.float64(0.0), 'document': np.float64(0.2), 'first': np.float64(0.2), 'is': np.float64(0.2), 'one': np.float64(0.0), 'second': np.float64(0.0), 'the': np.float64(0.2), 'third': np.float64(0.0), 'this': np.float64(0.2)}

--- Explanation of Differences ---
Comparing the manual and scikit-learn implementati

## Summary:

### Data Analysis Key Findings

*   Manual TF scores were very similar to TF scores derived from scikit-learn's `CountVectorizer` (by dividing raw counts by total words), indicating consistent term counting.
*   Manual TF-IDF scores differed from scikit-learn's `TfidfVectorizer` scores, particularly for common words.
*   The primary reason for the TF-IDF difference is the IDF calculation formula: manual uses $\text{log}(\text{Total Documents} / \text{Document Frequency})$, while scikit-learn uses a smoothed version $\text{log}((\text{Total Documents} + 1) / (\text{Document Frequency} + 1)) + 1$.
*   Words appearing in all documents (like 'this', 'is', 'the') have a manual TF-IDF score of 0 because their manual IDF is 0, whereas scikit-learn assigns them a non-zero (though small) score due to smoothing.
*   Scikit-learn's `TfidfVectorizer` also applies L2 normalization to the resulting vectors, further contributing to score differences.

### Insights or Next Steps

*   Understanding the specific IDF formula and normalization applied by libraries is crucial when comparing manual implementations with library results.
*   For practical applications, using optimized and standardized library implementations like scikit-learn is generally preferred over manual implementation due to robustness and efficiency.
