# Implement TF-IDF (Term Frequency-Inverse Document Frequency)

## Task: Implement TF-IDF (Term Frequency-Inverse Document Frequency)

Your task is to implement a function that computes the TF-IDF scores for a query against a given corpus of documents.

Write a function `compute_tf_idf(corpus, query)` that takes the following inputs:

- corpus: A list of documents, where each document is a list of words.
- query: A list of words for which you want to compute the TF-IDF scores.

The function should return a list of lists containing the TF-IDF scores for the query words in each document, rounded to five decimal places.

Example
```python
import numpy as np

corpus = [
    ["the", "cat", "sat", "on", "the", "mat"],
    ["the", "dog", "chased", "the", "cat"],
    ["the", "bird", "flew", "over", "the", "mat"]
]
query = ["cat"]

print(compute_tf_idf(corpus, query))

# Expected Output:
# [[0.21461], [0.25754], [0.0]]
```

## Understanding TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF is a numerical statistic used to reflect the importance of a word in a document within a collection or corpus. It is commonly used in information retrieval and text mining.

## Mathematical Formulation

TF-IDF is the product of two statistics: Term Frequency (TF) and Inverse Document Frequency (IDF).

Term Frequency (TF):

$$TF(t, d) = \frac{Number of times term t appears in document d}{Total number of terms in document d}$$ 
 
Inverse Document Frequency (IDF):

$$IDF(t) = \log\left(\frac{Total number of documents}{Number of documents containing term t}\right)$$ 
 
TF-IDF:

$$TFIDF(t, d) = TF(t, d) \times IDF(t)$$

Note: For this problem, we use Smooth - Inverse Document Frequency (IDF):
 
$$Smooth_{IDF}(t) = \log\left(\frac{Total number of documents + 1}{Number of documents containing term t + 1}\right) + 1$$
 
The constant “1” is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions.

## Implementation Steps

- Compute TF: For each term in each document, calculate its term frequency.
- Compute IDF: Calculate the inverse document frequency for each unique term in the corpus.
- Calculate TF-IDF: Multiply TF and IDF for each term in each document.
- Normalize: (Optional) Normalize the TF-IDF vectors for each document.

## Example Calculation

Given:

- Corpus: 3 documents
- Doc1: "The cat sat on the mat"
- Doc2: "The dog chased the cat"
- Doc3: "The bird flew over the mat"

Compute TF-IDF for the word "cat" in Doc1:

TF("cat", Doc1):

$$TF("cat", Doc1) = \frac{1}{6} = 0.1667$$ 
 
IDF("cat"):

$$IDF("cat") = \log\left(\frac{3}{2}\right) = 0.1761$$

TF-IDF("cat", Doc1):

$$TF-IDF("cat", Doc1) = 0.1667 \times 0.1761 = 0.0293$$

## Applications

TF-IDF is widely used in:

- Information Retrieval
- Text Mining
- Document Classification
- Search Engines
- Recommendation Systems

This statistic helps in ranking a document's relevance given a user query, document summarization, and feature extraction for machine learning algorithms in natural language processing.

In [1]:
import numpy as np

def compute_tf_idf(corpus, query):
    """
    Compute TF-IDF scores for a query against a corpus of documents.
    
    :param corpus: List of documents, where each document is a list of words
    :param query: List of words in the query
    :return: List of lists containing TF-IDF scores for the query words in each document
    """
    vocab = sorted(set(word for document in corpus for word in document).union(query))
    word_to_index = {word: idx for idx, word in enumerate(vocab)}

    tf = np.zeros((len(corpus), len(vocab)))

    for doc_idx, document in enumerate(corpus):
        for word in document:
            word_idx = word_to_index[word]
            tf[doc_idx, word_idx] += 1
        tf[doc_idx, :] /= len(document)

    df = np.count_nonzero(tf > 0, axis=0)

    num_docs = len(corpus)
    idf = np.log((num_docs + 1) / (df + 1)) + 1

    tf_idf = tf * idf

    query_indices = [word_to_index[word] for word in query]
    tf_idf_scores = tf_idf[:, query_indices]

    tf_idf_scores = np.round(tf_idf_scores, 5)

    return tf_idf_scores.tolist()

In [2]:
print('Test Case 1: Accepted')
print('Input:')
print('import numpy as np\ncorpus = [\n    ["the", "cat", "sat", "on", "the", "mat"],\n    ["the", "dog", "chased", "the", "cat"],\n    ["the", "bird", "flew", "over", "the", "mat"]\n]\nquery = ["cat"]\nprint(compute_tf_idf(corpus, query))')
print()
print('Output:')
import numpy as np
corpus = [
    ["the", "cat", "sat", "on", "the", "mat"],
    ["the", "dog", "chased", "the", "cat"],
    ["the", "bird", "flew", "over", "the", "mat"]
]
query = ["cat"]
print(compute_tf_idf(corpus, query))
print()
print('Expected:')
print('[[0.21461], [0.25754], [0.0]]')
print()
print()



print('Test Case 2: Accepted')
print('Input:')
print('import numpy as np\ncorpus = [\n    ["the", "cat", "sat", "on", "the", "mat"],\n    ["the", "dog", "chased", "the", "cat"],\n    ["the", "bird", "flew", "over", "the", "mat"]\n]\nquery = ["cat", "mat"]\nprint(compute_tf_idf(corpus, query))')
print()
print('Output:')
import numpy as np
corpus = [
    ["the", "cat", "sat", "on", "the", "mat"],
    ["the", "dog", "chased", "the", "cat"],
    ["the", "bird", "flew", "over", "the", "mat"]
]
query = ["cat", "mat"]
print(compute_tf_idf(corpus, query))
print()
print('Expected:')
print('[[0.21461, 0.21461], [0.25754, 0.0], [0.0, 0.21461]]')
print()
print()



print('Test Case 3: Accepted')
print('Input:')
print('import numpy as np\ncorpus = [\n    ["this", "is", "a", "sample"],\n    ["this", "is", "another", "example"],\n    ["yet", "another", "sample", "document"],\n    ["one", "more", "document", "for", "testing"]\n]\nquery = ["sample", "document", "test"]\nprint(compute_tf_idf(corpus, query))')
print()
print('Output:')
import numpy as np
corpus = [
    ["this", "is", "a", "sample"],
    ["this", "is", "another", "example"],
    ["yet", "another", "sample", "document"],
    ["one", "more", "document", "for", "testing"]
]
query = ["sample", "document", "test"]
print(compute_tf_idf(corpus, query))
print()
print('Expected:')
print('[[0.37771, 0.0, 0.0], [0.0, 0.0, 0.0], [0.37771, 0.37771, 0.0], [0.0, 0.30217, 0.0]]')

Test Case 1: Accepted
Input:
import numpy as np
corpus = [
    ["the", "cat", "sat", "on", "the", "mat"],
    ["the", "dog", "chased", "the", "cat"],
    ["the", "bird", "flew", "over", "the", "mat"]
]
query = ["cat"]
print(compute_tf_idf(corpus, query))

Output:
[[0.21461], [0.25754], [0.0]]

Expected:
[[0.21461], [0.25754], [0.0]]


Test Case 2: Accepted
Input:
import numpy as np
corpus = [
    ["the", "cat", "sat", "on", "the", "mat"],
    ["the", "dog", "chased", "the", "cat"],
    ["the", "bird", "flew", "over", "the", "mat"]
]
query = ["cat", "mat"]
print(compute_tf_idf(corpus, query))

Output:
[[0.21461, 0.21461], [0.25754, 0.0], [0.0, 0.21461]]

Expected:
[[0.21461, 0.21461], [0.25754, 0.0], [0.0, 0.21461]]


Test Case 3: Accepted
Input:
import numpy as np
corpus = [
    ["this", "is", "a", "sample"],
    ["this", "is", "another", "example"],
    ["yet", "another", "sample", "document"],
    ["one", "more", "document", "for", "testing"]
]
query = ["sample", "document", "test"]
