#### The TF-IDF (Term Frequency-Inverse Document Frequency) algorithm is used to convert a collection of text documents into a matrix of TF-IDF features. It is commonly used in text mining and information retrieval to reflect the importance of a word in a document relative to a collection of documents.

In [5]:
import math
from collections import Counter

class TFIDF:
    def __init__(self):
        self.vocabulary = {}  # Vocabulary to store word indices
        self.idf_values = {}  # IDF values for words

    def fit(self, documents):
        """
        Compute IDF values based on the provided documents.
        
        Args:
            documents (list of str): List of documents where each document is a string.
        """
        doc_count = len(documents)
        term_doc_count = Counter()  # To count the number of documents containing each word

        # Count occurrences of words in documents
        for doc in documents:
            words = set(doc.split())  # Unique words in the current document
            for word in words:
                term_doc_count[word] += 1

        # Compute IDF values
        self.idf_values = {
            word: math.log(doc_count / (count + 1))  # +1 to avoid division by zero
            for word, count in term_doc_count.items()
        }

        # Build vocabulary
        self.vocabulary = {word: idx for idx, word in enumerate(self.idf_values.keys())}

    def transform(self, documents):
        """
        Transform documents into TF-IDF representation.

        Args:
            documents (list of str): List of documents where each document is a string.
        
        Returns:
            list of list of float: TF-IDF matrix where each row corresponds to a document.
        """
        rows = []
        for doc in documents:
            words = doc.split()
            word_count = Counter(words)
            doc_length = len(words)
            row = [0] * len(self.vocabulary)

            for word, count in word_count.items():
                if word in self.vocabulary:
                    tf = count / doc_length
                    idf = self.idf_values[word]
                    index = self.vocabulary[word]
                    row[index] = tf * idf
            rows.append(row)
        return rows

    def fit_transform(self, documents):
        """
        Compute IDF values and transform documents into TF-IDF representation.

        Args:
            documents (list of str): List of documents where each document is a string.

        Returns:
            list of list of float: TF-IDF matrix where each row corresponds to a document.
        """
        self.fit(documents)
        return self.transform(documents)

In [6]:
# Example usage
if __name__ == "__main__":
    documents = [
        "the cat sat on the mat",
        "the dog ate my homework",
        "the cat ate the dog food"
    ]

    tfidf = TFIDF()
    tfidf_matrix = tfidf.fit_transform(documents)
    for i, row in enumerate(tfidf_matrix):
        print(f"Document {i}: {row}")

Document 0: [0.0, -0.09589402415059363, 0.06757751801802739, 0.06757751801802739, 0.06757751801802739, 0, 0, 0, 0, 0]
Document 1: [0, -0.05753641449035618, 0, 0, 0, 0.08109302162163289, 0.08109302162163289, 0.0, 0.0, 0]
Document 2: [0.0, -0.09589402415059363, 0, 0, 0, 0, 0, 0.0, 0.0, 0.06757751801802739]


In [7]:
# Additional example usage
if __name__ == "__main__":
    # Sample documents
    documents = [
        "I love programming in Python",
        "Machine learning is fun",
        "Python is a versatile language",
        "Learning new skills is always beneficial"
    ]

    # Initialize the TF-IDF model
    tfidf = TFIDF()
    
    # Fit the model and transform the documents
    tfidf_matrix = tfidf.fit_transform(documents)
    
    # Print the vocabulary
    print("Vocabulary:", tfidf.vocabulary)
    
    # Print the TF-IDF representation
    print("TF-IDF Representation:")
    for i, vector in enumerate(tfidf_matrix):
        print(f"Document {i + 1}: {vector}")

    # More example documents with mixed content
    more_documents = [
        "the quick brown fox jumps over the lazy dog",
        "a journey of a thousand miles begins with a single step",
        "to be or not to be that is the question",
        "the rain in Spain stays mainly in the plain",
        "all human beings are born free and equal in dignity and rights"
    ]

    # Fit the model and transform the new set of documents
    tfidf_more = TFIDF()
    tfidf_matrix_more = tfidf_more.fit_transform(more_documents)
    
    # Print the vocabulary for the new documents
    print("\nVocabulary for new documents:", tfidf_more.vocabulary)
    
    # Print the TF-IDF representation for the new documents
    print("TF-IDF Representation for new documents:")
    for i, vector in enumerate(tfidf_matrix_more):
        print(f"Document {i + 1}: {vector}")

Vocabulary: {'love': 0, 'I': 1, 'Python': 2, 'programming': 3, 'in': 4, 'learning': 5, 'fun': 6, 'Machine': 7, 'is': 8, 'a': 9, 'language': 10, 'versatile': 11, 'Learning': 12, 'beneficial': 13, 'new': 14, 'always': 15, 'skills': 16}
TF-IDF Representation:
Document 1: [0.13862943611198905, 0.13862943611198905, 0.05753641449035617, 0.13862943611198905, 0.13862943611198905, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Document 2: [0, 0, 0, 0, 0, 0.17328679513998632, 0.17328679513998632, 0.17328679513998632, 0.0, 0, 0, 0, 0, 0, 0, 0, 0]
Document 3: [0, 0, 0.05753641449035617, 0, 0, 0, 0, 0, 0.0, 0.13862943611198905, 0.13862943611198905, 0.13862943611198905, 0, 0, 0, 0, 0]
Document 4: [0, 0, 0, 0, 0, 0, 0, 0, 0.0, 0, 0, 0, 0.11552453009332421, 0.11552453009332421, 0.11552453009332421, 0.11552453009332421, 0.11552453009332421]

Vocabulary for new documents: {'brown': 0, 'fox': 1, 'quick': 2, 'over': 3, 'the': 4, 'lazy': 5, 'dog': 6, 'jumps': 7, 'thousand': 8, 'journey': 9, 'single': 10, 'a': 11, 'st

#### Explanation:

1. **Initialization**:
   - `self.vocabulary`: Dictionary to store the mapping of words to their indices in the TF-IDF matrix.
   - `self.idf_values`: Dictionary to store the IDF (Inverse Document Frequency) values for each word.

2. **`fit` Method**:
   - **Input**: List of documents.
   - **Purpose**: Calculate the IDF values for all unique words in the corpus.
   - **Steps**:
     1. Count the number of documents containing each word.
     2. Compute the IDF for each word using the formula:
        $$
        \text{IDF}(word) = \log \left(\frac{\text{Total number of documents}}{\text{Number of documents containing the word} + 1}\right)
        $$
        Adding 1 avoids division by zero.
     3. Build the vocabulary with word-to-index mapping.

3. **`transform` Method**:
   - **Input**: List of documents.
   - **Purpose**: Convert each document into a TF-IDF representation.
   - **Steps**:
     1. Compute Term Frequency (TF) for each word in the document:
        $$
        \text{TF} = \frac{\text{Count of the word}}{\text{Total number of words in the document}}
        $$
     2. Compute the TF-IDF value:
        $$
        \text{TF-IDF} = \text{TF} \times \text{IDF}
        $$
     3. Store the TF-IDF values in a matrix where each row corresponds to a document.

4. **`fit_transform` Method**:
   - **Purpose**: Perform both fitting (computing IDF values) and transforming (converting documents to TF-IDF representation) in one step.