---   
<img align="left" width="110"   src="https://upload.wikimedia.org/wikipedia/commons/c/c3/Python-logo-notext.svg"> 
<h1 align="center">Tools and Techniques for Data Science</h1>
<h1 align="center">Course: Natural Language Processing</h1>

--- 
<h2><div align="right">Muhammad Sheraz (Data Scientist)</div></h2>
<h1 align="center">Lecture 4: (Text Representation)</h1>

<img align="center" width="1100"  src="../images/phase3.png"  > 

## Learning agenda of this notebook

1. **Feature Extraction**
    - Bag of Words (BoW) Representation
    - Creating Bag of N-grams
    - Term Frequency-Inverse Document Frequency

## Bag of Words (BoW) Representation

- **Definition:**
  - BoW is a common technique in natural language processing (NLP) to represent text data as a numerical matrix.
  - It focuses on the occurrence and frequency of words in a document, disregarding grammar and word order.

- **Process:**
  - **Tokenization:** Breaks text into individual words or tokens.
  - **Lowercasing:** Converts all words to lowercase to ensure uniformity.
  - **Stopword Removal:** Eliminates common words (e.g., 'the', 'is') that add little meaning.
  - **Counting:** Creates a matrix where each row represents a document, and each column represents a unique word, counting the occurrences.

- **Key Components:**
  - **Document-Term Matrix (DTM):** The resulting matrix showing the frequency of each word in each document.
  - **Vocabulary:** The set of unique words across all documents.

- **Pr,initutiveos:**
  - Simple and computationally efficient.
  - Captures important term frequencies for basic text anal
  - Sparcity ysis.

- **Cons:**
  - Ignores word order and semantics.
  - Doesn't consider relationships between words (e.g., 'good' and 'great' are treated as separate entities).

- **Use Cases:**
  - Commonly used in text classification, sentiment analysis, and information retrieval.
  - Foundation for more advanced NLP techniques like TF-IDF and word embeddings.

- **Libraries:**
  - Popular libraries like scikit-learn in Python provide tools (e.g., `CountVectorizer`) for easy implementation.

- **Considerations:**
  - Customize preprocessing steps and hyperparameters based on specific needs.
  - May require additional techniques for handling large vocabularies or improving semantic understanding.


### Bag of Words (BoW) Example

Consider a corpus with three documents:

1. Document 1: "This is the first document."
2. Document 2: "This document is the second document."
3. Document 3: "And this is the third one."

#### Step 1: Tokenization

- Unique words across all documents:
  - {This, is, the, first, document, second, And, third, one}

#### Step 2: Create Vocabulary

- Assign an index to each unique word:
  - Vocabulary: {This: 0, is: 1, the: 2, first: 3, document: 4, second: 5, And: 6, third: 7, one: 8}

#### Step 3: Document-Term Matrix (DTM)

- Represent each document as a vector of word frequencies based on the vocabulary:

| Document | This | is | the | first | document | second | And | third | one |
|----------|------|----|-----|-------|----------|--------|-----|-------|-----|
| 1        | 1    | 1  | 1   | 1     | 1        | 0      | 0   | 0     | 0   |
| 2        | 1    | 1  | 1   | 0     | 2        | 1      | 0   | 0     | 0   |
| 3        | 1    | 1  | 1   | 0     | 0        | 0      | 1   | 1     | 1   |

The Document-Term Matrix (DTM) captures the frequency of each word in each document, forming the basis of the Bag of Words (BoW) representation.


In [6]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Create a sample dataframe
data = {'Text': ['This is the first document.',
                 'This document is the second document.',
                 'And this is the t||hird one.',
                 'Is this the first document?'],
        'Output': [1, 0, 1, 0]}  # Adding a binary output column

df = pd.DataFrame(data)

df


Unnamed: 0,Text,Output
0,This is the first document.,1
1,This document is the second document.,0
2,And this is the t||hird one.,1
3,Is this the first document?,0


In [7]:
# Apply bag-of-words representation with specified hyperparameters
vectorizer = CountVectorizer(
    lowercase=True,      # Convert all characters to lowercase
    #stop_words='english', # Remove common English stop words
    max_features=None,    # Keep all unique words (no limit on features)
    binary=False,         # Count occurrences (binary=False) or presence (binary=True)
    ngram_range=(1, 1)    # Use unigrams (single words), can be adjusted for bigrams, trigrams, etc.
)

bow_matrix = vectorizer.fit_transform(df['Text'])

# Convert the bag-of-words matrix to a DataFrame
bow_df = pd.DataFrame(bow_matrix.toarray(), columns=vectorizer.get_feature_names_out())
bow_df

Unnamed: 0,and,document,first,hird,is,one,second,the,this
0,0,1,1,0,1,0,0,1,1
1,0,2,0,0,1,0,1,1,1
2,1,0,0,1,1,1,0,1,1
3,0,1,1,0,1,0,0,1,1


In [8]:
bow_matrix.toarray()

array([[0, 1, 1, 0, 1, 0, 0, 1, 1],
       [0, 2, 0, 0, 1, 0, 1, 1, 1],
       [1, 0, 0, 1, 1, 1, 0, 1, 1],
       [0, 1, 1, 0, 1, 0, 0, 1, 1]], dtype=int64)

In [9]:
bow_matrix[0].toarray()

array([[0, 1, 1, 0, 1, 0, 0, 1, 1]], dtype=int64)

In [10]:
bow_matrix[3].toarray()

array([[0, 1, 1, 0, 1, 0, 0, 1, 1]], dtype=int64)

In [11]:
vectorizer.transform(['document is the final document']).toarray()

array([[0, 2, 0, 0, 1, 0, 0, 1, 0]], dtype=int64)

## Understanding `vectorizer.vocabulary_` in `CountVectorizer`

In the scikit-learn `CountVectorizer` class, `vectorizer.vocabulary_` is an attribute that provides a mapping between terms (words) and their indices in the bag-of-words matrix. Here's how it works:

### Vocabulary Building Process

1. **Tokenization:**
   - The text data is tokenized, breaking it into individual words or tokens.

2. **Lowercasing:**
   - All words are converted to lowercase to ensure consistency.

3. **Stopword Removal (if specified):**
   - Common English stop words (e.g., 'the', 'is', 'and') may be removed based on the `stop_words` parameter.

4. **Building the Vocabulary:**
   - For each unique word in the preprocessed text data, a unique index is assigned.
   - The resulting vocabulary is stored in a dictionary, where the keys are the words, and the values are their corresponding indices.



## Example:

Consider the following text data:
```plaintext
['This is the first document.',
 'This document is the second document.',
 'And this is the third one.',
 'Is this the first document?']




### Applying the tokenization, lowercasing, and stopword removal processes, we get a set of unique words:

{'document', 'second', 'one', 'third'}### The vocabulary dictionary would look like:
{
 'document': 0,
 'second': 1,
 'one': 2,
 'third': 3
}


## Example of Vocabulary Building in Bag-of-Words (BoW)

Consider the following text data:

1. "This is the first document."
2. "This document is the second document."
3. "And this is the third one."
4. "Is this the first document?"

### Step 1: Tokenization

Tokenize the text into individual words:

["This", "is", "the", "first", "document", "This", "document", "is", "the", "second", "document", "And", "this", "is", "the", "third", "one", "Is", "this", "the", "first", "document"]

### Step 2: Lowercasing

Convert all words to lowercase:

["this", "is", "the", "first", "document", "this", "document", "is", "the", "second", "document", "and", "this", "is", "the", "third", "one", "is", "this", "the", "first", "docume#


## Step 3: Vocabulary Building

Build the vocabulary by assigning unique indices to each unique word:

```markdown
{
 'this': 0,
 'is': 1,
 'the': 2,
 'first': 3,
 'document': 4,
 'second': 5,
 'and': 6,
 'third': 7,
 'one': 8,
 'second': 9
}
nt"]

In [12]:
vectorizer.vocabulary_

{'this': 8,
 'is': 4,
 'the': 7,
 'first': 2,
 'document': 1,
 'second': 6,
 'and': 0,
 'hird': 3,
 'one': 5}

In [13]:
result_df = pd.concat([df, bow_df], axis=1)
result_df

Unnamed: 0,Text,Output,and,document,first,hird,is,one,second,the,this
0,This is the first document.,1,0,1,1,0,1,0,0,1,1
1,This document is the second document.,0,0,2,0,0,1,0,1,1,1
2,And this is the t||hird one.,1,1,0,0,1,1,1,0,1,1
3,Is this the first document?,0,0,1,1,0,1,0,0,1,1


## CountVectorizer Hyperparameters

- **`lowercase` (default=True):**
  - *Description:* Converts all text to lowercase. Helps in treating words with different cases as the same.
  - *Default Value:* True
  - *Use Case:* Set to False if you want to preserve the case sensitivity of words.

- **`stop_words` (default=None):**
  - *Description:* Removes common English stop words (e.g., 'the', 'is', 'and') to focus on more meaningful words.
  - *Default Value:* None
  - *Use Case:* Pass 'english' to remove common English stop words. Custom stop words can be provided as a list.

- **`max_features` (default=None):**
  - *Description:* Limits the number of unique words to consider. If specified, the most frequent words are selected.
  - *Default Value:* None
  - *Use Case:* Set a specific number to limit the vocabulary size, helpful when dealing with large datasets or when focusing on top words.

- **`binary` (default=False):**
  - *Description:* If True, the matrix representation is binary (1 if the word is present, 0 if not). If False, it counts the occurrences.
  - *Default Value:* False
  - *Use Case:* Set to True for binary representation when only presence/absence matters, not the frequency.

- **`ngram_range` (default=(1, 1)):**
  - *Description:* Specifies the range of n-grams to consider. For example, (1, 1) considers only unigrams, (1, 2) considers unigrams and bigrams, etc.
  - *Default Value:* (1, 1)
  - *Use Case:* Adjust to capture more context by including bigrams (or trigrams) in addition to unigrams.

- **`tokenizer` (default=None):**
  - *Description:* Custom function for tokenization. If None, it uses the default tokenizer.
  - *Default Value:* None
  - *Use Case:* Provide a custom tokenizer function if the default tokenization is not suitable for your data.

- **`preprocessor` (default=None):**
  - *Description:* Custom function applied to each document before tokenization and stop word removal.
  - *Default Value:* None
  - *Use Case:* Use when additional preprocessing is required before tokenization.

- **`max_df` (default=1.0):**
  - *Description:* Ignores terms that have a document frequency strictly higher than the specified threshold (float or integer).
  - *Default Value:* 1.0
  - *Use Case:* Exclude words that are too common and may not provide meaningful information.

- **`min_df` (default=1):**
  - *Description:* Ignores terms that have a document frequency strictly lower than the specified threshold (float or integer).
  - *Default Value:* 1
  - *Use Case:* Exclude words that are too rare and may not contribute much to the analysis.

- **`vocabulary` (default=None):**
  - *Description:* List of words to consider. If not None, it ignores all terms that are not in this list.
  - *Default Value:* None
  - *Use Case:* Provide a custom vocabulary list to restrict the features to a predefined set.

- **`strip_accents` (default=None):**
  - *Description:* Remove accents during the preprocessing step.
  - *Default Value:* None
  - *Use Case:* Set to 'ascii' or 'unicode' to remove accents from words.

- **`token_pattern` (default=r"(?u)\b\w\w+\b"):**
  - *Description:* Regular expression defining what constitutes a 'word' and how to split it.
  - *Default Value:* r"(?u)\b\w\w+\b"
  - *Use Case:* Customize the pattern to suit specific tokenization requirements.

- **`analyzer` (default='word'):**
  - *Description:* Determines whether the feature should be made of word n-gram or character n-grams.
  - *Default Value:* 'word'
  - *Use Case:* Set to 'char' or 'char_wb' for character n-grams instead of word n-grams.

- **`dtype` (default=np.int64):**
  - *Description:* Type of the matrix returned.
  - *Default Value:* np.int64
  - *Use Case:* Adjust if a different data type for the matrix is required.

- **`input` (default='content'):**
  - *Description:* 'content' interprets input as a collection of raw text documents.
  - *Default Value:* 'content'
  - *Use Case:* Typically, there's no need to change this unless the input format is different.

These hyperparameters offer flexibility in customizing the behavior of the `CountVectorizer` based on specific requirements and characteristics of the input text data. Adjusting these hyperparameters allows for fine-tuning the Bag of Words representation.


In [15]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Create an advanced sample dataframe
data = {
        'Text': ['This is the first book about a thrilling adventure.',
                 'Book2 is a non-fiction work by AuthorB.',
                 'The third book is a fictional story by AuthorA.',
                 'AuthorC wrote a mystery novel in Book4.']}
df_advanced = pd.DataFrame(data)

# Display the advanced sample dataframe
df_advanced



Unnamed: 0,Text
0,This is the first book about a thrilling adven...
1,Book2 is a non-fiction work by AuthorB.
2,The third book is a fictional story by AuthorA.
3,AuthorC wrote a mystery novel in Book4.


In [16]:
# Apply bag-of-words representation
vectorizer = CountVectorizer(lowercase=True, stop_words='english')
bow_matrix = vectorizer.fit_transform(df_advanced['Text'])

# Convert the bag-of-words matrix to a DataFrame
bow_df = pd.DataFrame(bow_matrix.toarray(), columns=vectorizer.get_feature_names_out())

bow_df

Unnamed: 0,adventure,authora,authorb,authorc,book,book2,book4,fiction,fictional,mystery,non,novel,story,thrilling,work,wrote
0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0
1,0,0,1,0,0,1,0,1,0,0,1,0,0,0,1,0
2,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0
3,0,0,0,1,0,0,1,0,0,1,0,1,0,0,0,1


In [17]:
vectorizer.vocabulary_

{'book': 4,
 'thrilling': 13,
 'adventure': 0,
 'book2': 5,
 'non': 10,
 'fiction': 7,
 'work': 14,
 'authorb': 2,
 'fictional': 8,
 'story': 12,
 'authora': 1,
 'authorc': 3,
 'wrote': 15,
 'mystery': 9,
 'novel': 11,
 'book4': 6}

In [18]:
result_df_advanced = pd.concat([df_advanced, bow_df], axis=1)

result_df_advanced

Unnamed: 0,Text,adventure,authora,authorb,authorc,book,book2,book4,fiction,fictional,mystery,non,novel,story,thrilling,work,wrote
0,This is the first book about a thrilling adven...,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0
1,Book2 is a non-fiction work by AuthorB.,0,0,1,0,0,1,0,1,0,0,1,0,0,0,1,0
2,The third book is a fictional story by AuthorA.,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0
3,AuthorC wrote a mystery novel in Book4.,0,0,0,1,0,0,1,0,0,1,0,1,0,0,0,1


## g. Creating N-grams
- **What are n-grams?** 
    - A sequence of n words, can be bigram, trigram,....
- **Why to use n-grams?** 
    - Capture contextual information (`good food` carries more meaning than just `good` and `food` when observed independently)
    - Applications of N-grams:
        - Sentence Completion
        - Auto Spell Check and correction
        - Auto Grammer Check and correction
    - Is there a perfect value of n?
        - Different types of n-grams are suitable for different types of applications. You should try different n-grams on your data in order to confidently conclude which one works the best among all for your text analysis. 

- **How to create n-grams?** 

In [64]:
import nltk
mystr = "Allama Iqbal was a visionary philosopher and politician. Thank you"
tokens = nltk.tokenize.word_tokenize(mystr)
bgs = nltk.bigrams(tokens)
print(bgs)
for grams in bgs:
    print(grams)

<generator object bigrams at 0x0000023602AAEAB0>
('Allama', 'Iqbal')
('Iqbal', 'was')
('was', 'a')
('a', 'visionary')
('visionary', 'philosopher')
('philosopher', 'and')
('and', 'politician')
('politician', '.')
('.', 'Thank')
('Thank', 'you')


>- The formula to calculate the count of n-grams in a document is: **`X - N + 1`**, where `X` is the number of words in a given document and `N` is the number of words in n-gram
\begin{equation}
    \text{Count of N-grams} \hspace{0.5cm} = \hspace{0.5cm} 11 - 2 + 1 \hspace{0.5cm} = \hspace{0.5cm} 10
\end{equation}


In [65]:
tgs = nltk.trigrams(tokens)
for grams in tgs:
    print(grams)

('Allama', 'Iqbal', 'was')
('Iqbal', 'was', 'a')
('was', 'a', 'visionary')
('a', 'visionary', 'philosopher')
('visionary', 'philosopher', 'and')
('philosopher', 'and', 'politician')
('and', 'politician', '.')
('politician', '.', 'Thank')
('.', 'Thank', 'you')


\begin{equation}
    \text{Count of N-grams} \hspace{0.5cm} = \hspace{0.5cm} 11 - 3 + 1 \hspace{0.5cm} = \hspace{0.5cm} 9
\end{equation}


In [66]:
ngrams = nltk.ngrams(tokens, 4)
for grams in ngrams:
    print(grams)

('Allama', 'Iqbal', 'was', 'a')
('Iqbal', 'was', 'a', 'visionary')
('was', 'a', 'visionary', 'philosopher')
('a', 'visionary', 'philosopher', 'and')
('visionary', 'philosopher', 'and', 'politician')
('philosopher', 'and', 'politician', '.')
('and', 'politician', '.', 'Thank')
('politician', '.', 'Thank', 'you')


## N-Grams in Bag of Words (BoW)

- **Definition:**
  - N-Grams extend the BoW representation by considering sequences of 'n' consecutive words as a single feature.
  - Unigrams (1-grams) are single words, bigrams (2-grams) are pairs of consecutive words, trigrams (3-grams) are triplets, and so on.

- **Enhancements:**
  - **Contextual Information:** Captures the context and relationship between adjacent words.
  - **Increased Complexity:** Larger 'n' introduces more features, leading to a richer representation.

- **Use Cases:**
  - Useful in tasks where word order matters, such as language modeling and certain types of text analysis.
  - Provides more nuanced information for understanding the meaning of phrases.

- **Implementation:**
  - Supported in libraries like scikit-learn through the `ngram_range` parameter in `CountVectorizer`.

- **Considerations:**
  - Larger 'n' increases the dimensionality of the feature space, which may impact computational efficiency.
  - Finding the right balance between granularity and complexity is crucial.



## N-Grams in Bag of Words (BoW) Example

Consider a corpus with three documents:

1. Document 1: "This is the first document."
2. Document 2: "This document is the second document."
3. Document 3: "And this is the third one."

#### Step 1: Tokenization and N-Gram Formation

- Unique unigrams and bigrams across all documents:
  - {This, is, the, first, document, second, And, third, one, This is, is the, the first, first document, document This, is the, the second, second document, And this, this is, is the, the third, third one}

#### Step 2: Create Vocabulary

- Assign an index to each unique n-gram:
  - Vocabulary: {This: 0, is: 1, the: 2, first: 3, document: 4, second: 5, And: 6, third: 7, one: 8, This is: 9, is the: 10, the first: 11, first document: 12, document This: 13, is the: 14, the second: 15, second document: 16, And this: 17, this is: 18, is the: 19, the third: 20, third one: 21}

#### Step 3: Document-Term Matrix (DTM) for N-Grams

- Represent each document as a vector of n-gram frequencies based on the vocabulary:

| Document | This | is | the | first | document | second | And | third | one | This is | is the | the first | first document | document This | is the | the second | second document | And this | this is | is the | the third | third one |
|----------|------|----|-----|-------|----------|--------|-----|-------|-----|---------|--------|------------|-----------------|----------------|--------|-------------|------------------|-----------|---------|--------|-----------|-----------|
| 1        | 1    | 1  | 1   | 1     | 1        | 0      | 0   | 0     | 0   | 1       | 1      | 1          | 1               | 1              | 0      | 0           | 0                | 0         | 0       | 0      | 0         | 0         |
| 2        | 1    | 1  | 1   | 0     | 2        | 1      | 0   | 0     | 0   | 1       | 1      | 0          | 0               | 0              | 1      | 1           | 1                | 0         | 0       | 0      | 0         | 0         |
| 3        | 1    | 1  | 1   | 0     | 0        | 0      | 1   | 1     | 1   | 0       | 0      | 0          | 0               | 0              | 1      | 0           | 0                | 1         | 1       | 1      | 1         | 1         |

The Document-Term Matrix (DTM) for N-Grams captures the frequency of each n-gram in each document, forming the basis of the Bag of Words (BoW) representation with N-Grams.


In [70]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Create a sample dataframe
data = {'Text': ['This is the first document.',
                 'This document is the second document.',
                 'And this is the third one.',
                 'Is this the first document?']}
df = pd.DataFrame(data)
df



Unnamed: 0,Text
0,This is the first document.
1,This document is the second document.
2,And this is the third one.
3,Is this the first document?


In [71]:
# Apply N-Grams using CountVectorizer
ngram_vectorizer = CountVectorizer(ngram_range=(1, 3))  # This specifies using unigrams and bigrams
ngram_matrix = ngram_vectorizer.fit_transform(df['Text'])

# Convert the N-Grams matrix to a DataFrame
ngram_df = pd.DataFrame(ngram_matrix.toarray(), columns=ngram_vectorizer.get_feature_names_out())

# Concatenate the N-Grams DataFrame with the original dataframe
result_df = pd.concat([df, ngram_df], axis=1)

result_df


Unnamed: 0,Text,and,and this,and this is,document,document is,document is the,first,first document,is,...,the third one,third,third one,this,this document,this document is,this is,this is the,this the,this the first
0,This is the first document.,0,0,0,1,0,0,1,1,1,...,0,0,0,1,0,0,1,1,0,0
1,This document is the second document.,0,0,0,2,1,1,0,0,1,...,0,0,0,1,1,1,0,0,0,0
2,And this is the third one.,1,1,1,0,0,0,0,0,1,...,1,1,1,1,0,0,1,1,0,0
3,Is this the first document?,0,0,0,1,0,0,1,1,1,...,0,0,0,1,0,0,0,0,1,1


In [72]:
ngram_vectorizer.vocabulary_

{'this': 27,
 'is': 8,
 'the': 18,
 'first': 6,
 'document': 3,
 'this is': 30,
 'is the': 9,
 'the first': 19,
 'first document': 7,
 'this is the': 31,
 'is the first': 10,
 'the first document': 20,
 'second': 16,
 'this document': 28,
 'document is': 4,
 'the second': 21,
 'second document': 17,
 'this document is': 29,
 'document is the': 5,
 'is the second': 11,
 'the second document': 22,
 'and': 0,
 'third': 25,
 'one': 15,
 'and this': 1,
 'the third': 23,
 'third one': 26,
 'and this is': 2,
 'is the third': 12,
 'the third one': 24,
 'is this': 13,
 'this the': 32,
 'is this the': 14,
 'this the first': 33}

## TF-IDF (Term Frequency-Inverse Document Frequency)

- **Definition:**
  - TF-IDF is a numerical statistic that reflects the importance of a word in a document relative to a collection of documents.
  - It balances the frequency of a word in a document (Term Frequency - TF) with its rarity across documents (Inverse Document Frequency - IDF).

- **Components:**
  - **Term Frequency (TF):**
    - Measures how often a term appears in a document.
    - Calculated as the number of occurrences of a term divided by the total number of terms in the document.

  - **Inverse Document Frequency (IDF):**
    - Measures how unique or rare a term is across all documents.
    - Calculated as the logarithm of the total number of documents divided by the number of documents containing the term.

- **Formula:**
  - TF-IDF = TF * IDF

- **Key Characteristics:**
  - Words with high TF-IDF scores are important in a specific document but not common across all documents.
  - Common words receive lower scores, emphasizing the uniqueness of terms.

- **Pros:**
  - Captures the importance of terms in the context of a document and a corpus.
  - Penalizes common words and highlights distinctive terms.

- **Cons:**
  - Complexity increases with a larger corpus.
  - Requires careful consideration of parameter tuning.

- **Use Cases:**
  - Document retrieval and ranking.
  - Keyword extraction.
  - Text mining and clustering.

- **Libraries:**
  - Implemented in scikit-leary(), columns=tfidf_vectorizer.get_feature_names_out())


- ### Steps to Calculate TF-IDF
    
    - **Step 1: Compute Term Frequency (TF)**
        - Term Frequency (TF) is calculated for each term in each document. It represents the frequency of a term in a document relative to the total number of terms in that document.
        
    - **Step 2: Compute Inverse Document Frequency (IDF)**
        - Inverse Document Frequency (IDF) measures the importance of a term in the entire corpus. It is computed by taking the logarithm of the ratio of the total number of documents to the number of documents containing the term.
    
    - **Step 3: Compute TF-IDF**
        - TF-IDF (Term Frequency-Inverse Document Frequency) is the product of TF and IDF. It represents the importance of a term in a document relative to its importance in the entire corpus.

The TF-IDF representation captures the importance of terms in each document, emphasizing the uniqueness of terms across the entire corpus.
s.


## Example

Consider a corpus with three documents:

1. Document 1: "This is the first document."
2. Document 2: "This document is the second document."
3. Document 3: "And this is the third one."


### Term Frequency (TF) Table

| Term      | Document 1   | Document 2   | Document 3   |
|-----------|--------------|--------------|--------------|
| This      | 1/5          | 1/5          | 1/6          |
| is        | 1/5          | 1/5          | 1/6          |
| the       | 1/5          | 1/5          | 1/6          |
| first     | 1/5          | 0            | 0            |
| document  | 1/5          | 2/5          | 0            |
| second    | 0            | 1/5          | 0            |
| And       | 0            | 0            | 1/6          |
| third     | 0            | 0            | 1/6          |
| one       | 0            | 0            | 1/6          |


### Inverse Document Frequency (IDF) Table


| Term      | IDF          |
|-----------|--------------|
| This      | 0            |
| is        | 0            |
| the       | 0            |
| first     | 0.477        |
| document  | 0.176        |
| second    | 0.477        |
| And       | 0.477        |
| third     | 0.477        |
| one       | 0.477        |


### TF-IDF Table

| Term      | Document 1   | Document 2   | Document 3   |
|-----------|--------------|--------------|--------------|
| This      | 0            | 0            | 0            |
| is        | 0            | 0            | 0            |
| the       | 0            | 0            | 0            |
| first     | 0.095        | 0            | 0            |
| document  | 0.035        | 0.070        | 0            |
| second    | 0            | 0.095        | 0            |
| And       | 0            | 0            | 0.080        |
| third     | 0            | 0            | 0.080        |
| one       | 0            | 0            | 0.080        |


In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
# Sample documents
documents = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?",
]

# Create the TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the documents
tfidf_matrix = vectorizer.fit_transform(documents)

# Get feature names (terms)
feature_names = vectorizer.get_feature_names_out()

# Convert the TF-IDF matrix to a dense array for easier manipulation
dense_matrix = tfidf_matrix.todense()

# Display the TF-IDF values in a readable format
print("TF-IDF Values:")
pd.DataFrame(dense_matrix,columns=feature_names)



TF-IDF Values:


Unnamed: 0,and,document,first,is,one,second,the,third,this
0,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.384085,0.0,0.384085
1,0.0,0.687624,0.0,0.281089,0.0,0.538648,0.281089,0.0,0.281089
2,0.511849,0.0,0.0,0.267104,0.511849,0.0,0.267104,0.511849,0.267104
3,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.384085,0.0,0.384085


In [3]:
# Display feature names (terms)
print("\nFeature Names (Terms):")
feature_names



Feature Names (Terms):


array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
       'this'], dtype=object)

In [7]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample text data
data = {
    'Document': ["This is the first document.",
                 "This document is the second document.",
                 "And this is the third one."]
}

# Create DataFrame
df = pd.DataFrame(data)

# Apply TF-IDF
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(df['Document'])

# Convert TF-IDF matrix to DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

# Display original DataFrame
print("Original DataFrame:")
print(df)

# Display TF-IDF DataFrame
print("\nTF-IDF DataFrame:")
print(tfidf_df)


Original DataFrame:
                                Document
0            This is the first document.
1  This document is the second document.
2             And this is the third one.

TF-IDF DataFrame:
       and  document     first        is      one    second       the  \
0  0.00000  0.469417  0.617227  0.364544  0.00000  0.000000  0.364544   
1  0.00000  0.728445  0.000000  0.282851  0.00000  0.478909  0.282851   
2  0.49712  0.000000  0.000000  0.293607  0.49712  0.000000  0.293607   

     third      this  
0  0.00000  0.364544  
1  0.00000  0.282851  
2  0.49712  0.293607  
