<a href="https://colab.research.google.com/github/Abhilitcode/NLP_Practical/blob/main/Text_representation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Understanding Bag-of-Words

Bag-of-Words (BoW) is a technique used in natural language processing to convert text documents into numerical representations that can be understood by machine learning algorithms. It essentially counts the frequency of words in a document and represents it as a numerical vector.

Key Points:

Word Frequency: Each unique word in the vocabulary is assigned a specific index.
Document Representation: A document is represented as a vector where each element corresponds to the frequency of a specific word in that document.
Order and Syntax: BoW ignores the order and syntax of words, focusing solely on word occurrences.
Example: A Simple Bag-of-Words Implementation

Creating a DataFrame and Applying Bag-of-Words

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
data = {'Document': ['This is the first document.',
                      'This document is the second document.',
                      'And the third one.',
                      'Is this the first document?'], 'output':[1,1,0,0]}


In [3]:
df = pd.DataFrame(data)

In [4]:
df

Unnamed: 0,Document,output
0,This is the first document.,1
1,This document is the second document.,1
2,And the third one.,0
3,Is this the first document?,0


In [18]:
cv = CountVectorizer(ngram_range=(2,2))

Understanding bow = cv.fit_transform(df['Document'])

This line of code is a common step in text preprocessing for machine learning tasks, specifically when working with text data using the Bag-of-Words (BoW) model. Let's break it down:

1. cv = CountVectorizer():

This creates an instance of the CountVectorizer class from the sklearn.feature_extraction.text module.
The CountVectorizer is a tool used to convert a collection of text documents into a matrix of token counts.
2. cv.fit_transform(df['Document']):

This method applies the CountVectorizer to the specified column df['Document'] of the DataFrame df.
It performs two operations:
Fit: It learns the vocabulary from the text documents, identifying unique words.
Transform: It transforms each document into a numerical feature vector, where each feature corresponds to a word in the vocabulary, and the value represents the frequency of that word in the document.
3. bow:

The resulting bow is a sparse matrix, often represented in Compressed Sparse Row (CSR) format.
Each row corresponds to a document, and each column corresponds to a word in the vocabulary.
The values in the matrix represent the frequency of the corresponding word in the respective document.
In essence:

This line of code converts a collection of text documents into a numerical representation that can be used as input for machine learning algorithms. By transforming text data into numerical features, we enable models to understand and process text effectively.

Example:

Consider the following text documents:

doc1 = "This is the first document."
doc2 = "This document is the second document."
After applying the CountVectorizer, we might get a sparse matrix like:

[[1 1 1 1 0]
 [1 2 1 0 1]]
Here, each row represents a document, and each column represents a word. For instance, the first row indicates that "this" appears once, "is" appears once, "the" appears once, "first" appears once, and "document" appears once in the first document.

By converting text into numerical representations, we can apply various machine learning algorithms, such as Naive Bayes, Support Vector Machines, or deep learning models, to tasks like text classification, sentiment analysis, or topic modeling.

In [6]:
bow = cv.fit_transform(df['Document'])

In [7]:
print(cv.vocabulary_)

{'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}


The output [[0 1 1 1 0 0 1 0 1]] represents the numerical representation of the first document in your dataset, as processed by the Bag-of-Words (BoW) model.

Breakdown of the Output:

Each element corresponds to a word in the vocabulary:

Index 0: "and"
Index 1: "document"
Index 2: "first"
Index 3: "is"
Index 4: "one"
Index 5: "second"
Index 6: "the"
Index 7: "third"
Index 8: "this"
The value at each index represents the frequency of the corresponding word in the document:

"and": 0 (not present)
"document": 1 (appears once)
"first": 1 (appears once)
"is": 1 (appears once)
"one": 0 (not present)
"second": 0 (not present)
"the": 1 (appears once)
"third": 0 (not present)
"this": 1 (appears once)
In essence:

This numerical representation captures the word frequencies in the document, disregarding the order and grammar. This allows machine learning algorithms to process and analyze text data effectively.

By converting text documents into such numerical representations, we can apply various machine learning techniques to tasks like text classification, sentiment analysis, and topic modeling.

In [8]:
print(bow[0].toarray())

[[0 1 1 1 0 0 1 0 1]]


In [9]:
print(bow[2].toarray())

[[1 0 0 0 1 0 1 1 0]]


oov problem in on hot encoding gets solved here. the new word in sentence whihc is not present in vocabulary will be ignored.

In [10]:
cv.transform(["This document is the best and this used to be my first document"]).toarray()

array([[1, 2, 1, 1, 0, 0, 1, 0, 2]])

In [11]:
# Binary transformation using CountVectorizer with binary=True
vectorizer_binary = CountVectorizer(binary=True)

Binary Transformation: The CountVectorizer(binary=True) creates a matrix where each term's presence is represented as 1 or 0 in each document.
Max Features: The CountVectorizer(max_features=N) limits the vocabulary to the N most frequent terms.

In [12]:
binary_matrix = vectorizer_binary.fit_transform(df['Document'])

In [13]:
print("Binary Transformation Matrix (Binary=True):")
print(binary_matrix.toarray())

Binary Transformation Matrix (Binary=True):
[[0 1 1 1 0 0 1 0 1]
 [0 1 0 1 0 1 1 0 1]
 [1 0 0 0 1 0 1 1 0]
 [0 1 1 1 0 0 1 0 1]]


The function get_feature_names_out() from CountVectorizer provides an array of the feature names (words in the vocabulary) learned during fitting.

In [14]:
print("Feature Names (Vocabulary):", vectorizer_binary.get_feature_names_out())

Feature Names (Vocabulary): ['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']


max_features are mostly used when you have to remove rare words that are not needed.


In [15]:
# Max features transformation using CountVectorizer
max_features = 3  # Adjust the value as needed
vectorizer_max_features = CountVectorizer(max_features=max_features)
limited_matrix = vectorizer_max_features.fit_transform(df['Document'])

# Print max features transformation results
print("\nMax Features Transformation Matrix (max_features=3):")
print(limited_matrix.toarray())
print("Selected Feature Names (Limited Vocabulary):", vectorizer_max_features.get_feature_names_out())


Max Features Transformation Matrix (max_features=3):
[[1 1 1]
 [2 1 1]
 [0 0 1]
 [1 1 1]]
Selected Feature Names (Limited Vocabulary): ['document' 'is' 'the']


TOPIC- N_GRAM
BOW OF (1,1) MEANS only unigram. then (2,2) means only bigram. and (1,2) means unigram and bigram. etc.

In [23]:
cv = CountVectorizer(ngram_range=(1,2))

In [24]:
bow = cv.fit_transform(df['Document'])

ngram_range=(2, 2) ensures that only bigrams (pairs of consecutive words) are extracted.
ngram_range=(1, 2) would extract both unigrams and bigrams.

In [25]:
print(cv.vocabulary_)

{'this': 18, 'is': 6, 'the': 12, 'first': 4, 'document': 2, 'this is': 20, 'is the': 7, 'the first': 13, 'first document': 5, 'second': 10, 'this document': 19, 'document is': 3, 'the second': 14, 'second document': 11, 'and': 0, 'third': 16, 'one': 9, 'and the': 1, 'the third': 15, 'third one': 17, 'is this': 8, 'this the': 21}


In [26]:
len(cv.vocabulary_)

22

The **TF-IDF (Term Frequency-Inverse Document Frequency)** Vectorizer transforms text data into a matrix of **TF-IDF features**. It represents text in a way that assigns weights to words based on their frequency in a document (Term Frequency) and how common or rare they are across all documents (Inverse Document Frequency). The idea is to highlight important words and diminish the weight of more common, less informative words.

Sure! Here are the descriptions of each component of the TF-IDF formula explained in words:

### 1. Term Frequency (TF)
**Description**: The term frequency measures how often a word (term) appears in a single document relative to the total number of words in that document.

**Formula (in words)**:
Term Frequency (TF) is calculated as:
> The number of times the term appears in the document divided by the total number of terms (words) in the document.

### 2. Inverse Document Frequency (IDF)
**Description**: The inverse document frequency measures how important a term is across all the documents in the corpus. It decreases the weight of terms that occur very frequently across multiple documents and increases the weight for terms that appear rarely.

**Formula (in words)**:
Inverse Document Frequency (IDF) is calculated as:
> The logarithm of the total number of documents divided by the number of documents containing the term.

### 3. TF-IDF Weight
**Description**: The TF-IDF weight of a term is the product of its term frequency and inverse document frequency. It helps to emphasize words that are important (frequent in a particular document) but uncommon (infrequent across all documents) in the entire corpus.

**Formula (in words)**:
TF-IDF is calculated as:
> The term frequency multiplied by the inverse document frequency.

Let me know if this explanation is clear or if you need further assistance!


In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [28]:
tfidfvectorizer = TfidfVectorizer()

In [29]:
tfidf_doc = tfidfvectorizer.fit_transform(df['Document'])

In [30]:
tfidf_doc.toarray()

array([[0.        , 0.43877674, 0.54197657, 0.43877674, 0.        ,
        0.        , 0.35872874, 0.        , 0.43877674],
       [0.        , 0.66215942, 0.        , 0.33107971, 0.        ,
        0.51870034, 0.27067936, 0.        , 0.33107971],
       [0.55280532, 0.        , 0.        , 0.        , 0.55280532,
        0.        , 0.28847675, 0.55280532, 0.        ],
       [0.        , 0.43877674, 0.54197657, 0.43877674, 0.        ,
        0.        , 0.35872874, 0.        , 0.43877674]])

In [32]:
# Display feature names (vocabulary)
print("\nFeature Names (Vocabulary):")
print(tfidfvectorizer.get_feature_names_out())


Feature Names (Vocabulary):
['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']


The IDF (Inverse Document Frequency) value is constant for a term across different documents because it is calculated based on the entire corpus, not on a single document.

### Explanation:

1. **IDF Calculation**:
   - **IDF Formula (in words)**: The IDF is the logarithm of the total number of documents divided by the number of documents containing the term.
   - This means the IDF value depends solely on:
     - The total number of documents in the corpus.
     - The number of documents in which the specific term appears.
   - Since these values do not change for different documents within the same corpus, the IDF for a particular term remains constant for all documents.

2. **Role of IDF**:
   - The purpose of IDF is to scale down the weight of terms that appear frequently across many documents (e.g., common words) and scale up the weight of terms that are rare across the documents.
   - A term appearing in many documents has a lower IDF value, while a term appearing in few documents has a higher IDF value.

### Constant Nature Across Documents:
- Because IDF is computed globally (based on the entire set of documents), its value remains constant regardless of which document we are calculating the term's TF-IDF weight for.
- However, **TF** (Term Frequency) varies for each document, so the overall **TF-IDF** score for a term can differ from document to document even though the IDF is constant.

### Example:
- Suppose we have a corpus of 10 documents, and a term "AI" appears in 2 of them.
- The **IDF** value of "AI" would be:
  - IDF = log(10 / 2) = log(5).
- This value (log(5)) remains the same for all documents in the corpus whenever we calculate the TF-IDF for "AI". However, the **TF** (term frequency) of "AI" will vary depending on how many times it appears in each specific document.

In [33]:
print(tfidfvectorizer.idf_)

[1.91629073 1.22314355 1.51082562 1.22314355 1.91629073 1.91629073
 1.         1.91629073 1.22314355]


In [34]:
len(tfidfvectorizer.idf_)

9