<a href="https://colab.research.google.com/github/Abhilitcode/NLP_Practical/blob/main/Text_representation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Understanding Bag-of-Words

Bag-of-Words (BoW) is a technique used in natural language processing to convert text documents into numerical representations that can be understood by machine learning algorithms. It essentially counts the frequency of words in a document and represents it as a numerical vector.

Key Points:

Word Frequency: Each unique word in the vocabulary is assigned a specific index.
Document Representation: A document is represented as a vector where each element corresponds to the frequency of a specific word in that document.
Order and Syntax: BoW ignores the order and syntax of words, focusing solely on word occurrences.
Example: A Simple Bag-of-Words Implementation

Creating a DataFrame and Applying Bag-of-Words

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [5]:
data = {'Document': ['This is the first document.',
                      'This document is the second document.',
                      'And the third one.',
                      'Is this the first document?'], 'output':[1,1,0,0]}


In [6]:
df = pd.DataFrame(data)

In [7]:
df

Unnamed: 0,Document,output
0,This is the first document.,1
1,This document is the second document.,1
2,And the third one.,0
3,Is this the first document?,0


In [8]:
cv = CountVectorizer()

Understanding bow = cv.fit_transform(df['Document'])

This line of code is a common step in text preprocessing for machine learning tasks, specifically when working with text data using the Bag-of-Words (BoW) model. Let's break it down:

1. cv = CountVectorizer():

This creates an instance of the CountVectorizer class from the sklearn.feature_extraction.text module.
The CountVectorizer is a tool used to convert a collection of text documents into a matrix of token counts.
2. cv.fit_transform(df['Document']):

This method applies the CountVectorizer to the specified column df['Document'] of the DataFrame df.
It performs two operations:
Fit: It learns the vocabulary from the text documents, identifying unique words.
Transform: It transforms each document into a numerical feature vector, where each feature corresponds to a word in the vocabulary, and the value represents the frequency of that word in the document.
3. bow:

The resulting bow is a sparse matrix, often represented in Compressed Sparse Row (CSR) format.
Each row corresponds to a document, and each column corresponds to a word in the vocabulary.
The values in the matrix represent the frequency of the corresponding word in the respective document.
In essence:

This line of code converts a collection of text documents into a numerical representation that can be used as input for machine learning algorithms. By transforming text data into numerical features, we enable models to understand and process text effectively.

Example:

Consider the following text documents:

doc1 = "This is the first document."
doc2 = "This document is the second document."
After applying the CountVectorizer, we might get a sparse matrix like:

[[1 1 1 1 0]
 [1 2 1 0 1]]
Here, each row represents a document, and each column represents a word. For instance, the first row indicates that "this" appears once, "is" appears once, "the" appears once, "first" appears once, and "document" appears once in the first document.

By converting text into numerical representations, we can apply various machine learning algorithms, such as Naive Bayes, Support Vector Machines, or deep learning models, to tasks like text classification, sentiment analysis, or topic modeling.

In [9]:
bow = cv.fit_transform(df['Document'])

In [10]:
print(cv.vocabulary_)

{'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}


The output [[0 1 1 1 0 0 1 0 1]] represents the numerical representation of the first document in your dataset, as processed by the Bag-of-Words (BoW) model.

Breakdown of the Output:

Each element corresponds to a word in the vocabulary:

Index 0: "and"
Index 1: "document"
Index 2: "first"
Index 3: "is"
Index 4: "one"
Index 5: "second"
Index 6: "the"
Index 7: "third"
Index 8: "this"
The value at each index represents the frequency of the corresponding word in the document:

"and": 0 (not present)
"document": 1 (appears once)
"first": 1 (appears once)
"is": 1 (appears once)
"one": 0 (not present)
"second": 0 (not present)
"the": 1 (appears once)
"third": 0 (not present)
"this": 1 (appears once)
In essence:

This numerical representation captures the word frequencies in the document, disregarding the order and grammar. This allows machine learning algorithms to process and analyze text data effectively.

By converting text documents into such numerical representations, we can apply various machine learning techniques to tasks like text classification, sentiment analysis, and topic modeling.

In [11]:
print(bow[0].toarray())

[[0 1 1 1 0 0 1 0 1]]


In [12]:
print(bow[2].toarray())

[[1 0 0 0 1 0 1 1 0]]


oov problem in on hot encoding gets solved here. the new word in sentence whihc is not present in vocabulary will be ignored.

In [13]:
cv.transform(["This document is the best and this used to be my first document"]).toarray()

array([[1, 2, 1, 1, 0, 0, 1, 0, 2]])