<a href="https://colab.research.google.com/github/Swap1984/swapnil/blob/main/Assignment_Bag_of_words_multiple_documents_at_same_time.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Code for Bag of Words embedding




**Bag of Words (BoW): Method of Operation**

Method of Operation:

The Bag of Words (BoW) model is a simple and widely used technique for text representation. It converts a text into a matrix where:

Rows represent documents (or sentences, or any text unit).
Columns represent unique words (or bigrams, trigrams, etc.).
The values in the matrix represent the frequency (or count) of each word in each document.

Steps:

Tokenization: The text is broken down into individual words (or tokens).
Vocabulary Building: A vocabulary (set of unique words) is created from the tokens.

Frequency Count: Each word in the vocabulary is counted for its occurrences in the document.

**Advantages of BoW:**

Simplicity: BoW is easy to understand and implement.

Efficient for Smaller Datasets: Works well for small text data, especially when there's not much need for understanding context or word order.

No Need for Pre-trained Models: It doesn’t require pre-trained word embeddings, so it's quick to generate.

**Disadvantages of BoW:**

Ignores Context: It doesn't capture the context or word order (e.g., "dog bites man" and "man bites dog" will be treated similarly).

Sparsity: As vocabulary size increases, the matrix becomes sparse (many 0s) for larger datasets, which can increase computational cost.

High Dimensionality: BoW creates a large feature space, which can make training machine learning models harder.

Assumes All Words are Equally Important: It only considers the frequency of words, ignoring their importance or relevance.

**Applications of BoW:**

Text Classification: Frequently used in classification problems (spam detection, sentiment analysis) where contextual information isn't crucial.
Information Retrieval: Search engines often use BoW for indexing documents and retrieving based on word matches.

Topic Modeling: It can be a base input for algorithms like LDA (Latent Dirichlet Allocation) to find topics in a corpus.

#Initialising Libraries for  preprocessing the data for emmbedding


In [29]:

import string #This module is used to remove punctuation from the text.
import re #This regular expressions module is useful for operations like replacing repeated characters.
import nltk # Natural language tool kit library
from nltk.corpus import stopwords #  Provides a list of common English stopwords (e.g., "the", "is") that are generally not informative in text analysis.
from nltk.tokenize import word_tokenize # to Tokenize a string into individual words.
from nltk.stem import WordNetLemmatizer # to Convert words to their base (dictionary) form using lemmatization.
from sklearn.feature_extraction.text import CountVectorizer #to convert a collection of text documents into a matrix of token counts.

In [16]:
#Consider the following data with each sentence being treated as a separate document
data = ["”Yes, life is full, there is life even underground,” he began again." ,
"“You wouldn’t believe, Alexey, how I want to live now, what a thirst for existence and consciousness has sprung up in me within these peeling walls…",
" And what is suffering? I am not afraid of it, even if it were beyond reckoning.",
" I am not afraid of it now.",
" I was afraid of it before… And I seem to have such strength in me now, that I think I could stand anything, any suffering, only to be able to say and to repeat to myself every moment, ‘I exist.’",
"In thousands of agonies—I exist.",
" I’m tormented on the rack — but I exist! Though I sit alone on a pillar — I exist! I see the sun, and if I don’t see the sun, I know it’s there. And there’s a whole life in that, in knowing that the sun is there."
]


# Preprocessing the data

In [17]:
# Download stopwords if not already present
nltk.download('stopwords')

# Initialize stopwords list
stop_words = set(stopwords.words('english'))

# Function to preprocess text: lowercasing, removing punctuation, and tokenizing
def preprocess(text):
    # Lowercase the text
    text = text.lower()
    # Remove punctuation and special characters
    text = re.sub(r'[^\w\s]', '', text)
    # Remove stopwords
    tokens = [word for word in text.split() if word not in stop_words]
    return ' '.join(tokens)

# Preprocess all sentences
preprocessed_data = [preprocess(sentence) for sentence in data]
print(preprocessed_data)

['yes life full life even underground began', 'wouldnt believe alexey want live thirst existence consciousness sprung within peeling walls', 'suffering afraid even beyond reckoning', 'afraid', 'afraid seem strength think could stand anything suffering able say repeat every moment exist', 'thousands agoniesi exist', 'im tormented rack exist though sit alone pillar exist see sun dont see sun know theres whole life knowing sun']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [18]:
# Initialize CountVectorizer (BoW)
vectorizer = CountVectorizer()

In [20]:
# Fit and transform the data to create the BoW matrix
bow_matrix = vectorizer.fit_transform(preprocessed_data)
bow_matrix

<7x50 sparse matrix of type '<class 'numpy.int64'>'
	with 57 stored elements in Compressed Sparse Row format>

In [21]:
# Convert matrix to an array
bow_array = bow_matrix.toarray()
bow_array

array([[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 2, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
        0, 0, 0, 0, 0, 1],
       [0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
        0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
        1, 1, 0, 1, 1, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0],
       [1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,
        1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

In [22]:
# Get the feature names (i.e., unique words)
feature_names = vectorizer.get_feature_names_out()
feature_names

array(['able', 'afraid', 'agoniesi', 'alexey', 'alone', 'anything',
       'began', 'believe', 'beyond', 'consciousness', 'could', 'dont',
       'even', 'every', 'exist', 'existence', 'full', 'im', 'know',
       'knowing', 'life', 'live', 'moment', 'peeling', 'pillar', 'rack',
       'reckoning', 'repeat', 'say', 'see', 'seem', 'sit', 'sprung',
       'stand', 'strength', 'suffering', 'sun', 'theres', 'think',
       'thirst', 'though', 'thousands', 'tormented', 'underground',
       'walls', 'want', 'whole', 'within', 'wouldnt', 'yes'], dtype=object)

In [23]:
# Display the BoW matrix and feature names
print("Feature Names (Vocabulary):")
print(feature_names)

Feature Names (Vocabulary):
['able' 'afraid' 'agoniesi' 'alexey' 'alone' 'anything' 'began' 'believe'
 'beyond' 'consciousness' 'could' 'dont' 'even' 'every' 'exist'
 'existence' 'full' 'im' 'know' 'knowing' 'life' 'live' 'moment' 'peeling'
 'pillar' 'rack' 'reckoning' 'repeat' 'say' 'see' 'seem' 'sit' 'sprung'
 'stand' 'strength' 'suffering' 'sun' 'theres' 'think' 'thirst' 'though'
 'thousands' 'tormented' 'underground' 'walls' 'want' 'whole' 'within'
 'wouldnt' 'yes']


In [24]:
print("\nBag of Words Matrix:")
print(bow_array)


Bag of Words Matrix:
[[0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 1 0 0 0 0 0 1]
 [0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0
  0 0 0 1 0 0 0 0 1 1 0 1 1 0]
 [0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
  0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 1 0 0 0 1 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 1 1 1
  0 0 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 1 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 1 0 0 2 0 0 1 1 1 1 0 0 0 1 1 0 0 0 2 0 1 0 0 0 0
  3 1 0 0 1 0 1 0 0 0 1 0 0 0]]


In [27]:
# Output the number of unique words (vocabulary size)
print(f"Number of unique words in vocabulary: {len(feature_names)}")

# Output the shape of the BoW matrix (number of rows, number of columns/features)
print(f"Shape of the BoW matrix: {bow_array.shape}")

Number of unique words in vocabulary: 50
Shape of the BoW matrix: (7, 50)


**Inference and Anaysis**

As we see the BoW matrix is very sparse and some worrd like 'exist' appear in three documents. also words like 'again'are not considered in the corpus because its very rare and occures only in one doc.

**Question:** the word 'would't' and 'again' both are rare 'again was droped from the corpus and 'would't' was kept. why this happened?

In [28]:
from nltk.corpus import stopwords
print("again" in stopwords.words('english'))  # Check if "again" is a stopword

True


**Answer**:
Thus we see that

"again" was dropped: This is because "again" is included in NLTK's English stopwords list.

"wouldn't" was kept: If contractions are not expanded and stopword removal is applied, "wouldn't" remains as a single token and is not in the stopwords list.