<a href="https://colab.research.google.com/github/Swap1984/-Bag-of-Words-N-gram-NLP-Feature-Extraction-Complete-Conceptual-Walkthrough/blob/main/Bag_of_words.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Code for Bag of Words embedding




**Bag of Words (BoW): Method of Operation**

Method of Operation:

The Bag of Words (BoW) model is a simple and widely used technique for text representation. It converts a text into a matrix where:

Rows represent documents (or sentences, or any text unit).
Columns represent unique words (or bigrams, trigrams, etc.).
The values in the matrix represent the frequency (or count) of each word in each document.

Steps:

Tokenization: The text is broken down into individual words (or tokens).
Vocabulary Building: A vocabulary (set of unique words) is created from the tokens.

Frequency Count: Each word in the vocabulary is counted for its occurrences in the document.

**Advantages of BoW:**

Simplicity: BoW is easy to understand and implement.

Efficient for Smaller Datasets: Works well for small text data, especially when there's not much need for understanding context or word order.

No Need for Pre-trained Models: It doesn’t require pre-trained word embeddings, so it's quick to generate.

**Disadvantages of BoW:**

Ignores Context: It doesn't capture the context or word order (e.g., "dog bites man" and "man bites dog" will be treated similarly).

Sparsity: As vocabulary size increases, the matrix becomes sparse (many 0s) for larger datasets, which can increase computational cost.

High Dimensionality: BoW creates a large feature space, which can make training machine learning models harder.

Assumes All Words are Equally Important: It only considers the frequency of words, ignoring their importance or relevance.

**Applications of BoW:**

Text Classification: Frequently used in classification problems (spam detection, sentiment analysis) where contextual information isn't crucial.
Information Retrieval: Search engines often use BoW for indexing documents and retrieving based on word matches.

Topic Modeling: It can be a base input for algorithms like LDA (Latent Dirichlet Allocation) to find topics in a corpus.

#Initialising Libraries for  preprocessing the data for emmbedding


In [8]:

import string #This module is used to remove punctuation from the text.
import re #This regular expressions module is useful for operations like replacing repeated characters.
from nltk.corpus import stopwords #  Provides a list of common English stopwords (e.g., "the", "is") that are generally not informative in text analysis.
from nltk.tokenize import word_tokenize # to Tokenize a string into individual words.
from nltk.stem import WordNetLemmatizer # to Convert words to their base (dictionary) form using lemmatization.


In [9]:
# using a paragraph as input . data =""   ""
data = """Yes, life is full, there is life even underground,” he began again. “You wouldn’t believe, Alexey, how I want to live now, what a thirst for existence and consciousness has sprung up in me within these peeling walls… And what is suffering? I am not afraid of it, even if it were beyond reckoning. I am not afraid of it now. I was afraid of it before… And I seem to have such strength in me now, that I think I could stand anything, any suffering, only to be able to say and to repeat to myself every moment, ‘I exist.’ In thousands of agonies — I exist. I’m tormented on the rack — but I exist! Though I sit alone on a pillar — I exist! I see the sun, and if I don’t see the sun, I know it’s there. And there’s a whole life in that, in knowing that the sun is there."""

# Preprocessing the data

In [10]:
# Lowercasing
data = data.lower()
data


'yes, life is full, there is life even underground,” he began again. “you wouldn’t believe, alexey, how i want to live now, what a thirst for existence and consciousness has sprung up in me within these peeling walls… and what is suffering? i am not afraid of it, even if it were beyond reckoning. i am not afraid of it now. i was afraid of it before… and i seem to have such strength in me now, that i think i could stand anything, any suffering, only to be able to say and to repeat to myself every moment, ‘i exist.’ in thousands of agonies — i exist. i’m tormented on the rack — but i exist! though i sit alone on a pillar — i exist! i see the sun, and if i don’t see the sun, i know it’s there. and there’s a whole life in that, in knowing that the sun is there.'

In [11]:
# Removing ellipses and punctuation, including straight and curly quotation marks
data = re.sub(r'\.{2,}', '', data)  # Remove ellipses (two or more dots)
data = re.sub(r'…', ' ', data)  # Replace Unicode ellipses with a space
data = re.sub(r'\s+', ' ', data)  # Replace multiple spaces with a single space
data = re.sub(r'\.\.\.+', '', data)  # Remove ellipses
punctuation = string.punctuation + "“”‘’—"  # Include curly quotes and em dash
data= data.translate(str.maketrans("", "", punctuation))# Removing punctuation
data


'yes life is full there is life even underground he began again you wouldnt believe alexey how i want to live now what a thirst for existence and consciousness has sprung up in me within these peeling walls and what is suffering i am not afraid of it even if it were beyond reckoning i am not afraid of it now i was afraid of it before and i seem to have such strength in me now that i think i could stand anything any suffering only to be able to say and to repeat to myself every moment i exist in thousands of agonies  i exist im tormented on the rack  but i exist though i sit alone on a pillar  i exist i see the sun and if i dont see the sun i know its there and theres a whole life in that in knowing that the sun is there'

i was getting the ellipses even after including the ellipsees removing code so i used the unicode ellipses removal method by converting them into spaces and then the space removing method to get rid of the entire punctuation and formatting of the data.

as seen above all the punctuation marks were to be removed but we see that the curly quotation marks and hyphen and dots are still there so adding the same to the string libraray and processing the data.

thus we see that all punctuation marks have been removed

In [12]:
import nltk
nltk.download('stopwords')  # Download the stopwords resource

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [13]:
# Removing stopwords
stop_words = set(stopwords.words("english"))
data = " ".join([word for word in data.split() if word not in stop_words])
data

'yes life full life even underground began wouldnt believe alexey want live thirst existence consciousness sprung within peeling walls suffering afraid even beyond reckoning afraid afraid seem strength think could stand anything suffering able say repeat every moment exist thousands agonies exist im tormented rack exist though sit alone pillar exist see sun dont see sun know theres whole life knowing sun'

In [14]:
 # required as an error occured during tokenisation
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [15]:
#Tokenization
tokens = word_tokenize(data)
tokens


['yes',
 'life',
 'full',
 'life',
 'even',
 'underground',
 'began',
 'wouldnt',
 'believe',
 'alexey',
 'want',
 'live',
 'thirst',
 'existence',
 'consciousness',
 'sprung',
 'within',
 'peeling',
 'walls',
 'suffering',
 'afraid',
 'even',
 'beyond',
 'reckoning',
 'afraid',
 'afraid',
 'seem',
 'strength',
 'think',
 'could',
 'stand',
 'anything',
 'suffering',
 'able',
 'say',
 'repeat',
 'every',
 'moment',
 'exist',
 'thousands',
 'agonies',
 'exist',
 'im',
 'tormented',
 'rack',
 'exist',
 'though',
 'sit',
 'alone',
 'pillar',
 'exist',
 'see',
 'sun',
 'dont',
 'see',
 'sun',
 'know',
 'theres',
 'whole',
 'life',
 'knowing',
 'sun']

In [16]:
#Lemmatization to convert the worts to their root form
import nltk
nltk.download('wordnet') #required as the error occured during lemmatization
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(token) for token in tokens]
tokens

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


['yes',
 'life',
 'full',
 'life',
 'even',
 'underground',
 'began',
 'wouldnt',
 'believe',
 'alexey',
 'want',
 'live',
 'thirst',
 'existence',
 'consciousness',
 'sprung',
 'within',
 'peeling',
 'wall',
 'suffering',
 'afraid',
 'even',
 'beyond',
 'reckoning',
 'afraid',
 'afraid',
 'seem',
 'strength',
 'think',
 'could',
 'stand',
 'anything',
 'suffering',
 'able',
 'say',
 'repeat',
 'every',
 'moment',
 'exist',
 'thousand',
 'agony',
 'exist',
 'im',
 'tormented',
 'rack',
 'exist',
 'though',
 'sit',
 'alone',
 'pillar',
 'exist',
 'see',
 'sun',
 'dont',
 'see',
 'sun',
 'know',
 'there',
 'whole',
 'life',
 'knowing',
 'sun']

We see that the lemmatized output still contains the words with suffixes. We will have to perform Parts of speech (POS) tagging for this . thus we will perform following actions


In [17]:
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('wordnet')
nltk.download('omw-1.4')
#Function to get the WordNet POS tag for lemmatization
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN  # Default to noun if no match


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [18]:
tokens = word_tokenize(data)
pos_tags = pos_tag(tokens)
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token, get_wordnet_pos(pos)) for token, pos in pos_tags]
lemmatized_tokens

['yes',
 'life',
 'full',
 'life',
 'even',
 'underground',
 'begin',
 'wouldnt',
 'believe',
 'alexey',
 'want',
 'live',
 'thirst',
 'existence',
 'consciousness',
 'sprung',
 'within',
 'peel',
 'wall',
 'suffer',
 'afraid',
 'even',
 'beyond',
 'reckon',
 'afraid',
 'afraid',
 'seem',
 'strength',
 'think',
 'could',
 'stand',
 'anything',
 'suffer',
 'able',
 'say',
 'repeat',
 'every',
 'moment',
 'exist',
 'thousand',
 'agony',
 'exist',
 'im',
 'torment',
 'rack',
 'exist',
 'though',
 'sit',
 'alone',
 'pillar',
 'exist',
 'see',
 'sun',
 'dont',
 'see',
 'sun',
 'know',
 'theres',
 'whole',
 'life',
 'know',
 'sun']

As seen there are repeated word in the corpus now.. we will keep only the unique words


In [19]:
#Remove duplicates while preserving order
unique_tokens = list(dict.fromkeys(lemmatized_tokens))
unique_tokens

['yes',
 'life',
 'full',
 'even',
 'underground',
 'begin',
 'wouldnt',
 'believe',
 'alexey',
 'want',
 'live',
 'thirst',
 'existence',
 'consciousness',
 'sprung',
 'within',
 'peel',
 'wall',
 'suffer',
 'afraid',
 'beyond',
 'reckon',
 'seem',
 'strength',
 'think',
 'could',
 'stand',
 'anything',
 'able',
 'say',
 'repeat',
 'every',
 'moment',
 'exist',
 'thousand',
 'agony',
 'im',
 'torment',
 'rack',
 'though',
 'sit',
 'alone',
 'pillar',
 'see',
 'sun',
 'dont',
 'know',
 'theres',
 'whole']

In [20]:
# Removing digits
final_tokens = [token for token in unique_tokens if token.isalpha()]
final_tokens

['yes',
 'life',
 'full',
 'even',
 'underground',
 'begin',
 'wouldnt',
 'believe',
 'alexey',
 'want',
 'live',
 'thirst',
 'existence',
 'consciousness',
 'sprung',
 'within',
 'peel',
 'wall',
 'suffer',
 'afraid',
 'beyond',
 'reckon',
 'seem',
 'strength',
 'think',
 'could',
 'stand',
 'anything',
 'able',
 'say',
 'repeat',
 'every',
 'moment',
 'exist',
 'thousand',
 'agony',
 'im',
 'torment',
 'rack',
 'though',
 'sit',
 'alone',
 'pillar',
 'see',
 'sun',
 'dont',
 'know',
 'theres',
 'whole']

In [21]:
 #Removing digits
final_tokens = [token for token in unique_tokens if token.isalpha()]
final_tokens


['yes',
 'life',
 'full',
 'even',
 'underground',
 'begin',
 'wouldnt',
 'believe',
 'alexey',
 'want',
 'live',
 'thirst',
 'existence',
 'consciousness',
 'sprung',
 'within',
 'peel',
 'wall',
 'suffer',
 'afraid',
 'beyond',
 'reckon',
 'seem',
 'strength',
 'think',
 'could',
 'stand',
 'anything',
 'able',
 'say',
 'repeat',
 'every',
 'moment',
 'exist',
 'thousand',
 'agony',
 'im',
 'torment',
 'rack',
 'though',
 'sit',
 'alone',
 'pillar',
 'see',
 'sun',
 'dont',
 'know',
 'theres',
 'whole']

In [22]:
# Handelling rare words which may not add much value thus can be removed

from collections import Counter
word_freq = Counter(final_tokens)
final_tokens1 = [token for token in tokens if word_freq[token] > 1]
final_tokens1


[]

as seen in our corpus each word is unique and thus rare so our final token output is empty list.so moving ahead with the final tokens list for embedding.

In [23]:
final_tokens

['yes',
 'life',
 'full',
 'even',
 'underground',
 'begin',
 'wouldnt',
 'believe',
 'alexey',
 'want',
 'live',
 'thirst',
 'existence',
 'consciousness',
 'sprung',
 'within',
 'peel',
 'wall',
 'suffer',
 'afraid',
 'beyond',
 'reckon',
 'seem',
 'strength',
 'think',
 'could',
 'stand',
 'anything',
 'able',
 'say',
 'repeat',
 'every',
 'moment',
 'exist',
 'thousand',
 'agony',
 'im',
 'torment',
 'rack',
 'though',
 'sit',
 'alone',
 'pillar',
 'see',
 'sun',
 'dont',
 'know',
 'theres',
 'whole']

In [25]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [26]:
#This will create unigrams and bigrams
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(ngram_range=(1, 2))
vectorizer.fit(final_tokens)
unigrams_bigrams = vectorizer.get_feature_names_out()
unigrams_bigrams

array(['able', 'afraid', 'agony', 'alexey', 'alone', 'anything', 'begin',
       'believe', 'beyond', 'consciousness', 'could', 'dont', 'even',
       'every', 'exist', 'existence', 'full', 'im', 'know', 'life',
       'live', 'moment', 'peel', 'pillar', 'rack', 'reckon', 'repeat',
       'say', 'see', 'seem', 'sit', 'sprung', 'stand', 'strength',
       'suffer', 'sun', 'theres', 'think', 'thirst', 'though', 'thousand',
       'torment', 'underground', 'wall', 'want', 'whole', 'within',
       'wouldnt', 'yes'], dtype=object)

# BOW embedding code

In [27]:
text_data = " ".join(final_tokens)
text_data

'yes life full even underground begin wouldnt believe alexey want live thirst existence consciousness sprung within peel wall suffer afraid beyond reckon seem strength think could stand anything able say repeat every moment exist thousand agony im torment rack though sit alone pillar see sun dont know theres whole'

In [28]:
#BOW matrix
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
import numpy as np
bow_matrix = vectorizer.fit_transform([text_data])
bow_matrix

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 49 stored elements and shape (1, 49)>

In [29]:
#BOW array
bow_array = bow_matrix.toarray()
bow_array

array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1]])

In [30]:
print("Unigrams and Bigrams:")
print(unigrams_bigrams)  # List of unigrams and bigrams as features

print("\nBag of Words Matrix:")
print(bow_array)  # The BoW representation (counts of each unigram/bigram)

Unigrams and Bigrams:
['able' 'afraid' 'agony' 'alexey' 'alone' 'anything' 'begin' 'believe'
 'beyond' 'consciousness' 'could' 'dont' 'even' 'every' 'exist'
 'existence' 'full' 'im' 'know' 'life' 'live' 'moment' 'peel' 'pillar'
 'rack' 'reckon' 'repeat' 'say' 'see' 'seem' 'sit' 'sprung' 'stand'
 'strength' 'suffer' 'sun' 'theres' 'think' 'thirst' 'though' 'thousand'
 'torment' 'underground' 'wall' 'want' 'whole' 'within' 'wouldnt' 'yes']

Bag of Words Matrix:
[[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1]]


#Inference

Analysis and interpretation
Sparsity:

Sparsity refers to the proportion of zero entries in the BoW matrix. In this case, there is no sparsity at all, as every word in your unigrams and bigrams has a frequency of 1.
Normally, a BoW matrix is sparse when documents contain only a small subset of the entire vocabulary. This means that for most documents, many words (or bigrams) don't appear, and their corresponding frequency counts are zero.
Since all the tokens in the above matrix have a frequency of 1, it suggests that each unigram or bigram in the vocabulary appeared exactly once in the text.

Word Frequencies:

The matrix shows that every token (both unigrams and bigrams) in our vocabulary has a frequency of 1. This suggests that each token occurred exactly once in the input text.
Interpretation: The text was processed in such a way that no words were repeated (either due to tokenization, preprocessing, or the specific text structure). Hence, every word and combination of words has the same frequency.

the preprocessing here has lead to an uncomman or rare BOW matrix with no sparsity and uniform frequency


**Output shown below is considering each sentence of the input para as a document and running the BOW code for the new set pf documents thus formed. The code is in other colab file  but is being run here.**

In [31]:
%run "/content/Assignment_Bag_of_words_multiple_documents_at_same_time.ipynb"

['yes life full life even underground began', 'wouldnt believe alexey want live thirst existence consciousness sprung within peeling walls', 'suffering afraid even beyond reckoning', 'afraid', 'afraid seem strength think could stand anything suffering able say repeat every moment exist', 'thousands agoniesi exist', 'im tormented rack exist though sit alone pillar exist see sun dont see sun know theres whole life knowing sun']
Feature Names (Vocabulary):
['able' 'afraid' 'agoniesi' 'alexey' 'alone' 'anything' 'began' 'believe'
 'beyond' 'consciousness' 'could' 'dont' 'even' 'every' 'exist'
 'existence' 'full' 'im' 'know' 'knowing' 'life' 'live' 'moment' 'peeling'
 'pillar' 'rack' 'reckoning' 'repeat' 'say' 'see' 'seem' 'sit' 'sprung'
 'stand' 'strength' 'suffering' 'sun' 'theres' 'think' 'thirst' 'though'
 'thousands' 'tormented' 'underground' 'walls' 'want' 'whole' 'within'
 'wouldnt' 'yes']

Bag of Words Matrix:
[[0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
