$$Bag of Words (BoW) & TF-IDF (Term Frequency-Inverse Document Frequency)$$

**Bag of Words (BoW)**

- Bag of Words (BoW) in NLP is a method to represent text as numerical data by creating a vocabulary of unique words from the dataset and counting their occurrences in each document. It ignores grammar, word order, and context, focusing only on word frequency.

<img src="https://miro.medium.com/v2/resize:fit:893/1*axffCQ9ae0FHXxhuy66FbA.png" jsaction="" class="sFlh5c FyHeAf iPVvYb" style="max-width: 893px; height: 152px; margin: 0px; width: 351px;" alt="4 — Bag of Words Model in NLP. In this article, we will cover the Bag… | by  Aysel Aydin | Medium" jsname="kn3ccd" aria-hidden="false">

<img src="https://i.postimg.cc/pLvhy7zs/basicbow.png" jsaction="" class="sFlh5c FyHeAf iPVvYb" style="max-width: 880px; height: 156px; margin: 0.5px 0px; width: 351px;" alt="NLP - Bag of Words (BOW)" jsname="kn3ccd" aria-hidden="false">

**TF-IDF (Term Frequency-Inverse Document Frequency)**

- TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical method used in NLP to determine the importance of a word in a document relative to a collection of documents (corpus). It assigns higher weights to words that are frequent in a specific document but rare across the corpus.

<img src="https://assets.zilliz.com/TF_IDF_Understanding_Term_Frequency_Inverse_Document_Frequency_in_NLP_04d3c51de7.png" jsaction="" class="sFlh5c FyHeAf iPVvYb" style="max-width: 2400px; height: 184px; margin: 0.5px 0px; width: 351px;" alt="TF-IDF - Understanding Term Frequency-Inverse Document Frequency in NLP -  Zilliz Learn" jsname="kn3ccd" aria-hidden="false">

**Step 1: Importing Required Libraries**

In [1]:
# nltk: For natural language processing tasks
# re: For regular expressions to clean text
# stopwords: To remove commonly used words that do not contribute to meaning
# PorterStemmer, WordNetLemmatizer: For stemming and lemmatization

import nltk
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer


**Step 2: Paragraph Input**

In [4]:
# Text data for preprocessing and feature extraction

paragraph = """Over the span of 3000 years, India has been a land where diverse cultures, ideas, and forces from around the world have interacted with us. We've experienced invasions, territorial conquests, and profound shifts in our ways of thinking. Today, I envision a future where India stands united, empowered by its rich heritage and ready to face the challenges of tomorrow...."""



In [6]:
paragraph

"Over the span of 3000 years, India has been a land where diverse cultures, ideas, and forces from around the world have interacted with us. We've experienced invasions, territorial conquests, and profound shifts in our ways of thinking. Today, I envision a future where India stands united, empowered by its rich heritage and ready to face the challenges of tomorrow...."

**Step 3: Tokenization (Splitting paragraph into sentences)**

In [9]:
sentences = nltk.sent_tokenize(paragraph)


**Step 4: Initializing Tools for Cleaning**
- PorterStemmer for stemming (not used in this implementation)
- WordNetLemmatizer for lemmatization (used in this implementation)

In [12]:
ps = PorterStemmer()
wordnet = WordNetLemmatizer()

**Step 5: Text Cleaning and Preprocessing**
- Create an empty list to store the cleaned sentences

In [15]:
corpus = []

In [17]:
for i in range(len(sentences)):
    # Remove all non-alphabet characters using regular expressions
    review = re.sub('[^a-zA-Z]', ' ', sentences[i])
    review = review.lower()  # Convert text to lowercase
    review = review.split()  # Split sentence into words
    
    # Remove stopwords and apply lemmatization
    review = [wordnet.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]
    
    # Join the cleaned words back into a sentence
    review = ' '.join(review)
    corpus.append(review)  # Append the cleaned sentence to the corpus

**Step 6: Creating the Bag of Words (BoW) Model**
- Importing CountVectorizer to create the BoW representation

In [20]:
from sklearn.feature_extraction.text import CountVectorizer
cv_bow = CountVectorizer()  # Initialize CountVectorizer
X_bow = cv_bow.fit_transform(corpus).toarray()  # Fit and transform the cleaned data into a matrix

**Step 7: Creating the TF-IDF Model**
- Importing TfidfVectorizer to create the TF-IDF representation

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer
cv_tfidf = TfidfVectorizer()  # Initialize TfidfVectorizer
X_tfidf = cv_tfidf.fit_transform(corpus).toarray()  # Fit and transform the cleaned data into a matrix

**Step 8: Displaying Results**
- Display the feature matrix for Bag of Words

In [26]:
print("\nBag of Words Matrix:")
print(X_bow)


Bag of Words Matrix:
[[1 0 0 1 1 0 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1]
 [0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 1 1 0 0 0 1 0 0]
 [0 1 0 0 0 1 1 0 1 0 1 1 0 1 0 0 0 0 1 1 0 0 1 0 0 1 1 1 0 0 0]]


**Display the feature matrix for TF-IDF**

In [29]:
print("\nTF-IDF Matrix:")
print(X_tfidf)


TF-IDF Matrix:
[[0.30746099 0.         0.         0.30746099 0.30746099 0.
  0.         0.         0.         0.30746099 0.         0.
  0.30746099 0.23383201 0.30746099 0.         0.30746099 0.
  0.         0.         0.         0.30746099 0.         0.
  0.         0.         0.         0.         0.         0.30746099
  0.30746099]
 [0.         0.         0.35355339 0.         0.         0.
  0.         0.35355339 0.         0.         0.         0.
  0.         0.         0.         0.35355339 0.         0.35355339
  0.         0.         0.35355339 0.         0.         0.35355339
  0.35355339 0.         0.         0.         0.35355339 0.
  0.        ]
 [0.         0.28195987 0.         0.         0.         0.28195987
  0.28195987 0.         0.28195987 0.         0.28195987 0.28195987
  0.         0.21443775 0.         0.         0.         0.
  0.28195987 0.28195987 0.         0.         0.28195987 0.
  0.         0.28195987 0.28195987 0.28195987 0.         0.
  0.        ]]
