# <b> NLP : Bag Of Words

> Description:
  * Bag of Words is a simple NLP technique for representing text data as a collection of individual words and their frequencies. It involves breaking down a piece of text into its constituent words, and then counting the frequency of each word in the document. This approach creates a sparse matrix of word frequencies that can be used for various NLP tasks, such as sentiment analysis, document classification, and information retrieval. The Bag of Words approach does not take into account the order or context of words, which can limit its usefulness for certain types of text analysis.

In [1]:
# importing the libraries 

import nltk

import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [24]:
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [25]:
# Creating a paragraph on which we will be performing BagOfWords.

paragraph  = '''The Cosmic Microwave Background (CMB) is a form of electromagnetic radiation that pervades the entire universe. It is thought to be the afterglow of the Big Bang, the event that marks the beginning of the universe as we know it.

The CMB was first discovered in 1964 by two radio astronomers, Arno Penzias and Robert Wilson, who were working at Bell Labs in New Jersey. They were using a large horn-shaped antenna to study radio waves emitted by the Milky Way, but they kept detecting a mysterious signal that seemed to be coming from all directions in the sky. After ruling out a number of possible explanations, they realized that they had stumbled upon the CMB.

The CMB is incredibly faint, with a temperature of just 2.7 Kelvin (-270.45 degrees Celsius). However, it is remarkably uniform across the entire sky, with temperature variations of just a few parts in 100,000. These tiny fluctuations are thought to be the result of slight density variations in the early universe, which were stretched out by cosmic expansion to form the large-scale structures we see today, such as galaxies and clusters of galaxies.

Studying the CMB has been crucial to our understanding of the universe and its evolution. It has provided strong evidence for the Big Bang theory, as well as for the existence of dark matter and dark energy. It has also allowed astronomers to measure the age, size, and composition of the universe with unprecedented accuracy.

In recent years, the study of the CMB has entered a new era, with a number of high-precision experiments, such as the Planck satellite and the Atacama Cosmology Telescope, providing even more detailed maps of the CMB and shedding light on some of the universe's deepest mysteries.
'''

In [26]:
# Perfoerming Sentence Tokenization: 

sentences = nltk.sent_tokenize(paragraph)
print (len(sentences))

12


In [29]:
# Performing Lemmetization and making all the characters 'lowercase':
wordnet  = WordNetLemmatizer()
corpus = []

for i in range(len(sentences)):

  review = re.sub("[^a-zA-Z]", ' ', sentences[i])
  review = review.lower()
  review = review.split()
  review = [wordnet.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]
  review = ' '.join(review)
  
  corpus.append(review)


In [36]:
# Creating the BAG OF WORDS USING SKLEARN:

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()

x = cv.fit_transform(corpus).toarray()

In [38]:
x.shape

(12, 124)

In [43]:
import pandas as pd 

pd.DataFrame(x)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,114,115,116,117,118,119,120,121,122,123
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,1,1,0,...,0,0,0,0,0,0,0,1,1,0
3,0,0,0,0,0,0,1,0,0,0,...,0,0,1,0,1,1,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,1,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
