#### Definition:

Bag of Words (BoW) is a simple and commonly used model in NLP to represent text data. It involves converting a text document into a fixed-size vector by counting the occurrence of each word within the document, disregarding grammar and word order but keeping multiplicity.

#### Steps Involved in BoW:

Tokenization: Splitting the text into words (tokens).

Vocabulary Creation: Building a list of unique words from the entire corpus.

Vectorization: Counting the frequency of each word in the vocabulary for each document.

#### Example:

Consider the following two sentences:

"I love NLP"

"NLP is great

#### Tokenization:
    
Sentence 1: ["I", "love", "NLP"]
    
Sentence 2: ["NLP", "is", "great"]

#### Vocabulary Creation:
    
Vocabulary: ["I", "love", "NLP", "is", "great"]

#### Use Cases:

Text Classification: Classifying documents into categories like spam detection, sentiment analysis, etc.

Information Retrieval: Search engines use BoW models to retrieve relevant documents based on keyword searches.

Topic Modeling: Identifying topics within a large corpus of text.

#### Implementation in Python:

We'll implement a simple Bag of Words model using CountVectorizer from sklearn.

#### Installation:
Make sure you have scikit-learn installed:

In [2]:
pip install scikit-learn

Collecting numpy<2.0,>=1.19.5
Note: you may need to restart the kernel to use updated packages.


ERROR: Could not install packages due to an OSError: [WinError 5] Access is denied: 'C:\\Users\\karma\\anaconda3\\Lib\\site-packages\\~umpy.libs\\libopenblas64__v0.3.23-293-gc2f4bdbb-gcc_10_3_0-2bde3a66a51006b2b53eb373ff767a3f.dll'
Consider using the `--user` option or check the permissions.



  Downloading numpy-1.22.4-cp39-cp39-win_amd64.whl (14.7 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.4
    Uninstalling numpy-1.26.4:
      Successfully uninstalled numpy-1.26.4


In [3]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample documents
documents = ["I love NLP", "NLP is great"]

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit the model and transform the documents into vectors
X = vectorizer.fit_transform(documents)

# Print the vocabulary
print("Vocabulary:", vectorizer.vocabulary_)

# Print the vectors
print("Vectors:\n", X.toarray())


C:\Users\karma\anaconda3\lib\site-packages\numpy\.libs\libopenblas.EL2C6PLE4ZYW3ECEVIV3OXXGRN2NRFM2.gfortran-win_amd64.dll
C:\Users\karma\anaconda3\lib\site-packages\numpy\.libs\libopenblas.XWYDX2IKJW2NMTWSFYNGFUWKQU3LYTCZ.gfortran-win_amd64.dll


Vocabulary: {'love': 2, 'nlp': 3, 'is': 1, 'great': 0}
Vectors:
 [[0 0 1 1]
 [1 1 0 1]]


##### Explanation:
CountVectorizer: Converts a collection of text documents to a matrix of token counts.
fit_transform: Fits the model and learns the vocabulary; then transforms the data into a document-term matrix.
vocabulary_: A dictionary where keys are the words and values are their feature indices.
toarray(): Converts the sparse matrix to a dense array.