## Introduction to Bag of Words (BoW)

The Bag of Words (BoW) model is a simple and commonly used technique for text representation in natural language processing (NLP). It transforms text into a fixed-length vector of word frequencies or occurrences, ignoring grammar and word order but retaining the frequency of each word.

1. **Vocabulary**: The set of all unique words present in the corpus.
2. **Feature Vector**: A numerical representation of text based on word counts or frequencies.
3. **Sparsity**: BoW vectors are usually sparse, meaning they contain many zeros.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer


In [2]:
# Sample documents
documents = [
    "I love programming in Python",
    "Python programming is fun",
    "Machine learning is fascinating"
]

In [3]:
# Create the CountVectorizer object
vectorizer = CountVectorizer()

In [4]:
# Fit and transform the documents
X = vectorizer.fit_transform(documents)

In [5]:
# Convert to array and get feature names
X_array = X.toarray()
feature_names = vectorizer.get_feature_names_out()

In [6]:
# Display the BoW representation
import pandas as pd
df = pd.DataFrame(X_array, columns=feature_names)
print(df)

   fascinating  fun  in  is  learning  love  machine  programming  python
0            0    0   1   0         0     1        0            1       1
1            0    1   0   1         0     0        0            1       1
2            1    0   0   1         1     0        1            0       0


In [9]:
import pandas as pd

documents = [
    "Data science is an interdisciplinary field",
    "It uses scientific methods, processes, algorithms",
    "Data science is used for data analysis"
]

# TODO ::  Create the CountVectorizer object
vectorizer = CountVectorizer()
# TODO :: Fit and transform the documents
X = vectorizer.fit_transform(documents)
# TODO :: Convert to array and get feature names
X_array = X.toarray()
feature_names = vectorizer.get_feature_names_out()
# TODO :: Display the BoW representation
df = pd.DataFrame(X_array, columns= feature_names)
print(df)

   algorithms  an  analysis  data  field  for  interdisciplinary  is  it  \
0           0   1         0     1      1    0                  1   1   0   
1           1   0         0     0      0    0                  0   0   1   
2           0   0         1     2      0    1                  0   1   0   

   methods  processes  science  scientific  used  uses  
0        0          0        1           0     0     0  
1        1          1        0           1     0     1  
2        0          0        1           0     1     0  
