# <center> Bag-of-Words


## Overview:

The goal is to implement the Bag-of-Words model for text data representation. The notebook introduces the concept and demonstrates its implementation using scikit-learn's `CountVectorizer` to transform text into numerical vectors for machine learning applications.

# <a id= 'b0'> 
<font size = 4>
    
**Table of contents:**<br>
[1. Introduction](#b1)<br>
[2. skLearn-Countvectorizer](#b2)<br>

## <a id = 'b1'>
    
<font size = 10 color = 'midnightblue'> <b> Introduction

## Bag of Words (BoW) Model

A **Bag of Words** (BoW) is a text representation method that describes the occurrence of words within a document, ignoring the grammar and order of the words. In BoW, a document is represented as a vector of word counts. This approach is widely used for feature extraction in various Natural Language Processing (NLP) tasks, such as text classification, sentiment analysis, and information retrieval.

### Key Characteristics of BoW:
- **Word Occurrence**: It focuses on the frequency of words in a document.
- **Order Ignorance**: The order of words is not considered, which simplifies the representation but may lose context.
- **High Dimensionality**: The number of features (dimensions) corresponds to the vocabulary size, which can be very large for extensive corpora.

### Manual Implementation of Bag of Words

Let's illustrate how to manually create a Bag of Words representation using Python. We will demonstrate this with a small corpus of text documents.


In [2]:
# Sample documents
documents = [
    "I love programming in Python",
    "Python is great for data science",
    "I love data science and programming"
]

#### Step 1: Create a vocabulary

In [3]:
vocabulary = set()
for doc in documents:
    words = doc.lower().split()  # Convert to lowercase and split into words
    vocabulary.update(words)  # Add words to the vocabulary set

# Convert the vocabulary to a sorted list
vocabulary = sorted(list(vocabulary))
print("Vocabulary:", vocabulary)

Vocabulary: ['and', 'data', 'for', 'great', 'i', 'in', 'is', 'love', 'programming', 'python', 'science']


#### Step 2: Create the BoW representation

In [4]:
bow_representation = []

for doc in documents:
    # Initialize a count vector for the current document
    count_vector = [0] * len(vocabulary)
    words = doc.lower().split()
    
    for word in words:
        if word in vocabulary:
            index = vocabulary.index(word)
            count_vector[index] += 1  # Increment the count for the word in the vector
            
    bow_representation.append(count_vector)

# Displaying the Bag of Words representation
print("Bag of Words Representation:")
for i, doc_vector in enumerate(bow_representation):
    print(f"Document {i+1}: {doc_vector}")

Bag of Words Representation:
Document 1: [0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0]
Document 2: [0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1]
Document 3: [1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1]


In [5]:
import re
import pandas as pd
from string import punctuation
from sklearn.feature_extraction.text import CountVectorizer

<font size = 5 color = seagreen><b> Create a collection dataset

In [6]:
dataset = [
    "The weather today is fantastic, with clear skies and a gentle breeze.",
    "Reading is a great way to escape reality and immerse oneself in different worlds.",
    "Climate change is a pressing global issue that requires immediate attention.",
    "Exercise is crucial for maintaining good physical and mental health.",
    "Learning a new language can be challenging but incredibly rewarding."
]

[top](#b0)

## <a id = 'b2'>
<font size = 10 color = 'midnightblue'> <b> CountVectorizer 

<div class="alert alert-block alert-success">    
<font size = 4> 

<b>`CountVectoriser` from sklearn is used to fit the bag-of-words model.</b>

<font size = 5 color = seagreen><b>Define a count vectoriser

In [7]:
bow = CountVectorizer(max_features=1000, lowercase=True, analyzer='word')

<font size = 5 color = seagreen><b> Fit the bag-of-words model

In [8]:
bag_of_words = bow.fit(dataset)

<div class="alert alert-block alert-success">    
<font size = 4> 

The vectoriser object also returns the feature names for transformation which is the vocabulary

In [9]:
print(list(bow.get_feature_names_out()))

['and', 'attention', 'be', 'breeze', 'but', 'can', 'challenging', 'change', 'clear', 'climate', 'crucial', 'different', 'escape', 'exercise', 'fantastic', 'for', 'gentle', 'global', 'good', 'great', 'health', 'immediate', 'immerse', 'in', 'incredibly', 'is', 'issue', 'language', 'learning', 'maintaining', 'mental', 'new', 'oneself', 'physical', 'pressing', 'reading', 'reality', 'requires', 'rewarding', 'skies', 'that', 'the', 'to', 'today', 'way', 'weather', 'with', 'worlds']


<div class="alert alert-block alert-success">    
<font size = 4> 

The vectorizer returns a sparse matrix where rows represent each sentence of the dataset and columns correspond to each word in vocabulary


In [10]:
vector = bow.transform(dataset).toarray()

In [11]:
pd.DataFrame(vector,
             columns= list(bow.get_feature_names_out()),
             index = [f'sent_{i}' for i in range(1,len(dataset)+1)])

Unnamed: 0,and,attention,be,breeze,but,can,challenging,change,clear,climate,...,rewarding,skies,that,the,to,today,way,weather,with,worlds
sent_1,1,0,0,1,0,0,0,0,1,0,...,0,1,0,1,0,1,0,1,1,0
sent_2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,1,0,0,1
sent_3,0,1,0,0,0,0,0,1,0,1,...,0,0,1,0,0,0,0,0,0,0
sent_4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
sent_5,0,0,1,0,1,1,1,0,0,0,...,1,0,0,0,0,0,0,0,0,0


<font size = 5 color = seagreen><b> This vectorised data can be used as features (predictors) to any ML model

[top](#b0)