## Introduction to Bag of Words (BoW)

Welcome to this educational notebook where we explore the Bag of Words (BoW) model, a foundational technique in natural language processing (NLP). BoW is particularly useful for converting text into numerical representations that can be fed into various machine learning models. This notebook will guide you through applying the BoW model to analyze customer reviews of the Netflix app.

### What is Bag of Words?

The Bag of Words model is a way of extracting features from text for use in modeling, such as machine learning algorithms. It involves two primary steps:
1. **Tokenization**: Splitting text into individual words or tokens.
2. **Vectorization**: Counting how many times each word (token) in the dataset occurs in each document and using this count as a feature.

### Why is BoW Important?

BoW is crucial for many NLP tasks because it simplifies the complex task of understanding human language by reducing text to a bag of individual words. This model can be used for document classification, sentiment analysis, and other applications where text needs to be converted into a form that algorithms can process. Let's explore how we can implement and utilize this model effectively.

In [2]:
# Importing necessary libraries
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

### Sample Dataset Preview

We believe that the best way to grasp this algorithm is through an easy example. So for now, we'll use a simple dataset. Here's a quick look at the one we'll be using in this notebook:

In [3]:
# Sample DataFrame with Netflix app reviews
df = pd.DataFrame({
    'Content_cleaned': [
        'the app is great new features but crashes often',
        'love the app love the content but it crashes',
        'the app crashes too much it is frustrating',
        'the content is great it is easy to use it is great'
    ]
})

### Understanding the DataFrame

Our DataFrame `df` contains one column, `Content_cleaned`, just like we did in the `text_preprocessing.ipynb` file.

In [4]:
print(df.to_string())

                                      Content_cleaned
0     the app is great new features but crashes often
1        love the app love the content but it crashes
2          the app crashes too much it is frustrating
3  the content is great it is easy to use it is great


So here is the code to actually implement BoW in python below.

In [5]:
# Initialize CountVectorizer, our BoW tool
vectorizer = CountVectorizer()

# Fit and transform the reviews into a matrix of token counts
document_term_matrix = vectorizer.fit_transform(df['Content_cleaned'])

# Extract the feature names (vocabulary) from the vectorizer
feature_names = vectorizer.get_feature_names_out()

# Convert the matrix into a readable DataFrame with tokens as columns
bow_df = pd.DataFrame(document_term_matrix.toarray(), columns=feature_names, index=df.index)

# Insert the original reviews as the first column in the DataFrame
bow_df.insert(0, 'Original_Review', df['Content_cleaned'])

# Display the resulting Bag of Words matrix
bow_df

Unnamed: 0,Original_Review,app,but,content,crashes,easy,features,frustrating,great,is,it,love,much,new,often,the,to,too,use
0,the app is great new features but crashes often,1,1,0,1,0,1,0,1,1,0,0,0,1,1,1,0,0,0
1,love the app love the content but it crashes,1,1,1,1,0,0,0,0,0,1,2,0,0,0,2,0,0,0
2,the app crashes too much it is frustrating,1,0,0,1,0,0,1,0,1,1,0,1,0,0,1,0,1,0
3,the content is great it is easy to use it is g...,0,0,1,0,1,0,0,2,3,2,0,0,0,0,1,1,0,1


### Explaining the Matrix

We've built a matrix that turns our text data into a format that machine learning algorithms can use. Each row is a review, and each column is a word from our review collection, in alphabetical order. The values in the matrix show how often each word shows up in each review.


### Applying BoW on our entire dataset

Next we will perform the BoW algorithm on our Netflix dataset. We notice that it has a vocabulary of 39783 words and each document will have this size. The columns are the vocabulary alphabetically sorted. 

In [6]:
# Read the dataset
df = pd.read_csv('../DATASETS/preprocessed_text.csv')

# Filling empty text that occured after the text preprocessing
df.fillna('', inplace=True)

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit the model and transform the data
bow = vectorizer.fit_transform(df['content_cleaned'])

# Print the size of the vocabulary and the shape of the matrix
print(f"Vocabulary size: {len(vectorizer.vocabulary_)}")
print(f"Shape of the sparse matrix: {bow.shape}")


Vocabulary size: 39783
Shape of the sparse matrix: (113292, 39783)


### Sparse Matrix Representation

The BoW transformation gives us a matrix that's stored in a sparse matrix format. This format is really useful for dealing with high-dimensional data that has lots of zero entries. 

**Sparse Matrix Characteristics:**
- **Memory Efficiency**: Only the non-zero entries are stored, which saves a significant amount of memory.
- **Performance**: Operations on sparse matrices can be faster compared to dense matrices due to reduced storage requirements.

**Example of Sparse Matrix:**

Consider a simplified example where our document-term matrix might look like this:

| Document | Word A | Word B | Word C |
|----------|--------|--------|--------|
| Doc1     | 2      | 0      | 1      |
| Doc2     | 1      | 3      | 0      |
| Doc3     | 0      | 1      | 2      |

The sparse matrix representation of the document-term matrix is shown below:

| Row Index | Column Index | Value |
|-----------|--------------|-------|
| 0         | 0            | 2     |
| 0         | 2            | 1     |
| 1         | 0            | 1     |
| 1         | 1            | 3     |
| 2         | 1            | 1     |
| 2         | 2            | 2     |


Here, `(row_index, column_index) value` represents the non-zero values in the matrix. The sparse matrix format significantly reduces memory usage by storing only these non-zero entries.

You can see an example here on how it is rapresented.

In [8]:
print(df['content_cleaned'][44])
print(bow[44])

i love to watch the movie s you have on here and the tv shows i love them growing_heart
  (0, 38860)	1
  (0, 2769)	1
  (0, 34247)	2
  (0, 34829)	1
  (0, 16108)	1
  (0, 31090)	1
  (0, 24131)	1
  (0, 22367)	1
  (0, 20654)	2
  (0, 34287)	1
  (0, 37517)	1
  (0, 16371)	1
  (0, 35537)	1
  (0, 15596)	1


This is just a cleaner way to show it.

In [11]:
# Convert the sparse matrix row to a dense array
dense_array = bow[44].toarray()

# Create a DataFrame to map the indices to the words and their counts
words = vectorizer.get_feature_names_out()
word_counts = dense_array.flatten()
word_count_df = pd.DataFrame({
    'Word': [words[i] for i in range(len(word_counts)) if word_counts[i] > 0],
    'Count': [word_counts[i] for i in range(len(word_counts)) if word_counts[i] > 0]
})

# Display the original text and the word counts
print("Original text:")
print(df['content_cleaned'][44])
print("\nSparse Matrix Representation:")
print(word_count_df)

Original text:
i love to watch the movie s you have on here and the tv shows i love them growing_heart

Sparse Matrix Representation:
             Word  Count
0             and      1
1   growing_heart      1
2            have      1
3            here      1
4            love      2
5           movie      1
6              on      1
7           shows      1
8             the      2
9            them      1
10             to      1
11             tv      1
12          watch      1
13            you      1


## Pros and Cons of the Bag of Words Model

### Pros:
- **Simplicity**: BoW is easy and fast to implement and interpret.
- **Flexibility**: Easily adaptable for various NLP tasks.
- **Scalability**: Works well with large datasets and can be easily scaled.

### Cons:
- **Context Ignorance**: Fails to capture the context and semantics of words as order is not preserved.
- **High Dimensionality**: Can lead to very high-dimensional feature spaces with sparse matrices, especially with large vocabularies.
- **Common Words Issue**: Frequent words may dominate unless techniques like TF-IDF are used to normalize the counts.
- **Out-Of-Vocabulary Issue**: It does not work with new sequences that contain words not included in the vocabulary used for fitting.