## Introduction to Bag of Words (BoW)

Welcome to this notebook where we explore the Bag of Words (BoW) model, a foundational technique in natural language processing (NLP). BoW is particularly useful for converting text into numerical representations that can be fed into various machine learning models. This notebook will guide you through applying the BoW model to analyze customer reviews of the Netflix app.

### What is Bag of Words?

The Bag of Words model is a way of extracting features from text for use in modeling, such as machine learning algorithms. It involves two primary steps:
1. **Tokenization**: Splitting text into individual words or tokens.
2. **Vectorization**: Counting how many times each word (token) in the dataset occurs in each document and using this count as a feature.

### Why is BoW Important?

BoW is crucial for many NLP tasks because it simplifies the complex task of understanding human language by reducing text to a bag of individual words. This model can be used for document classification, sentiment analysis, and other applications where text needs to be converted into a form that algorithms can process. Let's explore how we can implement and utilize this model effectively.

In [21]:
# Importing necessary libraries
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

### Sample Dataset Preview

We believe that the best way to grasp this algorithm is through an easy example. So for now, we'll use a simple dataset. Here's a quick look at the one we'll be using in this notebook:

In [22]:
# Sample DataFrame with Netflix app reviews
df = pd.DataFrame({
    'Content_cleaned': [
        'the app is great new features but crashes often',
        'love the app love the content but it crashes',
        'the app crashes too much it is frustrating',
        'the content is great it is easy to use it is great'
    ]
})

### Understanding the DataFrame

Our DataFrame `df` contains one column, `content_cleaned`, just like we did in the `text_preprocessing.ipynb` file.

In [23]:
print(df.to_string())

                                      Content_cleaned
0     the app is great new features but crashes often
1        love the app love the content but it crashes
2          the app crashes too much it is frustrating
3  the content is great it is easy to use it is great


Let's start by applying the Bag-of-Words (BoW) technique on the simple dataset to see how it works.

In [24]:
# Initialize CountVectorizer, our BoW tool
vectorizer = CountVectorizer()

# Fit and transform the reviews into a matrix of token counts
document_term_matrix = vectorizer.fit_transform(df['Content_cleaned'])

# Extract the feature names (vocabulary) from the vectorizer
feature_names = vectorizer.get_feature_names_out()

# Convert the matrix into a readable DataFrame with tokens as columns
bow_df = pd.DataFrame(document_term_matrix.toarray(), columns=feature_names, index=df.index)

# Insert the original reviews as the first column in the DataFrame
bow_df.insert(0, 'Original_Review', df['Content_cleaned'])

# Display the resulting Bag of Words matrix
bow_df

Unnamed: 0,Original_Review,app,but,content,crashes,easy,features,frustrating,great,is,it,love,much,new,often,the,to,too,use
0,the app is great new features but crashes often,1,1,0,1,0,1,0,1,1,0,0,0,1,1,1,0,0,0
1,love the app love the content but it crashes,1,1,1,1,0,0,0,0,0,1,2,0,0,0,2,0,0,0
2,the app crashes too much it is frustrating,1,0,0,1,0,0,1,0,1,1,0,1,0,0,1,0,1,0
3,the content is great it is easy to use it is g...,0,0,1,0,1,0,0,2,3,2,0,0,0,0,1,1,0,1


### Explanation

The matrix we just created converts our text data into a format suitable for machine learning algorithms. Here's what it represents:

- Rows: Each row corresponds to a review from our dataset.
- Columns: Each column represents a word from our entire vocabulary (i.e., all the unique words found across the reviews), listed in alphabetical order.
- Values: The numbers in the matrix represent how many times each word appears in each review. Most values will be zero since not every word appears in every review.

### Applying BoW on our Netflix dataset

Now, let's apply the Bag-of-Words model to our full dataset and explore its output.

In [25]:
# Read the dataset
df = pd.read_csv('../DATASETS/preprocessed_text.csv')

# Filling empty text that occured after the text preprocessing
df.fillna('', inplace=True)

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit the model and transform the data
bow = vectorizer.fit_transform(df['content_cleaned'])

# Print the size of the vocabulary and the shape of the matrix
print(f"Vocabulary size: {len(vectorizer.vocabulary_)}")
print(f"Shape of the matrix: {bow.shape}")


Vocabulary size: 39783
Shape of the matrix: (113292, 39783)


Here’s what the output means:

- Vocabulary size (39,783): This represents the number of unique words found across all the reviews.
- Matrix shape (113,292, 39,783): This matrix has 113,292 rows (one for each review) and 39,783 columns (one for each unique word).

This means that each review will be represented as a vector of size 39,783, with mostly zeroes for the words that don’t appear in that particular review.

### Sparse Matrix Representation

The BoW transformation results in a sparse matrix. Sparse matrices are extremely useful when dealing with high-dimensional data like text, where the majority of values are zero. Storing all those zeros explicitly would be highly inefficient, so instead, we use a sparse matrix format.

**Characteristics of a Sparse Matrix**:

- **Memory Efficiency**: Only the non-zero values are stored, significantly reducing the amount of memory required compared to a dense matrix.
- **Performance**: Many operations on sparse matrices are faster due to the reduced number of elements that need to be processed.

**Example of Sparse Matrix:**

Consider a simplified example where our document-term matrix might look like this:

| Document | Word A | Word B | Word C |
|----------|--------|--------|--------|
| Doc1     | 2      | 0      | 1      |
| Doc2     | 1      | 3      | 0      |
| Doc3     | 0      | 1      | 2      |

The sparse matrix representation of the document-term matrix is shown below:

| Row Index | Column Index | Value |
|-----------|--------------|-------|
| 0         | 0            | 2     |
| 0         | 2            | 1     |
| 1         | 0            | 1     |
| 1         | 1            | 3     |
| 2         | 1            | 1     |
| 2         | 2            | 2     |


Here, `(row_index, column_index) value` represents the non-zero values in the matrix. The sparse matrix format significantly reduces memory usage by storing only these non-zero entries.

**Visualizing a Single Row of the Sparse Matrix**:

Let’s take a look at how the sparse matrix is represented for a single review:

In [29]:
row_index = 44  # Specific row (review) we are inspecting

# Display the original review text
print("Original Text:\n")
print(df['content_cleaned'][row_index])

# Get the non-zero column indices and corresponding counts for the specific row
non_zero_elements = bow[row_index].nonzero()  # Get non-zero column indices for the row

# Prepare data for the DataFrame
data = {
    'Row Index': [],
    'Column Index (Word)': [],
    'Count': []
}

# Get the feature names (vocabulary) from the vectorizer
words = vectorizer.get_feature_names_out()

# Populate the data
for col_index in non_zero_elements[1]:
    count = bow[row_index, col_index]  # Get the count of the word in the specific document
    word = words[col_index]  # Get the actual word corresponding to the column index
    data['Row Index'].append(row_index)
    data['Column Index (Word)'].append(f"{col_index} ({word})")  # Merge index and word
    data['Count'].append(count)

# Create a DataFrame to display the row index, combined column index and word, and count
sparse_matrix_df = pd.DataFrame(data)

# Display the DataFrame
print("\n\nSparse Matrix Representation as a Table:\n")
print(sparse_matrix_df)

Original Text:

i love to watch the movie s you have on here and the tv shows i love them growing_heart


Sparse Matrix Representation as a Table:

    Row Index    Column Index (Word)  Count
0          44            38860 (you)      1
1          44             2769 (and)      1
2          44            34247 (the)      2
3          44             34829 (to)      1
4          44           16108 (have)      1
5          44          31090 (shows)      1
6          44             24131 (on)      1
7          44          22367 (movie)      1
8          44           20654 (love)      2
9          44           34287 (them)      1
10         44          37517 (watch)      1
11         44           16371 (here)      1
12         44             35537 (tv)      1
13         44  15596 (growing_heart)      1


**Note**: All the single characters tokens are ignored by the default tokenizer (CountVectorizer). If you want single character tokens to be in the vocabulary, then you have to use a custom tokenizer.

## Pros and Cons of the Bag of Words Model

### Pros:
- **Simplicity**: BoW is easy and fast to implement and interpret.
- **Flexibility**: Easily adaptable for various NLP tasks.
- **Scalability**: Works well with large datasets and can be easily scaled.

### Cons:
- **Context Ignorance**: Fails to capture the context and semantics of words as order is not preserved.
- **High Dimensionality**: Can lead to very high-dimensional feature spaces with sparse matrices, especially with large vocabularies.
- **Common Words Issue**: Frequent words may dominate unless techniques like TF-IDF are used to normalize the counts.
- **Out-Of-Vocabulary Issue**: It does not work with new sequences that contain words not included in the vocabulary used for fitting.