## Introduction to Text Analysis Using Bag of Words (BoW)

Welcome to this educational notebook where we explore the Bag of Words (BoW) model, a foundational technique in natural language processing (NLP). BoW is particularly useful for converting text into numerical representations that can be fed into various machine learning models. This notebook will guide you through applying the BoW model to analyze customer reviews of the Netflix app.

### What is Bag of Words?

The Bag of Words model is a way of extracting features from text for use in modeling, such as machine learning algorithms. It involves two primary steps:
1. **Tokenization**: Splitting text into individual words or tokens.
2. **Vectorization**: Counting how many times each token in the dataset occurs in each document and using this count as a feature.

### Why is BoW Important?

BoW is crucial for many NLP tasks because it simplifies the complex task of understanding human language by reducing text to a bag of individual words. This model can be used for document classification, sentiment analysis, and other applications where text needs to be converted into a form that algorithms can process. Let's explore how we can implement and utilize this model effectively.

In [1]:
# Importing necessary libraries
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

### Sample Dataset Preview

Below is a small preview of the dataset we will use in this notebook:

In [2]:
# Sample DataFrame with Netflix app reviews
df = pd.DataFrame({
    'Content_cleaned': [
        'the app is great new features but crashes often',
        'love the app love the content but it crashes',
        'the app crashes too much it is frustrating',
        'the content is great it is easy to use it is great'
    ]
})

### Understanding the DataFrame

Our DataFrame `df` contains one column, `Content_cleaned`, which holds the text of customer reviews. These texts are preprocessed, meaning they are cleaned of extraneous symbols, and lowercased to standardize the input for our BoW model. This process is done in the `text_preprocessing.ipynb` file.

In [3]:
df.head()

Unnamed: 0,Content_cleaned
0,the app is great new features but crashes often
1,love the app love the content but it crashes
2,the app crashes too much it is frustrating
3,the content is great it is easy to use it is g...


In [4]:
# Initialize CountVectorizer, our BoW tool
vectorizer = CountVectorizer()

# Fit and transform the reviews into a matrix of token counts
document_term_matrix = vectorizer.fit_transform(df['Content_cleaned'])

# Extract the feature names (vocabulary) from the vectorizer
feature_names = vectorizer.get_feature_names_out()

# Convert the matrix into a readable DataFrame with tokens as columns
bow_df = pd.DataFrame(document_term_matrix.toarray(), columns=feature_names, index=df.index)

# Insert the original reviews as the first column in the DataFrame
bow_df.insert(0, 'Original_Review', df['Content_cleaned'])

# Display the resulting Bag of Words matrix
bow_df

Unnamed: 0,Original_Review,app,but,content,crashes,easy,features,frustrating,great,is,it,love,much,new,often,the,to,too,use
0,the app is great new features but crashes often,1,1,0,1,0,1,0,1,1,0,0,0,1,1,1,0,0,0
1,love the app love the content but it crashes,1,1,1,1,0,0,0,0,0,1,2,0,0,0,2,0,0,0
2,the app crashes too much it is frustrating,1,0,0,1,0,0,1,0,1,1,0,1,0,0,1,0,1,0
3,the content is great it is easy to use it is g...,0,0,1,0,1,0,0,2,3,2,0,0,0,0,1,1,0,1


### Explaining the Matrix

The matrix we've created transforms our text data into a format that can be used by machine learning algorithms. Each row corresponds to a review, and each column corresponds to a unique word in our corpus of reviews. The values in the matrix represent the frequency of each word in each review.


### Applying BoW on our entire dataset

Next we will perform the BoW algorithm on our entire dataset. We notice that it has a vocabulary of 39783 words and each document will have this size. The columns are the vocabulary alphabetically sorted. 

In [8]:
# Read the dataset
df = pd.read_csv('../DATASETS/preprocessed_text.csv')

# Filling empty text that occured after the text preprocessing
df.fillna('', inplace=True)

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit the model and transform the data
bow = vectorizer.fit_transform(df['content_cleaned'])

print(len(vectorizer.vocabulary_))
print(bow.shape)


39783
(113292, 39783)


Furthermore, we notice that the vectorized sequences are stored in a sparse matrix format. Basically, this means that it shows only the columns of the vector that are non zero, which is essential in such high dimensional feature spaces. Below we see two examples of the sparse matrices:

In [11]:
print(df['content_cleaned'][2])
print(bow[2])

thumbs_up thumbs_up
  (0, 34610)	2


In [12]:
print(df['content_cleaned'][5])
print(bow[5])

always promoting anti hindu shows
  (0, 2589)	1
  (0, 27041)	1
  (0, 3013)	1
  (0, 16517)	1
  (0, 31090)	1


## Pros and Cons of the Bag of Words Model

### Pros:
- **Simplicity**: BoW is easy and fast to implement and interpret.
- **Flexibility**: Easily adaptable for various NLP tasks.
- **Scalability**: Works well with large datasets and can be easily scaled.

### Cons:
- **Context Ignorance**: Fails to capture the context and semantics of words as order is not preserved.
- **High Dimensionality**: Can lead to very high-dimensional feature spaces with sparse matrices, especially with large vocabularies.
- **Common Words Issue**: Frequent words may dominate unless techniques like TF-IDF are used to normalize the counts.
- **Out-Of-Vocabulary Issue**: It does not work with new sequences that contain words not included in the vocabulary used for fitting.