Practical: Implementing Bag of Words (BoW) and TF-IDF using Scikit-Learn

#Objective:
To understand and implement the Bag of Words (BoW) model and TF-IDF (Term Frequency-Inverse Document Frequency) model for text data using Python's sklearn library.

#Dataset:
We will create a small, unique dataset containing product reviews from a fictional online store. Each review will contain descriptive text of the product, and we will use this dataset to apply both BoW and TF-IDF techniques.

#Dataset Creation:
Here is a sample dataset with product reviews:

In [1]:
documents = [
    "This phone has an excellent camera",
    "I love the screen quality of this laptop",
    "The tablet is light and easy to carry around",
    "Amazing sound quality from the speaker",
    "The camera quality is amazing, best phone camera",
    "Battery life of this phone lasts longer than expected",
    "The laptop is very fast and has a great processor",
    "The speaker has very clear sound and good bass",
    "I would recommend this tablet to anyone looking for portability",
    "This phone is very user friendly and fast"
]


Steps to Implement BoW and TF-IDF
1. Import necessary libraries:

In [2]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd


2. Create the dataset:

In [3]:
documents = [
    "This phone has an excellent camera",
    "I love the screen quality of this laptop",
    "The tablet is light and easy to carry around",
    "Amazing sound quality from the speaker",
    "The camera quality is amazing, best phone camera",
    "Battery life of this phone lasts longer than expected",
    "The laptop is very fast and has a great processor",
    "The speaker has very clear sound and good bass",
    "I would recommend this tablet to anyone looking for portability",
    "This phone is very user friendly and fast"
]

2. Bag of Words (BoW):

CountVectorizer() is used to convert the text into a "bag" of word counts.
fit_transform() learns the vocabulary of the documents and returns a sparse matrix containing the counts of each word across all documents.
We then convert this sparse matrix into a Pandas DataFrame to display the word counts in a readable format.

3. Implement Bag of Words (BoW) Model:

In [6]:
# Initialize CountVectorizer (BoW)
bow_vectorizer = CountVectorizer()

# Fit and transform the documents to get the BoW representation
# This step creates a matrix of word counts for each document
bow_matrix = bow_vectorizer.fit_transform(documents)

# Convert the BoW matrix to a DataFrame for better visualization
# The DataFrame will display the word counts for each word in the vocabulary
bow_df = pd.DataFrame(bow_matrix.toarray(), columns=bow_vectorizer.get_feature_names_out())

# Display the BoW representation
print("BoW Representation:")
print(bow_df)

BoW Representation:
   amazing  an  and  anyone  around  bass  battery  best  camera  carry  ...  \
0        0   1    0       0       0     0        0     0       1      0  ...   
1        0   0    0       0       0     0        0     0       0      0  ...   
2        0   0    1       0       1     0        0     0       0      1  ...   
3        1   0    0       0       0     0        0     0       0      0  ...   
4        1   0    0       0       0     0        0     1       2      0  ...   
5        0   0    0       0       0     0        1     0       0      0  ...   
6        0   0    1       0       0     0        0     0       0      0  ...   
7        0   0    1       0       0     1        0     0       0      0  ...   
8        0   0    0       1       0     0        0     0       0      0  ...   
9        0   0    1       0       0     0        0     0       0      0  ...   

   sound  speaker  tablet  than  the  this  to  user  very  would  
0      0        0       0     0

This will show a matrix where rows correspond to documents, and columns correspond to unique words. Each cell contains the count of occurrences of that word in the document.

3. TF-IDF (Term Frequency-Inverse Document Frequency):

TfidfVectorizer() is used to convert the text into a matrix of TF-IDF features.
TF-IDF adjusts the frequency of each word by considering how common or rare the word is in the entire corpus of documents.
Again, the resulting sparse matrix is converted into a DataFrame for visualization.

4. Implement TF-IDF Model:

In [7]:
# Initialize TfidfVectorizer
# This vectorizer will compute the TF-IDF scores for each word in each document
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the documents to get the TF-IDF representation
# This step creates a matrix of TF-IDF scores for each document
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

# Convert the TF-IDF matrix to a DataFrame
# The DataFrame will display the TF-IDF scores for each word in the vocabulary
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# Display the TF-IDF representation
print("\nTF-IDF Representation:")
print(tfidf_df)


TF-IDF Representation:
    amazing        an       and    anyone    around      bass   battery  \
0  0.000000  0.495948  0.000000  0.000000  0.000000  0.000000  0.000000   
1  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
2  0.000000  0.000000  0.257228  0.000000  0.389015  0.000000  0.000000   
3  0.424553  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
4  0.337907  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
5  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.364844   
6  0.000000  0.000000  0.276614  0.000000  0.000000  0.000000  0.000000   
7  0.000000  0.000000  0.263922  0.000000  0.000000  0.399139  0.000000   
8  0.000000  0.000000  0.000000  0.358105  0.000000  0.000000  0.000000   
9  0.000000  0.000000  0.297498  0.000000  0.000000  0.000000  0.000000   

       best    camera     carry  ...     sound   speaker    tablet      than  \
0  0.000000  0.421601  0.000000  ...  0.000000  0.000000  0.000000  0.

This will show a matrix where rows correspond to documents and columns correspond to unique words. Each cell contains the TF-IDF value for the word in that document.

5. Explanation of Output:

BoW (Bag of Words): The BoW representation simply counts the number of times a word appears in a document. It doesnâ€™t consider the word order, just the frequency of each word in each document.

TF-IDF (Term Frequency-Inverse Document Frequency): The TF-IDF representation is a more advanced version that weighs the word frequency (Term Frequency) by how common or rare the word is across all documents (Inverse Document Frequency). This helps to give more importance to rare words in the documents.

BoW Output:
amazing	and	bass	battery	camera	clear	easy	expected	...
0	0	0	0	0	1	0	0	0	...
1	0	0	0	0	0	0	0	0	...
2	0	1	0	0	0	0	1	0	...
3	1	0	1	0	0	1	0	0	...
TF-IDF Output:
amazing	and	bass	battery	camera	clear	easy	expected	...
0	0.5	0.0	0.0	0.0	0.6	0.0	0.0	0.0	...
1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...
2	0.0	0.4	0.0	0.0	0.0	0.0	0.5	0.0	...
3	0.6	0.0	0.7	0.0	0.0	0.6	0.0	0.0	...


6. Conclusion:

Bag of Words (BoW) is simple and useful for basic text classification tasks, where word frequency is the primary focus.
TF-IDF is more sophisticated and useful when we want to consider the importance of words across the entire corpus, making it suitable for more nuanced tasks like document similarity, search engines, and information retrieval.