In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Introduction

In this project, we used the BERTopic model for topic modeling on a dataset. The goal was to explore and categorize product descriptions into different topics and map them to specific zones. We began by preprocessing the data, followed by applying the BERTopic model, evaluating the results, and saving the model for future use.

We used various libraries such as `BERTopic`, `nltk`, `sklearn`, and `gensim` to carry out this work. Our approach involved data cleaning, feature extraction, and applying topic modeling techniques for clustering and classification.





## 1. Install Required Libraries

We begin by installing necessary libraries, including `BERTopic` for topic modeling and other required packages like `nltk` for text preprocessing.
We also import the necessary libraries to handle data preprocessing, modeling, and evaluation.


In [None]:
pip install BERTopic

Collecting BERTopic
  Downloading bertopic-0.16.4-py3-none-any.whl.metadata (23 kB)
Collecting hdbscan>=0.8.29 (from BERTopic)
  Downloading hdbscan-0.8.40-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (15 kB)
Collecting umap-learn>=0.5.0 (from BERTopic)
  Downloading umap_learn-0.5.7-py3-none-any.whl.metadata (21 kB)
Collecting pynndescent>=0.5 (from umap-learn>=0.5.0->BERTopic)
  Downloading pynndescent-0.5.13-py3-none-any.whl.metadata (6.8 kB)
Downloading bertopic-0.16.4-py3-none-any.whl (143 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.7/143.7 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading hdbscan-0.8.40-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/4.2 MB[0m [31m64.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading umap_learn-0.5.7-py3-none-any.whl (88 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.8/88.8 kB

In [None]:
import pandas as pd
import numpy as np
import re
import nltk
import string
import joblib
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('punkt_tab')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import  WordNetLemmatizer
from nltk.stem import PorterStemmer
stopwords_english = stopwords.words('english')
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from bertopic import BERTopic
from sklearn.decomposition import NMF
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
from bertopic.vectorizers import ClassTfidfTransformer
from bertopic.dimensionality import BaseDimensionalityReduction
from sklearn.cluster import KMeans
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from umap import UMAP
from sklearn.preprocessing import LabelEncoder
from sklearn.cluster import MiniBatchKMeans


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 2. Load and Clean the Data
We load the training and test data and check for missing values. We remove any rows with missing data to ensure that the data is clean before processing.

In [None]:
X_train = pd.read_csv('/content/drive/MyDrive/nlp pp/dataset/X_train.csv')
X_test = pd.read_csv('/content/drive/MyDrive/nlp pp/dataset/X_test.csv')


In [None]:
X_train.isnull().sum()

Unnamed: 0,0
main_category,0
description,10


In [None]:
X_test.isnull().sum()

Unnamed: 0,0
main_category,0
description,3


In [None]:
X_train = X_train.dropna()

In [None]:
X_test = X_test.dropna()

## 3. Separate Features and Labels
We separate the features (product descriptions) and labels (main categories) for both training and test datasets. This is necessary for supervised learning tasks.

In [None]:
y_train = X_train['main_category']
X_train = X_train.drop('main_category', axis=1)


y_test = X_test['main_category']
X_test = X_test.drop('main_category', axis=1)


## 4. Initialize Embedding and Clustering Models, Initialize and Fit the BERTopic Model

We initialize the sentence transformer model to generate embeddings for the product descriptions. We also set up a UMAP model for dimensionality reduction and a KMeans model for clustering.

Then we initialize the BERTopic model, passing in the embedding model, UMAP model, clustering model, and vectorizer. Then, we fit the model using the product descriptions and their corresponding categories.

In [None]:
embedding_model = SentenceTransformer("all-mpnet-base-v2")
import torch
if torch.cuda.is_available():
    embedding_model = embedding_model.to("cuda")

# UMAP
umap_model = UMAP(n_neighbors=10, n_components=3, min_dist=0.1, metric="cosine")

# Clustering model
kmeans_model = MiniBatchKMeans(n_clusters=15, random_state=42, n_init=10)
vectorizer_model = CountVectorizer(stop_words="english")

ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)

# Initialize BERTopic
topic_model = BERTopic(
    embedding_model=embedding_model,
    umap_model=umap_model,
    hdbscan_model=kmeans_model,
    vectorizer_model=vectorizer_model,
    ctfidf_model=ctfidf_model,
    nr_topics=5
)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
from sklearn.preprocessing import LabelEncoder
# Encode string labels to numeric values
documents = X_train['description'].tolist()
categories = y_train.tolist()
label_encoder = LabelEncoder()
categories_encoded = label_encoder.fit_transform(categories)

# Fit BERTopic
topic_model.fit(documents, y=categories_encoded)

<bertopic._bertopic.BERTopic at 0x7855c3582c80>

In [None]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,50512,0_hair_skin_nail_oil,"[hair, skin, nail, oil, brush, dry, wig, apply...",[aloe charcoal serum set peel pack contains de...
1,1,39075,1_book_author_life_review,"[book, author, life, review, story, university...",[review praise detective helen grace thriller ...
2,2,36619,2_camera_usb_cable_battery,"[camera, usb, cable, battery, video, compatibl...",[description pentax megapixel optio h digital ...
3,3,36345,3_plant_garden_flag_outdoor,"[plant, garden, flag, outdoor, material, grill...",[feature solar power wind chime wind chime out...
4,4,36094,4_flavor_tea_chocolate_taste,"[flavor, tea, chocolate, taste, delicious, cof...",[amazoncom updated look namethe great flavor o...


## 5. Save the BERTopic Model
Once we have trained the model, we save it using different methods. This allows us to load the model later for prediction or further analysis.

In [None]:
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
topic_model.save("/content/drive/MyDrive/Colab Notebooks/Bertopic_1", serialization="safetensors", save_ctfidf=True, save_embedding_model=embedding_model)

In [None]:
# Save model with pytorch
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
topic_model.save("/content/drive/MyDrive/Colab Notebooks/Bertopic_2", serialization="pytorch", save_ctfidf=True, save_embedding_model=embedding_model)


In [None]:
# Save model with pickle
topic_model.save("/content/drive/MyDrive/Colab Notebooks/Bertopic_3", serialization="pickle")




In [None]:
# Load the saved model
topic_model = BERTopic.load("/content/drive/MyDrive/nlp pp/Bertopic_3")


In [None]:
print(topic_model.topic_mapper_.get_mappings())

{0: 2, 1: 0, 2: 4, 3: 3, 4: 1, 5: 4, 6: 2, 7: 0, 8: 0, 9: 3, 10: 2, 11: 3, 12: 3, 13: 0, 14: 1}


## 6. Map Topics to Categories
We manually map the topic IDs to their corresponding categories, which helps in understanding the topics. This mapping will be used for evaluation and further analysis

In [None]:
# mappings = topic_model.topic_mapper_.get_mappings()
# y_mapped = [mappings[val] for val in categories_encoded]
# mappings = topic_model.topic_mapper_.get_mappings()
# mappings = {value: y_train[key] for key, value in mappings.items()}
mappings = {
    0: "beauty",
    1: "books",
    2: "electronics",
    3: "home",
    4: "grocery"
}
# mappings = topic_model.topic_mapper_.get_mappings()
# categories_mapped = {value: label_encoder.inverse_transform([key])[0] for key, value in mappings.items()}
# print("Topic to Category Mapping:", categories_mapped)

In [None]:
df = topic_model.get_topic_info()
df["Class"] = df.Topic.map(mappings)
df

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs,Class
0,0,50512,0_hair_skin_nail_oil,"[hair, skin, nail, oil, brush, dry, wig, apply...",[aloe charcoal serum set peel pack contains de...,beauty
1,1,39075,1_book_author_life_review,"[book, author, life, review, story, university...",[review praise detective helen grace thriller ...,books
2,2,36619,2_camera_usb_cable_battery,"[camera, usb, cable, battery, video, compatibl...",[description pentax megapixel optio h digital ...,electronics
3,3,36345,3_plant_garden_flag_outdoor,"[plant, garden, flag, outdoor, material, grill...",[feature solar power wind chime wind chime out...,home
4,4,36094,4_flavor_tea_chocolate_taste,"[flavor, tea, chocolate, taste, delicious, cof...",[amazoncom updated look namethe great flavor o...,grocery


In [None]:
print(dir(topic_model))

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_auto_reduce_topics', '_c_tf_idf', '_cluster_embeddings', '_combine_zeroshot_topics', '_create_topic_vectors', '_extract_embeddings', '_extract_representative_docs', '_extract_topics', '_extract_words_per_topic', '_get_param_names', '_guided_topic_modeling', '_images_to_text', '_is_zeroshot', '_map_predictions', '_map_probabilities', '_merged_topics', '_outliers', '_preprocess_text', '_reduce_dimensionality', '_reduce_to_n_topics', '_reduce_topics', '_save_representative_docs', '_sort_mappings_by_frequency', '_top_n_idx_sparse', '_top_n_values_sparse', '_topic_id_to_zeroshot_topic_idx', '_update_topic_size', '_zeroshot_topic_modeling', 'approxima

## 7. Evaluate the Model
We evaluate the accuracy of the model by comparing the predicted topics with the actual categories from the test set.



In [None]:
from sklearn.metrics import accuracy_score
doc = X_test['description'].tolist()
categ = y_test.tolist()

predicted_topics, _ = topic_model.transform(doc)

predicted_categories = [mappings.get(topic, "unknown") for topic in predicted_topics]

# Evaluate accuracy
accuracy = accuracy_score(categ, predicted_categories)
print(f"Accuracy: {accuracy:.2f}")


Accuracy: 0.97


In [None]:
from sklearn.metrics import classification_report

print(classification_report(categ, predicted_categories))

              precision    recall  f1-score   support

      beauty       0.97      0.96      0.97     12703
       books       0.98      0.99      0.99      9723
 electronics       0.98      0.97      0.97      9169
     grocery       0.97      0.97      0.97      9022
        home       0.94      0.96      0.95      9044

    accuracy                           0.97     49661
   macro avg       0.97      0.97      0.97     49661
weighted avg       0.97      0.97      0.97     49661



In [None]:
# Check the top terms in each topic
print(topic_model.get_topic_info())


   Topic  Count                          Name  \
0      0  50512          0_hair_skin_nail_oil   
1      1  39075     1_book_author_life_review   
2      2  36619    2_camera_usb_cable_battery   
3      3  36345   3_plant_garden_flag_outdoor   
4      4  36094  4_flavor_tea_chocolate_taste   

                                      Representation  \
0  [hair, skin, nail, oil, brush, dry, wig, apply...   
1  [book, author, life, review, story, university...   
2  [camera, usb, cable, battery, video, compatibl...   
3  [plant, garden, flag, outdoor, material, grill...   
4  [flavor, tea, chocolate, taste, delicious, cof...   

                                 Representative_Docs  
0  [aloe charcoal serum set peel pack contains de...  
1  [review praise detective helen grace thriller ...  
2  [description pentax megapixel optio h digital ...  
3  [feature solar power wind chime wind chime out...  
4  [amazoncom updated look namethe great flavor o...  


## 8. Calculate Coherence Score
To assess the quality of the topics, we calculate the coherence score. This metric measures how well the words within a topic are related to each other.

In [None]:
from gensim.corpora.dictionary import Dictionary
from gensim.models.coherencemodel import CoherenceModel

# Tokenize the training descriptions
tokenized_texts = [desc.split() for desc in X_test['description'].tolist() if desc]

# Create a dictionary from tokenized texts
dictionary = Dictionary(tokenized_texts)

topics = [topic_model.get_topic(i) for i in range(len(topic_model.get_topics()))]
topics = [[word[0] for word in topic] for topic in topics if topic]

# Compute coherence score using the training data
coherence_model = CoherenceModel(
    topics=topics,
    texts=tokenized_texts,
    dictionary=dictionary,
    coherence='c_v'
)
coherence_score = coherence_model.get_coherence()
print(f"Coherence Score: {coherence_score:.2f}")


Coherence Score: 0.71


## 9. Map Categories to Zones
We map the predicted categories to specific zones, which is useful for organizing the products based on their topics.

In [None]:
# Define a mapping of categories to zones
category_to_zone = {
    "beauty": "Cosmetic Zone",
    "books": "Dry Zone",
    "electronics": "Dry Zone",
    "home": "Bulk Zone",
    "grocery": "Food Zone"
}

# Now map the predicted categories to zones
predicted_zones = [category_to_zone.get(category, "unknown zone") for category in predicted_categories]

df["Zone"] = df["Class"].map(category_to_zone)
df




Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs,Class,Zone
0,0,50512,0_hair_skin_nail_oil,"[hair, skin, nail, oil, brush, dry, wig, apply...",[aloe charcoal serum set peel pack contains de...,beauty,Cosmetic Zone
1,1,39075,1_book_author_life_review,"[book, author, life, review, story, university...",[review praise detective helen grace thriller ...,books,Dry Zone
2,2,36619,2_camera_usb_cable_battery,"[camera, usb, cable, battery, video, compatibl...",[description pentax megapixel optio h digital ...,electronics,Dry Zone
3,3,36345,3_plant_garden_flag_outdoor,"[plant, garden, flag, outdoor, material, grill...",[feature solar power wind chime wind chime out...,home,Bulk Zone
4,4,36094,4_flavor_tea_chocolate_taste,"[flavor, tea, chocolate, taste, delicious, cof...",[amazoncom updated look namethe great flavor o...,grocery,Food Zone


In [None]:
print(df[['Topic', 'Class', 'Zone']])


   Topic        Class           Zone
0      0       beauty  Cosmetic Zone
1      1        books       Dry Zone
2      2  electronics       Dry Zone
3      3         home      Bulk Zone
4      4      grocery      Food Zone


# Conclusion

In this project, we applied the BERTopic model to categorize product descriptions into meaningful topics. By following a clear workflow, we successfully preprocessed the data, applied topic modeling, and evaluated the model’s performance.

The topics generated by BERTopic were clearly defined and represented different categories of products. The topic distribution provided valuable insights into the most common themes within the dataset.

The classification performance of the model was strong, with an overall accuracy of **97%**. The classification report further confirms that the model performs exceptionally well across different categories, with precision, recall, and f1-scores all exceeding 0.94 for each category. Specifically, the accuracy scores were:

- **Beauty**: Precision = 0.97, Recall = 0.96, F1-Score = 0.97
- **Books**: Precision = 0.98, Recall = 0.99, F1-Score = 0.99
- **Electronics**: Precision = 0.98, Recall = 0.97, F1-Score = 0.97
- **Grocery**: Precision = 0.97, Recall = 0.97, F1-Score = 0.97
- **Home**: Precision = 0.94, Recall = 0.96, F1-Score = 0.95

The **Coherence Score** of **0.71** indicates that the topics generated by the model are coherent and meaningful, validating the effectiveness of BERTopic for this type of analysis.

The project demonstrated the power of topic modeling for product categorization. With a high accuracy rate and well-defined topics, we have successfully built a system that can classify product descriptions efficiently.
