In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Introduction

In this project, we applied the **NMF (Non-negative Matrix Factorization)** algorithm to classify product descriptions into different categories. The goal was to use topic modeling to identify patterns in the text and then map those topics to specific product categories. We further used a classifier to predict these categories and evaluated the model's performance. The process involved several steps, from preprocessing the data to training the model and evaluating its coherence and accuracy.



- **Noor Alawlaqi** - S21107270
- **Maha Almashharawi** - S20106480
- **Mashael Alsalamah** - S20206926

## 1: Importing Necessary Libraries

We started by importing the necessary libraries for data processing, natural language processing (NLP), machine learning, and model evaluation. This includes tools like `nltk` for text preprocessing, `scikit-learn` for machine learning tasks, and `gensim` for coherence evaluation.

In [None]:
pip install nltk scikit-learn pandas



In [None]:
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
from gensim.models import CoherenceModel
from gensim.corpora import Dictionary
from sklearn.preprocessing import LabelEncoder
from sklearn.decomposition import NMF
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


## 2: Data Loading and Preprocessing
Next, we loaded the training and testing data and performed basic preprocessing. We removed any missing values from the datasets and separated the target category from the feature descriptions.

In [None]:
X_test = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/dataset/test/X_test.csv')
X_train = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/dataset/train/X_train.csv')

In [None]:
X_train= X_train.dropna()
X_test= X_test.dropna()

In [None]:
y_train = X_train['main_category']
X_train = X_train.drop('main_category', axis=1)


y_test = X_test['main_category']
X_test = X_test.drop('main_category', axis=1)

In [None]:
train_documents = X_train['description'].tolist()
test_documents = X_test['description'].tolist()

train_category = y_train.tolist()
test_category = y_test.tolist()

## 3: Text Vectorization
To convert the text data into a format suitable for machine learning models, we used TF-IDF (Term Frequency-Inverse Document Frequency) vectorization. This technique transforms the text into numerical vectors that capture the importance of words in the documents.

In [None]:
# Vectorize the training data
vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
X_train = vectorizer.fit_transform(train_documents)
X_test = vectorizer.transform(test_documents)


## 4: Label Encoding
We also encoded the category labels (e.g., "beauty", "books", "electronics") into numerical values to make them suitable for machine learning algorithms.

In [None]:
# Encode the category labels
label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(train_category)
y_test = label_encoder.transform(test_category)


## 5: Model Training with NMF
Now, we applied NMF (Non-negative Matrix Factorization) to extract topics from the text data. NMF helps identify hidden topics by decomposing the document-term matrix. We set the number of topics to 5 and used a solver for the decomposition process.

In [None]:

nmf = NMF(n_components=5, random_state=42, max_iter=200, solver='mu', beta_loss='frobenius')

# Logistic regression classifier to map NMF topics to categories
classifier = LogisticRegression(random_state=42, max_iter=1000)

nmf.fit(X_train, y_train)

## 6: Displaying Top Words for Each Topic
After training the NMF model, we displayed the top 10 words for each topic. These words give us insight into the primary themes or subjects in the dataset.

In [None]:
# Get the top words for each topic
def display_topics(model, feature_names, num_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic {topic_idx + 1}:")
        print(", ".join([feature_names[i] for i in topic.argsort()[:-num_top_words - 1:-1]]))
        print()

display_topics(nmf, vectorizer.get_feature_names_out(), 10)


Topic 1:
usb, case, camera, feature, cable, flag, power, material, plant, battery

Topic 2:
hair, wig, clip, comb, shampoo, brush, conditioner, style, human, dry

Topic 3:
nail, polish, gel, art, apply, coat, file, sticker, glue, manicure

Topic 4:
skin, oil, ingredient, body, organic, flavor, tea, oz, cream, chocolate

Topic 5:
book, author, life, review, story, university, novel, world, read, history



## 7: Classifier Training and Prediction
Next, we trained a Logistic Regression classifier on the topics extracted by NMF to predict product categories. The classifier was trained using the topic representation from NMF, which was then used to predict categories for the test set. We then evaluated the classifier's performance on the test data.



In [None]:
#training the classfier
classifier.fit(nmf.transform(X_train), y_train)

In [None]:
from sklearn.metrics import classification_report

# Predict categories for the test set
predicted_categories = classifier.predict(nmf.transform(X_test))
predicted_categories = label_encoder.inverse_transform(predicted_categories)

accuracy = accuracy_score(test_category, predicted_categories)

# Evaluate the model
print(classification_report(test_category, predicted_categories))


              precision    recall  f1-score   support

      beauty       0.77      0.70      0.73     12703
       books       0.99      0.98      0.98      9723
 electronics       0.61      0.72      0.66      9169
     grocery       0.68      0.72      0.70      9022
        home       0.53      0.46      0.49      9044

    accuracy                           0.72     49661
   macro avg       0.71      0.72      0.71     49661
weighted avg       0.72      0.72      0.72     49661



In [None]:
from gensim.models.coherencemodel import CoherenceModel
from gensim import corpora, models

## 8: Topic Coherence Evaluation
To assess the quality of the topics, we calculated the topic coherence score. The coherence score measures the interpretability and quality of topics based on their top words.

In [None]:
test_documents_tokens = [doc.split() for doc in test_documents]

In [None]:
# Create a Gensim dictionary and corpus
dictionary = Dictionary(test_documents_tokens)
corpus = [dictionary.doc2bow(doc) for doc in test_documents_tokens]

# Extract top words from the NMF model
def get_top_words(nmf_model, feature_names, num_top_words):
    topics = []
    for topic_idx, topic in enumerate(nmf_model.components_):
        top_words = [feature_names[i] for i in topic.argsort()[:-num_top_words - 1:-1]]
        topics.append(top_words)
    return topics

# Get top words for each topic
num_top_words = 10
top_words = get_top_words(nmf, vectorizer.get_feature_names_out(), num_top_words)

# Calculate the coherence score
coherence_model = CoherenceModel(topics=top_words, texts=test_documents_tokens, dictionary=dictionary, coherence='c_v')

coherence_score = coherence_model.get_coherence()

print(f"Topic Coherence Score: {coherence_score}")

Topic Coherence Score: 0.721909509714809


## 9: Saving the Model
After training the NMF model, we saved it using joblib for future use.

In [None]:
import joblib
joblib.dump(nmf, 'nmf_Model.joblib')

['nmf_Model.joblib']

## 10: Mapping Categories to Zones
We mapped the predicted product categories to different zones based on their types. This step helps organize the products into different zones based on their category, providing further insights into the classification.

In [None]:

# Now map the predicted categories to zones using your category_to_zone mapping
category_to_zone = {
    "beauty": "Cosmetic Zone",
    "books": "Dry Zone",
    "electronics": "Dry Zone",
    "home": "Bulk Zone",
    "grocery": "Food Zone"
}


# Map the predicted categories to their respective zones
predicted_zones = [category_to_zone[category] for category in predicted_categories]

# Create a DataFrame with the results
results_df = pd.DataFrame({
    'description': X_test['description'],
    'Category': predicted_categories,
    'Zone': predicted_zones
})

# Display the results
results_df.head()


Unnamed: 0,description,Category,Zone
0,cleanly mount accessory usb port dash vehicle ...,electronics,Dry Zone
1,review susan gregg gilmores novel voice simila...,books,Dry Zone
2,crunchy buttery indulgence rich buttery glaze,grocery,Food Zone
3,primera fragancia para jóvenes de burbu la fra...,grocery,Food Zone
4,brand weight g approx materialhightemperature ...,electronics,Dry Zone


In [None]:
print(results_df[['Category', 'Zone']].head())


      Category       Zone
0  electronics   Dry Zone
1        books   Dry Zone
2      grocery  Food Zone
3      grocery  Food Zone
4  electronics   Dry Zone


# Conclusion

In this project, we applied **Non-negative Matrix Factorization (NMF)** to extract meaningful topics from product descriptions. Using these topics, we trained a **Logistic Regression** classifier to predict the product categories, and we evaluated the model's performance using standard metrics such as **precision**, **recall**, and **F1-score**.

The results showed the following performance for each category:
- **Beauty**: Precision = 0.77, Recall = 0.70, F1-Score = 0.73
- **Books**: Precision = 0.99, Recall = 0.98, F1-Score = 0.98
- **Electronics**: Precision = 0.61, Recall = 0.72, F1-Score = 0.66
- **Grocery**: Precision = 0.68, Recall = 0.72, F1-Score = 0.70
- **Home**: Precision = 0.53, Recall = 0.46, F1-Score = 0.49


The overall **accuracy** of the model was **72%**, with a **macro average** F1-score of 0.71 and a **weighted average** F1-score of 0.72. These results suggest that the model performed well for certain categories, such as books, but less effectively for categories like home and beauty.

We also evaluated the **topic coherence score**, which was **0.72**. This indicates that the topics identified by the NMF model were meaningful and coherent, providing valuable insights for product categorization.


In conclusion, the NMF-based approach, combined with Logistic Regression, allowed us to classify product descriptions with reasonable accuracy. While some categories performed better than others, the overall system provides a solid foundation for automating product categorization.