## introduction ##

Topic modeling is a natural language processing technique used to discover hidden structures or themes in a collection of text data. This project focuses on implementing Latent Dirichlet Allocation (LDA), one of the most popular methods for topic modeling.

LDA works by grouping words that frequently appear together into topics, providing an overview of the key themes present in the dataset. It assigns each document a probability distribution over these topics, enabling both the identification of dominant topics in a document and the exploration of thematic patterns across the dataset.

In this implementation, we preprocess text data to create a structured input for LDA, train the model to identify topics, and use the results for analysis. The project also evaluates the coherence of topics to ensure meaningful outputs and applies these topics to predict categories or classify data using machine learning models like logistic regression.

- **Noor Alawlaqi** - S21107270
- **Maha Almashharawi** - S20106480
- **Mashael Alsalamah** - S20206926

## Data Loading and Exploration ##
1-we Loads required libraries for data manipulation, topic modeling, and classification.

2-We loaded the training and testing datasets.

3-We checked for missing values and cleaned the data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
from gensim import corpora, models
from gensim.models import CoherenceModel
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression


In [None]:
X_train = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/dataset/train/X_train.csv')
X_test = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/dataset/test/X_test.csv')

In [None]:

X_train.isnull().sum()

Unnamed: 0,0
main_category,0
description,10


In [None]:
X_test.isnull().sum()

Unnamed: 0,0
main_category,0
description,3


In [None]:
X_train = X_train.dropna()
X_test = X_test.dropna()

## Data Preprocessing and Preparing Data for LDA##

1-We reset the indices and separated text data and labels.

2-We tokenized the text data and encoded the labels.

3-We created a dictionary and corpus from the tokenized training data to use as input for the LDA model.

In [None]:
X_train = X_train.reset_index(drop=True)

In [None]:
X_test = X_test.reset_index(drop=True)

In [None]:
y_train = X_train['main_category'].tolist()
X_train = X_train.drop('main_category', axis=1)

X_train = X_train['description'].tolist()


y_test = X_test['main_category'].tolist()
X_test = X_test.drop('main_category', axis=1)

X_test= X_test['description'].tolist()


In [None]:
# Encode the labels
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

# Convert text data to tokenized lists
X_train_tokens = [str(doc).split() for doc in X_train]

X_test_tokens = [doc.split() for doc in X_test]

# Create dictionary and corpus for training data
id2word = corpora.Dictionary(X_train_tokens)
corpus = [id2word.doc2bow(text) for text in X_train_tokens if text]


## LDA Model Training ##

1- We trained the LDA model to identify topics in the data.

2- We extracted and displayed the generated topics.

In [None]:
num_topics = 5
lda_model = models.LdaModel(corpus=corpus,
                             id2word=id2word,
                             num_topics=num_topics,
                             random_state=100,
                             update_every=1,
                             passes=20,
                             alpha='auto',
                             per_word_topics=True)

# Print topics
topics = lda_model.print_topics(num_words=10)
print("LDA Topics:")
for topic in topics:
    print(topic)


LDA Topics:
(0, '0.014*"hair" + 0.007*"skin" + 0.006*"nail" + 0.005*"material" + 0.005*"easy" + 0.005*"plant" + 0.004*"garden" + 0.004*"please" + 0.004*"brush" + 0.003*"feature"')
(1, '0.009*"flavor" + 0.008*"oil" + 0.007*"ingredient" + 0.006*"tea" + 0.005*"organic" + 0.005*"food" + 0.005*"taste" + 0.005*"chocolate" + 0.005*"oz" + 0.005*"delicious"')
(2, '0.005*"could" + 0.005*"said" + 0.004*"back" + 0.004*"know" + 0.004*"de" + 0.004*"get" + 0.004*"right" + 0.004*"even" + 0.004*"day" + 0.003*"way"')
(3, '0.008*"power" + 0.007*"usb" + 0.007*"camera" + 0.006*"battery" + 0.006*"cable" + 0.005*"compatible" + 0.005*"case" + 0.005*"device" + 0.005*"video" + 0.005*"feature"')
(4, '0.015*"book" + 0.013*"author" + 0.006*"life" + 0.006*"review" + 0.004*"university" + 0.004*"story" + 0.004*"world" + 0.004*"also" + 0.004*"read" + 0.003*"many"')


##Topic Feature Extraction

1-We transformed the text data into topic distributions for further analysis.

2-We converted the topic distributions into feature vectors.


In [None]:
# Transform test data into topic distributions
test_corpus = [id2word.doc2bow(text) for text in X_test_tokens]
X_test_topics = [lda_model.get_document_topics(doc, minimum_probability=0) for doc in test_corpus]

def topic_distribution_to_features(doc_topics, num_topics):
    """Convert topic distributions into a fixed-length feature vector."""
    features = [0] * num_topics
    for topic_id, prob in doc_topics:
        features[topic_id] = prob
    return features

X_test_features = [topic_distribution_to_features(doc, num_topics) for doc in X_test_topics]

##Training the Classifier

1- We used the topic feature vectors to train a logistic regression model for classification.

In [None]:
# Train classifier (Logistic Regression)
X_train_topics = [lda_model.get_document_topics(doc, minimum_probability=0) for doc in corpus]
X_train_features = [topic_distribution_to_features(doc, num_topics) for doc in X_train_topics]


In [None]:
classifier = LogisticRegression(max_iter=1000, random_state=42)
classifier.fit(X_train_features, y_train_encoded)

Model Evaluation

1- We predicted categories on the test dataset.

2-We calculated accuracy, coherence score, and generated a classification report to evaluate the model's performance.

In [None]:
# Make predictions and calculate accuracy
y_pred = classifier.predict(X_test_features)
predicted_categories = label_encoder.inverse_transform(y_pred)
accuracy = accuracy_score(y_test_encoded, y_pred)
print(f"\nAccuracy: {accuracy}")
id2word_test = corpora.Dictionary(X_test_tokens)

# Compute coherence score
coherence_model_lda = CoherenceModel(model=lda_model, texts=X_test_tokens, dictionary=id2word_test, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print(f"Coherence Score: {coherence_lda}")


Accuracy: 0.7485954773363404
Coherence Score: 0.4027059144824786


In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, predicted_categories))


              precision    recall  f1-score   support

      beauty       0.59      0.77      0.67     12703
       books       0.96      0.99      0.98      9723
 electronics       0.85      0.87      0.86      9169
     grocery       0.81      0.92      0.86      9022
        home       0.45      0.17      0.25      9044

    accuracy                           0.75     49661
   macro avg       0.73      0.74      0.72     49661
weighted avg       0.72      0.75      0.72     49661



##Mapping Predictions to Zones

1-We mapped the predicted categories to predefined zones as follows:

Beauty-> Cosmetic Zone

Books-> Dry Zone

Electronics-> Dry Zone

Home-> Bulk Zone

Grocery-> Food Zone

2-We created a results table with descriptions, categories, and zones for better interpretation.

In [None]:
# Category to Zone mapping

category_to_zone = {
    "beauty": "Cosmetic Zone",
    "books": "Dry Zone",
    "electronics": "Dry Zone",
    "home": "Bulk Zone",
    "grocery": "Food Zone"
}


predicted_zones = [category_to_zone[category] for category in predicted_categories]


results_df_lda = pd.DataFrame({
    'description': X_test,
    'Category': predicted_categories,
    'Zone': predicted_zones

results_df_lda[['description', 'Category', 'Zone']].head()


Unnamed: 0,description,Category,Zone
0,cleanly mount accessory usb port dash vehicle ...,electronics,Dry Zone
1,review susan gregg gilmores novel voice simila...,books,Dry Zone
2,crunchy buttery indulgence rich buttery glaze,grocery,Food Zone
3,primera fragancia para jóvenes de burbu la fra...,books,Dry Zone
4,brand weight g approx materialhightemperature ...,beauty,Cosmetic Zone


In [None]:
print(results_df_lda[['Category', 'Zone']].head())  # Show first few rows


      Category           Zone
0  electronics       Dry Zone
1        books       Dry Zone
2      grocery      Food Zone
3        books       Dry Zone
4       beauty  Cosmetic Zone


In [None]:
lda_model.save("/content/drive/MyDrive/LDAmodel/lda_model.gensim")

##Conclusion
Through this project, we explored the use of LDA for topic modeling and its integration with machine learning. We learned how to preprocess text data, train an LDA model, and extract meaningful topics. Using these topics, we developed a classification model that effectively categorized and mapped data to predefined zones.

This process taught us the importance of text preprocessing, the utility of topic modeling in understanding data patterns, and how to evaluate model performance using metrics like accuracy and coherence scores. We also gained experience in combining unsupervised and supervised learning techniques for practical applications.