<a href="https://colab.research.google.com/github/Alamodi123/NLP_LABS-/blob/main/TOPIC_MODELING.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###Importting necessary libraries

In [27]:
import pandas as pd
import numpy as np

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report


###Load 20 Newsgroups Dataset

In [28]:
categories = ['rec.autos', 'comp.graphics', 'sci.med', 'talk.politics.guns']

newsgroups = fetch_20newsgroups(
    subset='all',
    categories=categories,
    remove=('headers', 'footers', 'quotes')
)

df = pd.DataFrame({
    'text': newsgroups.data,
    'label': newsgroups.target
})

print(df.head())


                                                text  label
0  Looking for a graphics/CAD/or-whatever package...      0
1  FOR IMMEDIATE RELEASE\n\nEditorial Contact:\nS...      0
2  \n\nUtah raster toolkit using getx11. Convert ...      0
3                      ^^^^^^^\nHuh!  I though Be...      1
4  I'm plannig to trade my Sentra SE-R in with a ...      1


###Convert text to suitable representation

In [29]:
vectorizer = CountVectorizer(
    stop_words='english',
    max_features=5000
)

vec_matrix = vectorizer.fit_transform(df['text'])
feature_names = vectorizer.get_feature_names_out()


###Perform LDA modeling

In [30]:
num_topics = 4

lda_model = LatentDirichletAllocation(
    n_components=num_topics,
    random_state=42
)

topic_matrix = lda_model.fit_transform(vec_matrix)


###Display topics and their top words

In [31]:
print("Topics and their top words:")
for topic_idx, topic in enumerate(lda_model.components_):
    top_words = [feature_names[i] for i in topic.argsort()[:-6:-1]]
    print(f"Topic {topic_idx + 1}: {', '.join(top_words)}")


Topics and their top words:
Topic 1: gun, people, don, guns, think
Topic 2: car, like, just, don, know
Topic 3: edu, com, graphics, data, information
Topic 4: image, jpeg, file, images, graphics


###Create topic feature DataFrame

In [32]:
topic_features = pd.DataFrame(
    topic_matrix,
    columns=[f"Topic_{i+1}" for i in range(num_topics)]
)

X = topic_features
y = df['label']


###Split data into training and testing set

In [33]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)


###Train suitable classifier

In [34]:
classifier = MultinomialNB()
classifier.fit(X_train, y_train)


###Evaluate classification model performance

In [35]:
y_pred = classifier.predict(X_test)

print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=categories))


Classification Report:
                    precision    recall  f1-score   support

         rec.autos       0.78      0.90      0.84       195
     comp.graphics       0.52      0.65      0.58       198
           sci.med       0.48      0.26      0.34       198
talk.politics.guns       0.80      0.87      0.83       182

          accuracy                           0.66       773
         macro avg       0.65      0.67      0.65       773
      weighted avg       0.64      0.66      0.64       773



### Interpretation of Classification Model Performance

The model achieved an **accuracy of 66%** on the test set, meaning it correctly classified about two-thirds of the documents.

**Performance by Category:**

*   **'rec.autos' and 'talk.politics.guns'**: These categories show strong performance with high **precision (0.78 and 0.80)**, **recall (0.90 and 0.87)**, and **f1-scores (0.84 and 0.83)**. This indicates that the model is quite good at both identifying documents from these categories and avoiding false positives.

*   **'comp.graphics'**: The model performs moderately well for this category, with a **precision of 0.52** and a **recall of 0.65**, resulting in an **f1-score of 0.58**. This suggests there's room for improvement, possibly by reducing false positives.

*   **'sci.med'**: This category is the weakest. The model has low **precision (0.48)** and very low **recall (0.26)**, leading to a poor **f1-score of 0.34**. This means the model frequently misclassifies 'sci.med' documents and also struggles to identify actual 'sci.med' documents. This could be due to a variety of factors, such as overlapping topics with other categories or a lack of distinctive features after LDA.

**Summary:**
The model works well for specific categories like 'rec.autos' and 'talk.politics.guns' but struggles significantly with 'sci.med'. The 'comp.graphics' category shows moderate performance. Further improvements might involve refining the LDA topic modeling, increasing the number of topics, or exploring different classification algorithms, especially for the underperforming categories.