### LDA Topic Modeling on Press Releases

In this notebook, we will perform topic modeling using Latent Dirichlet Allocation (LDA).
LDA is a popular technique for discovering hidden topics in a collection of documents.
We will use the `sklearn` and `gensim` libraries to process text data and extract topics.


In [None]:
import numpy as np
import pandas as pd
import re
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from wordcloud import WordCloud


### Load Press Releases Dataset


In [None]:
df = pd.read_csv("press_releases.csv")
documents = df['text'].dropna().tolist()


### Text Preprocessing Function

In [None]:
def preprocess_text(text):
    text = text.lower()  # Lowercasing
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove punctuation and numbers
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces
    return text

# Apply preprocessing
documents = [preprocess_text(doc) for doc in documents]

### Convert Text to Document-Term Matrix

In [None]:
vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
X = vectorizer.fit_transform(documents)

### Train LDA Model

In [None]:
num_topics = 5  # Define the number of topics
lda_model = LatentDirichletAllocation(n_components=num_topics, max_iter=10, random_state=42)
lda_model.fit(X)

### Extract Words for Each Topic

In [None]:
def display_topics(model, feature_names, num_words):
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic {topic_idx + 1}:")
        print(" ".join([feature_names[i] for i in topic.argsort()[:-num_words - 1:-1]]))
        print()

# Display the top words for each topic
display_topics(lda_model, vectorizer.get_feature_names_out(), 10)

### Visualizing Topics with Word Clouds

In [None]:
fig, axes = plt.subplots(1, num_topics, figsize=(15, 5))
for i, ax in enumerate(axes):
    topic_words = {vectorizer.get_feature_names_out()[j]: lda_model.components_[i][j] for j in range(len(vectorizer.get_feature_names_out()))}
    wordcloud = WordCloud(width=400, height=400, background_color='white').generate_from_frequencies(topic_words)
    ax.imshow(wordcloud, interpolation='bilinear')
    ax.axis("off")
    ax.set_title(f"Topic {i+1}")
plt.show()

### Assign Topics to Documents

In [None]:
doc_topic_distribution = lda_model.transform(X)
predicted_topics = np.argmax(doc_topic_distribution, axis=1)
df_topics = pd.DataFrame({'Document': documents, 'Predicted_Topic': predicted_topics})
df_topics.head()