<a href="https://colab.research.google.com/github/Joshuajee/AI-ML-PROJECTS/blob/master/Topic%20Modelling%20on%20Financial%20Posts%20from%20Redit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install pyLDAvis
!pip install bertopic
!pip install flair
!apt-get -qq install -y libfluidsynth1



** **
## Step 1: Loading the Data
** **
The data was collected manually from twenty two financial subreddit and saved in a csv format to my github repo.

In [None]:
import requests
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import os
import gensim
import nltk
from nltk.corpus import stopwords
from gensim.utils import simple_preprocess


In [None]:
def get_redit_data_from_github():
  file_path = "https://raw.githubusercontent.com/Joshuajee/AI-ML-PROJECTS/master/data/reddit/reddit_financial_data.csv"
  reponse = requests.get(file_path)
  if reponse.status_code == 200:
    with open("reddit_financial_data.csv", "wb") as f:
      f.write(reponse.content)
    return pd.read_csv("reddit_financial_data.csv", sep=",")
  else:
    raise Exception("Error downloading", reponse.status_code)


In [None]:
reddit_data = get_redit_data_from_github()
reddit_data

** **
## Step 2: Data Cleaning
** **

The reddit post data contains multiple columns, but since this is an NLP task only the text and title columns are useful for our Topic modeling task the other columns will be ignored.

1. Join the title and the text columns
2. Remove punctuations and special characters.




In [None]:
# Join the title and text columns in a new content column
reddit_data['content'] = reddit_data['title'] + ' ' + reddit_data['text']
reddit_data

In [None]:
# Create a new DataFrame containing only the content column
content_df = reddit_data[['content']].copy()
content_df

In [None]:
def preprocess(text):
    text = re.sub(r'http\S+|www\S+', '', text)  # Remove URLs
    text = re.sub(r'[^A-Za-z0-9\s]+', '', text)  # Remove special characters
    text = text.lower()  # Lowercase text
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces
    return text

In [None]:
# Cleaning the data gotton from reddit as it contains relevant characters
content_df['cleaned_content'] = content_df['content'].apply(preprocess)
content_df

In [None]:
# Removing stop words
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])
nltk.download('punkt_tab')

# Create a list of tokens i.e this converts the sentence into tokens/words
content_df['wordlist'] = content_df['cleaned_content'].apply(lambda x:word_tokenize(str(x)))
# Removing stop words from wordlist columns
content_df['wordlist_no_sw'] = content_df['wordlist'].apply(lambda x: [word for word in x if word not in stop_words])
content_df

** **
## Step 3: Exploratory Analysis <a class="anchor\" id="eda"></a>
** **

To verify whether the preprocessing, we’ll make a simple word cloud using the `wordcloud` package to get a visual representation of most common words. It is key to understanding the data and ensuring we are on the right track, and if any more preprocessing is necessary before training the model.



In [None]:
text_lengths = [len(x) for x in content_df['wordlist_no_sw'].to_numpy()]

# Set up the figure size
plt.figure(figsize=(12, 6))

# Plot the histogram using seaborn with a KDE overlay.
sns.histplot(text_lengths, bins=50, kde=True, color="steelblue")

# Add plot labels and title
plt.title("Distribution of Text Lengths")
plt.xlabel("Text Length (number of characters)")
plt.ylabel("Frequency")

# Show the plot
plt.show()

In [None]:
# Import the wordcloud library
from wordcloud import WordCloud

data_words = content_df['wordlist_no_sw'].explode().to_list()

# Create a WordCloud object
wordcloud = WordCloud(background_color="white", max_words=1000, contour_width=3, width=1000, height=600, contour_color='steelblue')

# Generate a big chunck of text
big_chunck_text = " ".join(data_words)

# Generate a word cloud
wordcloud.generate(big_chunck_text)

# Visualize the word cloud
wordcloud.to_image()

In [None]:
import gensim
from gensim.utils import simple_preprocess
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

def sent_to_words(sentences):
    for sentence in sentences:
        # deacc=True removes punctuations
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))

def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc))
             if word not in stop_words] for doc in texts]


data = content_df.cleaned_content.values.tolist()
data_words = list(sent_to_words(data))

# remove stop words
data_words = remove_stopwords(data_words)

print(data_words[:1][0][:30])

In [None]:
import gensim.corpora as corpora

# Create Dictionary
id2word = corpora.Dictionary(data_words)

# Create Corpus
texts = data_words

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
print(corpus[:1][0][:30])

** **
#### Step 5: LDA model tranining <a class="anchor\" id="train_model"></a>
** **

To keep things simple, we'll keep all the parameters to default except for inputting the number of topics. For this tutorial, we will build a model with 10 topics where each topic is a combination of keywords, and each keyword contributes a certain weightage to the topic.

In [None]:
from pprint import pprint

# number of topics
num_topics = 8

# Build LDA model
lda_model = gensim.models.LdaMulticore(corpus=corpus, id2word=id2word, num_topics=num_topics, passes=50)

# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

** **
#### Step 6: Analyzing our LDA model <a class="anchor\" id="results"></a>
** **

Now that we have a trained model let’s visualize the topics for interpretability. To do so, we’ll use a popular visualization package, pyLDAvis which is designed to help interactively with:

1. Better understanding and interpreting individual topics, and
2. Better understanding the relationships between the topics.

For (1), you can manually select each topic to view its top most frequent and/or “relevant” terms, using different values of the λ parameter. This can help when you’re trying to assign a human interpretable name or “meaning” to each topic.

For (2), exploring the Intertopic Distance Plot can help you learn about how topics relate to each other, including potential higher-level structure between groups of topics.

In [None]:
import pyLDAvis.gensim_models
import pickle
import pyLDAvis

# Visualize the topics
pyLDAvis.enable_notebook()

LDAvis_data_filepath = os.path.join('/content_'+str(num_topics))

# # this is a bit time consuming - make the if statement True
# # if you want to execute visualization prep yourself
if 1 == 1:
    LDAvis_prepared = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word)
    with open(LDAvis_data_filepath, 'wb') as f:
        pickle.dump(LDAvis_prepared, f)

# load the pre-prepared pyLDAvis data from disk
with open(LDAvis_data_filepath, 'rb') as f:
    LDAvis_prepared = pickle.load(f)

pyLDAvis.save_html(LDAvis_prepared, 'content_'+ str(num_topics) +'.html')

LDAvis_prepared

In [None]:
lda_model