# Topic modelling - Latent Dirichlet Allocation (LDA)

This is a file where I tried to implement LDA with different packages, it is applied on the data of questions and answers between 2018 to 2022. This was done to compare topic models LDA and BERTopic to decide which one to use. We later decided to use BERTopic and applied it on all data. The models were initially saved to the models folder but were not provided in the final version to save space and since we decided not to use LDA. 

Made by: Elsa Kidman


## How to use

The following packages are required:
- nltk
- gensim
- sklearn

In [1]:
import json
f = open('../../data/data 2018-09-09 2022-09-11/data_FINAL_2018-09-09_to_2022-09-11.json')
data_init = json.load(f)

### Preprocess the text

For LDA it's imoportant to clean the text before applying the model

In [6]:
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer

# Download NLTK stopwords data
nltk.download('stopwords')
nltk.download('punkt')

# A function to remove to/from formalia from each question.
def remove_names(question):

    ## Remove the first two lines
    lines = question.split('\n')[2:]
    result = '\n'.join(lines)

    return result

# Remove Swedish Stop Words
def remove_stopwords(text):

    stopword_custom = stopwords.words('swedish')
    stop_list = ["ska", "ske", "det", "vore", "samt"] # "se", "ge"
    stopword_custom.extend(stop_list)
    stop_words = set(stopword_custom)
    result = [word for word in text if word not in stop_words]
    return result

# Apply stemming for swedish
def stemming(text):

    stemmer = SnowballStemmer("swedish")
    resuling_text = [stemmer.stem(plural) for plural in text]
    return resuling_text

# Filters out all special characters, punktation etc. by using regular expressions. Only letters and numers left
def filter_bad_characters(tokens):

    # Regular expression can be used only get tokens containing letters and  numbers. Unicode is needed for å,ä,ö
    cleaned_tokens = [token for token in tokens if re.match(r'^[\wåäö]+$', token, flags=re.UNICODE)]
    return cleaned_tokens

def preprocess(text):

    # Tokenize the text
    words = word_tokenize(text, language='swedish')
    # Lowercase
    lowercase_words = [word.lower() for word in words]

    # Remove stop words
    stop_words_removed = remove_stopwords(lowercase_words)

    # Remove special characters and punktation
    cleaned_tokens = filter_bad_characters(stop_words_removed)

    #Apply stemming
    filtered_words = stemming(cleaned_tokens)

    # contains the preprocessed tokens.
    return filtered_words

[nltk_data] Downloading package stopwords to /home/elsa/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /home/elsa/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [8]:
# Gather data to apply the LDA models on
questions = []
questions_text = []
questions_tokenized = []

answers = []
answers_preproc = []

for entry in data_init:
    question = entry['question']
    answer = entry['answer']
    
    questions.append(question)
    
    preproc_question = preprocess(question)
    questions_tokenized.append(preproc_question)
    
    preproc_question = ' '.join(preproc_question)
    questions_text.append(preproc_question)

    answers.append(answer)
    
    preproc_answer = preprocess(answer)
    answers_preproc.append(preproc_answer)
    

## Implementation
Try to implement LDA with different packages: Gensim and Sklearn. 

### Gensim
[1] https://radimrehurek.com/gensim/intro.html

In [9]:
from gensim.corpora import Dictionary
from gensim.models import LdaModel

## We start on the data with the questions, which have been preprocessed.
# Create a dictionary representation from our questions (docs).
dictionary = Dictionary(questions_tokenized)

corpus = [dictionary.doc2bow(doc) for doc in questions_tokenized]

# Since LDA don't decide number of topics for us we need to decide how many topics we want.
num_topics = 50

# Index to word dictionary
tmp = dictionary[0]
id2word = dictionary.id2token


lda_model = LdaModel(corpus=corpus, id2word=id2word, num_topics=num_topics, alpha='auto', eval_every=5)

In [15]:
from gensim.test.utils import datapath

# How to save the model.  NOTE: this model is not included in the models folder to save space.
# path = datapath("models/gensim_LDA")
lda_model.save("models/gensim_LDA")

# How toload a pretrained model.
# lda_model = LdaModel.load(path)

In [16]:
# A function to get the topics of the model. Useful for displaying the topics.
def get_topics(model, n_topics):
  topics = {}
  for i, topic in model.show_topics(formatted=False, num_topics=n_topics):
    topic =  [x[0] for x in topic]
    topics[i] = topic
  return topics

# Print all topics found by LDA
topics =  get_topics(lda_model, num_topics)
print(len(topics))
for i, topic in topics.items():
  print(f"Topic ID: {i}, Words: {topic}")

50
Topic ID: 0, Words: ['elev', 'utbildningsminist', 'ekström', 'ann', 'lär', 'skolan', 'skol', 'utbildning', 'skolverket', 'skollag']
Topic ID: 1, Words: ['riksrevision', 'regering', 'arland', 'fråg', 'beskattning', 'granskningsrapport', 'isolering', 'diesel', 'finansminist', 'magdalen']
Topic ID: 2, Words: ['företag', 'sver', 'svensk', 'statsrådet', 'fråg', 's', 'näringsminist', 'regering', 'vill', 'baylan']
Topic ID: 3, Words: ['postnord', 'rätt', 'who', 'länd', 's', 'utrikesminist', 'barnkonvention', 'egypt', 'otillåtn', 'skarv']
Topic ID: 4, Words: ['regering', 'fråg', 'sver', 'statsrådet', 'svensk', 'mynd', 's', 'komm', 'statsminist', 'säkerhetsklass']
Topic ID: 5, Words: ['barn', 'ung', 'barnet', 'boend', 'skydd', 'rätt', 'föräldr', 'vuxn', 'familj', 'hallengr']
Topic ID: 6, Words: ['handläggningstid', 'lantmäteriet', 'skyddsjak', 'skadegör', 'jäg', 'pendl', 'statsrådet', 'fastighetsbildning', 'säl', 'licensjak']
Topic ID: 7, Words: ['statsrådet', 'fråg', 's', 'vill', 'student',

In [17]:
# How to get the most probable topic for a entry document
def get_top_topic_document(model, bow_doc, topics):
  doc_topics = lda_model.get_document_topics(bow_doc)
  max_topic = max(doc_topics, key=lambda x: x[1])
  return max_topic[0], topics[max_topic[0]]

# This gets the most likely topic for question at index 100
get_top_topic_document(lda_model, corpus[100], topics)

(11,
 ['sver',
  'fråg',
  'regering',
  'stämpelskat',
  's',
  'svensk',
  'vill',
  'finansminist',
  'statsrådet',
  'anledning'])

In [19]:
from gensim.models.coherencemodel import CoherenceModel


# To evaluate the model we can use different measures of coherence like c_v and U mass
texts = [[dictionary[word_id] for word_id, freq in doc] for doc in corpus]
c_v = CoherenceModel(model=lda_model, texts=texts, corpus=corpus, dictionary=dictionary, coherence='c_v')
coherence_c_v = c_v.get_coherence()
u_mass = CoherenceModel(model=lda_model, corpus=corpus, coherence='u_mass')
coherence_u_mass = u_mass.get_coherence()

print(f"C_v coherence score: {coherence_c_v}")
print(f"U_mass coherence score: {coherence_u_mass}")

C_v coherence score: 0.5372385673459167
U_mass coherence score: -4.347784362688686


### Sklearn LDA

[2] https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html

In [21]:
#import matplotlib.pyplot as plt
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

top_words = 10
# We need to decide to nr of topics in advance. This can later be optimised with gridsearch.
nr_topics = 50

tf_vectorizer = CountVectorizer()#(max_df=0.95, min_df=2, max_features=top_words))
tf = tf_vectorizer.fit_transform(questions_text)

lda_model_2 = LatentDirichletAllocation(n_components=nr_topics)
doc_topic = lda_model_2.fit_transform(tf)
#doc_topic = lda_model_2.transform(tf)

tf_feature_names = tf_vectorizer.get_feature_names_out()

In [22]:
# To be able to print the topics for the lda model
def get_topics(model, top_words, tf_feature_names):
  topics = {}
  for i, topic in enumerate(model.components_):
    topics[i] = ([tf_feature_names[j] for j in topic.argsort()[:-top_words - 1:-1]])
  return topics

# Print each topic
topics_2 =  get_topics(lda_model_2, top_words, tf_feature_names)
print(len(topics_2))
for i, topic in topics_2.items():
  print(f"Topic ID: {i}, Words: {topic}")

50
Topic ID: 0, Words: ['handläggningstid', 'procent', 'per', '000', 'år', 'veck', '2020', 'bolund', 'kommun', '2019']
Topic ID: 1, Words: ['johansson', 'morgan', 'migrationsminist', 'minist', 'person', 'fråg', 'brott', 'sver', 'polis', 'vill']
Topic ID: 2, Words: ['kron', 'sver', 'regering', 'miljard', 'kostnad', 'år', 'kommun', 'vill', 'fråg', 'stor']
Topic ID: 3, Words: ['utrikesminist', 'ann', 'kin', 'lind', 'rätt', 'landet', 'regim', 'mänsk', 'iran', 'kinesisk']
Topic ID: 4, Words: ['kvinn', 'statsrådet', 'våld', 'män', 'vill', 'kommun', 'fråg', 'sver', 'anledning', 'fler']
Topic ID: 5, Words: ['lin', 'statsrådet', 'högskol', 'öresund', 'anledning', 'vill', 'kihlblom', 'utbildning', 'fråg', 'axelsson']
Topic ID: 6, Words: ['statsrådet', 'mynd', 'mikael', 'gävleborg', 'damberg', 'msb', 'beredskap', 'fråg', 'kommun', 'åtgärd']
Topic ID: 7, Words: ['sexualbrot', 'år', 'barn', 'procent', 'anmäld', 'våldtäk', 'person', 'brott', 'fler', 'finn']
Topic ID: 8, Words: ['statsrådet', 'nilsso

In [23]:
# How to get the most probable topic for each entry document
def get_top_topic_document_2(model, doc_topic_entry, topics):

  # Get the most likely topic for an entry
  max_topic = doc_topic_entry.argmax()
  return max_topic, topics[max_topic]

# For question at index 0
get_top_topic_document_2(lda_model_2, doc_topic[0], topics_2)

(41,
 ['riksrevision',
  'statsrådet',
  'regering',
  'fråg',
  'vill',
  'anledning',
  'brist',
  'granskning',
  'dag',
  'även'])

In [24]:
doc_topic.shape[0]

6603

In [31]:
from gensim.models import CoherenceModel

# To be able to use gensim's coherence score feature with Sklearn
model_components = lda_model_2.components_
dictionary2 = Dictionary(questions_tokenized)
feature_names = [dictionary[i] for i in range(len(dictionary2))]

# Get the top words for each topic from the components_ attribute
topic_words = []
for topic in model_components:
    words = [feature_names[i] for i in topic.argsort()[:-nr_topics - 1:-1]] # This gets the words for each topic
    topic_words.append(words)

c_v_2 = CoherenceModel(topics=topic_words, texts=questions_tokenized, dictionary=dictionary2, coherence='c_v')
coherence_c_v_2 = c_v_2.get_coherence()

u_mass_2 = CoherenceModel(topics=topic_words, texts=questions_tokenized, dictionary=dictionary2,  coherence='u_mass')
coherence_u_mass_2 = u_mass_2.get_coherence()  # get coherence value

print(f"C_v coherence score: {coherence_c_v_2}")
print(f"U_mass coherence score: {coherence_u_mass_2}")

C_v coherence score: 0.6948252362369609
U_mass coherence score: -19.451116631416294
