# Topic modelling - BERTopic

This file creates topics for the dataset containing "Skriflig fråga" (questions) and "svar på skriftlig fråga" (answers) fetched from riksdagen.se. This is done by using a model called BERTopic [1]. First the techinque is tested on a smaller dataset containing only data from 2018 to 2022 and is then applied to all dat fetched from 2006 to 2022. This file creates BERTopic models on this data. The models that was created from this file have been saved to a folder called "models" and can be easialy loaded (how to load them in can be seen in the code), so that we can have consistent topics and don't have to run this code each time. Note that the models for the data 2018 to 2022 is not included since this was only to test out BERTopic and I didn't want to include even more data then I already have.

File made by: Elsa Kidman

[1] M. Grootendorst, BERTopic: Neural topic modeling with a class-based TF-IDF procedure, 2022. https://maartengr.github.io/BERTopic/index.html

## How to run
To be able to run the following code the following python packages are required:
- tensorflow needs to be updated
- BERTopic
- sklearn
- pickle
- nltk
- numpy

In [None]:
# To be able to run this file the following packages needs to be installed
# !pip3 install --upgrade tensorflow
# !pip3 install BERTopic

In [3]:
import pickle
import json
import numpy as np
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from nltk.corpus import stopwords

# Download NLTK stopwords data
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## BERTopic
It exist different ways to find the topics of the questions and answers*. I'm going to get the topics of the following:
- Get topics for a combined entry, i.e. questions and answers combined.
- Get topics for answers and questions seperatly. However, execute the BERTopic model with both questions and answers so that they get the same set of topics.
- One could also try to get topics for answers and questions seperatly by fit and transform the data for questions and answers seperate. However, then they will get completely different set of topics and therefore I think this is not a good solution for us.


Note:
- The text does not need to be preprocessed (tokenisation, stemming, stopword removale etc.) before applying BERTopic [2].
- Outlier topics need to handled. These are topics of nr -1 [3].
- BERTopic automatically chooses the number of topics for us.


\* Questions refers to "skriftlig fråga" and answers refers to "svar på skriftlig fråga"

[2] M. Grootendorst, Frequently Asked Questions, 2023,
https://maartengr.github.io/BERTopic/faq.html#should-i-preprocess-the-data

[3] M. Grootendorst, Outlier reduction, 2023, https://maartengr.github.io/BERTopic/getting_started/outlier_reduction/outlier_reduction.html

In [13]:
# Reduces outlier topics. These are marked as the topic nr -1
def reduce_outliers(model, text, topics):

  # Reduce outliers
  new_topics = model.reduce_outliers(text, topics)
  # Update Topic Representation
  model.update_topics(text, topics=new_topics, vectorizer_model=vectorizer_model)
  return model

# Get list of Swedish stopwords and add some more
stopword_custom = stopwords.words('swedish')
stop_list = ["ska", "ske", "det", "vore"]
stopword_custom.extend(stop_list)
stop_words = set(stopword_custom)
stop_words = list(stop_words)

# define our CountVectorizer to remove stopwords
vectorizer_model = CountVectorizer(stop_words=stop_words)

# Improved model 2
vectorizer_model = CountVectorizer(min_df=2, stop_words = stop_words, ngram_range=(1, 2))


In [None]:
# Code to Save and load the models. Note: Change the path
# # Save model, topics and probabilities
# model.save("/models", serialization="pickle")

# with open("/models", "wb") as fp:
#   pickle.dump(topics, fp)

# with open("/models", "wb") as fp:
#   pickle.dump(probs, fp)

# # Load model, topics and probabilities
# model = BERTopic.load("/models")

# with open("/models", "rb") as fp:
#   topics = pickle.load(fp)

# with open("/models", "rb") as fp:
#   probs = pickle.load(fp)

# # Hot to save as json
# with open("/models", "w") as fp:
#     json.dump(result_dict, fp)  


## Dataset containing questions and answers from 2018 to 2022

First I test BERTopic on the smaller dataset containing data from 2018 to 2022. This is only to test and learn BERTopicon a smaller dataset, scroll down to all data for the final product. Again, this model is not included in the models folder since it is only to learn.

In [None]:
# TODO: change path
f = open('../../data/data 2018-09-09 2022-09-11/data_FINAL_2018-09-09_to_2022-09-11.json')
data_init = json.load(f)

In [5]:
# Remove person names from question/answers.
# NOTE: This worked for the data 2018-2022, but not on all data since the questions format is different troughout the years (see data_exploration_all_years).
def remove_names(question):
    ## Remove the first two lines
    lines = question.split('\n')[2:]
    new_text = '\n'.join(lines)
    return new_text

# Generate lists of the different data needed
orginal_text_combined = []
orginal_questions = []
orginal_answers = []
original_qa = []

for entry in data_init:
    question = entry['question']
    answer = entry['answer']

    # remove the to/from formality from question
    question2 = remove_names(question)
    orginal_questions.append(question2)
    original_qa.append(question2)

    # Answers
    orginal_answers.append(answer)
    original_qa.append(answer)

    # Combined questions and answers
    orginal_text_combined.append(question2+answer)

print(f"Amount of datapoints: {len(orginal_text_combined)}")

Amount of datapoints: 6603


### Preform BERTopic on questions and answers seperatly
NOTE: We decided not to use this since questions and answers get different set of topics.

In [None]:
# Train the BERTopic model on the questions
model_questions = BERTopic(language="swedish",calculate_probabilities=True, vectorizer_model=vectorizer_model)
topics_questions, probs_questions = model_questions.fit_transform(orginal_questions)
model_questions_improved = reduce_outliers(model_questions, orginal_questions, topics_questions)

# Train the BERTopic model on the answers
model_answers = BERTopic(language="swedish",calculate_probabilities=True, vectorizer_model=vectorizer_model)
topics_answers, probs_answers = model_answers.fit_transform(orginal_answers)
model_answers_improved = reduce_outliers(model_answers, orginal_answers, topics_answers)

In [None]:
# # Save questions model, topic and probs. NOTE: these are not included in the models folder since it would take up to much space
# model_questions.save("models/model_questions_trained_seperately", serialization="pickle")

# with open("models/topics_questions_trained_seperately", "wb") as fp:
#   pickle.dump(topics_questions, fp)

# with open("models/probs_questions_trained_seperately", "wb") as fp:
#   pickle.dump(probs_questions, fp)

# # Save answers model, topic and probs
# model_answers.save("models/model_answers_trained_seperately", serialization="pickle")

# with open("models/topics_answers_trained_seperately", "wb") as fp:
#   pickle.dump(topics_answers, fp)

# with open("models/probs_answers_trained_seperately", "wb") as fp:
#   pickle.dump(probs_answers, fp)

In [None]:
model_questions.get_topic_info()
model_answers.get_topic_info()

In [None]:
# Get topics for each document (data entry) and save them the the data json file.
document_topic_questions = model_questions.get_document_info(orginal_questions)
document_topic_answers = model_answers.get_document_info(orginal_answers)
document_topic_combined = model_combined.get_document_info(orginal_text_combined)

columns_to_extract = ['Name', 'Top_n_words']
data = data_init.copy()

for i, entry in enumerate(data):
    # combined
    id_topic, words = document_topic_combined.loc[i, columns_to_extract].values
    words = words.split(' - ')
    words = ', '.join(words)
    print(words)
    break

    #questions
    id_topic_Q, wordsQ = document_topic_questions.loc[i, columns_to_extract].values
    words_Q = wordsQ.split(' - ')
    words_Q = ', '.join(words_Q)

    if len(entry['answer']) == 0:
        # Answers
        id_topic_A = ''
        words_A =  []

    else:
        # Answers
        nameA, wordsA = document_topic_answers.loc[i, columns_to_extract].values
        wordsA = wordsA.split(' - ')

    new_element = { 'id_topic_combined': id, 'top_10_words_combined': words,
                    'id_topic_question': id_topic_Q, 'top_n_words_question': words_Q,
                    'id_topic_answer': id_topic_A, 'top_n_words_answer': words_A}

    entry.update(new_element)

### BERTopic on question and answers seperate but trained at the same time, so they get same set of topics
Note: Here bertopic setting nr_topics='auto' was used to automatically let BERTopic reduce amount of topics. After consideration we decided not to use this since we got a large nonsense topic (topic 0). It was not used on the dataset with all data from 2006-2022.

In [None]:
model_qa = BERTopic(language="swedish",
                    calculate_probabilities=True,
                    vectorizer_model=vectorizer_model,
                    nr_topics = 'auto') # by using auto, BERTopic decreases topics

topics_qa, probs_qa = model_qa.fit_transform(original_qa)

model_qa.get_topic_info()

In [None]:
# model_qa = BERTopic.load("/content/gdrive/MyDrive/IT5/Data/model_qa")
# model_qa.get_topic_info()

In [None]:
model_qa = reduce_outliers(model_qa, original_qa, topics_qa)
model_qa.get_topic_info()



Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,10352,0_regeringen_fråga_sverige_åtgärder,"[regeringen, fråga, sverige, åtgärder, 2020, a...",[Viktor Wärnyck (M) har frågat mig om jag avse...
1,1,433,1_elever_skolan_lärare_skolor,"[elever, skolan, lärare, skolor, ekström, anna...",[Svar på fråga 2020/21:1487 av Thomas Morell (...
2,2,138,2_besvaras_övergångsregering_skriftliga_frågor,"[besvaras, övergångsregering, skriftliga, fråg...",[Meddelande om uteblivet svar på fråga 2020/21...
3,3,144,3_turkiet_turkiska_turkiets_syrien,"[turkiet, turkiska, turkiets, syrien, mänsklig...",[Svar på fråga 2019/20:1607 av Sara Gille (SD)...
4,4,127,4_israel_palestinska_palestina_israels,"[israel, palestinska, palestina, israels, pale...",[Svar på fråga 2019/20:1962 av Björn Söder (SD...
5,5,99,5_taiwan_taiwans_who_internationella,"[taiwan, taiwans, who, internationella, intern...",[Svar på frågorna 2019/20:864 av Hans Rothenbe...
6,6,115,6_iran_iranska_mänskliga_rättigheter,"[iran, iranska, mänskliga, rättigheter, mänskl...",[Svar på fråga 2018/19:420 av Anders Österberg...
7,7,90,7_etiopien_tigray_mänskliga_mänskliga rättigheter,"[etiopien, tigray, mänskliga, mänskliga rättig...",[Svar på fråga 2021/22:339 av Håkan Svenneling...
8,8,112,8_journalister_medier_radio_tv,"[journalister, medier, radio, tv, oberoende, p...",[Svar på fråga 2019/20:718 av Angelika Bengtss...
9,9,79,9_kulturen_kultur_kultursektorn_kulturarvet,"[kulturen, kultur, kultursektorn, kulturarvet,...",[ \nStora delar av kultursektorn är i behov av...


In [None]:
model_qa.visualize_heatmap(n_clusters = 20)

In [None]:
document_topic_qa = model_qa.get_document_info(original_qa)

document_topic_q = document_topic_qa.iloc[::2]  # Select every other row starting from the first row
document_topic_a = document_topic_qa.iloc[1::2]  # Select every other row starting from the second row

# Index needs to be resetted
document_topic_q.reset_index(drop=True, inplace=True)
document_topic_a.reset_index(drop=True, inplace=True)

### BERTopic combined
Note: Here bertopic setting nr_topics='auto' was used to automatically let BERTopic reduce amount of topics. After consideration we decided not to use this since we got a large nonsense topic (topic 0). It was not used on the dataset with all data from 2006-2022.

In [None]:
# Train the BERTopic model on a text that for each datapoint questions and answers are combined into one entry.
model_combined = BERTopic(language="swedish",
                          calculate_probabilities=True,
                          vectorizer_model=vectorizer_model,
                          nr_topics = 'auto',)

topics_combined, probs_combined = model_combined.fit_transform(orginal_text_combined)

model_combined = reduce_outliers(model_combined, orginal_text_combined, topics_combined)



### Save model result in data file
The following code saves the topic ID and top 10 words for a topic for each datapoint in our dataset.

In [None]:
# document_topic_q
# document_topic_a
document_topic_combined = model_combined.get_document_info(orginal_text_combined)

columns_to_extract = ['Topic', 'Top_n_words']
data = data_init.copy()

for i, entry in enumerate(data):
    # combined
    id_topic, words = document_topic_combined.loc[i, columns_to_extract].values
    id_topic = id_topic.item()
    words = words.split(' - ')
    words = ', '.join(words)

    #questions
    id_topic_Q, words_Q = document_topic_q.loc[i, columns_to_extract].values
    id_topic_Q = id_topic_Q.item()
    words_Q = words_Q.split(' - ')
    words_Q = ', '.join(words_Q)

    if len(entry['answer']) == 0:
        # Answers
        id_topic_A = ''
        words_A =  []

    else:
        # Answers
        id_topic_A, words_A = document_topic_a.loc[i, columns_to_extract].values
        id_topic_A = id_topic_A.item()
        words_A = words_A.split(' - ')
        words_A = ', '.join(words_A)

    new_element = { 'id_topic_combined': id_topic, 'top_10_words_combined': words,
                    'id_topic_question': id_topic_Q, 'top_10_words_question': words_Q,
                    'id_topic_answer': id_topic_A, 'top_10_words_answer': words_A}

    entry.update(new_element)

In [None]:
data[0]

{'id_': 'h911987',
 'question': 'av Betty Malmberg (M)\ntill Utbildningsminister Anna Ekström (S)\n\xa0\nÅr 2010 antog Europaparlamentet det så kallade djurförsöksdirektivet (2010/63/EU). Syftet med direktivet är att de försök som i dag görs på levande djur i antingen vetenskapliga eller utbildningsmässiga sammanhang ska ersättas med djurfria metoder, där det är vetenskapligt möjligt.\nNederländerna har redan antagit en strategi för detta som innebär att de fram till 2025 ska ha fasat ut många av djurförsöken samt ha antagit olika handlingsplaner för att minska djurförsök inom bland annat grundforskning. Nederländernas initiativ är mycket intressant ur flera aspekter såsom etik, effektivitet och ekonomi. Det är också högst rimligt eftersom det i dag finns alternativa sätt för att utveckla läkemedel som är mer effektiva och som dessutom kan korta processerna för framtagande av desamma. Det kan vara via studier på levande celler, i provrör eller genom beräkningsmodeller i datorn. Det inn

In [None]:
import json

with open("../../data/data 2018-09-09 2022-09-11/data_topic.json", "w") as fp:
    json.dump(data, fp)  # encode dict into JSON

### Improve BERTopic
We can now try to improve BERTopic even further by playing around with all features the package provides. I followed a guide [4] that provides additional representation models which could help give better top n words for each topic.

[4] M. Mansurova, Topics per Class Using BERTopic, 2023, Towards Data Science. https://towardsdatascience.com/topics-per-class-using-bertopic-252314f2640

In [None]:
# Here BERTopic is applied on questions and answers together. [question1, answer1, question2, answer2, etc...]
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.representation import KeyBERTInspired, PartOfSpeech, MaximalMarginalRelevance

# use sv_core_news_sm for swedish
main_representation_model    = None
aspect_representation_model1 = KeyBERTInspired()
aspect_representation_model2 = PartOfSpeech("sv_core_news_sm")
aspect_representation_model3 = [KeyBERTInspired(),
                                MaximalMarginalRelevance(diversity=.5)]#diversity=.7)]

# Different representation models can be compared
representation_model = {
   "Main": main_representation_model,
   "Aspect1":  aspect_representation_model1,
   "Aspect2":  aspect_representation_model2,
   "Aspect3":  aspect_representation_model3
}

vectorizer_model = CountVectorizer(min_df=2, stop_words = stop_words, ngram_range=(1, 2))#min_df=2, ngram_range=(1, 2)

# Added swedish language and calculate_probabilities
topic_model = BERTopic(nr_topics = 'auto',
                      vectorizer_model = vectorizer_model,
                      representation_model = representation_model,
                       language="swedish",
                       calculate_probabilities=True, )

topics, ini_probs = topic_model.fit_transform(original_qa)

In [None]:
# check for empty entries
i = 0
for e in original_qa:
  if len(e) == 0:
    i+=1
print(i)

27


In [None]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,4359,-1_regeringen_fråga_åtgärder_sverige,"[regeringen, fråga, åtgärder, sverige, 2020, v...","[Svar på fråga 2020/21:2373, 2020/21:2374, 202..."
1,0,7361,0_regeringen_fråga_sverige_åtgärder,"[regeringen, fråga, sverige, åtgärder, 2020, 2...",[Svar på fråga 2020/21:2001 av Ulrika Jörgense...
2,1,118,1_israel_palestinska_palestina_israels,"[israel, palestinska, palestina, israels, pale...",[Svar på fråga 2019/20:1962 av Björn Söder (SD...
3,2,115,2_turkiet_turkiska_turkiets_syrien,"[turkiet, turkiska, turkiets, syrien, mänsklig...",[Svar på fråga 2019/20:1703 av Markus Wiechel ...
4,3,101,3_bostäder_bostadsmarknaden_hyresrätter_hyress...,"[bostäder, bostadsmarknaden, hyresrätter, hyre...",[Svar på fråga 2019/20:461 av Momodou Malcolm ...
5,4,92,4_taiwan_taiwans_who_internationella,"[taiwan, taiwans, who, internationella, intern...",[Svar på frågorna 2019/20:864 av Hans Rothenbe...
6,5,69,5_etiopien_tigray_mänskliga_humanitära,"[etiopien, tigray, mänskliga, humanitära, mäns...",[Svar på fråga 2020/21:2513 av Magnus Jacobsso...
7,6,58,6_kina_kinesiska_kinas_hongkong,"[kina, kinesiska, kinas, hongkong, eu, ambassa...",[Svar på frågorna 2019/20:1712 Påtvingad säker...
8,7,53,7_spel_spelmarknaden_shekarabi_spelinspektionen,"[spel, spelmarknaden, shekarabi, spelinspektio...",[Svar på fråga 2020/21:1833 av John Weinerhall...
9,8,48,8_venezuela_kuba_chile_politiska,"[venezuela, kuba, chile, politiska, colombia, ...",[Svar på fråga 2018/19:398 Maktväxling i Venez...


In [None]:
topic_model = reduce_outliers(topic_model, original_qa, topics)
topic_model.get_topic_info()



Unnamed: 0,Topic,Count,Name,Representation,Aspect1,Aspect2,Aspect3,Representative_Docs
0,0,12171,0_regeringen_fråga_sverige_åtgärder,"[regeringen, fråga, sverige, åtgärder, 2020, a...","[fråga 2020, 2020 21, 2022, åtgärder, 2019 20,...","[regeringen, fråga, åtgärder, svar, svenska, a...","[fråga 2020, 2020 21, 2022, åtgärder, 2019 20,...",[Viktor Wärnyck (M) har frågat mig om jag avse...
1,1,103,1_taiwan_taiwans_who_internationella,"[taiwan, taiwans, who, internationella, kina, ...","[taiwans deltagande, stödja taiwans, stöd taiw...","[internationella, internationella organisation...","[taiwans deltagande, stödja taiwans, stöd taiw...",[Svar på frågorna 2019/20:864 av Hans Rothenbe...
2,2,85,2_etiopien_tigray_fn_mänskliga,"[etiopien, tigray, fn, mänskliga, rättigheter,...","[etiopiska regeringen, etiopiska, humanitärt t...","[mänskliga, humanitära, rättigheter, mänskliga...","[etiopiska regeringen, etiopiska, humanitärt t...",[Svar på fråga 2020/21:2513 av Magnus Jacobsso...
3,3,66,3_mänskliga_venezuela_kuba_rättigheter,"[mänskliga, venezuela, kuba, rättigheter, eu, ...","[chiles regering, venezuela, chile, chiles, ku...","[politiska, mänskliga, rättigheter, mänskliga ...","[chiles regering, venezuela, chile, chiles, ku...",[Svar på fråga 2018/19:398 Maktväxling i Venez...
4,4,66,4_kina_hongkong_kinas_eu,"[kina, hongkong, kinas, eu, kinesiska, hongkon...","[säkerhetslagstiftning hongkong, situationen h...","[kinesiska, ambassaden, agerande, mänskliga, r...","[säkerhetslagstiftning hongkong, situationen h...",[Svar på frågorna 2019/20:1712 Påtvingad säker...
5,5,41,5_solceller_el_privatpersoner_solel,"[solceller, el, privatpersoner, solel, förnyba...","[solenergianläggningar, solcellsanläggningar, ...","[solceller, privatpersoner, solel, bygglov, fö...","[solenergianläggningar, solcellsanläggningar, ...",[Svar på fråga 2018/19:450 av Rickard Nordin (...
6,6,52,6_kulturarvet_statens_kulturarv_slottet,"[kulturarvet, statens, kulturarv, slottet, fas...","[nationalmuseum, museiverksamhet, kulturarvet,...","[kulturarvet, statens, kulturarv, museer, slot...","[nationalmuseum, museiverksamhet, kulturarvet,...",[Svar på fråga 2019/20:694 av Magnus Oscarsson...
7,7,36,7_postnord_post_brev_pts,"[postnord, post, brev, pts, samhällsomfattande...","[samhällsomfattande posttjänsten, posttjänsten...","[post, brev, samhällsomfattande, tidningar, po...","[samhällsomfattande posttjänsten, posttjänsten...",[Svar på fråga 2018/19:522 av Linus Sköld (S)R...
8,8,35,8_folkmordet_1915_folkmord_osmanska,"[folkmordet, 1915, folkmord, osmanska, erkänna...","[folkmordet 1915, erkänna folkmordet, svenska ...","[folkmordet, folkmord, riket, erkännande, arme...","[folkmordet 1915, erkänna folkmordet, svenska ...",[Svar på fråga 2020/21:2652 av Björn Söder (SD...
9,9,31,9_myanmar_militären_militärkuppen_fn,"[myanmar, militären, militärkuppen, fn, eu, la...","[burma, fråga utrikesminister, militärens ager...","[militären, militärkuppen, landet, militärens,...","[burma, fråga utrikesminister, militärens ager...",[Svar på frågorna 2020/21:1567 av Markus Wiech...


In [None]:
# This displays some different representation models for different topics
a = 0
b = 1
print(topic_model.get_topic_info().iloc[a]['Representation'])

print(topic_model.get_topic_info().iloc[a]['Aspect1'])
print(topic_model.get_topic_info().iloc[a]['Aspect2'])
print(topic_model.get_topic_info().iloc[a]['Aspect3'])

print("\nnytt topic \n")

print(topic_model.get_topic_info().iloc[b]['Representation'])
print(topic_model.get_topic_info().iloc[b]['Aspect1'])
print(topic_model.get_topic_info().iloc[b]['Aspect2'])
print(topic_model.get_topic_info().iloc[b]['Aspect3'])


['regeringen', 'fråga', 'sverige', 'åtgärder', '2020', 'avser', 'kommer', '2021', 'vill', 'även']
['fråga 2020', '2020 21', '2022', 'åtgärder', '2019 20', '2021', 'bland annat', '19', '2020', '2019']
['regeringen', 'fråga', 'åtgärder', 'svar', 'svenska', 'anledning', 'år', 'andra', 'annat', 'ministern']
['fråga 2020', '2020 21', '2022', 'åtgärder', '2019 20', '2021', 'bland annat', '19', '2020', '2019']

nytt topic 

['taiwan', 'taiwans', 'who', 'internationella', 'kina', 'internationella organisationer', 'linde', 'ann linde', 'deltagande', 'taiwans deltagande']
['taiwans deltagande', 'stödja taiwans', 'stöd taiwan', 'taiwan deltar', 'intresse taiwan', 'taiwans mottagning', 'taiwan', 'taiwans', 'taiwans möjlighet', 'fråga utrikesminister']
['internationella', 'internationella organisationer', 'deltagande', 'organisationer', 'svar', 'fråga', 'tidigare', 'utrikesminister', 'representation', 'tidigare svar']
['taiwans deltagande', 'stödja taiwans', 'stöd taiwan', 'taiwan deltar', 'intress

In [None]:
# Heatmap can help to see if topics are similar to each other
topic_model.visualize_heatmap(n_clusters = 20)

In [None]:
# If a representation model other than none were to be choosen. This would be probably be the one.
from bertopic.representation import MaximalMarginalRelevance

representation_model = MaximalMarginalRelevance(diversity=0.4)
topic_model = BERTopic(representation_model=representation_model)

# Reduces outlier topics. These are often marked as the topic nr -1
def reduce_outliers_representation(model, text, topics):
  # Reduce outliers
  new_topics = model.reduce_outliers(text, topics)
  # Update Topic Representation
  model.update_topics(text, topics=new_topics, vectorizer_model=vectorizer_model)
  return model

Above different representation models and vectorizer models are combared. I evaluated different models by looking at the topics and the words that represented them. It was a difficult choice, however, I found that the when it comes to representation model the none one worked well. For vectorizer the below settings gave good result. It omitts words that occure less than two times, handles stop words and allows for words that often occure together.



```
vectorizer_model = CountVectorizer(min_df=2, stop_words = stop_words, ngram_range=(1, 2))
```



## BERTopic on all data

From 2006 to 2022

We will take topics for:
- Combined questions and answers
- Question and answers as seperate entities but fitted and transformed with BERTopic at same time.

The models that is created here can be find in the models folder:
- For the model for the combined question and answer: `models/BERTopic_alldata_model_combined`.
- For the model for the question and answer seperate but fitted and transformed together: `models/BERTopic_alldata_model_qa`.

In the models folder each model also have a file for their topics and probabilities. I also tried to reduce the nr of topics, however, this creates one big topic (topic 0) that contain a big chunk of entries and its words are nonsense. Therefore, we decided to stick with the initial nr of topics that BERTopic decides.


In [7]:
# TODO: change path
f = open('../../data/data_all/data_all_final.json')
data_all = json.load(f)

In [None]:
len(data_all)

18997

In [8]:
# Check if there are any empty questions and answers. Don't use empty questions
eq = []
ea = []
data_clean = []
for entry in data_all:
  if len(entry['question']) == 0 and len(entry['answer']) == 0:
    print(entry)
  else:
    data_clean.append(entry)
  if len(entry['question']) == 0:
    eq.append(entry['question'])
  if len(entry['answer']) == 0:
    ea.append(entry['answer'])

print(len(eq))
print(len(ea))

{'id_': 'h411987', 'question': '', 'question_date': '2017-03-03', 'answer': '', 'undertecknare_name': 'Maria Malmer Stenergard', 'undertecknare_partibet': 'M', 'besvaradav_name': 'Finansminister Magdalena Andersson', 'besvaradav_partibet': 'S', 'regering': 49}
{'id_': 'h411987', 'question': '', 'question_date': '2017-03-03', 'answer': '', 'undertecknare_name': 'Maria Malmer Stenergard', 'undertecknare_partibet': 'M', 'besvaradav_name': 'Finansminister Magdalena Andersson', 'besvaradav_partibet': 'S', 'regering': 49}
{'id_': 'h211450', 'question': '', 'question_date': '2015-04-21', 'answer': '', 'undertecknare_name': 'Roger Haddad', 'undertecknare_partibet': 'FP', 'besvaradav_name': 'Helene Hellmark Knutsson', 'besvaradav_partibet': 'S', 'regering': 49}
3
137


In [9]:
# Generate lists of the different data needed
all_text_combined = []
all_questions = []
all_answers = []
all_qa = []

for entry in data_clean:
    question = entry['question']
    answer = entry['answer']

    # remove the to/from formality from question. NOTE: this can't be done on all data since format is different over time.
    #question2 = remove_names(question)

    all_questions.append(question)
    all_qa.append(question)

    # Answers
    all_answers.append(answer)
    all_qa.append(answer)

    # Combined questions and answers
    all_text_combined.append(question+answer)

### A try to reduce number of topics

**Note:** This is just a try to reduce the topics with BERTopic's setting ```nr_topics = 'auto'```. As mentioned eariler, we decidd to not use this since it created a big nonsense topic that only contained words such as "goverment", "question", "sweden" etc. and did not provide good information. This can be seen in the following code for the topic 0.



In [None]:
# Train the BERTopic model on a text that for each datapoint questions and answers are combined into one entry.
model_combined = BERTopic(language="swedish",
                          calculate_probabilities=True,
                          vectorizer_model=vectorizer_model,
                          nr_topics = 'auto')

topics_combined, probs_combined = model_combined.fit_transform(all_text_combined)

model_combined = reduce_outliers(model_combined, all_text_combined, topics_combined)



In [None]:
model_combined.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,16178,0_regeringen_fråga_sverige_åtgärder,"[regeringen, fråga, sverige, åtgärder, avser, ...",[av Elisabeth Svantesson (M)\ntill Justitie- o...
1,1,185,1_flygplatser_flygplats_arlanda_bromma,"[flygplatser, flygplats, arlanda, bromma, flyg...",[av Hans Wallmark (M)\ntill Infrastrukturminis...
2,2,165,2_israel_palestinska_gaza_palestina,"[israel, palestinska, gaza, palestina, israels...",[av Håkan Svenneling (V)\ntill Utrikesminister...
3,3,143,3_alkohol_gårdsförsäljning_systembolaget_syste...,"[alkohol, gårdsförsäljning, systembolaget, sys...",[av Boriana Åberg (M)\ntill Näringsminister Ib...
4,4,175,4_kultur_lena adelsohn_adelsohn liljeroth_lilj...,"[kultur, lena adelsohn, adelsohn liljeroth, li...",[\nden 9 mars\nFråga \n2006/07:839 Besöken på ...
5,5,186,5_kina_mänskliga_kinesiska_hongkong,"[kina, mänskliga, kinesiska, hongkong, kinas, ...",[av Hans Wallmark (M)\ntill Utrikesminister An...
6,6,104,6_msb_räddningstjänsten_fyrverkerier_bränder,"[msb, räddningstjänsten, fyrverkerier, bränder...",[av Johan Hultberg (M)\ntill Socialminister Le...
7,7,94,7_telefoni_telia_pts_tillgång,"[telefoni, telia, pts, tillgång, bredband, pos...",[\nden 23 november\nFråga \n2007/08:333 Telefo...
8,8,118,8_hälso_vård_cancervården_vården,"[hälso, vård, cancervården, vården, behandling...",[av Camilla Waltersson Grönvall (M)\ntill Soci...
9,9,79,9_idrotten_idrottsrörelsen_idrott_idrottsevene...,"[idrotten, idrottsrörelsen, idrott, idrottseve...",[\nden \r\n23 november\nFråga \n2011/12:171 \r...


In [None]:
model_qa = BERTopic(language="swedish",
                    calculate_probabilities=True,
                    vectorizer_model=vectorizer_model,
                    nr_topics = 'auto')

topics_qa, probs_qa = model_qa.fit_transform(all_qa)
model_qa = reduce_outliers(model_qa, all_qa, topics_qa)

model_qa.get_topic_info()



Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,33673,0_regeringen_fråga_sverige_åtgärder,"[regeringen, fråga, sverige, åtgärder, avser, ...",[\n\n\nDnr A2015/02886/A\n\n\n\nArbetsmarknads...
1,1,272,1_alkohol_gårdsförsäljning_systembolaget_syste...,"[alkohol, gårdsförsäljning, systembolaget, sys...",[Svar på fråga 2019/20:1093 och 2019/20:1094 a...
2,2,134,2____,"[, , , , , , , , , ]","[, , ]"
3,3,141,3_taiwan_taiwans_who_kina,"[taiwan, taiwans, who, kina, ann linde, linde,...",[Svar på fråga 2019/20:1605 av Sara Gille (SD)...
4,4,162,4_msb_bränder_fyrverkerier_beredskap,"[msb, bränder, fyrverkerier, beredskap, brands...",[Svar på fråga 2021/22: 440 av Johan Hultberg ...
...,...,...,...,...,...
90,90,13,90_honduras_situationen honduras_cáceres_mexiko,"[honduras, situationen honduras, cáceres, mexi...",[\n\n\n\n\n\nUtrikesdepartementet\n\nUtrikesmi...
91,91,22,91_flygplatser_flyget_fråga 2019_2019 20,"[flygplatser, flyget, fråga 2019, 2019 20, 201...",[Svar på fråga 2019/20:1698 av Hans Wallmark (...
92,92,14,92_kryptovalutor_bitcoin_per bolund_valutor,"[kryptovalutor, bitcoin, per bolund, valutor, ...",[av Mikael Eskilandersson (SD)\ntill Statsråde...
93,93,33,93_kriminalvården_östersund_kriminalvårdens_fä...,"[kriminalvården, östersund, kriminalvårdens, f...",[\nden 4 september\nSvar på fråga\n2008/09:117...


We can also reduce topics afterhand by specifying a number

In [None]:
model_combined2 = BERTopic.load("/content/gdrive/MyDrive/IT5/Data/model_all_comb")
model_qa2 = BERTopic.load("/content/gdrive/MyDrive/IT5/Data/model_all_qa")

In [None]:
model_combined2.get_topic_info()
model_qa2.get_topic_info()
model_combined2.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,3502,0_regeringen_fråga_sverige_statsrådet,"[regeringen, fråga, sverige, statsrådet, åtgär...",[av Marléne Lund Kopparklint (M)\ntill Närings...
1,1,1195,1_finansminister_borg_anders borg_finansminist...,"[finansminister, borg, anders borg, finansmini...",[\nden \r\n19 januari\nFråga \n2010/11:228 \r\...
2,2,1143,2_brott_polisen_polismyndigheten_poliser,"[brott, polisen, polismyndigheten, poliser, mo...",[av Boriana Åberg (M)\ntill Justitie- och migr...
3,3,910,3_hälso_vård_vården_regeringen,"[hälso, vård, vården, regeringen, sjukvården, ...",[av Jenny Petersson (M)\ntill Socialminister A...
4,4,901,4_elever_utbildning_skolan_lärare,"[elever, utbildning, skolan, lärare, utbildnin...",[av Michael Rubbestad (SD)\ntill Utbildningsmi...
...,...,...,...,...,...
65,65,15,65_arktis_arktiska_arktiska rådet_rådet,"[arktis, arktiska, arktiska rådet, rådet, arkt...",[av Sofia Arkelsten (M)\ntill Utrikesminister ...
66,66,14,66_telefonförsäljning_avtal_telefon_ångerrätt,"[telefonförsäljning, avtal, telefon, ångerrätt...",[\nden 29 mars\nFråga \n2006/07:960 Konsuments...
67,67,14,67_julen_uppmaningar_semester_jul,"[julen, uppmaningar, semester, jul, smittsprid...",[av Björn Söder (SD)\ntill Försvarsminister Pe...
68,68,13,68_hemlöshet_hemlösa_barn_bostadsmarknaden,"[hemlöshet, hemlösa, barn, bostadsmarknaden, m...",[av Sara Gille (SD)\ntill Socialminister Lena ...


### Actual implementation (no reduced topics)

In [None]:
# Train the BERTopic model on a text that for each datapoint questions and answers are combined into one entry.
model_combined2 = BERTopic(language="swedish",
                          calculate_probabilities=True,
                          vectorizer_model=vectorizer_model)

topics_combined2, probs_combined2 = model_combined2.fit_transform(all_text_combined)

model_combined2 = reduce_outliers(model_combined2, all_text_combined, topics_combined2)
model_combined2.get_topic_info()



Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,638,0_trafikverket_sj_eneroth_tomas eneroth,"[trafikverket, sj, eneroth, tomas eneroth, tom...",[av Roland Utbult (KD)\ntill Statsrådet Tomas ...
1,1,685,1_finansminister_skatteverket_borg_anders borg,"[finansminister, skatteverket, borg, anders bo...",[\nden \r\n22 juni\nFråga \n2010/11:615 \r\nÅt...
2,2,505,2_migrationsverket_asylsökande_uppehållstillst...,"[migrationsverket, asylsökande, uppehållstills...",[av Mikael Cederbratt (M)\ntill Justitie- och ...
3,3,439,3_hälso_vården_vård_sjukvården,"[hälso, vården, vård, sjukvården, hälso sjukvå...",[av Camilla Waltersson Grönvall (M)\ntill Soci...
4,4,370,4_bankerna_finansiella_banker_finansinspektionen,"[bankerna, finansiella, banker, finansinspekti...",[\nden \r\n10 februari\nFråga \n2013/14:408 \r...
...,...,...,...,...,...
281,281,12,281_ofredanden_sexuella_övergrepp_åberg,"[ofredanden, sexuella, övergrepp, åberg, festi...",[av Boriana Åberg (M)\ntill Statsrådet Anders ...
282,282,18,282_mali_hultqvist_peter hultqvist_insatsen,"[mali, hultqvist, peter hultqvist, insatsen, f...",[av Hans Wallmark (M)\ntill Statsrådet Matilda...
283,283,12,283_fetma_övervikt_diabetes_övervikt fetma,"[fetma, övervikt, diabetes, övervikt fetma, fy...",[av Camilla Waltersson Grönvall (M)\ntill Soci...
284,284,13,284_transportbidraget_transportbidrag_vilt_för...,"[transportbidraget, transportbidrag, vilt, för...",[av Eric Palmqvist (SD)\ntill Näringsminister ...


In [None]:
# import pickle

# model_combined2.save("models/BERTopic_alldata_model_combined", serialization="pickle")

# with open("models/BERTopic_alldata_topics_combined", "wb") as fp:
#   pickle.dump(topics_combined2, fp)

# with open("models/BERTopic_alldata_probs_combined", "wb") as fp:
#   pickle.dump(probs_combined2, fp)

# # Load model, topics and probabilities
# model_combined2 = BERTopic.load("models/BERTopic_alldata_model_combined")

# with open("models/BERTopic_alldata_topics_combined", "rb") as fp:
#   topics_combined2 = pickle.load(fp)

# with open("models/BERTopic_alldata_probs_combined", "rb") as fp:
#   probs_combined2 = pickle.load(fp)



In [None]:
model_qa2 = BERTopic(language="swedish",
                    calculate_probabilities=True,
                    vectorizer_model=vectorizer_model)

topics_qa2, probs_qa2 = model_qa2.fit_transform(all_qa)
model_qa2 = reduce_outliers(model_qa2, all_qa, topics_qa2)

model_qa2.get_topic_info()



Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,1974,0_elever_skolan_lärare_utbildning,"[elever, skolan, lärare, utbildning, utbildnin...",[Svar på fråga 2021/22:1040 av Lars Hjälmered ...
1,1,1325,1_migrationsverket_asylsökande_uppehållstillst...,"[migrationsverket, asylsökande, uppehållstills...",[\n\n\nDnr Ju2014/6975\n\n\n\nJustitiedepartem...
2,2,1145,2_finansminister_skatteverket_anders borg_borg,"[finansminister, skatteverket, anders borg, bo...",[\nden 27 januari\nSvar på frågorna\n2010/11:2...
3,3,681,3_bostäder_bostadsmarknaden_hyresrätter_bostad,"[bostäder, bostadsmarknaden, hyresrätter, bost...",[\nden 19 augusti\nSvar på fråga\n2007/08:1520...
4,4,491,4_etiopien_västsahara_somaliland_somalia,"[etiopien, västsahara, somaliland, somalia, fn...",[av Birgitta Ohlsson (L)\ntill Statsrådet Isab...
...,...,...,...,...,...
460,460,18,460_tiggeri_beredskapspolisen_tigger_utsatta eu,"[tiggeri, beredskapspolisen, tigger, utsatta e...",[av Richard Jomshof (SD)\ntill Justitieministe...
461,461,14,461_griskött_tyskt_högnivågruppen_svensk mat,"[griskött, tyskt, högnivågruppen, svensk mat, ...",[\nden 1 juni\nSvar på fråga\n2010/11:548 Gris...
462,462,17,462_ersättningsnämnden_upprättelse_samira_mari...,"[ersättningsnämnden, upprättelse, samira, mari...",[\nden 17 januari\nSvar på fråga\n2006/07:454 ...
463,463,23,463_järnvägar_underhåll_vägar järnvägar_vägar,"[järnvägar, underhåll, vägar järnvägar, vägar,...",[av Sten Bergheden (M)\ntill Infrastrukturmini...


In [None]:

# model_qa2.save("models/BERTopic_alldata_topics_model_qa", serialization="pickle")

# with open("models/BERTopic_alldata_topics_qa", "wb") as fp:
#   pickle.dump(topics_qa2, fp)

# with open("models/BERTopic_alldata_topics_probs_qa", "wb") as fp:
#   pickle.dump(probs_qa2, fp)

# # Load model, topics and probabilities
# model_qa2 = BERTopic.load("models/BERTopic_alldata_model_qa")

# with open("models/BERTopic_alldata_topics_qa", "rb") as fp:
#   topics_qa2 = pickle.load(fp)

# with open("models/BERTopic_alldata_probs_qa", "rb") as fp:
#   probs_qa2 = pickle.load(fp)

#### Save the topics with the data

The top 10 words for each topic and the topic ID is saved together with the existing json dataset containing data from 2006 to 2022. This is saved to the data folder

In [None]:
# To save the topics in the existing dataset
document_topic_combined2 = model_combined2.get_document_info(all_text_combined)

document_topic_qa2 = model_qa2.get_document_info(all_qa)

document_topic_q2 = document_topic_qa2.iloc[::2] # To get the questions we need to take every other entry
document_topic_a2 = document_topic_qa2.iloc[1::2]

# Index needs to be resetted
document_topic_q2.reset_index(drop=True, inplace=True)
document_topic_a2.reset_index(drop=True, inplace=True)
data2 = data_clean.copy()

columns_to_extract = ['Topic', 'Top_n_words']

for i, entry in enumerate(data2):
    # combined
    id_topic, words = document_topic_combined2.loc[i, columns_to_extract].values
    id_topic = id_topic.item()
    words = words.split(' - ')
    words = ', '.join(words)

    #questions
    id_topic_Q, words_Q = document_topic_q2.loc[i, columns_to_extract].values
    id_topic_Q = id_topic_Q.item()
    words_Q = words_Q.split(' - ')
    words_Q = ', '.join(words_Q)

    # To handle empty answers. Some questions don't have answers.
    if len(entry['answer']) == 0:
        # Answers
        id_topic_A = ''
        words_A =  []

    else:
        # Answers
        id_topic_A, words_A = document_topic_a2.loc[i, columns_to_extract].values
        id_topic_A = id_topic_A.item()
        words_A = words_A.split(' - ')
        words_A = ', '.join(words_A)

    new_element = { 'id_topic_combined': id_topic, 'top_10_words_combined': words,
                    'id_topic_question': id_topic_Q, 'top_10_words_question': words_Q,
                    'id_topic_answer': id_topic_A, 'top_10_words_answer': words_A}

    entry.update(new_element)

In [None]:
# Save the new updated dataset that now contains topics. TODO: chnage path
with open("../../data/data_all/data_all_topic_final.json", "w") as fp:
    json.dump(data2, fp)

In [None]:
# Example of an entry
data2[200]

{'id_': 'gx11906',
 'question': '\nden \r\n17 juni\nFråga \n2009/10:906 \r\nHandelsavtal som gynnar befolkningen\nav Ameer Sachet (s)\ntill statsrådet Ewa Björling (m)\nMineralrika utvecklingsländers befolkning borde \r\ngynnas mer av resurserna som finns. Därför borde avtalen mellan länderna och \r\nglobala gruvföretag utformas annorlunda än i dag, anser den New York-baserade \r\ntillsynsorganisationen Revenue Watch Institute. Enligt Revenue Watch Institute \r\när det i dag bolagen och ett fåtal högt uppsatta personer i länderna som ofta \r\när de stora vinnarna. Revenue Watch Institute arbetar för att befolkningen i \r\nutvecklingsländer ska få större del av vinsten från landets naturtillgångar, \r\nsom olja, gas och mineraler. De reser runt och försöker påverka regeringar och \r\nföretag. I dag är många av de avtal som gruvbolag och länder ingår med \r\nvarandra, inte något som befolkningen får insyn i. Det öppnar för korruption \r\noch landets rika mineraltillgångar gynnar inte bef

### Coherence score

To be able to take the coherence score of a BERTopic model i followed this github issue:
https://github.com/MaartenGr/BERTopic/issues/90

In [31]:
# Load model, topics and probabilities. TODO: Change path
topic_model_comb = BERTopic.load("models/BERTopic_alldata_model_combined")
topic_model_qa = BERTopic.load("models/BERTopic_alldata_model_qa")

with open("models/BERTopic_alldata_topics_combined", "rb") as fp:
  topics_comb = pickle.load(fp)

with open("models/BERTopic_alldata_topics_qa", "rb") as fp:
  topics_qa = pickle.load(fp)

In [27]:
import gensim.corpora as corpora
from gensim.models.coherencemodel import CoherenceModel

# Preprocess documents
cleaned_docs = topic_model_comb._preprocess_text(all_text_combined)

# Extract vectorizer and tokenizer from BERTopic
vectorizer = topic_model_comb.vectorizer_model
tokenizer = vectorizer.build_tokenizer()

# Extract features for Topic Coherence evaluation
tokens = [tokenizer(doc) for doc in cleaned_docs]
dictionary = corpora.Dictionary(tokens)
corpus = [dictionary.doc2bow(token) for token in tokens]
topic_words = [[words for words, _ in topic_model_comb.get_topic(topic)] for topic in range(len(set(topics))-1)]


In [None]:
# Evaluate
c_v_comb = CoherenceModel(topics=topic_words, texts=tokens, corpus=corpus,dictionary=dictionary, coherence='c_v')
coherence = c_v_comb.get_coherence()

In [29]:
coherence

0.6899322122524747

In [36]:
# Evaluate
u_mass_comb = CoherenceModel(topics=topic_words,
                                 texts=tokens,
                                 corpus=corpus,
                                 dictionary=dictionary,
                                 coherence='u_mass')
coherence = u_mass_comb.get_coherence()

In [37]:
coherence

nan