### 4.1.2 Validation

For validating the results in the previous section, we use word and topic intrusion tests based on [Reading Tea Leaves: How Humans Interpret Topic Models](https://proceedings.neurips.cc/paper/2009/file/f92586a25bb3145facd64ab20fd554ff-Paper.pdf). We implement an interface and evaluate the results of humans label by the two authors.

In [None]:
from bertopic import BERTopic

import pickle

from sklearn.metrics import cohen_kappa_score

from functools import partial

import pandas as pd

import ipywidgets as widgets
from IPython.display import clear_output
from ipywidgets import IntProgress

import random
import numpy as np

from tqdm.notebook import tqdm
tqdm.pandas()

#### 4.1.2.1 Word intrusion

Word intrusion measures the coherence of topics. For this we show annotators 5 high probability keywords of a particular topic and an intruder keyword form another topic and give them the task to identify the intruder keyword. The model precision as measured by the word intrusion score is then defined as the the number of time the intruder keyword was chosen divided by the number of topics shown. 

##### 4.1.2.1.1 Define functions

Before we can execute the word intrusion task we need to define a set of help functions. We are creating an simple interface for this task to be executed in the Notebooks cells.

In [None]:
# Define a random document searcher
def choose_random_document(index, number_documents):
    rand_document = random.randrange(-1, number_documents-2)
    if rand_document != index:
        return rand_document 
    else:
        return choose_random_document(index, number_documents)

In [None]:
# Function for creating a word intrusion dataset
def create_word_intrusion_dataset(topic_model):
    number_documents = len(topic_model.get_topics())
    records_list = []
    for i in range(number_documents): 
        word_list = []
        for j in range(5):
            word_list.append(topic_model.get_topic(i-1)[j][0])
        intruder_word = topic_model.get_topic(choose_random_document(i-1, number_documents))[0][0]
        intruder_position = random.randrange(4)
        word_list.insert(intruder_position, intruder_word)
        word_list.append(intruder_word)
        word_list.append(intruder_position)
        records_list.append(word_list)
    word_intrusion_df = pd.DataFrame.from_records(records_list)
    word_intrusion_df.columns = ["word_0", "word_1", "word_2", "word_3", "word_4", "word_5", 
                                 "intruder_word", "intruder_index"]
    return word_intrusion_df

In [None]:
# A function that divides the word intrusion dataset into seperate sets for the the annotators
def generate_annotator_set(df, number_label, number_iaa, name_1, name_2):
    length = df.shape[0]
    if 2*number_label + number_iaa > length:
        print("Too many labels for the size of the dataframe")
    df_shuffeled = df.sample(frac=1).reset_index(drop=True)
    df_shuffeled[name_1] = [1] * (number_label+number_iaa) + [0] * (length-number_label-number_iaa)
    df_shuffeled[name_2] = [0] * (number_label) + [1] * (number_label+number_iaa) + [0] * (length-2*number_label-number_iaa)
    df_shuffeled["iaa_flag"] = [0] * number_label + [1] * number_iaa + [0] * (length-number_label-number_iaa)
    df_shuffeled["wis_label"] = [1] * number_label + [0] * number_iaa + [1] * (length-number_label-number_iaa)
    return df_shuffeled

In [None]:
# A function that offers an interface in Jupyter notebook for the word intrusion task
def word_intrusion_test(word_df, name, medium):
    intrusion_df = word_df[word_df[name] == 1].reset_index(drop = True)
    
    max_count = intrusion_df.shape[0]
    global i
    i = 0
    
    button_0 = widgets.Button(description = intrusion_df.word_0[i])
    button_1 = widgets.Button(description = intrusion_df.word_1[i])
    button_2 = widgets.Button(description = intrusion_df.word_2[i])
    button_3 = widgets.Button(description = intrusion_df.word_3[i])
    button_4 = widgets.Button(description = intrusion_df.word_4[i])
    button_5 = widgets.Button(description = intrusion_df.word_5[i])


    chosen_words = []
    chosen_positions= []

    display("Word Intrusion Test")

    f = IntProgress(min=0, max=max_count)
    display(f)

    display(button_0)
    display(button_1)
    display(button_2)
    display(button_3)
    display(button_4)
    display(button_5)


    def btn_eventhandler(position, obj):
        global i 
        i += 1
        
        
        clear_output(wait=True)
        
        display("Word Intrusion Text")
        display(f)
        f.value += 1
        
        choosen_text = obj.description
        chosen_words.append(choosen_text)
        
        chosen_positions.append(position)
        
        if i < max_count:

            button_0 = widgets.Button(description = intrusion_df.word_0[i])
            button_1 = widgets.Button(description = intrusion_df.word_1[i])
            button_2 = widgets.Button(description = intrusion_df.word_2[i])
            button_3 = widgets.Button(description = intrusion_df.word_3[i])
            button_4 = widgets.Button(description = intrusion_df.word_4[i])
            button_5 = widgets.Button(description = intrusion_df.word_5[i])
            
            display(button_0)
            display(button_1)
            display(button_2)
            display(button_3)
            display(button_4)
            display(button_5)
            
            button_0.on_click(partial(btn_eventhandler,0))
            button_1.on_click(partial(btn_eventhandler,1))
            button_2.on_click(partial(btn_eventhandler,2))
            button_3.on_click(partial(btn_eventhandler,3))
            button_4.on_click(partial(btn_eventhandler,4))
            button_5.on_click(partial(btn_eventhandler,5))
        else:
            print ("Thanks " + name + " you finished all the work!")
            intrusion_df["chosen_word"] = chosen_words
            intrusion_df["chosen_position"] = chosen_positions
            intrusion_df.to_csv("../data/processed/word_intrusion_test_" + name + "_" + medium + ".csv", index = False)



    button_0.on_click(partial(btn_eventhandler,0))
    button_1.on_click(partial(btn_eventhandler,1))
    button_2.on_click(partial(btn_eventhandler,2))
    button_3.on_click(partial(btn_eventhandler,3))
    button_4.on_click(partial(btn_eventhandler,4))
    button_5.on_click(partial(btn_eventhandler,5))
    
    return intrusion_df

In [None]:
# Calculate the word intrusion score for the two annotator sets
def calculate_word_intrusion(name_1, name_2, medium):
    df_word_intrusion_1 = pd.read_csv("../data/processed/word_intrusion_test_" + name_1 + "_" + medium + ".csv")
    df_word_intrusion_2 = pd.read_csv("../data/processed/word_intrusion_test_" + name_2 + "_" + medium + ".csv")
    iaa_values_1 = df_word_intrusion_1[df_word_intrusion_1.iaa_flag == 1].chosen_position.values
    iaa_values_2 = df_word_intrusion_2[df_word_intrusion_2.iaa_flag == 1].chosen_position.values
    kappa = cohen_kappa_score(iaa_values_1, iaa_values_2)
    df_word_intrusion = df_word_intrusion_1.append(df_word_intrusion_2)
    df_word = df_word_intrusion[df_word_intrusion["wis_label"] == 1]
    df_word["intruder_chosen"] = df_word["intruder_word"] == df_word["chosen_word"]
    return  df_word["intruder_chosen"].mean(), kappa

##### 4.1.2.1.2 Validation of tweets topic model

Based on the above defined functions, we are going to execute the word intrusion task for the tweets BERTopic model. The annotation is done by the two authors.

In [None]:
# Load model
topic_model_tweets = BERTopic.load("../models/bertopic_tweets")

In [None]:
# Create candidate dataset
word_intrusion_dataset_tweets = create_word_intrusion_dataset(topic_model_tweets)

In [None]:
# Create label dataset for two annotators
word_intrusion_dataset_tweets_label = generate_annotator_set(word_intrusion_dataset_tweets, 45, 11, "Jakob",
                                                             "Stjepan")

In [None]:
# Execute annotation for first candidate
# Uncomment if annotation is repeated
# df_word_intrusion_jakob_tweets = word_intrusion_test(word_intrusion_dataset_tweets_label, "Jakob", "Tweets")

In [None]:
# Execute annotation for second candidate
# Uncomment if annotation is repeated
# df_word_intrusion_stjepan_tweets = word_intrusion_test(word_intrusion_dataset_tweets_label, "Stjepan", "Tweets")

In [None]:
# Calculate intrusion score and cohens kappa
word_intrusion_score_tweets, word_kappa_tweets = calculate_word_intrusion("Jakob", "Stjepan", "Tweets")

In [None]:
# Cohens kappa
print("Cohens kappa is: " + str(round(word_kappa_tweets,2)))

Our inter annotator agreement is on a satisfactory level and shows a good consensus of our annotations. 

In [None]:
# Intrusion score
print("The word intrusion score is: " + str(round(word_intrusion_score_tweets,2)))

We see an a good intrusion score as many of the intruder words were detected. These results could be improved by fixing the identified limitations of our model.

##### 4.1.2.1.2 Validation of speeches topic model

In [None]:
# Load model
topic_model_speeches = BERTopic.load("../models/bertopic_speeches")

In [None]:
# Create candidate dataset
word_intrusion_dataset_speeches = create_word_intrusion_dataset(topic_model_speeches)

In [None]:
# Create label dataset for two annotators
word_intrusion_dataset_speeches_label = generate_annotator_set(word_intrusion_dataset_speeches, 10, 5, "Jakob",
                                                             "Stjepan")

In [None]:
# Execute annotation for first candidate
# df_word_intrusion_jakob_speeches = word_intrusion_test(word_intrusion_dataset_speeches_label, "Jakob", "Speeches")

In [None]:
# Execute annotation for second candidate
# df_word_intrusion_stjepan_speeches = word_intrusion_test(word_intrusion_dataset_speeches_label, "Stjepan", "Speeches")

In [None]:
# Calculate intrusion score and cohens kappa
word_intrusion_score_speeches, word_kappa_speeches = calculate_word_intrusion("Jakob", "Stjepan", "Speeches")

In [None]:
# Cohens kappa
print("Cohens kappa is: " + str(round(word_kappa_speeches,2)))

Our inter annotator agreement is on a satisfactory level and shows a good consensus of our annotations. 

In [None]:
# Intrusion score
print("The word intrusion score is: " + str(round(word_intrusion_score_speeches,2)))

We see an a good intrusion score as many of the intruder words were detected. These results could be improved by fixing the identified limitations of our model.

#### 4.1.2.2 Topic Intrusion

By measurng the topic intrusion score we want to test if the algorithms probability distribution of topics for the documents seems to match the human assesment. For this we show an excerpt of the document, the three topics with the highest probability for this topic and a random low probability topic. To calculate the topic intrusion score we take the mean of the differences of the log probabilities of the selected topic and the true topic.

In [None]:
# Create a function that combines key words into a single string
def create_topic_string(topic_info):
    word_list = []
    for i in range(8):
        word_list.append(topic_info[i][0])
    return ", ".join(word_list)

In [None]:
# Create a function that prepares the topic intrusion dataset
def create_topic_intrusion_dataset(data, topic_model, topic_probabilities, test_number = 100):
    number_documents = data.shape[0]
    if number_documents < test_number:
        print("You can only choose as many test as number of documents!")
    number_topics = len(topic_model.get_topics())
    records_list = []
    for i in range(test_number): 
        topic_list = []
        high_probability_documents = sorted(zip(topic_probabilities[i].tolist(), list(range(number_topics))), reverse=True)[:3]
        low_probability_documents = sorted(zip(topic_probabilities[i].tolist(), list(range(number_topics))), reverse=True)[3:]
        for j in range(3):
            topic_index = high_probability_documents[j][1]
            topic_list.append(create_topic_string(topic_model.get_topic(topic_index)))
        intruder_document = low_probability_documents[random.randrange(number_topics-4)]
        intruder_topic = create_topic_string(topic_model.get_topic(intruder_document[1]))
        intruder_position = random.randrange(4)
        topic_list.insert(intruder_position, intruder_topic)
        for k in range(3):
            topic_index = high_probability_documents[k][1]
            topic_list.append(high_probability_documents[k][0])
        topic_list.insert(intruder_position + 4, intruder_document[0])
        topic_list.append(intruder_topic)
        topic_list.append(intruder_document[0])
        topic_list.append(intruder_position)
        topic_list.append(data["text"][i])
        records_list.append(topic_list)
    df = pd.DataFrame.from_records(records_list)
    df.columns = ["topic_0", "topic_1", "topic_2", "topic_3","probability_topic_0","probability_topic_1",
                  "probability_topic_2","probability_topic_3", "intruder_topic", "intruder_topic_probability",
                  "intruder_index", "text"]
    return df

In [None]:
# Create a function that generate the interface for the topic intrusion test
def topic_intrusion_test(intrusion_df, name, medium):
    intrusion_df = intrusion_df[intrusion_df[name] == 1].reset_index(drop = True)
    
    max_count = intrusion_df.shape[0]
    global i
    i = 0
    
    layout = widgets.Layout(width='auto')

    button_0 = widgets.Button(description = intrusion_df.topic_0[i], layout = layout)
    button_1 = widgets.Button(description = intrusion_df.topic_1[i], layout = layout)
    button_2 = widgets.Button(description = intrusion_df.topic_2[i], layout = layout)
    button_3 = widgets.Button(description = intrusion_df.topic_3[i], layout = layout)
    
    chosen_elements = []
    chosen_positions = []
    chosen_probabilities = []

    display("Topic Intrusion Test")

    f = IntProgress(min=0, max=max_count)
    display(f)
    
    if len(intrusion_df.text[i]) < 1100:
        display(intrusion_df.text[i][0:1100])
    else :
        display(intrusion_df.text[i][100:1100])

    display(button_0)
    display(button_1)
    display(button_2)
    display(button_3)


    def btn_eventhandler(position, column, obj):
        
        global i
        
        clear_output(wait=True)
        
        display("Topic Intrusion Text")
        display(f)
        f.value += 1
                
        choosen_text = obj.description
        chosen_elements.append(choosen_text)
        chosen_positions.append(position)
        chosen_probabilities.append(intrusion_df[column][i])
        
        i += 1
        
        if i < max_count:

            button_0 = widgets.Button(description = intrusion_df.topic_0[i], layout = layout)
            button_1 = widgets.Button(description = intrusion_df.topic_1[i], layout = layout)
            button_2 = widgets.Button(description = intrusion_df.topic_2[i], layout = layout)
            button_3 = widgets.Button(description = intrusion_df.topic_3[i], layout = layout)
            
            if len(intrusion_df.text[i]) < 1100:
                display(intrusion_df.text[i][0:1000])
            else :
                display(intrusion_df.text[i][100:1100])
            
            display(button_0)
            display(button_1)
            display(button_2)
            display(button_3)
            
            button_0.on_click(partial(btn_eventhandler,0,"probability_topic_0"))
            button_1.on_click(partial(btn_eventhandler,1,"probability_topic_1"))
            button_2.on_click(partial(btn_eventhandler,2,"probability_topic_2"))
            button_3.on_click(partial(btn_eventhandler,3,"probability_topic_3"))
        else:
            print ("Thanks " + name + " you finished all the work!")
            intrusion_df["chosen_topic"] = chosen_elements
            intrusion_df["chosen_position"] = chosen_positions
            intrusion_df["chosen_topic_probability"] = chosen_probabilities
            intrusion_df.to_csv("../data/processed/topic_intrusion_test_" + name + "_" + medium + ".csv", index = False)



    button_0.on_click(partial(btn_eventhandler,0,"probability_topic_0"))
    button_1.on_click(partial(btn_eventhandler,1,"probability_topic_1"))
    button_2.on_click(partial(btn_eventhandler,2,"probability_topic_2"))
    button_3.on_click(partial(btn_eventhandler,3,"probability_topic_3"))
    
    return intrusion_df

In [None]:
# Create a function to calulate the topic intrusion score
def calculate_topic_intrusion(name_1, name_2, medium):
    df_topic_intrusion_1 = pd.read_csv("../data/processed/topic_intrusion_test_" + name_1 + "_" + medium + ".csv")
    df_topic_intrusion_2 = pd.read_csv("../data/processed/topic_intrusion_test_" + name_2 + "_" + medium + ".csv")
    iaa_values_1 = df_topic_intrusion_1[df_topic_intrusion_1.iaa_flag == 1].chosen_position.values
    iaa_values_2 = df_topic_intrusion_2[df_topic_intrusion_2.iaa_flag == 1].chosen_position.values
    kappa = cohen_kappa_score(iaa_values_1, iaa_values_2)
    df_topic_intrusion = df_topic_intrusion_1.append(df_topic_intrusion_2)
    df_topic = df_topic_intrusion[df_topic_intrusion["wis_label"] == 1]
    df_topic["intruder_score"] = np.log(df_topic["intruder_topic_probability"]) - np.log(df_topic["chosen_topic_probability"])
    return  df_topic["intruder_score"].mean(), kappa

##### 4.1.2.2.1 Validation of tweets topic model

In the first step we calculate the validation score for the tweets BERTopic model

In [None]:
# Load data
with open( "../data/processed/tweets_processed_bert.pickle", "rb" ) as handle:
    tweets_processed_bert = pickle.load(handle)
with open('../data/processed/probabilities_tweets_bert.pickle', 'rb') as handle:
    topic_probabilities_tweets = pickle.load(handle)

In [None]:
# Load model
# topic_model_tweets = BERTopic.load("../models/bertopic_tweets")

In [None]:
# Create candidate dataset
topic_intrusion_dataset_tweets = create_topic_intrusion_dataset(tweets_processed_bert, topic_model_tweets,
                                                               topic_probabilities_tweets, test_number = 100)

In [None]:
# Create label dataset for two annotators
topic_intrusion_dataset_tweets_label = generate_annotator_set(topic_intrusion_dataset_tweets, 40, 10, "Jakob",
                                                             "Stjepan")

In [None]:
# Execute annotation for first candidate
# df_topic_intrusion_jakob_tweets = topic_intrusion_test(topic_intrusion_dataset_tweets_label, "Jakob", "Tweets")

In [None]:
# Execute annotation for second candidate
# df_topic_intrusion_stjepan_tweets = topic_intrusion_test(topic_intrusion_dataset_tweets_label, "Stjepan", "Tweets")

In [None]:
# Calculate intrusion score and cohens kappa
topic_intrusion_score_tweets, topic_kappa_tweets = calculate_topic_intrusion("Jakob", "Stjepan", "Tweets")

In [None]:
# Cohens kappa
print("Cohens kappa is: " + str(round(topic_kappa_tweets,2)))

Our inter annotator agreement is on a satisfactory level and shows a good consensus of our annotations. 

In [None]:
# Intrusion score
print("The topic intrusion score is: " + str(round(topic_intrusion_score_tweets,2)))

It is difficult to objectively evaluate the resulting topic intrusion score. But comparing with the results from the article, we can infer that this score is at least satisfactory and validates our model.

##### 4.1.2.2.2 Validation of speeches topic model

In [None]:
# Load data
with open( "../data/processed/speeches_processed_bert.pickle", "rb" ) as handle:
    speeches_processed_bert = pickle.load(handle).reset_index(drop = True)
with open('../data/processed/probabilities_speeches_bert.pickle', 'rb') as handle:
    topic_probabilities_speeches = pickle.load(handle)

In [None]:
# Load model
# topic_model_speeches = BERTopic.load("../models/bertopic_speeches")

In [None]:
# Create candidate dataset
topic_intrusion_dataset_speeches = create_topic_intrusion_dataset(speeches_processed_bert, topic_model_speeches,
                                                               topic_probabilities_speeches, test_number = 100)

In [None]:
# Create label dataset for two annotators
topic_intrusion_dataset_speeches_label = generate_annotator_set(topic_intrusion_dataset_speeches, 40, 10, "Jakob",
                                                             "Stjepan")

In [None]:
# Execute annotation for first candidate
# df_topic_intrusion_jakob_speeches = topic_intrusion_test(topic_intrusion_dataset_speeches_label, "Jakob", "Speeches")

In [None]:
# Execute annotation for second candidate
# df_topic_intrusion_stjepan_speeches = topic_intrusion_test(topic_intrusion_dataset_speeches_label, "Stjepan", "Speeches")

In [None]:
# Calculate intrusion score and cohens kappa
topic_intrusion_score_speeches, topic_kappa_speeches = calculate_topic_intrusion("Jakob", "Stjepan", "Speeches")

In [None]:
# Cohens kappa
print("Cohens kappa is: " + str(round(topic_kappa_speeches,2)))

The inter annotator agreement on this task is rather small. We did expect this as it was quite difficult to infer the topics from an excerpt from the speeches, as they are generally quite long and therefore it is not easy to infer the right topics.

In [None]:
# Intrusion score
print("The word intrusion score is: " + str(round(topic_intrusion_score_speeches,2)))

It is difficult to objectively evaluate the resulting topic intrusion score. But comparing with the results from the article, we can infer that this score is at least satisfactory and validates our model.

#### 4.1.2.3 Conclusion

Based on the topic and word intrusion measures we evaluated in these section, we can infer an satisfactory validity of our models. There are different possibility of improvement and we detected several limtation in the results section, however the model still offers noticeable interesting insights.