### 4.2.2 Validation

As often stressed in literature (cite) we need to revalidate the dictionary used to see if it fits with the application we are trying to analyse. Therefore, we will use gold standard validation to see how well the dictionaries perform in comparison to human coders. We will also check if the inter coder reliability is granted with our two authors coded. <br>
Due to the fact that context plays a major role in deciding how emotions are expressed within a text we need to be especially careful when using a non-specific dictionary to detect sentiment. Working with this limitation we addressed the issue in our validation approach.

First, we load in the needed packages to perform the validation step. We will look at the accuracy, recall, precision and f1 score for the comparison to the gold standard of human coded sentiment. In addition we will use Cohen's Kappa to get a score for the intercoder reliability for the human coded data. These measures should proof useful in determining whether the dictionary performed well for our research questions and topic. <br>
In addition, we also implemented a little interface to perform the gold standard validation which is why we need the ipywidgets library. 

In [81]:
#import packages
import pickle

#get the scores
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import cohen_kappa_score

from functools import partial

import pandas as pd

import ipywidgets as widgets
from IPython.display import clear_output
from ipywidgets import IntProgress

import random
import numpy as np

from tqdm.notebook import tqdm
tqdm.pandas()

### Validation Twitter

Again we start of our validation with the sentiment scores for the Twitter data. First, we need to transform our dataset in a way that we can apply our gold standard validation. As until now, we got the sentiment in form of a polarity score which is given as a number between -1 and 1, we needed a way to make this scoring system manageable for human coders. (How should they distinguish between a sentiment of 0.0001 and 0.0002?) To make our lifes easier we decided to take a simple scoring system of distributing the tweets into positive, negative and neutral tweets. That way there was a less subjective classification as we can for the most part agree on what positive and negative conotated messages are. To generate the corresponding score generated by the dictionary approach we classified the sentiment as positive if the polarity was positive and negative if the polarity was negative. This only left tweets and speeches with a polarity of 0 as neutral which again needs to critically viewed as the polarity score can be biased in one direction. So we should consider the neutral assignments made by the dictionary with care. Nevertheless, for our gold standard coding we can use this scoring system. <br>
After creating the new score values via a loop we add them as a new column to our dataset. 

In [2]:
#set up Twitter dataset for sentiment coding
pre_data_twitter= pickle.load(open('../data/processed/tweets_processed.p','rb'))
sentiment=[]
for polarity in pre_data_twitter['polarity_textblob']:
    if polarity>0:
        sentiment.append('Positive')
    elif polarity<0:
        sentiment.append('Negative')
    else:
        sentiment.append('Neutral')
pre_data_twitter['sentiment']=sentiment

In preparation for the validation step we will need to define a function that let's us randomly select a certain number of tweets from our corpus. We do this by simply suffeling the data and afterwards selecting the first tweets until the desired number.

In [3]:
#create a function to chose random tweets for manual coding
def create_sentiment_dataset(data, number):
    data= data.sample(frac=1)
    data_test= data[0:number]
    data_test.reset_index(drop=True, inplace= True)
    return data_test

The next step is to create the interface for the validity testing. Herefore, we also define a function that lets us display buttons we can press to select the wanted sentiment of the coder while going through the randomly selected tweets. After the coder has labeled all the given tweets we save his labels as a new column for the given dataframe and create a file were we save the coded corpus for the coder. 

In [4]:
#create the test interface
def sentiment_gold_dictionary_tweets(sentiment_df, name):
    max_count = sentiment_df.shape[0]
    global i
    i = 0

    button_0 = widgets.Button(description = "Positive")
    button_1 = widgets.Button(description = "Neutral")
    button_2 = widgets.Button(description = "Negative")
    
    chosen_elements = []

    display("Sentiment Gold Standard")

    f = IntProgress(min=0, max=max_count)
    display(f)
    
    display(sentiment_df.text_preprocessed_sentence[i])

    display(button_0)
    display(button_1)
    display(button_2)


    def btn_eventhandler(obj):
        global i 
        i += 1
        
        clear_output(wait=True)
        
        display("Sentiment Gold Standard")
        display(f)
        f.value += 1
                
        choosen_text = obj.description
        chosen_elements.append(choosen_text)
        
        if i < max_count:
            
            display(sentiment_df.text_preprocessed_sentence[i])
            
            display(button_0)
            display(button_1)
            display(button_2)
            
            button_0.on_click(btn_eventhandler)
            button_1.on_click(btn_eventhandler)
            button_2.on_click(btn_eventhandler)
            
        else:
            print ("Thanks " + name + " you finished all the work!")
            sentiment_df["choosen_sentiment"] = chosen_elements
            sentiment_df.to_csv("../data/processed/sentiment_gold_standard_tweets_" + name + ".csv", index = False)

    button_0.on_click(btn_eventhandler)
    button_1.on_click(btn_eventhandler)
    button_2.on_click(btn_eventhandler)
    
    return sentiment_df

In our validation step, we use 40 randomly selected tweets to label manually. We are well aware that the usual suggestion is to label at least 1% of the corpus manually when revalidating. As this would have meant to label over 1000 tweets we settled for fewer but in the same range as for the speeches later.

In [5]:
test_data=create_sentiment_dataset(pre_data_twitter, 40)

After selection, we labeled the tweets with our defined function. While labeling we of course get a glimps of the tweets in the corpus. While most of the tweets seem to contain useful messages there were also some that were rather short or even just one word as there also seem to be replys to tweets.

In [6]:
#test_sentiment=sentiment_gold_dictionary_tweets(test_data,'Stjepan')

'Sentiment Gold Standard'

IntProgress(value=39, max=40)

Thanks Stjepan you finished all the work!


When labeling is finished we save the results in a csv file so we can analyze them later.

In [56]:
test_sentiment1=pd.read_csv('../data/processed/sentiment_gold_standard_tweets_Stjepan.csv')

The same labeling with 40 newly selected tweets is done by the second coder so we have a labeled corpus of 80 tweets in total. 

In [8]:
test_data=create_sentiment_dataset(pre_data_twitter, 40)

In [9]:
#test_sentiment=sentiment_gold_dictionary_tweets(test_data,'Jakob')

'Sentiment Gold Standard'

IntProgress(value=39, max=40)

Thanks Jakob you finished all the work!


In [57]:
test_sentiment2=pd.read_csv('../data/processed/sentiment_gold_standard_tweets_Jakob.csv')

Afterwards the labeled tweets are combined into one file so we can analyze them.

In [58]:
test_sentiment_both=pd.concat([test_sentiment1,test_sentiment2])

As mentioned, we want to use different metric to see how the dictionary performed with the sentiment analysis. We first look at F1 and Accuracy to get a feeling for the performance and then at Recall and Precision to get further insides on how the performance was achieved. <br>
For the whole corpus of labeled tweets we can see that Accuracy and F1 where rather simliar with around 54%. On a first look this seems like a bad result but when thinking about the performance of dictionary approaches in general a score in this region is really close to the best you can hope for considering we did not tune the dictionary to fit our problem particularly well. When looking at Recall and Precision we can see that those metrics also don't differ to much from the F1 score. <br>
We can plot the confusion matrices for the different labels from our validation corpus to see what kind of errors were made by the dictionary while classifying. We can see some quiet different pictures there for the different labels but overall the dictionary classifier seems to be making both kinds of errors. All in all the performance seems to be fine for the dictionary in TextBlob.

In [48]:
f2=f1_score(test_sentiment_both['choosen_sentiment'], test_sentiment_both['sentiment'], average='weighted')
print('F1 Score:',f2)
accuracy2=accuracy_score(test_sentiment_both['choosen_sentiment'], test_sentiment_both['sentiment'])
print('Accuracy Score:',accuracy2)
precision2=precision_score(test_sentiment_both['choosen_sentiment'], test_sentiment_both['sentiment'], average='weighted',zero_division=1)
print('Precision Score:',precision2)
recall2=recall_score(test_sentiment_both['choosen_sentiment'], test_sentiment_both['sentiment'], average='weighted',zero_division=1)
print('Recall Score:',recall2)
cm = multilabel_confusion_matrix(test_sentiment_both['choosen_sentiment'], test_sentiment_both['sentiment'])
display(cm)

F1 Score: 0.5494074201305393
Accuracy Score: 0.5375
Precision Score: 0.6026515151515152
Recall Score: 0.5375


array([[[48,  2],
        [14, 16]],

       [[29, 20],
        [11, 20]],

       [[46, 15],
        [12,  7]]])

For the next step we want to look at the intercoder reliability of our manually coded tweets. This will give us a measure on how reliable the previous result are. Only with a relatively good score here can we be sure to have so good gold standard. We choose 10 randomly selected tweets that will be labeled by both coders on which basis we perform this analysis.

In [19]:
test_data=create_sentiment_dataset(pre_data_twitter, 10)

Then, we again perform the manual coding, this time with both coders having the same tweets.

In [20]:
#test_sentiment=sentiment_gold_dictionary_tweets(test_data,'Stjepan_inter')

'Sentiment Gold Standard'

IntProgress(value=9, max=10)

Thanks Stjepan_inter you finished all the work!


In [21]:
#test_sentiment=sentiment_gold_dictionary_tweets(test_data,'Jakob_inter')

'Sentiment Gold Standard'

IntProgress(value=9, max=10)

Thanks Jakob_inter you finished all the work!


Afterwards, we save the results if the manual coding in csv files so we can access them at will.

In [22]:
test_sentiment1=pd.read_csv('../data/processed/sentiment_gold_standard_tweets_Stjepan_inter.csv')
test_sentiment2=pd.read_csv('../data/processed/sentiment_gold_standard_tweets_Jakob_inter.csv')


In the end, we calculate Cohen's Kappa as our reliability measure for which we use a implemented function from the sklearn library. We can see that our score of roughly 68% is an acceptable level of intercoder reliability. There sure is room for improvement but considering that we had no big coding manual for the coders and just relied on simple sentiment impression this seems like a satisfying result.

In [23]:
#kappa
kappa= cohen_kappa_score(test_sentiment1['choosen_sentiment'],test_sentiment2['choosen_sentiment'])
print(kappa)

0.6825396825396826


### Validation Speeches

Next up are the speeches for which we of course also have revalidate the dictionary approach. Again we load in the data and apply our code to generate a new column which contains the new sentiment scoring system.

In [63]:
#set up Speeches dataset for sentiment coding
pre_data_speeches= pickle.load(open('../data/processed/speeches_processed.p','rb'))
sentiment=[]
for polarity in pre_data_speeches['polarity_textblob']:
    if polarity>0:
        sentiment.append('Positive')
    elif polarity<0:
        sentiment.append('Negative')
    else:
        sentiment.append('Neutral')
pre_data_speeches['sentiment']=sentiment

Then we again set up afunction to randomly sample a given amount of tweets that we want to code. (Notice it is the same function as above.)

In [64]:
#create a function to chose random speeches for manual coding
def create_sentiment_dataset(data, number):
    data= data.sample(frac=1)
    data_test= data[0:number]
    data_test.reset_index(drop=True, inplace= True)
    return data_test

Afterwards we create the interface for the manual coding of the speeches by implementing widget that let us select the sentiment for the sampled tweets as above.

In [65]:
#create the test interface
def sentiment_gold_dictionary_speeches(sentiment_df, name):
    max_count = sentiment_df.shape[0]
    global i
    i = 0

    button_0 = widgets.Button(description = "Positive")
    button_1 = widgets.Button(description = "Neutral")
    button_2 = widgets.Button(description = "Negative")
    
    chosen_elements = []

    display("Sentiment Gold Standard")

    f = IntProgress(min=0, max=max_count)
    display(f)
    
    display(sentiment_df.text_preprocessed_sentence[i])

    display(button_0)
    display(button_1)
    display(button_2)


    def btn_eventhandler(obj):
        global i 
        i += 1
        
        clear_output(wait=True)
        
        display("Sentiment Gold Standard")
        display(f)
        f.value += 1
                
        choosen_text = obj.description
        chosen_elements.append(choosen_text)
        
        if i < max_count:
            
            display(sentiment_df.text_preprocessed_sentence[i])
            
            display(button_0)
            display(button_1)
            display(button_2)
            
            button_0.on_click(btn_eventhandler)
            button_1.on_click(btn_eventhandler)
            button_2.on_click(btn_eventhandler)
            
        else:
            print ("Thanks " + name + " you finished all the work!")
            sentiment_df["choosen_sentiment"] = chosen_elements
            sentiment_df.to_csv("../data/processed/sentiment_gold_standard_speeches_" + name + ".csv", index = False)

    button_0.on_click(btn_eventhandler)
    button_1.on_click(btn_eventhandler)
    button_2.on_click(btn_eventhandler)
    
    return sentiment_df

Again we choose 40 speeches randomly to be coded by hand. This time around this is around 1% of our speech corpus as we have significantly less speeches from the 19. Bundestag. Now every coder has to choose the sentiment of 40 different tweets. The results are again saved as a csv file so they can be accessed later.

In [66]:
test_data=create_sentiment_dataset(pre_data_speeches, 40)

In [67]:
#test_sentiment=sentiment_gold_dictionary_speeches(test_data,'Stjepan')

In [68]:
test_sentiment1=pd.read_csv('../data/processed/sentiment_gold_standard_speeches_Stjepan.csv')

In [69]:
test_data=create_sentiment_dataset(pre_data_speeches, 40)

In [70]:
#test_sentiment=sentiment_gold_dictionary_speeches(test_data,'Jakob')

In [71]:
test_sentiment2=pd.read_csv('../data/processed/sentiment_gold_standard_speeches_Jakob.csv')

After combining the two labeled corpi we can again apply our evaluation metrics. This time around the score are overall lower which lets us believe that the dictionary performed worse on the Bundestag speeches. When looking at the F1 score we see a drop of more than 10% which is a significantly lower result than before. The only score which doesn't seem to have dropped that much is Precision. Reasons for this drop in performance could lie in the different lengths of the speeches in comparison to the tweets and the higher complexity of texts. With this we can cautiously assume that the dictionary struggles to perform well with longer texts as there are more words influencing sentiment and in longer text sentiments could also be changing in different parts of the text.

In [72]:
test_sentiment_both=pd.concat([test_sentiment1,test_sentiment2])

In [73]:
f2=f1_score(test_sentiment_both['choosen_sentiment'], test_sentiment_both['sentiment'], average='weighted')
print('F1 Score:',f2)
accuracy2=accuracy_score(test_sentiment_both['choosen_sentiment'], test_sentiment_both['sentiment'])
print('Accuracy Score:',accuracy2)
precision2=precision_score(test_sentiment_both['choosen_sentiment'], test_sentiment_both['sentiment'], average='weighted',zero_division=1)
print('Precision Score:',precision2)
recall2=recall_score(test_sentiment_both['choosen_sentiment'], test_sentiment_both['sentiment'], average='weighted',zero_division=1)
print('Recall Score:',recall2)
cm = multilabel_confusion_matrix(test_sentiment_both['choosen_sentiment'], test_sentiment_both['sentiment'])
display(cm)

F1 Score: 0.3975587794617645
Accuracy Score: 0.3875
Precision Score: 0.5754140866873065
Recall Score: 0.3875


array([[[36,  7],
        [25, 12]],

       [[50,  3],
        [20,  7]],

       [[25, 39],
        [ 4, 12]]])

After these measures we also want to have a look at the intercoder reliability of the human coders. For this we again take 10 speeches which are coded by both authors to determine the reliabilty of the results. We also save these results for future analysis.

In [74]:
test_data=create_sentiment_dataset(pre_data_speeches, 10)

In [76]:
test_sentiment=sentiment_gold_dictionary_speeches(test_data,'Stjepan_inter')

'Sentiment Gold Standard'

IntProgress(value=9, max=10)

Thanks Stjepan_inter you finished all the work!


In [77]:
test_sentiment=sentiment_gold_dictionary_speeches(test_data,'Jakob_inter')

'Sentiment Gold Standard'

IntProgress(value=9, max=10)

Thanks Jakob_inter you finished all the work!


In [78]:
test_sentiment1=pd.read_csv('../data/processed/sentiment_gold_standard_speeches_Stjepan_inter.csv')
test_sentiment2=pd.read_csv('../data/processed/sentiment_gold_standard_speeches_Jakob_inter.csv')

In the end we again compute Cohen's Kappa with help of the sklearn library. We get a value of 53% which is around 15% worse than for the tweets. Again this doesn't seem surprising as these speeches are more complex in their structure and can address multiple issues or issues from different perspectives. With regards to that the Kappa score seems resonable but again there is definitely room for improvement. 

In [79]:
#### kappa
kappa= cohen_kappa_score(test_sentiment1['choosen_sentiment'],test_sentiment2['choosen_sentiment'])

In [80]:
print(kappa)

0.53125


In conclusion, we can say that our revalidation showed that the dictionary used to answer our research questions in regard to the sentiment analysis has performed good for a dictionary. As already suggested in (cite) dictionaries seem to struggle to give brilliant results in automated media content analysis due to their limited capacities and harsh assumptions. Nevertheless, for our research they offered a nice approach to gain first insights into the corpus and make reproducable analysis. As mentioned before, in a next step one should definitely view dictionaries as a first approach and try out different technics as machine learning and semi-supervised approaches next. 