### 4.2.2 Validation

As often stressed in literature (cite) we need to revalidate the dictionary used to see if it fits with the application we are trying to analyse. Therefore, we will use gold standard validation to see how well the dictionaries perform in comparison to human coders. We will also check if the inter coder reliability is granted with our two authors coded. <br>
Due to the fact that context plays a major role in deciding how emotions are expressed within a text we need to be especially careful when using a non-specific dictionary to detect sentiment. Working with this limitation we addressed the issue in our validation approach.

First, we load in the needed packages to perform the validation step. We will look at the accuracy, recall, precision and f1 score for the comparison to the gold standard of human coded sentiment. In addition we will use Cohen's Kappa to get a score for the intercoder reliability for the human coded data. These measures should proof useful in determining whether the dictionary performed well for our research questions and topic. <br>
In addition, we also implemented a little interface to perform the gold standard validation which is why we need the ipywidgets library. 

In [1]:
#import packages
import pickle

#get the scores
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import cohen_kappa_score

from functools import partial

import pandas as pd

import ipywidgets as widgets
from IPython.display import clear_output
from ipywidgets import IntProgress

import random
import numpy as np

from tqdm.notebook import tqdm
tqdm.pandas()

### Validation Twitter

Again we start of our validation with the sentiment scores for the Twitter data. First, we need to transform our dataset in a way that we can apply our gold standard validation. As until now, we got the sentiment in form of a polarity score which is given as a number between -1 and 1, we needed a way to make this scoring system manageable for human coders. (How should they distinguish between a sentiment of 0.0001 and 0.0002?) To make our lifes easier we decided to take a simple scoring system of distributing the tweets into positive, negative and neutral tweets. That way there was a less subjective classification as we can for the most part agree on what positive and negative conotated messages are. To generate the corresponding score generated by the dictionary approach we classified the sentiment as positive if the polarity was positive and negative if the polarity was negative. This only left tweets and speeches with a polarity of 0 as neutral which again needs to critically viewed as the polarity score can be biased in one direction. So we should consider the neutral assignments made by the dictionary with care. Nevertheless, for our gold standard coding we can use this scoring system. <b>
After creating the new score values via a loop we add them as a new column to our dataset. 

In [2]:
#set up Twitter dataset for sentiment coding
pre_data_twitter= pickle.load(open('../data/processed/tweets_processed.p','rb'))
sentiment=[]
for polarity in pre_data_twitter['polarity_textblob']:
    if polarity>0:
        sentiment.append('Positive')
    elif polarity<0:
        sentiment.append('Negative')
    else:
        sentiment.append('Neutral')
pre_data_twitter['sentiment']=sentiment

In preparation for the validation step we will need to define a function that let's us randomly select a certain number of tweets from our corpus. We do this by simply suffeling the data and afterwards selecting the first tweets until the desired number.

In [3]:
#create a function to chose random tweets for manual coding
def create_sentiment_dataset(data, number):
    data= data.sample(frac=1)
    data_test= data[0:number]
    data_test.reset_index(drop=True, inplace= True)
    return data_test

The next step is to create the interface for the validity testing. Herefore, we also define a function that lets us display buttons we can press to select the wanted sentiment of the coder while going through the randomly selected tweets. After the coder has labeled all the given tweets we save his labels as a new column for the given dataframe and create a file were we save the coded corpus for the coder. 

In [4]:
#create the test interface
def sentiment_gold_dictionary_tweets(sentiment_df, name):
    max_count = sentiment_df.shape[0]
    global i
    i = 0

    button_0 = widgets.Button(description = "Positive")
    button_1 = widgets.Button(description = "Neutral")
    button_2 = widgets.Button(description = "Negative")
    
    chosen_elements = []

    display("Sentiment Gold Standard")

    f = IntProgress(min=0, max=max_count)
    display(f)
    
    display(sentiment_df.text_preprocessed_sentence[i])

    display(button_0)
    display(button_1)
    display(button_2)


    def btn_eventhandler(obj):
        global i 
        i += 1
        
        clear_output(wait=True)
        
        display("Sentiment Gold Standard")
        display(f)
        f.value += 1
                
        choosen_text = obj.description
        chosen_elements.append(choosen_text)
        
        if i < max_count:
            
            display(sentiment_df.text_preprocessed_sentence[i])
            
            display(button_0)
            display(button_1)
            display(button_2)
            
            button_0.on_click(btn_eventhandler)
            button_1.on_click(btn_eventhandler)
            button_2.on_click(btn_eventhandler)
            
        else:
            print ("Thanks " + name + " you finished all the work!")
            sentiment_df["choosen_sentiment"] = chosen_elements
            sentiment_df.to_csv("../data/processed/sentiment_gold_standard_tweets_" + name + ".csv", index = False)

    button_0.on_click(btn_eventhandler)
    button_1.on_click(btn_eventhandler)
    button_2.on_click(btn_eventhandler)
    
    return sentiment_df

In [None]:
In our validation step, we use 40...

In [5]:
test_data=create_sentiment_dataset(pre_data_twitter, 40)

In [6]:
#test_sentiment=sentiment_gold_dictionary_tweets(test_data,'Stjepan')

'Sentiment Gold Standard'

IntProgress(value=39, max=40)

Thanks Stjepan you finished all the work!


In [7]:
test_sentiment1=pd.read_csv('../data/processed/sentiment_gold_standard_tweets_Stjepan.csv')

In [8]:
test_data=create_sentiment_dataset(pre_data_twitter, 40)

In [9]:
#test_sentiment=sentiment_gold_dictionary_tweets(test_data,'Jakob')

'Sentiment Gold Standard'

IntProgress(value=39, max=40)

Thanks Jakob you finished all the work!


In [10]:
test_sentiment2=pd.read_csv('../data/processed/sentiment_gold_standard_tweets_Jakob.csv')

In [11]:
test_sentiment_both=pd.concat([test_sentiment1,test_sentiment2])

In [12]:
f2=f1_score(test_sentiment_both['choosen_sentiment'], test_sentiment_both['sentiment'], average='weighted')
print('F1 Score:',f2)
accuracy2=accuracy_score(test_sentiment_both['choosen_sentiment'], test_sentiment_both['sentiment'])
print('Accuracy Score:',accuracy2)
precision2=precision_score(test_sentiment_both['choosen_sentiment'], test_sentiment_both['sentiment'], average='weighted',zero_division=1)
print('Precision Score:',precision2)
recall2=recall_score(test_sentiment_both['choosen_sentiment'], test_sentiment_both['sentiment'], average='weighted',zero_division=1)
print('Recall Score:',recall2)

F1 Score: 0.5494074201305393
Accuracy Score: 0.5375
Precision Score: 0.6026515151515152
Recall Score: 0.5375


In [13]:
for label in ['Positive', 'Negative', 'Neutral']:
    test_for_score=test_sentiment_both.loc[test_sentiment_both['choosen_sentiment']==label]
    f2=f1_score(test_for_score['choosen_sentiment'], test_for_score['sentiment'], average='weighted')
    accuracy2=accuracy_score(test_for_score['choosen_sentiment'], test_for_score['sentiment'])
    precision2=precision_score(test_for_score['choosen_sentiment'], test_for_score['sentiment'], average='weighted',zero_division=1)
    recall2=recall_score(test_for_score['choosen_sentiment'], test_for_score['sentiment'], average='weighted',zero_division=1)
    print('F1 Score for',label,':',f2)
    print('Accuracy Score for',label,':',accuracy2)
    print('Precision Score for',label,':',precision2)
    print('Recall Score for',label,':',recall2)
    print( )

F1 Score for Positive : 0.5384615384615384
Accuracy Score for Positive : 0.3684210526315789
Precision Score for Positive : 1.0
Recall Score for Positive : 0.3684210526315789

F1 Score for Negative : 0.6956521739130436
Accuracy Score for Negative : 0.5333333333333333
Precision Score for Negative : 1.0
Recall Score for Negative : 0.5333333333333333

F1 Score for Neutral : 0.7843137254901961
Accuracy Score for Neutral : 0.6451612903225806
Precision Score for Neutral : 1.0
Recall Score for Neutral : 0.6451612903225806



In [19]:
test_data=create_sentiment_dataset(pre_data_twitter, 10)

In [20]:
#test_sentiment=sentiment_gold_dictionary_tweets(test_data,'Stjepan_inter')

'Sentiment Gold Standard'

IntProgress(value=9, max=10)

Thanks Stjepan_inter you finished all the work!


In [21]:
#test_sentiment=sentiment_gold_dictionary_tweets(test_data,'Jakob_inter')

'Sentiment Gold Standard'

IntProgress(value=9, max=10)

Thanks Jakob_inter you finished all the work!


In [22]:
test_sentiment1=pd.read_csv('../data/processed/sentiment_gold_standard_tweets_Stjepan_inter.csv')
test_sentiment2=pd.read_csv('../data/processed/sentiment_gold_standard_tweets_Jakob_inter.csv')


In [23]:
#kappa
kappa= cohen_kappa_score(test_sentiment1['choosen_sentiment'],test_sentiment2['choosen_sentiment'])
print(kappa)

0.6825396825396826


In [24]:
### Validation Speeches

In [25]:
#set up Speeches dataset for sentiment coding
pre_data_speeches= pickle.load(open('../data/processed/speeches_processed.p','rb'))
sentiment=[]
for polarity in pre_data_speeches['polarity_textblob']:
    if polarity>0:
        sentiment.append('Positive')
    elif polarity<0:
        sentiment.append('Negative')
    else:
        sentiment.append('Neutral')
pre_data_speeches['sentiment']=sentiment

In [26]:
#create a function to chose random speeches for manual coding
def create_sentiment_dataset(data, number):
    data= data.sample(frac=1)
    data_test= data[0:number]
    data_test.reset_index(drop=True, inplace= True)
    return data_test

In [27]:
#create the test interface
def sentiment_gold_dictionary_speeches(sentiment_df, name):
    max_count = sentiment_df.shape[0]
    global i
    i = 0

    button_0 = widgets.Button(description = "Positive")
    button_1 = widgets.Button(description = "Neutral")
    button_2 = widgets.Button(description = "Negative")
    
    chosen_elements = []

    display("Sentiment Gold Standard")

    f = IntProgress(min=0, max=max_count)
    display(f)
    
    display(sentiment_df.text_preprocessed_sentence[i])

    display(button_0)
    display(button_1)
    display(button_2)


    def btn_eventhandler(obj):
        global i 
        i += 1
        
        clear_output(wait=True)
        
        display("Sentiment Gold Standard")
        display(f)
        f.value += 1
                
        choosen_text = obj.description
        chosen_elements.append(choosen_text)
        
        if i < max_count:
            
            display(sentiment_df.text_preprocessed_sentence[i])
            
            display(button_0)
            display(button_1)
            display(button_2)
            
            button_0.on_click(btn_eventhandler)
            button_1.on_click(btn_eventhandler)
            button_2.on_click(btn_eventhandler)
            
        else:
            print ("Thanks " + name + " you finished all the work!")
            sentiment_df["choosen_sentiment"] = chosen_elements
            sentiment_df.to_csv("../data/processed/sentiment_gold_standard_speeches_" + name + ".csv", index = False)

    button_0.on_click(btn_eventhandler)
    button_1.on_click(btn_eventhandler)
    button_2.on_click(btn_eventhandler)
    
    return sentiment_df

In [28]:
test_data=create_sentiment_dataset(pre_data_speeches, 40)

In [29]:
#test_sentiment=sentiment_gold_dictionary_speeches(test_data,'Stjepan')

'Sentiment Gold Standard'

IntProgress(value=39, max=40)

Thanks Stjepan you finished all the work!


In [30]:
test_sentiment1=pd.read_csv('../data/processed/sentiment_gold_standard_speeches_Stjepan.csv')

In [31]:
test_data=create_sentiment_dataset(pre_data_speeches, 40)

In [32]:
#test_sentiment=sentiment_gold_dictionary_speeches(test_data,'Jakob')

'Sentiment Gold Standard'

IntProgress(value=39, max=40)

Thanks Jakob you finished all the work!


In [33]:
test_sentiment2=pd.read_csv('../data/processed/sentiment_gold_standard_speeches_Jakob.csv')

In [34]:
test_sentiment_both=pd.concat([test_sentiment1,test_sentiment2])

In [35]:
f2=f1_score(test_sentiment_both['choosen_sentiment'], test_sentiment_both['sentiment'], average='weighted')
print('F1 Score:',f2)
accuracy2=accuracy_score(test_sentiment_both['choosen_sentiment'], test_sentiment_both['sentiment'])
print('Accuracy Score:',accuracy2)
precision2=precision_score(test_sentiment_both['choosen_sentiment'], test_sentiment_both['sentiment'], average='weighted',zero_division=1)
print('Precision Score:',precision2)
recall2=recall_score(test_sentiment_both['choosen_sentiment'], test_sentiment_both['sentiment'], average='weighted',zero_division=1)
print('Recall Score:',recall2)

F1 Score: 0.3975587794617645
Accuracy Score: 0.3875
Precision Score: 0.5754140866873065
Recall Score: 0.3875


In [36]:
for label in ['Positive', 'Negative', 'Neutral']:
    test_for_score=test_sentiment_both.loc[test_sentiment_both['choosen_sentiment']==label]
    f2=f1_score(test_for_score['choosen_sentiment'], test_for_score['sentiment'], average='weighted')
    accuracy2=accuracy_score(test_for_score['choosen_sentiment'], test_for_score['sentiment'])
    precision2=precision_score(test_for_score['choosen_sentiment'], test_for_score['sentiment'], average='weighted',zero_division=1)
    recall2=recall_score(test_for_score['choosen_sentiment'], test_for_score['sentiment'], average='weighted',zero_division=1)
    print('F1 Score for ',label,':',f2)
    print('Accuracy Score for ',label,':',accuracy2)
    print('Precision Score for ',label,':',precision2)
    print('Recall Score for ',label,':',recall2)
    print( )

F1 Score for  Positive : 0.8571428571428571
Accuracy Score for  Positive : 0.75
Precision Score for  Positive : 1.0
Recall Score for  Positive : 0.75

F1 Score for  Negative : 0.4897959183673469
Accuracy Score for  Negative : 0.32432432432432434
Precision Score for  Negative : 1.0
Recall Score for  Negative : 0.32432432432432434

F1 Score for  Neutral : 0.4117647058823529
Accuracy Score for  Neutral : 0.25925925925925924
Precision Score for  Neutral : 1.0
Recall Score for  Neutral : 0.25925925925925924



In [37]:
test_data=create_sentiment_dataset(pre_data_speeches, 10)

In [38]:
#test_sentiment=sentiment_gold_dictionary_speeches(test_data,'Stjepan_inter')

'Sentiment Gold Standard'

IntProgress(value=9, max=10)

Thanks Stjepan_inter you finished all the work!


In [39]:
#test_sentiment=sentiment_gold_dictionary_speeches(test_data,'Jakob_inter')

'Sentiment Gold Standard'

IntProgress(value=9, max=10)

Thanks Jakob_inter you finished all the work!


In [40]:
test_sentiment1=pd.read_csv('../data/processed/sentiment_gold_standard_speeches_Stjepan_inter.csv')
test_sentiment2=pd.read_csv('../data/processed/sentiment_gold_standard_speeches_Jakob_inter.csv')

In [None]:
#### kappa
kappa= cohen_kappa_score(test_sentiment1['choosen_sentiment'],test_sentiment2['choosen_sentiment'])

In [43]:
print(kappa)

0.509348
