# Annotations Data Postprocessing

This notebook performs the following steps: 

1. <b>Load Annotated Data</b>
   - Load the ```.jsonl``` files that were downloaded directly from the [Doccano labeling application](https://doccano.github.io/doccano/) 

2. <b>Merge Annotated Data </b>
   - Merge each individual annotators results into a single DataFrame per annotator
   - Remove any duplicates in the individual DataFrames
   - Merge all annnotators results into a single DataFrame

3. <b>Adjust for Exact Triplet Annotations</b>
   - Check that each tweet (```message_id```)is contained in the merged results exactly 3 times (not just once or twice), indicating it was annotated 3x
   - Drop any tweets that aren't triplets

4. <b>Organize and Sort Data</b>
   - Sort the DataFrame by ```message_id``` so that all tweet-triplets are next to each other

5. <b>Inter-Annotator Agreement Analysis</b>
   - Define a function to identify continuous ranges and the most frequent word within these ranges
   - Initialize the final df for storing inter-annotator agreements
   - Iterate over each unique 'message_id':
       - Collect all annotations associated with the 'message_id'
       - For each annotation, extract and process the annotated ranges and emotions 
       - Filter out annotations not supported by at least 2 annotators (including “none”)
       - Apply the continuous ranges function to identify agreed-upon labels
       - Update the final df with the agreed-upon labels

6. <b>Save the Final Dataset</b>
   - Save the df containing inter-annotator agreements to a CSV file

7. <b>Reformat the Data into a .txt File for GRACE Model Training



### Set-Up

In [1]:
# ------------- Data Handling -----------------------------------------
import pandas as pd                                                     # Powerful data structures for data analysis, time series, and statistics
from collections import Counter                                         # A container that keeps count of the elements in an iterable
import ast                                                              # A Python module that helps to process trees of the Python abstract syntax grammar
from sklearn.model_selection import train_test_split

# ------------- Plotting Data -----------------------------------------
import matplotlib.pyplot as plt                                         # A plotting library for creating static, interactive, and animated visualizations in Python
import matplotlib.ticker as ticker                                      # Provides classes for configuring tick locating and formatting
import numpy as np


### 1. Load Data

In [2]:
# @title Annotator 1

a1_1 = pd.read_json("Data/Training Data/1 - Annotation Results/First 1K/1K_a1_1/aspect-emotion.jsonl", lines=True)
print(f'first subset: {len(a1_1)}')

a1_2 = pd.read_json("Data/Training Data/1 - Annotation Results/First 1K/1K_a1_2/aspect-emotion.jsonl", lines=True)
print(f'second subset: {len(a1_2)}')

a1_3 = pd.read_json("Data/Training Data/1 - Annotation Results/First 1K/1K_a1_3/aspect-emotion.jsonl", lines=True)
print(f'third subset: {len(a1_3)}')

a1_4 = pd.read_json("Data/Training Data/1 - Annotation Results/First 1K/1K_a1_4/aspect-emotion.jsonl", lines=True)
print(f'fourth subset: {len(a1_4)}')


# merge individual dfs into one for the annotator
a1_with_duplicates = pd.concat([a1_1, a1_2, a1_3, a1_4], ignore_index=True)
print(f'\nfinal, merged dataset: {len(a1_with_duplicates)}')

# remove any duplications (by "message_id")
a1 = a1_with_duplicates.drop_duplicates(subset='message_id', keep='first')
print(f'\nduplicates removed: {len(a1)}')

# save final annotated dataet for the annotator to a csv file
# a1.to_csv("/content/drive/MyDrive/Msc AGI/Master/Jan 2024 ABEA Training Dataset Labeling/Final Annotations/annotator1.csv", index=False)
# print('Saved to drive.')

first subset: 104
second subset: 203
third subset: 667
fourth subset: 1170

final, merged dataset: 2144

duplicates removed: 1274


In [3]:
# @title Annotator 2
a2_1 = pd.read_json("Data/Training Data/1 - Annotation Results/First 1K/1K_a2_1/aspect-emotion.jsonl", lines=True)
print(f'first subset: {len(a2_1)}')

a2_2 = pd.read_json("Data/Training Data/1 - Annotation Results/First 1K/1K_a2_2/aspect-emotion.jsonl", lines=True)
print(f'second subset: {len(a2_2)}')

a2_3 = pd.read_json("Data/Training Data/1 - Annotation Results/First 1K/1K_a2_3/aspect-emotion.jsonl", lines=True)
print(f'third subset: {len(a2_3)}')

a2_4 = pd.read_json("Data/Training Data/1 - Annotation Results/First 1K/1K_a2_4/aspect-emotion.jsonl", lines=True)
print(f'fourth subset: {len(a2_4)}')


# merge individual dfs into one for the annotator
a2_with_duplicates = pd.concat([a2_1, a2_2, a2_3, a2_4], ignore_index=True)
print(f'\nfinal, merged dataset: {len(a2_with_duplicates)}')

# remove any duplications (by "message_id")
a2 = a2_with_duplicates.drop_duplicates(subset='message_id', keep='first')
print(f'\nduplicates removed: {len(a2)}')

# save final annotated dataet for the annotator to a csv file
# a2.to_csv("/content/drive/MyDrive/Msc AGI/Master/Jan 2024 ABEA Training Dataset Labeling/Final Annotations/annotator2.csv", index=False)
# print('Saved to drive.')

first subset: 104
second subset: 104
third subset: 667
fourth subset: 1188

final, merged dataset: 2063

duplicates removed: 1292


In [4]:
# @title Annotator 3
a3_1 = pd.read_json("Data/Training Data/1 - Annotation Results/First 1K/1K_a3_1/aspect-emotion.jsonl", lines=True)
print(f'first subset: {len(a3_1)}')

a3_2 = pd.read_json("Data/Training Data/1 - Annotation Results/First 1K/1K_a3_2/aspect-emotion.jsonl", lines=True)
print(f'second subset: {len(a3_2)}')

a3_3 = pd.read_json("Data/Training Data/1 - Annotation Results/First 1K/1K_a3_3/aspect-emotion.jsonl", lines=True)
print(f'third subset: {len(a3_3)}')

a3_4 = pd.read_json("Data/Training Data/1 - Annotation Results/First 1K/1K_a3_4/aspect-emotion.jsonl", lines=True)
print(f'fourth subset: {len(a3_4)}')


# merge individual dfs into one for the annotator
a3_with_duplicates = pd.concat([a3_1, a3_2, a3_3, a3_4], ignore_index=True)
print(f'\nfinal, merged dataset: {len(a3_with_duplicates)}')

# remove any duplications (by "message_id")
a3 = a3_with_duplicates.drop_duplicates(subset='message_id', keep='first')
print(f'\nduplicates removed: {len(a3)}')

# save final annotated dataet for the annotator to a csv file
# a3.to_csv("/content/drive/MyDrive/Msc AGI/Master/Jan 2024 ABEA Training Dataset Labeling/Final Annotations/annotator3.csv", index=False)
# print('Saved to drive.')

first subset: 103
second subset: 203
third subset: 667
fourth subset: 1170

final, merged dataset: 2143

duplicates removed: 1273


In [5]:
# @title Annotator 4
a4_1 = pd.read_json("Data/Training Data/1 - Annotation Results/Second 1K/2K_a1_1/aspect-emotion.jsonl", lines=True)
print(f'first subset: {len(a4_1)}')

a4_2 = pd.read_json("Data/Training Data/1 - Annotation Results/Second 1K/2K_a1_2/aspect-emotion.jsonl", lines=True)
print(f'second subset: {len(a4_2)}')


# merge individual dfs into one for the annotator
a4_with_duplicates = pd.concat([a4_1, a4_2], ignore_index=True)
print(f'\nfinal, merged dataset: {len(a4_with_duplicates)}')

# remove any duplications (by "message_id")
a4 = a4_with_duplicates.drop_duplicates(subset='message_id', keep='first')
print(f'\nduplicates removed: {len(a4)}')

# save final annotated dataet for the annotator to a csv file
# a4.to_csv("/content/drive/MyDrive/Msc AGI/Master/Jan 2024 ABEA Training Dataset Labeling/Final Annotations/annotator4.csv", index=False)
# print('Saved to drive.')

first subset: 500
second subset: 1000

final, merged dataset: 1500

duplicates removed: 1000


In [6]:
# @title Annotator 5
a5_1 = pd.read_json("Data/Training Data/1 - Annotation Results/Second 1K/2K_a2_1/aspect-emotion.jsonl", lines=True)
print(f'first subset: {len(a5_1)}')

a5_2 = pd.read_json("Data/Training Data/1 - Annotation Results/Second 1K/2K_a2_2/aspect-emotion.jsonl", lines=True)
print(f'second subset: {len(a5_2)}')


# merge individual dfs into one for the annotator
a5_with_duplicates = pd.concat([a5_1, a5_2], ignore_index=True)
print(f'\nfinal, merged dataset: {len(a5_with_duplicates)}')

# remove any duplications (by "message_id")
a5 = a5_with_duplicates.drop_duplicates(subset='message_id', keep='first')
print(f'\nduplicates removed: {len(a5)}')

# save final annotated dataet for the annotator to a csv file
# a5.to_csv("/content/drive/MyDrive/Msc AGI/Master/Jan 2024 ABEA Training Dataset Labeling/Final Annotations/annotator5.csv", index=False)
# print('Saved to drive.')

first subset: 501
second subset: 1000

final, merged dataset: 1501

duplicates removed: 1000


In [7]:
# @title Annotator 6

a6_1 = pd.read_json("Data/Training Data/1 - Annotation Results/Second 1K/2K_a3_1/aspect-emotion.jsonl", lines=True)
print(f'first subset: {len(a6_1)}')

a6_2 = pd.read_json("Data/Training Data/1 - Annotation Results/Second 1K/2K_a3_2/aspect-emotion.jsonl", lines=True)
print(f'second subset: {len(a6_2)}')


# merge individual dfs into one for the annotator
a6_with_duplicates = pd.concat([a6_1, a6_2], ignore_index=True)
print(f'\nfinal, merged dataset: {len(a6_with_duplicates)}')

# remove any duplications (by "message_id")
a6 = a6_with_duplicates.drop_duplicates(subset='message_id', keep='first')
print(f'\nduplicates removed: {len(a6)}')

# save final annotated dataet for the annotator to a csv file
# a6.to_csv("/content/drive/MyDrive/Msc AGI/Master/Jan 2024 ABEA Training Dataset Labeling/Final Annotations/annotator6.csv", index=False)
# print('Saved to drive.')

first subset: 500
second subset: 1000

final, merged dataset: 1500

duplicates removed: 1000


In [8]:
# @title Annotator 7
a7 = pd.read_json("Data/Training Data/1 - Annotation Results/Third 1K/3K_a1/aspect-emotion.jsonl", lines=True)
print(f'first subset: {len(a7)}')

# save final annotated dataet for the annotator to a csv file
# a7.to_csv("/content/drive/MyDrive/Msc AGI/Master/Jan 2024 ABEA Training Dataset Labeling/Final Annotations/annotator7.csv", index=False)
# print('Saved to drive.')

first subset: 501


In [9]:
# @title Annotator 8
a8 = pd.read_json("Data/Training Data/1 - Annotation Results/Third 1K/3K_a2/aspect-emotion.jsonl", lines=True)
print(f'first subset: {len(a8)}')

# save final annotated dataet for the annotator to a csv file
# a8.to_csv("/content/drive/MyDrive/Msc AGI/Master/Jan 2024 ABEA Training Dataset Labeling/Final Annotations/annotator8.csv", index=False)
# print('Saved to drive.')

first subset: 501


In [10]:
# @title Annotator 9
a9 = pd.read_json("Data/Training Data/1 - Annotation Results/Third 1K/3K_a3/aspect-emotion.jsonl", lines=True)
print(f'first subset: {len(a9)}')

# save final annotated dataet for the annotator to a csv file
# a9.to_csv("/content/drive/MyDrive/Msc AGI/Master/Jan 2024 ABEA Training Dataset Labeling/Final Annotations/annotator9.csv", index=False)
# print('Saved to drive.')

first subset: 500


### Check how many tweets were annotated overall


In [11]:
# @title All tweets that were annotated

# merge ALL individual dfs into one
all_tweets_annotated_w_duplicates = pd.concat([a1, a2, a3, a4, a5, a6, a7, a8, a9], ignore_index=True)
print(f'\nfinal, merged dataset: {len(all_tweets_annotated_w_duplicates)}')

# remove any duplications (by "message_id")
all_tweets_annotated = all_tweets_annotated_w_duplicates.drop_duplicates(subset='message_id', keep='first')
print(f'\nduplicates removed: {len(all_tweets_annotated)}')


final, merged dataset: 8341

duplicates removed: 2800


In [12]:
# @title All tweets that were annotated (adjusted for exact 3 duplicates)

# merge ALL individual dfs into one
all_tweets_annotated_w_duplicates = pd.concat([a1, a2, a3, a4, a5, a6, a7, a8, a9], ignore_index=True)
print(f'total, merged dataset: {len(all_tweets_annotated_w_duplicates)}')

# Group by 'message_id' and filter groups that have exactly 3 entries
all_tweets_annotated_temp = all_tweets_annotated_w_duplicates.groupby('message_id').filter(lambda x: len(x) == 3)
print(f'\nAll tweets, where the tweet was annotated by 3 annotators: {len(all_tweets_annotated_temp)}')

# save df with all tweets where there are 3 duplicates of each message_id as "all_3_duplicate_tweets.csv"
# all_tweets_annotated_temp.to_csv("/content/drive/MyDrive/Msc AGI/Master/Jan 2024 ABEA Training Dataset Labeling/Final Annotations/all_tweets_w_three_duplicates.csv", index=False)

# Now, drop duplicates to keep only one instance of each 'message_id'
all_tweets_annotated = all_tweets_annotated_temp.drop_duplicates(subset='message_id', keep='first')

# reset the index for cleanliness
all_tweets_annotated.reset_index(drop=True, inplace=True)

print(f'\nRows with exactly 3 duplicates, then removed all duplicates: {len(all_tweets_annotated)}')


total, merged dataset: 8341

All tweets, where the tweet was annotated by 3 annotators: 8298

Rows with exactly 3 duplicates, then removed all duplicates: 2766


### 2. - 4. Sorted DF with all 3 annotations

(all 3 annotations are next to grouped together in the df)

In [13]:
total_annotations = all_tweets_annotated_temp

print(len(total_annotations))

8298


In [14]:
# Sort the DataFrame by 'message_id'
sorted_df = total_annotations.sort_values(by='message_id')

# Reset the index after sorting if needed
sorted_df.reset_index(drop=True, inplace=True)

# sorted_df.to_csv("Data/Training Data/1 - Annotation Results/Final Results/all_tweets_w_three_duplicates_sorted.csv", index=False)
sorted_df.head(15)


Unnamed: 0,id,text,message_id,label,Comments
0,5521,"01.09.20\n\nAfter a handful of days off, the l...",1300734393309331456,"[[39, 49, Anger], [95, 98, Anger]]",[]
1,8185,"01.09.20\n\nAfter a handful of days off, the l...",1300734393309331456,"[[39, 49, Anger], [95, 98, Anger]]",[]
2,6853,"01.09.20\n\nAfter a handful of days off, the l...",1300734393309331456,"[[43, 50, Anger], [95, 99, Anger]]",[]
3,11120,Leopard? Spots? Wrong https://t.co/jZZpyQWUFi,1301600936926871552,[],[]
4,13786,Leopard? Spots? Wrong https://t.co/jZZpyQWUFi,1301600936926871552,[],[]
5,12453,Leopard? Spots? Wrong https://t.co/jZZpyQWUFi,1301600936926871552,[],[]
6,1895,The new breakfast line up. https://t.co/Lof0Dn...,1301784141223071744,[],[]
7,562,The new breakfast line up. https://t.co/Lof0Dn...,1301784141223071744,[],[]
8,3228,The new breakfast line up. https://t.co/Lof0Dn...,1301784141223071744,[],[]
9,5558,@robdgill Not to mention income multiples,1301946973088616448,[],[]


### 5. Check for Inter Annotator Agreements and Add to Final DF

In [15]:
def find_dominant_emotion(emotions):
    '''
    Determines the dominant emotion from a list of emotions. If there is no single dominant emotion, returns None
    '''
    emotion_counts = Counter(emotions).most_common(3)
    if len(emotion_counts) > 1 and emotion_counts[0][1] == emotion_counts[1][1]:
        return None                                                                     # No dominant emotion if the top two are equal
    return emotion_counts[0][0]                                                         # Return the emotion with the highest count

# function to check for overlapping index ranges
def find_cont_ranges_and_dom_emotions(d):
    '''
    Function that takes a dictionary as input, where the keys are character indices (e.g. 1) and the values are a list of emotions that were annotated for that character,
    and finds continuous character ranges (e.g. for a word or several words) and determines what the dominant emotion is that was labeled by 3 annotators
    Input: dictionary with keys that are character indexes and values that is a list of any/all labeled emotions for that character
    Output: A list of tuples with (start_idx, end_idx, emotion), where the range and the emotion are agreen upon by at least 2 annotators
    '''

    sorted_keys = sorted(d.keys())                                                      # Sort input dict by key values, which are character indexes
    ranges = []                                                                         # Init a list to store continuous character ranges

    current_range_start = sorted_keys[0]                                                # Begin at first index
    current_range_end = current_range_start                                             # Set end index to first index, but update as we go
    emotions_for_this_range = d[sorted_keys[0]]                                         # Init list for storing all emotions associated with this range of characters, and add first emotions

    for i in range(1, len(sorted_keys)):                                                # Iterate over each of the character indexes
        if sorted_keys[i] == sorted_keys[i - 1] + 1:                                    # Check if the current key continues the range
            current_range_end = sorted_keys[i]                                          # Update range end to current index
            emotions_for_this_range.extend(d[sorted_keys[i]])                           # Add the list of emotions from this character to the "all emotions" list
        else:                                                                           # If the current key doesn't continue the current range (meaning end of current annotated range is reached)
            dominant_emotion = find_dominant_emotion(emotions_for_this_range)           # Get the dominant emotion for this annotated range
            if dominant_emotion:                                                        # If not none, add to final results list
                ranges.append([current_range_start, current_range_end, dominant_emotion])
            current_range_start = current_range_end = sorted_keys[i]                    # Update start index 
            emotions_for_this_range = d[sorted_keys[i]]                                 # Get current emotion labels

    # Handle the last range
    dominant_emotion = find_dominant_emotion(emotions_for_this_range)
    if dominant_emotion:
        ranges.append([current_range_start, current_range_end, dominant_emotion])

    return ranges

In [51]:
# Set up final DF and add column for final label (which annotators agreed on)
final_iaa_df = all_tweets_annotated
final_iaa_df['label'] = ''

# Iterate through groups of message_ids (= always 3 tweets, one from each annotator), get the three current labels
for message_id, group in sorted_df.groupby('message_id'): 
  iaa_dict = {}                                                   # Initialise a dict for current tweet

  for i in range(3):                                              # For each annotator, get their label, i.e. annotated ranges
    annotator_label = group.iloc[i]['label']                      # e.g. [[0, 21, 'Happiness'], [128, 142, 'Sadness'], [205, 214, 'Sadness']]

    if len(annotator_label) == 0:                                 # If the annotator didn't annotate anything, acknowledge this by setting pseudo index and emotion=('none')
      start = -2
      end = -1
      emotion =  'none'

      wholerange = list(range(start, end+1))                      # List of the pseudo range [-2, -1, 0]
     
      for k in range(len(wholerange)):                            # For each pseudo index number, add the nr and 'none' emotion to the dict
        index_num = wholerange[k]                                 # Get current index number, e.g. -2
        if index_num in iaa_dict:                                 # If index is already in dict, just add to the emotion label, else add index and emotion label
          iaa_dict[index_num].append(emotion)
        else:
          iaa_dict[index_num] = [emotion]

    else:                                                         # If the annotator did annotation something
      for j in range(len(annotator_label)):                       # Loop over each annotated range (such as one word) in the sub lists by accessing length (e.g. 3 for the example above)
        start = annotator_label[j][0]                             # e.g. 0
        end = annotator_label[j][1]                               # e.g. 21
        emotion = annotator_label[j][2]                           # e.g. 'Happiness'

        wholerange = list(range(start, end+1))                    # List of the range, e.g. [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21]
        
        for k in range(len(wholerange)):                          # For each index number, add the nr and emotion to the dict
          index_num = wholerange[k]
          if index_num in iaa_dict:
            iaa_dict[index_num].append(emotion)
          else:
            iaa_dict[index_num] = [emotion]


  iaa_dict = {k: v for k, v in iaa_dict.items() if len(v) > 1}    # After collecting all annotated range numbers and emotions, drop any key-value pairs with only 1 emotion (= only one annotator marked it)

  if iaa_dict != {}:                                              # If there are any annotations left 
    ranges = find_cont_ranges_and_dom_emotions(iaa_dict)          # Get the ranges and the dominant emotion
    final_iaa_df.loc[final_iaa_df['message_id'] == message_id, 'label'] = str(ranges)     # add the final ranges and emotions to the 'label' column in the final df
  else:                                                           # if there are no emotions then just add empty list brackets in the label column
    final_iaa_df.loc[final_iaa_df['message_id'] == message_id, 'label'] = str('[]')

# Save to CSV
final_iaa_df.to_csv("Data/Training Data/1 - Annotation Results/Final Results/final_iaa_df_new.csv", index=False)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_iaa_df['label'] = ''


1302377117951369216
1302407703051022336
1322845152612933632
1323285372131004416
1327893319914020864
1477117401145610240
Total annotations: 51962
Total annotations dropped: 5704
{'iaa_2': 7, 'iaa_3': 4040}


In [50]:
def find_dominant_emotion(emotions, iaa_counts):
    '''
    Determines the dominant emotion from a list of emotions. If there is no single dominant emotion, returns None
    '''
    emotion_counts = Counter(emotions).most_common(3)   
    if len(emotion_counts) > 1 and emotion_counts[0][1] == emotion_counts[1][1]:
        iaa_counts['iaa_2'] += 1 
        return None, 1                                                # No dominant emotion if the top two are equal
    iaa_counts['iaa_3'] += 1 
    return emotion_counts[0][0] , 0                                 # Return the emotion with the highest count

# function to check for overlapping index ranges
def find_cont_ranges_and_dom_emotions(d, iaa_counts):
    '''
    Function that takes a dictionary as input, where the keys are character indices (e.g. 1) and the values are a list of emotions that were annotated for that character,
    and finds continuous character ranges (e.g. for a word or several words) and determines what the dominant emotion is that was labeled by 3 annotators
    Input: dictionary with keys that are character indexes and values that is a list of any/all labeled emotions for that character
    Output: A list of tuples with (start_idx, end_idx, emotion), where the range and the emotion are agreen upon by at least 2 annotators
    '''

    sorted_keys = sorted(d.keys())                                                      # Sort input dict by key values, which are character indexes
    ranges = []                                                                         # Init a list to store continuous character ranges

    current_range_start = sorted_keys[0]                                                # Begin at first index
    current_range_end = current_range_start                                             # Set end index to first index, but update as we go
    emotions_for_this_range = d[sorted_keys[0]]                                         # Init list for storing all emotions associated with this range of characters, and add first emotions
    get_msg_id = 0

    for i in range(1, len(sorted_keys)):                                                # Iterate over each of the character indexes
        if sorted_keys[i] == sorted_keys[i - 1] + 1:                                    # Check if the current key continues the range
            current_range_end = sorted_keys[i]                                          # Update range end to current index
            emotions_for_this_range.extend(d[sorted_keys[i]])                           # Add the list of emotions from this character to the "all emotions" list
        else:                                                                           # If the current key doesn't continue the current range (meaning end of current annotated range is reached)
            dominant_emotion, get_msg_id = find_dominant_emotion(emotions_for_this_range, iaa_counts)           # Get the dominant emotion for this annotated range
            if dominant_emotion:                                                        # If not none, add to final results list
                ranges.append([current_range_start, current_range_end, dominant_emotion])
            current_range_start = current_range_end = sorted_keys[i]                    # Update start index 
            emotions_for_this_range = d[sorted_keys[i]]                                 # Get current emotion labels

    # Handle the last range
    dominant_emotion, get_msg_id = find_dominant_emotion(emotions_for_this_range, iaa_counts)
    if dominant_emotion:
        ranges.append([current_range_start, current_range_end, dominant_emotion])

    return ranges, get_msg_id

In [53]:
pd.options.display.max_colwidth = None  # Set to None to display full column width without truncation


# Set up final DF and add column for final label (which annotators agreed on)
final_iaa_df = all_tweets_annotated
final_iaa_df['label'] = ''

total_annotations_count = 0
total_annotations_dropped = 0
iaa_counts = {'iaa_2': 0, 'iaa_3': 0}

# Iterate through groups of message_ids (= always 3 tweets, one from each annotator), get the three current labels
for message_id, group in sorted_df.groupby('message_id'): 
  
  if message_id in [1302377117951369216, 1302407703051022336, 1322845152612933632, 1323285372131004416, 1327893319914020864, 1477117401145610240]:
    print(group)

  iaa_dict = {}                                                   # Initialise a dict for current tweet

  for i in range(3):                                              # For each annotator, get their label, i.e. annotated ranges
    annotator_label = group.iloc[i]['label']                      # e.g. [[0, 21, 'Happiness'], [128, 142, 'Sadness'], [205, 214, 'Sadness']]

    if len(annotator_label) == 0:                                 # If the annotator didn't annotate anything, acknowledge this by setting pseudo index and emotion=('none')
      start = -2
      end = -1
      emotion =  'none'

      wholerange = list(range(start, end+1))                      # List of the pseudo range [-2, -1, 0]
     
      for k in range(len(wholerange)):                            # For each pseudo index number, add the nr and 'none' emotion to the dict
        index_num = wholerange[k]                                 # Get current index number, e.g. -2
        if index_num in iaa_dict:                                 # If index is already in dict, just add to the emotion label, else add index and emotion label
          iaa_dict[index_num].append(emotion)
        else:
          iaa_dict[index_num] = [emotion]

    else:                                                         # If the annotator did annotation something
      for j in range(len(annotator_label)):                       # Loop over each annotated range (such as one word) in the sub lists by accessing length (e.g. 3 for the example above)
        start = annotator_label[j][0]                             # e.g. 0
        end = annotator_label[j][1]                               # e.g. 21
        emotion = annotator_label[j][2]                           # e.g. 'Happiness'

        wholerange = list(range(start, end+1))                    # List of the range, e.g. [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21]
        
        for k in range(len(wholerange)):                          # For each index number, add the nr and emotion to the dict
          index_num = wholerange[k]
          if index_num in iaa_dict:
            iaa_dict[index_num].append(emotion)
          else:
            iaa_dict[index_num] = [emotion]

  initial_count = len(iaa_dict)                                   # count before removing all annotations that don't have at least two annotations
  total_annotations_count += initial_count
  iaa_dict = {k: v for k, v in iaa_dict.items() if len(v) > 1}    # After collecting all annotated range numbers and emotions, drop any key-value pairs with only 1 emotion (= only one annotator marked it)
  filtered_count = len(iaa_dict)                                  # count after removing annotations with less than two annotations
  dropped_count = initial_count - filtered_count
  total_annotations_dropped += dropped_count
  
  if iaa_dict != {}:                                              # If there are any annotations left 
    ranges, get_msg_id = find_cont_ranges_and_dom_emotions(iaa_dict, iaa_counts)          # Get the ranges and the dominant emotion
    if get_msg_id == 1: 
      print(message_id)
    final_iaa_df.loc[final_iaa_df['message_id'] == message_id, 'label'] = str(ranges)     # add the final ranges and emotions to the 'label' column in the final df
  else:                                                           # if there are no emotions then just add empty list brackets in the label column
    final_iaa_df.loc[final_iaa_df['message_id'] == message_id, 'label'] = str('[]')

# Save to CSV
# final_iaa_df.to_csv("Data/Training Data/1 - Annotation Results/Final Results/final_iaa_df_new.csv", index=False)
print(f"Total annotations: {total_annotations_count}")
print(f"Total annotations dropped: {total_annotations_dropped}")
print(iaa_counts)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_iaa_df['label'] = ''


       id  \
738  3177   
739  1844   
740   511   

                                                                                             text  \
738  @pallious In the old days the Guardian would be all over a case like this. Nowadays.....nah.   
739  @pallious In the old days the Guardian would be all over a case like this. Nowadays.....nah.   
740  @pallious In the old days the Guardian would be all over a case like this. Nowadays.....nah.   

              message_id                                   label Comments  
738  1302377117951369216      [[26, 38, Anger], [75, 83, Anger]]       []  
739  1302377117951369216                     [[26, 38, Sadness]]       []  
740  1302377117951369216  [[26, 38, Sadness], [75, 83, Sadness]]       []  
1302377117951369216
        id  \
789  25355   
790  26688   
791  28021   

                                                                                                                                             text  \
789  UK’s Br

In [45]:
# Initialize counters for agreement statistics
total_annotations_count = 0
total_annotations_dropped = 0
all_three_equal_count = 0
two_equal_count = 0
all_different_count = 0
iaa_counts = {'iaa_2': 0, 'iaa_3': 0}

# Iterate through groups of message_ids (= 3 annotations per tweet)
for message_id, group in sorted_df.groupby('message_id'): 
    iaa_dict = {}                                                   # Initialize a dict for current tweet

    for i in range(3):                                              # For each annotator, get their label, i.e., annotated ranges
        annotator_label = group.iloc[i]['label']                    # e.g., [[0, 21, 'Happiness'], [128, 142, 'Sadness'], ...]

        if len(annotator_label) == 0:                               # If no annotation, use pseudo range with 'none'
            emotion = 'none'
            iaa_dict[-2] = iaa_dict.get(-2, []) + [emotion]         # Pseudo range [-2, -1, 0]
        else:                                                       # If annotation is present
            for j in range(len(annotator_label)):                   # Loop over each annotated range
                start, end, emotion = annotator_label[j]
                wholerange = list(range(start, end + 1))
                
                for index_num in wholerange:                        # Add emotion to each index in range
                    iaa_dict[index_num] = iaa_dict.get(index_num, []) + [emotion]

    # Analyze each continuous range independently
    ranges = find_cont_ranges_and_dom_emotions(iaa_dict, iaa_counts)            # Get ranges with dominant emotions

    for start, end, dominant_emotion in ranges:
        emotions_list = [iaa_dict[i] for i in range(start, end + 1)]
        flattened_emotions = [emotion for sublist in emotions_list for emotion in sublist]
        emotion_counts = Counter(flattened_emotions)


    # Count dropped annotations
    initial_count = len(iaa_dict)
    total_annotations_count += initial_count
    iaa_dict = {k: v for k, v in iaa_dict.items() if len(v) > 1}    # Drop annotations with only one annotator
    filtered_count = len(iaa_dict)
    dropped_count = initial_count - filtered_count
    total_annotations_dropped += dropped_count

# Output overall statistics
print(f"Total annotations: {total_annotations_count}")
print(f"Total annotations dropped: {total_annotations_dropped}")
print(f"Ranges with a dominant emotion (iaa_3): {iaa_counts['iaa_3']}")
print(f"Ranges without a dominant emotion (iaa_2): {iaa_counts['iaa_2']}")
print(f"Ranges with all different emotions: {all_different_count}")

Total annotations: 50888
Total annotations dropped: 5015
Ranges with a dominant emotion (iaa_3): 4358
Ranges without a dominant emotion (iaa_2): 6
Ranges with all different emotions: 0


In [30]:
final_iaa_df.head(50)

Unnamed: 0,id,text,message_id,label,Comments
0,25334,Proud to receive an A+ rating from th Nevada F...,1524255702851620864,"[[20, 29, 'Happiness'], [92, 112, 'Happiness']]",[]
1,25335,@AnotherJay @mayor_anderson @BorisJohnson My f...,1323323443446747136,"[[225, 235, 'Anger']]",[]
2,25336,I can't even: Gunfire erupts after a high scho...,1532901446114365440,"[[14, 21, 'Fear'], [89, 94, 'Sadness']]",[]
3,25337,@codefknblack I had a good gun play session th...,1356204586806222848,"[[27, 43, 'Happiness']]",[]
4,25338,@PennySpalpeen @jenfox84 @pastebbins @LucyFan4...,1333825902220881920,"[[353, 357, 'Happiness'], [386, 392, 'Happines...",[]
5,25339,@myhairynipples @SkySportsPL Drogba you prick,1302449973204746240,"[[29, 35, 'Anger']]",[]
6,25340,"@SandraKobelt @MailOnline A lot more than 13, ...",1524641627184132096,"[[-2, -1, 'none']]",[]
7,25341,Give this a watch well worth it informative an...,1323368847748063232,"[[5, 9, 'Happiness']]",[]
8,25342,Can you tell we miss DofE? Getting outside one...,1322993372878020608,"[[21, 25, 'Happiness'], [109, 131, 'Happiness']]",[]
9,25343,The @SDSU Police Department and San Diego Fire...,1567966037509025792,"[[191, 199, 'Fear']]",[]


### Added: Re-Load Final Dataset for further Data-Cleaning and Subsetting

In [1]:
import pandas as pd

df = pd.read_csv("Data/Training Data/1 - Annotation Results/Final Results/final_iaa_df_new.csv")
print(len(df))
df.head()


2766


Unnamed: 0,id,text,message_id,label,Comments
0,25334,Proud to receive an A+ rating from th Nevada F...,1524255702851620864,"[[20, 29, 'Happiness'], [92, 112, 'Happiness']]",[]
1,25335,@AnotherJay @mayor_anderson @BorisJohnson My f...,1323323443446747136,"[[225, 235, 'Anger']]",[]
2,25336,I can't even: Gunfire erupts after a high scho...,1532901446114365440,"[[14, 21, 'Fear'], [89, 94, 'Sadness']]",[]
3,25337,@codefknblack I had a good gun play session th...,1356204586806222848,"[[27, 43, 'Happiness']]",[]
4,25338,@PennySpalpeen @jenfox84 @pastebbins @LucyFan4...,1333825902220881920,"[[353, 357, 'Happiness'], [386, 392, 'Happines...",[]


### 7. Reformat final Training Dataset for GRACE Training

Input data format example: 

> <b>Text</b>: Proud to receive an A+ rating from th Nevada Firearms Coalition. I will ALWAYS defend your 2nd amendment rights!
>
> <b>Label</b>: [[20, 29, 'Happiness'], [92, 112, 'Happiness']]


Output format example:

Proud - - O O O <br>
to - - O O O <br>
receive - - O O O <br>
an  - - O O O <br>
A+ - - B_AP HAPPINESS B_AP+HAPPINESS <br>
rating - - I_AP HAPPINESS I_AP+HAPPINESS <br>
from - - O O O <br>
th - - O O O <br>
Nevada - - O O O <br>
Firearms - - O O O <br>
Coalition - - O O O <br>
. - - O O O <br>
I - - O O O <br>
will - - O O O <br>
ALWAYS - - O O O <br>
defend - - O O O <br>
your - - O O O <br>
2nd - - B_AP HAPPINESS B_AP+HAPPINESS <br>
amendment - - I_AP HAPPINESS I_AP+HAPPINESS <br>
rights - - I_AP HAPPINESS I_AP+HAPPINESS <br>
! - - O O O <br>


In [6]:
print(len(df))


# Split the DataFrame into training and test sets
# 'test_size' is set to 0.15 for 15% test data and 85% training data
# 'random_state' can be set to an integer for reproducibility


train_df, test_df = train_test_split(final_iaa_df, test_size=0.1, random_state=42)
# train_df, test_df = train_test_split(df, test_size=0.2, random_state=38)

print(len(train_df))
print(len(test_df))
test_df.head()


2766
2212
554


Unnamed: 0,id,text,message_id,label,Comments
2263,8994,Homeless camp fires are a regular thing. Light...,1531923081886769152,"[[41, 84, 'Anger']]",[]
1666,8396,https://t.co/GvLdJyIvcp - Quite frankly @Pasto...,1305950140658855936,"[[40, 59, 'Anger']]",[]
412,461,"Just posted a video @ Blackheath, London https...",1311576244899717120,"[[-2, -1, 'none']]",[]
960,1009,@Disney should do a #starwars what if scenario...,1302275358339858432,"[[0, 7, 'Happiness'], [20, 46, 'Happiness'], [...",['example for compound targets']
645,694,Thought Mall looked wonderfully dilapidated in...,1302268394738155520,"[[8, 12, 'Sadness']]",[]


In [7]:
def process_text(text, labels):
    words = text.split()
    indices = []
    start = 0
    for word in words:
        end = start + len(word)
        indices.append((start, end))
        start = end + 1  # +1 for the space or newline character

    formatted_text = []
    # reformat the lists which are stored as strings back into list formats
    labels = ast.literal_eval(labels)

    for i, (word, (start_idx, end_idx)) in enumerate(zip(words, indices)):
        label = "O O O"
        for label_range in labels:
            # a start index of -2 indicates that the label is "none"
            if start == -2:
                label = "O O O"
            else:
                range_start, range_end, emotion = label_range
                if start_idx >= range_start and end_idx <= range_end:
                    if start_idx == range_start:
                        label = f"B_AP {emotion.upper()} B_AP+{emotion.upper()}"
                    else:
                        label = f"I_AP {emotion.upper()} I_AP+{emotion.upper()}"
                    break
        formatted_text.append(f"{word} - - {label}")

    return formatted_text

def format_dataframe(df):
    all_formatted_texts = []
    for _, row in df.iterrows():
        formatted_text = process_text(row['text'], row['label'])
        all_formatted_texts.extend(formatted_text)
        all_formatted_texts.append("")  # Empty row between tweets

    return "\n".join(all_formatted_texts)

Whole dataset reformatted

In [43]:
# Process the DataFrame
formatted_data = format_dataframe(final_iaa_df)

# Write the output to a .txt file
output_file_path = "Data/Training Data/2 - Data Reformatted and Split for Training/abea_training_dataset.txt"
with open(output_file_path, 'w') as file:
    file.write('-DOCSTART-\n\n')
    file.write(formatted_data)

print(f"Output written to {output_file_path}")

Output written to Data/Training Data/2 - Data Reformatted and Split for Training/abea_training_dataset.txt


Train and Test Datasets Reformatted

In [46]:
# Process the DataFrame
formatted_data = format_dataframe(train_df)
# Write the output to a .txt file
training_file_paths = ["Data/Training Data/2 - Data Reformatted and Split for Training/abea_resplit_train.txt", 
                       "Data/Training Data/2 - Data Reformatted and Split for Training/abea__resplit_trial.txt"]
for path in training_file_paths:
    with open(path, 'w') as file:
        file.write('-DOCSTART-\n\n')
        file.write(formatted_data)

    print(f"Output written to {path}")



# Process the DataFrame
formatted_data = format_dataframe(test_df)
# Write the output to a .txt file
output_file_path = "Data/Training Data/2 - Data Reformatted and Split for Training/abea_test.gold.txt"
with open(output_file_path, 'w') as file:
    file.write('-DOCSTART-\n\n')
    file.write(formatted_data)

print(f"Output written to {output_file_path}")

Output written to Data/Training Data/2 - Data Reformatted and Split for Training/abea_train.txt
Output written to Data/Training Data/2 - Data Reformatted and Split for Training/abea_trial.txt
Output written to Data/Training Data/2 - Data Reformatted and Split for Training/abea_test.gold.txt


<hr>

### Processing of Refined Annotations

In [3]:
# load only one file
df = pd.read_json("Data/Training Data/Refined Annotations/aspect-emotion.jsonl", lines=True)
print(len(df))
df.head()

2766


Unnamed: 0,id,text,message_id,label,Comments
0,2767,Proud to receive an A+ rating from th Nevada F...,1524255702851620864,"[[20, 29, pride], [87, 112, pride]]",[]
1,2768,@AnotherJay @mayor_anderson @BorisJohnson My f...,1323323443446747136,"[[57, 62, irritation], [225, 235, frustration]]",[]
2,2769,I can't even: Gunfire erupts after a high scho...,1532901446114365440,"[[14, 21, disappointment], [81, 94, sadness]]",[]
3,2770,@codefknblack I had a good gun play session th...,1356204586806222848,"[[27, 43, excitement]]",[]
4,2771,@PennySpalpeen @jenfox84 @pastebbins @LucyFan4...,1333825902220881920,"[[348, 352, affection], [408, 413, cheerfulness]]",[]


In [4]:
# Define words that indicate rows to remove
# words_to_remove = ["remove", "food ad", "tricky", "grammar"]
words_to_remove = ["remove", "tricky", "grammar"]

# Function to check if any word in the list is in the comments
def filter_comments(comment_list):
    return not any(word in comment_list for word in words_to_remove)

# Filter rows based on comments
df = df[df['Comments'].apply(filter_comments)]

print(len(df))


2625


In [5]:
# change labels into the overarching labels
# Define the label mappings
emotion_mappings = {
    'affection': 'Happiness', 'lust': 'Happiness', 'cheerfulness': 'Happiness',
    'excitement': 'Happiness', 'contentment': 'Happiness', 'pride': 'Happiness',
    'optimism': 'Happiness', 'enthrallment': 'Happiness', 'relief': 'Happiness',
    'sarcasm - pos': 'Happiness', 'irritation': 'Anger', 'frustration': 'Anger',
    'rage': 'Anger', 'digust': 'Anger', 'envy': 'Anger', 'torment': 'Anger',
    'sarcasm - neg': 'Anger', 'suffering': 'Sadness', 'sadness': 'Sadness',
    'disappointment': 'Sadness', 'shame': 'Sadness', 'neglect': 'Sadness',
    'sympathy': 'Sadness', 'horror': 'Fear', 'nervousness': 'Fear',
    'neutral': 'None', 'surprise': 'None'
}

# Function to remap labels
def remap_labels_emotions(labels):
    try:
        # label_list = ast.literal_eval(label_str)  # Convert the string representation of the list into an actual list
        # print(labels)
        new_labels = []
        for label in labels:
            # print(label[2])
            if label[2] in emotion_mappings:
                new_labels.append([label[0], label[1], emotion_mappings[label[2]]])
        return new_labels
    except Exception as e:
        print(f"Error parsing {labels}: {e}")
        return []

# Apply the remapping function to the label column
df['label'] = df['label'].apply(remap_labels_emotions)

df.head(10)

Unnamed: 0,id,text,message_id,label,Comments
0,2767,Proud to receive an A+ rating from th Nevada F...,1524255702851620864,"[[20, 29, Happiness], [87, 112, Happiness]]",[]
1,2768,@AnotherJay @mayor_anderson @BorisJohnson My f...,1323323443446747136,"[[57, 62, Anger], [225, 235, Anger]]",[]
2,2769,I can't even: Gunfire erupts after a high scho...,1532901446114365440,"[[14, 21, Sadness], [81, 94, Sadness]]",[]
3,2770,@codefknblack I had a good gun play session th...,1356204586806222848,"[[27, 43, Happiness]]",[]
4,2771,@PennySpalpeen @jenfox84 @pastebbins @LucyFan4...,1333825902220881920,"[[348, 352, Happiness], [408, 413, Happiness]]",[]
5,2772,@myhairynipples @SkySportsPL Drogba you prick,1302449973204746240,"[[29, 35, Anger]]",[]
6,2773,"@SandraKobelt @MailOnline A lot more than 13, ...",1524641627184132096,"[[50, 74, None]]",[]
7,2774,Give this a watch well worth it informative an...,1323368847748063232,"[[5, 9, Happiness]]",[]
8,2775,Can you tell we miss DofE? Getting outside one...,1322993372878020608,"[[21, 25, Happiness], [109, 131, Happiness]]",[]
9,2776,The @SDSU Police Department and San Diego Fire...,1567966037509025792,"[[78, 90, None]]",[]


In [8]:
# change labels into the overarching labels
# Define the label mappings
sentiment_mappings = {
    'Happiness': 'Positive', 'Anger': 'Negative', 'Sadness': 'Negative',
    'Fear': 'Negative', 'None': 'Neutral'
}

# Function to remap labels
def remap_labels_sentiments(labels):
    try:
        # label_list = ast.literal_eval(label_str)  # Convert the string representation of the list into an actual list
        # print(labels)
        new_labels = []
        for label in labels:
            # print(label[2])
            if label[2] in sentiment_mappings:
                new_labels.append([label[0], label[1], sentiment_mappings[label[2]]])
        return new_labels
    except Exception as e:
        print(f"Error parsing {labels}: {e}")
        return []

# Apply the remapping function to the label column
df['label'] = df['label'].apply(remap_labels_sentiments)

df.head(10)

Unnamed: 0,id,text,message_id,label,Comments
0,2767,Proud to receive an A+ rating from th Nevada F...,1524255702851620864,"[[20, 29, Positive], [87, 112, Positive]]",[]
1,2768,@AnotherJay @mayor_anderson @BorisJohnson My f...,1323323443446747136,"[[57, 62, Negative], [225, 235, Negative]]",[]
2,2769,I can't even: Gunfire erupts after a high scho...,1532901446114365440,"[[14, 21, Negative], [81, 94, Negative]]",[]
3,2770,@codefknblack I had a good gun play session th...,1356204586806222848,"[[27, 43, Positive]]",[]
4,2771,@PennySpalpeen @jenfox84 @pastebbins @LucyFan4...,1333825902220881920,"[[348, 352, Positive], [408, 413, Positive]]",[]
5,2772,@myhairynipples @SkySportsPL Drogba you prick,1302449973204746240,"[[29, 35, Negative]]",[]
6,2773,"@SandraKobelt @MailOnline A lot more than 13, ...",1524641627184132096,"[[50, 74, Neutral]]",[]
7,2774,Give this a watch well worth it informative an...,1323368847748063232,"[[5, 9, Positive]]",[]
8,2775,Can you tell we miss DofE? Getting outside one...,1322993372878020608,"[[21, 25, Positive], [109, 131, Positive]]",[]
9,2776,The @SDSU Police Department and San Diego Fire...,1567966037509025792,"[[78, 90, Neutral]]",[]


In [9]:
# save
df.to_csv("Data/Training Data/Refined Annotations/prepared_sentiments.csv")