# Reddit AITA Dataset Analysis

Datasets Analyzed:
- [Reddit AITA Multiclass](https://huggingface.co/datasets/MattBoraske/reddit-AITA-submissions-and-comments-multiclass)
- [Reddit AITA Multiclass Top 2k](https://huggingface.co/datasets/MattBoraske/reddit-AITA-submissions-and-comments-multiclass-top-2k)
- [Reddit AITA Binary](https://huggingface.co/datasets/MattBoraske/reddit-AITA-submissions-and-comments-binary)
- [Reddit AITA Binary Top 2k](https://huggingface.co/datasets/MattBoraske/reddit-AITA-submissions-and-comments-binary-top-2k)

Analysis Done:
- Decision Ambiguity Analysis - Scores Histogram, Cumulative Freqency Plot, Zero Ambiguity Proportion
- Comment Agreement Analysis - Krippendorff's Alpha (holistic) and Cohen's Kappa (pairwise)
- Flan-T5 and Llama-2 Token Count Analysis - Histograms that show the distribution of tokens for each prompt

## Prepare environment

In [None]:
%pip install datasets transformers krippendorff seaborn huggingface_hub ipywidgets pandas numpy

In [None]:
!mkdir analysis_results\multiclass\Decision_Ambiguity_Analysis
!mkdir analysis_results\multiclass\Comment_Agreement_Analysis
!mkdir analysis_results\multiclass\Token_Count_Analysis

!mkdir analysis_results\multiclass-top-2k\Decision_Ambiguity_Analysis
!mkdir analysis_results\multiclass-top-2k\Comment_Agreement_Analysis
!mkdir analysis_results\multiclass-top-2k\Token_Count_Analysis

!mkdir analysis_results\binary\Decision_Ambiguity_Analysis
!mkdir analysis_results\binary\Comment_Agreement_Analysis
!mkdir analysis_results\binary\Token_Count_Analysis

!mkdir analysis_results\binary-top-2k\Decision_Ambiguity_Analysis
!mkdir analysis_results\binary-top-2k\Comment_Agreement_Analysis
!mkdir analysis_results\binary-top-2k\Token_Count_Analysis

## Loading of Datasets

In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [None]:
from datasets import load_dataset

multiclass_dataset = load_dataset('MattBoraske/reddit-AITA-submissions-and-comments-multiclass')
multiclass_top_2k_dataset = load_dataset('MattBoraske/reddit-AITA-submissions-and-comments-multiclass-top-2k')
binary_dataset = load_dataset('MattBoraske/reddit-AITA-submissions-and-comments-binary')
binary_top_2k_dataset = load_dataset('MattBoraske/reddit-AITA-submissions-and-comments-binary-top-2k')

## Flan-T5 and Llama-2 Token Count Analysis

In [None]:
from transformers import PreTrainedTokenizer, AutoTokenizer
from datasets import Dataset

def add_token_counts_to_dataset(dataset: Dataset, column: str, tokenizer: PreTrainedTokenizer, new_column_name: str) -> Dataset:
    """
    Adds a new column to a specified partition of a dataset with the number of tokens in each row of a specified column.

    Parameters:
      dataset (Dataset): A Hugging Face dataset object.
      column (str): The name of the column in the dataset partition to process.
      tokenizer: A Hugging Face transformers pretrained tokenizer
      new_column_name (str): The name of the new column to be added to the dataset.

    Returns:
      Dataset: The modified dataset with an additional column for token counts.
    """

    def count_tokens(row):
        row_tokens = tokenizer(row[column], padding=False, truncation=False, return_tensors="pt")
        tokens_count = len([tensor.item() for tensor in row_tokens['input_ids'][0]])
        return {new_column_name: tokens_count}
    
    return dataset.map(count_tokens)

In [None]:
flanT5_tokenizer = AutoTokenizer.from_pretrained("MattBoraske/flan-t5-xl-reddit-AITA-multiclass", trust_remote_code=True)
llama2_tokenizer = AutoTokenizer.from_pretrained("MattBoraske/llama-2-7b-chat-reddit-AITA-multiclass", trust_remote_code=True)

multiclass_dataset = add_token_counts_to_dataset(multiclass_dataset, 'flanT5_instruction', flanT5_tokenizer, 'flanT5_instruction_token_count')
multiclass_dataset = add_token_counts_to_dataset(multiclass_dataset, 'llama2_instruction', llama2_tokenizer, 'llama2_instruction_token_count')

multiclass_top_2k_dataset = add_token_counts_to_dataset(multiclass_top_2k_dataset, 'flanT5_instruction', flanT5_tokenizer, 'flanT5_instruction_token_count')
multiclass_top_2k_dataset = add_token_counts_to_dataset(multiclass_top_2k_dataset, 'llama2_instruction', llama2_tokenizer, 'llama2_instruction_token_count')

binary_dataset = add_token_counts_to_dataset(binary_dataset, 'flanT5_instruction', flanT5_tokenizer, 'flanT5_instruction_token_count')
binary_dataset = add_token_counts_to_dataset(binary_dataset, 'llama2_instruction', llama2_tokenizer, 'llama2_instruction_token_count')

binary_top_2k_dataset = add_token_counts_to_dataset(binary_top_2k_dataset, 'flanT5_instruction', flanT5_tokenizer, 'flanT5_instruction_token_count')
binary_top_2k_dataset = add_token_counts_to_dataset(binary_top_2k_dataset, 'llama2_instruction', llama2_tokenizer, 'llama2_instruction_token_count')


In [None]:
def calculate_token_counts(dataset, split):
    token_counts = {
        'llama2_instruction_token_count': [],
        'flanT5_instruction_token_count': []
    }

    # Iterate over all examples in the split
    for example in dataset[split]:
        token_counts['llama2_instruction_token_count'].append(example['llama2_instruction_token_count'])
        token_counts['flanT5_instruction_token_count'].append(example['flanT5_instruction_token_count'])

    return token_counts

# Get train/test token counts for each dataset
multiclass_token_counts = {
    'train': calculate_token_counts(multiclass_dataset, 'train'),
    'test': calculate_token_counts(multiclass_dataset, 'test')
}

multiclass_2k_token_counts = {
    'train': calculate_token_counts(multiclass_top_2k_dataset, 'train'),
    'test': calculate_token_counts(multiclass_top_2k_dataset, 'test')
}

binary_token_counts = {
    'train': calculate_token_counts(binary_dataset, 'train'),
    'test': calculate_token_counts(binary_dataset, 'test')
}

binary_2k_token_counts = {
    'train': calculate_token_counts(binary_top_2k_dataset, 'train'),
    'test': calculate_token_counts(binary_top_2k_dataset, 'test')
}

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Calculate the common bin edges for both train and test submissions
combined_data = np.concatenate([multiclass_token_counts['train']['flanT5_instruction_token_count'], multiclass_token_counts['test']['flanT5_instruction_token_count']])
bin_edges = np.histogram_bin_edges(combined_data, bins=20)

# Get histogram bin counts for train and test submissions
flant5_train_counts, _ = np.histogram(multiclass_token_counts['train']['flanT5_instruction_token_count'], bins=bin_edges)
flant5_test_counts, _ = np.histogram(multiclass_token_counts['test']['flanT5_instruction_token_count'], bins=bin_edges)

# Plot histograms
plt.figure(figsize=(12, 6))
sns.histplot(multiclass_token_counts['train']['flanT5_instruction_token_count'], bins=bin_edges, kde=True, label='Training Set', color='blue')
sns.histplot(multiclass_token_counts['test']['flanT5_instruction_token_count'], bins=bin_edges, kde=True, label='Testing Set', color='orange')
plt.title('Reddit-AITA-Multiclass Flan-T5 Tokenized Prompt Lengths', fontsize=16)
plt.xlabel('Number of Tokens', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.legend()
plt.savefig('analysis_results/multiclass/Token_Count_Analysis/flanT5_tokenized_lengths.png')
plt.show()

# Log Histogram Bin Counts
bin_labels = [f"{bin_edges[i]:.2f} - {bin_edges[i+1]:.2f}" for i in range(len(bin_edges)-1)]
flant5_counts_df = pd.DataFrame({'Bin Ranges': bin_labels, 'Test': flant5_train_counts, 'Train': flant5_test_counts}, index=bin_labels)
flant5_counts_df.to_csv('analysis_results/multiclass/Token_Count_Analysis/flanT5_tokenized_lengths_bins.csv', index=False)


# Calculate the common bin edges for both train and test submissions
combined_data = np.concatenate([multiclass_token_counts['train']['llama2_instruction_token_count'], multiclass_token_counts['test']['llama2_instruction_token_count']])
bin_edges = np.histogram_bin_edges(combined_data, bins=20)

# Get histogram bin counts for train and test submissions
ambiguity_train_counts, _ = np.histogram(multiclass_token_counts['train']['llama2_instruction_token_count'], bins=bin_edges)
ambiguity_test_counts, _ = np.histogram(multiclass_token_counts['test']['llama2_instruction_token_count'], bins=bin_edges)

# Plot histograms
plt.figure(figsize=(12, 6))
sns.histplot(multiclass_token_counts['train']['llama2_instruction_token_count'], bins=bin_edges, kde=True, label='Training Set', color='blue')
sns.histplot(multiclass_token_counts['test']['llama2_instruction_token_count'], bins=bin_edges, kde=True, label='Testing Set', color='orange')
plt.title('Reddit-AITA-Multiclass Llama 2 Tokenized Prompt Lengths', fontsize=16)
plt.xlabel('Number of Tokens', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.legend()
plt.savefig('analysis_results/multiclass/Token_Count_Analysis/llama2_tokenized_lengths.png')
plt.show()

# Log Histogram Bin Counts
bin_labels = [f"{bin_edges[i]:.2f} - {bin_edges[i+1]:.2f}" for i in range(len(bin_edges)-1)]
llama2_counts_df = pd.DataFrame({'Bin Ranges': bin_labels, 'Test': ambiguity_train_counts, 'Train': ambiguity_test_counts}, index=bin_labels)
llama2_counts_df.to_csv('analysis_results/multiclass/Token_Count_Analysis/llama2_tokenized_lengths_bins.csv', index=False)

In [None]:
# Calculate the common bin edges for both train and test submissions
combined_data = np.concatenate([multiclass_2k_token_counts['train']['flanT5_instruction_token_count'], multiclass_2k_token_counts['test']['flanT5_instruction_token_count']])
bin_edges = np.histogram_bin_edges(combined_data, bins=20)

# Get histogram bin counts for train and test submissions
flant5_train_counts, _ = np.histogram(multiclass_2k_token_counts['train']['flanT5_instruction_token_count'], bins=bin_edges)
flant5_test_counts, _ = np.histogram(multiclass_2k_token_counts['test']['flanT5_instruction_token_count'], bins=bin_edges)

# Plot histograms
plt.figure(figsize=(12, 6))
sns.histplot(multiclass_2k_token_counts['train']['flanT5_instruction_token_count'], bins=bin_edges, kde=True, label='Training Set', color='blue')
sns.histplot(multiclass_2k_token_counts['test']['flanT5_instruction_token_count'], bins=bin_edges, kde=True, label='Testing Set', color='orange')
plt.title('Reddit-AITA-Multiclass-Top-2k Flan-T5 Tokenized Prompt Lengths', fontsize=16)
plt.xlabel('Number of Tokens', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.legend()
plt.savefig('analysis_results/multiclass-top-2k/Token_Count_Analysis/flanT5_tokenized_lengths.png')
plt.show()

# Log Histogram Bin Counts
bin_labels = [f"{bin_edges[i]:.2f} - {bin_edges[i+1]:.2f}" for i in range(len(bin_edges)-1)]
flant5_counts_df = pd.DataFrame({'Bin Ranges': bin_labels, 'Test': flant5_train_counts, 'Train': flant5_test_counts}, index=bin_labels)
flant5_counts_df.to_csv('analysis_results/multiclass-top-2k/Token_Count_Analysis/flanT5_tokenized_lengths_bins.csv', index=False)


# Calculate the common bin edges for both train and test submissions
combined_data = np.concatenate([multiclass_2k_token_counts['train']['llama2_instruction_token_count'], multiclass_2k_token_counts['test']['llama2_instruction_token_count']])
bin_edges = np.histogram_bin_edges(combined_data, bins=20)

# Get histogram bin counts for train and test submissions
ambiguity_train_counts, _ = np.histogram(multiclass_2k_token_counts['train']['llama2_instruction_token_count'], bins=bin_edges)
ambiguity_test_counts, _ = np.histogram(multiclass_2k_token_counts['test']['llama2_instruction_token_count'], bins=bin_edges)

# Plot histograms
plt.figure(figsize=(12, 6))
sns.histplot(multiclass_2k_token_counts['train']['llama2_instruction_token_count'], bins=bin_edges, kde=True, label='Training Set', color='blue')
sns.histplot(multiclass_2k_token_counts['test']['llama2_instruction_token_count'], bins=bin_edges, kde=True, label='Testing Set', color='orange')
plt.title('Reddit-AITA-Multiclass-Top-2k Llama 2 Tokenized Prompt Lengths', fontsize=16)
plt.xlabel('Number of Tokens', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.legend()
plt.savefig('analysis_results/multiclass-top-2k/Token_Count_Analysis/llama2_tokenized_lengths.png')
plt.show()

# Log Histogram Bin Counts
bin_labels = [f"{bin_edges[i]:.2f} - {bin_edges[i+1]:.2f}" for i in range(len(bin_edges)-1)]
llama2_counts_df = pd.DataFrame({'Bin Ranges': bin_labels, 'Test': ambiguity_train_counts, 'Train': ambiguity_test_counts}, index=bin_labels)
llama2_counts_df.to_csv('analysis_results/multiclass-top-2k/Token_Count_Analysis/llama2_tokenized_lengths_bins.csv', index=False)

In [None]:
# Calculate the common bin edges for both train and test submissions
combined_data = np.concatenate([binary_token_counts['train']['flanT5_instruction_token_count'], binary_token_counts['test']['flanT5_instruction_token_count']])
bin_edges = np.histogram_bin_edges(combined_data, bins=20)

# Get histogram bin counts for train and test submissions
flant5_train_counts, _ = np.histogram(binary_token_counts['train']['flanT5_instruction_token_count'], bins=bin_edges)
flant5_test_counts, _ = np.histogram(binary_token_counts['test']['flanT5_instruction_token_count'], bins=bin_edges)

# Plot histograms
plt.figure(figsize=(12, 6))
sns.histplot(binary_token_counts['train']['flanT5_instruction_token_count'], bins=bin_edges, kde=True, label='Training Set', color='blue')
sns.histplot(binary_token_counts['test']['flanT5_instruction_token_count'], bins=bin_edges, kde=True, label='Testing Set', color='orange')
plt.title('Reddit-AITA-Binary Flan-T5 Tokenized Prompt Lengths', fontsize=16)
plt.xlabel('Number of Tokens', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.legend()
plt.savefig('analysis_results/binary/Token_Count_Analysis/flanT5_tokenized_lengths.png')
plt.show()

# Log Histogram Bin Counts
bin_labels = [f"{bin_edges[i]:.2f} - {bin_edges[i+1]:.2f}" for i in range(len(bin_edges)-1)]
flant5_counts_df = pd.DataFrame({'Bin Ranges': bin_labels, 'Test': flant5_train_counts, 'Train': flant5_test_counts}, index=bin_labels)
flant5_counts_df.to_csv('analysis_results/binary/Token_Count_Analysis/flanT5_tokenized_lengths_bins.csv', index=False)


# Calculate the common bin edges for both train and test submissions
combined_data = np.concatenate([binary_token_counts['train']['llama2_instruction_token_count'], binary_token_counts['test']['llama2_instruction_token_count']])
bin_edges = np.histogram_bin_edges(combined_data, bins=20)

# Get histogram bin counts for train and test submissions
ambiguity_train_counts, _ = np.histogram(binary_token_counts['train']['llama2_instruction_token_count'], bins=bin_edges)
ambiguity_test_counts, _ = np.histogram(binary_token_counts['test']['llama2_instruction_token_count'], bins=bin_edges)

# Plot histograms
plt.figure(figsize=(12, 6))
sns.histplot(binary_token_counts['train']['llama2_instruction_token_count'], bins=bin_edges, kde=True, label='Training Set', color='blue')
sns.histplot(binary_token_counts['test']['llama2_instruction_token_count'], bins=bin_edges, kde=True, label='Testing Set', color='orange')
plt.title('Reddit-AITA-Binary Llama 2 Tokenized Prompt Lengths', fontsize=16)
plt.xlabel('Number of Tokens', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.legend()
plt.savefig('analysis_results/binary/Token_Count_Analysis/llama2_tokenized_lengths.png')
plt.show()

# Log Histogram Bin Counts
bin_labels = [f"{bin_edges[i]:.2f} - {bin_edges[i+1]:.2f}" for i in range(len(bin_edges)-1)]
llama2_counts_df = pd.DataFrame({'Bin Ranges': bin_labels, 'Test': ambiguity_train_counts, 'Train': ambiguity_test_counts}, index=bin_labels)
llama2_counts_df.to_csv('analysis_results/binary/Token_Count_Analysis/llama2_tokenized_lengths_bins.csv', index=False)



In [None]:
# Calculate the common bin edges for both train and test submissions
combined_data = np.concatenate([binary_2k_token_counts['train']['flanT5_instruction_token_count'], binary_2k_token_counts['test']['flanT5_instruction_token_count']])
bin_edges = np.histogram_bin_edges(combined_data, bins=20)

# Get histogram bin counts for train and test submissions
flant5_train_counts, _ = np.histogram(binary_2k_token_counts['train']['flanT5_instruction_token_count'], bins=bin_edges)
flant5_test_counts, _ = np.histogram(binary_2k_token_counts['test']['flanT5_instruction_token_count'], bins=bin_edges)

# Plot histograms
plt.figure(figsize=(12, 6))
sns.histplot(binary_2k_token_counts['train']['flanT5_instruction_token_count'], bins=bin_edges, kde=True, label='Training Set', color='blue')
sns.histplot(binary_2k_token_counts['test']['flanT5_instruction_token_count'], bins=bin_edges, kde=True, label='Testing Set', color='orange')
plt.title('Reddit-AITA-Binary-Top-2k Flan-T5 Tokenized Prompt Lengths', fontsize=16)
plt.xlabel('Number of Tokens', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.legend()
plt.savefig('analysis_results/binary-top-2k/Token_Count_Analysis/flanT5_tokenized_lengths.png')
plt.show()

# Log Histogram Bin Counts
bin_labels = [f"{bin_edges[i]:.2f} - {bin_edges[i+1]:.2f}" for i in range(len(bin_edges)-1)]
flant5_counts_df = pd.DataFrame({'Bin Ranges': bin_labels, 'Test': flant5_train_counts, 'Train': flant5_test_counts}, index=bin_labels)
flant5_counts_df.to_csv('analysis_results/binary-top-2k/Token_Count_Analysis/flanT5_tokenized_lengths_bins.csv', index=False)


# Calculate the common bin edges for both train and test submissions
combined_data = np.concatenate([binary_2k_token_counts['train']['llama2_instruction_token_count'], binary_2k_token_counts['test']['llama2_instruction_token_count']])
bin_edges = np.histogram_bin_edges(combined_data, bins=20)

# Get histogram bin counts for train and test submissions
ambiguity_train_counts, _ = np.histogram(binary_2k_token_counts['train']['llama2_instruction_token_count'], bins=bin_edges)
ambiguity_test_counts, _ = np.histogram(binary_2k_token_counts['test']['llama2_instruction_token_count'], bins=bin_edges)

# Plot histograms
plt.figure(figsize=(12, 6))
sns.histplot(binary_2k_token_counts['train']['llama2_instruction_token_count'], bins=bin_edges, kde=True, label='Training Set', color='blue')
sns.histplot(binary_2k_token_counts['test']['llama2_instruction_token_count'], bins=bin_edges, kde=True, label='Testing Set', color='orange')
plt.title('Reddit-AITA-Binary-Top-2k Llama 2 Tokenized Prompt Lengths', fontsize=16)
plt.xlabel('Number of Tokens', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.legend()
plt.savefig('analysis_results/binary-top-2k/Token_Count_Analysis/llama2_tokenized_lengths.png')
plt.show()

# Log Histogram Bin Counts
bin_labels = [f"{bin_edges[i]:.2f} - {bin_edges[i+1]:.2f}" for i in range(len(bin_edges)-1)]
llama2_counts_df = pd.DataFrame({'Bin Ranges': bin_labels, 'Test': ambiguity_train_counts, 'Train': ambiguity_test_counts}, index=bin_labels)
llama2_counts_df.to_csv('analysis_results/binary-top-2k/Token_Count_Analysis/llama2_tokenized_lengths_bins.csv', index=False)

In [None]:
multiclass_dataset["train"] = multiclass_dataset["train"].remove_columns(["flanT5_instruction_token_count", "llama2_instruction_token_count"])
multiclass_dataset["test"] = multiclass_dataset["test"].remove_columns(["flanT5_instruction_token_count", "llama2_instruction_token_count"])

multiclass_top_2k_dataset["train"] = multiclass_top_2k_dataset["train"].remove_columns(["flanT5_instruction_token_count", "llama2_instruction_token_count"])
multiclass_top_2k_dataset["test"] = multiclass_top_2k_dataset["test"].remove_columns(["flanT5_instruction_token_count", "llama2_instruction_token_count"])

binary_dataset["train"] = binary_dataset["train"].remove_columns(["flanT5_instruction_token_count", "llama2_instruction_token_count"])
binary_dataset["test"] = binary_dataset["test"].remove_columns(["flanT5_instruction_token_count", "llama2_instruction_token_count"])

binary_top_2k_dataset["train"] = binary_top_2k_dataset["train"].remove_columns(["flanT5_instruction_token_count", "llama2_instruction_token_count"])
binary_top_2k_dataset["test"] = binary_top_2k_dataset["test"].remove_columns(["flanT5_instruction_token_count", "llama2_instruction_token_count"])


## Decision Ambiguity Analysis

In [None]:
def get_ambiguity_scores(dataset, split):

    scores = []
    for example in dataset[split]:
        scores.append(example['ambiguity_score'])
    return scores

# Calculate train/test ambiguity scores for each dataset
multi_ambiguity_scores = {
    'train': get_ambiguity_scores(multiclass_dataset, 'train'),
    'test': get_ambiguity_scores(multiclass_dataset, 'test')
}

multi_top_2k_ambiguity_scores = {
    'train': get_ambiguity_scores(multiclass_top_2k_dataset, 'train'),
    'test': get_ambiguity_scores(multiclass_top_2k_dataset, 'test')
}

binary_ambiguity_scores = {
    'train': get_ambiguity_scores(binary_dataset, 'train'),
    'test': get_ambiguity_scores(binary_dataset, 'test')
}

binary_top_2k_ambiguity_scores = {
    'train': get_ambiguity_scores(binary_top_2k_dataset, 'train'),
    'test': get_ambiguity_scores(binary_top_2k_dataset, 'test')
}

In [None]:
# Calculate the common bin edges for both train and test submissions
combined_data = np.concatenate([multi_ambiguity_scores['train'], multi_ambiguity_scores['test']])
bin_edges = np.histogram_bin_edges(combined_data, bins=20)

# Get histogram bin counts for train and test submissions
ambiguity_train_counts, _ = np.histogram(multi_ambiguity_scores['train'], bins=bin_edges)
ambiguity_test_counts, _ = np.histogram(multi_ambiguity_scores['test'], bins=bin_edges)

# Plot histograms
plt.figure(figsize=(12, 6))
sns.histplot(multi_ambiguity_scores['train'], bins=bin_edges, kde=True, label='Training Set', color='blue')
sns.histplot(multi_ambiguity_scores['test'], bins=bin_edges, kde=True, label='Testing Set', color='orange')
plt.title('Reddit-AITA-Multiclass Ambiguity Scores', fontsize=16)
plt.xlabel('Score', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.yscale('log')
plt.legend()
plt.savefig('analysis_results/multiclass/Decision_Ambiguity_Analysis/ambiguity_scores.png')
plt.show()

# Log Histogram Bin Counts
bin_labels = [f"{bin_edges[i]:.2f} - {bin_edges[i+1]:.2f}" for i in range(len(bin_edges)-1)]
llama2_counts_df = pd.DataFrame({'Bin Ranges': bin_labels, 'Test': ambiguity_train_counts, 'Train': ambiguity_test_counts}, index=bin_labels)
llama2_counts_df.to_csv('analysis_results/multiclass/Decision_Ambiguity_Analysis/ambiguity_scores_bins.csv', index=False)


# Calculate the common bin edges for both train and test submissions
combined_data = np.concatenate([multi_top_2k_ambiguity_scores['train'], multi_top_2k_ambiguity_scores['test']])
bin_edges = np.histogram_bin_edges(combined_data, bins=20)

# Get histogram bin counts for train and test submissions
ambiguity_train_counts, _ = np.histogram(multi_top_2k_ambiguity_scores['train'], bins=bin_edges)
ambiguity_test_counts, _ = np.histogram(multi_top_2k_ambiguity_scores['test'], bins=bin_edges)

# Plot histograms
plt.figure(figsize=(12, 6))
sns.histplot(multi_top_2k_ambiguity_scores['train'], bins=bin_edges, kde=True, label='Training Set', color='blue')
sns.histplot(multi_top_2k_ambiguity_scores['test'], bins=bin_edges, kde=True, label='Testing Set', color='orange')
plt.title('Reddit-AITA-Multiclass-Top-2k Ambiguity Scores', fontsize=16)
plt.xlabel('Score', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.yscale('log')
plt.legend()
plt.savefig('analysis_results/multiclass-top-2k/Decision_Ambiguity_Analysis/ambiguity_scores.png')
plt.show()

# Log Histogram Bin Counts
bin_labels = [f"{bin_edges[i]:.2f} - {bin_edges[i+1]:.2f}" for i in range(len(bin_edges)-1)]
llama2_counts_df = pd.DataFrame({'Bin Ranges': bin_labels, 'Test': ambiguity_train_counts, 'Train': ambiguity_test_counts}, index=bin_labels)
llama2_counts_df.to_csv('analysis_results/multiclass-top-2k/Decision_Ambiguity_Analysis/ambiguity_scores_bins.csv', index=False)


# Calculate the common bin edges for both train and test submissions
combined_data = np.concatenate([binary_ambiguity_scores['train'], binary_ambiguity_scores['test']])
bin_edges = np.histogram_bin_edges(combined_data, bins=20)

# Get histogram bin counts for train and test submissions
ambiguity_train_counts, _ = np.histogram(binary_ambiguity_scores['train'], bins=bin_edges)
ambiguity_test_counts, _ = np.histogram(binary_ambiguity_scores['test'], bins=bin_edges)

# Plot histograms
plt.figure(figsize=(12, 6))
sns.histplot(binary_ambiguity_scores['train'], bins=bin_edges, kde=True, label='Training Set', color='blue')
sns.histplot(binary_ambiguity_scores['test'], bins=bin_edges, kde=True, label='Testing Set', color='orange')
plt.title('Reddit-AITA-Binary Ambiguity Scores', fontsize=16)
plt.xlabel('Score', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.yscale('log')
plt.legend()
plt.savefig('analysis_results/binary/Decision_Ambiguity_Analysis/ambiguity_scores.png')
plt.show()

# Log Histogram Bin Counts
bin_labels = [f"{bin_edges[i]:.2f} - {bin_edges[i+1]:.2f}" for i in range(len(bin_edges)-1)]
llama2_counts_df = pd.DataFrame({'Bin Ranges': bin_labels, 'Test': ambiguity_train_counts, 'Train': ambiguity_test_counts}, index=bin_labels)
llama2_counts_df.to_csv('analysis_results/binary/Decision_Ambiguity_Analysis/ambiguity_scores_bins.csv', index=False)


# Calculate the common bin edges for both train and test submissions
combined_data = np.concatenate([binary_top_2k_ambiguity_scores['train'], binary_top_2k_ambiguity_scores['test']])
bin_edges = np.histogram_bin_edges(combined_data, bins=20)

# Get histogram bin counts for train and test submissions
ambiguity_train_counts, _ = np.histogram(binary_top_2k_ambiguity_scores['train'], bins=bin_edges)
ambiguity_test_counts, _ = np.histogram(binary_top_2k_ambiguity_scores['test'], bins=bin_edges)

# Plot histograms
plt.figure(figsize=(12, 6))
sns.histplot(binary_top_2k_ambiguity_scores['train'], bins=bin_edges, kde=True, label='Training Set', color='blue')
sns.histplot(binary_top_2k_ambiguity_scores['test'], bins=bin_edges, kde=True, label='Testing Set', color='orange')
plt.title('Reddit-AITA-Binary-Top-2k Ambiguity Scores', fontsize=16)
plt.xlabel('Score', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.yscale('log')
plt.legend()
plt.savefig('analysis_results/binary-top-2k/Decision_Ambiguity_Analysis/ambiguity_scores.png')
plt.show()

# Log Histogram Bin Counts
bin_labels = [f"{bin_edges[i]:.2f} - {bin_edges[i+1]:.2f}" for i in range(len(bin_edges)-1)]
llama2_counts_df = pd.DataFrame({'Bin Ranges': bin_labels, 'Test': ambiguity_train_counts, 'Train': ambiguity_test_counts}, index=bin_labels)
llama2_counts_df.to_csv('analysis_results/binary-top-2k/Decision_Ambiguity_Analysis/ambiguity_scores_bins.csv', index=False)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Compute percentiles for 'train' and test datasets
train_percentiles = np.percentile(multi_ambiguity_scores['train'], np.arange(0, 101, 1))
test_percentiles = np.percentile(multi_ambiguity_scores['test'], np.arange(0, 101, 1))
combined_percentile_data = pd.DataFrame({
    "Percentile": np.arange(0, 101, 1),
    "Train Ambiguity Score": train_percentiles,
    "Test Ambiguity Score": test_percentiles
})

# Save the combined DataFrame to a CSV file
combined_percentile_data.to_csv("analysis_results/multiclass/Decision_Ambiguity_Analysis/ambiguity_score_cumulative_frequency_bins.csv", index=False)

# Plotting
plt.figure(figsize=(10, 6))  # Adjust the figure size as needed
plt.plot(train_percentiles, np.arange(0, 101, 1), label='Train')
plt.plot(test_percentiles, np.arange(0, 101, 1), label='Test')
plt.xlabel("Ambiguity Score")
plt.ylabel("Percentile")
plt.title("Reddit-AITA-Multiclass Ambiguity Score Cumulative Frequency")
plt.legend()
plt.savefig("analysis_results/multiclass/Decision_Ambiguity_Analysis/ambiguity_scores_cumulative_frequency.png")
plt.show()


# Compute percentiles for 'train' and test datasets
train_percentiles = np.percentile(multi_top_2k_ambiguity_scores['train'], np.arange(0, 101, 1))
test_percentiles = np.percentile(multi_top_2k_ambiguity_scores['test'], np.arange(0, 101, 1))
combined_percentile_data = pd.DataFrame({
    "Percentile": np.arange(0, 101, 1),
    "Train Ambiguity Score": train_percentiles,
    "Test Ambiguity Score": test_percentiles
})

# Save the combined DataFrame to a CSV file
combined_percentile_data.to_csv("analysis_results/multiclass-top-2k/Decision_Ambiguity_Analysis/ambiguity_score_cumulative_frequency_bins.csv", index=False)

# Plotting
plt.figure(figsize=(10, 6))  # Adjust the figure size as needed
plt.plot(train_percentiles, np.arange(0, 101, 1), label='Train')
plt.plot(test_percentiles, np.arange(0, 101, 1), label='Test')
plt.xlabel("Ambiguity Score")
plt.ylabel("Percentile")
plt.title("Reddit-AITA-Multiclass-Top-2k Ambiguity Score Cumulative Frequency")
plt.legend()
plt.savefig("analysis_results/multiclass-top-2k/Decision_Ambiguity_Analysis/ambiguity_scores_cumulative_frequency.png")
plt.show()


# Compute percentiles for 'train' and test datasets
train_percentiles = np.percentile(binary_ambiguity_scores['train'], np.arange(0, 101, 1))
test_percentiles = np.percentile(binary_ambiguity_scores['test'], np.arange(0, 101, 1))
combined_percentile_data = pd.DataFrame({
    "Percentile": np.arange(0, 101, 1),
    "Train Ambiguity Score": train_percentiles,
    "Test Ambiguity Score": test_percentiles
})

# Save the combined DataFrame to a CSV file
combined_percentile_data.to_csv("analysis_results/binary/Decision_Ambiguity_Analysis/ambiguity_score_cumulative_frequency_bins.csv", index=False)

# Plotting
plt.figure(figsize=(10, 6))  # Adjust the figure size as needed
plt.plot(train_percentiles, np.arange(0, 101, 1), label='Train')
plt.plot(test_percentiles, np.arange(0, 101, 1), label='Test')
plt.xlabel("Ambiguity Score")
plt.ylabel("Percentile")
plt.title("Reddit-AITA-Binary Ambiguity Score Cumulative Frequency")
plt.legend()
plt.savefig("analysis_results/binary/Decision_Ambiguity_Analysis/ambiguity_scores_cumulative_frequency.png")
plt.show()


# Compute percentiles for 'train' and test datasets
train_percentiles = np.percentile(binary_top_2k_ambiguity_scores['train'], np.arange(0, 101, 1))
test_percentiles = np.percentile(binary_top_2k_ambiguity_scores['test'], np.arange(0, 101, 1))
combined_percentile_data = pd.DataFrame({
    "Percentile": np.arange(0, 101, 1),
    "Train Ambiguity Score": train_percentiles,
    "Test Ambiguity Score": test_percentiles
})

# Save the combined DataFrame to a CSV file
combined_percentile_data.to_csv("analysis_results/binary-top-2k/Decision_Ambiguity_Analysis/ambiguity_score_cumulative_frequency_bins.csv", index=False)

# Plotting
plt.figure(figsize=(10, 6))  # Adjust the figure size as needed
plt.plot(train_percentiles, np.arange(0, 101, 1), label='Train')
plt.plot(test_percentiles, np.arange(0, 101, 1), label='Test')
plt.xlabel("Ambiguity Score")
plt.ylabel("Percentile")
plt.title("Reddit-AITA-Binary-Top-2k Ambiguity Score Cumulative Frequency")
plt.legend()
plt.savefig("analysis_results/binary-top-2k/Decision_Ambiguity_Analysis/ambiguity_scores_cumulative_frequency.png")
plt.show()

In [None]:
# Filter the datasets to include only samples with an ambiguity score of 0

multiclass_zero_ambiguity_dataset = multiclass_dataset.filter(lambda x: x['ambiguity_score'] == 0)
multiclass_top_2k_zero_ambiguity_dataset = multiclass_top_2k_dataset.filter(lambda x: x['ambiguity_score'] == 0)
binary_zero_ambiguity_dataset = binary_dataset.filter(lambda x: x['ambiguity_score'] == 0)
binary_top_2k_zero_ambiguity_dataset = binary_top_2k_dataset.filter(lambda x: x['ambiguity_score'] == 0)

In [None]:
import json

# Get counts for each dataset split
zero_ambiguity_count = len(multiclass_zero_ambiguity_dataset['train']) + len(multiclass_zero_ambiguity_dataset['test'])

# Calculate percentages
total_count= len(multiclass_dataset['train']) + len(multiclass_dataset['test'])
zero_ambiguity_percentage = round((zero_ambiguity_count / total_count) * 100, 3)

# Store results in dataframe and save to output CSV
zero_ambiguity_results = {
    'Number of Samples with Zero Ambiguity': [zero_ambiguity_count],
    'Percentage of Samples with Zero Ambiguity': [zero_ambiguity_percentage]
}

output_file = "analysis_results/multiclass/Decision_Ambiguity_Analysis/zero_ambiguity_analysis_results.json"

with open(output_file, 'w') as file:
    json.dump(zero_ambiguity_results, file, indent=4)


# Get counts for each dataset split
zero_ambiguity_count = len(multiclass_top_2k_zero_ambiguity_dataset['train']) + len(multiclass_top_2k_zero_ambiguity_dataset['test'])

# Calculate percentages
total_count= len(multiclass_top_2k_dataset['train']) + len(multiclass_top_2k_dataset['test'])
zero_ambiguity_percentage = round((zero_ambiguity_count / total_count) * 100, 3)

# Store results in dataframe and save to output CSV
zero_ambiguity_results = {
    'Number of Samples with Zero Ambiguity': [zero_ambiguity_count],
    'Percentage of Samples with Zero Ambiguity': [zero_ambiguity_percentage]
}

output_file = "analysis_results/multiclass-top-2k/Decision_Ambiguity_Analysis/zero_ambiguity_analysis_results.json"

with open(output_file, 'w') as file:
    json.dump(zero_ambiguity_results, file, indent=4)


# Get counts for each dataset split
zero_ambiguity_count = len(binary_zero_ambiguity_dataset['train']) + len(binary_zero_ambiguity_dataset['test'])

# Calculate percentages
total_count= len(binary_dataset['train']) + len(binary_dataset['test'])
zero_ambiguity_percentage = round((zero_ambiguity_count / total_count) * 100, 3)

# Store results in dataframe and save to output CSV
zero_ambiguity_results = {
    'Number of Samples with Zero Ambiguity': [zero_ambiguity_count],
    'Percentage of Samples with Zero Ambiguity': [zero_ambiguity_percentage]
}

output_file = "analysis_results/binary/Decision_Ambiguity_Analysis/zero_ambiguity_analysis_results.json"

with open(output_file, 'w') as file:
    json.dump(zero_ambiguity_results, file, indent=4)


# Get counts for each dataset split
zero_ambiguity_count = len(binary_top_2k_zero_ambiguity_dataset['train']) + len(binary_top_2k_zero_ambiguity_dataset['test'])

# Calculate percentages
total_count= len(binary_top_2k_dataset['train']) + len(binary_top_2k_dataset['test'])
zero_ambiguity_percentage = round((zero_ambiguity_count / total_count) * 100, 3)

# Store results in dataframe and save to output CSV
zero_ambiguity_results = {
    'Number of Samples with Zero Ambiguity': [zero_ambiguity_count],
    'Percentage of Samples with Zero Ambiguity': [zero_ambiguity_percentage]
}

output_file = "analysis_results/binary-top-2k/Decision_Ambiguity_Analysis/zero_ambiguity_analysis_results.json"

with open(output_file, 'w') as file:
    json.dump(zero_ambiguity_results, file, indent=4)

## Comment Agreement Analysis
- Holistic: Krippendorff's Alpha
  - Key aspects
    - "Krippendorff's alpha coefficient,[1] named after academic Klaus Krippendorff, is a statistical measure of the agreement achieved when coding a set of units of analysis. Since the 1970s, alpha has been used in content analysis where textual units are categorized by trained readers, in counseling and survey research where experts code open-ended interview data into analyzable terms, in psychological testing where alternative tests of the same phenomena need to be compared, or in observational studies where unstructured happenings are recorded for subsequent analysis."
    - "Krippendorff's alpha generalizes several known statistics, often called measures of inter-coder agreement, inter-rater reliability, reliability of coding given sets of units (as distinct from unitizing) but it also distinguishes itself from statistics that are called reliability coefficients but are unsuitable to the particulars of coding data generated for subsequent analysis."
    - "Krippendorff's alpha is applicable to any number of coders, each assigning one value to one unit of analysis, to incomplete (missing) data, to any number of values available for coding a variable, to binary, nominal, ordinal, interval, ratio, polar, and circular metrics (note that this is not a metric in the mathematical sense, but often the square of a mathematical metric, see levels of measurement), and it adjusts itself to small sample sizes of the reliability data. The virtue of a single coefficient with these variations is that computed reliabilities are comparable across any numbers of coders, values, different metrics, and unequal sample sizes.
  - [Wiki](https://en.wikipedia.org/wiki/Krippendorff%27s_alpha)
  - [Lecture by Krippendorff on Calculation](https://www.asc.upenn.edu/sites/default/files/2021-03/Computing%20Krippendorff%27s%20Alpha-Reliability.pdf)
  - [Article Explanation](https://www.surgehq.ai/blog/inter-rater-reliability-metrics-an-introduction-to-krippendorffs-alpha)
    - Ranges from -1 to 1, with -1 being complete disagreement, 0 being random choice, and 1 being complete agreement
    - 0.8 indicates significant agreement.

- Pairwise: Cohen's Kappa
  - [Wiki](https://en.wikipedia.org/wiki/Cohen%27s_kappa)
  - [Article Explanation](https://towardsdatascience.com/multi-class-metrics-made-simple-the-kappa-score-aka-cohens-kappa-coefficient-bdea137af09c)

In [None]:
import numpy as np
import krippendorff
from sklearn.metrics import cohen_kappa_score
from itertools import combinations

def get_encoded_classifications(dataset):
    """
    Encodes AITA classifications into numeric values, retaining None values.

    Parameters:
    dataset (list of dictionaries): A huggingface dataset

    Returns:
    list[list]: Lists of numeric classifications, with None where input was None
    """

    # Mapping of AITA classifications to numeric values
    classification_values = {'YTA': 1, 'ESH': 2,
                             'INFO': 3, 'NAH': 4,
                             'NTA': 5}

    # Initialize a list of lists, one for each of the top 10 comments
    top_comments = [[] for _ in range(10)]

    # Iterate over each sample in the dataset
    for sample in dataset:
        # Iterate over the top 10 comments
        for i in range(10):
            key = f'top_comment_{i+1}_classification'
            # Append the classification to the corresponding list
            top_comments[i].append(sample.get(key, None))

    # Convert classifications to their numeric representations, keeping None as is
    top_comments_encoded = []
    for i in range(len(top_comments)):
        encoded_comment = [classification_values.get(c, None) for c in top_comments[i]]
        top_comments_encoded.append(encoded_comment)
    return top_comments_encoded


def calculate_krippendorffs_alpha(dataset):
  """
  Calculates Krippendorff's alpha for a given dataset.

  Parameters:
  dataset (list of dictionaries): A huggingface dataset.

  Returns:
  float: Krippendorff's alpha score.
  """

  # Encode top comment classifications
  top_comments_encoded = get_encoded_classifications(dataset)

  # Calculate and return krippendorff's alpha
  data = np.array([[np.nan if x is None else x for x in sublist] for sublist in top_comments_encoded], dtype=float)
  return krippendorff.alpha(data)


def calculate_cohen_kappa(dataset):
  """
  Calculates Cohen's Kappa score for a given dataset.

  Parameters:
  dataset (list of dictionaries): A huggingface dataset.

  Returns:
  dict: A dictionary of Cohen's Kappa scores for each pair of top comments.
  """

  # encode top comment classifications
  top_comments_encoded = get_encoded_classifications(dataset)

  scores = {}
  for list1, list2 in combinations(top_comments_encoded, 2):
      filtered_list1 = []
      filtered_list2 = []
      for true, pred in zip(list1, list2):
          if true is not None and pred is not None:
              filtered_list1.append(true)
              filtered_list2.append(pred)
      score = cohen_kappa_score(filtered_list1, filtered_list2)
      index1 = top_comments_encoded.index(list1)
      index2 = top_comments_encoded.index(list2)
      key = (f"top_comment_{index1 + 1}", f"top_comment_{index2 + 1}")
      scores[key] = score
  return scores


def save_cohen_kappa_scores(cohen_kappa_scores, output_file):
  """
  Saves Cohen's Kappa scores to a CSV file.

  Parameters:
  cohen_kappa_scores (dict): A dictionary of Cohen's Kappa scores.
  output_file (str): The path to the output CSV file.
  """

  # create a list of column and row names
  comments = [f"top_comment_{i}" for i in range(1, 11)]

  # create an empty dataframe and fill with scores
  df = pd.DataFrame(index=comments, columns=comments)
  for (comment1, comment2), score in cohen_kappa_scores.items():
      df.at[comment1, comment2] = round(score, 3)

  # set the lower triangle to NaN, including the diagonal
  for i in range(len(df)):
      for j in range(i + 1):
          df.iat[i, j] = np.nan

  # save the dataframe
  df.to_csv(output_file, index=True)

In [None]:
import json

# Calculate and save Krippendorff's alphas for both train and test datasets
krippendorffs_alpha = {
    'train': calculate_krippendorffs_alpha(multiclass_dataset['train']),
    'test': calculate_krippendorffs_alpha(multiclass_dataset['test'])
}
output_file_alpha = "analysis_results/multiclass/Comment_Agreement_Analysis/krippendorffs_alpha.json"
with open(output_file_alpha, "w") as f:
    json.dump(krippendorffs_alpha, f)

# Calculate and save Cohen's Kappa scores for both train and test datasets
cohen_kappa_scores = {
    'train': {str(key): value for key, value in calculate_cohen_kappa(multiclass_dataset['train']).items()},
    'test': {str(key): value for key, value in calculate_cohen_kappa(multiclass_dataset['test']).items()}
}
output_file_kappa = "analysis_results/multiclass/Comment_Agreement_Analysis/cohen_kappa_scores.json"
with open(output_file_kappa, "w") as f:
    json.dump(cohen_kappa_scores, f)


# Calculate and save Krippendorff's alphas for both train and test datasets
krippendorffs_alpha = {
    'train': calculate_krippendorffs_alpha(multiclass_top_2k_dataset['train']),
    'test': calculate_krippendorffs_alpha(multiclass_top_2k_dataset['test'])
}
output_file_alpha = "analysis_results/multiclass-top-2k/Comment_Agreement_Analysis/krippendorffs_alpha.json"
with open(output_file_alpha, "w") as f:
    json.dump(krippendorffs_alpha, f)

# Calculate and save Cohen's Kappa scores for both train and test datasets
cohen_kappa_scores = {
    'train': {str(key): value for key, value in calculate_cohen_kappa(multiclass_top_2k_dataset['train']).items()},
    'test': {str(key): value for key, value in calculate_cohen_kappa(multiclass_top_2k_dataset['test']).items()}
}
output_file_kappa = "analysis_results/multiclass-top-2k/Comment_Agreement_Analysis/cohen_kappa_scores.json"
with open(output_file_kappa, "w") as f:
    json.dump(cohen_kappa_scores, f)


# Calculate and save Krippendorff's alphas for both train and test datasets
krippendorffs_alpha = {
    'train': calculate_krippendorffs_alpha(binary_dataset['train']),
    'test': calculate_krippendorffs_alpha(binary_dataset['test'])
}
output_file_alpha = "analysis_results/binary/Comment_Agreement_Analysis/krippendorffs_alpha.json"
with open(output_file_alpha, "w") as f:
    json.dump(krippendorffs_alpha, f)

# Calculate and save Cohen's Kappa scores for both train and test datasets
cohen_kappa_scores = {
    'train': {str(key): value for key, value in calculate_cohen_kappa(binary_dataset['train']).items()},
    'test': {str(key): value for key, value in calculate_cohen_kappa(binary_dataset['test']).items()}
}
output_file_kappa = "analysis_results/binary/Comment_Agreement_Analysis/cohen_kappa_scores.json"
with open(output_file_kappa, "w") as f:
    json.dump(cohen_kappa_scores, f)


# Calculate and save Krippendorff's alphas for both train and test datasets
krippendorffs_alpha = {
    'train': calculate_krippendorffs_alpha(binary_top_2k_dataset['train']),
    'test': calculate_krippendorffs_alpha(binary_top_2k_dataset['test'])
}
output_file_alpha = "analysis_results/binary-top-2k/Comment_Agreement_Analysis/krippendorffs_alpha.json"
with open(output_file_alpha, "w") as f:
    json.dump(krippendorffs_alpha, f)

# Calculate and save Cohen's Kappa scores for both train and test datasets
cohen_kappa_scores = {
    'train': {str(key): value for key, value in calculate_cohen_kappa(binary_top_2k_dataset['train']).items()},
    'test': {str(key): value for key, value in calculate_cohen_kappa(binary_top_2k_dataset['test']).items()}
}
output_file_kappa = "analysis_results/binary-top-2k/Comment_Agreement_Analysis/cohen_kappa_scores.json"
with open(output_file_kappa, "w") as f:
    json.dump(cohen_kappa_scores, f)