Assuming you have all the datasets referenced in the thesis both for HS and CS downloaded, this code both cleans the datasets and brings them together. In the case of CS, it groups them to make pairs of subsequent comments/messages in the dialogue. The output of this code are the files LabeledHateTrainingDataset.csv and LabeledDialoguesForCounterTraining.csv.

In [1]:
# Standard Library Imports
import os
import json
import re
from collections import Counter

# Third-Party Library Imports
import pandas as pd
import numpy as np
from tqdm import tqdm
import spacy
import torch
from torch.utils.data import DataLoader, Subset

# Hugging Face Transformers and Datasets
import datasets
from datasets import Dataset
from transformers import (
    AutoModelForSequenceClassification, 
    AutoTokenizer, 
    Trainer, 
    TrainingArguments, 
    DataCollatorWithPadding, 
    EarlyStoppingCallback, 
    pipeline
)


2024-08-29 18:21:20.218089: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-08-29 18:21:20.246929: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


# Bring Labeled Datasets of Hate Speech Together

## Create Datasets from each source

### Ethos Dataset
https://github.com/intelligence-csd-auth-gr/Ethos-Hate-Speech-Dataset/tree/master

Provides a total of 94 hate comments

In [6]:
ethos_file_path = '../../DATA_Data/Ethos_Dataset_Multi_Label.csv'
ethos_data = pd.read_csv(ethos_file_path,sep=';')
ethos_filtered_data = ethos_data[ethos_data['sexual_orientation'] > 0]
ethos_dataset = ethos_filtered_data[['comment']]
ethos_dataset = ethos_dataset.rename(columns={'comment': 'text'})
ethos_dataset['label'] = 1

### FRENKT-en-hate-datasetInspection
The test dataset has a total of 1017 rows.
The train dataset has a total of 4337 rows.
The dev dataset has a total of 482 rows.
That makes a total of 5836
! Here we are only taking the 4337 rows of training

https://huggingface.co/datasets/classla/FRENK-hate-en?row=2

In [7]:
FRENKT_train_path = '../../DATA_Data/train.tsv'

columns = ['ID', 'Comment', 'Background', 'Offensive', 'Target', 'Category']
FRENK_train_dataset = pd.read_csv(FRENKT_train_path, sep='\t', header=None, names=columns)

FRENK_train_lgbt = FRENK_train_dataset[FRENK_train_dataset['Category'] == 'lgbt'].copy()
FRENK_train_lgbt.loc[:, 'Label'] = (FRENK_train_lgbt['Offensive'] == 'Offensive').astype(int)
FRENK_train_lgbt = FRENK_train_lgbt[['Comment', 'Label']]
FRENK_train_lgbt = FRENK_train_lgbt.rename(columns={
    'Comment': 'text',
    'Label': 'label'
})

### Dataset-for-Identification-of-Queerphobia
It has a total of 10000 rows: https://github.com/ShivumB/Dataset-for-Identification-of-Queerphobia/tree/main 

Paper: https://www.researchgate.net/publication/370504802_Dataset_for_identification_of_queerphobia

In [8]:
queer_phobia_file_path = '../../DATA_Data/queerPhobia.csv'

queer_phobia_dataset= pd.read_csv(queer_phobia_file_path)
queer_phobia_dataset = queer_phobia_dataset.rename(columns={
    'classification':'label'
})


### hatecheck-data

It has a total of 1014 hate messages directed towards gay people or trans people

https://github.com/paul-rottger/hatecheck-data/tree/main

Paper:https://aclanthology.org/2021.acl-long.4.pdf

In [9]:
hatecheck_file_path = '../../DATA_Data/test_suite_cases.csv'

hatecheck_data = pd.read_csv(hatecheck_file_path)
hatecheck_data_lgbtq = hatecheck_data[hatecheck_data['target_ident'].isin(['gay people', 'trans people'])]

hatecheck_dataset = hatecheck_data_lgbtq[['test_case', 'label_gold']].copy()
hatecheck_dataset.rename(columns={'test_case': 'text', 'label_gold': 'label'}, inplace=True)
hatecheck_dataset['label'] = (hatecheck_dataset['label'] == 'hateful').astype(int)

### HateXpain data

A total of 1940 sentences: https://github.com/hate-alert/HateXplain/tree/master/Data

Paper: HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection

In [10]:
with open('../../DATA_Data/dataset.json', 'r') as file:
    HateXplain_data = json.load(file)

for post_id, post_data in HateXplain_data.items():
    post_data['post_tokens'] = ' '.join(post_data['post_tokens'])

unique_targets = set()

for post_id, post_data in HateXplain_data.items():
    for annotator in post_data['annotators']:
        unique_targets.update(annotator['target'])

HateXplain_filtered_data = {}

for post_id, post_data in HateXplain_data.items():
    count = sum(1 for annotator in post_data['annotators'] if any(target in annotator['target'] for target in ['Homosexual', 'Asexual', 'Bisexual']))
    if count >= 2:
        HateXplain_filtered_data[post_id] = post_data

for post_id, post_data in HateXplain_filtered_data.items():
    post_data.pop('post_id', None)
    post_data.pop('rationales', None)
    target_vectors = []
    label_vector = []
    for annotator in post_data['annotators']:
        target_vectors.extend(annotator['target'])
        label_vector.append(annotator['label'])
    post_data["label"] = label_vector
    post_data['targets'] = target_vectors
    post_data.pop('annotators', None)

data_list = [{'text': data['post_tokens'], 'label': data['label']} for key, data in HateXplain_filtered_data.items()]

HateXplain_dataset = pd.DataFrame(data_list)

def sort_and_join(label):
    return ', '.join(sorted(label))

unique_sorted_labels = HateXplain_dataset['label'].apply(sort_and_join).unique()

def most_common_label(label):
    if not label:
        return "hatespeech"
    counts = Counter(label)
    return counts.most_common(1)[0][0]

HateXplain_dataset['label'] = HateXplain_dataset['label'].apply(most_common_label)

def map_label(label):
    return 0 if label == "normal" else 1

HateXplain_dataset['label'] = HateXplain_dataset['label'].apply(map_label)

### SBIC

2770 rows

https://maartensap.com/social-bias-frames/ 

In [11]:
SBIC_train = pd.read_csv("../../DATA_Data/SBIC.v2/SBIC.v2.trn.csv")
SBIC_dev = pd.read_csv("../../DATA_Data/SBIC.v2/SBIC.v2.dev.csv")
SBIC_tst = pd.read_csv("../../DATA_Data/SBIC.v2/SBIC.v2.tst.csv")

SBIC_train = pd.concat([SBIC_train, SBIC_dev, SBIC_tst], ignore_index=True)

columns_to_remove = ["annotatorGender", "annotatorMinority", "speakerMinorityYN","WorkerId", "HITId",
                     "annotatorPolitics", "annotatorRace", "annotatorAge", "dataSource", "sexReason",
                     "whoTarget", "intentYN", "sexYN", "targetCategory"]

SBIC_train = SBIC_train.drop(columns_to_remove, axis=1)# Print the count of values in the 'exampleColumn'
groups_to_include = ["gay men", "lesbian women", "lesbian women, gay men", "women, gay men",
                         "gay men, trans women, trans men, bisexual women, bisexual men",
                         "women, lesbian women, trans women, bisexual women",
                         "lesbian women, gay men, trans women, trans men, bisexual women, bisexual men",
                         "trans women, trans men", "gays", "trans men", "Non - binary",
                         "women, trans men", "trans women", "bisexual women, bisexual men", "women, trans women",
                         "women, gay men", "women, lesbian women", "lesbian women, gay men, bisexual women, bisexual men",
                         "bisexual women", "trans people", "gay people", "asexual people"]

keywords = ["gay", "lesbian", "trans", "transgender", "bisexual", "asexual", "Non-binary", "Non - binary", "queer",
            "Queer", "Demi-queer", "Fluid", "fluid", "Pan Sexual"]

SBIC_filtered_df = SBIC_train[SBIC_train['targetMinority'].isin(groups_to_include)]

for keyword in keywords:
    SBIC_filtered_df = pd.concat([
        SBIC_filtered_df, 
        SBIC_train[SBIC_train['targetMinority'].str.contains(keyword, case=False, na=False)]
    ], ignore_index=True)

SBIC_filtered_df = SBIC_filtered_df.drop_duplicates()
SBIC_filtered_df = SBIC_filtered_df[SBIC_filtered_df['offensiveYN'] == 1]
SBIC_dataset = pd.DataFrame({
    'text': pd.concat([SBIC_filtered_df['post'], SBIC_filtered_df['targetStereotype']]).dropna().unique()
})

SBIC_dataset = SBIC_dataset.drop_duplicates()
SBIC_dataset["label"] = 1

### CONAN

465 rows 

https://github.com/marcoguerini/CONAN/tree/master

In [12]:
conan_DATA = pd.read_csv("../../DATA_Data/Multitarget-CONAN.csv")


conan_DATA_LGBTQ = conan_DATA[conan_DATA["TARGET"] == "LGBT+"]
conan_DATA_LGBTQ = conan_DATA_LGBTQ[["HATE_SPEECH"]]
conan_DATA_LGBTQ["label"] = 1
conan_DATA_LGBTQ = conan_DATA_LGBTQ.rename(columns={'HATE_SPEECH': 'text'})
conan_DATA_LGBTQ = conan_DATA_LGBTQ.drop_duplicates()

### ucberkeley-dlab

https://huggingface.co/datasets/ucberkeley-dlab/measuring-hate-speech

In [13]:
berkeley_dataset = datasets.load_dataset('ucberkeley-dlab/measuring-hate-speech', 'default')   
berkeley_dataframe = berkeley_dataset['train'].to_pandas()


columns_to_keep = [
    "hate_speech_score",
    "text",
    "target_gender_non_binary",
    "target_gender_transgender_men",
    "target_gender_transgender_unspecified",
    "target_gender_transgender_women",
    "target_gender_other",
    "target_sexuality_bisexual",
    "target_sexuality_gay",
    "target_sexuality_lesbian",
    "target_sexuality_other",
    "target_sexuality",
]

berkeley_dataframe = berkeley_dataframe[columns_to_keep]
columns_to_check = [
    'target_gender_non_binary', 'target_gender_transgender_men',
    'target_gender_transgender_unspecified', 'target_gender_transgender_women',
    'target_gender_other', 'target_sexuality_bisexual', 'target_sexuality_gay',
    'target_sexuality_lesbian', 'target_sexuality_other', "target_sexuality"
]

# Create a condition that checks if any of the specified columns have True
condition = berkeley_dataframe[columns_to_check].any(axis=1)

# Filter the DataFrame based on the condition
filtered_berkeley_dataframe = berkeley_dataframe[condition]

berkeley_hate_speech_dataset = filtered_berkeley_dataframe[["hate_speech_score", "text"]]
berkeley_counter_speech_dataset = filtered_berkeley_dataframe[["hate_speech_score", "text"]]



In [14]:
berkeley_hate_speech_dataset['hate_speech_score'] = berkeley_hate_speech_dataset['hate_speech_score'].apply(lambda x: 1 if x > 0.5 else 0)
berkeley_hate_speech_dataset = berkeley_hate_speech_dataset.rename(columns={'hate_speech_score': 'label'})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  berkeley_hate_speech_dataset['hate_speech_score'] = berkeley_hate_speech_dataset['hate_speech_score'].apply(lambda x: 1 if x > 0.5 else 0)


## Join all labeled hate speech datasets
30471 texts:
- 19119 non hate speech
- 11352 hate speech

In [24]:
final_hate_df = pd.concat([ethos_dataset, FRENK_train_lgbt, queer_phobia_dataset, 
                         hatecheck_dataset, HateXplain_dataset, SBIC_dataset,
                         conan_DATA_LGBTQ,berkeley_hate_speech_dataset], ignore_index=True)
final_hate_df = final_hate_df.drop_duplicates(subset=["text"])
final_hate_df.to_csv('LabeledHateTrainingDataset.csv', index=False)
final_hate_df

Unnamed: 0,text,label
0,why haven't you faggots taken the bi-pill yet,1
1,Why does Bulgaria exist as a country? It's one...,1
2,what happened in the good old days when these ...,1
3,transgenders need help...they are sick in the ...,1
4,Trans are using the same logic feminists have ...,1
...,...,...
44614,Lailat al Miraj mubarak to all Muslims.. I'm w...,0
44840,Notwithstanding Marriyum Aurangzeb sahiba's po...,0
44949,"I, a Catholic and a Jesuit, am grateful for th...",0
45777,NUGS lauds Ghanaian Muslims for their immense ...,0


# Bring Labeled Datasets of Counter Speech Together

4213 conversations, each with two comments

In [25]:
dialoconan = pd.read_csv("../../CODES_context/Data/DIALOCONAN.csv")
dialoconan =  dialoconan[["text", "TARGET", "dialogue_id", "turn_id", "type"]]
dialoconan = dialoconan[dialoconan["TARGET"].isin(["LGBT+", "WOMEN/LGBT+"])]


multitarget_conan = pd.read_csv("../../CODES_context/Data/Multitarget-CONAN.csv")
multitarget_conan = multitarget_conan[["INDEX", "HATE_SPEECH", "COUNTER_NARRATIVE", "TARGET"]]
multitarget_conan = multitarget_conan.rename(columns={
    'INDEX': 'dialogue_id',
    'HATE_SPEECH': 'hateSpeech',
    'COUNTER_NARRATIVE': 'counterSpeech',
    'TARGET': 'TARGET'
})

rows = []

for index, row in multitarget_conan.iterrows():
    # Convert dialogue_id to an integer and add 3049
    new_dialogue_id = str(int(row['dialogue_id']) + 3049)

    # Create a row for hateSpeech
    hs_row = {
        'dialogue_id': new_dialogue_id,
        'TARGET': row['TARGET'],
        'text': row['hateSpeech'],
        'turn_id': 0,
        'type': 'HS'
    }
    rows.append(hs_row)

    # Create a row for counterSpeech
    cn_row = {
        'dialogue_id': new_dialogue_id,
        'TARGET': row['TARGET'],
        'text': row['counterSpeech'],
        'turn_id': 1,
        'type': 'CN'
    }
    rows.append(cn_row)

# Convert the list of dictionaries to a DataFrame
transformed_multitarget_conan = pd.DataFrame(rows)
transformed_multitarget_conan = transformed_multitarget_conan[transformed_multitarget_conan["TARGET"].isin(["LGBT+"])]

In [26]:
combined_dataframe = pd.concat([transformed_multitarget_conan, dialoconan], ignore_index=True)
combined_dataframe['dialogue_id'] = combined_dataframe['dialogue_id'].astype(str)

grouped = combined_dataframe.groupby('dialogue_id')['text'].apply(lambda x: tuple(x)).reset_index()

# Find duplicate text sets
duplicates = grouped[grouped.duplicated(subset=['text'], keep=False)]
# Group by the text tuples and find dialogue_id pairs
duplicate_pairs = duplicates.groupby('text')['dialogue_id'].apply(list).reset_index()

# Print the pairs of dialogues that are duplicated along with their texts
for index, row in duplicate_pairs.iterrows():
    dialogue_ids = row['dialogue_id']
    texts = row['text']
    if len(dialogue_ids) > 1:
        print(f"Duplicated text set {index + 1}:")
        for dialogue_id in dialogue_ids:
            print(f" - dialogue_id: {dialogue_id}")
            # Print the texts within each dialogue
            dialogue_texts = combined_dataframe[combined_dataframe['dialogue_id'] == dialogue_id]['text'].tolist()
            for text in dialogue_texts:
                print(f"   * {text}")
        print("\n")

# Identify all dialogue_ids to be removed (keep only the first occurrence)
to_remove = []
for index, row in duplicate_pairs.iterrows():
    dialogue_ids = row['dialogue_id']
    if len(dialogue_ids) > 1:
        to_remove.extend(dialogue_ids[:-1])  # Keep the first occurrence, mark the rest for removal

# Remove the identified duplicates from the original dataframe
cleaned_dataframe = combined_dataframe[~combined_dataframe['dialogue_id'].isin(to_remove)]

Duplicated text set 1:
 - dialogue_id: 29
   * LGBT history month now being taught in schools… I’m sick of this being shoved down our throats, what happened to British, European and World history being taught? Identity politics gone mad.
   * We still can teach British, European and World history, but part of that is looking at the history of groups which have been erased from our history previously. Looking at the some of the historic struggles of LGBT people in the UK can inform the future for everyone!
   * Do we have to discuss the sexuality of everyone in history when it’s largely irrelevant? If they were persecuted for it then fair enough, stick it in there and say how wrong it was, otherwise leave well alone.
   * LGBT people have been persecuted until recently in the UK, and it's still illegal to be gay in over 70 countries. I think it's important that people see that there have always been LGBT people, throughout history and that LGBT identities are not 'new' or a 'trend', the

In [11]:
# Identify the pairs of dialogues we are interested in and separate and label them accordingly 
subdialogue_combinations = [(0, 1), (2, 3), (4, 5), (6,7), (1,2), (3,4), (5,6), (0,2), (2,4), (4,6)]
subdialogues = []

for dialogue_id in cleaned_dataframe['dialogue_id'].unique():
    subdialogue_counter = 1
    dialogue_df = cleaned_dataframe[cleaned_dataframe['dialogue_id'] == dialogue_id]

    for combo in subdialogue_combinations:
        if all(turn_id in dialogue_df['turn_id'].values for turn_id in combo):
            subdialogue = dialogue_df[dialogue_df['turn_id'].isin(combo)].copy()
            new_dialogue_id = f"{dialogue_id}_{subdialogue_counter}"
            subdialogue['dialogue_id'] = new_dialogue_id
            subdialogue_counter += 1
            subdialogues.append(subdialogue)

subdialogues_df = pd.concat(subdialogues).reset_index(drop=True)
subdialogues_df = subdialogues_df.drop(columns=['TARGET'])

subdialogues_df['label'] = subdialogues_df['type'].apply(lambda x: 1 if x == 'CN' else 0)
subdialogues_df.to_csv("LabeledDialoguesForCounterTraining.csv")

Unnamed: 0,dialogue_id,text,turn_id,type,label
0,3073_1,UK Education Secretary: ‘All children from the...,0,HS,0
1,3073_1,We are always developing as time goes on and s...,1,CN,1
2,3076_1,Since when has abusing the anus as a sex organ...,0,HS,0
3,3076_1,To recognise and accept practices that we don'...,1,CN,1
4,3079_1,UK Education Secretary: ‘All children from the...,0,HS,0
