# HateXplain Dataset Preparation

The latest version of Huggingface dataset library is broken. It miss identifies local file system as 'remote' file system and raises 'NotImplementedError: Loading a dataset cached in a LocalFileSystem is not supported.' error. Downgrading it to datasets==2.16.0 version solved the error problem. I also downgraded the numpy library because this version is compatible with the trainer. After downgrading these libraries Colab will ask you to restart the session.

In [None]:
!pip install datasets==2.16.0
!pip install numpy==1.26.0

Collecting datasets==2.16.0
  Downloading datasets-2.16.0-py3-none-any.whl.metadata (20 kB)
Collecting pyarrow-hotfix (from datasets==2.16.0)
  Downloading pyarrow_hotfix-0.7-py3-none-any.whl.metadata (3.6 kB)
Collecting fsspec<=2023.10.0,>=2023.1.0 (from fsspec[http]<=2023.10.0,>=2023.1.0->datasets==2.16.0)
  Downloading fsspec-2023.10.0-py3-none-any.whl.metadata (6.8 kB)
Downloading datasets-2.16.0-py3-none-any.whl (507 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2023.10.0-py3-none-any.whl (166 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m166.4/166.4 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow_hotfix-0.7-py3-none-any.whl (7.9 kB)
Installing collected packages: pyarrow-hotfix, fsspec, datasets
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2025.3.2
    Uninstalling fsspec-2025.3.2:
      Successfully uninsta

[31mERROR: Operation cancelled by user[0m[31m
[0m^C


It is good practice to load the dataset in cache memory isntead of Google drive beause its I/O operation is slow. This could lead to warning 'slow disk access' by the load_dataset(). This function not only loads dataset, but also metadata which are important for proper data processing.

In [None]:
from datasets import load_dataset

ds = load_dataset("literAlbDev/hatexplain", cache_dir="/taher/ds")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Let's try to understand the datatype and content of the ds variable. You can notice that 'ds' is a dataset type dictionary which is just key value pairs. Also notice that the dataset is saved as .arrow format which comes from Apache Arrow.

In [None]:
print("Data type:\n", type(ds))
print("Location of the cache file:\n",ds.cache_files)
print("Content of the ds object:\n",ds)
print("Kyes in ds:\n",ds.keys())
print("Values in ds:\n",ds.values())
print("Example row from Train dataset:\n", ds['train']['id'][0])

Data type:
 <class 'datasets.dataset_dict.DatasetDict'>
Location of the cache file:
 {'train': [{'filename': '/taher/ds/literAlbDev___hatexplain/default/0.0.0/98fc2dd7b29744257acca45cbe4457cbdae2b979/hatexplain-train.arrow'}], 'validation': [{'filename': '/taher/ds/literAlbDev___hatexplain/default/0.0.0/98fc2dd7b29744257acca45cbe4457cbdae2b979/hatexplain-validation.arrow'}], 'test': [{'filename': '/taher/ds/literAlbDev___hatexplain/default/0.0.0/98fc2dd7b29744257acca45cbe4457cbdae2b979/hatexplain-test.arrow'}]}
Content of the ds object:
 DatasetDict({
    train: Dataset({
        features: ['id', 'annotators', 'rationales', 'post_tokens'],
        num_rows: 15383
    })
    validation: Dataset({
        features: ['id', 'annotators', 'rationales', 'post_tokens'],
        num_rows: 1922
    })
    test: Dataset({
        features: ['id', 'annotators', 'rationales', 'post_tokens'],
        num_rows: 1924
    })
})
Kyes in ds:
 dict_keys(['train', 'validation', 'test'])
Values in ds:
 dic

'ds' dataset contains train, test and validation dataset. We can access it just like panda's dataframe. For our work we would like to further inspect the dataset. We first converted each dataset to panda dataframe. Then we print some parts of the data.

In [None]:
df1 = ds['train'].to_pandas()
df2 = ds['test'].to_pandas()
df3 = ds['validation'].to_pandas()
print("Train data:\n", df1.head(3))
print("\n\n")
print("Test data:\n", df2.head(3))
print("\n\n")
print("Validation data:\n", df3.head(3))

Train data:
                             id  \
0                 23107796_gab   
1                  9995600_gab   
2  1227920812235051008_twitter   

                                          annotators  \
0  {'label': [0, 2, 2], 'annotator_id': [203, 204...   
1  {'label': [2, 2, 0], 'annotator_id': [27, 6, 4...   
2  {'label': [2, 2, 2], 'annotator_id': [209, 203...   

                                          rationales  \
0  [[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,...   
1  [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...   
2  [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,...   

                                         post_tokens  
0  [u, really, think, i, would, not, have, been, ...  
1  [the, uk, has, threatened, to, return, radioac...  
2  [if, english, is, not, imposition, then, hindi...  



Test data:
                             id  \
0  1178610029273976833_twitter   
1  1165785686903009283_twitter   
2  1252707503441313794_twitter   

                            

In train, test, validation dataset order of the labels are very imporatnt. From the blow code section's output we can say that Class label is organized in the following manner ['hatespeech', 'normal', 'offensive']. This means that in the dataset
0 = hatespeech
1 = normal and
2 = offensive

In [None]:
f1 = ds['train'].features
f2 = ds['test'].features
f3 = ds['validation'].features
print("Train features:\n", f1)
print("\n\n")
print("Test features:\n", f2)
print("\n\n")
print("Validation features:\n", f3)

# Let's look into train dataset to see what order of label it used

Train features:
 {'id': Value(dtype='string', id=None), 'annotators': Sequence(feature={'label': ClassLabel(names=['hatespeech', 'normal', 'offensive'], id=None), 'annotator_id': Value(dtype='int32', id=None), 'target': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}, length=-1, id=None), 'rationales': Sequence(feature=Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), length=-1, id=None), 'post_tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}



Test features:
 {'id': Value(dtype='string', id=None), 'annotators': Sequence(feature={'label': ClassLabel(names=['hatespeech', 'normal', 'offensive'], id=None), 'annotator_id': Value(dtype='int32', id=None), 'target': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}, length=-1, id=None), 'rationales': Sequence(feature=Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), length=-1, id=None), 'post_tokens': Sequence(feature=Value(dtype='st

Below is a small tutorial to show you how to convert and panda dataframe into a Huggingface dataset. Here, mapping of label's numeric and string value is important. It is possible to convert the numeric values into corresponding string and vice versa.



```
from datasets import Dataset, Features, ClassLabel, Value

# Create a DataFrame
import pandas as pd
df = pd.DataFrame({
    "text": [
        "I love this movie!",
        "This is hate speech.",
        "You're amazing!",
        "That was offensive."
    ],
    "label": [0, 1, 0, 1]
})

# Define features with class names
features = Features({
    "text": Value("string"),
    "label": ClassLabel(names=["normal", "hate"])
})

# Convert to Hugging Face Dataset with features
dataset = Dataset.from_pandas(df, features=features)
print(dataset.features)
print(dataset[0])

label_feature = dataset.features["label"]
print("Class names:", label_feature.names)
print("0 means:", label_feature.int2str(0))
print("1 means:", label_feature.int2str(1))
print("String 'normal' maps to:", label_feature.str2int("normal"))
print("String 'hate' maps to:", label_feature.str2int("hate"))

```



From the below code we get the following mapping of the labels.

```
0 means: hatespeech
1 means: normal
2 means: offensive
```



In [None]:
# after you’ve loaded ds:
features = ds["train"].features

# drill down into the annotators sequence:
label_feature = features["annotators"].feature["label"]

# this is a ClassLabel object — its `.names` tell you the mapping:
print(label_feature.names)
print(label_feature)
print("0 means:", label_feature.int2str(0))
print("1 means:", label_feature.int2str(1))
print("2 means:", label_feature.int2str(2))

['hatespeech', 'normal', 'offensive']
ClassLabel(names=['hatespeech', 'normal', 'offensive'], id=None)
0 means: hatespeech
1 means: normal
2 means: offensive


In the original dataset the label field contains multiple labels from different annotators. Bsed on the majority voting we choose a single label for a twitter post. Rationale field also contains 0 to 3 list of rationale from three different annotator. We merged them using uninon. Post tokes and ID field remained same as it was.

In [None]:
import pandas as pd
from collections import Counter

def dataset_to_dataframe(split):
    """
    Convert a Hugging Face Dataset split into a pandas DataFrame with:
    - id: example ID
    - label: most frequent label among annotators
    - rationales: element-wise union (OR) of annotator rationales
    - post_tokens: list of tokens as-is
    """
    rows = []
    for example in split:
        # Extract ID
        example_id = example['id']

        # Determine most frequent label among annotators
        labels = example['annotators']['label']
        most_common_label = Counter(labels).most_common(1)[0][0]

        # Union rationales (element-wise OR across annotators)
        rationale_lists = example['rationales']
        union_rationale = [
            int(any(token_flags))
            for token_flags in zip(*rationale_lists)
        ]

        # Post tokens as they are
        post_tokens = example['post_tokens']

        rows.append({
            'id': example_id,
            'label': most_common_label,
            'rationales': union_rationale,
            'post_tokens': post_tokens
        })

    return pd.DataFrame(rows)


df_train = dataset_to_dataframe(ds['train'])
df_valid = dataset_to_dataframe(ds['validation'])
df_test = dataset_to_dataframe(ds['test'])

# Display first few rows for the train split
# (Uncomment the following lines if running interactively)
# df_train = dataset_to_dataframe(ds['train'])
display(df_train.head())
display(df_valid.head())
display(df_test.head())




Unnamed: 0,id,label,rationales,post_tokens
0,23107796_gab,2,"[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, ...","[u, really, think, i, would, not, have, been, ..."
1,9995600_gab,2,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[the, uk, has, threatened, to, return, radioac..."
2,1227920812235051008_twitter,2,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1]","[if, english, is, not, imposition, then, hindi..."
3,1204931715778543624_twitter,2,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1]","[no, liberal, congratulated, hindu, refugees, ..."
4,1179102559241244672_twitter,2,"[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[he, said, bro, even, your, texts, sound, redn..."


Unnamed: 0,id,label,rationales,post_tokens
0,1178613994371928065_twitter,1,[],"[me, getting, books, from, the, library, about..."
1,1170285336221638656_twitter,1,[],"[y, si, fuera, top, <number>, me, faltarían, h..."
2,1179099934731190272_twitter,1,[],"[<user>, <user>, <user>, i, am, a, lesbian, no..."
3,1178856372617846789_twitter,1,[],"[<user>, by, tweeting, about, a, civil, war, t..."
4,1178878849570021376_twitter,1,[],"[<user>, <user>, you, all, only, caring, about..."


Unnamed: 0,id,label,rationales,post_tokens
0,1178610029273976833_twitter,1,[],"[<user>, men, can, not, be, raped, can, not, b..."
1,1165785686903009283_twitter,1,[],"[<user>, you, are, missing, an, essential, pre..."
2,1252707503441313794_twitter,2,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]","[<user>, <user>, why, are, you, repeating, you..."
3,1103385226921762816_twitter,0,"[0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[<user>, <user>, well, she, ’, muslim, so, of,..."
4,1169443635869487105_twitter,2,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[<user>, lol, not, me, i, don, ’, t, deal, wit..."


Let's save these dataset in the Google drive and load them sothat we don't need to go through steps again and again.

In [None]:
train_path = '/content/drive/MyDrive/Colab Notebooks/Hate speach detection/Dataset/train.parquet'
test_path =  '/content/drive/MyDrive/Colab Notebooks/Hate speach detection/Dataset/validation.parquet'
val_path = '/content/drive/MyDrive/Colab Notebooks/Hate speach detection/Dataset/test.parquet'

df_train.to_parquet(train_path)
df_valid.to_parquet(val_path)
df_test.to_parquet(test_path)

In [None]:
import pandas as pd

train_path = '/content/drive/MyDrive/Colab Notebooks/Hate speach detection/Dataset/train.parquet'
test_path =  '/content/drive/MyDrive/Colab Notebooks/Hate speach detection/Dataset/validation.parquet'
val_path = '/content/drive/MyDrive/Colab Notebooks/Hate speach detection/Dataset/test.parquet'

df_train = pd.read_parquet(train_path)
df_valid = pd.read_parquet(val_path)
df_test = pd.read_parquet(test_path)

display(df_train.head())
display(df_valid.head())
display(df_test.head())

Unnamed: 0,id,label,rationales,post_tokens
0,23107796_gab,2,"[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, ...","[u, really, think, i, would, not, have, been, ..."
1,9995600_gab,2,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[the, uk, has, threatened, to, return, radioac..."
2,1227920812235051008_twitter,2,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1]","[if, english, is, not, imposition, then, hindi..."
3,1204931715778543624_twitter,2,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1]","[no, liberal, congratulated, hindu, refugees, ..."
4,1179102559241244672_twitter,2,"[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[he, said, bro, even, your, texts, sound, redn..."


Unnamed: 0,id,label,rationales,post_tokens
0,1178613994371928065_twitter,1,[],"[me, getting, books, from, the, library, about..."
1,1170285336221638656_twitter,1,[],"[y, si, fuera, top, <number>, me, faltarían, h..."
2,1179099934731190272_twitter,1,[],"[<user>, <user>, <user>, i, am, a, lesbian, no..."
3,1178856372617846789_twitter,1,[],"[<user>, by, tweeting, about, a, civil, war, t..."
4,1178878849570021376_twitter,1,[],"[<user>, <user>, you, all, only, caring, about..."


Unnamed: 0,id,label,rationales,post_tokens
0,1178610029273976833_twitter,1,[],"[<user>, men, can, not, be, raped, can, not, b..."
1,1165785686903009283_twitter,1,[],"[<user>, you, are, missing, an, essential, pre..."
2,1252707503441313794_twitter,2,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]","[<user>, <user>, why, are, you, repeating, you..."
3,1103385226921762816_twitter,0,"[0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[<user>, <user>, well, she, ’, muslim, so, of,..."
4,1169443635869487105_twitter,2,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[<user>, lol, not, me, i, don, ’, t, deal, wit..."


I want to work only with hatespeech and normal data. So, I am filtering out offensive data. Remember...

0 means: hatespeech

1 means: normal

2 means: offensive