# Collaborative Data Labelling

## Overview

At this step, I needed to label my random sample so it could be used to fine-tune BERT to classify the rest of my data. In other words, I had to prepare my training data.

I was fortunate to have help in this regard since labelling data can take some time. To divvy up the task efficiently, I took the ten chunks of my training data–100 newspaper clippings apiece–and uploaded them to Github. Then I created 10 nearly identical Google Colab notebooks, one for each chunk. The only differences between these notebooks were the .csv files they reviewed and the filenames they saved.

Here's the code and text as it appeared in these notebooks:

In [None]:
import pandas as pd
from google.colab import files
import textwrap

## 1) Directions

When you run the code in the following notebook, you will be prompted to label newspaper clippings. The goal is to identify if the clippings contain any reference to lynchings. The labelled data will then be used to fine-tune a BERT model to classify other clippings.

Please label the data in the following way:

- '__yes__' = yes, the clipping is entirely about a lynching
- '__no__' = no, the clipping has nothing to do with lynchings
- '__partial__' = part of the clipping is about a lynching, but not all of it
- '__unknown__' = the clipping is in a language I can't read or the OCR is so bad it's hard to tell

Once you've finished labelling the data, run the last bit of code to save your work as a .csv file. Then please email this .csv file to Matthew Kollmer at kollmer2@illinois.edu.

Thank you for your help!

In [None]:
# df was linked to different .csv files, one per chunk (i.e. part_1, part_2, part_3, etc.)
df = pd.read_csv('https://raw.githubusercontent.com/MatthewKollmer/messing-around/refs/heads/main/vrt_work/say_their_names/training_data/clippings_part_1.csv')

case_match_values = []

for index, clipping in df['clippings'].items():
    print('---------------------------------------')
    print('---------------------------------------')
    print('---------------------------------------')
    wrapped_clipping = '\n'.join(textwrap.wrap(str(clipping), width=60))
    print(wrapped_clipping)
    print()
    case_match = input(f'Does the clipping contain text about a lynching? Answers must be yes, partial, no, or unknown: ')
    case_match_values.append(case_match)

df['case_match'] = case_match_values

## 2) Save and Finish

Once you've finished labelling the set of 100 clippings, run the last bit of code below. It will save the labelled data as a .csv file.

Last thing: please email this file to Matthew. Thanks again!

In [None]:
# changed depending on which chunk was labelled (i.e. part_1, part_2, part_3, etc.)
df.to_csv('clippings_part_1_labelled.csv', index=False)
files.download('clippings_part_1_labelled.csv')