# Collaborative Data Labelling

## Overview

At this step, I needed to label my random sample so it could be used to fine-tune BERT to classify the rest of my data.

I was fortunate to have help in this regard since labelling data can take some time. To divvy up the task efficiently, I took the ten chunks of my training data–100 newspaper clippings apiece–and uploaded them to Github. Then I created 10 nearly identical Google Colab notebooks, one for each chunk. The only differences between these notebooks were the .csv files they reviewed and the filenames they saved.

Here's the code and text as it appeared in these notebooks:

In [None]:
import pandas as pd
# from google.colab import files
import textwrap

## 1) Directions

When you run the code in the following notebook, you will be prompted to label newspaper clippings. The goal is to identify if the clippings contain any reference to lynchings. The labelled data will then be used to fine-tune a BERT model to classify other clippings.

Please label the data in the following way:

- '__yes__' = yes, the clipping is contains reference to a lynching
- '__no__' = no, the clipping has nothing to do with lynchings

Once you've finished labelling the data, run the last bit of code to save your work as a .csv file. Then please email this .csv file to Matthew Kollmer at kollmer2@illinois.edu.

Thank you for your help!

In [None]:
# df was linked to different .csv files, one per chunk (i.e. part_1, part_2, part_3, etc.)
df = pd.read_csv('training_data/clippings_part_10.csv')

case_match_values = []

counter = 0

for index, clipping in df['clippings'].items():
    counter += 1
    print('---------------------------------------')
    print('---------------------------------------')
    print('---------------------------------------')
    print(f'row: {counter}')
    wrapped_clipping = '\n'.join(textwrap.wrap(str(clipping), width=60))
    print(wrapped_clipping)
    print()
    case_match = input(f'Does the clipping contain text about a lynching? Answers must be yes or no: ')
    case_match_values.append(case_match)

df['case_match'] = case_match_values

## 2) Save and Finish

Once you've finished labelling the set of 100 clippings, run the last bit of code below. It will save the labelled data as a .csv file.

Last thing: please email this file to Matthew. Thanks again!

In [None]:
# changed depending on which chunk was labelled (i.e. part_1, part_2, part_3, etc.)
df.to_csv('training_data/clippings_part_10_labelled.csv', index=False)
# files.download('clippings_part_1_labelled.csv')

# BREAK - Collaborative Work Ended Here

## 3) Unification of Data

Once all twenty subsets of the data were labelled, I concatenated them into one dataframe and saved it as training_data.csv.

In [None]:
df1 = pd.read_csv('training_data/clippings_part_1_labelled.csv')
df2 = pd.read_csv('training_data/clippings_part_2_labelled.csv')
df3 = pd.read_csv('training_data/clippings_part_3_labelled.csv')
df4 = pd.read_csv('training_data/clippings_part_4_labelled.csv')
df5 = pd.read_csv('training_data/clippings_part_5_labelled.csv')
df6 = pd.read_csv('training_data/clippings_part_6_labelled.csv')
df7 = pd.read_csv('training_data/clippings_part_7_labelled.csv')
df8 = pd.read_csv('training_data/clippings_part_8_labelled.csv')
df9 = pd.read_csv('training_data/clippings_part_9_labelled.csv')
df10 = pd.read_csv('training_data/clippings_part_10_labelled.csv')
df11 = pd.read_csv('training_data/clippings_part_11_labelled.csv')
df12 = pd.read_csv('training_data/clippings_part_12_labelled.csv')
df13 = pd.read_csv('training_data/clippings_part_13_labelled.csv')
df14 = pd.read_csv('training_data/clippings_part_14_labelled.csv')
df15 = pd.read_csv('training_data/clippings_part_15_labelled.csv')
df16 = pd.read_csv('training_data/clippings_part_16_labelled.csv')
df17 = pd.read_csv('training_data/clippings_part_17_labelled.csv')

train_df = pd.concat([df1, df2, df3, df4, df5, df6, df7, df8, df9, df10, df11, df12, df13, df14, df15, df16, df17], ignore_index=True)

I only added the 'yes' labels of the last three subsets:

In [None]:
df18 = pd.read_csv('training_data/clippings_part_18_labelled.csv')
df19 = pd.read_csv('training_data/clippings_part_19_labelled.csv')
df20 = pd.read_csv('training_data/clippings_part_20_labelled.csv')

df18_yes = df18[df18['case_match'] == 'yes']
df19_yes = df19[df19['case_match'] == 'yes']
df20_yes = df20[df20['case_match'] == 'yes']

train_df = pd.concat([train_df, df18_yes, df19_yes, df20_yes], ignore_index=True)

Finally, I saved train_df:

In [None]:
train_df.to_csv('training_data/training_data.csv', index=False)