# Data Processing
This file runs through the data and processes it into a usable format for the training of the language model and reward model.

This involves the following steps:
1. Removing all rows where the column *"black_card_pick_num"* is over 1, to ensure that chosen rows only contain one punchline.
2. Erasing rows where *"round_skipped"* is True, to ensure that chosen rows actually have a winning joke.
3. Adding an extra row into the dataset containing the winning joke for that round. This becomes useful for the reward model training.
4. Mapping *"won"* values of False to 0, and True to 1 to aid with training further down the line.
5. Columns containing data that is not immediately relevant are removed and *"fake_round_id"* is renamed to *"round_id"*.

In [15]:
# Imports the Pandas library
import pandas as pd

# Reads the .csv file and then omits certain rows based on conditions.
dataset = pd.read_csv('Data/cah_data.csv')
dataset = dataset[dataset['black_card_pick_num'] <= 1]
dataset = dataset[dataset['round_skipped'] != True]

# Creates a new column for the winning joke.
winning_jokes = dataset[dataset['won'] == True].copy()
winning_jokes.loc[:, 'chosen_white_card'] = winning_jokes['white_card_text']
dataset = pd.merge(dataset, winning_jokes[['fake_round_id', 'chosen_white_card']], on='fake_round_id', how='left')

# Alters other features within the dataset, as specified above, to aid with future tasks.
dataset['won'] = dataset['won'].map({False: 0, True: 1})
dataset['won'] = dataset['won'].astype(float)
dataset.drop(['winning_index', 'round_skipped', 'round_completion_seconds', 'black_card_pick_num'], axis=1, inplace=True)
dataset.rename(columns={'fake_round_id': 'round_id'}, inplace=True)

# Saves the dataset into a new .csv file.
dataset.to_csv('Data/proc_cah_data.csv', index=False)

In [14]:
dataset

Unnamed: 0,round_id,black_card_text,white_card_text,won,chosen_white_card
0,1,"Hi MTV! My name is Kendra, I live in Malibu, I...",Going inside at some point because of the mosq...,0,Shapes and colors.
1,1,"Hi MTV! My name is Kendra, I live in Malibu, I...",Being fat from noodles.,0,Shapes and colors.
2,1,"Hi MTV! My name is Kendra, I live in Malibu, I...",Letting this loser eat me out.,0,Shapes and colors.
3,1,"Hi MTV! My name is Kendra, I live in Malibu, I...",That chicken from Popeyes.®,0,Shapes and colors.
4,1,"Hi MTV! My name is Kendra, I live in Malibu, I...",A sorry excuse for a father.,0,Shapes and colors.
...,...,...,...,...,...
2446355,298955,Oh my god! _____ killed Kenny!,Breastfeeding a ten-year-old.,0,Jeff Bezos.
2446356,298955,Oh my god! _____ killed Kenny!,Happy daddies with happy sandals.,0,Jeff Bezos.
2446357,298955,Oh my god! _____ killed Kenny!,Jerking off to a 10-second RealMedia clip.,0,Jeff Bezos.
2446358,298955,Oh my god! _____ killed Kenny!,Getting naked and watching Nickelodeon.,0,Jeff Bezos.
