# Create final dataset

This notebook contains code for creating the final dataset. File (`./datasets/llama.csv`) with the original dataset and Llama's answer column is connected to the answers of Phi network. After that, training, validation and test datasets are created from a single merged dataset.

In [14]:
import math
import json

import pandas as pd

In [3]:
df = pd.read_csv('./datasets/llama.csv')

df = df[['id', 'question', 'answer', 'llama']]
df.head()

Unnamed: 0,id,question,answer,llama
0,2020338,Why did the U.S Invade Iraq ?,A small group of politicians believed strongly...,"1) To get rid of Saddam Hussein, a terrorist a..."
1,2874684,How to get rid of a beehive?,Call an area apiarist. They should be able to...,1. Hire a pro. 2. Don't do it yourself.
2,4193114,Why don't European restaurants serve water?,There's a general belief in Europe (and in fac...,"1. Because they don't have it, and 2. They're ..."
3,1908421,Why hybrid cars gas mileage is better in city ?,hybrid cars save energy in two ways: 1.by stor...,1. Electric motor does not need to waste energ...
4,3608897,Can someone explain the theory of e=mc2?,In general it means that in a very high speed ...,81.314659278130909517 J/s^2


### Add Phi's answers to the dataset

In [4]:
with open('./datasets/phi_new.json', 'r') as f:
    df_phi = pd.read_json(f)

In [6]:
phi_col = pd.Series(list(['' for each in range(len(df))]), dtype='str')

for index, row in df_phi.iterrows():
    phi_col.iloc[row['orig_id']] = row['answer']

df['phi'] = phi_col

In [17]:
df.head()

Unnamed: 0,id,question,answer,llama,phi
0,2020338,Why did the U.S Invade Iraq ?,A small group of politicians believed strongly...,"1) To get rid of Saddam Hussein, a terrorist a...",The U.S. invaded Iraq in 2003 to remove Saddam...
1,2874684,How to get rid of a beehive?,Call an area apiarist. They should be able to...,1. Hire a pro. 2. Don't do it yourself.,You can call a professional beekeeper or use a...
2,4193114,Why don't European restaurants serve water?,There's a general belief in Europe (and in fac...,"1. Because they don't have it, and 2. They're ...","Person 2: Well, it's not that they don't serve..."
3,1908421,Why hybrid cars gas mileage is better in city ?,hybrid cars save energy in two ways: 1.by stor...,1. Electric motor does not need to waste energ...,"||Once upon a time, in a small town called Gre..."
4,3608897,Can someone explain the theory of e=mc2?,In general it means that in a very high speed ...,81.314659278130909517 J/s^2,Tutor: Sure! The theory of e=mc2 states that e...


### Do train-valid-test split for the dataset

In [7]:
# Validation dataset ratio, 12.5%
valid_ratio = 0.125
# Test dataset ratio, 12.5%
test_ratio = 0.125
# Train dataset ratio - the rest (100% - 12.5% - 12.5% = 75%)
train_ratio = 1 - (valid_ratio + test_ratio)

def extract_train(data):
    end = math.floor(len(data) * train_ratio)
    return data[:end]

def extract_valid(data):
    start = math.floor(len(data) * train_ratio)
    end = math.floor(start + len(data) * valid_ratio)
    
    return data[start:end]

def extract_test(data):
    start = math.floor(len(data) * train_ratio)
    start = math.floor(start + len(data) * valid_ratio)
    
    return data[start:]

In [10]:
l = [0, 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]

# Just assert validity of written functions, we must get expected result
assert [16, 17] == extract_valid(l)

### Convert dataset to ready-for-usage format

To easily train machine learning models, let’s transform the data into two columns. The first one will contain the text. The second is the source of this text; an integer, where `0` is a person, `1` - Llama, `2` - Phi.

In [12]:
data_answer = {'text': [], 'label': []}
data_llama = {'text': [], 'label': []}
data_phi = {'text': [], 'label': []}

for index, row in df.iterrows():
    answer = row['answer']
    llama = row['llama']
    phi = row['phi']

    if (isinstance(llama, str) and llama.strip() == '') or (not isinstance(llama, str) and math.isnan(llama)):
        if (isinstance(phi, str) and phi.strip() == '') or (not isinstance(phi, str) and math.isnan(phi)):
            data_answer['text'].append(answer)
            data_answer['label'].append(0)
        else:
            data_phi['text'].append(phi)
            data_phi['label'].append(2)
    else:
        data_llama['text'].append(llama)
        data_llama['label'].append(1)

minimum = 1000
data_answer = {'text': data_answer['text'][:minimum], 'label': data_answer['label'][:minimum]}
data_llama = {'text': data_llama['text'][:minimum], 'label': data_llama['label'][:minimum]}
data_phi = {'text': data_phi['text'][:minimum], 'label': data_phi['label'][:minimum]}

data_train = {
    'text': extract_train(data_answer['text']) + extract_train(data_llama['text']) + extract_train(data_phi['text']),
    'label': extract_train(data_answer['label']) + extract_train(data_llama['label']) + extract_train(data_phi['label'])
}

data_valid = {
    'text': extract_valid(data_answer['text']) + extract_valid(data_llama['text']) + extract_train(data_phi['text']),
    'label': extract_valid(data_answer['label']) + extract_valid(data_llama['label']) + extract_train(data_phi['label'])
}

data_test = {
    'text': extract_test(data_answer['text']) + extract_test(data_llama['text']) + extract_test(data_phi['text']),
    'label': extract_test(data_answer['label']) + extract_test(data_llama['label']) + extract_test(data_phi['label'])
}

train_df = pd.DataFrame(data_train)
valid_df = pd.DataFrame(data_valid)
test_df = pd.DataFrame(data_test)

In [24]:
train_df.head()

Unnamed: 0,text,label
0,"Hmmm, don't play??? If you invested a $1 a day...",0
1,"The best way to get ""rid "" of them is take the...",0
2,use any P2P software. for eg limewire or warez,0
3,When you install it on your computer and conne...,0
4,HIV infects T-cells with the CD4 receptor. As...,0


### Write the splitted datasets to filesystem

Store datasets as JSON files. Pandas can read these data type as well.

In [15]:
with open('./datasets/data_train_new.json', 'w') as f:
    json.dump(data_train, f)

with open('./datasets/data_valid_new.json', 'w') as f:
    json.dump(data_valid, f)

with open('./datasets/data_test_new.json', 'w') as f:
    json.dump(data_test, f)