# Build datasets from the raw dataset

### Datasets:
    - 4 Test sets  ✔
    - Train set ✔
    - Subset of train for further evaluation✔
    
A number of steps is performed to reach the final dataset. Since the datasets are quite big, the process is split into multiple steps and intermediate results are saved.

## To rebuild: 
    1. download the gptdataset (small-117M) and place it in the project folder
    2. Run through this notebook 
    3. Delete / use different checkpoints during the process to work efficiently

### Necessary packages

In [2]:
import pickle
import json
import torch
import re
import operator
from utility import *

In [2]:
%%capture
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

## Split the original data into single sentences and save them

In [3]:
texts = []
for i, line in enumerate(open("original_data/small-117M." + str("test") +".jsonl")):
    texts.append(json.loads(line)['text'])

In [4]:
splitted = split_eos(texts)
final = [item for sublist in splitted for item in sublist]
final = "<|endoftext|>".join(final)
f = open("saves/splitOnEosDataset_v2_test.txt", "w",encoding = "UTF-8")
f.write(final)
f.close()

In [5]:
len(final)

16337865

In [43]:
texts = []
for i, line in enumerate(open("original_data/small-117M." + str("train") +".jsonl")):
    texts.append(json.loads(line)['text'])

In [44]:
splitted = split_eos(texts)
final = [item for sublist in splitted for item in sublist]

In [45]:
len(final)

5106654

In [46]:
# split for better prcessing later on
a =int(len(final)/5)
print(a)
for x in range(5):
    finalSplit = final[a*x:a*(x+1)] 
    print(len(finalSplit))
    finalSplit = "<|endoftext|>".join(finalSplit)
    
    f = open("saves/splitOnEosDataset_v2_" + str(x+1) + ".txt", "w",encoding = "UTF-8")
    f.write(finalSplit)
    f.close()

1021330
1021330
1021330
1021330
1021330
1021330


# Correct and collect stats  

In [2]:
# Next step is scriptified. run grammar_parser.py for every part of the saved data (named: EOS_corrected_v2_(nr),EOS_stats_v2_(nr))

# Build Frequency Stats: 

In [3]:
for x in range (5):
    stats = pickle.load(open("saves/EOS_stats_v2_" + str(x+1) + ".p", "rb")) 
    freq = build_frequency_stats(stats)
    pickle.dump(freq,open("saves/EOS_freq_v2_" + str(x+1) + ".p", "wb"))

0


#  Filter trash / correct sentences / long sentences (4 datasets) 

In [3]:
inp1 = open("saves/splitOnEosDataset_v2_test.txt",encoding="UTF-8")
inp1 = inp1.read()
inp1 = inp1.split("<|endoftext|>")
inp2 = pickle.load( open( "saves/EOS_corrected_v2_test.p", "rb" ))
inp3 = pickle.load( open( "saves/EOS_stats_v2_test .p", "rb" ))
stats = pickle.load( open( "saves/EOS_freq_v2_test.p", "rb" ))

In [7]:
inp2 = add_correct_tokens(inp2,inp3[5],len(inp1))
sentences = (inp1,inp2)

In [7]:
len(sentences[0])

103127

In [8]:
#The second test sets only include wrong sentence. When we want to finally test the performance in the wild, 
#we need to have wrong and correct. so a new DS is builded with both filter 3 and 2

In [11]:
filtered1 = filter_trash_3(sentences,stats[-1],99)

23 were deleted since they had more than99 mistakes
42004 sentences had no grammar mistakes.


In [12]:
len(filtered1[0])

103104

In [13]:
filtered2 = filter_trash_2(sentences,stats[-1],1000)

0 were deleted since they had more than1000 mistakes
42004 sentences had no grammar mistakes. They were deleted from the dataset


### Test Format: All sentences 

In [18]:
out = []
for x in range (len(filtered1[0])):
    out.append(filtered1[0][x] + "==== " + filtered1[1][x])

In [19]:
outFiltered = []
for x in out:
    if len(x)>700:
        pass
    else:
        outFiltered.append(x)

In [20]:
final = out
for x in range(len(final)):
    if x%2 == 0: 
        final[x] = final[x].replace("<|endoftext|>","")
final = " ".join(final)
f = open("build_data/EOS_new_no_filter_long.txt", "w",encoding = "UTF-8")
f.write(final)
f.close()

In [15]:
final = outFiltered
for x in range(len(final)):
    if x%2 == 0: 
        final[x] = final[x].replace("<|endoftext|>","")
final = " ".join(final)
f = open("build_data/EOS_new_no_filter_700.txt", "w",encoding = "UTF-8")
f.write(final)
f.close()


### Test format: Only wrong

In [30]:
out = []
for x in range (len(filtered2[0])):
    out.append(filtered2[0][x] + "==== " + filtered2[1][x])

In [17]:
outFiltered = []
for x in out:
    if len(x)>700:
        pass
    else:
        outFiltered.append(x)

In [18]:
final = out
for x in range(len(final)):
    if x%2 == 0: 
        final[x] = final[x].replace("<|endoftext|>","")
final = " ".join(final)
f = open("build_data/EOS_new_filter_long.txt", "w",encoding = "UTF-8")
f.write(final)
f.close()

In [19]:
final = outFiltered
for x in range(len(final)):
    if x%2 == 0: 
        final[x] = final[x].replace("<|endoftext|>","")
final = " ".join(final)
f = open("build_data/EOS_new_filter_700.txt", "w",encoding = "UTF-8")
f.write(final)
f.close()

# Prepare the full dataset (requires a lot of RAM) 

In [None]:
inp1L = []
inp2L = []
inp3L = []
inpSL = []
for x in range(5):
    inp1 = open("saves//splitOnEosDataset_v2_" +  str(x+1) + ".txt",encoding="UTF-8")
    inp1 = inp1.read()
    inp1L.append(inp1.split("<|endoftext|>"))
    inp2L.append(pickle.load( open( "saves/manual_dataset/EOS_corrected_v2_" +  str(x+1) + ".p", "rb" )))
    inp3L.append(pickle.load( open( "saves/manual_dataset/EOS_stats_v2_" +  str(x+1) + ".p", "rb" )))
    inpSL.append(pickle.load( open( "saves/manual_dataset/EOS_freq_v2_" +  str(x+1) + ".p", "rb" )))

In [None]:
for x in range(len(inp2L)):
    print(x)
    inp2L[x] = add_correct_tokens(inp2L[x],inp3L[x][5],len(inp1L[x]))

0
1
2
3
4


In [None]:
sentences = []
for x in range(5):
    sentences.append((inp1L[x],inp2L[x]))

In [299]:
final = []
for x in range(5):
    final.append(filter_trash_3(sentences[x],inpSL[x][-1],100))

273 were deleted since they had more than100 mistakes
416139 sentences had no grammar mistakes. They were deleted from the dataset
249 were deleted since they had more than100 mistakes
413842 sentences had no grammar mistakes. They were deleted from the dataset
272 were deleted since they had more than100 mistakes
412081 sentences had no grammar mistakes. They were deleted from the dataset
259 were deleted since they had more than100 mistakes
413714 sentences had no grammar mistakes. They were deleted from the dataset
265 were deleted since they had more than100 mistakes
414202 sentences had no grammar mistakes. They were deleted from the dataset


In [300]:
out = []
for y in range(5):
    for x in range (len(final[y][0])):
        out.append(final[y][0][x] + "==== " + final[y][1][x])

In [311]:
outFiltered = []
for x in out:
    if len(x)>700:
        pass
    else:
        outFiltered.append(x)
out = outFiltered

In [313]:
for x in range(len(out)):
    if x%2 == 0: 
        out[x] = out[x].replace("<|endoftext|>","")

In [318]:
final_train = " ".join(final_train)
f = open("EOS_new_full_train.txt", "w",encoding = "UTF-8")
f.write(final_train)
f.close()

# Prepare ~1% train for later testing steps

In [15]:
dataR = open("EOS_new_full_train.txt", "r",encoding = "UTF-8")
dataR = dataR.read()

In [16]:
data = str(dataR).split("<|endoftext|>")

In [17]:
oneP = data[:5000]

In [18]:
oneP = "<|endoftext|>".join(oneP)

In [19]:
f = open("EOS_new_full_train_5k.txt", "w",encoding = "UTF-8")
f.write(oneP)
f.close()

# Build the dataset for classic finetuning 

#### First step: Correct the raw data with grammar_parser_json.py (the original data was split in two parts to improve handability) 

In [95]:
cor1 = pickle.load( open( "gpt2-dataset/correctedtrain100k.p", "rb" ))
cor2 = pickle.load( open( "gpt2-dataset/correctedtrain150k.p", "rb" ))

In [96]:
cor1 = cor1 + cor2
print(len(cor1))
cor1 = "".join(cor1)

In [99]:
f = open("classic_finetune_train.txt", "w",encoding = "UTF-8")
f.write(cor1)
f.close()

# Datasets done.