# Data preprocessing 
----
Downloading dataset enwik8 from https://mattmahoney.net/dc/textdata.html \
Processing the data to convert it into a pretokenized binary file.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
DATA_CACHE_DIR = "/home/rod/storage/enwik8"

In [3]:
import os
import glob
import numpy as np
import pathlib
import pickle

In [4]:
# create a list of char for the tokenizer
dp = pathlib.Path("/home/rod/storage/enwik8/enwik8")
with open(dp, "r") as f:
    characters = "".join(list(set(f.read())))

with open(pathlib.Path(dp.parent/"char.pkl"), 'wb') as f:
    pickle.dump(characters, f)

In [5]:
# sanity check
with open(pathlib.Path(dp.parent/"char.pkl"), 'rb') as f:
    char = pickle.load(f)
characters == char

True

In [None]:
char

In [7]:
len(char)

6064

## Tokenizing the data

In [11]:
import sys
sys.path.append("/home/rod/Projects/llama")

In [12]:
from tokenizer import SimpleTokenizer

In [13]:
with open("/storage/enwik8/enwik8", "r") as f:
        data = f.read()

In [14]:
print(data[:100], len(data))

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.3/" xmlns:xsi="http://www.w3.org/2001/XMLSch 99621832


In [16]:
print(data[22500:26000])

t==
{{seealso|Anarcho-syndicalism}}

[[Image:Flag of Anarcho syndicalism.svg|thumb|175px|The red-and-black flag, coming from the experience of anarchists in the labour movement, is particularly associated with anarcho-syndicalism.]]

[[Anarcho-syndicalism]] was an early 20th century working class movement seeking to overthrow capitalism and the state to institute a worker controlled society. The movement pursued [[industrial action]]s, such as [[general strike]], as a primary strategy. Many anarcho-syndicalists believed in [[anarchist communism]], though not all communists believed in syndicalism.

After the [[Paris Commune|1871 repression]] French anarchism reemerged, influencing the ''Bourses de Travails'' of autonomous workers groups and trade unions. From this movement the [[Confédération Générale du Travail]] (General Confederation of Work, CGT) was formed in 1895 as the first major anarcho-syndicalist movement. [[Emile Pataud]] and [[Emile Pouget]]'s writing for the CGT saw [[lib

In [17]:
enc = SimpleTokenizer()
tokens = enc.encode(data, False, False)
all_tok = np.array(tokens, dtype=np.uint16)

In [18]:
tokenized_filename = pathlib.Path("/home/rod/storage/enwik8/enwik8.bin")

with open(tokenized_filename, "wb") as f:
    f.write(all_tok.tobytes())

In [19]:
# Sanity check
a = np.fromfile(tokenized_filename, dtype=np.uint16)
print(a, a.shape)

[2963 2697 2802 ... 5612 1435 3369] (99621832,)


### Creating the split


In [21]:
b=a[:256 * (a.shape[0]//256)]
b = b.reshape((-1, 256))
np.random.shuffle(b)

In [23]:
t = int(0.9 * b.shape[0])
s = int(0.05 * b.shape[0])
start = 0
end = t
for split in ["train", "val", "test"]:
    tokenized_filename = pathlib.Path("/home/rod/storage/enwik8/enwik8_"+split+".bin")
    with open(tokenized_filename, "wb") as f:
        f.write(b[start:end].flatten().tobytes())
    start = end
    end += s