# 1. Context

This notebook is created for loading & preprocessing Data for [Word2Vec Implementation](https://arxiv.org/abs/1301.3781). Scale of Data Processing will be reduced to fit consumer GPUs Contraints

# 2. Dataset in Scope

Wikitext-2 corpus which has around 2M Tokens in Corpus is explored as 

- Small enough to run in consumer GPUs
- Perform POC experiments

If resources permits, wikitext-103 will also be tried

# 3. Basic Imports

In [18]:
from datasets import load_dataset
import re


# 4. Loading Data

In [30]:
dataset_raw = load_dataset("wikitext", "wikitext-2-raw-v1")

## 4.1. Processing Data

### 4.1.1. English Tokeniser 

Word2Vec did not used any sophisticated Tokeniser, so to keep things simple, a english tokeniser is created with follow rules

1. Lower caseing
2. Punctuation Separation
3. Remove unwanted symbols like currency, apostrophe 


In [67]:
def basic_english_tokenizer(text):
    # Lowercase
    text = text.lower()
    # Separate punctuation from words
    text = re.sub(r"([.,!?;])", r" \1 ", text)
    # Remove any unwanted characters liked dollars, semicolon etc
    text = re.sub(r"[^a-zA-Z0-9.,!?;'\s]", '', text)
    # Tokenize by whitespace
    tokens = text.split()
    return tokens

# Test Example
print(basic_english_tokenizer("South African cricket team won 2025 World Test Championship Final. Fans were waiting for more than 25 years"))


['south', 'african', 'cricket', 'team', 'won', '2025', 'world', 'test', 'championship', 'final', '.', 'fans', 'were', 'waiting', 'for', 'more', 'than', '25', 'years']


### 4.1.2. Exploring Raw Data

1. Raw Data is a list of strings, contain document metadata heirarchy defined by use of `=` characters

For example, running below

```python
print(dataset_raw["train"]["text"][0:4])
```

leads to 
```python
['',
 ' = Valkyria Chronicles III = \n',
 '',
 ' Senjō no Valkyria 3 : Unrecorded Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . Employing the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " Calamaty Raven " . \n']
```

### 4.1.3. Preprocessing Scheme

Since we are interested in sentence representation with aim to preserve basic english tokenisation as defined in `basic_english_tokenizer`, To do that

1. Join strings into a big string, via `join` function
2. Ensure new line by between each element by `"\n".join(dataset_raw["train"]["text"][0:4])`
3. Big string is passed `basic_english_tokenizer`

This gives below output


```python
['valkyria', 'chronicles', 'iii', 'senj', 'no', 'valkyria', '3', 'unrecorded', 'chronicles', 'japanese', '3', ',', 'lit', '.', 'valkyria', 'of', 'the', 'battlefield', '3', ',', 'commonly', 'referred', 'to', 'as', 'valkyria', 'chronicles', 'iii', 'outside', 'japan', ',', 'is', 'a', 'tactical', 'role', 'playing', 'video', 'game', 'developed', 'by', 'sega', 'and', 'media', '.', 'vision', 'for', 'the', 'playstation', 'portable', '.', 'released', 'in', 'january', '2011', 'in', 'japan', ',', 'it', 'is', 'the', 'third', 'game', 'in', 'the', 'valkyria', 'series', '.', 'employing', 'the', 'same', 'fusion', 'of', 'tactical', 'and', 'real', 'time', 'gameplay', 'as', 'its', 'predecessors', ',', 'the', 'story', 'runs', 'parallel', 'to', 'the', 'first', 'game', 'and', 'follows', 'the', 'nameless', ',', 'a', 'penal', 'military', 'unit', 'serving', 'the', 'nation', 'of', 'gallia', 'during', 'the', 'second', 'europan', 'war', 'who', 'perform', 'secret', 'black', 'operations', 'and', 'are', 'pitted', 'against', 'the', 'imperial', 'unit', 'calamaty', 'raven', '.']
```

In [73]:
# basic_english_tokenizer("\n\n".join(dataset_raw["train"]["text"][0:4]))

In [85]:
def tokenize_example(example): return {'tokens': basic_english_tokenizer((example['text']))}

In [86]:
tokenized_dataset = dataset_raw.map(tokenize_example, batched=False)

Map:   0%|          | 0/4358 [00:00<?, ? examples/s]

Map:   0%|          | 0/36718 [00:00<?, ? examples/s]

Map:   0%|          | 0/3760 [00:00<?, ? examples/s]

In [89]:
vv= tokenized_dataset['train']['tokens'][0:10]

In [90]:
vv[3]

['senj',
 'no',
 'valkyria',
 '3',
 'unrecorded',
 'chronicles',
 'japanese',
 '3',
 ',',
 'lit',
 '.',
 'valkyria',
 'of',
 'the',
 'battlefield',
 '3',
 ',',
 'commonly',
 'referred',
 'to',
 'as',
 'valkyria',
 'chronicles',
 'iii',
 'outside',
 'japan',
 ',',
 'is',
 'a',
 'tactical',
 'role',
 'playing',
 'video',
 'game',
 'developed',
 'by',
 'sega',
 'and',
 'media',
 '.',
 'vision',
 'for',
 'the',
 'playstation',
 'portable',
 '.',
 'released',
 'in',
 'january',
 '2011',
 'in',
 'japan',
 ',',
 'it',
 'is',
 'the',
 'third',
 'game',
 'in',
 'the',
 'valkyria',
 'series',
 '.',
 'employing',
 'the',
 'same',
 'fusion',
 'of',
 'tactical',
 'and',
 'real',
 'time',
 'gameplay',
 'as',
 'its',
 'predecessors',
 ',',
 'the',
 'story',
 'runs',
 'parallel',
 'to',
 'the',
 'first',
 'game',
 'and',
 'follows',
 'the',
 'nameless',
 ',',
 'a',
 'penal',
 'military',
 'unit',
 'serving',
 'the',
 'nation',
 'of',
 'gallia',
 'during',
 'the',
 'second',
 'europan',
 'war',
 'who',