# Task 2

## Summary

* Added ```GPT2likeModel``` in ```transformer.py```
* Added preprocessing function ```preprocess_data``` in ```dataset.py```. I prerocess data (read, filter, tokenize) once and the put shr data in Datasets as input
* Implemented ```BrainDataset```, ```BigBrainDataset``` and ```collate_fn``` according to the task
* Also ```UltraDuperBigBrainBatchSampler``` implemented. I try to sample all data uniformly. Steps:
    * Calculate table "lengths to inds"
    * Sample "main" length with weights (counts of indexes for the length)
    * Set "window" around "main" length with width = n_bins
    * Find all lengths from the window which are in the table
    * Sample uniforly from union of the indexes with these lengths
    * For those batches that les than batch_size the caluculation time is scaled (according to the number of elements in the batch)
* I use part of data to make calculations faster

## Experiments

In [2]:
import pandas as pd
from run_epoch import run_epoch, DataMode
from dataset import preprocess_data

In [3]:
data_path = "wikitext-103-raw/wiki.train.raw"

data, vocab = preprocess_data(data_path)

In [4]:
b = run_epoch(data, vocab, DataMode.BRAIN)

Warmup:


 26%|██▌       | 125/478 [00:04<00:13, 25.72it/s]


Train:


100%|██████████| 478/478 [00:41<00:00, 11.50it/s]

{'minimum': 0.07377535899286158, 'maximum': 0.1460416749905562, 'mean': 0.08628600094553616, 'median': 0.08739759100717492}





In [5]:
bb = run_epoch(data, vocab, DataMode.BIG_BRAIN)

Warmup:


 26%|██▌       | 125/478 [00:04<00:11, 30.50it/s]


Train:


100%|██████████| 478/478 [00:39<00:00, 12.22it/s]

{'minimum': 0.04276562300219666, 'maximum': 0.10457310400670394, 'mean': 0.08114802340774019, 'median': 0.08639974300604081}





In [6]:
udbb_1 = run_epoch(data, vocab, DataMode.ULTRA_DUPER_BIG_BRAIN, n_bins=1)

Warmup:


 38%|███▊      | 182/478 [00:02<00:04, 61.47it/s]


Train:


100%|██████████| 478/478 [00:21<00:00, 22.53it/s]

{'minimum': 0.00821347399323713, 'maximum': 0.2687243759864941, 'mean': 0.08958919143679632, 'median': 0.06779857650690246}





In [7]:
udbb_5 = run_epoch(data, vocab, DataMode.ULTRA_DUPER_BIG_BRAIN, n_bins=5)

Warmup:


 30%|██▉       | 142/478 [00:02<00:06, 48.92it/s]


Train:


100%|██████████| 478/478 [00:25<00:00, 18.60it/s]

{'minimum': 0.008278658002382144, 'maximum': 0.26943800796288997, 'mean': 0.07330763733603204, 'median': 0.06403578600293258}





In [8]:
udbb_10 = run_epoch(data, vocab, DataMode.ULTRA_DUPER_BIG_BRAIN, n_bins=10)

Warmup:


 29%|██▉       | 138/478 [00:03<00:07, 44.58it/s]


Train:


100%|██████████| 478/478 [00:27<00:00, 17.70it/s]

{'minimum': 0.009054530994035304, 'maximum': 0.2715580399380997, 'mean': 0.06734283700428005, 'median': 0.06339345399464946}





In [9]:
udbb_20 = run_epoch(data, vocab, DataMode.ULTRA_DUPER_BIG_BRAIN, n_bins=20)

Warmup:


 27%|██▋       | 131/478 [00:02<00:07, 45.70it/s]


Train:


100%|██████████| 478/478 [00:25<00:00, 18.43it/s]

{'minimum': 0.0081318369921064, 'maximum': 0.3507025999715552, 'mean': 0.06032542965493233, 'median': 0.05599357799655991}





In [10]:
udbb_50 = run_epoch(data, vocab, DataMode.ULTRA_DUPER_BIG_BRAIN, n_bins=50)

Warmup:


 27%|██▋       | 128/478 [00:02<00:07, 48.26it/s]


Train:


100%|██████████| 478/478 [00:27<00:00, 17.62it/s]

{'minimum': 0.010012742000981234, 'maximum': 0.2355414800113067, 'mean': 0.058628822775937756, 'median': 0.061706441003479995}





In [11]:
udbb_640 = run_epoch(data, vocab, DataMode.ULTRA_DUPER_BIG_BRAIN, n_bins=640)

Warmup:


 26%|██▌       | 125/478 [00:03<00:10, 32.17it/s]


Train:


100%|██████████| 478/478 [00:38<00:00, 12.36it/s]

{'minimum': 0.04487882800458465, 'maximum': 0.28268466400913894, 'mean': 0.08057339574560282, 'median': 0.08378183650347637}





In [24]:
columns = ["experiment", "minimum", "maximum", "mean", "median"]
experimens = [b, bb, udbb_1, udbb_5, udbb_10, udbb_20, udbb_50, udbb_640]
df = pd.DataFrame(columns=columns)
df['experiment'] = [exp for exp in ['b', 'bb', 'udbb_1', 'udbb_5', 'udbb_10', 'udbb_20', 'udbb_50', 'udbb_640']]
df['minimum'] = [exp["minimum"] for exp in experimens]
df['maximum'] = [exp["maximum"]  for exp in experimens]
df['mean'] = [exp["mean"]  for exp in experimens]
df['median'] = [exp["median"]  for exp in experimens]

In [23]:
df

Unnamed: 0,experiment,minimum,maximum,mean,median
0,b,0.073775,0.146042,0.086286,0.087398
1,bb,0.042766,0.104573,0.081148,0.0864
2,udbb_1,0.008213,0.268724,0.089589,0.067799
3,udbb_5,0.008279,0.269438,0.073308,0.064036
4,udbb_10,0.009055,0.271558,0.067343,0.063393
5,udbb_20,0.008132,0.350703,0.060325,0.055994
6,udbb_50,0.010013,0.235541,0.058629,0.061706
7,udbb_640,0.044879,0.282685,0.080573,0.083782


## Results

* ```BigBrainDataset``` shows a bit better results, than ```BrainDataset```. It could be much better, but I use batch_size=8, so it is more likely to see a row in a batch with more than 640 tokens
* The results of ```UltraDuperBigBrainDataset``` with ```n_bins=1``` are not good because of the not big amount of data (i take 8192 rows). For the certain length value there is often less than batch_size rows, and after time scaling we get worse results
* Different ```n_bins``` show good results
* ```n-bins = 640``` shows the same results as ```BigBrainDataset```