### Note: We need preprocessed texts (tokinized and numericalized) in this notebook

> Notebook based on:
> 1. https://github.com/fastai/course-v3/blob/master/nbs/dl2/12_text.ipynb
> 2. https://github.com/fastai/course-v3/blob/master/nbs/dl2/12a_awd_lstm.ipynb
> 3. https://github.com/fastai/course-v3/blob/master/nbs/dl2/12b_lm_pretrain.ipynb
> 4. https://github.com/fastai/course-v3/blob/master/nbs/dl2/12c_ulmfit.ipynb
> 
> Video:
> - https://youtu.be/vnOpEwmtFJ8?t=4687 from 1:18:00 to 2:08:00 (50 mins)

# Imports

In [1]:
import numpy as np
import pathlib
from tqdm.notebook import tqdm
from collections import Counter, defaultdict

import torch
from torch.utils.data import Dataset, DataLoader, Sampler

# Data

In [2]:
!ls "../../Datasets/NLP/IMBd_prepro"

test  train  unsup  vocab.pkl


In [3]:
!ls "../../Datasets/NLP/IMBd_prepro/train"

neg  pos


# Utils

In [5]:
import multiprocessing
from concurrent.futures import ProcessPoolExecutor

def parallel_map(func, array):
    
    cpu_cores = multiprocessing.cpu_count()
    array_len = len(array)
    chunksize = array_len // 100
    
    if cpu_cores<2:
        return list(tqdm(map(func, arr), total=array_len))
    else:
        with ProcessPoolExecutor(max_workers=cpu_cores) as ex:
            return list(tqdm(ex.map(func, array, chunksize=chunksize), total=array_len))

---
# <center> Dataset & Dataloader for Langauge Model
- X: Text
- Y: Same text but shifted by 1 token

At every epoch:

1. **Shuffle** (sort randomly) our collection of texts.
2. **Concatenate** the individual texts together into a big stream. 
3. **Cut** this stream into a certain number of batches (which is our batch size).
   - For instance, if the stream has 50,000 tokens and we set a batch size of 10, this will give us 10 mini-streams of 5,000 tokens.
   
So to recap, at every epoch we shuffle our collection of documents and concatenate them into a stream of tokens. We then cut that stream into a batch of fixed-size consecutive mini-streams. Our model will then read the mini-streams in order, and thanks to an inner state, it will produce the same activation whatever sequence length we picked.

# Dataset

In [6]:
class LM_Dataset(Dataset):
    
    def __init__(self, prepro_texts_dir, bptt=70, shuffle=False):
        
        # Read tokenized and numeralized text files (numpy format)
        np_filepaths = list( pathlib.Path(prepro_texts_dir).glob('**/*.npy') ) 
        
        # Open numpy arrays in parallel
        self.texts_np = parallel_map(func=np.load, array=np_filepaths)

        self.bptt    = bptt
        self.shuffle = shuffle
        self.total_tokens = sum([len(t) for t in self.texts_np])
        
        self.concat_texts_into_stream()
        
    # this is necesseary at the begining of every epoch for train !!!!
    def concat_texts_into_stream(self):
        
        # 1. Reorder texts if we need to
        if self.shuffle:
            np.random.shuffle(self.texts_np)
            #self.texts_np = self.texts_np[np.random.permutation(len(self.texts_np))]
            
        # 2. Concat texts into a large stream
        self.stream = np.concatenate(self.texts_np)
        #self.stream = torch.cat([torch.Tensor(t) for t in self.texts_np])
                
    def __len__(self):
        return self.total_tokens // self.bptt
    
    def __getitem__(self, idx):
        x = self.stream[idx   : idx+self.bptt]
        y = self.stream[idx+1 : idx+self.bptt+1] # shifted by 1
        
        # convert from numpy.uint16 to torch.int64        
        x = torch.tensor(x.astype("int64"))
        y = torch.tensor(y.astype("int64"))
        
        return x,y

In [7]:
BPTT_LEN = 70 # Lengh of the minisequences in the big stream

train_ds = LM_Dataset("../../Datasets/NLP/IMBd_prepro/train", bptt=BPTT_LEN, shuffle=True)
valis_ds = LM_Dataset("../../Datasets/NLP/IMBd_prepro/test",  bptt=BPTT_LEN, shuffle=False)

  0%|          | 0/25000 [00:00<?, ?it/s]

  0%|          | 0/25000 [00:00<?, ?it/s]

In [9]:
train_ds[0]

(tensor([    2,    18,   160,   476,    19,    29,   144,   251,    16,   106,
           410,    60,     9,    18,    25,    59,  2837,   174,   176,    10,
            11,   166,    18,   167,  8772,     9,    18,   419,   416,     8,
           397,   473,    34,     7, 25389,     7, 61636,     9,    18,    85,
            52,    94,   267,    10,    18,    73,   133,    14,   142,    77,
           480,    19,    29,     9,     7,    39,  1200,     7,   536,     7,
          1712,    26,    88,    11,    18,   140,    39,   105,   133,    19]),
 tensor([   18,   160,   476,    19,    29,   144,   251,    16,   106,   410,
            60,     9,    18,    25,    59,  2837,   174,   176,    10,    11,
           166,    18,   167,  8772,     9,    18,   419,   416,     8,   397,
           473,    34,     7, 25389,     7, 61636,     9,    18,    85,    52,
            94,   267,    10,    18,    73,   133,    14,   142,    77,   480,
            19,    29,     9,     7,    39,  1200,

In [35]:
#" ".join(denumericalize(ds[0][0])), " ".join(denumericalize(ds[0][1]))

('xxbos xxmaj the choice to make this snl skit into a movie was far better thought out than other recent ones . xxmaj the humor involved in the character is not annoyance humor , and is also character driven enough to be stretched out for an hour or two . \n\n xxmaj oddly enough the sexual content seemed like it could be avoided , but that may have been because',
 'xxmaj the choice to make this snl skit into a movie was far better thought out than other recent ones . xxmaj the humor involved in the character is not annoyance humor , and is also character driven enough to be stretched out for an hour or two . \n\n xxmaj oddly enough the sexual content seemed like it could be avoided , but that may have been because the')

# Dataloader (with custom sampler for BPTT)

if we divide our big stream of **28 elements** with **batch_size of 5**:

|               |              |               |               |               |               |
|---------------|--------------|---------------|---------------|---------------|---------------|
| **1st batch** | stream_idx 0 | stream_idx 6  | stream_idx 12 | stream_idx 18 | stream_idx 23 |
| **2nd batch** | stream_idx 1 | stream_idx 7  | stream_idx 13 | stream_idx 19 | stream_idx 24 |
| **3rd batch** | stream_idx 2 | stream_idx 8  | stream_idx 14 | stream_idx 20 | stream_idx 25 |
| **4th batch** | stream_idx 3 | stream_idx 8  | stream_idx 15 | stream_idx 21 | stream_idx 26 |
| **5th batch** | stream_idx 4 | stream_idx 10 | stream_idx 16 | stream_idx 22 | stream_idx 27 |
| **6th batch** | stream_idx 5 | stream_idx 11 | stream_idx 17 |               |               |

https://pytorchnlp.readthedocs.io/en/latest/_modules/torchnlp/samplers/bptt_sampler.html
    
    

In [11]:
class BPTT_BatchSampler(Sampler):
    def __init__(self, n_elements, batch_size, drop_last):
        
        indexes = np.array_split(list(range(n_elements)), batch_size) # magic happens here
        
        n_batches = n_elements//batch_size
        self.batches_idxs = np.array([x[:n_batches] for x in indexes]).T.tolist()
        
        if not drop_last:
            last_batch_idxs = np.array([x[n_batches:] for x in indexes if x[n_batches:].size==1]).T.tolist()
            self.batches_idxs += last_batch_idxs
        
    def __iter__(self):
        return iter(self.batches_idxs)

In [12]:
s = BPTT_BatchSampler(n_elements=28, batch_size=5, drop_last=False)
list(s)

[[0, 6, 12, 18, 23],
 [1, 7, 13, 19, 24],
 [2, 8, 14, 20, 25],
 [3, 9, 15, 21, 26],
 [4, 10, 16, 22, 27],
 [5, 11, 17]]

In [13]:
s = BPTT_BatchSampler(n_elements=28, batch_size=5, drop_last=True)
list(s)

[[0, 6, 12, 18, 23],
 [1, 7, 13, 19, 24],
 [2, 8, 14, 20, 25],
 [3, 9, 15, 21, 26],
 [4, 10, 16, 22, 27]]

In [39]:
BATCH_SIZE = 64

train_dl = DataLoader(train_ds, batch_sampler=BPTT_BatchSampler(n_elements=len(train_ds),
                                                                batch_size=BATCH_SIZE,
                                                                drop_last=True))

vaalid_dl = DataLoader(valid_ds, batch_sampler=BPTT_BatchSampler(n_elements=len(valid_ds),
                                                                batch_size=BATCH_SIZE,
                                                                drop_last=True))

# <center> Model