## <center> Preprocess .txt dataset
# <center> `read_file` -> `tokenize` -> `numericalize`
> Notebook based on:
> 1. **https://github.com/fastai/course-v3/blob/master/nbs/dl2/12_text.ipynb**
> 2. https://github.com/fastai/course-v3/blob/master/nbs/dl2/12a_awd_lstm.ipynb
> 3. https://github.com/fastai/course-v3/blob/master/nbs/dl2/12b_lm_pretrain.ipynb
> 4. https://github.com/fastai/course-v3/blob/master/nbs/dl2/12c_ulmfit.ipynb
> 
> Video:
> - https://youtu.be/vnOpEwmtFJ8?t=4687 from 1:18:00 to 2:08:00 (50 mins)
> 
> [Usuful blog post about Pytorch Dataset, Dataloader, Samplers and collate](https://www.scottcondron.com/jupyter/visualisation/audio/2020/12/02/dataloaders-samplers-collate.html)

# Steps
1. Load the text data
2. Tokenizing (Spacy + Custom tokens)
3. Create vocab
4. Numeralization

### Imports

In [1]:
#from fastai.text.all import *

import numpy as np
import pathlib
import pickle
from collections import Counter, defaultdict
import torch

# For tokenizing
import re
import html
import spacy
import multiprocessing
from concurrent.futures import ProcessPoolExecutor
from spacy.symbols import ORTH
from tqdm.notebook import tqdm

# Constants

In [2]:
SPACY_TOKENIZER = spacy.blank("en").tokenizer

# Data

In [2]:
!ls "../../Datasets/NLP/IMBd"

test  train  unsup


In [3]:
!ls "../../Datasets/NLP/IMBd/train"

neg  pos


In [4]:
data_path = pathlib.Path("../../Datasets/NLP/IMBd")
data_path

PosixPath('../../Datasets/NLP/IMBd')

In [5]:
train_filenames = list( (data_path/"train").glob('**/*.txt') )
valid_filenames = list( (data_path/"test").glob('**/*.txt') )
unsup_filenames = list( (data_path/"unsup").glob('**/*.txt') )

In [6]:
print("Train:", len(train_filenames), "reviews")
print("Test: ", len(valid_filenames), "reviews")
print("Unsup:", len(unsup_filenames), "reviews")

Train: 25000 reviews
Test:  25000 reviews
Unsup: 50000 reviews


In [7]:
filenames = train_filenames + valid_filenames + unsup_filenames

In [8]:
len(filenames)

100000

# 1 Read file

In [9]:
def read_file(text_file): 
    with open(text_file, 'r', encoding='utf8') as f:
        return f.read()

In [10]:
read_file(filenames[0])

'The choice to make this SNL skit into a movie was far better thought out than other recent ones. The humor involved in the character is not annoyance humor, and is also character driven enough to be stretched out for an hour or two.<br /><br />Oddly enough the sexual content seemed like it could be avoided, but that may have been because the constraints of live television schooled me to not expect it. I suppose I was thinking more "Leisure Suit Larry" risqué than the producers were...<br /><br />Definitely not a PG-13 movie, which will probably hurt it from ever reaching the heights of its more successful predecessors, but still better premise and writing than its more dismal ones.<br /><br />I liked it, but I doubt it will be a smash hit... (which is sad, as Tim Meadows tends not to do characters that annoy me with quite the frequency other SNL alumni tend to)'

# 2 Tokenizing

### Special tokens
0. `xxunk`: Indicates the word is **unknown**. [`jkajkadsa`] -> [`xxunk`]
1. `xxpad`: Indicates **padding** (no more content)
2. `xxbos`: Indicates the **beginning of stream** (here, a movie review).
3. `xxeos`: Indicates the **end of stream** (here, a movie review).
4. `xxfld`: Indicates separate **fields** (parts like title, summary etc).
5. `xxrep`: Indicates **repetition**. [`hello!!!!`] -> [`hello`, `xxrep`, `4`, `!`]
6. `xxwrep`: Indicates **word repetition**. [`hello`, `hello`, `hello`] -> [`xxwrep`,`3`, `hello`]
7. `xxup`: Indicates the next word is all in capital (since we lowercased everything). [`GOD`] -> [`xxup`, `god`]
8. `xxmaj`: Indicates the next word begins with a capital (since we lowercased everything). [`This`] -> [`xxmaj`, `this`]

In [11]:
UNK     = "xxunk"  # 0
PAD     = "xxpad"  # 1
BOS     = "xxbos"  # 3
EOS     = "xxeos"  # 4
TK_REP  = "xxrep"  # 5
TK_WREP = "xxwrep" # 6
TK_UP   = "xxup"   # 7
TK_MAJ  = "xxmaj"  # 8

default_spec_tok = [UNK, PAD, BOS, EOS, TK_REP, TK_WREP, TK_UP, TK_MAJ]

### Pre-tokenization rules

In [12]:
def sub_br(t):
    "Replaces the <br /> by \n"
    re_br = re.compile(r'<\s*br\s*/?>', re.IGNORECASE)
    return re_br.sub("\n", t)

def spec_add_spaces(t):
    "Add spaces around / and #"
    return re.sub(r'([/#])', r' \1 ', t)

def rm_useless_spaces(t):
    "Remove multiple spaces"
    return re.sub(' {2,}', ' ', t)

def replace_rep(t):
    "Replace repetitions at the character level: cccc -> TK_REP 4 c"
    def _replace_rep(m) -> str:
        c,cc = m.groups()
        return f' {TK_REP} {len(cc)+1} {c} '
    re_rep = re.compile(r'(\S)(\1{3,})')
    return re_rep.sub(_replace_rep, t)
    
def replace_wrep(t):
    "Replace word repetitions: word word word -> TK_WREP 3 word"
    def _replace_wrep(m) -> str:
        c,cc = m.groups()
        return f' {TK_WREP} {len(cc.split())+1} {c} '
    re_wrep = re.compile(r'(\b\w+\W+)(\1{3,})')
    return re_wrep.sub(_replace_wrep, t)

def fixup_text(x):
    "Various messy things we've seen in documents"
    re1 = re.compile(r'  +')
    x = x.replace('#39;', "'").replace('amp;', '&').replace('#146;', "'").replace(
        'nbsp;', ' ').replace('#36;', '$').replace('\\n', "\n").replace('quot;', "'").replace(
        '<br />', "\n").replace('\\"', '"').replace('<unk>',UNK).replace(' @.@ ','.').replace(
        ' @-@ ','-').replace('\\', ' \\ ')
    return re1.sub(' ', html.unescape(x))
    
default_pre_rules = [fixup_text, replace_rep, replace_wrep, spec_add_spaces, rm_useless_spaces, sub_br]

In [13]:
replace_rep('cccc')

' xxrep 4 c '

In [14]:
replace_wrep('word word word word word ')

' xxwrep 5 word  '

### Post-Tokenization rules
These rules are applies after the tokenization on the list of tokens.

In [15]:
def replace_all_caps(x):
    "Replace tokens in ALL CAPS by their lower version and add `TK_UP` before."
    res = []
    for t in x:
        if t.isupper() and len(t) > 1: res.append(TK_UP); res.append(t.lower())
        else: res.append(t)
    return res

def deal_caps(x):
    "Replace all Capitalized tokens in by their lower version and add `TK_MAJ` before."
    res = []
    for t in x:
        if t == '': continue
        if t[0].isupper() and len(t) > 1 and t[1:].islower(): res.append(TK_MAJ)
        res.append(t.lower())
    return res

def add_eos_bos(x):
    return [BOS] + x + [EOS]

default_post_rules = [deal_caps, replace_all_caps, add_eos_bos]

In [16]:
replace_all_caps(['I', 'AM', 'SHOUTING'])

['I', 'xxup', 'am', 'xxup', 'shouting']

In [17]:
deal_caps(['My', 'name', 'is', 'Javi'])

['xxmaj', 'my', 'name', 'is', 'xxmaj', 'javi']

In [18]:
add_eos_bos(['My', 'name', 'is', 'Javi'])

['xxbos', 'My', 'name', 'is', 'Javi', 'xxeos']

### Tokenizer = `Pre rules` -> `Spacy English word tokenizer` -> `Post rules`

In [19]:
def tokenize(text):
    
    ######### Apply pre rules
    for pre_rule in default_pre_rules: text = pre_rule(text)
        
    ######### SPACY English Tokenizer
    tokens = [str(token) for token in SPACY_TOKENIZER(text)]
    
    ######### Apply post rules
    for post_rule in default_post_rules: tokens = post_rule(tokens)
        
    return tokens

In [20]:
" ".join(tokenize("Hello, my name is Javi!!!!"))

'xxbos xxmaj hello , my name is xxmaj javi xxrep 4 ! xxeos'

### Tokenize texts (in Parallel)
Since tokenizing and applying those rules takes a bit of time, we'll parallelize it using `ProcessPoolExecutor` to go faster.

In [21]:
def parallel_map(func, array):
    
    cpu_cores = multiprocessing.cpu_count()
    array_len = len(array)
    chunksize = array_len // 100
    
    if cpu_cores<2:
        return list(tqdm(map(func, arr), total=array_len))
    else:
        with ProcessPoolExecutor(max_workers=cpu_cores) as ex:
            return list(tqdm(ex.map(func, array, chunksize=chunksize), total=array_len))

In [22]:
def readfile_and_tokenize(filename):
    return tokenize(read_file(filename))

texts_toks = parallel_map(func=readfile_and_tokenize, array=filenames)

  0%|          | 0/100000 [00:00<?, ?it/s]

# Create a Voacab

In [23]:
def create_vocab(texts_toks, max_vocab=65536, min_freq=2): # 60000
    
    # Count number of occurrences for each token
    token_counts = Counter(p for o in texts_toks for p in o)
    
    print( len(token_counts),                                    "different tokens exists")
    print( len([t for t in token_counts if token_counts[t]>=2]), "tokens appears at least 2 times")
    print( len([t for t in token_counts if token_counts[t]>=3]), "tokens appears at least 3 times")
    print( len([t for t in token_counts if token_counts[t]>=5]), "tokens appears at least 5 times")
    
    # Create vocab limiting some words
    vocab = [o for o,c in token_counts.most_common(max_vocab) if c >= min_freq]
    
    # Put special tokens (xx) at the begining of the list
    for o in default_spec_tok[::-1]:
        if o in vocab:
            vocab.remove(o)
        vocab.insert(0, o)
        
    return vocab

vocab = create_vocab(texts_toks, max_vocab=65536, min_freq=2)
del texts_toks

172700 different tokens exists
89384 tokens appears at least 2 times
70718 tokens appears at least 3 times
54932 tokens appears at least 5 times


In [24]:
print("Vocab with", len(vocab), "different tokens in IMDb")
print("\nFirst tokens in the vocab:")
print(vocab[:50])

Vocab with 65539 different tokens in IMDb

First tokens in the vocab:
['xxunk', 'xxpad', 'xxbos', 'xxeos', 'xxrep', 'xxwrep', 'xxup', 'xxmaj', 'the', '.', ',', 'and', 'a', 'of', 'to', 'is', 'it', 'in', 'i', 'this', 'that', '"', "'s", '-', '\n\n', 'was', 'as', 'with', 'for', 'movie', 'but', 'film', 'you', ')', 'on', "n't", '(', 'not', 'are', 'he', 'his', 'have', 'be', 'one', 'all', 'at', 'they', 'by', 'an', 'who']


### Save vocab

In [25]:
vocab_file = open(data_path/'vocab.pkl','wb')
pickle.dump(vocab, vocab_file)
print("Vocab saved on", str(data_path/'vocab.pkl'))

######################## Test for reading vocab file 
vocab_file = open(data_path/'vocab.pkl','rb')
assert pickle.load(vocab_file) == vocab

Vocab saved on ../../Datasets/NLP/IMBd/vocab.pkl


# <center> Numericalizing
Once we have tokenized our texts, we replace each token by an individual number, this is called numericalizing.
    
If the token string does not extist we assign 0.

In [32]:
# int -> is for returning 0 when the token string does not exists
otoi = defaultdict(int, {o:i for i,o in enumerate(vocab)})

In [33]:
otoi

defaultdict(int,
            {'xxunk': 0,
             'xxpad': 1,
             'xxbos': 2,
             'xxeos': 3,
             'xxrep': 4,
             'xxwrep': 5,
             'xxup': 6,
             'xxmaj': 7,
             'the': 8,
             '.': 9,
             ',': 10,
             'and': 11,
             'a': 12,
             'of': 13,
             'to': 14,
             'is': 15,
             'it': 16,
             'in': 17,
             'i': 18,
             'this': 19,
             'that': 20,
             '"': 21,
             "'s": 22,
             '-': 23,
             '\n\n': 24,
             'was': 25,
             'as': 26,
             'with': 27,
             'for': 28,
             'movie': 29,
             'but': 30,
             'film': 31,
             'you': 32,
             ')': 33,
             'on': 34,
             "n't": 35,
             '(': 36,
             'not': 37,
             'are': 38,
             'he': 39,
             'his': 40,
           

In [34]:
print(np.iinfo(np.uint8))
print(np.iinfo(np.uint16))
print(np.iinfo(np.uint32))

Machine parameters for uint8
---------------------------------------------------------------
min = 0
max = 255
---------------------------------------------------------------

Machine parameters for uint16
---------------------------------------------------------------
min = 0
max = 65535
---------------------------------------------------------------

Machine parameters for uint32
---------------------------------------------------------------
min = 0
max = 4294967295
---------------------------------------------------------------



In [39]:
def numericalize(list_of_tokens):
    return np.array([otoi[tok] for tok in list_of_tokens], dtype=np.uint16)

def denumericalize(list_of_nums):
    return [vocab[i] for i in list_of_nums]

toks = tokenize(read_file(filenames[0]))
assert toks == denumericalize(numericalize(toks))
numericalize(toks)

array([    2,     7,     8,  1090,    14,   114,    19,  6434,  7502,
         104,    12,    29,    25,   247,   146,   217,    60,    93,
         102,  1190,   704,     9,     7,     8,   469,   588,    17,
           8,   123,    15,    37,  9268,   469,    10,    11,    15,
         103,   123,  2085,   214,    14,    42,  5656,    60,    28,
          48,   564,    55,   127,     9,    24,     7,  3051,   214,
           8,   888,  1533,   484,    53,    16,    96,    42,  4594,
          10,    30,    20,   227,    41,    99,   107,     8,  9463,
          13,   436,   722, 28287,    89,    14,    37,   534,    16,
           9,    18,  1370,    18,    25,   552,    69,    21,     7,
       16700,     7,  2009,     7,  2562,    21, 14208,    93,     8,
        1121,    86,    92,    24,     7,   423,    37,    12,  5336,
          29,    10,    80,   105,   261,  1515,    16,    51,   144,
        5075,     8,  5684,    13,   112,    69,  1152,  8486,    10,
          30,   152,

### Preprocess texts in parallel

In [37]:
prepro_dir = pathlib.Path("../../Datasets/NLP/IMBd_prepro")

def preprocess_text(input_file_path):
    file_name   = input_file_path.name[:-4]       # filename without ".txt"
    class_name  = input_file_path.parents[0].name # pos or neg
    subset_name = input_file_path.parents[1].name # train, test or unsup
    folder      = prepro_dir / subset_name / class_name
    folder.mkdir(parents=True, exist_ok=True)
    
    text_string = read_file(input_file_path) # 1. Read text file
    text_tokens = tokenize(text_string)      # 2. Tokenize text
    text_nums   = numericalize(text_tokens)  # 3. Numerizalize tokens (uint16)
    np.save(folder/file_name, text_nums)     # 4. Save numpy array

In [38]:
_ = parallel_map(func=preprocess_text, array=filenames)

  0%|          | 0/100000 [00:00<?, ?it/s]