# Word-Level Tokenizing

If not present, download the dataset = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt" to `data/01_raw/shakespeare.txt`

In [12]:
%load_ext kedro.ipython
%reload_kedro

from typing import Any, Dict, List, Tuple

import re

from datasets import load_dataset
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

The kedro.ipython extension is already loaded. To reload it, use:
  %reload_ext kedro.ipython


# Data

In [2]:
shakespeare = load_dataset("tiny_shakespeare")
train_split = shakespeare["train"]
test_split = shakespeare["test"]

# split on words and specific punctuation

In [3]:


s = shakespeare['train']['text'][0][:100]

In [4]:
s

[32m'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'[0m

In [5]:
delimiters = r";|,|\n|'|`| "
pat = re.compile(delimiters)
t = re.split(pat, s[:100])
print(t)

['First', 'Citizen:', 'Before', 'we', 'proceed', 'any', 'further', '', 'hear', 'me', 'speak.', '', 'All:', 'Speak', '', 'speak.', '', 'First', 'Citizen:', 'You']


Though we don't have any contractions, we can see words are roughly whole. The next step, would be to assign a unique integer to each word. 
For us, a good-enough is the entire dataset vocabulary.

In [6]:
count_vec = CountVectorizer(analyzer='word', tokenizer=lambda x: re.split(pat, x), lowercase=False, stop_words=None)
count_vec.fit(shakespeare['train']['text'])
# print the vocabulary encoding values, sorted
print(f"{len(count_vec.vocabulary_) = }")
{k: v for k, v in sorted(count_vec.vocabulary_.items(), key=lambda item: item[1])}

len(count_vec.vocabulary_) = 18120



[1m{[0m
    [32m''[0m: [1;36m0[0m,
    [32m'!'[0m: [1;36m1[0m,
    [32m'&C:'[0m: [1;36m2[0m,
    [32m'&c.'[0m: [1;36m3[0m,
    [32m'--'[0m: [1;36m4[0m,
    [32m'--Hold'[0m: [1;36m5[0m,
    [32m'--I'[0m: [1;36m6[0m,
    [32m'--O'[0m: [1;36m7[0m,
    [32m'--Plague'[0m: [1;36m8[0m,
    [32m'--Tybalt'[0m: [1;36m9[0m,
    [32m'--Where'[0m: [1;36m10[0m,
    [32m'--a'[0m: [1;36m11[0m,
    [32m'--an'[0m: [1;36m12[0m,
    [32m'--and'[0m: [1;36m13[0m,
    [32m'--as'[0m: [1;36m14[0m,
    [32m'--be'[0m: [1;36m15[0m,
    [32m'--believe'[0m: [1;36m16[0m,
    [32m'--but'[0m: [1;36m17[0m,
    [32m'--cast'[0m: [1;36m18[0m,
    [32m'--cousin'[0m: [1;36m19[0m,
    [32m'--do'[0m: [1;36m20[0m,
    [32m'--for'[0m: [1;36m21[0m,
    [32m'--give'[0m: [1;36m22[0m,
    [32m'--goddess!--O'[0m: [1;36m23[0m,
    [32m'--here'[0m: [1;36m24[0m,
    [32m'--how'[0m: [1;36m25[0m,
    [32m'--if'[0m: [1;36m26[0m,


In [7]:
count_vec.vocabulary_['plague']

[1;36m12682[0m

In [8]:
def encoding(s: List[str], cv) -> List[int]:
    """s is the pre-split string. """
    vocab = cv.vocabulary_
    encoded = [vocab[tok] for tok in s if tok in vocab]
    return encoded

In [9]:
print(t)
my_encoding = encoding(t, count_vec)
print(my_encoding)

['First', 'Citizen:', 'Before', 'we', 'proceed', 'any', 'further', '', 'hear', 'me', 'speak.', '', 'All:', 'Speak', '', 'speak.', '', 'First', 'Citizen:', 'You']
[1051, 592, 327, 17454, 13052, 3702, 8521, 0, 9236, 11141, 15103, 0, 155, 2595, 0, 15103, 0, 1051, 592, 3175]


Note the `0`s, those are the spaces. 

## Thoughts

The idea is interesting, each word hold a _lot_ of contextual information. That is, each word, even a compound word, feel "unique" to the space. In the case of the Shakespeare dataset, this is `N=19467` unique tokens given our delimiter splitting. A real English unigram vocabulary could be hundreds of thousands of tokens in cardinality. 

Limits: lack of shared meaning across similar words. `count_vec.vocabulary_['plagues']=13672` while `plague=13669`. (numeric distance does not represent similarity)

### Stopwords

We may want to keep vocabulary smaller. We can drop "stop" words. words that connect or are frequently used. Stop words are words like “and”, “the”, “him”, which are presumed to be uninformative in representing the content of a text, and which may be removed to avoid them being construed as signal for prediction.

In [13]:
nltk.download('stopwords')
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data] Downloading package stopwords to /Users/b/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
