# Character-Level Tokenizing

Thoughts

* Unlike word-based vocabulary size, character-based will be 1-2 orders smaller: tens to hundreds (ascii- 122-ish) outside of some logographic and syllabaries.
* Thus, follows there will be fewer out-of-vocabulary tokens as well. 
* Significantly less information (context) dense. 
* Done A. Karpaty's NanoGPT? that uses character-level tokenization 
* Downside: strings get transtated into significantly longer sequences for processing. this means a shorter context can be passed around


# Imports

In [1]:
%load_ext kedro.ipython
%reload_kedro

from typing import Any, Dict, List, Tuple

import re

from datasets import load_dataset
from datasets.dataset_dict import DatasetDict
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from transformers import CanineTokenizer

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


# Data 

In [2]:
shakespeare = load_dataset("tiny_shakespeare")
train_split = shakespeare["train"]
test_split = shakespeare["test"]

# Implementation

Take your corpus, and split on every (ascii, unicode, ..) character. that's it

In [3]:
s = shakespeare['train']['text'][0][:100]

def char_split(l) -> list[str]:
    chars = []
    for s in l: 
        chars.extend([c for c in s])
    return chars

print([s])
print(char_split(s))

['First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou']
['F', 'i', 'r', 's', 't', ' ', 'C', 'i', 't', 'i', 'z', 'e', 'n', ':', '\n', 'B', 'e', 'f', 'o', 'r', 'e', ' ', 'w', 'e', ' ', 'p', 'r', 'o', 'c', 'e', 'e', 'd', ' ', 'a', 'n', 'y', ' ', 'f', 'u', 'r', 't', 'h', 'e', 'r', ',', ' ', 'h', 'e', 'a', 'r', ' ', 'm', 'e', ' ', 's', 'p', 'e', 'a', 'k', '.', '\n', '\n', 'A', 'l', 'l', ':', '\n', 'S', 'p', 'e', 'a', 'k', ',', ' ', 's', 'p', 'e', 'a', 'k', '.', '\n', '\n', 'F', 'i', 'r', 's', 't', ' ', 'C', 'i', 't', 'i', 'z', 'e', 'n', ':', '\n', 'Y', 'o', 'u']


## Canine Tokenizer

In [4]:
s = shakespeare['train']['text'][:100]

tokenizer = CanineTokenizer.from_pretrained("google/canine-c")

In [5]:
encoding = tokenizer(s, padding="longest", truncation=True)
encoding


[1m{[0m
    [32m'input_ids'[0m: [1m[[0m
        [1m[[0m
            [1;36m57344[0m,
            [1;36m70[0m,
            [1;36m105[0m,
            [1;36m114[0m,
            [1;36m115[0m,
            [1;36m116[0m,
            [1;36m32[0m,
            [1;36m67[0m,
            [1;36m105[0m,
            [1;36m116[0m,
            [1;36m105[0m,
            [1;36m122[0m,
            [1;36m101[0m,
            [1;36m110[0m,
            [1;36m58[0m,
            [1;36m10[0m,
            [1;36m66[0m,
            [1;36m101[0m,
            [1;36m102[0m,
            [1;36m111[0m,
            [1;36m114[0m,
            [1;36m101[0m,
            [1;36m32[0m,
            [1;36m119[0m,
            [1;36m101[0m,
            [1;36m32[0m,
            [1;36m112[0m,
            [1;36m114[0m,
            [1;36m111[0m,
            [1;36m99[0m,
            [1;36m101[0m,
            [1;36m101[0m,
            [1;36m100[0m,
            [1;36