In [9]:
from transformers import AutoTokenizer
from datasets import load_dataset
import torch

# Tokenization

HF offers tokenizers from pretrained models. Still unsure why, need to read more on it. One would assume they would be corpus dependent, and not model dependent.

In [2]:
word_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
character_tokenizer = AutoTokenizer.from_pretrained("google/byt5-small")
subword_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

In [3]:
sentence = "Using a Transformer network is simple"

print(word_tokenizer.tokenize(sentence))
print(character_tokenizer.tokenize(sentence))
print(subword_tokenizer.tokenize(sentence))

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']
['U', 's', 'i', 'n', 'g', ' ', 'a', ' ', 'T', 'r', 'a', 'n', 's', 'f', 'o', 'r', 'm', 'e', 'r', ' ', 'n', 'e', 't', 'w', 'o', 'r', 'k', ' ', 'i', 's', ' ', 's', 'i', 'm', 'p', 'l', 'e']
['using', 'a', 'transform', '##er', 'network', 'is', 'simple']


# Loading and preparing data

Attempt at loading and preparing data for training. Hopefuly the tokenizers contain the padding token.
Some tokenizers take into account the context and semanctics, i dont think these last 3 ones do, might be wrong. In this case I will simply use each review as an element for the batch.

In [5]:
dataset = load_dataset("lhoestq/demo1")

In [8]:
dataset['train'][0]['review']

{'id': '7bd227d9-afc9-11e6-aba1-c4b301cdf627',
 'package_name': 'com.mantz_it.rfanalyzer',
 'review': "Great app! The new version now works on my Bravia Android TV which is great as it's right by my rooftop aerial cable. The scan feature would be useful...any ETA on when this will be available? Also the option to import a list of bookmarks e.g. from a simple properties file would be useful.",
 'date': 'October 12 2016',
 'star': 4,
 'version_id': 1487}

In [35]:
reviews = [str(review['review']) for review in dataset['train']]

In [45]:
reviews

["Great app! The new version now works on my Bravia Android TV which is great as it's right by my rooftop aerial cable. The scan feature would be useful...any ETA on when this will be available? Also the option to import a list of bookmarks e.g. from a simple properties file would be useful.",
 "Great It's not fully optimised and has some issues with crashing but still a nice app  especially considering the price and it's open source.",
 "Works on a Nexus 6p I'm still messing around with my hackrf but it works with my Nexus 6p  Trond usb-c to usb host adapter. Thanks!",
 'The bandwidth seemed to be limited to maximum 2 MHz or so. I tried to increase the bandwidth but not possible. I purchased this is because one of the pictures in the advertisement showed the 2.4GHz band with around 10MHz or more bandwidth. Is it not possible to increase the bandwidth? If not  it is just the same performance as other free APPs.',
 'Works well with my Hackrf Hopefully new updates will arrive for extra f

In [46]:
max_length = max([len(review) for review in reviews])

batch = word_tokenizer.batch_encode_plus(reviews, padding='max_length', max_length=max_length, truncation=True)

In [47]:
batch

{'input_ids': [[101, 2038, 12647, 106, 1109, 1207, 1683, 1208, 1759, 1113, 1139, 139, 1611, 7137, 13693, 1794, 1134, 1110, 1632, 1112, 1122, 112, 188, 1268, 1118, 1139, 27915, 10485, 6095, 119, 1109, 14884, 2672, 1156, 1129, 5616, 119, 119, 119, 1251, 27269, 1592, 1113, 1165, 1142, 1209, 1129, 1907, 136, 2907, 1103, 5146, 1106, 13757, 170, 2190, 1104, 1520, 22328, 174, 119, 176, 119, 1121, 170, 3014, 4625, 4956, 1156, 1129, 5616, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

There is a clear problem: all token_ids are the same. This is because the tokenizer is not taking into account the context of the sentence, so it doesnt know that there are multiple parts.