# HuggingFace tutorial

https://huggingface.co/learn/nlp-course/chapter7/3?fw=tf#fine-tuning-a-masked-language-model

In [1]:
from transformers import TFAutoModelForMaskedLM

In [6]:
model_checkpoint = "distilbert-base-uncased"
model = TFAutoModelForMaskedLM.from_pretrained(model_checkpoint)

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForMaskedLM: ['activation_13']
- This IS expected if you are initializing TFDistilBertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertForMaskedLM were initialized from the model checkpoint at distilbert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForMaskedLM for predictions without further training.


In [5]:
model.summary()

Model: "tf_distil_bert_for_masked_lm"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMai  multiple                 66362880  
 nLayer)                                                         
                                                                 
 vocab_transform (Dense)     multiple                  590592    
                                                                 
 vocab_layer_norm (LayerNorm  multiple                 1536      
 alization)                                                      
                                                                 
 vocab_projector (TFDistilBe  multiple                 23866170  
 rtLMHead)                                                       
                                                                 
Total params: 66,985,530
Trainable params: 66,985,530
Non-trainable params: 0
__________________________

In [7]:
text = "This is a great [MASK]."

In [9]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [10]:
import numpy as np
import tensorflow as tf

inputs = tokenizer(text, return_tensors="np")
token_logits = model(**inputs).logits
# Find the location of [MASK] and extract its logits
mask_token_index = np.argwhere(inputs["input_ids"] == tokenizer.mask_token_id)[0, 1]
mask_token_logits = token_logits[0, mask_token_index, :]
# Pick the [MASK] candidates with the highest logits
# We negate the array before argsort to get the largest, not the smallest, logits
top_5_tokens = np.argsort(-mask_token_logits)[:5].tolist()

for token in top_5_tokens:
    print(f">>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}")

>>> This is a great deal.
>>> This is a great success.
>>> This is a great adventure.
>>> This is a great idea.
>>> This is a great feat.


In [12]:
from datasets import load_dataset

imdb_dataset = load_dataset("imdb")
imdb_dataset

Found cached dataset imdb (C:/Users/Bjanka/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [13]:
sample = imdb_dataset["train"].shuffle(seed=42).select(range(3))

for row in sample:
    print(f"\n'>>> Review: {row['text']}'")
    print(f"'>>> Label: {row['label']}'")


'>>> Review: There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier's plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it's the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...'
'>>> Label: 1'

'>>> Review: This movie is a great. The plot is very true to the book which is a classic written by Mark Twain. The movie starts of with a scene where Hank sings a song with a bunch of kids called "when you stu

In [14]:
def tokenize_function(examples):
    result = tokenizer(examples["text"])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result


# Use batched=True to activate fast multithreading!
tokenized_datasets = imdb_dataset.map(
    tokenize_function, batched=True, remove_columns=["text", "label"]
)
tokenized_datasets

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (720 > 512). Running this sequence through the model will result in indexing errors


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 50000
    })
})

In [15]:
tokenizer.model_max_length

512

In [16]:
chunk_size = 128

In [17]:
# Slicing produces a list of lists for each feature
tokenized_samples = tokenized_datasets["train"][:3]

for idx, sample in enumerate(tokenized_samples["input_ids"]):
    print(f"'>>> Review {idx} length: {len(sample)}'")

'>>> Review 0 length: 363'
'>>> Review 1 length: 304'
'>>> Review 2 length: 133'


In [18]:
concatenated_examples = {
    k: sum(tokenized_samples[k], []) for k in tokenized_samples.keys()
}
total_length = len(concatenated_examples["input_ids"])
print(f"'>>> Concatenated reviews length: {total_length}'")

'>>> Concatenated reviews length: 800'


In [19]:
chunks = {
    k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
    for k, t in concatenated_examples.items()
}

for chunk in chunks["input_ids"]:
    print(f"'>>> Chunk length: {len(chunk)}'")

'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 32'


In [20]:
def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result["labels"] = result["input_ids"].copy()
    return result

In [21]:
lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 61291
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 59904
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 122957
    })
})

In [22]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

"as the vietnam war and race issues in the united states. in between asking politicians and ordinary denizens of stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men. < br / > < br / > what kills me about i am curious - yellow is that 40 years ago, this was considered pornographic. really, the sex and nudity scenes are few and far between, even then it's not shot like some cheaply made porno. while my countrymen mind find it shocking, in reality sex and nudity are a major staple in swedish cinema. even ingmar bergman,"

In [23]:
tokenizer.decode(lm_datasets["train"][1]["labels"])

"as the vietnam war and race issues in the united states. in between asking politicians and ordinary denizens of stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men. < br / > < br / > what kills me about i am curious - yellow is that 40 years ago, this was considered pornographic. really, the sex and nudity scenes are few and far between, even then it's not shot like some cheaply made porno. while my countrymen mind find it shocking, in reality sex and nudity are a major staple in swedish cinema. even ingmar bergman,"

In [24]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

In [25]:
samples = [lm_datasets["train"][i] for i in range(2)]
for sample in samples:
    _ = sample.pop("word_ids")

for chunk in data_collator(samples)["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.



'>>> [CLS] i rented i am curious - yellow from my video store [MASK] professionals all the controversy that surrounded it when [MASK] was first released in [MASK]. side also [MASK] [MASK] [MASK] first it was seized by u. s. customs if it ever tried to enter this country, therefore [MASK] a [MASK] of films [MASK] " controversial " i really had [MASK] see this for myself. < br / > < br / > the plot [MASK] centered around steel young swedish drama student named lena who wants to learn everything she can about life. in particular she wants to focus her attention [MASK] to making [MASK] sort of [MASK] on what the average [MASK]ede thought about certain political [MASK] such'

'>>> as [MASK] vietnam war and race issues [MASK] the united states. in between [MASK] politicians and ordinary denize [MASK] of stockholm about their opinions on politics, she has sex [MASK] her dil [MASK], [MASK], and married men. < br / [MASK] < br resulted > what kills me about i am curious - yellow is that 40 yea

In [26]:
import collections
import numpy as np

from transformers.data.data_collator import tf_default_data_collator

wwm_probability = 0.2


def whole_word_masking_data_collator(features):
    for feature in features:
        word_ids = feature.pop("word_ids")

        # Create a map between words and corresponding token indices
        mapping = collections.defaultdict(list)
        current_word_index = -1
        current_word = None
        for idx, word_id in enumerate(word_ids):
            if word_id is not None:
                if word_id != current_word:
                    current_word = word_id
                    current_word_index += 1
                mapping[current_word_index].append(idx)

        # Randomly mask words
        mask = np.random.binomial(1, wwm_probability, (len(mapping),))
        input_ids = feature["input_ids"]
        labels = feature["labels"]
        new_labels = [-100] * len(labels)
        for word_id in np.where(mask)[0]:
            word_id = word_id.item()
            for idx in mapping[word_id]:
                new_labels[idx] = labels[idx]
                input_ids[idx] = tokenizer.mask_token_id
        feature["labels"] = new_labels

    return tf_default_data_collator(features)

In [27]:
samples = [lm_datasets["train"][i] for i in range(2)]
batch = whole_word_masking_data_collator(samples)

for chunk in batch["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")


'>>> [CLS] i rented i am curious [MASK] yellow from my video [MASK] because [MASK] [MASK] [MASK] [MASK] that [MASK] it when it was first released in 1967. i [MASK] heard that at first it was [MASK] [MASK] u. s. customs if it ever tried to enter this country, therefore being a fan of films considered " controversial " i really [MASK] to see this for myself [MASK] [MASK] br / > < br [MASK] > the [MASK] [MASK] centered around a young swedish [MASK] student named lena who wants to learn [MASK] she can about life. in particular she [MASK] to focus her attentions to [MASK] some sort of documentary on what the average swede thought [MASK] certain political issues such'

'>>> as [MASK] vietnam [MASK] and [MASK] issues in the united states. in [MASK] [MASK] [MASK] [MASK] ordinary [MASK] [MASK] [MASK] [MASK] [MASK] about their opinions on politics, she [MASK] sex [MASK] [MASK] drama teacher [MASK] classmates, [MASK] [MASK] [MASK]. [MASK] [MASK] / [MASK] < br / > [MASK] kills me [MASK] i am curi

In [28]:
train_size = 10_000
test_size = int(0.1 * train_size)

downsampled_dataset = lm_datasets["train"].train_test_split(
    train_size=train_size, test_size=test_size, seed=42
)
downsampled_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 1000
    })
})

In [29]:
from huggingface_hub import notebook_login

notebook_login()

Token is valid.
Your token has been saved in your configured git credential helpers (manager-core).
Your token has been saved to C:\Users\Bjanka\.cache\huggingface\token
Login successful


In [30]:
tf_train_dataset = model.prepare_tf_dataset(
    downsampled_dataset["train"],
    collate_fn=data_collator,
    shuffle=True,
    batch_size=32,
)

tf_eval_dataset = model.prepare_tf_dataset(
    downsampled_dataset["test"],
    collate_fn=data_collator,
    shuffle=False,
    batch_size=32,
)

In [31]:
from transformers import create_optimizer
from transformers.keras_callbacks import PushToHubCallback
import tensorflow as tf

num_train_steps = len(tf_train_dataset)
optimizer, schedule = create_optimizer(
    init_lr=2e-5,
    num_warmup_steps=1_000,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01,
)
model.compile(optimizer=optimizer)

# Train in mixed-precision float16
tf.keras.mixed_precision.set_global_policy("mixed_float16")

model_name = model_checkpoint.split("/")[-1]
callback = PushToHubCallback(
    output_dir=f"{model_name}-finetuned-imdb", tokenizer=tokenizer
)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


The dtype policy mixed_float16 may run slowly because this machine does not have a GPU. Only Nvidia GPUs with compute capability of at least 7.0 run quickly with mixed_float16.


Cloning https://huggingface.co/BjankaV/distilbert-base-uncased-finetuned-imdb into local empty directory.


In [33]:
import math

eval_loss = model.evaluate(tf_eval_dataset)
print(f"Perplexity: {math.exp(eval_loss):.2f}")

Perplexity: 23.48


In [34]:
model.fit(tf_train_dataset, validation_data=tf_eval_dataset, callbacks=[callback])



<keras.callbacks.History at 0x21624cc6b20>

In [35]:
eval_loss = model.evaluate(tf_eval_dataset)
print(f"Perplexity: {math.exp(eval_loss):.2f}")

Perplexity: 13.57


In [51]:
from transformers import pipeline

mask_filler = pipeline(
    "fill-mask", model="BjankaV/distilbert-base-uncased-finetuned-imdb"
)

In [52]:
preds = mask_filler(text)

for pred in preds:
    print(f">>> {pred['sequence']}")

>>> this is a great deal.
>>> this is a great idea.
>>> this is a great adventure.
>>> this is a great success.
>>> this is a great one.


# OffensEval dataset

In [1]:
from transformers import TFAutoModelForMaskedLM

In [2]:
model_checkpoint = "distilbert-base-uncased"
model = TFAutoModelForMaskedLM.from_pretrained(model_checkpoint)

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForMaskedLM: ['activation_13']
- This IS expected if you are initializing TFDistilBertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertForMaskedLM were initialized from the model checkpoint at distilbert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForMaskedLM for predictions without further training.


In [3]:
model.summary()

Model: "tf_distil_bert_for_masked_lm"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMai  multiple                 66362880  
 nLayer)                                                         
                                                                 
 vocab_transform (Dense)     multiple                  590592    
                                                                 
 vocab_layer_norm (LayerNorm  multiple                 1536      
 alization)                                                      
                                                                 
 vocab_projector (TFDistilBe  multiple                 23866170  
 rtLMHead)                                                       
                                                                 
Total params: 66,985,530
Trainable params: 66,985,530
Non-trainable params: 0
__________________________

In [4]:
text = "Women are [MASK]."

In [5]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [6]:
import numpy as np
import tensorflow as tf

inputs = tokenizer(text, return_tensors="np")
token_logits = model(**inputs).logits
# Find the location of [MASK] and extract its logits
mask_token_index = np.argwhere(inputs["input_ids"] == tokenizer.mask_token_id)[0, 1]
mask_token_logits = token_logits[0, mask_token_index, :]
# Pick the [MASK] candidates with the highest logits
# We negate the array before argsort to get the largest, not the smallest, logits
top_5_tokens = np.argsort(-mask_token_logits)[:5].tolist()

for token in top_5_tokens:
    print(f">>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}")

>>> Women are excluded.
>>> Women are welcome.
>>> Women are exempt.
>>> Women are included.
>>> Women are prohibited.


In [7]:
from datasets import load_dataset
from datasets.dataset_dict import DatasetDict

train_dataset = load_dataset('csv', data_files='datasets/train_level_a.csv')['train']
test_dataset = load_dataset('csv', data_files='datasets/test_level_a.csv')['train']

train_dataset = train_dataset.rename_column("subtask_a", "label")

train_dataset = train_dataset.remove_columns(["id"])
test_dataset = test_dataset.remove_columns(["id"])

offens_dataset = DatasetDict({'train': train_dataset, 'test': test_dataset})
offens_dataset

Found cached dataset csv (C:/Users/Bjanka/.cache/huggingface/datasets/csv/default-3f6a9f74a71595f8/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1)


  0%|          | 0/1 [00:00<?, ?it/s]

Found cached dataset csv (C:/Users/Bjanka/.cache/huggingface/datasets/csv/default-06a522e7e8f39f73/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1)


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['tweet', 'label'],
        num_rows: 13240
    })
    test: Dataset({
        features: ['tweet', 'label'],
        num_rows: 860
    })
})

In [8]:
sample = offens_dataset["train"].shuffle(seed=42).select(range(3))

for row in sample:
    print(f"\n'>>> Tweet: {row['tweet']}'")
    print(f"'>>> Label: {row['label']}'")

Loading cached shuffled indices for dataset at C:\Users\Bjanka\.cache\huggingface\datasets\csv\default-3f6a9f74a71595f8\0.0.0\6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1\cache-e60a00a815932e95.arrow



'>>> Tweet: @USER She is really good for him and told him how he needed to straighten up. I like her and I like them together. Sometimes you just need someone who calls you out on your sh*t so you can become a better person 💖'
'>>> Label: NOT'

'>>> Tweet: @USER Kirinodere and the Curious Case of the Fucked Customs Fees. 10/10 good book.'
'>>> Label: OFF'

'>>> Tweet: @USER @USER There's nothing more joyous than watching a snowflake cry especially when it's a google order follower.'
'>>> Label: NOT'


In [9]:
def tokenize_function(examples):
    result = tokenizer(examples["tweet"])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result


# Use batched=True to activate fast multithreading!
tokenized_datasets = offens_dataset.map(
    tokenize_function, batched=True, remove_columns=["tweet", "label"]
)
tokenized_datasets

Loading cached processed dataset at C:\Users\Bjanka\.cache\huggingface\datasets\csv\default-3f6a9f74a71595f8\0.0.0\6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1\cache-ecdf81e3eb0ef39f.arrow
Loading cached processed dataset at C:\Users\Bjanka\.cache\huggingface\datasets\csv\default-06a522e7e8f39f73\0.0.0\6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1\cache-87728803799eeef4.arrow


DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 13240
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 860
    })
})

In [10]:
tokenizer.model_max_length

512

In [103]:
chunk_size = 128

In [12]:
# Slicing produces a list of lists for each feature
tokenized_samples = tokenized_datasets["train"][:3]

for idx, sample in enumerate(tokenized_samples["input_ids"]):
    print(f"'>>> Tweet {idx} length: {len(sample)}'")

'>>> Tweet 0 length: 18'
'>>> Tweet 1 length: 27'
'>>> Tweet 2 length: 41'


In [13]:
concatenated_examples = {
    k: sum(tokenized_samples[k], []) for k in tokenized_samples.keys()
}
total_length = len(concatenated_examples["input_ids"])
print(f"'>>> Concatenated tweets length: {total_length}'")

'>>> Concatenated tweets length: 86'


In [104]:
chunks = {
    k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
    for k, t in concatenated_examples.items()
}

for chunk in chunks["input_ids"]:
    print(f"'>>> Chunk length: {len(chunk)}'")

'>>> Chunk length: 86'


In [105]:
def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result["labels"] = result["input_ids"].copy()
    return result

In [106]:
lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

Loading cached processed dataset at C:\Users\Bjanka\.cache\huggingface\datasets\csv\default-3f6a9f74a71595f8\0.0.0\6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1\cache-0d146ad401be4a78.arrow
Loading cached processed dataset at C:\Users\Bjanka\.cache\huggingface\datasets\csv\default-06a522e7e8f39f73\0.0.0\6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1\cache-57310d68c7bc8221.arrow


DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 3481
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 270
    })
})

In [107]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

'user liberals are all kookoo!!! [SEP] [CLS] @ user @ user oh noes! tough shit. [SEP] [CLS] @ user was literally just talking about this lol all mass shootings like that have been set ups. it ’ s propaganda used to divide us on major issues like gun control and terrorism [SEP] [CLS] @ user buy more icecream!!! [SEP] [CLS] @ user canada doesn ’ t need another cuck! we already have enough # looneyleft # liberals f * * king up our great country! # qproofs # trudeaumustgo [SEP] [CLS] @ user @ user @ user it ’ s'

In [108]:
tokenizer.decode(lm_datasets["train"][1]["labels"])

'user liberals are all kookoo!!! [SEP] [CLS] @ user @ user oh noes! tough shit. [SEP] [CLS] @ user was literally just talking about this lol all mass shootings like that have been set ups. it ’ s propaganda used to divide us on major issues like gun control and terrorism [SEP] [CLS] @ user buy more icecream!!! [SEP] [CLS] @ user canada doesn ’ t need another cuck! we already have enough # looneyleft # liberals f * * king up our great country! # qproofs # trudeaumustgo [SEP] [CLS] @ user @ user @ user it ’ s'

In [109]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

In [110]:
samples = [lm_datasets["train"][i] for i in range(2)]
for sample in samples:
    _ = sample.pop("word_ids")

for chunk in data_collator(samples)["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")


'>>> [CLS] @ user she should [MASK] a few native americans what their take on this is. [SEP] [CLS] @ user @ user go [MASK] you [MASK] affair drunk!!! @ user # maga # trump [MASK]20 [UNK] url [SEP] [CLS] amazon is investigating chinese employees [MASK] are selling internal data to [MASK] - party sellers looking for an edge in the competitive [MASK]. url # amazon # maga # kag # china # [MASK]ot [SEP] [CLS] @ user someone should'vetake [MASK] " this piece of shit to a volcano. [UNK]inski [SEP] [CLS] @ user @ user obama wanted liberals & amp ; illegals to move into red states [SEP] [CLS] @'

'>>> user liberals are all kookoo!!! [SEP] [CLS] @ user @ user oh noes! tough shit. [SEP] [CLS] @ user [MASK] literally just talking about this lol all mass shootings like that have been set ups. it ’ s propaganda used to divide us on major issues like gun control and terrorism [SEP] [CLS] @ user buy more ice [MASK] [MASK]! [MASK]! [SEP] [CLS] @ [MASK] canada doesn [MASK] t need another cuck! [MASK] a

In [111]:
import collections
import numpy as np

from transformers.data.data_collator import tf_default_data_collator

wwm_probability = 0.2


def whole_word_masking_data_collator(features):
    for feature in features:
        word_ids = feature.pop("word_ids")

        # Create a map between words and corresponding token indices
        mapping = collections.defaultdict(list)
        current_word_index = -1
        current_word = None
        for idx, word_id in enumerate(word_ids):
            if word_id is not None:
                if word_id != current_word:
                    current_word = word_id
                    current_word_index += 1
                mapping[current_word_index].append(idx)

        # Randomly mask words
        mask = np.random.binomial(1, wwm_probability, (len(mapping),))
        input_ids = feature["input_ids"]
        labels = feature["labels"]
        new_labels = [-100] * len(labels)
        for word_id in np.where(mask)[0]:
            word_id = word_id.item()
            for idx in mapping[word_id]:
                new_labels[idx] = labels[idx]
                input_ids[idx] = tokenizer.mask_token_id
        feature["labels"] = new_labels

    return tf_default_data_collator(features)

In [112]:
samples = [lm_datasets["train"][i] for i in range(2)]
batch = whole_word_masking_data_collator(samples)

for chunk in batch["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")


'>>> [CLS] [MASK] user she should ask a few native americans what their [MASK] on [MASK] [MASK]. [SEP] [CLS] @ user @ user go home [MASK] ’ [MASK] drunk [MASK]!! @ user # [MASK] [MASK] [MASK] trump2020 [MASK] url [SEP] [CLS] [MASK] is [MASK] chinese employees [MASK] [MASK] selling internal [MASK] to third - party sellers looking for an edge in [MASK] competitive marketplace. url # [MASK] # maga [MASK] [MASK] [MASK] # china # tcot [SEP] [CLS] @ [MASK] someone should'vetaken " this piece of shit [MASK] [MASK] volcano. [MASK] " [SEP] [CLS] @ [MASK] @ user obama wanted liberals [MASK] [MASK] ; [MASK] [MASK] to move into [MASK] [MASK] [SEP] [CLS] @'

'>>> user liberals [MASK] [MASK] kookoo!! [MASK] [SEP] [CLS] @ user @ user oh noes! tough shit [MASK] [SEP] [CLS] [MASK] user [MASK] literally just talking [MASK] this lol all mass shootings like that have been set [MASK]. it ’ s [MASK] used to divide [MASK] on [MASK] issues like gun control [MASK] terrorism [SEP] [CLS] [MASK] user buy more ic

In [57]:
from huggingface_hub import notebook_login

notebook_login()

Token is valid.
Your token has been saved in your configured git credential helpers (manager-core).
Your token has been saved to C:\Users\Bjanka\.cache\huggingface\token
Login successful


In [None]:
train_size = 10_000
test_size = int(0.1 * train_size)

downsampled_dataset = lm_datasets["train"].train_test_split(
    train_size=train_size, test_size=test_size, seed=42
)
downsampled_dataset

In [113]:
tf_train_dataset = model.prepare_tf_dataset(
    lm_datasets["train"],
    collate_fn=data_collator,
    shuffle=True,
    batch_size=32,
)

tf_eval_dataset = model.prepare_tf_dataset(
    lm_datasets["test"],
    collate_fn=data_collator,
    shuffle=False,
    batch_size=32,
)

In [114]:
from transformers import create_optimizer
from transformers.keras_callbacks import PushToHubCallback
import tensorflow as tf

num_train_steps = len(tf_train_dataset)
optimizer, schedule = create_optimizer(
    init_lr=2e-5,
    num_warmup_steps=1_000,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01,
)
model.compile(optimizer=optimizer)

# Train in mixed-precision float16
tf.keras.mixed_precision.set_global_policy("mixed_float16")

model_name = model_checkpoint.split("/")[-1]
callback = PushToHubCallback(
    output_dir=f"{model_name}-finetuned-offens", tokenizer=tokenizer
)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.
C:\Users\Bjanka\Desktop\APT\tar-project\distilbert-base-uncased-finetuned-offens is already a clone of https://huggingface.co/BjankaV/distilbert-base-uncased-finetuned-offens. Make sure you pull the latest changes with `repo.git_pull()`.


In [115]:
import math

eval_loss = model.evaluate(tf_eval_dataset)
print(f"Perplexity: {math.exp(eval_loss):.2f}")

Perplexity: 79.43


In [None]:
model.fit(tf_train_dataset, validation_data=tf_eval_dataset, callbacks=[callback])

In [117]:
eval_loss = model.evaluate(tf_eval_dataset)
print(f"Perplexity: {math.exp(eval_loss):.2f}")

Perplexity: 46.16


In [119]:
from transformers import pipeline

mask_filler = pipeline(
    "fill-mask", model=model, tokenizer=tokenizer
)

In [120]:
preds = mask_filler(text)

for pred in preds:
    print(f">>> {pred['sequence']}")

>>> women are excluded.
>>> women are welcome.
>>> women are included.
>>> women are prohibited.
>>> women are allowed.
