# Pre-training LLMs with Hugging Face

In [None]:
# # All Libraries required for this lab are listed below. The libraries pre-installed on Skills Network Labs are commented.
!pip install -qy pandas==1.3.4 numpy==1.21.4 seaborn==0.9.0 matplotlib==3.5.0 torch=2.1.0+cu118
# # - Update a specific package
!pip install pmdarima -U
# # - Update a package to specific version
!pip install --upgrade pmdarima==2.0.2
# # Note: If your environment doesn't support "!pip install", use "!mamba install"

In [None]:
!pip install transformers==4.40.0 
!pip install -U git+https://github.com/huggingface/transformers
!pip install --user datasets # 2.15.0
!pip install --user portalocker>=2.0.0
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q -U accelerate
!pip install --user torch==2.3.0
!pip install -U torchvision
!pip install --user protobuf==3.20
!pip install --user dataset

- !pip install transformers==4.40.0

Installs version 4.40.0 of the transformers library. This library by Hugging Face provides pre-trained models and tools for natural language processing (NLP) and machine learning.
______________________________________________________________
- !pip install -U git+https://github.com/huggingface/transformers

Installs the latest version of the transformers library directly from Hugging Face's GitHub repository (-U ensures upgrading if a version is already installed).
______________________________________________________________
- !pip install --user datasets # 2.15.0

Installs the datasets library (version 2.15.0 based on the comment). This library, also by Hugging Face, is used for loading and manipulating datasets.
______________________________________________________________
- !pip install --user portalocker>=2.0.0

Installs version 2.0.0 or higher of the portalocker library, which helps in file locking for cross-platform applications.
______________________________________________________________
- !pip install -q -U git+https://github.com/huggingface/accelerate.git

Silently (-q for quiet mode) installs the latest version of the accelerate library from its GitHub repository. This library optimizes and scales machine learning workflows.
______________________________________________________________
- !pip install -q -U accelerate

Updates the accelerate library to the latest version. If it's already installed, it ensures it's up-to-date.
______________________________________________________________
- !pip install --user torch==2.3.0

Installs version 2.3.0 of PyTorch, a popular machine learning library for deep learning.
______________________________________________________________
- !pip install -U torchvision

Updates the torchvision library, which provides tools and datasets for computer vision tasks.
______________________________________________________________
- !pip install --user protobuf==3.20

Installs version 3.20 of the protobuf library, which is used for serializing structured data and is often required by machine learning frameworks.
______________________________________________________________
- !pip install --user dataset

Installs the dataset library. This is different from Hugging Face's datasets library and is used to work with databases in Python.

# TOKENIZERS_PARALLELISM to false

In [None]:
# Set the environment variable TOKENIZERS_PARALLELISM to 'false'
import os
os.environ['TOKENIZERS_PARALLELISM'] = 'false'

# Pretraining and self-supervised fine-tuning 
lets load the pretrained model and make a inference

In [None]:
model=AutoModelForCausalLM.from_pretrained("facebook/opt-350")
tokenizer=AutoTokenizer.from_pretrained("facebook/opt-350m")
pipe=pipeline("text-generation",model=model,tokenizer=tokenizer)
print(pipe("this movie was really")[0]["generated_text"])

: 

# Self-supervised training of a BERT model
Prepare the train dataset
Train a Tokenizer
Preprocess the dataset
Pre-train BERT using an MLM task
Evaluate the trained model

In [None]:
# load the dataset
dataset=load_dataset("wikitext","wikitext-2-raw-v1")
dataset["train"] = dataset["train"].select([i for i in range(1000)])
dataset["test"] = dataset["test"].select([i for i in range(200)])

Why Create Text Files?

These files serve as input to the TextDataset object, which is commonly used in NLP tasks like pretraining language models. Here's the workflow:

Extract Raw Data: Text is extracted from datasets and saved to text files.

Preprocessing: Text files are loaded into objects like TextDataset, where tokenization and formatting happen.

Training Models: Preprocessed datasets are then fed into models for training using objectives like Masked Language Modeling (MLM).

## Step 1: Specify File Paths
These lines define the paths where the training and test datasets will be saved as text files.

In [None]:
output_file_train = "wikitext_dataset_train.txt"
output_file_test = "wikitext_dataset_test.txt"


## Step 2: Save Training Dataset
- Open File: The training file (wikitext_dataset_train.txt) is opened in write mode ("w"), with UTF-8 encoding to handle special characters.

- Iterate Through Examples: Loop through each example in the training set (dataset["train"]).

- Write to File: For every example, write the text field from the dataset to the file, adding a newline ("\n") after each example.

In [None]:
with open(output_file_train, "w", encoding="utf-8") as f:
    for example in dataset["train"]:
        f.write(example["text"] + "\n")

## Step 3: Save Test Dataset
This is almost identical to the training dataset step, except it processes the test set (dataset["test"]) and writes to a separate file (wikitext_dataset_test.txt).

In [None]:
with open(output_file_test, "w", encoding="utf-8") as f:
    for example in dataset["test"]:
        f.write(example["text"] + "\n")

You need to define a tokenizer to be used for tokenizing the dataset.

In [None]:
# create a tokenizer from existing one to re-use special tokens
bert_tokenizer=BertTokenizerfast.from_pretrained("bert-base-uncased")
model_name='bert-base-uncased'
model=AutoModelForCausalLm.from_pretrained(model_name)
tokenizer=AutoTokenizer.from_pretrained(model_name,is_decoder=True)

# Training a Tokenizer(Optional)
This is specially helpful when using transformers for specific areas such as medicine where tokens are somehow different than the general tokens that tokenizers are created based on. (You can skip this step if you do not want to train the tokenizer on your specific data):

## Part 1: Create a Python Generator to Dynamically Load the Data(look the code below )
1-def batch_iterator(batch_size=10000):
Defines a function named batch_iterator that serves as a Python generator.
Accepts a parameter batch_size, which specifies the number of examples to be processed in each batch. The default value is set to 10000.


In [None]:
## create a python generator to dynamically load the data
#1
def batch_iterator(batch_size=10000):
    #2
    for i in tqdm(range(0, len(dataset), batch_size)):
        #3
        yield dataset['train'][i : i + batch_size]["text"]


2- for i in tqdm(range(0, len(dataset), batch_size)):
Uses a for loop to iterate through the dataset in increments of batch_size.

range(0, len(dataset), batch_size) generates indices from 0 to the length of the dataset, stepping by batch_size. For example, if the dataset has 50,000 examples and batch_size is 10,000, the loop iterates over indices [0, 10000, 20000, 30000, 40000].

tqdm adds a progress bar to visualize the loop's progress.

3- yield dataset['train'][i : i + batch_size]["text"]

The yield keyword makes the function a generator, producing one batch of data at a time rather than loading all the data into memory at once.

Fetches a slice of the training dataset (dataset['train'][i : i + batch_size]), containing batch_size examples from the current position i.

Extracts only the "text" field from the sliced examples.

# Part 2: Create a Tokenizer

In [None]:
bert_tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

# Part 3: Train the Tokenizer Using the Dataset
train_new_from_iterator

Trains a new tokenizer by updating the vocabulary from the dataset.

Uses an iterator (in this case, the generator function batch_iterator) to supply text data for training.

In [None]:
bert_tokenizer = bert_tokenizer.train_new_from_iterator(text_iterator=batch_iterator(), vocab_size=30522)


text_iterator=batch_iterator()

Passes the generator batch_iterator() to the tokenizer training process. This ensures that text data is loaded dynamically in batches.

vocab_size=30522

Specifies the size of the vocabulary to build during training. In this case, the size matches the original BERT tokenizer vocabulary.

### Pretraining

In this step, we define the configuration of the BERT model and create the model:
#### Define the BERT Configuration
Here, we define the configuration settings for a BERT model using `BertConfig`. This includes setting various parameters related to the model's architecture:
- **vocab_size=30522**: Specifies the size of the vocabulary. This number should match the vocabulary size used by the tokenizer.
- **hidden_size=768**: Sets the size of the hidden layers.
- **num_hidden_layers=12**: Determines the number of hidden layers in the transformer model.
- **num_attention_heads=12**: Sets the number of attention heads in each attention layer.
- **intermediate_size=3072**: Specifies the size of the "intermediate" (i.e., feed-forward) layer within the transformer.


In [None]:
# Define the BERT configuration
config = BertConfig(
    vocab_size=len(bert_tokenizer.get_vocab()),  # Specify the vocabulary size(Make sure this number equals the vocab_size of the tokenizer)
    hidden_size=768,  # Set the hidden size
    num_hidden_layers=12,  # Set the number of layers
    num_attention_heads=12,  # Set the number of attention heads
    intermediate_size=3072,  # Set the intermediate size
)

now we  Create the BERT model for pre-training:
and check model configuration


In [None]:
model = BertForMaskedLM(config)
model

### Tokenize Dataset Dynamically

In this section, we dynamically tokenize the dataset 
This approach provides greater flexibility and integrates well with modern NLP workflows.

#### **Tokenization Function**
The `tokenize_function` is used to preprocess the text data by tokenizing and formatting it for model training.

#### What This Code Does as a Whole
Defines a function to tokenize text examples using the BERT tokenizer.

Applies the tokenizer to the training and testing datasets, dynamically processing data in batches.

Removes the original raw text to save space and focuses on tokenized inputs.

Splits the processed dataset into train and test subsets, preparing it for the next stages of model training and evaluation.

This process ensures that your dataset is formatted and ready for input into a model like BERT. Would you like help with the next steps, such as feeding this data into a model for training?


In [None]:
# Tokenize dataset dynamically
def tokenize_function(examples):
    return bert_tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)

# Tokenize train and test datasets
tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=["text"])

# Print tokenized dataset sample
print(tokenized_datasets["train"][0])

# Split into training and test sets
train_dataset = tokenized_datasets["train"]
test_dataset = tokenized_datasets["test"]
#examining  one sample the token indexes  are shown here with the block size.
train_dataset[0]

bert_tokenizer(examples["text"], ...):

Applies the bert_tokenizer to the "text" field of the input examples.

Parameters:

truncation=True: Ensures that text longer than max_length (512 tokens) is truncated.

padding="max_length": Pads shorter sequences to the maximum length (max_length=512) for consistent input size.

max_length=512: Limits each input sequence to a maximum of 512 tokens.

tokenize_function: This line applies the tokenize_function to the dataset.

dataset.map(...):

Maps (applies) the tokenize_function to each example in the dataset.

The batched=True argument means examples are processed in batches rather than one at a time, which is more efficient.

remove_columns=["text"]:

Removes the original "text" field from the dataset after tokenization (as it's no longer needed).

Output:

Produces a new dataset (tokenized_datasets) where each example contains tokenized inputs instead of raw text.

tokenized_datasets["train"][0]:

Accesses the first example from the training dataset.

It will show tokenized data (numerical tokens and possibly attention masks) instead of raw text.

splits the tokenized dataset into training and testing subsets.

train_dataset:

Stores the tokenized training data, which is used to train the model.

test_dataset:

Stores the tokenized test data, which is used to evaluate the model's performance.

Then, we prepare data for the MLM task (masking random tokens):
### Define the Data Collator for Language Modeling
This line of code sets up a `DataCollatorForLanguageModeling` from the Hugging Face Transformers library. A data collator is used during training to dynamically create batches of data. For language modeling, particularly for models like BERT that use masked language modeling (MLM), this collator prepares training batches by automatically masking tokens according to a specified probability. Here are the details of the parameters used:

- **tokenizer=bert_tokenizer**: Specifies the tokenizer to be used with the data collator. The `bert_tokenizer` is responsible for tokenizing the text and converting it to the format expected by the model.
- **mlm=True**: Indicates that the data collator should mask tokens for masked language modeling training. This parameter being set to `True` configures the collator to randomly mask some of the tokens in the input data, which the model will then attempt to predict.
- **mlm_probability=0.15**: Sets the probability with which tokens will be masked. A probability of 0.15 means that, on average, 15% of the tokens in any sequence will be replaced with a mask token.

In [None]:
# Prepare the data collator for language modeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=bert_tokenizer, mlm=True, mlm_probability=0.15
)
# check how collator transforms a sample input data record
data_collator([train_dataset[0]])

Now, we train the BERT Model using the Trainer module. (For a complete list of training arguments, check [here](https://huggingface.co/docs/transformers/v4.33.2/en/main_classes/trainer#transformers.TrainingArguments)):
This section configures the training process by specifying various parameters that control how the model is trained, evaluated, and saved:

- **output_dir="./trained_model"**: Specifies the directory where the trained model and other output files will be saved.
- **overwrite_output_dir=True**: If set to `True`, this will overwrite the contents of the output directory if it already exists. This is useful when running experiments multiple times.
- **do_eval=True**: Enables evaluation of the model. If `True`, the model will be evaluated at the specified intervals.
- **evaluation_strategy="epoch"**: Defines when the model should be evaluated. Setting this to "epoch" means the model will be evaluated at the end of each epoch.
- **learning_rate=5e-5**: Sets the learning rate for training the model. This is a typical learning rate for fine-tuning BERT-like models.
- **num_train_epochs=10**: Specifies the number of training epochs. Each epoch involves a full pass over the training data.
- **per_device_train_batch_size=2**: Sets the batch size for training on each device. This should be set based on the memory capacity of your hardware.
- **save_total_limit=2**: Limits the total number of model checkpoints to be saved. Only the most recent two checkpoints will be kept.
- **logging_steps=20**: Determines how often to log training information, which can help monitor the training process.

In [None]:
'''# Define the training arguments
training_args = TrainingArguments(
    output_dir="./trained_model",  # Specify the output directory for the trained model
    overwrite_output_dir=True,
    do_eval=True,
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    num_train_epochs=10,  # Specify the number of training epochs
    per_device_train_batch_size=2,  # Set the batch size for training
    save_total_limit=2,  # Limit the total number of saved checkpoints
    logging_steps = 20
    
)

# Instantiate the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

# Start the pre-training
trainer.train()'''

## Evaluating Model Performance

Let's check the performance of the trained model. Perplexity is commonly used to compare different language models or different configurations of the same model.
After training, perplexity can be calculated on a held-out evaluation dataset to assess the model's performance. The perplexity is calculated by feeding the evaluation dataset through the model and comparing the predicted probabilities of the target tokens with the actual token values that are masked.

A lower perplexity score indicates that the model has a better understanding of the language and is more effective at predicting the masked tokens. It suggests that the model has learned useful representations and can generalize well to unseen data.

In [None]:
'''eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")'''

## Loading the saved model


In [None]:
!wget 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/BeXRxFT2EyQAmBHvxVaMYQ/bert-scratch-model.pt'
model.resize_token_embeddings(30522)
model.load_state_dict(torch.load('bert-scratch-model.pt',map_location=torch.device('cpu')))

The simplest way to try out the model for inference is to use it in a pipeline(). Instantiate a pipeline for fill-mask with your model, and pass your text to it. If you like, you can use the top_k parameter to specify how many predictions to return:

In [None]:
# Define the input text with a masked token
text = "This is a [MASK] movie!"

# Create a pipeline for the "fill-mask" task
mask_filler = pipeline("fill-mask", model=model,tokenizer=bert_tokenizer)

# Generate predictions by filling the mask in the input text
results = mask_filler(text) #top_k parameter can be set 

# Print the predicted sequences
for result in results:
    print(f"Predicted token: {result['token_str']}, Confidence: {result['score']:.2f}")

You can see that [MASK] is replaced by the most frequent token. This weak performance can be due to insufficient training, lack of training data, model architecture, or not tuning hyperparameters. Let's try a pretrained model from Hugging Face:


## Inferencing a pretrained BERT model

In [None]:
# Load the pretrained BERT model and tokenizer
pretrained_model = BertForMaskedLM.from_pretrained('bert-base-uncased')
pretrained_tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

# Define the input text with a masked token
text = "This is a [MASK] movie!"

# Create the pipeline
mask_filler = pipeline(task='fill-mask', model=pretrained_model,tokenizer=pretrained_tokenizer)

# Perform inference using the pipeline
results = mask_filler(text)
for result in results:
    print(f"Predicted token: {result['token_str']}, Confidence: {result['score']:.2f}")

This pretrianed model performs way better than the model you just trained for a few epochs using a single dataset. Still, pretrained models cannot be used for specific tasks, such as sentiment extraction or sequence classification. This is why supervised fine-tuning methods are introduced.
