### **BERT (Bidirectional Encoder Representations)**

- https://www.analyticsvidhya.com/blog/2022/09/fine-tuning-bert-with-masked-language-modeling/
- https://docs.google.com/document/d/1eV8EqaFCxh3u7Wluk2X5e05qV-xtDj4Smq2DEqIfs3o/edit

<div>
<img src = "nlp_images/BERT.png" width = "600">
</div>

**Input Embeddings:**
- **Token Embeddings**: These are representations of the individual words or tokens that the model processes. Each word in a sentence is converted into a vector that captures its meaning.
- **Segment Embeddings**: BERT can take two sentences as inputs at once (for tasks like question answering where you have a question and a context), and segment embeddings help the model to distinguish between the two different sentences.
- **Position Embeddings**: Since BERT needs to understand the order of words in a sentence, position embeddings are added to indicate the position of each word within the sentence.
- The embeddings for each word are summed together to form the input for the encoders.

**Encoders:**
- BERT uses multiple encoder layers (12 in this case, as is standard for BERT-base). Each layer is composed of self-attention mechanisms and fully connected neural network layers. These layers process the input embeddings, refining and transforming the information at each step to capture the context and relationships between words.

**Masked Language Modeling Head:**
- At the top, there is a "Masked Language Modeling Head." This is used during pre-training, where the model tries to predict masked words in a sentence. For example, in the sentence "My dog is [MASK] cute," BERT tries to predict what the masked word is based on the context provided by the other words.

**Logits:**
- The output of the model is a set of logits, which are raw predictions made by the model for each position in the sentence. For example, if BERT is predicting a masked word, the logits represent the model's estimation of how likely each word in its vocabulary is the correct prediction.

Overall, this architecture allows BERT to generate predictions for tasks like filling in the blanks in sentences, understanding the sentiment of a review, answering questions, and many other natural language processing tasks.


Pre-training the BERT model is essential because its primary goal is transfer learning, which involves leveraging knowledge gained from one problem to tackle another. In the original research paper, the BERT model is pre-trained using two self-supervised learning tasks, described as follows:

Masked Language Model (MLM): During this pre-training task, some of the input tokens are randomly replaced with a [MASK] token. The model is then trained to predict these masked tokens using the context provided by the unmasked surrounding tokens. Researchers have found that masking about 15% of the input tokens randomly tends to yield the best results.

Next Sentence Prediction (NSP): This is a binary classification task where the output corresponding to the [CLS] token is used. The goal is to predict whether the second sentence in a pair is the actual subsequent sentence in the original text. For training purposes, the dataset is constructed manually so that 50% of the time, the actual next sentence is provided with a target label of 1, and 50% of the time, a random sentence is used with a target label of 0.


![image.png](attachment:image.png)

### **MLM**

- An MLM model stands for "Masked Language Model," which is a type of training model used in natural language processing (NLP). This method involves randomly masking words (typically with a [MASK] token) in the text and then training the model to predict these masked words based on the context provided by the surrounding words.

- Models like BERT (Bidirectional Encoder Representations from Transformers) utilize this method during their pre-training phase. MLM allows the model to consider the full context of a sentence—looking at both left and right context—which helps to enhance its understanding of language. Through this process, the model becomes better at discerning the relationships between words and their meanings within sentences.

- The learning approach of MLM helps the model to reflect real-life language usage more accurately, developing a generalized language understanding capability that can be applied to a wide array of more complex NLP tasks.


### Comparing Masked Language Modeling (MLM) VS Causal Language Modeling (CLM)

1. Masked Language Modeling (MLM) - BERT
- Proposed by: Google
- How it works:
    - Some words in a sentence are hidden (masked).
    - The model is trained to predict these hidden words.
- Advantages:
    - Tries to understand the entire sentence.
    - Uses the information from the whole sentence to understand relationships.
- Disadvantages:
    - Only calculates errors on the 15% of the sentence that is masked.


2. Causal Language Modeling (CLM) - GPT
- Proposed by: OpenAI
- How it works:
    - Predicts the next word in a sentence.
    - Uses only the current and previous words without looking at future words.
- Advantages:
    - Learns to predict the next word in order, following the sequence of the sentence.
- Disadvantages:
    - Only learns causal relationships within the sentence.
- Limitations:
    - MLM: Understands the entire sentence but only uses parts of it for training.
    - CLM: Predicts words in order but only learns causal relationships.
- Usage:
    - Both models need to be modified and fine-tuned for specific tasks.

### **NSP**

- NSP stands for "Next Sentence Prediction." It's a training objective used in the pre-training of models like BERT (Bidirectional Encoder Representations from Transformers). The idea behind NSP is to help the model understand the relationship between two sentences, which is crucial for many natural language processing tasks, such as question answering and natural language inference.

- In NSP, the model is given pairs of sentences and must predict whether the second sentence logically follows the first one in the original document. During training, this is achieved by presenting the model with two types of sentence pairs:

- Positive pairs, where the second sentence is the actual next sentence that follows the first sentence in the original text. These are labeled as "IsNext."
Negative pairs, where the second sentence is randomly chosen from the corpus and does not follow the first sentence. These are labeled as "NotNext."
The model is then trained to correctly classify these pairs as either "IsNext" or "NotNext." This training helps the model learn the coherence and relationship between sentences, improving its understanding of context and how sentences are related within a text.


- CLM: A method of predicting the next word within a sentence, primarily used for text generation.
- NSP: A method of predicting whether two sentences are connected, used for learning the relationship between sentences.

### **GPL**

- GPL, which stands for Generative Pseudo Labeling, is a semi-supervised learning technique. In semi-supervised learning, the goal is to leverage a large amount of unlabeled data in addition to a smaller set of labeled data to improve the performance of machine learning models. GPL specifically focuses on generating pseudo labels for the unlabeled data to achieve this goal.

- Here's how GPL generally works:
    - Initial Model Training: First, a model is trained on the available labeled data. This initial model is then used to make predictions on the unlabeled data.
    - Generating Pseudo Labels: The predictions made by the model on the unlabeled data are used to generate pseudo labels. These pseudo labels are determined based on a confidence score or threshold. For instance, if the model predicts a certain class for an unlabeled data point with high confidence (above a predefined threshold), that prediction can be considered a pseudo label.
    - Combining Data: The pseudo-labeled data is then combined with the originally labeled data. This expanded dataset, containing both real and pseudo labels, provides a larger set of training data.
    - Model Re-training: The model is re-trained on this combined dataset, allowing it to learn from both the originally labeled data and the new insights gained from the pseudo labels.
    - Iteration: This process can be iterated several times, with each iteration potentially improving the model's performance as it learns from an increasingly larger and diverse dataset.
- Similar to the Next Sentence Prediction (NSP) technique used in training models like BERT for understanding the relationship between sentences, GPL aims to enhance model learning by incorporating unlabeled data. However, instead of focusing on sentence relationships, GPL is more general-purpose and can be applied to various types of data and classification tasks. By using the model's own predictions as feedback, GPL effectively leverages unlabeled data to improve the model's accuracy and performance on tasks with limited labeled data.


### **RL (Reinforcement Learning)**

- Reinforcement Learning is a process where an agent learns by interacting with an environment. 

<div>
<img src = "nlp_images/RL.png" width = "400">
</div>

The diagram you're referring to illustrates the basic process of Reinforcement Learning (RL), where an agent learns by interacting with an environment. Here's a detailed explanation in English:
- **State**: This represents the current situation or configuration that the agent perceives in the environment. For example, in a game of chess, the arrangement of the chessboard would be the state.
- **Action**: This is a decision or move made by the agent. In chess, this would be a move like moving a pawn or a knight.
- **Reward**: After the agent takes an action, it receives feedback from the environment, which is called the reward. Positive rewards encourage the agent to repeat the action in similar future states, while negative rewards discourage it.
- **Next State**: This is the new situation the agent finds itself in after taking an action. In the chess example, this would be the new arrangement of the chessboard after the agent's move.

The goal of reinforcement learning is for the agent to learn a policy—a strategy for choosing actions—that maximizes the total reward over time. The agent undergoes a process of trial and error, exploring different actions and learning from the outcomes to determine the best course of action. This learning process involves balancing exploration (trying new actions) and exploitation (using known actions that have yielded high rewards). The agent aims to improve its policy through repeated interactions, adjusting its actions based on the rewards received to achieve the best possible results in the environment.


#### **Summary**

**MLM (Masked Language Model):**
- Summary: A pre-training method where some tokens of input text are randomly masked, and the model predicts the masked tokens based on their context.
- Core Concept: Helps the model learn context-aware word representations.

**NSP (Next Sentence Prediction):**
- Summary: A pre-training method used in models like BERT where the model predicts if a second sentence logically follows a first sentence.
- Core Concept: Helps the model understand relationships between sentences.

**GPL (Guided Pseudo Labeling):**
- Summary: A semi-supervised learning technique that uses a model to generate labels for unlabeled data, which is then combined with labeled data for training.
- Core Concept: Leverages large amounts of unlabeled data effectively.

**RL (Reinforcement Learning):**
- Summary: A learning paradigm where an agent learns to make decisions by taking actions in an environment to maximize some notion of cumulative reward.
- Core Concept: Balances exploration (trying new things) and exploitation (using what's known to work).


![image.png](attachment:image.png)

## Transformer Architecture
- The Transformer model fundamentally consists of two components: the Encoder and the Decoder. However, models like BERT, GPT, and T5 use this structure differently.

### BERT (Bidirectional Encoder Representations from Transformers)
- Architecture: Encoder only
- Reason: BERT processes all words in the input sentence simultaneously and learns the bidirectional context of each word. Therefore, it uses only the encoder to focus on understanding the overall context of the sentence.
- Usage: Mainly used for Natural Language Understanding (NLU) tasks such as sentence classification, question answering, and sentiment analysis.
- Features:
    - Masked Language Modeling (MLM): Trains the model by masking some words in the input sentence and predicting the masked words.
    - Next Sentence Prediction (NSP): Trains the model to predict whether the second sentence follows the first sentence.

### GPT (Generative Pre-trained Transformer)
- Architecture: Decoder only
- Reason: GPT aims to generate sentences sequentially, predicting the next word based on the current words. It uses only the decoder to generate future words based on past words.
- Usage: Mainly used for Natural Language Generation (NLG) tasks such as dialogue generation and storytelling.
- Features:
    - Causal Language Modeling (CLM): Trains the model by predicting the next word using previous words.
    - Unidirectional: Processes sentences from left to right sequentially.

### T5 (Text-To-Text Transfer Transformer)
- Architecture: Both Encoder and Decoder
- Reason: T5 defines all NLP tasks as converting text input to text output. It uses an encoder-decoder structure to process the input sentence with the encoder and generate the desired output sentence with the decoder.
- Usage: Used for a variety of NLP tasks such as translation, summarization, and question answering.
- Features:
    - Text-to-Text Framework: Handles all inputs and outputs in text format.
    - Multi-task Learning: Designed to perform various tasks with a single model.

**Summary**
- BERT: Encoder only. Specializes in sentence understanding.
- GPT: Decoder only. Specializes in text generation.
- T5: Both encoder and decoder. Specializes in various text transformation tasks.

### **BERT (Bidirectional Encoder Representations)**

- https://www.analyticsvidhya.com/blog/2022/09/fine-tuning-bert-with-masked-language-modeling/
- https://docs.google.com/document/d/1eV8EqaFCxh3u7Wluk2X5e05qV-xtDj4Smq2DEqIfs3o/edit

<div>
<img src = "nlp_images/BERT.png" width = "600">
</div>

**Input Embeddings:**
- **Token Embeddings**: These are representations of the individual words or tokens that the model processes. Each word in a sentence is converted into a vector that captures its meaning.
- **Segment Embeddings**: BERT can take two sentences as inputs at once (for tasks like question answering where you have a question and a context), and segment embeddings help the model to distinguish between the two different sentences.
- **Position Embeddings**: Since BERT needs to understand the order of words in a sentence, position embeddings are added to indicate the position of each word within the sentence.
- The embeddings for each word are summed together to form the input for the encoders.

**Encoders:**
- BERT uses multiple encoder layers (12 in this case, as is standard for BERT-base). Each layer is composed of self-attention mechanisms and fully connected neural network layers. These layers process the input embeddings, refining and transforming the information at each step to capture the context and relationships between words.

**Masked Language Modeling Head:**
- At the top, there is a "Masked Language Modeling Head." This is used during pre-training, where the model tries to predict masked words in a sentence. For example, in the sentence "My dog is [MASK] cute," BERT tries to predict what the masked word is based on the context provided by the other words.

**Logits:**
- The output of the model is a set of logits, which are raw predictions made by the model for each position in the sentence. For example, if BERT is predicting a masked word, the logits represent the model's estimation of how likely each word in its vocabulary is the correct prediction.

Overall, this architecture allows BERT to generate predictions for tasks like filling in the blanks in sentences, understanding the sentiment of a review, answering questions, and many other natural language processing tasks.


In [1]:
import pandas as pd
from transformers import AutoTokenizer, AutoModelForTokenClassification, Trainer, TrainingArguments
from transformers import DataCollatorForTokenClassification
from sklearn.model_selection import train_test_split
import datasets
import evaluate
from itertools import chain
import numpy as np
import evaluate

  from .autonotebook import tqdm as notebook_tqdm
comet_ml is installed but `COMET_API_KEY` is not set.


In [2]:
import pandas as pd
#Label encoding
df = pd.read_csv('TalkFile_ner_2.csv').iloc[:300,:]
df['Tag'] = df['Tag'].apply(eval)
list_all_tag = df.Tag.to_list()
list_labels = ['O'] + sorted(set(chain.from_iterable(list_all_tag)) - {'O'})

label2ind = {label: idx for idx, label in enumerate(list_labels)}
ind2label = {idx: label for label, idx in label2ind.items()}
label2ind

{'O': 0,
 'B-art': 1,
 'B-eve': 2,
 'B-geo': 3,
 'B-gpe': 4,
 'B-nat': 5,
 'B-org': 6,
 'B-per': 7,
 'B-tim': 8,
 'I-art': 9,
 'I-eve': 10,
 'I-geo': 11,
 'I-gpe': 12,
 'I-nat': 13,
 'I-org': 14,
 'I-per': 15,
 'I-tim': 16}

In [3]:
#Sentence Tokenization & Label Encoding
data_dict = {
    'id': list(range(len(df))),
    'tokens': [sentence.split(' ') for sentence in df['Sentence']],
    'ner_tags': [list(map(label2ind.get, tags)) for tags in df['Tag']]
}
new_df = pd.DataFrame(data_dict)
new_df.head()

Unnamed: 0,id,tokens,ner_tags
0,0,"[Thousands, of, demonstrators, have, marched, ...","[0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 3, 0, 0, ..."
1,1,"[Families, of, soldiers, killed, in, the, conf...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,2,"[They, marched, from, the, Houses, of, Parliam...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 11, 0]"
3,3,"[Police, put, the, number, of, marchers, at, 1...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
4,4,"[The, protest, comes, on, the, eve, of, the, a...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 6, ..."


#### Model preparation and data preprocessing for token classification.
- Preparing a model architecture suitable for a token classification task.
- Freezing certain layers of the model as part of a transfer learning strategy.
- Tokenizing input data and aligning tokenized inputs with corresponding labels for training a model on a sequence labeling task like NER.

In [15]:
#Loading the Model
from transformers import AutoModelForTokenClassification

# Load the pre-trained DistilBERT model for token classification
model = AutoModelForTokenClassification.from_pretrained(
    "distilbert/distilbert-base-uncased", 
    num_labels=len(label2ind.keys()),  # Specify the number of unique labels in your dataset
    id2label=ind2label,  # Map from label IDs to label names
    label2id=label2ind   # Map from label names to label IDs
)

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Freeze the embedding layer parameters (optional)
for name, param in model.named_parameters():
    if name.startswith("distilbert.embeddings"):
        param.requires_grad = False

In [6]:
#Subword Tokenization and Aligning Labels 
from transformers import DistilBertTokenizerFast

# Initialize the tokenizer: 
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert/distilbert-base-uncased')
def tokenize_and_align_labels(examples):
    #split words into subword tokens: e.g., "walking" -> "walk", "##ing"
    tokenized_inputs = tokenizer(
        examples["tokens"], 
        truncation=True,  # Ensure input sequences fit within the model's length limits
        is_split_into_words=True
    )
    
    #ensures your NER labels are properly associated with the subword tokens the model will actually use
    #maps tokens back to the words of the original input they were derived from ???????????????????
    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        label_ids = []
        for word_idx in word_ids:
            label_id = label[word_idx] if word_idx is not None else -100
            label_ids.append(label_id)
        
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

- **tokenize**r**: the creation of padding tokens and the attention mask is primarily handled within the tokenizer(...) function.
    - The DistilBERT tokenizer (distilbert/distilbert-base-uncased) uses a **WordPiece** tokenization algorithm.
    - *Tokenization*: When you call tokenizer(...), it tokenizes your input sequences into subwords based on the DistilBERT vocabulary.
    - *Truncation/Padding Logic*:
        - truncation=True tells the tokenizer to cut off sequences longer than the model's maximum input length.
        - Conversely, the tokenizer determines if sequences are shorter than the max length and adds padding tokens as needed.
    - *Attention Mask Generation*:  The tokenizer internally creates the  attention_mask.   It marks positions corresponding to real words/subwords with '1' and padding tokens with '0'.
    - Example
        - Input Sentence: "I love eating pizza in New York City."
        - Pre-trained Model: 'distilbert/distilbert-base-uncased'
        - Max Lenght: 12
        - Possible Tokenized Output
        - {
            "input_ids": [101, 1045, 2023, 2188, 1108, 102, 2025, 3162, 3842, 1012, 102],
            "attention_mask": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]
           }
            - [101]: The [CLS] token, marking the start of the input.
            - 1045, 2023, etc.: Numerical IDs corresponding to tokens "I", "love", "eating", and "pizza".
            - 1108, 2025, ...: Representation of "in", "New", "York", "City", likely including subwords like "##york" due to DistilBERT's vocabulary.
            - [102]: The [SEP] token, marking the end of the sentence.
            - attention_mask: The '0' values tell the model to ignore those positions during subsequent computations, as they don't contain meaningful information.


- Sentence: "OpenAI develops innovative AI solutions."
- tokenized_inputs 
    - { "input_ids": [101, ID_for_Open, ID_for_##AI, ID_for_develops, ID_for_innovative, ID_for_AI, ID_for_solutions, ID_for_., 102],
    -   "attention_mask": [1, 1, 1, 1, 1, 1, 1, 1, 1],
    -   "labels": [-100, 0, 0, 1, 1, 1, 1, 1, -100]
}
- Labels: ["B-ORG", "O", "O", "O"] (corresponding to ["OpenAI", "develops", "innovative", "AI solutions."])
- Tokenization: ["[CLS]", "Open", "##AI", "develops", "innovative", "AI", "solutions", ".", "[SEP]"]
- Using word_ids: [None, 0, 0, 1, 2, 3, 4, 4, None]
- tokenized_inputs["labels"] = [-100, 0, 0, 1, 1, 1, 1, 1, -100]
    - -100 for [CLS] and [SEP]
    - 0 for "Open" and "##AI" indicating "B-ORG"
    - 1 for "develops", "innovative", "AI", "solutions", and ".", indicating "O"
- *word_ids* is a method/function of the tokenizer that maps tokens back to the words of the original input they were derived from. 

In [7]:
from sklearn.model_selection import train_test_split
import datasets #a powerful tool in the Hugging Face ecosystem, designed to handle datasets in a streamlined way for NLP tasks.

train_df,test_df = train_test_split(new_df,test_size=0.2,random_state=42)
dataset_dict = datasets.DatasetDict() #creates an empty DatasetDict object named dataset_dict
#These lines take your train and test dataframes (train_df and test_df), convert them into Dataset objects
dataset_dict['train'] = datasets.Dataset.from_pandas(train_df)
dataset_dict['test'] = datasets.Dataset.from_pandas(test_df)
tokenized_dataset = dataset_dict.map(tokenize_and_align_labels, batched=True)

Map:   0%|          | 0/240 [00:00<?, ? examples/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Map: 100%|██████████| 240/240 [00:00<00:00, 9600.15 examples/s]
Map: 100%|██████████| 60/60 [00:00<00:00, 7964.37 examples/s]


Why Use a DatasetDict?

- Organization: A DatasetDict lets you keep your training, validation, and testing datasets separate and easily accessible by their respective names (keys).
- Convenience: The datasets library provides lots of methods for working with DatasetDict objects – loading, saving, processing, and more.
- Trainer Compatibility: Hugging Face's Trainer class works seamlessly with DatasetDict objects for streamlined model training.

In [8]:
#Model evaluation: Model has been trained for token classification, NER
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
seqeval = evaluate.load("seqeval")

example = tokenized_dataset['train'][0]
labels = [ind2label[i] for i in example[f"ner_tags"]]

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_predictions = [[ind2label[p] for p, l in zip(pred, lab) if l != -100] 
                        for pred, lab in zip(predictions, labels)]
    true_labels = [[ind2label[l] for p, l in zip(pred, lab) if l != -100] 
                       for pred, lab in zip(predictions, labels)]
    
    results = seqeval.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

- Why Still Need DataCollatorForTokenClassification after creating tokenized_dataset
    - while tokenize_and_align_labels prepares your dataset at an individual example level, ensuring tokens are aligned with their corresponding labels, DataCollatorForTokenClassification prepares your data at the batch level, handling padding, attention masks, and batch-specific adjustments.
- "seqeval" library evaluates sequence labeling tasks. This library provides functions to calculate metrics like precision, recall, and F1 score for your model's predictions compared to the ground truth labels. load("seqeval") specifically loads the seqeval metric from the evaluate module. 

In [10]:
#Model training and evaluation & Fine-tuning
training_args = TrainingArguments(
    output_dir=".", #Where to save model checkpoints and results.
    learning_rate=2e-5,
    per_device_train_batch_size=16, #Batch sizes for training 
    per_device_eval_batch_size=16, #Batch sizes for evaluation
    num_train_epochs=7, #Number of passes through the entire training dataset.
    weight_decay=0.01, #A regularization technique to prevent overfitting.
    evaluation_strategy="epoch", #Evaluate the model at the end of each epoch.
    save_strategy="epoch", #Save model checkpoints at the end of each epoch.
    load_best_model_at_end=True, #Automatically load the best-performing model.
    push_to_hub=False, #isables pushing the model to the Hugging Face Hub.
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator, #The object that prepares data batches.
    compute_metrics=compute_metrics,
)

trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
  0%|          | 0/105 [00:00<?, ?it/s]

  _warn_prf(average, modifier, msg_start, len(result))

 14%|█▍        | 15/105 [00:07<00:41,  2.16it/s]Checkpoint destination directory .\checkpoint-15 already exists and is non-empty. Saving will proceed but saved results may be invalid.


{'eval_loss': 0.2975486218929291, 'eval_precision': 0.6284403669724771, 'eval_recall': 0.6715686274509803, 'eval_f1': 0.6492890995260663, 'eval_accuracy': 0.9277638190954773, 'eval_runtime': 0.51, 'eval_samples_per_second': 117.637, 'eval_steps_per_second': 7.842, 'epoch': 1.0}


  _warn_prf(average, modifier, msg_start, len(result))

 29%|██▊       | 30/105 [00:15<00:32,  2.32it/s]Checkpoint destination directory .\checkpoint-30 already exists and is non-empty. Saving will proceed but saved results may be invalid.


{'eval_loss': 0.2400401532649994, 'eval_precision': 0.7155963302752294, 'eval_recall': 0.7647058823529411, 'eval_f1': 0.7393364928909951, 'eval_accuracy': 0.9428391959798995, 'eval_runtime': 0.5189, 'eval_samples_per_second': 115.628, 'eval_steps_per_second': 7.709, 'epoch': 2.0}


  _warn_prf(average, modifier, msg_start, len(result))

 43%|████▎     | 45/105 [00:23<00:31,  1.89it/s]Checkpoint destination directory .\checkpoint-45 already exists and is non-empty. Saving will proceed but saved results may be invalid.


{'eval_loss': 0.20450906455516815, 'eval_precision': 0.7546296296296297, 'eval_recall': 0.7990196078431373, 'eval_f1': 0.7761904761904763, 'eval_accuracy': 0.9528894472361809, 'eval_runtime': 0.6231, 'eval_samples_per_second': 96.299, 'eval_steps_per_second': 6.42, 'epoch': 3.0}


  _warn_prf(average, modifier, msg_start, len(result))

 57%|█████▋    | 60/105 [00:32<00:23,  1.90it/s]Checkpoint destination directory .\checkpoint-60 already exists and is non-empty. Saving will proceed but saved results may be invalid.


{'eval_loss': 0.18730327486991882, 'eval_precision': 0.75, 'eval_recall': 0.8088235294117647, 'eval_f1': 0.7783018867924529, 'eval_accuracy': 0.9547738693467337, 'eval_runtime': 0.6318, 'eval_samples_per_second': 94.967, 'eval_steps_per_second': 6.331, 'epoch': 4.0}


  _warn_prf(average, modifier, msg_start, len(result))

 71%|███████▏  | 75/105 [00:42<00:16,  1.85it/s]Checkpoint destination directory .\checkpoint-75 already exists and is non-empty. Saving will proceed but saved results may be invalid.


{'eval_loss': 0.17691276967525482, 'eval_precision': 0.7752293577981652, 'eval_recall': 0.8284313725490197, 'eval_f1': 0.8009478672985783, 'eval_accuracy': 0.957286432160804, 'eval_runtime': 0.6329, 'eval_samples_per_second': 94.805, 'eval_steps_per_second': 6.32, 'epoch': 5.0}


  _warn_prf(average, modifier, msg_start, len(result))

 86%|████████▌ | 90/105 [00:51<00:07,  1.89it/s]Checkpoint destination directory .\checkpoint-90 already exists and is non-empty. Saving will proceed but saved results may be invalid.


{'eval_loss': 0.17152731120586395, 'eval_precision': 0.7727272727272727, 'eval_recall': 0.8333333333333334, 'eval_f1': 0.8018867924528302, 'eval_accuracy': 0.957286432160804, 'eval_runtime': 0.6146, 'eval_samples_per_second': 97.619, 'eval_steps_per_second': 6.508, 'epoch': 6.0}


  _warn_prf(average, modifier, msg_start, len(result))

100%|██████████| 105/105 [01:00<00:00,  1.72it/s]Checkpoint destination directory .\checkpoint-105 already exists and is non-empty. Saving will proceed but saved results may be invalid.


{'eval_loss': 0.17073394358158112, 'eval_precision': 0.7612612612612613, 'eval_recall': 0.8284313725490197, 'eval_f1': 0.7934272300469484, 'eval_accuracy': 0.9560301507537688, 'eval_runtime': 0.6318, 'eval_samples_per_second': 94.973, 'eval_steps_per_second': 6.332, 'epoch': 7.0}


100%|██████████| 105/105 [01:01<00:00,  1.72it/s]

{'train_runtime': 61.1578, 'train_samples_per_second': 27.47, 'train_steps_per_second': 1.717, 'train_loss': 0.20008919125511532, 'epoch': 7.0}





TrainOutput(global_step=105, training_loss=0.20008919125511532, metrics={'train_runtime': 61.1578, 'train_samples_per_second': 27.47, 'train_steps_per_second': 1.717, 'train_loss': 0.20008919125511532, 'epoch': 7.0})

There are several core reasons why your training loss decreases across multiple runs of training your model. Let's explore them:

1. The Nature of Gradient Descent
- Most deep learning models train using some variant of gradient descent. The basic idea is to calculate how wrong your model is (the loss), then adjust its parameters slightly in a direction that should reduce the error on the next iteration.
With each update, assuming your learning rate is well-chosen, the model becomes less wrong. This translates directly into a decreasing loss.
2. Data Memorization
- Your model learns patterns in the training data. As it sees examples multiple times, it becomes better at recognizing those patterns and producing outputs that align with the expected labels. This "memorization" leads to lower errors and, therefore, lower loss.
3. Regularization
- Techniques like dropout, L1/L2 regularization, and weight decay are designed to prevent overfitting. Overfitting is when a model becomes too specialized to the training data and performs worse on new examples.
Regularization techniques add a small penalty to overly complex models during training, This helps the model focus on generalizable patterns and keeps the loss from increasing.
4. Optimization Improvements
- More advanced gradient descent algorithms like Adam, Adadelta, etc., utilize techniques like momentum and adaptive learning rates. They help the model navigate the loss landscape more efficiently and can lead to faster and more consistent decreases in loss.

**Caveats**
- Overfitting: If training loss decreases significantly, but validation loss stops improving or starts getting worse, it's a sign of overfitting. Your model is getting too tied to the specifics of your training set.
- Learning Rate: A too-high learning rate can cause the model to bounce around and prevent loss from converging. A too-low learning can prolong training unnecessarily.

### MLM

In [13]:
from transformers import Trainer, TrainingArguments
from transformers import DataCollatorForLanguageModeling,DataCollatorForWholeWordMask
from transformers import Trainer, TrainingArguments,AutoModelForMaskedLM

#Defines a custom dataset class to manage your tokenized sentences and integrate with Hugging Face's data handling mechanisms.
class TokenizedSentencesDataset:
    def __init__(self, sentences, tokenizer, max_length, cache_tokenization=False):
        self.tokenizer = tokenizer
        self.sentences = sentences
        self.max_length = max_length
        self.cache_tokenization = cache_tokenization

    def __getitem__(self, item):
        if not self.cache_tokenization:
            return self.tokenizer(self.sentences[item], add_special_tokens=True, truncation=True, max_length=self.max_length, return_special_tokens_mask=True)

        if isinstance(self.sentences[item], str):
            self.sentences[item] = self.tokenizer(self.sentences[item], add_special_tokens=True, truncation=True, max_length=self.max_length, return_special_tokens_mask=True)
        return self.sentences[item]

    def __len__(self):
        return len(self.sentences)

#Prepare the Datasets    
max_length = 100
mlm_prob=0.15 #Probability of masking a token
train_dataset = TokenizedSentencesDataset(df['Sentence'].to_list()[:260], tokenizer, max_length)
dev_dataset = TokenizedSentencesDataset(df['Sentence'].to_list()[260:], tokenizer, max_length, cache_tokenization=True) if len(df['Sentence'].to_list()[:260]) > 0 else None

#Choose the Data Collator
do_whole_word_mask = True
if do_whole_word_mask:
    data_collator = DataCollatorForWholeWordMask(tokenizer=tokenizer, mlm=True, mlm_probability=mlm_prob)
else:
    data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=mlm_prob)

#Initialize the Model & Setup Training Arguments & Initialize the Trainer 
model2 = AutoModelForMaskedLM.from_pretrained("distilbert/distilbert-base-uncased")
training_args = TrainingArguments(
    output_dir= ".",
    overwrite_output_dir=True,
    num_train_epochs=2,
    per_gpu_train_batch_size= 16,
    save_steps=10_000,
    save_total_limit=2,
    prediction_loss_only=True,
)

trainer = Trainer(
    model=model2,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=dev_dataset
)

trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
100%|██████████| 34/34 [00:25<00:00,  1.33it/s]

{'train_runtime': 25.5824, 'train_samples_per_second': 20.326, 'train_steps_per_second': 1.329, 'train_loss': 2.4633097929113053, 'epoch': 2.0}





TrainOutput(global_step=34, training_loss=2.4633097929113053, metrics={'train_runtime': 25.5824, 'train_samples_per_second': 20.326, 'train_steps_per_second': 1.329, 'train_loss': 2.4633097929113053, 'epoch': 2.0})