<div style="color: #7b6b59; font-size: 30px; text-align: center;">Diverse Approaches to Document Classification</div>

# <div style="padding:20px;color:white;margin:0;font-size:30px;font-family:Georgia;text-align:left;display:fill;border-radius:5px;background-color:#7b6b59;overflow:hidden">Introduction</div>

**Text classification** consists in categorizing a text passage into several predefined labels. It is one of the most useful natural language processing (NLP) techniques and typical use cases include email routing, sentiment analysis of customer reviews, spam filtering, toxicity detection, etc. In practice, the circumstances may vastly differ from one use case to another. In particular:

1. **Labeled data** may be abundant, scarce, or simply inexistent;
1. **The vocabulary** used in the texts and the targeted labels may be **common or very specific to the context of a particular organization**;
1. **The organization** implementing the use case may have **limited or extensive NLP expertise**.

In this notebook, we’ll present several powerful text classification techniques to fit all these situations.

<img width="1163" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/d3ab4069-8cf9-45c8-836b-858610432158">

## <span style="color: #7b6b59;">What is Transfer Learning?</span>

A neural network is trained on a data. This network gains knowledge from this data, which is compiled as “weights” of the network. These weights can be extracted and then transferred to any other neural network. Instead of training the other neural network from scratch, we “transfer” the learned features.

- **Transfer learning** involves using a pre-trained model on a specific task and applying its learned knowledge to a different but related task. The basic ideology of the feature is that the features learned by the pre-trained model on a large dataset can be generalized and useful for other tasks, even if the new task has a different dataset. The process typically involves taking a pre-trained model, removing its last layers, and replacing them with new layers. The initial layers of the pre-trained model are fine-tuned with a small learning rate to preserve the learned representations. They help in capturing the general features. The newly added layers are then trained using the new dataset specific to the target task.

- **Fine-tuning** refers to the process of taking a pre-trained model and further training it on a new dataset. Fine-tuning pre-trained AI models is a powerful technique that has revolutionized the field of machine learning. It involves taking a model that has already been trained on a large dataset and then further training, or “fine-tuning,” it on a smaller, specific dataset. This process allows the model to adapt to new tasks with less data than training from scratch. **What is Fine-Tuning?** In the context of machine learning, fine-tuning is a process that takes a pre-trained model (a model trained on a large-scale dataset) and “tunes” this model for a different but related task. Pre-trained models are attractive for many reasons, but primarily because they’ve already learned patterns in their original training set. So, leveraging these learned patterns can lead to improved performance and faster training times on related tasks.

To illustrate the difference between not fine-tuned and fine-tuned models, let’s consider an example in the field of natural language processing (NLP). Suppose we have a pre-trained model on a large corpus of English text, and we want to adapt this model to perform sentiment analysis (i.e., determining whether a piece of text is positive, negative, or neutral).

- **Not Fine-Tuned:** If we use our pre-trained model without any fine-tuning, it might not perform well on the sentiment analysis task. This is because the original task it was trained on likely didn’t involve any form of sentiment classification, so the model might not have learned the necessary features for this task.

- **Fine-Tuned:** If we fine-tune our pre-trained model on a smaller dataset specifically labeled for sentiment analysis, it can learn from the pre-existing knowledge of the English language it gained during pre-training and combine this with the specific examples in our smaller dataset. This can lead to significantly improved performance on the sentiment analysis task.

The intuition behind transfer learning for image classification is that if a model is trained on a large and general enough dataset, this model will effectively serve as a generic model of the visual world. You can then take advantage of these learned feature maps without having to start from scratch by training a large model on a large dataset.


### <span style="color: #7b6b59;">What is a Pre-trained Model?</span>

Simply put, a pre-trained model is a model created by some one else to solve a similar problem. Instead of building a model from scratch to solve a similar problem, you use the model trained on other problem as a starting point. A pre-trained model is a saved network that was previously trained on a large dataset, typically on a large-scale image-classification task. You either use the pretrained model as is or use transfer learning to customize this model to a given task.

For example, if you want to build a self learning car. You can spend years to build a decent image recognition algorithm from scratch or you can take inception model (a pre-trained model) from Google which was built on ImageNet data to identify images in those pictures.

A pre-trained model may not be 100% accurate in your application, but it saves huge efforts required to re-invent the wheel. Let me show this to you with a recent example.

*The only change that I made to the VGG16 existing architecture is changing the softmax layer with 1000 outputs to 16 categories suitable for our problem and re-training the dense layer. This architecture gave me an accuracy of 70% much better than MLP and CNN. Also, the biggest benefit of using the VGG16 pre-trained model was almost negligible time to train the dense layer with greater accuracy. So, I moved forward with this approach of using a pre-trained model and the next step was to fine tune my VGG16 model to suit this problem.*

What is our objective when we train a neural network? We wish to identify the correct weights for the network by multiple forward and backward iterations. By using pre-trained models which have been previously trained on large datasets, we can directly use the weights and architecture obtained and apply the learning on our problem statement. This is known as transfer learning. We “transfer the learning” of the pre-trained model to our specific problem statement.

You should be very careful while choosing what pre-trained model you should use in your case. If the problem statement we have at hand is very different from the one on which the pre-trained model was trained – the prediction we would get would be very inaccurate. For example, a model previously trained for speech recognition would work horribly if we try to use it to identify objects using it.

We make modifications in the pre-existing model by fine-tuning the model. Since we assume that the pre-trained network has been trained quite well, we would not want to modify the weights too soon and too much. While modifying we generally use a learning rate smaller than the one used for initially training the model.

**Why use pre-trained models?**

Pre-trained models are neural networks that have been trained on large amounts of data, usually for a general NLP task like language modeling, sentiment analysis, or question answering. They can capture complex patterns and features of natural language, and transfer them to other related tasks. Using pre-trained models can help you achieve better results, faster, and with less data than training a model from scratch. However, pre-trained models are not magic bullets. They might not fit your specific problem or domain, and they might have biases or limitations that affect their performance.

### <span style="color: #7b6b59;">Ways to Transfer Learning</span>

- **Two Main Approaches:** There are two main approaches to transfer learning in NLP: 
    - **feature-based transfer learning** and 
    - **fine-tuning.**


1. **Feature extraction** – We can use a pre-trained model as a feature extraction mechanism. What we can do is that we can remove the output layer( the one which gives the probabilities for being in each of the 1000 classes) and then use the entire network as a fixed feature extractor for the new data set. Use the representations learned by a previous network to extract meaningful features from new samples. You simply add a new classifier, which will be trained from scratch, on top of the pretrained model so that you can repurpose the feature maps learned previously for the dataset. You do not need to (re)train the entire model. The base convolutional network already contains features that are generically useful for classifying pictures. However, the final, classification part of the pretrained model is specific to the original classification task, and subsequently specific to the set of classes on which the model was trained. When working with a small dataset, it is a common practice to take advantage of features learned by a model trained on a larger dataset in the same domain. This is done by instantiating the pre-trained model and adding a fully-connected classifier on top. The pre-trained model is "frozen" and only the weights of the classifier get updated during training. In this case, the convolutional base extracted all the features associated with each image and you just trained a classifier that determines the image class given that set of extracted features.

1. **Train some layers while freeze others** – Another way to use a pre-trained model is to train is partially. What we can do is we keep the weights of initial layers of the model frozen while we retrain only the higher layers. We can try and test as to how many layers to be frozen and how many to be trained. Unfreeze a few of the top layers of a frozen model base and jointly train both the newly-added classifier layers and the last layers of the base model. This allows us to "fine-tune" the higher-order feature representations in the base model in order to make them more relevant for the specific task. To further improve performance, one might want to repurpose the top-level layers of the pre-trained models to the new dataset via fine-tuning. In this case, you tuned your weights such that your model learned high-level features specific to the dataset. This technique is usually recommended when the training dataset is large and very similar to the original dataset that the pre-trained model was trained on. One way to increase performance even further is to train (or "fine-tune") the weights of the top layers of the pre-trained model alongside the training of the classifier you added. The training process will force the weights to be tuned from generic feature maps to features associated specifically with the dataset. Also, you should try to fine-tune a small number of top layers rather than the whole MobileNet model. In most convolutional networks, the higher up a layer is, the more specialized it is. The first few layers learn very simple and generic features that generalize to almost all types of images. As you go higher up, the features are increasingly more specific to the dataset on which the model was trained. The goal of fine-tuning is to adapt these specialized features to work with the new dataset, rather than overwrite the generic learning.

1. **Use the Architecture of the pre-trained model** – What we can do is that we use architecture of the model while we initialize all the weights randomly and train the model according to our dataset again.

The below diagram should help you decide on how to proceed on using the pre trained model in your case.

<img width="772" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/578731ef-1025-486e-928b-217e9fc9b900">

- **Scenario 1:** Size of the Data set is small while the Data similarity is very high – In this case, since the data similarity is very high, we do not need to retrain the model. All we need to do is to customize and modify the output layers according to our problem statement. We use the pretrained model as a feature extractor. Suppose we decide to use models trained on Imagenet to identify if the new set of images have cats or dogs. Here the images we need to identify would be similar to imagenet, however we just need two categories as my output – cats or dogs. In this case all we do is just modify the dense layers and the final softmax layer to output 2 categories instead of a 1000.

- **Scenario 2:** Size of the data is small as well as data similarity is very low – In this case we can freeze the initial (let’s say k) layers of the pretrained model and train just the remaining(n-k) layers again. The top layers would then be customized to the new data set. Since the new data set has low similarity it is significant to retrain and customize the higher layers according to the new dataset.  The small size of the data set is compensated by the fact that the initial layers are kept pretrained(which have been trained on a large dataset previously) and the weights for those layers are frozen.

- **Scenario 3:** Size of the data set is large however the Data similarity is very low – In this case, since we have a large dataset, our neural network training would be effective. However, since the data we have is very different as compared to the data used for training our pretrained models. The predictions made using pretrained models would not be effective. Hence, its best to train the neural network from scratch according to your data.

- **Scenario 4:** Size of the data is large as well as there is high data similarity – This is the ideal situation. In this case the pretrained model should be most effective. The best way to use the model is to retain the architecture of the model and the initial weights of the model. Then we can retrain this model using the weights as initialized in the pre-trained model.

*Let’s now try to use a pretrained model for a simple problem. There are various architectures that have been trained on the imageNet data set. You can go through various architectures here. I have used vgg16 as pretrained model architecture and have tried to identify handwritten digits using it. Let’s see in which of the above scenarios would this problem fall into. We have around 60,000 training images of handwritten digits. This data set is definitely small. So the situation would either fall into scenario 1 or scenario 2. We shall try to solve the problem using both these scenarios.*

1. Retrain the output dense layers only – Here we use vgg16 as a feature extractor. We then use these features and send them to dense layers which are trained according to our data set. The output layer is also replaced with our new softmax layer relevant to our problem. The output layer in a vgg16 is a softmax activation with 1000 categories. We remove this layer and replace it with a softmax layer of 10 categories. We just train the weights of these layers and try to identify the digits.

2. Freeze the weights of first few layers – Here what we do is we freeze the weights of the first 8 layers of the vgg16 network, while we retrain the subsequent layers. This is because the first few layers capture universal features like curves and edges that are also relevant to our new problem. We want to keep those weights intact and we will get the network to focus on learning dataset-specific features in the subsequent layers.

Be sure that the pre-trained model you have selected has been trained on a similar data set as the one that you wish to use it on. There are various architectures people have tried on different types of data sets and I strongly encourage you to go through these architectures and apply them on your own problem statements. 

## <span style="color: #7b6b59;">Technique 1: Fine-tuning pre-trained models</span>
Fine-tuning is often necessary because pre-trained models are trained on general language understanding, and the fine-tuning process adapts the model to a specific task or domain. Fine-tuning involves training the model on task-specific data, where the model is initialized with the pre-trained weights, and the weights are updated during the fine-tuning process.

Fine-tuning involves adjusting various hyperparameters, such as learning rate, batch size, and number of training epochs, to optimize the model’s performance on the specific task. Fine-tuning can be done on a single task or on multiple tasks, where the model is trained on multiple tasks sequentially or simultaneously.


**Fine-tuning example: Fine-tuning BERT for text classification**

In this example, we want to build a sentiment analysis model for movie reviews. Instead of creating a new language model and training it from scratch, we use a pre-trained model called BERT, which has already learned the structure and nuances of the English language from a large text dataset.

We add a classification layer on top of the pre-trained BERT model and then fine-tune the entire model on our dataset of movie reviews. Fine-tuning means we continue the training process for a few more epochs with a smaller learning rate. This allows the BERT model to adjust its weights slightly to better understand the specific task of sentiment analysis while preserving the knowledge it has learned from the larger text dataset.

This approach leverages the knowledge BERT has already learned from the large text dataset to achieve better performance on our sentiment analysis task with less data and training time.

### <span style="color: #7b6b59;">Where to Start?</span>


Fine-tuning pre-trained models involves several steps:

1. **Select a Pre-Trained Model:** Choose a pre-trained model that is well-suited to your specific task. Models like BERT, GPT-3, and RoBERTa have different strengths, so consider the architecture and pre-training data when making your selection. The first step is to choose an appropriate pre-trained model. The choice of model usually depends on your specific task. For instance, if you’re working on an NLP task, models like BERT or GPT-3 could be good choices. The first step is to choose a pre-trained NLP model that suits your use case. There are many models available, such as BERT, GPT-2, RoBERTa, XLNet, and T5, each with different architectures, sizes, and pre-training objectives. You should consider factors such as the type of task (e.g., classification, generation, summarization, etc.), the language and domain of the data, the computational resources and time available, and the expected performance and accuracy. You can find pre-trained models on platforms such as Hugging Face or TensorFlow Hub, or train your own model from scratch or using transfer learning. When selecting a pre-trained model for your portfolio project, there are several factors to consider, such as the task and data, performance and resources, and availability and accessibility. It is important to evaluate how similar your data is to the data that the pre-trained model was trained on, as well as how accurate and robust you want your model to be. Additionally, it is necessary to consider how much computational power and memory you have, as well as how easy it is to find and download the pre-trained model. By comparing different pre-trained models based on these criteria, you can determine which one best suits your needs and constraints.
    - **How to load a pre-trained model?** Once you have chosen a pre-trained model, load it into your code. Depending on the model and the framework, there are different ways to do this. One of the most convenient options is to use libraries like Hugging Face Transformers or PyTorch Lightning, which provide easy access to a wide range of pre-trained models and their corresponding tokenizers, weights, and configurations. For example, using Hugging Face Transformers, you can load a pre-trained BERT model for sentiment analysis with just a few lines of code. You can also load custom or fine-tuned models from local files or online repositories, as long as they are compatible with the library. Initialize the pre-trained model with the pre-trained weights and train it on your task-specific dataset. Transfer learning helps the model leverage its prior knowledge, making fine-tuning more efficient.

2. **Prepare Your Dataset:** You’ll need a labeled dataset for your specific task. This dataset will be used to fine-tune the pre-trained model. The second step is to prepare the data for fine-tuning. You need to have a labeled dataset that matches your target task and domain. For example, if you want to fine-tune a model for sentiment analysis, you need a dataset of texts with positive or negative labels. You also need to split the data into training, validation, and test sets, and preprocess the data according to the model requirements. For example, you may need to tokenize, truncate, pad, or mask the texts, or add special tokens such as [CLS] or [SEP]. You can use libraries such as transformers or nltk to help you with data preparation. Gather and preprocess your task-specific dataset. This involves tokenization, handling special characters, padding sequences, and encoding labels for supervised tasks.

3. **Set the hyperparameters & Fine-Tune the Model:** Next, you’ll need to fine-tune your selected pre-trained model on your dataset. This usually involves setting up your training configuration and then training your model. The third step is to set the hyperparameters for fine-tuning. Hyperparameters are the variables that control the training process, such as the learning rate, the batch size, the number of epochs, the optimizer, the loss function, and the regularization methods. You should tune the hyperparameters to optimize the performance and avoid overfitting or underfitting. You can use methods such as grid search, random search, or Bayesian optimization to find the best hyperparameters for your model and data. You can also use tools such as Optuna or Ray Tune to automate the hyperparameter tuning process.
    - **How to fine-tune a pre-trained model?** Loading a pre-trained model is not enough to make it work on your specific task and data. You need to fine-tune it, which means adjusting its parameters to optimize its performance on your target domain. Fine-tuning a pre-trained model involves several steps, such as preparing the data by splitting it into training, validation, and test sets and preprocessing them according to the model's requirements. Additionally, you need to set the hyperparameters such as the learning rate, batch size, number of epochs, optimizer, loss function, and other options that affect the training process and the model's behavior. Furthermore, you must train the model by feeding it data and updating its parameters based on feedback from the loss function and validation metrics. Finally, you must evaluate the model by testing it on unseen data and measuring its performance using appropriate metrics such as accuracy, precision, recall, or F1-score. You may also need to analyze errors, outputs, and attention weights of the model in order to compare it with other models or baselines. Depending on your task, you may need to add custom layers on top of the pre-trained model. For example, for text classification, you can add a fully connected layer with softmax activation for predicting class labels.

4. **Evaluate Your Model:** After fine-tuning, it’s important to evaluate your model’s performance on a validation set to see how well it’s doing. The final step is to evaluate the results of fine-tuning. You should use the test set to measure the performance of your fine-tuned model on your target task and domain. You should use appropriate metrics that reflect your objectives and expectations, such as accuracy, precision, recall, F1-score, BLEU, ROUGE, etc. You should also compare the results with the baseline model or other models to assess the improvement and the trade-offs. You can use libraries such as scikit-learn or nltk to calculate the metrics and visualize the results.
    - **How to showcase your fine-tuned model?** After fine-tuning your pre-trained model, you might want to showcase it in your portfolio project and demonstrate your NLP skills and knowledge. Writing a blog post or report is one way to do this, as it allows you to explain the methodology and results of your project, as well as highlight any challenges and contributions. Additionally, visualizations, code snippets, and references can be included to support arguments and findings. Creating a demo or an app is another option. Tools like Streamlit, Flask, or Dash can be used to build web-based demos or apps that can be hosted on platforms like Heroku, AWS, or Google Cloud. Lastly, you can share your code and model on platforms like GitHub, Colab, or Kaggle to make them accessible and reproducible for others. Documentation of the code and model should also be included using comments, README files, or notebooks.


### <span style="color: #7b6b59;">How to Keep Improving?</span>

Large language models have achieved remarkable results in natural language processing (NLP) tasks, but they are not always optimized for specific tasks. Fine-tuning a pre-trained language model on a task or domain-specific dataset can improve its performance on that task or domain respectively. We will explore several fine-tuning techniques that can be applied to large language models to boost their performance on specific tasks or domains.

Fine-tuning is more of an art than a science, and there are several ways you can continue to improve your fine-tuned models:

1. **Task-specific training data:** The first and most important technique for fine-tuning a large language model is to use task-specific training data. A pre-trained language model has learned a general representation of language that can be applied to many tasks, but it may not perform well on a specific task without additional training. To fine-tune the model for a specific task, we need a dataset that is annotated with task-specific labels. For example, if we want to fine-tune the model for sentiment analysis, we need a dataset of text documents labeled with positive or negative sentiment. Once we have the task-specific training data, we can fine-tune the pre-trained language model on this dataset. During fine-tuning, the model is trained in a supervised way, where the input is the task-specific text and the output is the corresponding label. The fine-tuning process adjusts the pre-trained parameters to better fit the task-specific data. The output of the fine-tuned model is a model that is optimized for the specific task. Hence, a task-specific dataset contains text that is specific to a particular task, such as sentiment analysis or machine translation, and the goal is to fine-tune the pre-trained language model to perform that task better. The model is trained on a large corpus of text that is annotated or labeled for that task, which helps it learn the patterns and relationships between words that are specific to the task.

1. **Domain Specific training data:** Machine learning models are built to generalize, meaning they can learn patterns and relationships in the data they are trained on and apply them to new, unseen data. However, if the model is trained on data that is not representative of the target domain, it may not perform well in that domain. A domain-specific dataset contains text that is specific to a particular domain, such as legal or medical texts, and the goal is to fine-tune the pre-trained language model to better understand the language and jargon used in that domain. The model is trained on a large corpus of domain-specific text, which helps it learn domain-specific language patterns and terminology. If a model is trained on a general corpus of text, it may not perform well in a specific domain, such as legal or medical text. To improve the model’s performance in these domains, it is essential to fine-tune the model with domain-specific data. Hence, domain-specific datasets are used to fine-tune language models to better understand specific domains, while task-specific datasets are used to fine-tune language models to perform specific tasks better. Both approaches can be used in combination, depending on the use case.

1. **Hyperparameter Tuning:** You can experiment with different hyperparameters during the fine-tuning process such as learning rate, batch size, number of training epochs etc. Experiment with hyperparameters like dropout rates, layer sizes, and optimizer choice to find the best configuration for your task.

1. **Data Augmentation:** Techniques such as text augmentation for NLP tasks or image augmentation for computer vision tasks can provide your model with more varied examples and help improve its performance. Augment your training data to improve model generalization. For text data, this might include synonym replacement, paraphrasing, or back-translation.

1. **Regularization Techniques:** Methods like dropout or weight decay can help prevent overfitting during the fine-tuning process. Implement techniques like dropout or weight decay to prevent overfitting during fine-tuning. Regularization techniques can help prevent overfitting during fine-tuning. Overfitting occurs when the model memorizes the training data instead of generalizing to new data. Regularization techniques add constraints to the optimization process to encourage the model to learn a more general representation of the data. Two popular regularization techniques are **dropout** and **weight decay**. Dropout randomly drops out some of the neurons in the model during training, which can help prevent co-adaptation of the neurons and encourage the model to learn more robust features. Weight decay adds a penalty to the loss function for large weights, which can help prevent the model from overemphasizing noisy features.

1. **Early Stopping:** Implement early stopping by monitoring validation performance to avoid overfitting.

1. **Learning Rate:** Experiment with different learning rates during fine-tuning. A common approach is to use a smaller learning rate for the pre-trained layers and a larger learning rate for the custom layers. This helps retain valuable pre-trained knowledge while adapting to the new task. The learning rate is a hyperparameter that controls the step size taken during gradient descent optimization. A high learning rate can cause the model to overshoot the minimum of the loss function, while a low learning rate can cause slow convergence or getting stuck in local minima. During fine-tuning, adjusting the learning rate can improve performance. Starting with a high learning rate and gradually decreasing it over time can help the model converge to better solutions. One popular learning rate scheduling technique is called the “learning rate warmup.” During the warmup phase, the learning rate is gradually increased from a small value to the desired maximum value. This helps the model start learning more quickly and avoid getting stuck in suboptimal solutions.

1. **Gradient clipping:** Gradient clipping can prevent the gradients from becoming too large during fine-tuning. Large gradients can cause the optimization process to become unstable and lead to divergent or oscillating behavior. Gradient clipping sets a maximum threshold for the gradient norm, so that the gradients are scaled down if they exceed the threshold.

1. **Batch Size:** Adjust the batch size based on the available GPU memory. Smaller batch sizes are often used during fine-tuning to ensure model stability.

1. **Number of Epochs:** Fine-tuning does not typically require as many epochs as pre-training. Start with a small number of epochs, monitor performance, and increase if necessary.

1. **Loss Function:** Choose a loss function suitable for your task. Common choices include cross-entropy loss for classification tasks and mean squared error for regression tasks.

1. **Task-Specific Metrics:** Choose evaluation metrics that are specific to your task. For example, F1 score for classification, BLEU score for translation, or ROUGE score for text summarization.

1. **Inference:** After fine-tuning, save the model and develop an inference pipeline for your specific task. This may involve additional pre-processing and post-processing steps.

1. **Monitoring and Iteration:** Continually monitor the model's performance on a validation dataset and iterate on the fine-tuning process as needed. You may need to adjust hyperparameters or collect more data to improve results.

1. **Multi-task learning:** Multi-task learning is a technique that involves fine-tuning a pre-trained model on multiple related tasks simultaneously. The model is trained on a combination of the task-specific datasets, with each dataset contributing to the loss function in proportion to its importance. Multi-task learning can improve the performance of the model on all tasks by encouraging the model to learn more general features that are useful across tasks.

Fine-tuning a large language model can significantly improve its performance on specific tasks. By using task-specific training data, adjusting the learning rate, applying regularization techniques, using gradient clipping, and using multi-task learning, we can fine-tune a pre-trained model to better fit the task-specific data.

It’s worth noting that the specific finetuning techniques used will depend on the task and the specific model being used. Different tasks may require different levels of regularization or learning rate schedules, and some models may benefit more from multi-task learning than others.

Overall, fine-tuning a pre-trained language model is a powerful technique for improving the performance of the model on specific tasks. It allows us to leverage the power of pre-trained models while tailoring them to specific use cases. By experimenting with different finetuning techniques and hyperparameters, we can further optimize the performance of the model for our specific needs.


## <span style="color: #7b6b59;">Technique 2: Feature-based Transfer Learning</span>

1. Feature extraction – We can use a pre-trained model as a feature extraction mechanism. What we can do is that we can remove the output layer( the one which gives the probabilities for being in each of the 1000 classes) and then use the entire network as a fixed feature extractor for the new data set.

Feature Based Transfer learning is a method in machine learning where a pre-trained model is used as a starting point for training a new model on a different but related task. The idea behind transfer learning is that the knowledge gained from one task can be reused and applied to another task, reducing the amount of training data and computational resources required to achieve good performance. Transfer learning is especially useful in deep learning, where training large models from scratch can be very expensive and time-consuming.

Transfer learning example: Feature extraction

In this example, we want to build an image classifier to distinguish between images of cats and dogs. Instead of creating a new model and training it from scratch, we use a pre-trained model called VGG16, which has already learned to identify thousands of object categories from a large dataset called ImageNet.

We remove the last few layers of VGG16 (which are responsible for actual object classification) and use the remaining layers as a “feature extractor” for our cat and dog images. These layers can transform the input images into a compact representation that captures the essential information. We then use this compact representation as input for training a simpler classifier, like a Support Vector Machine (SVM) or logistic regression.

This approach leverages the knowledge VGG16 has already learned from the ImageNet dataset to speed up our training process and achieve better performance with less data.

How might an Enterprise use a pre-trained LLM Model in conjunction with feature based transfer learning ?

- **Image and video recognition:** Enterprises can use pre-trained models, such as VGG16, ResNet50, or MobileNet, to extract features from images and videos. They can then fine-tune the pre-trained models on their specific tasks, such as detecting defects in manufacturing products or identifying security threats in surveillance videos.

- **Natural language processing:** Enterprises can use pre-trained models, such as BERT, GPT-2, or RoBERTa, to extract features from text data. They can then fine-tune the pre-trained models on their specific tasks, such as sentiment analysis, question-answering, or document classification.

- **Recommendation systems:** Enterprises can use pre-trained models, such as deep autoencoders or matrix factorization, to learn latent representations of user preferences and item features. They can then fine-tune the pre-trained models on their specific recommendation tasks, such as product recommendations or personalized content recommendations.

- **Speech recognition:** Enterprises can use pre-trained models, such as DeepSpeech or Kaldi, to extract features from audio data. They can then fine-tune the pre-trained models on their specific speech recognition tasks, such as voice assistants or call center transcriptions.

- **Anomaly detection:** Enterprises can use pre-trained models, such as autoencoders or GANs, to learn the normal patterns of their data. They can then fine-tune the pre-trained models on their specific anomaly detection tasks, such as fraud detection or predictive maintenance.

## <span style="color: #7b6b59;">Technique 3: Training from Scratch</span>

Use the Architecture of the pre-trained model – What we can do is that we use architecture of the model while we initialize all the weights randomly and train the model according to our dataset again.


## <span style="color: #7b6b59;">Considerations</span>

- **Transfer Learning vs Fine-Tuning:** While these terms are often used interchangeably, they have subtle differences. Transfer learning generally refers to using the pre-trained model as a fixed feature extractor, while fine-tuning involves updating the weights of the pre-trained model during training. Fine-tuning involves taking the pre-trained model and training it on a new task or domain with additional data. Fine-tuning is a form of transfer learning, but they are not exactly the same thing. Transfer learning is a more general concept, while fine-tuning is a specific technique used within transfer learning. Fine-tuning involves taking a pre-trained model and further training it on a specific task or domain.

- **When to Fine-Tune:**  Not all tasks or datasets benefit from fine-tuning. Sometimes, using a pre-trained model as a fixed feature extractor might be enough.

- **Fine-Tuning vs Training from Scratch:** It’s important to understand when it’s beneficial to fine-tune a pre-trained model versus training a model from scratch.

- **Challenges in Fine-Tuning:** Fine-tuning is not without its challenges. Issues such as catastrophic forgetting (where the model forgets its previously learned knowledge) and domain shift (where the distribution of the new task data is different from the pre-training data) are important considerations.



In conclusion, fine-tuning pre-trained AI models is a powerful technique that enables the tailoring of versatile models to meet specific needs across various domains. It empowers AI practitioners to harness the vast knowledge encapsulated in these models and adapt it for custom applications. By following the steps outlined above and embracing ongoing refinement, you can unlock the true potential of fine-tuned AI models, making them invaluable tools for addressing a wide range of challenges and opportunities in the world of artificial intelligence. Whether you’re building chatbots, image classifiers, or tackling other complex tasks, fine-tuning is the key to achieving exceptional results while navigating ethical, legal, and practical considerations in this rapidly evolving field


**Pros:**

- Pre-trained models can save time and resources by allowing developers to start with a model that has already learned useful features from data. This can be especially useful when working with small datasets or when trying to solve complex problems.

- Pre-trained models can provide a good starting point for further fine-tuning and customization. This can help developers achieve better results more quickly than if they were starting from scratch.

- Pre-trained models can provide a level of performance that would be difficult to achieve with a model trained from scratch. This is because pre-trained models have been trained on large amounts of data and have learned to recognize many different patterns and relationships.

**Cons:**

- Pre-trained models may not be suitable for all tasks or domains. For example, a pre-trained model trained on text data may not perform well on image data.
- Pre-trained models may require additional fine-tuning and customization to achieve the desired level of performance. This can require additional time and resources.
- Pre-trained models may not always be available for the specific task or domain that a developer is working on. In this case, the developer may need to train their own model from scratch.

To make the most of a fine-tuned large language model, an enterprise should consider the following steps:

1. Identify the specific NLP tasks that would benefit from the LLM’s capabilities.
1. Collect, clean, and label the necessary data for fine-tuning the LLM for the specific tasks.
1. Fine-tune the pre-trained LLM using the collected data and evaluate its performance.
1. Deploy the fine-tuned model in production environments, such as chatbots, document management systems, or analytics platforms, using tools like NVIDIA’s Triton Inference Server.

Continuously monitor the model’s performance and update it with new data to ensure its accuracy and relevance to the tasks at hand.

# <div style="padding:20px;color:white;margin:0;font-size:30px;font-family:Georgia;text-align:left;display:fill;border-radius:5px;background-color:#7b6b59;overflow:hidden">Import Python Libraries</div>


In [1]:
import os
import numpy as np
import pandas as pd
import gc
import plotly.express as px

from sklearn.metrics import roc_auc_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import StratifiedKFold

from datasets import Dataset
from transformers import AutoTokenizer, AutoModel, AutoConfig
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from transformers import get_linear_schedule_with_warmup, get_cosine_schedule_with_warmup


import torch
import torch.nn as nn
from torch.nn import Parameter
import torch.nn.functional as F
from torch.optim import Adam, SGD, AdamW
#from torch.utils.data import DataLoader, Dataset

import time
import math
import random


# Import necessary classes and constants from the logging module
from logging import getLogger, INFO, StreamHandler, FileHandler, Formatter
from typing import Tuple
from sklearn.metrics import roc_auc_score




# <div style="padding:20px;color:white;margin:0;font-size:30px;font-family:Georgia;text-align:left;display:fill;border-radius:5px;background-color:#7b6b59;overflow:hidden">Load the Dataset</div>


In [2]:
train_prompts = pd.read_csv("/kaggle/input/llm-detect-ai-generated-text/train_prompts.csv")
train_essays = pd.read_csv("/kaggle/input/llm-detect-ai-generated-text/train_essays.csv")
test_essays = pd.read_csv("/kaggle/input/llm-detect-ai-generated-text/test_essays.csv")

In [3]:
train_prompts.head()

Unnamed: 0,prompt_id,prompt_name,instructions,source_text
0,0,Car-free cities,Write an explanatory essay to inform fellow ci...,"# In German Suburb, Life Goes On Without Cars ..."
1,1,Does the electoral college work?,Write a letter to your state senator in which ...,# What Is the Electoral College? by the Office...


In [4]:
train_essays.head()

Unnamed: 0,id,prompt_id,text,generated
0,0059830c,0,Cars. Cars have been around since they became ...,0
1,005db917,0,Transportation is a large necessity in most co...,0
2,008f63e3,0,"""America's love affair with it's vehicles seem...",0
3,00940276,0,How often do you ride in a car? Do you drive a...,0
4,00c39458,0,Cars are a wonderful thing. They are perhaps o...,0


In [5]:
train_essays.shape

(1378, 4)

In [6]:
test_essays.shape

(3, 3)

In [7]:
df = pd.read_csv("/kaggle/input/daigt-v2-train-dataset/train_v2_drcat_02.csv")
train_df = df[df.prompt_name != "Car-free cities"].reset_index(drop=True)
valid_df = df[df.prompt_name == "Car-free cities"].reset_index(drop=True)

In [8]:
train_df.shape

(40151, 5)

In [9]:
train_df.head()

Unnamed: 0,text,label,prompt_name,source,RDizzl3_seven
0,Phones\n\nModern humans today are always on th...,0,Phones and driving,persuade_corpus,False
1,This essay will explain if drivers should or s...,0,Phones and driving,persuade_corpus,False
2,Driving while the use of cellular devices\n\nT...,0,Phones and driving,persuade_corpus,False
3,Phones & Driving\n\nDrivers should not be able...,0,Phones and driving,persuade_corpus,False
4,Cell Phone Operation While Driving\n\nThe abil...,0,Phones and driving,persuade_corpus,False


# <div style="padding:20px;color:white;margin:0;font-size:30px;font-family:Georgia;text-align:left;display:fill;border-radius:5px;background-color:#7b6b59;overflow:hidden">Exploratory Data Analysis</div>


## <span style="color: #7b6b59;">Labels Distribution in Essay Data</span>

- `generated`: Whether the essay was written by a student (0) or generated by an LLM (1). This field is the target and is not present in test_essays.csv.


In [10]:
train_essays['generated'].value_counts()

generated
0    1375
1       3
Name: count, dtype: int64

In [11]:
train_essays['generated'].value_counts(normalize=True)

generated
0    0.997823
1    0.002177
Name: proportion, dtype: float64

# <div style="padding:20px;color:white;margin:0;font-size:24px;text-align:left;display:fill;border-radius:5px;background-color:#7b6b59;overflow:hidden">Standard Approaches: Vectorization and Classic Machine Learning (ML) Model</div>

## <span style="color: #7b6b59;">Introduction</span>

A simple approach for text classification is to convert text passages in vectors and then use standard ML algorithms such as logistic regression or tree-based models. The key question then becomes: How do you transform a text passage in a vector?

### <span style="color: #7b6b59;">Option 1: TF-IDF, Sparse Vectorization and Classic Machine Learning (ML) Model</span>

TF-IDF (or **term frequency — inverse document frequency**) is one way to achieve this vectorization. It returns a vector with one dimension for each word in a given vocabulary. Each component of this vector reflects the frequency of the corresponding word in the input text compared to the entire collection of texts.

**TF-IDF has several drawbacks. It does not consider the order of the words in the text and it ignores the semantic similarity between words.** It also does not distinguish between the various meanings of a polysemous word (e.g., “sound” as in “a loud sound,” “they sound correct,” or “a sound proposal”).

### <span style="color: #7b6b59;">Option 2: Embeddings obtained from a pre-trained deep learning model, Dense Vectorization and Classic ML Model</span>

A more effective approach, in particular if the training dataset is relatively small, is to use the vector representations (or **sentence embeddings**) obtained from a pre-trained deep learning model such as BERT.

<img width="921" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/901505fd-908a-4810-8fb5-66022a0cbe76">

***Sparse vectorization with TF-IDF (left), dense vectorization with sentence embeddings (right)***

# <div style="padding:20px;color:white;margin:0;font-size:30px;font-family:Georgia;text-align:left;display:fill;border-radius:5px;background-color:#7b6b59;overflow:hidden">Approach 1: Sparse Vectorization and Classic Machine Learning (ML) Model</div>

## <span style="color: #7b6b59;">Overview of vectorization options</span>

**Vectors & Word Embeddings: TF-IDF vs Word2Vec vs Bag-of-words vs BERT:**

As discussed above, TF-IDF can be used to vectorize text into a format more agreeable for ML & NLP techniques. However while it is a popular NLP algorithm it is not the only one out there.

1. **Bag of Words:** Bag of Words (BoW) simply counts the frequency of words in a document. Thus the vector for a document has the frequency of each word in the corpus for that document.  The key difference between bag of words and TF-IDF is that the former does not incorporate any sort of inverse document frequency (IDF)  and is only a frequency count (TF).

1. **Word2Vec:**  Word2Vec is an algorithm that uses shallow 2-layer, not deep, neural networks to ingest a corpus and produce sets of vectors. Some key differences between TF-IDF and word2vec is that TF-IDF is a statistical measure that we can apply to terms in a document and then use that to form a vector whereas word2vec will produce a vector for a term and then more work may need to be done to convert that set of vectors into a singular vector or other format. Additionally TF-IDF does not take into consideration the context of the words in the corpus whereas word2vec does.

1. **BERT - Bidirectional Encoder Representations from Transformers:** BERT is an ML/NLP technique developed by Google that uses a transformer based ML model to  convert phrases, words, etc into vectors. Key differences between TF-IDF and BERT are as follows: TF-IDF does not take into account the semantic meaning or context of the words whereas BERT does. Also BERT uses deep neural networks as part of its architecture, meaning that it can be much more computationally expensive than TF-IDF which has no such requirements. 

**Feature Engineering with Bag-of-Words or TF-IDF:**

Instead of using deep learning methods, you might utilize statistical methods for text representation like Bag-of-Words or TF-IDF, combined with machine learning algorithms.


## <span style="color: #7b6b59;">TF-IDF</span>

Most machine learning algorithms are fulfilled with mathematical things such as statistics, algebra, calculus and etc. They expect the data to be numerical such as a 2-dimensional array with rows as instances and columns as features. The problem with natural language is that the data is in the form of raw text, so that the text needs to be transformed into a vector. **The process of transforming text into a vector is commonly referred to as text vectorization.** It’s a fundamental process in natural language processing because none of the machine learning algorithms understand a text, not even computers. Text vectorization algorithm namely TF-IDF vectorizer, which is a very popular approach for traditional machine learning algorithms can help in transforming text into vectors. In order to process natural language, the text must be represented as a numerical feature. **The process of transforming text into a numerical feature is called text vectorization.** TF-IDF is one of the most popular text vectorizers, the calculation is very simple and easy to understand. It gives the rare term high weight and gives the common term low weight. TF-IDF stands for term frequency-inverse document frequency and it is a measure, used in the fields of information retrieval (IR) and machine learning, that can quantify the importance or relevance of string representations (words, phrases, lemmas, etc)  in a document amongst a collection of documents (also known as a corpus).

**Term frequency-inverse document frequency** is a text vectorizer that transforms the text into a usable vector. It combines 2 concepts, Term Frequency (TF) and Document Frequency (DF). TF-IDF can be broken down into two parts **TF (term frequency)** and **IDF (inverse document frequency)**.


- **The term frequency** is the number of occurrences of a specific term in a document. Term frequency indicates how important a specific term in a document. Term frequency represents every text from the data as a matrix whose rows are the number of documents and columns are the number of distinct terms throughout all documents. Term frequency works by looking at the frequency of a particular term you are concerned with relative to the document. There are multiple measures, or ways, of defining frequency:

    - Number of times the word appears in a document (raw count).
    - Term frequency adjusted for the length of the document (raw count of occurences divided by number of words in the document).
    - Logarithmically scaled frequency (e.g. log(1 + raw count)).
    - Boolean frequency (e.g. 1 if the term occurs, or 0 if the term does not occur, in the document).

- **Document frequency** is the number of documents containing a specific term. Document frequency indicates how common the term is.

- **Inverse document frequency (IDF)** is the weight of a term, it aims to reduce the weight of a term if the term’s occurrences are scattered throughout all the documents. IDF can be calculated as follow:

    <img width="781" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/e321a50a-138a-438b-9ee4-9320d21a8aed">
    
    Where idfᵢ is the IDF score for term i, dfᵢ is the number of documents containing term i, and n is the total number of documents. The higher the DF of a term, the lower the IDF for the term. When the number of DF is equal to n which means that the term appears in all documents, the IDF will be zero, since log(1) is zero, when in doubt just put this term in the stopword list because it doesn't provide much information. **What is IDF (inverse document frequency)?** Inverse document frequency looks at how common (or uncommon) a word is amongst the corpus. IDF is calculated as follows where t is the term (word) we are looking to measure the commonness of and N is the number of documents (d) in the corpus (D).. The denominator is simply the number of documents in which the term, t, appears in. The reason we need IDF is to help correct for words like “of”, “as”, “the”, etc. since they appear frequently in an English corpus. Thus by taking inverse document frequency, we can minimize the weighting of frequent terms while making infrequent terms have a higher impact. Finally IDFs can also be pulled from either a background corpus, which corrects for sampling bias, or the dataset being used in the experiment at hand.

    <img width="706" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/6012bdc9-5847-4234-9e8c-987adfd2828e">
    
    Note: It can be possible for a term to not appear in the corpus at all, which can result in a divide-by-zero error. One way to handle this is to take the existing count and add 1. Thus making the denominator (1 + count). An example of how the  popular library scikit-learn handles this can be seen below.
    
    <img width="739" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/68d7d8f0-84a7-4b3c-a811-4695eae291d9">

- The **TF-IDF score** as the name suggests is just a multiplication of the term frequency matrix with its IDF, it can be calculated as follow:
    
    <img width="692" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/3d3217ff-fccb-4012-a962-b70c8d40c379">
    
    Where wᵢⱼ is TF-IDF score for term i in document j, tfᵢⱼ is term frequency for term i in document j, and idfᵢ is IDF score for term i. To summarize the key intuition motivating TF-IDF is the importance of a term is inversely related to its frequency across documents.TF gives us information on how often a term appears in a document and IDF gives us information about the relative rarity of a term in the collection of documents. By multiplying these values together we can get our final TF-IDF value.
    
    <img width="710" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/6ca35452-1cca-4f6b-b695-253cd13fe27a">


**The higher the TF-IDF score the more important or relevant the term is; as a term gets less relevant, its TF-IDF score will approach 0.**


**Example:** Suppose we have 3 texts and we need to vectorize these texts using TF-IDF.

<img width="614" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/a4e32193-47f9-4291-bfd4-5a1f6046f481">

1. **Step 1:** Create a term frequency matrix where rows are documents and columns are distinct terms throughout all documents. Count word occurrences in every text.
    
    <img width="830" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/77e48f3a-13cf-4c51-bd45-32010ff239d7">

1. **Step 2:** Compute inverse document frequency (IDF) using the previously explained formula. The term i and processing has 0 IDF score, as previously mentioned we can drop these terms, but for the sake of simplicity, we keep these terms here.
    
    <img width="839" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/0bc6eec5-5317-4694-9ebb-c84f9c1b9d88">
    
1. **Step 3:** Multiply TF matrix with IDF respectively. That's it 😃! the text is now ready to feed into a machine learning algorithm.

     <img width="833" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/fd090353-4915-4f01-a88c-ee3e381e6382">


### <span style="color: #7b6b59;">Pros of using TF-IDF</span>

The biggest advantages of TF-IDF come from how simple and easy to use it is. It is simple to calculate, it is computationally cheap, and it is a simple starting point for similarity calculations (via TF-IDF vectorization + cosine similarity).


### <span style="color: #7b6b59;">Limitations, Cons of using TF-IDF</span>

1. It is only useful as a lexical level feature.

1. Synonymities are neglected.

1. It doesn't capture semantic. Something to be aware of is that TF-IDF cannot help carry semantic meaning. It considers the importance of the words due to how it weighs them, but it cannot necessarily derive the contexts of the words and understand importance that way.

1. The highest TF-IDF score may not make sense with the topic of the document, since IDF gives high weight if the DF of a term is low.

1. It neglects the sequence of the terms. Also as mentioned above, like BoW, TF-IDF ignores word order and thus compound nouns like “Queen of England” will not be considered as a “single unit”. This also extends to situations like negation with “not pay the bill” vs “pay the bill”, where the order makes a big difference. In both cases using NER tools and underscores, “queen_of_england” or “not_pay” are ways to handle treating the phrase as a single unit. No concept of word order: TF-IDF treats all words as equally important, regardless of their order or position in the document. This can be problematic for certain applications, such as sentiment analysis, where word order can be crucial for determining the sentiment of a document.

1. Another disadvantage is that it can suffer from memory-inefficiency since TF-IDF can suffer from the curse of dimensionality. Recall that the length of TF-IDF vectors is equal to the size of the vocabulary. In some classification contexts this may not be an issue but in other contexts like clustering this can be unwieldy as the number of documents increases. Thus looking into some of the above named alternatives (BERT, Word2Vec) may be necessary. **Vocabulary size:** The vocabulary size can become very large when working with large datasets, which can lead to high-dimensional feature spaces and difficulty in interpreting the results.

1. Assumes independence: TF-IDF assumes that the terms in a document are independent of each other. However, this is often not the case in natural language, where words are often related to each other in complex ways.

### <span style="color: #7b6b59;">Where to use TF-IDF</span>

As we can see, TF-IDF can be a very handy metric for determining how important a term is in a document. But how is TF-IDF used? There are three main applications for TF-IDF. These are in machine learning, information retrieval, and text summarization/keyword extraction.


1. **Using TF-IDF in machine learning & natural language processing:** Machine learning algorithms often use numerical data, so when dealing with textual data or any natural language processing (NLP) task, a sub-field of ML/AI dealing with text, that data first needs to be converted to a vector of numerical data by a process known as vectorization. TF-IDF vectorization involves calculating the TF-IDF score for every word in your corpus relative to that document and then putting that information into a vector (see images above). Thus each document in your corpus would have its own vector, and the vector would have a TF-IDF score for every single word in the entire collection of documents. ***Once you have these vectors you can apply them to various use cases such as seeing if two documents are similar by comparing their TF-IDF vector using cosine similarity.***

1. **Using TF-IDF in information retrieval:** TF-IDF also has use cases in the field of information retrieval, with one common example being search engines. Since TF-IDF can tell you about the relevant importance of a term based upon a document, a search engine can use TF-IDF to help rank search results based on relevance, with results which are more relevant to the user having higher TF-IDF scores.

1. **Using TF-IDF in text summarization & keyword extraction:** Since TF-IDF weights words based on relevance, one can use this technique to determine that the words with the highest relevance are the most important. This can be used to help summarize articles more efficiently or to simply determine keywords (or even tags) for a document. Measures relevance: TF-IDF measures the importance of a term in a document, based on the frequency of the term in the document and the inverse document frequency (IDF) of the term across the entire corpus. This helps to identify which terms are most relevant to a particular document.

1. **Interpretable:** The scores generated by TF-IDF are easy to interpret and understand, as they represent the importance of a term in a document relative to its importance across the entire corpus.

1. Works well with different languages: TF-IDF can be used with different languages and character encodings, making it a versatile technique for processing multilingual text data.

### <span style="color: #7b6b59;">Conclusion</span>

TF-IDF (Term Frequency - Inverse Document Frequency) is a handy algorithm that uses the frequency of words to determine how relevant those words are to a given document. It’s a relatively simple but intuitive approach to weighting words, allowing it to act as a great jumping off point for a variety of tasks. This includes building search engines, summarizing documents, or other tasks in the information retrieval and machine learning domains.


### <span style="color: #7b6b59;">How to implement TF-IDF with scikit-learn</span>

1. Thanks to the `TfidfVectorizer` class, implementing TF-IDF with `scikit-learn` is a fairly straightforward process. The first step is importing `TfidfVectorizer` and creating a list of documents to analyze and convert into TF-IDF features.

1. Next, create an instance of the `TfidfVectorizer` class with the desired customization options, such as tokenization patterns, stopword removal or IDF smoothing parameters.

1. Then, to fit and transform the corpus, call the `fit_transform()` method on the vectorizer instance and pass in the corpus. This computes term frequencies and inverse document frequencies while transforming the text data into a matrix of TF-IDF features.

1. Finally, call `get_feature_names()` to inspect feature names and their corresponding TF-IDF values, then convert the variable to an array using toarray():

By following these steps, you can implement TF-IDF with scikit-learn and transform your raw text data into valuable numerical representations for further analysis or feeding into machine learning models.

When using the `TfidfVectorizer` from `scikit-learn`, **you do not necessarily need to tokenize the text yourself before passing it to the vectorizer**. TfidfVectorizer has built-in capabilities to tokenize and preprocess the text. Here's how it works by default:

1. **Default Tokenization in TfidfVectorizer:**

    - **Tokenization:** By default, TfidfVectorizer tokenizes the text by extracting word tokens and ignores punctuation and whitespace. This is typically done using a regular expression that defines what constitutes a token (word). The default pattern is `r"(?u)\b\w\w+\b"`, which captures sequences of alphanumeric characters (words) that are at least two characters long. This pattern is specified in the token_pattern argument. So while spaces between words usually signify where one word ends and another begins (and thus often correspond to word boundaries), the regex isn't splitting text directly on spaces. Instead, it's looking for those alphanumeric sequences that are bounded by non-word characters or the edges of the string, which more robustly constitutes what we think of as whole, standalone words. This method is more reliable because:
        - **It ignores punctuation:** For example, in "end-of-sentence.", the period is not part of the last word, and the pattern correctly excludes it from the token "sentence".
        - **It handles complex word separations:** Not all words are neatly separated by spaces, especially in languages with different scripts or in cases with punctuation like hyphens, apostrophes, etc. The pattern correctly identifies words in many of these cases.
        In summary, while spaces are a significant part of how the pattern determines where words begin and end, the actual process involves identifying sequences of word characters that are delineated by word boundaries, which provides a more nuanced and effective approach to word tokenization in varied text environments.
        
    - **Preprocessing:** It converts all characters to lowercase (unless you set lowercase=False) and performs normalization, such as accent stripping, if specified.

1. **Customization Options:**
    1. **Custom Tokenizer:** You can provide a custom tokenizer function to the `tokenizer` parameter. This function takes a string as input and returns a list of tokens. If you have specific tokenization needs (e.g., handling special cases, working with a non-standard text format), you might implement and use your custom tokenizer.

    1. **Custom Preprocessor:** Similarly, you can provide a custom preprocessing function to the `preprocessor` parameter. This function also takes a string as input and returns the processed string. It's applied to the text before tokenization.


***Should You Tokenize Beforehand?***

- **Usually Unnecessary:** For standard text processing needs, the default behavior of TfidfVectorizer is often sufficient. It is designed to handle typical cases of text vectorization, including tokenization and case normalization.

- **Custom Needs:** If your text data requires specialized handling, such as dealing with a particular language's nuances, handling mixed text types, or integrating with an existing text processing pipeline, you might perform tokenization (and other text preprocessing) before vectorization. In such cases, you could use the tokenizer and preprocessor parameters to integrate your custom functions.


1. **`ngram_range=(1, 3)`:** This parameter defines the range of n-gram sizes to include in the token counts. (1, 3) means that it will consider unigrams (single words), bigrams (two consecutive words), and trigrams (three consecutive words) as individual features for vectorization. Essentially, it's looking at the individual words, pairs of consecutive words, and triplets of consecutive words when creating the vectors.

1. **`sublinear_tf=True`:** This parameter applies sublinear tf scaling, i.e., it replaces term frequency (tf) with 1 + log(tf). The idea is to reduce the sensitivity of the vectorizer to terms that occur very frequently and therefore might skew the results disproportionately. It's a way to temper the effect of terms that appear very often and might dominate the feature set. By transforming the frequency to the logarithmic scale, increases in term frequency have a gradually smaller effect on the computation of TF-IDF.

1. **`lowercase=False`:** This indicates that the text will not be automatically converted to lowercase before tokenizing. By default, TfidfVectorizer converts all characters to lowercase to ensure that the same words in different cases are counted as the same token.

1. **`analyzer='word'`:** This parameter sets the unit of features to words. Other options might include 'char' or 'char_wb' for character n-grams. 'word' means it will consider tokens of words as the feature base.

1. **`tokenizer=dummy`:** This specifies a custom tokenizer function. Typically, TfidfVectorizer tokenizes the string by extracting words of at least two letters. By setting tokenizer to 'dummy', you are replacing the default tokenizer with your own custom function named dummy. This function will be used to split the text into tokens.

1. **`token_pattern=None`:** Normally, this parameter defines the regex pattern that the tokenizer uses to find tokens in the text string. By setting it to None, and providing a custom tokenizer, you're effectively ignoring the default regex pattern and relying entirely on the custom tokenizer you've provided.

1. **`preprocessor=dummy`:** Similar to the tokenizer, this specifies a custom pre-processing function. The default preprocessor in TfidfVectorizer takes care of removing accents and performing other cleaning steps. By setting it to 'dummy', you are specifying that your own custom function named dummy should be used for preprocessing the text.

1. **`strip_accents='unicode'`:** This is used to remove accents during the preprocessing step. 'unicode' is a method that works on any characters that have a direct Unicode equivalent. It's an effective way to standardize text by removing accents and diacritical marks that might lead to variations in how words are processed.



1. **Fitting the Vectorizer:** Initially, when we fit TfidfVectorizer to our documents (e.g., using vectorizer.fit(texts)), it learns the vocabulary of the corpus, meaning it identifies all unique terms used across all documents, considering the constraints and specifications we've given it (like token patterns, n-grams, etc.).

1. **Building the Vocabulary Dictionary:** After fitting, the vectorizer has a complete list of terms used in the documents. It then creates a mapping of these terms to specific indices. This mapping is stored in `vectorizer.vocabulary_`.
    
    - Keys: Each unique term or token found in the corpus.
    - Values: A unique integer index corresponding to each term. This index is used when creating the sparse matrix representation of the documents where each term's TF-IDF score will be placed.
    
    ```python 
{
    'galaxy': 123,
    'black hole': 15,
    'star cluster': 678,
    'nebula': 321,
    ...
}
```
    In this hypothetical vocabulary:

    - The term 'galaxy' is found at column index 123 in the TF-IDF matrix.
    - The term 'black hole' is found at column index 15, and so on.

1. When you **transform** your documents into their TF-IDF representation using the fitted vectorizer (via vectorizer.transform(texts)), each document is represented as a sparse vector with the length of the total vocabulary, where most values are zero except for the indices corresponding to the terms present in the document.


In [12]:
vectorizer = TfidfVectorizer(ngram_range=(1, 3),sublinear_tf=True)
X = vectorizer.fit_transform(train_df["text"])

In [13]:
# Inspect feature names and TF-IDF values 
print(vectorizer.get_feature_names_out()) 

['00' '00 00' '00 00 and' ... '完全禁止使用手机应该是合法和道路安全的唯一选择'
 '完全禁止使用手机应该是合法和道路安全的唯一选择 保护所有道路使用者的安全'
 '完全禁止使用手机应该是合法和道路安全的唯一选择 保护所有道路使用者的安全 司机必须在驾驶时将全部注意力都集中在道路上']


In [14]:
# get the first vector out (for the first document) 
first_vector_tfidfvectorizer=X[0] 

# place tf-idf values in a pandas data frame 
df = pd.DataFrame(first_vector_tfidfvectorizer.T.todense(), index=vectorizer.get_feature_names_out(), columns=["tfidf"])
df.sort_values(by=["tfidf"],ascending=False)

Unnamed: 0,tfidf
be in contact,0.067319
way how,0.059980
always on their,0.059508
are always on,0.055284
in contact,0.051028
...,...
further without,0.000000
further with you,0.000000
further with this,0.000000
further with their,0.000000


The `vocab = vectorizer.vocabulary_` line returns a Python dictionary from the fitted `TfidfVectorizer` object. The dictionary's keys are the terms (or tokens) found in the document corpus, and the values are the column indices of these terms in the resulting TF-IDF matrix.
The term 'galaxy' is found at column index 123 in the TF-IDF matrix.
The term 'black hole' is found at column index 15, and so on.  This vocabulary is crucial because it maintains a consistent mapping of terms to indices, ensuring that when you transform new documents into vectors, the terms align correctly with the learned model's features. It's essential for both understanding the feature space of your model and for preparing new text inputs for predictions or further analysis with the trained vectorizer.



In [15]:
# Getting vocab
vocab = vectorizer.vocabulary_


# <div style="padding:20px;color:white;margin:0;font-size:24px;text-align:left;display:fill;border-radius:5px;background-color:#7b6b59;overflow:hidden">Approach 2: Extracting embeddings from pre-trained models and Classic Machine Learning (ML) Model - Transfer Learning without Fine-Tuning</div>



Instead of fine-tuning a pre-trained model, you could use it as a feature extractor. For instance, you can pass your documents through a pre-trained model (like BERT) to get embeddings and then train a simpler machine learning model (like Logistic Regression) on those features.

**Extracting embeddings from pre-trained BERT| Huggingface Transformers**

## <span style="color: #7b6b59;">Overview</span>


The need for standardization in training models and using the language model, Hugging Face, was found.NLP is democratized by Hugging Face, where the constructed API allows easy access to pre-trained models, datasets, and tokens. This Hugging Face's transformers library generates embeddings, and we use the pre-trained BERT model to extract the embeddings.

## <span style="color: #7b6b59;">How to use embeddings for feature extraction?</span>

Now, let’s talk about how you can use BERT with your text: The BERT Model learns complex understandings of the English language, which can help you extract different aspects of text for various tasks. If you have a set of sentences with labels, you can train a regular classifier using the information produced by the BERT Model as input (the text). To obtain the features of a particular text using this model in TensorFlow see the code below.

**How to use embeddings to extract information from text column?**

We are going to take advantage of the incredible hugging face 🤗 framework to extract information from this feature.

1. **Step 1:** First, we need to import the model and the tokenizer: There are different models that we can try, and you check them here: https://huggingface.co/models?pipeline_tag=feature-extraction It is important to use the model’s tokenizer so that it receives the data in a proper format and they are also useful since they already clean up the data for you. Each tokenizer will have different ways of dealing with the data, therefore it is important to read about them.

1. **Step 2:** Second, we extract the hidden state associated to the token CLS which represents an entire sequence of text and rather than dealing with a 768 array for each token in a string, we just need to deal with one (the 768 dimension varies from model to model).





In [16]:
# Step 1: We need to import the model and the tokenizer
#tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
#model = TFBertModel.from_pretrained("bert-base-cased")

from transformers import AutoModel, AutoTokenizer

model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModel.from_pretrained(model_ckpt)


#custom_text = "You are welcome to utilize any text of your choice."
#encoded_input = tokenizer(custom_text, return_tensors='tf')
#output_embeddings = model(encoded_input)

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [17]:
# Step 2: we extract the hidden state associated to the token CLS
#train_df["embeddings"] = train_df["text"].apply(lambda x: model(**tokenizer(x, return_tensors="pt", truncation=True)).last_hidden_state[:,0,:].detach().numpy()[0])

This piece of code is a straightforward example of how to use the BERT tokenizer and model from the Hugging Face `transformers `library for encoding text into embeddings. Here's a breakdown of what each part does:

1. **Importing the Necessary Classes:**
    - **BertTokenizer:** A tokenizer class for BERT. It handles the conversion from text to tokens that BERT understands.
    - **TFBertModel:** The BERT model class compatible with TensorFlow.

1. **Using the BERT Tokenizer and Model:**
    
    1. **Load Pre-trained Models:**
        - `tokenizer = BertTokenizer.from_pretrained('bert-base-cased')`: Loads the BERT tokenizer for the 'bert-base-cased' version. This tokenizer is responsible for breaking the text down into tokens that BERT can understand.
        - `model = TFBertModel.from_pretrained("bert-base-cased")`: Loads the pre-trained BERT model. This model will generate embeddings for the input text.

1. **Prepare Custom Text:**

    - `custom_text = "You are welcome to utilize any text of your choice."`: A sample text that you want to convert into embeddings.

1. **Tokenize the Text:**
    - `encoded_input = tokenizer(custom_text, return_tensors='tf')`: The tokenizer converts the text into a format suitable for the BERT model. The `return_tensors='tf'` argument tells the tokenizer to return TensorFlow tensors.

1. **Generate Embeddings:**

    - `output_embeddings = model(encoded_input)`: Passes the tokenized input to the BERT model. The model returns the embeddings, which are a rich, contextual representation of each token in the input text.

1. **Understanding the Output:** The `output_embeddings` returned by the model is typically a complex structure containing several types of embeddings:

    - **Last Hidden State:** The output corresponding to the last layer of the BERT model, which gives you the embeddings for each token in the input sequence.
    - **Pooler Output:** A pooled output of the last hidden state, which represents the entire input sequence, often used in classification tasks.

To print the dimensions of the output_embeddings, you would typically focus on these two parts. Here is how you can do it:

In these lines of code:

- `output_embeddings.last_hidden_state.shape` will give you the dimensions of the last hidden state, which is usually of the form `[batch_size, sequence_length, hidden_size]`.
- `output_embeddings.pooler_output.shape` will give you the dimensions of the pooled output, typically `[batch_size, hidden_size]`.
Understanding these dimensions:

- **batch_size:** The number of sequences processed at a time (for your case, it will be 1 as you're processing a single sentence).
- **sequence_length:** The length of the tokenized input (number of tokens).
- **hidden_size:** The size of the hidden layers in the BERT model. For 'bert-base-cased', it is usually 768.



# <div style="padding:20px;color:white;margin:0;font-size:24px;text-align:left;display:fill;border-radius:5px;background-color:#7b6b59;overflow:hidden">Advanced Approaches: Fine-tune a pre-trained model</div>

## <span style="color: #7b6b59;">Introduction</span>

**What does fine-tuning a pre-trained model mean?** 

The fine-tuning technique is used to optimize a model’s performance on a new or different task. It is used to tailor a model to meet a specific need or domain, say cancer detection, in the field of healthcare. Pre-trained models are fine-tuned by training them on large amounts of labeled data for a certain task, such as Natural Language Processing (NLP) or image classification. Once trained, the model can be applied to similar new tasks or datasets with limited labeled data by fine-tuning the pre-trained model.

The fine-tuning process is commonly used in transfer learning, where a pre-trained model is used as a starting point to train a new model for a contrasting but related task. A pre-trained model can significantly diminish the labeled data required to train a new model, making it an effective tool for tasks where labeled data is scarce or expensive.

**How does fine-tuning pre-trained models work?**

Fine-tuning a pre-trained model works by updating the parameters utilizing the available labeled data instead of starting the training process from the ground up. The following are the generic steps involved in fine-tuning:

1. **Loading the pre-trained model:** The initial phase in the process is to select and load the right model, which has already been trained on a large amount of data, for a related task.

1. **Modifying the model for the new task - Adjust the Architecture:** Once a pre-trained model is loaded, its top layers must be replaced or retrained to customize it for the new task. Adapting the pre-trained model to new data is necessary because the top layers are often task specific. After selecting the pre-trained model, you need to make modifications to the model’s architecture to fit the requirements of your specific task. This typically involves modifying the top layers of the model. For example, you may need to change the number of output neurons in the final layer to match the number of classes in your classification task.

1. **Freezing particular layers:** The earlier layers facilitating low-level feature extraction are usually frozen in a pre-trained model. Since these layers have already learned general features that are useful for various tasks, freezing them may allow the model to preserve these features, avoiding overfitting the limited labeled data available in the new task. Depending on the complexity of your task and the size of your dataset, you can choose to freeze some layers in the pre-trained model. Freezing a layer means preventing it from updating its weights during the fine-tuning process. This can be beneficial if the lower layers of the pre-trained model have already learned general features that are useful for your task. On the other hand, unfreezing allows the corresponding layers to adapt to the new data during fine-tuning.

1. **Training the new layers:** With the labeled data available for the new task, the newly created layers are then trained, all the while keeping the weights of the earlier layers constant. As a result, the model’s parameters can be adapted to the new task, and its feature representations can be refined. Once you have adjusted the architecture and decided which layers to freeze or unfreeze, it’s time to train the modified model on your task-specific dataset. During training, it’s advisable to use a smaller learning rate than what was used in the initial pre-training phase. This helps prevent drastic changes to the already learned representations while allowing the model to adapt to the new data.

1. **Fine-tuning the model:** Once the new layers are trained, you can fine-tune the entire model on the new task using the available limited data. Every task and dataset is unique, and it may require further experimentation with hyperparameters, loss functions, and other training strategies. Fine-tuning is not a one-size-fits-all approach, and you may need to iterate and fine-tune your fine-tuning strategy to achieve optimal results.

**Understanding fine-tuning with an example**

Suppose you have a pre-trained model trained on a wide range of medical data or images that can detect abnormalities like tumors and want to adapt the model for a specific use case, say identifying a rare type of cancer, but you have a limited set of labeled data available. In such a case, you must fine-tune the model by adding new layers on top of the pre-trained model and training the newly added layers with the available data. Typically, the earlier layers of a pre-trained model, which extract low-level features, are frozen to prevent overfitting.

**Best practices to follow when fine-tuning a pre-trained model**

While fine-tuning a pre-trained model, several best practices can help ensure successful outcomes. Here are some key practices to follow:

1. **Understand the pre-trained model:** Gain a comprehensive understanding of the pre-trained model architecture, its strengths, limitations, and the task it was initially trained on. This knowledge can enhance the fine-tuning process and help make appropriate modifications.

1. **Select a relevant pre-trained model:** Choose a pre-trained model that aligns closely with the target task or domain. A model trained on similar data or a related task will provide a better starting point for fine-tuning.

1. **Freeze early layers:** Typically, the lower layers of a pre-trained model capture generic features and patterns. Freeze these early layers during fine-tuning to preserve the learned representations. This practice helps prevent catastrophic forgetting and lets the model focus on task-specific fine-tuning.

1. **Adjust learning rate**: Experiment with different learning rates during fine-tuning. It is typical to use a smaller learning rate compared to the initial pre-training phase. A lower learning rate allows the model to adapt more gradually and prevent drastic changes that could lead to overfitting.

1. **Utilize transfer learning techniques:** Transfer learning methods can enhance fine-tuning performance. Techniques like feature extraction, where pre-trained layers are used as fixed feature extractors, or gradual unfreezing, where layers are unfrozen gradually during training, can help preserve and transfer valuable knowledge.

1. **Regularize the model:** Apply regularization techniques, **such as dropout or weight decay,** during fine-tuning to prevent overfitting. Regularization helps the model generalize better and reduces the risk of memorizing specific training examples.

1. **Monitor and evaluate performance:** Continuously monitor and evaluate the performance of the fine-tuned model on validation or holdout datasets. Use appropriate evaluation metrics to assess the model’s progress and make informed decisions on further fine-tuning adjustments.

1. **Data augmentation:** Augment the training data by applying transformations, perturbations, or adding noise. Data augmentation can increase the diversity and generalizability of the training data, leading to better fine-tuning results.

1. **Consider domain adaptation:** If the target task or domain significantly differs from the pre-training data, consider domain adaptation techniques. These methods aim to bridge the gap between the pre-training data and the target data, improving the model’s performance on the specific task.

1. **Regularly backup and save checkpoints:** Save model checkpoints at regular intervals during fine-tuning to ensure progress is saved and prevent data loss. This practice allows for easy recovery and enables the exploration of different fine-tuning strategies.

There are two ways to do it: Since we are looking to fine-tune the model for a downstream task like classification, we can directly use:

### <span style="color: #7b6b59;">1. A simple way</span>

**Fine-tuning pretrained NLP models with Huggingface’s Trainer:** *A simple way to fine-tune pretrained NLP models without native Pytorch or Tensorflow*

While working on a data science competition, I was fine-tuning a pre-trained model and realised how tedious it was to fine-tune a model using native PyTorch or Tensorflow. I experimented with Huggingface’s **Trainer API** and was surprised by how easy it was.

- **Train Our Classification Model:** Now that our input data is properly formatted, it’s time to fine tune the pre-trained model, for instance a BERT model.
    - For this task, we first want to modify the pre-trained BERT model to give outputs for classification, and then we want to continue training the model on our dataset until that the entire model, end-to-end, is well-suited for our task.

    - **Classification Head:** Finally, the output from the pooler is passed through the classification head, which simply involves projecting the pooled embedding into a space with dimensionality equal to the number of different classes. It is called a head because this component of the model can be swapped out to suit a particular task. This is in contrast to the backbone of BERT — responsible for creating the contextualized representations of the tokens in the sequence — that remains the same regardless of the task.
    - Thankfully, the huggingface pytorch implementation includes a set of interfaces designed for a variety of NLP tasks. Though these interfaces are all built on top of a trained BERT model, each has different top layers and output types designed to accomodate their specific NLP task. [Here](https://huggingface.co/transformers/v2.2.0/model_doc/bert.html) is the current list of classes provided for fine-tuning.
    
    - The `BertForSequenceClassification` class is the outermost class that we call to instantiate our BERT model. It houses both the base architecture (self.bert) and the classification head (self.classifier). The outputs are the logits for which there is one value for each class. Taking the maximum value of these logits will give us the predicted class. However, if it is desired to interpret the logits as probabilities the softmax function will need to be applied. `BertForSequenceClassification` performs fine-tuning of logistic regression layer on the output dimension of 768.
   
    - We’ll be using `BertForSequenceClassification`. This is the normal BERT model with an added single linear layer on top for classification that we will use as a sentence classifier. As we feed input data, the entire pre-trained BERT model and the additional untrained classification layer is trained on our specific task.
    
    - So, in summary, we fine-tune the entire pre-trained BERT model, including the last layers specifically designed for our classification task. It adjusts the model to our document classification problem using the data we provide, but it doesn't train the model entirely from scratch. The distinction is that "from scratch" would mean initializing all the model's weights randomly and learning them solely from our data, which usually requires a much larger dataset and more computational resources. Here, we're leveraging the general understanding already embedded in the BERT model from its pre-training, which provides a significant head start for most NLP tasks.

### <span style="color: #7b6b59;">2. Adding Custom Layers on Top of a Hugging Face Model</span>

Alternatively, we can define a custom module, that created a bert model based on the pre-trained weights and adds layers on top of it.



# <div style="padding:20px;color:white;margin:0;font-size:30px;font-family:Georgia;text-align:left;display:fill;border-radius:5px;background-color:#7b6b59;overflow:hidden">Approach 3: Fine-tune a pre-trained model with 🤗 Transformers</div>

## <span style="color: #7b6b59;">1. Introduction</span>

There are significant benefits to using a pretrained model. It reduces computation costs, your carbon footprint, and allows you to use state-of-the-art models without having to train one from scratch. 🤗 Transformers provides access to thousands of pretrained models for a wide range of tasks. **When you use a pretrained model, you train it on a dataset specific to your task. This is known as fine-tuning,** an incredibly powerful training technique. In this section, we will fine-tune a pretrained model with a deep learning framework of our choice:

- Fine-tune a pretrained model with 🤗 Transformers PyTorch Trainer.
- Fine-tune a pretrained model in TensorFlow with Keras.
- Fine-tune a pretrained model in native PyTorch.

## <span style="color: #7b6b59;">2. Create a dataset or Prepare the dataset</span>

- **From in-memory data:** Eventually, it’s also possible to instantiate a datasets.Dataset directly from in-memory data, currently one or:
    - a python dict, or
    - a pandas dataframe.

A `datasets.Dataset` instance is more precisely a table with rows and columns in which the columns are typed. Querying an example (a single row) will thus return a python dictionary with keys corresponding to columns names, and values corresponding to the example’s value for each column.

You can get the number of rows and columns of the dataset with various standard attributes. 

Sometimes, you may need to create a dataset if you’re working with your own data. Creating a dataset with **🤗 Datasets confers all the advantages of the library to your dataset: fast loading and processing, stream enormous datasets, memory-mapping, and more.** You can easily and rapidly create a dataset with 🤗 Datasets low-code approaches, reducing the time it takes to start training a model. In many cases, it is as easy as dragging and dropping your data files into a dataset repository on the Hub.

Creating a `Dataset` object from our dataset when fine-tuning a pre-trained model with Hugging Face Transformers is important for several reasons:

1. **Efficiency:** The `Dataset` object is optimized for performance. It enables efficient data loading, preprocessing, and iteration, which is crucial when dealing with large datasets common in NLP tasks.

1. **Easy Integration:** Hugging Face Transformers and Datasets libraries are designed to work together seamlessly. By using a `Dataset` object, we can directly apply transformations, tokenization, and batching, which are necessary for preparing our data for the model.

1. **Consistency and Reproducibility:** Creating a `Dataset` object ensures that data processing steps are consistent. This is important for reproducibility of results, a key aspect of any scientific experiment. You can share your dataset with others, and they'll be able to achieve the same results using the same preprocessing steps.

1. **Advanced Features:** The Dataset object comes with many advanced features like easy slicing, indexing, and even complex transformations. It supports operations like `map`, `filter`, and `shuffle`, which are essential for training neural networks.

1. **Scalability:** Datasets in Hugging Face are designed to be scalable. They can handle datasets much larger than your system's RAM and facilitate distributed training by efficiently managing memory and processing.

1. **Community Standards:** Using widely adopted standards like the Dataset object from Hugging Face ensures that your work is accessible and understandable by a broader community. It also makes it easier for you to use datasets and models shared by others.

In essence, **using a `Dataset` object simplifies the data preprocessing pipeline, ensures efficient and reproducible training, and aligns your work with community practices.**




In [18]:
train_dataset = Dataset.from_pandas(train_df)
valid_dataset = Dataset.from_pandas(valid_df)

print(f"The shape of the train dataset is: {train_dataset.shape}")
print("---------------------------------------------------")

print(f"The number of columns in the train dataset is: {train_dataset.num_columns}")
print(f"The column names are: {train_dataset.column_names}")
print(f"The columns' detailed types are: {train_dataset.features}")

print("---------------------------------------------------")
print(f"The number of rows in the train dataset is: {train_dataset.num_rows}")
print(f"Or the length of the train dataset is: {len(train_dataset)}")

The shape of the train dataset is: (40151, 5)
---------------------------------------------------
The number of columns in the train dataset is: 5
The column names are: ['text', 'label', 'prompt_name', 'source', 'RDizzl3_seven']
The columns' detailed types are: {'text': Value(dtype='string', id=None), 'label': Value(dtype='int64', id=None), 'prompt_name': Value(dtype='string', id=None), 'source': Value(dtype='string', id=None), 'RDizzl3_seven': Value(dtype='bool', id=None)}
---------------------------------------------------
The number of rows in the train dataset is: 40151
Or the length of the train dataset is: 40151


In [19]:
# While you can access a single row with the train_dataset[i] pattern, 
# you can also access several rows using slice notation or with a list of indices (or a numpy/torch/tf array of indices):
#print(train_dataset[1])
#print("--------------------------------\n")
#print(train_dataset[:2])
#print("--------------------------------\n")
#print(train_dataset["text"][:2])

## <span style="color: #7b6b59;">3. Initialise pre-trained model and tokenizer</span>

Before we can fine-tune a pretrained model, we have to prepare it for training. As you now know, we need a tokenizer to process the text and include a **padding** and **truncation** strategy to handle any variable sequence lengths. To process our dataset in one step, use 🤗 Datasets `map` method to apply a preprocessing function over the entire dataset:

To feed our text to deberta, it must be split into tokens, and then these tokens must be mapped to their index in the tokenizer vocabulary.

The tokenization must be performed by the tokenizer included with deberta–the below cell will download this for us.

In [20]:
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-xsmall", use_fast=True)

tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/578 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]



Since we are using a pretrained model, we need to ensure that the input data is in the same form as what the pretrained model was trained on. Thus, we would need to instantiate the tokenizer using the name of the model.

Now that the model and tokenizer have been initialised, we can proceed to preprocess the data.

**Preprocess text using pretrained tokenizer**

Let us preprocess the text using the tokenizer intialised earlier.

The input text that we are using for the tokenizer is a list of strings.

We have set `padding=True`, `truncation=True`, `max_length=128` so that we can get same length inputs for the model- the long texts will be truncated to 128 tokens while the short texts will have extra tokens added to make it 128 tokens.

128 tokens is used because this is the maximum token length that the pre-trained model can take.

After tokenizing your text, you will get a python dictionary with 3 keys:

- Input_ids
- token_type_ids
- attention_mask



In [21]:
def tokenize_function(samples):
    return tokenizer(samples["text"], max_length=128, padding=True, truncation=True)

In [22]:
tokenized_train_dataset = train_dataset.map(tokenize_function, batched=True)
tokenized_valid_dataset = valid_dataset.map(tokenize_function, batched=True)

  0%|          | 0/41 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

## <span style="color: #7b6b59;">4. Train with PyTorch Trainer</span>

🤗 Transformers provides a Trainer class optimized for training 🤗 Transformers models, making it easier to start training without manually writing your own training loop. The Trainer API supports a wide range of training options and features such as logging, gradient accumulation, and mixed precision.

1. **Start by loading your model and specify the number of expected labels.**: You will see a warning about some of the pretrained weights not being used and some weights being randomly initialized. Don’t worry, this is completely normal! The pretrained head of the BERT model is discarded, and replaced with a randomly initialized classification head. You will fine-tune this new model head on your sequence classification task, transferring the knowledge of the pretrained model to it.
1. **Training hyperparameters:** Next, create a `TrainingArguments` class which contains all the hyperparameters you can tune as well as flags for activating different training options. For this tutorial you can start with the default training hyperparameters, but feel free to experiment with these to find your optimal settings.
1. **Evaluate:** `Trainer` does not automatically evaluate model performance during training. You’ll need to pass Trainer a function to compute and report metrics. 
1. **Trainer:** Create a `Trainer` object with your model, training arguments, training and test datasets, and evaluation function. Then fine-tune your model by calling `train()`.

For this task, we first want to modify the pre-trained Deberta model to give outputs for classification, and then we want to continue training the model on our dataset until that the entire model, end-to-end, is well-suited for our task.

Thankfully, the huggingface pytorch implementation includes a set of interfaces designed for a variety of NLP tasks. Though these interfaces are all built on top of a trained Deberta model, each has different top layers and output types designed to accomodate their specific NLP task. We’ll be using `AutoModelForSequenceClassification`. This is the normal Deberta model with an added single linear layer on top for classification that we will use as a sentence classifier. As we feed input data, the entire pre-trained BERT model and the additional untrained classification layer is trained on our specific task. Have also a look on [BertForSequenceClassification source code](https://huggingface.co/transformers/v3.0.2/_modules/transformers/modeling_bert.html#BertForSequenceClassification).


In [23]:
model = AutoModelForSequenceClassification.from_pretrained("microsoft/deberta-v3-xsmall", num_labels=2)
model.cuda()

pytorch_model.bin:   0%|          | 0.00/241M [00:00<?, ?B/s]

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-xsmall and are newly initialized: ['classifier.bias', 'pooler.dense.weight', 'pooler.dense.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DebertaV2ForSequenceClassification(
  (deberta): DebertaV2Model(
    (embeddings): DebertaV2Embeddings(
      (word_embeddings): Embedding(128100, 384, padding_idx=0)
      (LayerNorm): LayerNorm((384,), eps=1e-07, elementwise_affine=True)
      (dropout): StableDropout()
    )
    (encoder): DebertaV2Encoder(
      (layer): ModuleList(
        (0-11): 12 x DebertaV2Layer(
          (attention): DebertaV2Attention(
            (self): DisentangledSelfAttention(
              (query_proj): Linear(in_features=384, out_features=384, bias=True)
              (key_proj): Linear(in_features=384, out_features=384, bias=True)
              (value_proj): Linear(in_features=384, out_features=384, bias=True)
              (pos_dropout): StableDropout()
              (dropout): StableDropout()
            )
            (output): DebertaV2SelfOutput(
              (dense): Linear(in_features=384, out_features=384, bias=True)
              (LayerNorm): LayerNorm((384,), eps=1e-07, elementwise_affine

In [24]:
metric_name = "roc_auc"
train_batch_size = 4
eval_batch_size = 32
grad_acc = 4
num_steps = len(train_df) // (train_batch_size * grad_acc)
num_steps

2509

**Defining TrainingArguments and Trainer**

Here is where the magic of the Trainer function is. We can define the training parameters in the TrainingArguments and Trainer class as well as train the model with a single command.

We need to first define a function to calculate the metrics of the validation set. Since this is a binary classification problem, we can use accuracy, precision, recall and f1 score.

Next, we specify some training parameters, set the pretrained model, train data and evaluation data in the TrainingArgs and Trainer class.

After we have defined the parameters , simply run `trainer.train()` to train the model.

In [25]:
training_args = TrainingArguments(
    output_dir="deberta-v3-xsmall_finetuned",
    evaluation_strategy="steps", # If you’d like to monitor your evaluation metrics during fine-tuning, specify the evaluation_strategy parameter in your training arguments to report the evaluation metric at the end of each epoch
    save_strategy = "steps",
    eval_steps = num_steps // 3,
    save_steps = num_steps // 3,
    learning_rate=2e-5,
    per_device_train_batch_size=train_batch_size,
    per_device_eval_batch_size=eval_batch_size,
    gradient_accumulation_steps=grad_acc,
    num_train_epochs=1,
    weight_decay=0.01,
    load_best_model_at_end=False,
    metric_for_best_model=metric_name,
    report_to='none', # change to wandb after enabling internet access
)

In [26]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    probs = np.exp(logits) / np.sum(np.exp(logits), axis=-1, keepdims=True)
    auc = roc_auc_score(labels, probs[:,1], multi_class='ovr')
    return {"roc_auc": auc}


In [27]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_valid_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [28]:
#trainer.train()

## <span style="color: #7b6b59;">5. Making prediction</span>

After the model is trained, we repeat the same steps for the test data:

1. Tokenize test data with pretrained tokenizer
1. Create torch dataset
1. Load trained model
1. Define Trainer

To load the trained model from the previous steps, set the model_path to the path containing the trained model weights.

To make prediction, only a single command is needed as well `test_trainer.predict(test_dataset)` .

After making a prediction, you will only get the raw prediction. Additional preprocessing steps will be needed to get it to a usable format.

Since the task is just a simple sequence classification task, we can just obtain the argmax across axis 1. Note that other NLP tasks may require different ways to preprocess the raw predictions.

In [29]:
# test = pd.read_csv('/kaggle/input/llm-detect-ai-generated-text/test_essays.csv')
# test_ds = Dataset.from_pandas(test)
# test_ds_enc = test_ds.map(tokenize_function, batched=True)


In [30]:
#test_preds = trainer.predict(test_ds_enc)

In [31]:
# logits = test_preds.predictions
# probs = np.exp(logits) / np.sum(np.exp(logits), axis=-1, keepdims=True)
# sub = pd.DataFrame()
# sub['id'] = test['id']
# sub['generated'] = probs[:,1]
# sub.to_csv('submission.csv', index=False)


# <div style="padding:20px;color:white;margin:0;font-size:30px;font-family:Georgia;text-align:left;display:fill;border-radius:5px;background-color:#7b6b59;overflow:hidden">Approach 4: Fine-tuning by Adding Custom Layers on Top of a Hugging Face Pre-trained Model</div>


We are accustomed to the canonical way of fine-tuning: append just an additional output layer after Transformer for downstream tasks or back-end part of models which takes representations from the last layer of the pre-trained language models as the default input.

However, due to the multi-layer structure of Transformers, different layers capture different levels of representations. They learn a rich hierarchy of linguistic information i.e. with surface features in lower layers, syntactic features in middle layers, and semantic features in higher layers.

<img width="1023" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/6a64afa0-0276-4706-a05a-706153b06507">

The BERT authors tested word-embedding strategies by feeding different vector combinations as input features to a BiLSTM used on a named entity recognition task and observing the resulting F1 scores. Concatenation of the last four layers produced the best results.

This is partially demonstrated by noting that the different layers of BERT encode very different kinds of information, so the appropriate pooling strategy will change depending on the application because different layers encode different kinds of information. This holds true for other variants as well.


The notebook will show many different ways these outputs and hidden representations can be utilized to do much more than just adding an output layer. Below are the various techniques we will be implementing.

# <span style="color: #7b6b59;">Step 4.1: Utils</span>


In [32]:
OUTPUT_DIR = "./"

if not os.path.exists(OUTPUT_DIR):
    os.makedirs(OUTPUT_DIR)

The `seed_everything` function is designed to set a fixed seed for various random number generators across different libraries and configurations in Python, to ensure that the execution of the code is deterministic. This is particularly useful in machine learning experiments where reproducibility of results is important. Several operations in machine learning and data processing introduce randomness, which can lead to different outcomes in different runs if the random number generators (RNGs) are not controlled. Here are some common sources of randomness:

1. **Data Shuffling:** In most machine learning workflows, datasets are shuffled before training to ensure that the model does not learn any unintended patterns from the order of the data. This shuffling process is random.

1. **Weight Initialization:** Neural networks and many other models initialize their parameters (weights) randomly. This random initialization can lead to different starting points for the training process, affecting the final model.

1. **Mini-batch Selection:** During training, especially with stochastic gradient descent (SGD) and its variants, data is often divided into mini-batches randomly for each training epoch. The selection of samples for each mini-batch introduces randomness.

1. **Dropout:** Dropout is a regularization technique used in neural networks where a random subset of neurons is "dropped" (i.e., their output is temporarily set to zero) during each training iteration to prevent overfitting.

1. **Ensemble Methods:** Some ensemble methods, like Random Forests or bagging techniques, rely on randomness to create diversity among the models they combine. For example, Random Forests use bootstrap sampling (sampling with replacement) and random feature selection for splitting nodes.

1. **Random Seeds in Algorithms:** Some machine learning algorithms, such as k-means clustering or algorithms that involve stochastic optimization, use random seeds to initiate processes.

1. **Data Augmentation:** In deep learning, particularly in computer vision tasks, data augmentation techniques (like random rotations, flipping, cropping, etc.) are used to artificially expand the training dataset by applying random transformations to the original images.

1. **Exploration in Reinforcement Learning:** Reinforcement learning algorithms often include an exploration mechanism where actions are chosen randomly to explore the environment, as opposed to exploiting the currently known best strategy.



In [33]:
def seed_everything(seed=42):
    """
    Seed all possible random number generators to ensure reproducibility of results.

    This function sets a fixed seed for the random number generators in the `random`, `numpy`, and `torch` libraries,
    and also ensures deterministic behavior in CUDA operations if PyTorch is used with CUDA.

    Args:
        seed (int, optional): The seed value to use for all random number generators. Defaults to 42.
    """
    random.seed(seed)  # Seed Python's built-in random module.
    os.environ['PYTHONHASHSEED'] = str(seed)  # Set PYTHONHASHSEED environment variable to ensure reproducibility when hashing objects in Python.
    np.random.seed(seed)  # Seed NumPy's random number generator.
    torch.manual_seed(seed)  # Seed PyTorch's random number generator for CPU operations.
    torch.cuda.manual_seed(seed)  # Seed PyTorch's random number generator for CUDA (GPU) operations.
    
    # Ensure that CUDA operations are deterministic. This might impact performance.
    torch.backends.cudnn.deterministic = True


In [34]:
# Example usage:
seed_everything(seed=42)

In [35]:
def get_logger(filename=OUTPUT_DIR+'train'):
    """
    Creates a logger that outputs log messages to both the console and a file.

    This function configures a logger to write log messages with the INFO level
    and above to both the standard output stream (console) and a specified log file.
    The format of the log messages is set to display the message content only.

    Args:
        filename (str): The base name of the log file. The '.log' extension will be appended
                        to this base name. Defaults to OUTPUT_DIR+'train', where OUTPUT_DIR
                        is assumed to be a predefined directory path.

    Returns:
        logging.Logger: A configured logger object.
    """

    # Create a logger with the name of the current module
    logger = getLogger(__name__)
    
    # Set the logger's severity level to INFO
    logger.setLevel(INFO)
    
    # Create a stream handler to output log messages to the console
    handler1 = StreamHandler()
    # Set the format for the stream handler to display only the message content
    handler1.setFormatter(Formatter("%(message)s"))
    
    # Create a file handler to output log messages to a file, appending '.log' to the filename
    handler2 = FileHandler(filename=f"{filename}.log")
    # Set the format for the file handler to display only the message content
    handler2.setFormatter(Formatter("%(message)s"))
    
    # Add the stream and file handlers to the logger
    logger.addHandler(handler1)
    logger.addHandler(handler2)
    
    # Return the configured logger
    return logger

In [36]:
# Initialize the LOGGER by calling get_logger function without specifying a filename
# It uses the default filename derived from OUTPUT_DIR+'train'
LOGGER = get_logger()


# <span style="color: #7b6b59;">Step 4.2: Data Loading</span>


In [37]:
train = pd.read_csv("/kaggle/input/llm-detect-ai-generated-text/train_essays.csv")
train.rename(columns={'generated': 'label'}, inplace=True)

test = pd.read_csv("/kaggle/input/llm-detect-ai-generated-text/test_essays.csv")
submission = pd.read_csv("/kaggle/input/llm-detect-ai-generated-text/sample_submission.csv")

print(f"train.shape: {train.shape}")
display(train.head())
print(f"test.shape: {test.shape}")
display(test.head())
print(f"submission.shape: {submission.shape}")
display(submission.head())

train.shape: (1378, 4)


Unnamed: 0,id,prompt_id,text,label
0,0059830c,0,Cars. Cars have been around since they became ...,0
1,005db917,0,Transportation is a large necessity in most co...,0
2,008f63e3,0,"""America's love affair with it's vehicles seem...",0
3,00940276,0,How often do you ride in a car? Do you drive a...,0
4,00c39458,0,Cars are a wonderful thing. They are perhaps o...,0


test.shape: (3, 3)


Unnamed: 0,id,prompt_id,text
0,0000aaaa,2,Aaa bbb ccc.
1,1111bbbb,3,Bbb ccc ddd.
2,2222cccc,4,CCC ddd eee.


submission.shape: (3, 2)


Unnamed: 0,id,generated
0,0000aaaa,0.1
1,1111bbbb,0.9
2,2222cccc,0.4


# <span style="color: #7b6b59;">Step 4.3: k-fold Cross-Validation</span>


## <span style="color: #7b6b59;">Cross-Validation</span>

In machine learning (ML), generalization usually refers to the ability of an algorithm to be effective across various inputs. It means that the ML model does not encounter performance degradation on the new inputs from the same distribution of the training data.

For human beings generalization is the most natural thing possible. We can classify on the fly. For example, we would definitely recognize a dog even if we didn’t see this breed before. Nevertheless, it might be quite a challenge for an ML model. That’s why checking the algorithm’s ability to generalize is an important task that requires a lot of attention when building the model.

To do that, we use Cross-Validation (CV). There is always a need to validate the stability of your machine learning model. I mean you just can’t fit the model to your training data and hope it would accurately work for the real data it has never seen before. You need some kind of assurance that your model has got most of the patterns from the data correct, and its not picking up too much on the noise, or in other words its low on bias and variance.

Cross Validation is a very useful technique:

- for **assessing the effectiveness of your model**, particularly in cases where you need to mitigate overfitting. 
- It is also of use in **determining the hyper parameters of your model, in the sense that which parameters will result in lowest test error.** 

This is all the basic you need to get started with cross validation. You can get started with all kinds of validation techniques using `Scikit-Learn`, that gets you up and running with just a few lines of code in python.

### <span style="color: #7b6b59;">What is cross-validation?</span>

This process of deciding whether the numerical results quantifying hypothesized relationships between variables, are acceptable as descriptions of the data, is known as validation. Generally, an error estimation for the model is made after training, better known as evaluation of residuals. In this process, a numerical estimate of the difference in predicted and original responses is done, also called the training error. However, this only gives us an idea about how well our model does on data used to train it. **Now its possible that the model is underfitting or overfitting the data. So, the problem with this evaluation technique is that it does not give an indication of how well the learner will generalize to an independent/ unseen data set. Getting this idea about our model is known as Cross Validation.**

- **Cross-validation is a technique for evaluating a machine learning model and testing its performance.** CV is commonly used in applied ML tasks. It helps to compare and select an appropriate model for the specific predictive modeling problem.

CV is easy to understand, easy to implement, and it tends to have a lower bias than other methods used to count the model’s efficiency scores. All this makes cross-validation a powerful tool for selecting the best model for the specific task.

There are a lot of different techniques that may be used to cross-validate a model. Still, all of them have a similar algorithm:

1. **Divide** the dataset into two parts: one for training, other for testing
1. **Train** the model on the training set
1. **Validate** the model on the test set
1. **Repeat** 1-3 steps a couple of times. This number depends on the CV method that you are using

As you may know, there are plenty of CV techniques. Some of them are commonly used, others work only in theory. 

1. **Hold-out**
1. **K-folds**
1. **Stratified K-folds**
1. **Repeated K-folds**
1. **Nested K-folds**
1. **Time series CV**

Above listed validation techniques are also referred to as **Non-exhaustive cross validation methods**. These do not compute all ways of splitting the original sample, i.e. you just have to decide how many subsets need to be made. Also, these are approximations of method listed below, also called **Exhaustive Methods,** that computes all possible ways the data can be split into training and test sets.

1. **Leave-one-out**
1. **Leave-p-out**

### <span style="color: #7b6b59;">Cross-Validation Techniques</span>

1. **Hold-out cross-validation:** Hold-out cross-validation is the simplest and most common technique. You might not know that it is a hold-out method but you certainly use it every day. *We usually use the hold-out method on large datasets as it requires training the model only once.* It is really easy to implement hold-out. The error estimation then tells how our model is doing on unseen data or the validation set. For example, you may do it using `sklearn.model_selection.train_test_split`. The algorithm of hold-out technique:

    - Divide the dataset into two parts: the **training set** and the **test set**. Usually, 80% of the dataset goes to the training set and 20% to the test set but you may choose any splitting that suits you better
    - Train the model on the training set
    - Validate on the test set
    - Save the result of the validation
    
    1. Disadvantages:
        - For example, a dataset that is not completely even distribution-wise. If so we may end up in a rough spot after the split. For example, the training set will not represent the test set. Both training and test sets may differ a lot, one of them might be easier or harder.  This is a simple kind of cross validation technique, also known as the holdout method. Although this method doesn’t take any overhead to compute and is better than traditional validation, it still suffers from issues of high variance. This is because it is not certain which data points will end up in the validation set and the result might be entirely different for different sets.
        - Moreover, the fact that we test our model only once might be a bottleneck for this method. Due to the reasons mentioned before, the result obtained by the hold-out technique may be considered inaccurate. 

1. **k-Fold cross-validation:** k-Fold cross-validation is a technique that minimizes the disadvantages of the hold-out method. k-Fold introduces a new way of splitting the dataset which helps to overcome the “test only once bottleneck”. As there is never enough data to train your model, removing a part of it for validation poses a problem of underfitting. **By reducing the training data, we risk losing important patterns/ trends in data set, which in turn increases error induced by bias.** So, what we require is a method that provides ample data for training the model and also leaves ample data for validation. K Fold cross validation does exactly that. In K Fold cross validation, the data is divided into k subsets. Now the holdout method is repeated k times, such that each time, one of the k subsets is used as the test set/ validation set and the other k-1 subsets are put together to form a training set. The error estimation is averaged over all k trials to get total effectiveness of our model. As can be seen, every data point gets to be in a validation set exactly once, and gets to be in a training set k-1 times. This significantly reduces bias as we are using most of the data for fitting, and also significantly reduces variance as most of the data is also being used in validation set. Interchanging the training and test sets also adds to the effectiveness of this method. **As a general rule and empirical evidence, K = 5 or 10 is generally preferred, but nothing’s fixed and it can take any value.**

    The algorithm of the k-Fold technique:
    
    <img width="500" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/376f6016-01a3-499d-816f-a677f544bf2e">
    
    - Pick a number of folds – k. Usually, k is 5 or 10 but you can choose any number which is less than the dataset’s length.
    - Split the dataset into k equal (if possible) parts (they are called folds)
    - Choose k – 1 folds as the training set. The remaining fold will be the test set
    - Train the model on the training set. On each iteration of cross-validation, you must train a new model independently of the model trained on the previous iteration
    - Validate on the test set
    - Save the result of the validation
    - Repeat steps 3 – 6 k times. Each time use the remaining  fold as the test set. 
    - In the end, you should have validated the model on every fold that you have. To get the final score average the results that you got on step 6.
    
    To perform k-Fold cross-validation you can use `sklearn.model_selection.KFold`. In general, it is always better to use k-Fold technique instead of hold-out. In a head to head, comparison k-Fold gives a more stable and trustworthy result since training and testing is performed on several different parts of the dataset. We can make the overall score even more robust if we increase the number of folds to test the model on many different sub-datasets.
    
    1. Disadvantages:
        - Still, k-Fold method has a disadvantage. Increasing k results in training more models and the training process might be really expensive and time-consuming.

1. **Stratified k-Fold cross-validation:** Sometimes we may face a large imbalance of the target value in the dataset. For example, in a dataset concerning wristwatch prices, there might be a larger number of wristwatch having a high price. In the case of classification, in cats and dogs dataset there might be a large shift towards the dog class. **Stratified k-Fold is a variation of the standard k-Fold CV technique which is designed to be effective in such cases of target imbalance.** In some cases, there may be a large imbalance in the response variables. For example, in dataset concerning price of houses, there might be large number of houses having high price. Or in case of classification, there might be several times more negative samples than positive samples. For such problems, a slight variation in the K Fold cross validation technique is made, such that each fold contains approximately the same percentage of samples of each target class as the complete set, or in case of prediction problems, the mean response value is approximately equal in all the folds. This variation is also known as Stratified K Fold.

    It works as follows. Stratified k-Fold splits the dataset on k folds such that each fold contains approximately the same percentage of samples of each target class as the complete set. In the case of regression, Stratified k-Fold makes sure that the mean target value is approximately equal in all the folds. The algorithm of Stratified k-Fold technique:

    - Pick a number of folds – k
    - Split the dataset into k folds. Each fold must contain approximately the same percentage of samples of each target class as the complete set 
    - Choose k – 1 folds which will be the training set. The remaining fold will be the test set
    - Train the model on the training set. On each iteration a new model must be trained
    - Validate on the test set
    - Save the result of the validation
    - Repeat steps 3 – 6 k times. Each time use the remaining  fold as the test set. In the end, you should have validated the model on every fold that you have.
    - To get the final score average the results that you got on step 6.

    As you may have noticed, the algorithm for Stratified k-Fold technique is similar to the standard k-Folds. You don’t need to code something additionally as the method will do everything necessary for you. Stratified k-Fold also has a built-in method in sklearn – `sklearn.model_selection.StratifiedKFold`. All mentioned above about k-Fold CV is true for Stratified k-Fold technique. When choosing between different CV methods, make sure you are using the proper one. For example, you might think that your model performs badly simply because you are using k-Fold CV to validate the model which was trained on the dataset with a class imbalance. To avoid that you should always do a proper exploratory data analysis on your data.

1. **Repeated k-Fold cross-validation:** Repeated k-Fold cross-validation or Repeated random sub-sampling CV is probably the most robust of all CV techniques in this paper. It is a variation of k-Fold but in the case of Repeated k-Folds k is not the number of folds. It is the number of times we will train the model. The general idea is that on every iteration we will randomly select samples all over the dataset as our test set. For example, if we decide that 20% of the dataset will be our test set, 20% of samples will be randomly selected and the rest 80% will become the training set. The algorithm of Repeated k-Fold technique:

    - Pick k – number of times the model will be trained
    - Pick a number of samples which will be the test set
    - Split the dataset
    - Train on the training set. On each iteration of cross-validation, a new model must be trained
    - Validate on the test set
    - Save the result of the validation
    - Repeat steps 3-6 k times
    - To get the final score average the results that you got on step 6.
    
    Repeated k-Fold has clear advantages over standard k-Fold CV. Firstly, the proportion of train/test split is not dependent on the number of iterations. Secondly, we can even set unique proportions for every iteration. Thirdly, random selection of samples from the dataset makes Repeated k-Fold even more robust to selection bias. Still, there are some disadvantages. k-Fold CV guarantees that the model will be tested on all samples, whereas Repeated k-Fold is based on randomization which means that some samples may never be selected to be in the test set at all. At the same time, some samples might be selected multiple times. Thus making it a bad choice for imbalanced datasets. Sklearn will help you to implement a Repeated k-Fold CV. Just use `sklearn.model_selection.RepeatedKFold`. In sklearn implementation of this technique you must set the number of folds that you want to have (n_splits) and the number of times the split will be performed (n_repeats). It guarantees that you will have different folds on each iteration.

1. **Leave-one-out cross-validation:** Leave-one-out сross-validation (LOOCV) is an extreme case of k-Fold CV. Imagine if k is equal to n where n is the number of samples in the dataset. Such k-Fold case is equivalent to Leave-one-out technique. The algorithm of LOOCV technique:

    - Choose one sample from the dataset which will be the test set
    - The remaining n – 1 samples will be the training set
    - Train the model on the training set. On each iteration, a new model must be trained
    - Validate on the test set
    - Save the result of the validation
    - Repeat steps 1 – 5 n times as for n samples we have n different training and test sets
    - To get the final score average the results that you got on step 5.
    
    For LOOCV sklearn also has a built-in method. It can be found in the model_selection library – sklearn.model_selection.LeaveOneOut.
    
    1. Advantages:
        - The greatest advantage of Leave-one-out cross-validation is that it doesn’t waste much data. We use only one sample from the whole dataset as a test set, whereas the rest is the training set. 
        
    1. Disadvantages:
        - But when compared with k-Fold CV, LOOCV requires building n models instead of k models, when we know that n which stands for the number of samples in the dataset is much higher than k. It means LOOCV is more computationally expensive than k-Fold, it may take plenty of time to cross-validate the model using LOOCV. **Thus, the Data Science community has a general rule based on empirical evidence and different researches, which suggests that 5- or 10-fold cross-validation should be preferred over LOOCV.**

1. **Leave-p-out cross-validation:** Leave-p-out cross-validation (LpOC) is similar to Leave-one-out CV as it creates all the possible training and test sets by using p samples as the test set. All mentioned about LOOCV is true and for LpOC. This approach leaves p data points out of training data, i.e. if there are n data points in the original sample then, n-p samples are used to train the model and p points are used as the validation set. This is repeated for all combinations in which original sample can be separated this way, and then the error is averaged for all trials, to give overall effectiveness. **This method is exhaustive in the sense that it needs to train and validate the model for all possible combinations, and for moderately large p, it can become computationally infeasible.** Still, it is worth mentioning that unlike LOOCV and k-Fold test sets will overlap for LpOC if p is higher than 1. The algorithm of LpOC technique:

    - Choose p samples from the dataset which will be the test set
    - The remaining n – p samples will be the training set
    - Train the model on the training set. On each iteration, a new model must be trained
    - Validate on the test set
    - Save the result of the validation
    - Repeat steps 2 – 5 Cpn times 
    - To get the final score average the results that you got on step 5
    
    You can perform Leave-p-out CV using sklearn – `sklearn.model_selection.LeavePOut`. LpOC has all the disadvantages of the LOOCV, but, nevertheless, it’s as robust as LOOCV. A particular case of this method is when p = 1. This is known as Leave one out cross validation. **This method is generally preferred over the previous one because it does not suffer from the intensive computation, as number of possible combinations is equal to number of data points in original sample or n.**

1. **Nested cross-validation:** In the case of k-fold and stratified k-fold cross-validation, we get a poor estimate of the error in training and test data. Hyperparameter tuning is done separately in the earlier methods. When cross-validation is used simultaneously for tuning the hyperparameters and generalizing the error estimate, nested cross-validation is required. Nested Cross Validation can be applicable in both k-fold and stratified k-fold variants. Nested k-fold cross-validation is an advanced form of cross-validation used primarily for model selection and hyperparameter tuning, particularly when the dataset is not very large. It's a way to more reliably estimate the performance of a model on unseen data. Two Layers of k-Fold Cross-Validation:

    - **Outer Loop:** This is for model evaluation. The data is split into k 'folds'. In each iteration, one fold is used as the test set, and the remaining k-1 folds are used for training (and further split in the inner loop).
    - **Inner Loop:** Within each iteration of the outer loop, the training data is again split into k 'folds' for hyperparameter tuning. The model is trained on k-1 of these folds and validated on the remaining fold. This process is repeated for each combination of hyperparameters to find the best set.
   
   **Purpose:**

    - **Model Selection and Hyperparameter Tuning:** The inner loop is used to select the best model and hyperparameters. The outer loop evaluates the performance of this best model.
    - **Avoiding Data Leakage:** By having a separate test set in the outer loop that is never used for training or hyperparameter tuning, nested k-fold cross-validation avoids leaking test data into the model training proces

   **When to Use Nested k-Fold Cross-Validation:**
    - **Small Datasets:** When your dataset is not large enough to afford a separate, dedicated hold-out set for final model evaluation.

    - **Reliable Performance Estimation:** When you need a more reliable and unbiased estimate of the model's performance on unseen data.

    - **Hyperparameter Tuning:** It is particularly useful when the process of hyperparameter tuning is critical to the performance of the model.

    - **Model Selection:** When comparing multiple models or configurations and you need a rigorous method to assess which model is likely to perform best on unseen data.

    Suppose you have a dataset and want to use a support vector machine (SVM) model. You're not sure what value of the regularization parameter C to use. Nested k-fold cross-validation would allow you to test different values of C in the inner loop to find the best one, while the outer loop would give you a reliable estimate of how well the SVM with this C value performs on unseen data. Nested k-fold cross-validation is a thorough and rigorous approach to model evaluation and selection, particularly useful in scenarios where every data point is valuable, and an unbiased estimate of the model's performance is crucial. It's more computationally intensive than simple k-fold cross-validation but provides a more accurate estimate of a model's generalization capabilities.

1. **Time-series cross-validation:** Traditional cross-validation techniques don’t work on sequential data such as time-series because we cannot choose random data points and assign them to either the test set or the train set as it makes no sense to use the values from the future to forecast values in the past. The above cross-validation methods may not be suitable for evaluating time series models because the order of the data is very important in time series data. That’s why time series cross-validation was introduced. There are mainly two ways to go about this:

    - **Rolling cross-validation:** Cross-validation is done on a rolling basis i.e. starting with a small subset of data for training purposes, predicting the future values, and then checking the accuracy on the forecasted data points. The following image can help you get the intuition behind this approach. In time series cross-validation, folds are created in a forward-chaining fashion. We start with a small subset of data as the training fold and a much smaller subset as the validation fold. The validation fold gets shifted in time and the previous validation fold gets added to the training fold in the next iteration. We can use the Scikit-learn TimeSeriesSplit() function to perform time series cross-validation. The number of splits is specified in the n_splits hyperparameter. For the time-series dataset, the split of data into train and validation is according to the time also referred to as forward chaining method or rolling cross-validation. For a particular iteration, the next instance of train data can be treated as validation data. As mentioned in the above diagram, for the 1st iteration, 1st 3 rows are considered as train data and the next instance T4 is validation data. The chance of choice of train and validation data is forwarded for further iterations.
    
        <img width="844" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/343151b4-cbf7-4707-a628-ab312a53b8e5">
        <img width="887" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/fed37a32-056a-42cd-ab85-b81eed1e4b9c">
        
    - **Blocked cross-validation:** The first technique may introduce leakage from future data to the model. The model will observe future patterns to forecast and try to memorize them. That’s why blocked cross-validation was introduced. It works by adding margins at two positions. The first is between the training and validation folds in order to prevent the model from observing lag values which are used twice, once as a regressor and another as a response. The second is between the folds used at each iteration in order to prevent the model from memorizing patterns from one iteration to the next.
   
        <img width="773" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/193a7093-dde0-4d03-a7c1-bfd66bda2b12">



**Best practices and tips**

It’s worth mentioning that sometimes performing cross-validation might be a little tricky. 

For example, it’s quite easy to make a logical mistake when splitting the dataset which may lead to an untrustworthy CV result. 

You may find some tips that you need to keep in mind when cross-validating a model below:

1. Be logical when splitting the data (does the splitting method make sense)
1. Use the proper CV method (is this method viable for my use-case)
1. When working with time series don’t validate on the past (see the first tip)
1. When working with medical or financial data remember to split by person. Avoid having data for one person both in the training and the test set as it may be considered as data leak
1. When cropping patches from larger images remember to split by the large image Id

Of course, tips differ from task to task and it’s almost impossible to cover all of them. That’s why performing a solid exploratory data analysis before starting to cross-validate a model is always the best practice.

In [38]:
# ====================================================
# CV split
# ====================================================

n_folds = 4
train_folds = [0, 1, 2, 3]

stratified_k_fold = StratifiedKFold(
    n_splits=n_folds,
    shuffle=True, 
    random_state=123
)

# Iterating Over Each Fold
# The enumerate function is used to iterate over the fold splits. 
# It provides two pieces of information for each iteration:
# fold: The current fold number (starting from 0).
# (train_index, val_index): Two arrays containing indices of the training and validation data for the current fold
for fold, (train_index, val_index) in enumerate(stratified_k_fold.split(train, train["label"])): # It generates indices for training and validation sets for each fold.
    
    # Inside the loop, for each fold, the validation indices (val_index) are used 
    # to assign the fold number to the corresponding rows in train.
    # This line does the assignment. It sets the 'fold' column of the DataFrame for rows in val_index to the current fold number.
    # This effectively tags each data point with the fold number it will be a part of in the validation set.
    # The purpose of this assignment is to keep track of which data points should be in the validation set for each fold.
    # When you actually train the model, you can easily filter the DataFrame to get the appropriate training and validation sets based on these fold numbers.

    train.loc[val_index, 'fold'] = int(fold)

train['fold'] = train['fold'].astype(int)
display(train.groupby('fold').size())



fold
0    345
1    345
2    344
3    344
dtype: int64

In [39]:
for fold in range(n_folds):
    
    train_folds = train[train["fold"] != fold].reset_index(drop=True)
    valid_folds = train[train["fold"] == fold].reset_index(drop=True)
    print(valid_folds.shape)
    valid_labels = valid_folds["label"].values


(345, 5)
(345, 5)
(344, 5)
(344, 5)


# <span style="color: #7b6b59;">Step 4.4: Prepare the Training Data - Tokenization and DataLoader</span>

## <span style="color: #7b6b59;">Tokenization with Hugging Face 🤗 Transformers</span>


A tokenizer is in charge of preparing the inputs for a model. The Hugging Face library contains tokenizers for all the models. 

Before you can train a model on a dataset, it needs to be preprocessed into the expected model input format. Whether your data is text, images, or audio, they need to be converted and assembled into batches of tensors. 🤗 Transformers provides a set of preprocessing classes to help prepare your data for the model. Text, use a **Tokenizer** to convert text into a sequence of tokens, create a numerical representation of the tokens, and assemble them into tensors. The main tool for preprocessing textual data is a [tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer). 

1. A tokenizer splits text into tokens according to a set of rules. 
1. The tokens are converted into numbers and then tensors, which become the model inputs. 
1. Any additional inputs required by the model are added by the tokenizer.

***Tip:*** If you plan on using a pretrained model, it’s important to use the associated pretrained tokenizer. This ensures the text is split the same way as the pretraining corpus, and uses the same corresponding tokens-to-index (usually referred to as the vocab) during pretraining.

### <span style="color: #7b6b59;">Step 1.1: Loading a pretrained tokenizer</span>


Get started by loading a pretrained tokenizer with the AutoTokenizer.from_pretrained() method. This downloads the vocab a model was pretrained with:

```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

```

In [40]:
from transformers import AutoTokenizer

OUTPUT_DIR = "./"
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base")
tokenizer.save_pretrained(OUTPUT_DIR + "tokenizer/")

tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/579 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]



('./tokenizer/tokenizer_config.json',
 './tokenizer/special_tokens_map.json',
 './tokenizer/spm.model',
 './tokenizer/added_tokens.json',
 './tokenizer/tokenizer.json')

### <span style="color: #7b6b59;">Step 1.2: Then pass your text to the tokenizer</span>


The tokenizer returns a dictionary with three important items:

1. **input_ids** are the indices corresponding to each token in the sentence.
1. **attention_mask** indicates whether a token should be attended to or not.
1. **token_type_ids** identifies which sequence a token belongs to when there is more than one sequence.

In [41]:
encoded_input = tokenizer("Do not meddle in the affairs of wizards, for they are subtle and quick to anger.")
print(encoded_input)


{'input_ids': [1, 771, 298, 57249, 267, 262, 6303, 265, 41267, 261, 270, 306, 281, 6245, 263, 1538, 264, 5693, 260, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


Return your input by decoding the input_ids:

In [42]:
tokenizer.decode(encoded_input["input_ids"])

'[CLS] Do not meddle in the affairs of wizards, for they are subtle and quick to anger.[SEP]'

As you can see, the tokenizer added two special tokens - `CLS` and `SEP` (classifier and separator) - to the sentence. Not all models need special tokens, but if they do, the tokenizer automatically adds them for you.

If there are several sentences you want to preprocess, pass them as a list to the tokenizer:


In [43]:
batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_inputs = tokenizer(batch_sentences)
print(encoded_inputs)

{'input_ids': [[1, 420, 339, 314, 567, 2962, 302, 2], [1, 1310, 280, 297, 428, 313, 2212, 314, 567, 2962, 261, 31663, 260, 2], [1, 458, 314, 11583, 268, 3933, 302, 2]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]]}


- **Pad:** Sentences aren’t always the same length which can be an issue because tensors, the model inputs, need to have a uniform shape. Padding is a strategy for ensuring tensors are rectangular by adding a special padding token to shorter sentences.
    - Set the `padding` parameter to `True` to pad the shorter sequences in the batch to match the longest sequence.
- **Truncation:** On the other end of the spectrum, sometimes a sequence may be too long for a model to handle. In this case, you’ll need to truncate the sequence to a shorter length.
    - Set the `truncation` parameter to `True` to truncate a sequence to the maximum length accepted by the model.
- **Build tensors:** Finally, you want the tokenizer to return the actual tensors that get fed to the model.
    - Set the `return_tensors` parameter to either `pt` for PyTorch, or `tf` for TensorFlow:





In [44]:
encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, max_length=128, return_tensors="pt")
print(encoded_input)

{'input_ids': tensor([[    1,   420,   339,   314,   567,  2962,   302,     2,     0,     0,
             0,     0,     0,     0],
        [    1,  1310,   280,   297,   428,   313,  2212,   314,   567,  2962,
           261, 31663,   260,     2],
        [    1,   458,   314, 11583,   268,  3933,   302,     2,     0,     0,
             0,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]])}


In [45]:
text = "Do not meddle in the affairs of wizards, for they are subtle and quick to anger."
inputs = tokenizer.encode_plus(
    text, 
    return_tensors="pt", 
    add_special_tokens=True, 
    padding="max_length",
    max_length=512,
    truncation=True
)

In [46]:
tokenizer(
    "Do not meddle in the affairs of wizards, for they are subtle and quick to anger.", return_tensors=None, 
    add_special_tokens=True, 
    padding="max_length",
    max_length=512,
    truncation=True
)

{'input_ids': [1, 771, 298, 57249, 267, 262, 6303, 265, 41267, 261, 270, 306, 281, 6245, 263, 1538, 264, 5693, 260, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

I typically use the `tokenizer.encode_plus()` function to tokenize my input, but there is another function that can be used to tokenize input, and this `tokenizer.encode()`. The main difference between `tokenizer.encode_plus()` and `tokenizer.encode()` is that `tokenizer.encode_plus()` returns more information. Specifically, it returns the actual input ids, the attention masks, and the token type ids, and it returns all of these in a dictionary. `tokenizer.encode()` **only returns the input ids**, and it returns this either as a list or a tensor depending on the parameter, `return_tensors = “pt”`.

In Hugging Face's Transformers library, the difference between `tokenizer(input)` and `tokenizer.encode_plus(...)` lies in their functionality and the level of control they offer. In summary, `tokenizer(input)` is a simpler method for basic tokenization, while `tokenizer.encode_plus(...)` provides more options and is suitable for scenarios where you need to customize the tokenization process to fit specific model requirements. When you call `tokenizer(inputs)` in Hugging Face's Transformers library, it essentially acts as a high-level wrapper that internally calls methods like `encode_plus` or similar functionalities, depending on the specific tokenizer implementation. The `encode_plus` method is one of the comprehensive methods for encoding text, handling various tasks like tokenization, conversion to token IDs, adding special tokens, creating attention masks, and managing sequence length (truncation and padding).

So, in the background, when you use `tokenizer(inputs)`, it's likely invoking `encode_plus `or a functionally equivalent method, carrying out a series of steps to prepare the input text for processing by the model. The exact methods called can vary between different tokenizer classes, but they generally perform similar tasks to prepare and format the input data appropriately.


In [47]:
train_text = train_df['text'][:16].tolist()

In [48]:
features = tokenizer.batch_encode_plus(
    train_text, 
    return_tensors="pt", 
    add_special_tokens=True, 
    padding="max_length",
    max_length=512,
    truncation=True
)

In [49]:
type(features["input_ids"])

torch.Tensor

In [50]:
train_df.head()

Unnamed: 0,text,label,prompt_name,source,RDizzl3_seven
0,Phones\n\nModern humans today are always on th...,0,Phones and driving,persuade_corpus,False
1,This essay will explain if drivers should or s...,0,Phones and driving,persuade_corpus,False
2,Driving while the use of cellular devices\n\nT...,0,Phones and driving,persuade_corpus,False
3,Phones & Driving\n\nDrivers should not be able...,0,Phones and driving,persuade_corpus,False
4,Cell Phone Operation While Driving\n\nThe abil...,0,Phones and driving,persuade_corpus,False


In [51]:
features["input_ids"].size()

torch.Size([16, 512])

In [52]:
def prepare_input(text):
    inputs = tokenizer.encode_plus(
        text, 
        return_tensors="pt", 
        add_special_tokens=True, 
        padding="max_length",
        max_length=512,
        truncation=True
    )
  
    return inputs

In [53]:
# ====================================================
# tokenizer
# ====================================================
OUTPUT_DIR = "./"
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base")
tokenizer.save_pretrained(OUTPUT_DIR + "tokenizer/")



('./tokenizer/tokenizer_config.json',
 './tokenizer/special_tokens_map.json',
 './tokenizer/spm.model',
 './tokenizer/added_tokens.json',
 './tokenizer/tokenizer.json')


### <span style="color: #7b6b59;">Introduction to PyTorch Dataset and DataLoader</span>

Code for processing data samples can get messy and hard to maintain; we ideally want our dataset code to be decoupled from our model training code for better readability and modularity. PyTorch provides two data primitives: `torch.utils.data.DataLoader` and `torch.utils.data.Dataset` that allow you to use pre-loaded datasets as well as your own data. **Dataset stores the samples and their corresponding labels, and DataLoader wraps an iterable around the Dataset to enable easy access to the samples.**

Your training pipeline should be as modular as possible in order to aid quick prototyping and maintaining usability. Using a poorly-written data loader / not using a data loader (using a Python generator or a function) can affect the parallelization ability of your code. 

Dataset processing is a highly important part of any training pipeline and should be kept separate from modeling. 

***How to use `Datasets` and `DataLoader` in PyTorch for custom text data***

In this section, we'll go through the PyTorch data primitives, namely `torch.utils.data.DataLoader` and `torch.utils.data.Dataset`, and understand how to create our own DataLoader and Datasets by subclassing these modules. 

We will learn how to make a custom Dataset and manage it with DataLoader in PyTorch. Creating a PyTorch `Dataset` and managing it with `Dataloader` keeps your data manageable and helps to simplify your machine learning pipeline. **A Dataset stores all your data, and Dataloader is can be used to iterate through the data, manage batches, transform the data, and much more.**


- **Pandas** is not essential to create a Dataset object. However, it’s a powerful tool for managing data so i’m going to use it.

- **`torch.utils.data`** imports the required functions we need to create and use Dataset and DataLoader.



In [54]:
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader

### <span style="color: #7b6b59;">Implementing A Custom Dataset In PyTorch</span>

A dataset is an abstract class in PyTorch that represents a collection of data. It is responsible for loading and preprocessing data from a source and returning it in the form of a PyTorch tensor.


Now, for most purposes, you will need to write your own implementation of a `Dataset`. So let's see how you can write a custom dataset by subclassing `torch.utils.data.Dataset`.

You'll need to implement 3 functions. The Dataset class provides 3 main methods:

1. **`__init__`**: This function is called when instancing the object. It's typically used to store some essential locations like file paths and image transforms. `class TextDataset(Dataset)`: Create a class called ‘TextDataset’, this can be called whatever you want. Passed in to the class is the dataset module which we imported earlier. `def __init__(self, text, labels)`: When you initialise the class you need to import two variables. In this case, the variables are called ‘text’ and ‘labels’ to match the data which will be added.

1. **`__len__`**: This function returns the length of the dataset. `self.labels = labels` & `self.text = text`: The imported variables can now be used in functions within the class by using self.text or self.labels. `def __len__(self)`: This function just returns the length of the labels when called. E.g., if you had a dataset with 5 labels, then the integer 5 would be returned.

1. **`__getitem__`**: This is the big kahuna 🏅. This function is responsible for returning a sample from the dataset based on the index provided. returns a single data point from the dataset at a given index. The getitem method is where the actual data loading and preprocessing takes place. It takes an index as input and returns a data point, which can be a tensor or a dictionary of tensors. This method is used by the DataLoader class to load and preprocess the data.


### <span style="color: #7b6b59;">PyTorch DataLoader: A Complete Guide</span>

PyTorch provides an intuitive and incredibly versatile tool, the DataLoader class, to load data in meaningful ways. Because data preparation is a critical step to any type of data work, being able to work with, and understand, DataLoaders is an important step in your deep learning journey. The Dataset retrieves our dataset’s features and labels one sample at a time. While training a model, we typically want to pass samples in “minibatches”, reshuffle the data at every epoch to reduce model overfitting, and use Python’s multiprocessing to speed up data retrieval.

**DataLoader is an iterable that abstracts this complexity for us in an easy API.**

The PyTorch DataLoader class is built on top of the PyTorch Dataset class, which provides a standard interface for accessing data. The DataLoader class takes in a Dataset object and provides a way to iterate over the data in batches. This allows for efficient processing of large datasets by allowing parallelization of data loading and preprocessing.



**What Does a PyTorch DataLoader Do?**

The PyTorch DataLoader class is an important tool to help you prepare, manage, and serve your data to your deep learning networks. Because many of the pre-processing steps you will need to do before beginning training a model, finding ways to standardize these processes is critical for the readability and maintainability of your code.

The PyTorch DataLoader allows you to:

- **Define a dataset to work with:** identifying where the data is coming from and how it should be accessed.
- **Batch the data:** define how many training or testing samples to use in a single iteration. Because data are often split across training and testing sets of large sizes, being able to work with batches of data can allow your training and testing processes to be more manageable.
- **Shuffle the data:** PyTorch can handle shuffling data for you as it loads data into batches. This can increase representativeness in your dataset and prevent accidental skewness.
- **Support multi-processing:** PyTorch is optimized to run multiple processes at once in order to make better use of modern CPUs and GPUs and to save time in training and testing your data. The DataLoader class lets you define how many workers should go at once.
- **Merge datasets together:** optionally, PyTorch also allows you to merge multiple datasets together. While this may not be a common task, having it available to you is an a great feature.
- **Load data directly on CUDA tensors:** because PyTorch can run on the GPU, you can load the data directly onto the CUDA before they’re returned.

The DataLoader is a PyTorch utility class that provides a way to iterate over a Dataset object in batches. It is designed to handle large datasets efficiently and can be configured to load data in parallel, preprocess data on the fly, and shuffle data for each epoch.

The DataLoader takes in a Dataset object and provides a number of configuration options, including batch size, shuffling, and number of worker processes for parallel data loading. The DataLoader class is responsible for batching the data and returning it in a format that can be consumed by the model


`DataLoader` class has a lot of different parameters available. Of course, one of the most important parameters is the actual dataset. Generally, you’ll be working with at least a training and a testing dataset. **Because of this, it’s a convention that you’ll have at least two DataLoaders, to be able to load data for both your training and testing data.**

PyTorch lets you define many different parameters to influence how data are loaded. This can have a big impact on the speed at which your model can train, how well it can train, and ensuring that data are sampled appropriately.

We have loaded that dataset into the DataLoader and can iterate through the dataset as needed. Each iteration below returns a batch of train_features and train_labels (containing `batch_size=8` features and labels respectively). Because we specified `shuffle=True`, after we iterate over all batches the data is shuffled.


In [55]:
max_len = 512
batch_size = 4

In [56]:
class DAIGTDataset(Dataset):
    
    def __init__(self, df, tokenizer, max_len):
        self.texts = df["text"].values
        self.labels = df["label"].values
        self.tokenizer = tokenizer
        self.max_len = max_len
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        text = self.texts[idx]
        label = torch.tensor(self.labels[idx], dtype=torch.float)
        tokenized = self.tokenizer.encode_plus(
            text=text,
            return_tensors='pt',
            padding='max_length',
            max_length=self.max_len,
            truncation=True,
            add_special_tokens=True
        )
        return tokenized['input_ids'].squeeze(), tokenized['attention_mask'].squeeze(), label

In [57]:
sample_dataset = DAIGTDataset(train[0:20], tokenizer, max_len)
sample_loader = DataLoader(
    dataset=sample_dataset,
    batch_size=batch_size,
    shuffle=True,
    num_workers=4,
    pin_memory=True
)

In [58]:
def collate(input_ids, attention_mask):
    mask_len = attention_mask.sum(axis=1).max()
    return input_ids[:,:mask_len]

In [59]:
for step, (input_ids, attention_mask, labels) in enumerate(sample_loader):
    print(attention_mask.shape)
    print(attention_mask.sum(axis=1).max())
    inputs_v2 = collate(input_ids, attention_mask)
    print(inputs_v2.shape)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

torch.Size([4, 512])
tensor(512)
torch.Size([4, 512])
torch.Size([4, 512])
tensor(512)
torch.Size([4, 512])
torch.Size([4, 512])
tensor(512)
torch.Size([4, 512])
torch.Size([4, 512])
tensor(512)
torch.Size([4, 512])
torch.Size([4, 512])
tensor(512)
torch.Size([4, 512])


In [60]:
encoded_input = tokenizer("Do not meddle in the affairs of wizards, for they are subtle and quick to anger. I'm happ")
print(encoded_input)

{'input_ids': [1, 771, 298, 57249, 267, 262, 6303, 265, 41267, 261, 270, 306, 281, 6245, 263, 1538, 264, 5693, 260, 273, 280, 358, 110355, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


# <span style="color: #7b6b59;">Step 4.5: Modelling with Hugging Face 🤗 Transformers</span>

### <span style="color: #7b6b59;">Introduction</span>


In Hugging Face Transformers there are 2 main outputs and 3 if configured; that we receive after giving input_ids and attention_mask as input.

- **pooler output (batch size, hidden size)**: Last layer hidden-state of the first token of the sequence
- **last hidden state (batch size, seq Len, hidden size)**: which is the sequence of hidden states at the output of the last layer.
- **hidden states (n layers, batch size, seq Len, hidden size)**: Hidden states for all layers and for all ids.

In this notebook, we will show many different ways these outputs and hidden representations can be utilized to do much more than just adding an output layer. Below are the various techniques we will be implementing.


### <span style="color: #7b6b59;">Last Hidden State Output</span>


<img width="894" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/923fb8e9-58b0-4984-a4e3-d097afde3b88">

**This is the first and default output from models.**

Last Hidden State output is the sequence of hidden-states at the output of the last layer of the model. The output is usually `[batch, maxlen, hidden_state]`, it can be narrowed down to `[batch, 1, hidden_state]` for `[CLS]` token, as the `[CLS]` token is 1st token in the sequence. Here , `[batch, 1, hidden_state]` can be equivalently considered as `[batch, hidden_state]`.

#### <span style="color: #7b6b59;">Implementation Details</span>


All models have outputs that are instances of subclasses of [`ModelOutput`](https://huggingface.co/docs/transformers/v4.36.1/en/main_classes/output#transformers.utils.ModelOutput). Those are data structures containing all the information returned by the model, but that can also be used as tuples or dictionaries.

```python

outputs = model(**inputs, labels=labels)

```
When considering our outputs object as tuple, it only considers the attributes that don’t have None values. For instance, it has two elements, loss then logits, so will return the `tuple (outputs.loss, outputs.logits)`. `outputs[:2]`


In Hugging Face Transformers, when you use a model from the `AutoModel` class with `AutoModel.from_pretrained`, the specific subclass of `ModelOutput` that the model returns depends on the type of model you are using (e.g., BERT, GPT-2, T5, etc.) and the nature of the task (e.g., sequence classification, token classification, language modeling, etc.).

To determine which subclass of `ModelOutput` is returned, you should consider the following:

1. **Model Type:** Different models are designed for different kinds of tasks. For instance, BERT-like models might return `BaseModelOutput` or `SequenceClassifierOutput`, while GPT-like models might return `CausalLMOutput`.

1. **Task:** The nature of the task also influences the output type. For example:

    - For sequence classification tasks, models often return SequenceClassifierOutput.
    - For token classification tasks (like Named Entity Recognition), models might return TokenClassifierOutput.
    - For language modeling tasks, models could return CausalLMOutput or MaskedLMOutput.

1. **Documentation::** The best way to know for sure is to refer to the Hugging Face documentation for the specific model you are using. The documentation usually specifies the output format for each model.

1. **Inspecting the Output:** You can programmatically inspect the output to determine its type. For example, after running `outputs = self.model(**inputs)`, you can check `type(outputs)` to see the class of the output.

1. **Common Attributes:** Most ModelOutput subclasses have common attributes like `loss`, `logits`, `hidden_states`, and `attentions`, but the presence and relevance of these attributes can vary. The exact composition of the output object will align with the requirements of the model's intended task.

1. **Configuration:** Sometimes, the configuration of the model (self.config) can give you hints about the expected output type, especially if it contains task-specific configurations.

Remember that Hugging Face's design philosophy with `ModelOutput` is to provide flexibility and convenience, allowing outputs to be used like tuples, dictionaries, or objects with named attributes. This makes it easier to access the information you need for your specific application.

For instance, if we have a look on the [documentation for the Deberta Model](https://huggingface.co/docs/transformers/model_doc/deberta#transformers.DebertaModel.forward) in the `forward` method  
we will see that "**Returns `transformers.modeling_outputs.BaseModelOutput` or `tuple(torch.FloatTensor)`**". Now if we jump to the [BaseModelOutput documentation](https://huggingface.co/docs/transformers/v4.36.1/en/main_classes/output#transformers.modeling_outputs.BaseModelOutput) we'll get 

<img width="1044" alt="image" src="https://github.com/microsoft/DeBERTa/assets/28102493/dbe927a5-4f60-4673-bf28-29a8a96e05aa">

### <span style="color: #7b6b59;">Mean Pooling</span>

#### <span style="color: #7b6b59;">Introduction</span>

Since Transformers are contextual model, the idea is `[CLS]` token would have captured the entire context and would be sufficient for simple downstream tasks such as classification. Hence, for tasks such as classification using sentence representations, you can use `[batch, hidden_state]`.

We can also consider the last hidden state `[batch, maxlen, hidden_state]`, the average across maxlen dimensions to get averaged/mean embeddings.

There are multiple different ways to do this. We can simply take `torch.mean(last_hidden_state, 1)` but rather we will be implementing something different. We will make use of attention masks as well so that we can ignore padding tokens which is a better way of implementing average embeddings.

***What is pooling in Transformer models?***

In the context of transformers, pooling refers to the process of summarizing the outputs of the transformer layers into a fixed-size vector, often used for downstream tasks such as classification.

In a transformer architecture, the input sequence is processed by a series of self-attention and feedforward layers. Each layer produces a sequence of output vectors, which encode the input sequence in a higher-level representation. Pooling involves taking the output vectors from one or more of these layers and aggregating them into a single vector.

There are different types of pooling mechanisms used in transformer architectures, including:


1. **Max Pooling:** where the maximum value across the sequence of output vectors is selected as the summary representation.

1. **Mean Pooling:** where the average of the output vectors is taken as the summary representation.

1. **Last Hidden State:** where the final output vector of the transformer is used as the summary representation.

1. **Self-Attention Pooling:** where a weighted sum of the output vectors is computed, with the weights determined by a learned attention mechanism.

#### <span style="color: #7b6b59;">Neural Networks: Pooling Layers</span>

In this section, we’ll walk through **pooling**, a machine-learning technique ***widely used that reduces the size of the input and, thus the complexity of deep learning models while preserving important features and relationships in the input data***. In particular, we’ll introduce pooling, explain its usage, highlight its importance, and give brief examples of how it works.

***What Are Pooling Layers?***

In machine learning and neural networks, the dimensions of the input data and the parameters of the neural network play a crucial role. So this number can be controlled by the stacking of one or more pooling layers. Depending on the type of the pooling layer, an operation is performed on each channel of the input data independently to summarize its values into a single one and thus keep the most important features. These values are driven as input to the next layer of the model and so on. The pooling process may be repeated several times, and each iteration reduces the spatial dimensions. The value aggregation can be performed by using different techniques.

***Types of Pooling Layers***

There are many pooling operations and different extensions that have been developed to address specific challenges in different applications.


1. **Max Pooling:** Max pooling is a convolution technique that chooses the maximum value from the patch of the input data and summarizes these values into a feature map: This method maintains the most significant features of the input by reducing its dimensions.
    <img width="660" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/4e844e1d-8d59-4f97-beb8-00845cc7e45a">
    
1. **Average Pooling:** Average pooling calculates the average value from a patch of input data and summarizes these values into a feature map: This method is preferable in cases in which smoothing the input data is necessary as it helps to identify the presence of outliers. 
     <img width="625" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/af4b21e0-10f8-4518-9a5c-dbdf960f48fa">

1. **Global Pooling:** Global pooling summarizes the values of all neurons for each patch of the input data into a feature map, regardless of their spatial location. This technique is also used to reduce the dimensionality of the input and can be performed either by using the maximum or average pooling operation. 

1. **Stochastic Pooling:** Stochastic pooling is a deterministic pooling operation that introduces randomness into the max pooling process. This technique helps in improving the robustness of the model to small variations in the input data.

***Advantages and Disadvantages***

In machine learning, pooling layers offer several advantages and disadvantages as well.

First of all, pooling layers help in keeping the most important characteristics of the input data. Furthermore, the addition of pooling layers in the neural network offers translation invariance, which means that the model can generate the same outputs regardless of small changes in the input. Moreover, these techniques help in reducing the impact of outliers.

On the other hand, the pooling processes may lead to information loss, increased training complexity, and limited model interpretability.

***Usages of Pooling Layers in Machine Learning***

Pooling layers play a critical role in the size and complexity of the model and are widely used in several machine-learning tasks. They are usually employed after the convolutional layers in the convolutional neural network’s structure and are mainly used for downsampling the output.

These techniques are commonly used in convolutional neural networks and deep learning models of computer vision, speech recognition, and natural language processing.


***In conclusion, pooling layers play a critical role in reducing the size and complexity of deep learning models while preserving important features and relationships in the input data.***




In [61]:
os.environ["TOKENIZERS_PARALLELISM"] = "false"  # or "false"

In [62]:
config = AutoConfig.from_pretrained("microsoft/deberta-v3-base", output_hidden_states=True)
model = AutoModel.from_pretrained("microsoft/deberta-v3-base", config=config)

pytorch_model.bin:   0%|          | 0.00/371M [00:00<?, ?B/s]

In [63]:
with torch.no_grad():
    for step, (input_ids, attention_mask, labels) in enumerate(sample_loader):
        outputs = model(input_ids, attention_mask)
        break

In [64]:
type(outputs)

transformers.modeling_outputs.BaseModelOutput

In [65]:
outputs[0]
# OR outputs["last_hidden_state"]

tensor([[[ 2.0244e-02,  6.7374e-02, -2.3058e-02,  ..., -7.2960e-02,
           6.6919e-02,  3.3520e-02],
         [ 6.1867e-01, -2.4755e-01,  1.2476e-01,  ...,  1.2333e+00,
          -1.8161e-01, -9.5691e-02],
         [ 2.7716e-01,  3.4418e-01, -3.9991e-01,  ...,  6.6889e-01,
          -7.3343e-03,  4.5186e-01],
         ...,
         [-2.8558e-01,  5.3847e-01,  5.7045e-02,  ..., -9.0902e-01,
           3.6311e-01,  2.4040e-01],
         [-8.3092e-01,  6.3363e-01,  2.4380e-01,  ...,  6.6695e-01,
          -2.1120e-01,  1.0565e+00],
         [ 2.6948e-02,  6.5354e-02, -8.3646e-04,  ..., -7.3275e-02,
           7.2130e-02,  2.9461e-02]],

        [[ 2.3808e-02,  6.5923e-02,  1.0164e-02,  ..., -6.0424e-02,
           6.2456e-02,  2.7879e-02],
         [ 2.8578e-01,  4.9128e-01,  9.5755e-02,  ..., -1.1389e-01,
           1.4006e-01, -3.4079e-01],
         [ 1.8625e-01, -7.0034e-01,  1.8400e-02,  ..., -1.1186e+00,
           4.3184e-01,  5.4246e-02],
         ...,
         [ 1.7070e-01, -2

In [66]:
last_hidden_state = outputs[0]
outputs[0].shape

torch.Size([4, 512, 768])

- **Step 1 - Attention Mask Expansion:** Expand Attention Mask from `[batch_size, max_len]` to `[batch_size, max_len, hidden_size]`. `attention_mask` is used to identify the actual tokens and padding tokens in the input sequences. It has 1s for real tokens and 0s for padding.
    - `.unsqueeze(-1)` adds an extra dimension at the end of the `attention_mask`, making it compatible for element-wise multiplication with `last_hidden_state`. `attention_mask` is a 2D tensor where each row corresponds to a sequence in the batch, and each element in a row is either 0 (for padding tokens) or 1 (for actual tokens). For example, if you have a batch size of 2 and sequence length of 4, it might look like this:
    ```python

        [
            [1, 1, 0, 0], # First sequence with 2 actual tokens and 2 paddings
            [1, 1, 1, 0]  # Second sequence with 3 actual tokens and 1 padding
        ]  
    ```
    - Unsqueezing operation adds an extra dimension at the end, making it a 3D tensor. For the example above, after unsqueezing, it would look like:
    ```python
        [
            [
                [1], [1], [0], [0] # First sequence with an added dimension
            ],   
            [
                [1], [1], [1], [0] # Second sequence with an added dimension
            ]
        ]
    ```
    - Expansion (`expand(last_hidden_state.size())`): `.expand(last_hidden_state.size())` adjusts the size of the mask to match the dimensions of `last_hidden_state`, ensuring that each embedding vector in the sequence has a corresponding mask value. The `last_hidden_state tensor` has a shape similar to `[batch_size, seq_length, hidden_size]` where `hidden_size` is the size of the embedding vectors. The expansion makes the attention mask match this shape by repeating its values across the new dimension. After expansion, using the example above and assuming hidden_size is 3, the expanded mask would conceptually look like:
    ```python
        [
            [              # First sequence
                [1, 1, 1], # First token (actual token)
                [1, 1, 1], # Second token (actual token)
                [0, 0, 0], # Third token (padding)
                [0, 0, 0]  # Fourth token (padding)
            ],
            [              # Second sequence
                [1, 1, 1], # First token (actual token)
                [1, 1, 1], # Second token (actual token)
                [1, 1, 1], # Third token (actual token)
                [0, 0, 0]  # Fourth token (padding)
            ]           
        ]
    ```
    - In this tensor, each `[1, 1, 1]` or `[0, 0, 0]` corresponds to a token in the sequence, replicated across the hidden_size dimension. This expanded attention mask can now be element-wise multiplied with the last_hidden_state tensor, effectively zeroing out the embeddings of padding tokens and leaving the embeddings of actual tokens unchanged. This is a crucial step before summing the embeddings for mean pooling, as it ensures that only meaningful token embeddings are considered.
    - `.float()` converts the mask to float type for subsequent arithmetic operations.

In [67]:
print(attention_mask.shape)
print(attention_mask.unsqueeze(-1).shape)
#print(attention_mask)
#print(attention_mask.unsqueeze(-1))
print(attention_mask.unsqueeze(-1).expand(last_hidden_state.size()))

torch.Size([4, 512])
torch.Size([4, 512, 1])
tensor([[[1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         ...,
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1]],

        [[1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         ...,
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1]],

        [[1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         ...,
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0]],

        [[1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         ...,
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1]]])


In [68]:
input_mask_expanded = attention_mask.unsqueeze(-1).expand(last_hidden_state.size()).float()

- **Step 2 - Embeddings Summation:** Sum Embeddings along max_len axis so now we have `[batch_size, hidden_size]`.
    - The code multiplies `last_hidden_state` with `input_mask_expanded` to zero out embeddings corresponding to padding tokens.
    - `torch.sum(..., 1)` sums up the embeddings across the sequence length dimension (tokens), resulting in a single vector for each sequence in the batch.

In [69]:
sum_embeddings = torch.sum(last_hidden_state * input_mask_expanded, 1)

In [70]:
print(last_hidden_state.shape)
print(input_mask_expanded.shape)
element_wise = last_hidden_state * input_mask_expanded
print(element_wise.shape)
print(sum_embeddings.shape)

torch.Size([4, 512, 768])
torch.Size([4, 512, 768])
torch.Size([4, 512, 768])
torch.Size([4, 768])


- **Step 3 - Summation of Mask Values:** Sum Mask along `max_len axis`. This is done so that we can ignore padding tokens.
    - Summing up the `input_mask_expanded `across the sequence length gives the number of actual tokens (not padding) in each sequence. In this step, the code is summing up the values of the expanded attention mask (`input_mask_expanded`) across the sequence length dimension. The purpose of this operation is to count the number of actual tokens (ignoring padding tokens) in each sequence of the batch. This count is important for the mean pooling operation, as it tells us by what number we should divide the sum of embeddings to get the average embedding per sequence. `input_mask_expanded.sum(1)` computes the sum along dimension 1, which is the sequence **length dimension**. This effectively adds up the mask values for each token in a sequence, giving us a total count of actual tokens per sequence. Since `input_mask_expanded` was expanded to match the `last_hidden_state` size and had 1s for actual tokens and 0s for padding, summing it up over the sequence length dimension will count the number of 1s, i.e., the number of actual tokens in each sequence.
    - Before the summation, `input_mask_expanded` has the same dimensions as `last_hidden_state`, which is `[batch_size, seq_length, hidden_size]`. After summing across the sequence length (dimension 1), the size of `sum_mask` is reduced to `[batch_size, hidden_size]`. However, since all values across `hidden_size` are the same (because the mask was expanded by repeating the same values), this is effectively equivalent to having a tensor of shape `[batch_size, 1]` where each value represents the count of actual tokens in each sequence of the batch.
    - **Clamping:** After the summation, the code ensures that sum_mask does not contain zeros, to avoid division by zero in the mean calculation. `torch.clamp(..., min=1e-9)` ensures that `sum_mask` does not have any zero values, **preventing division by zero in the next step.** The minimum value `(1e-9)` is arbitrary but small enough to not significantly affect the mean. `torch.clamp(..., min=1e-9`) sets a lower bound on the values in `sum_mask`. If any value is less than `1e-9`, it is set to `1e-9`. This is a safety measure to avoid division by zero. In practice, since `sum_mask` counts the number of tokens, it should not have any zero values unless a sequence is entirely made of padding tokens, which is unlikely in well-formed input data.
    - So, the `sum_mask` tensor provides a count of actual tokens for each sequence, and its dimensions after summation are effectively `[batch_size, 1]`, representing the token count for each sequence in the batch.

In [71]:
sum_mask = input_mask_expanded.sum(1)
sum_mask = torch.clamp(sum_mask, min=1e-9)

In [72]:
print(input_mask_expanded.shape)
print(input_mask_expanded)
print(sum_mask.shape)
print(sum_mask)

torch.Size([4, 512, 768])
tensor([[[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.]],

        [[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.]],

        [[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]],

        [[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [1., 1., 1.,  ..., 1., 1., 1.],
         [

- **Step 4- Computing Mean Embeddings:** Take Average. Dividing `sum_embeddings` by `sum_mask` gives the mean embeddings. Since `sum_embeddings` contains the sum of embeddings for actual tokens only, and `sum_mask` contains the count of actual tokens, the division computes the mean of the embeddings across each sequence.

In [73]:
mean_embeddings = sum_embeddings / sum_mask

In [74]:
print(sum_embeddings.shape)
print(sum_mask.shape)
print(mean_embeddings.shape)

torch.Size([4, 768])
torch.Size([4, 768])
torch.Size([4, 768])


In [75]:
class CFG:
    model="microsoft/deberta-v3-base"
    gradient_checkpointing=True
    epochs=2
    batch_size=8
    max_len=512
    apex=True
    scheduler='cosine' # ['linear', 'cosine']
    gradient_accumulation_steps=1
    num_warmup_steps=0
    num_cycles=0.5
    encoder_lr=2e-5
    decoder_lr=2e-5
    weight_decay=0.01
    n_folds=4
    train_folds=[0, 1, 2, 3]
    batch_scheduler=True
    print_freq=20

In [76]:
# ====================================================
# Model
# ====================================================
class MeanPooling(nn.Module):
    def __init__(self):
        super(MeanPooling, self).__init__()
        
    def forward(self, last_hidden_state, attention_mask):
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(last_hidden_state.size()).float()
        sum_embeddings = torch.sum(last_hidden_state * input_mask_expanded, 1)
        sum_mask = input_mask_expanded.sum(1)
        sum_mask = torch.clamp(sum_mask, min=1e-9)
        mean_embeddings = sum_embeddings / sum_mask
        return mean_embeddings

    
class DAIGTModel(nn.Module):
    
    def __init__(self, cfg, config_path=None, pretrained=False):
        super().__init__()
        self.cfg = cfg
        if config_path is None:
            self.config = AutoConfig.from_pretrained(cfg.model, output_hidden_states=True)
        else:
            self.config = torch.load(config_path)
        if pretrained:
            self.model = AutoModel.from_pretrained(cfg.model, config=self.config)
        else:
            self.model = AutoModel(self.config)
        if self.cfg.gradient_checkpointing:
            self.model.gradient_checkpointing_enable()
        self.pool = MeanPooling()
        self.classifier = nn.Linear(self.config.hidden_size, 1)
        
    def feature(self, input_ids, attention_mask):
        outputs = self.model(input_ids, attention_mask)
        last_hidden_states = outputs[0]
        feature = self.pool(last_hidden_states, attention_mask)
        return feature

    def forward(self, input_ids, attention_mask):
        embeddings = self.feature(input_ids, attention_mask)
        logits = self.classifier(embeddings)
        return logits

In [77]:
model = DAIGTModel(CFG, config_path=None, pretrained=True)

In [78]:
with torch.no_grad():
    for step, (input_ids, attention_mask, labels) in enumerate(sample_loader):
        logits = model(input_ids, attention_mask)
        break

In [79]:
print(logits.shape)

torch.Size([4, 1])


In [80]:
logits

tensor([[-0.0344],
        [ 0.2451],
        [ 0.0803],
        [-0.0541]])

# <span style="color: #7b6b59;">Step 4.6: Train the Model - Training Loop</span>

## <span style="color: #7b6b59;">Key Concepts in Deep Learning for Effective Model Training</span>

### <span style="color: #7b6b59;">Concept 1: Optimizers</span>

#### <span style="color: #7b6b59;">Introduction</span>


Many people may be using optimizers while training the neural network without knowing that the method is known as optimization. 

- Optimizers are algorithms or methods used to change the attributes of your neural network such as weights and learning rate in order to reduce the losses. How you should change your weights or learning rates of your neural network to reduce the losses is defined by the optimizers you use. Optimization algorithms or strategies are responsible for reducing the losses and to provide the most accurate results possible.


#### <span style="color: #7b6b59;">Optimization Algorithms</span>

We’ll learn about different types of optimizers and their advantages:


1. **Gradient Descent:** Gradient Descent is the most basic but most used optimization algorithm. It’s used heavily in linear regression and classification algorithms. Backpropagation in neural networks also uses a gradient descent algorithm. Gradient descent is a first-order optimization algorithm which is dependent on the first order derivative of a loss function. It calculates that which way the weights should be altered so that the function can reach a minima. Through backpropagation, the loss is transferred from one layer to another and the model’s parameters also known as weights are modified depending on the losses so that the loss can be minimized. algorithm: θ=θ−α⋅∇J(θ). Gradient descent is an optimization algorithm based on a convex function and tweaks its parameters iteratively to minimize a given function to its local minimum. Gradient Descent iteratively reduces a loss function by moving in the direction opposite to that of steepest ascent. It is dependent on the derivatives of the loss function for finding minima. uses the data of the entire training set to calculate the gradient of the cost function to the parameters which requires large amount of memory and slows down the process. How big/small the steps are gradient descent takes into the direction of the local minimum are determined **by the learning rate**, which figures out how fast or slow we will move towards the optimal weights.

    <img width="879" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/1ce01c91-6d49-45ab-aed5-05679edcbc01">
    
    <img width="819" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/36cdb708-717a-47fe-83a4-dd63fe8c7af5">

    1. **Advantages:**
    
        - Easy computation.
        - Easy to implement.
        - Easy to understand.
    1. **Disadvantages:**

        - May trap at local minima.
        - Weights are changed after calculating gradient on the whole dataset. So, if the dataset is too large than this may take years to converge to the minima. Because this method calculates the gradient for the entire data set in one update, the calculation is very slow.
        - Requires large memory to calculate gradient on the whole dataset. It requires large memory and it is computationally expensive.

1. **Stochastic Gradient Descent:** It’s a variant of Gradient Descent. It tries to update the model’s parameters more frequently. In this, the model parameters are altered after computation of loss on each training example. So, if the dataset contains 1000 rows SGD will update the model parameters 1000 times in one cycle of dataset instead of one time as in Gradient Descent. `θ=θ−α⋅∇J(θ;x(i);y(i)) , where {x(i) ,y(i)}` are the training examples. As the model parameters are frequently updated parameters have high variance and fluctuations in loss functions at different intensities. It is a variant of Gradient Descent. It update the model parameters one by one. If the model has 10K dataset SGD will update the model parameters 10k times.

    <img width="812" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/4bbb77e3-5c50-4bbd-a467-b1e240970db2">

    1. **Advantages:**

        - Frequent updates of model parameters hence, converges in less time.
        - Requires less memory as no need to store values of loss functions.
        - May get new minima’s.
        - Allows the use of large data sets as it has to update only one example at a time.
   
    1. **Disadvantages:**

        - High variance in model parameters.
        - May shoot even after achieving global minima.
        - To get the same convergence as gradient descent needs to slowly reduce the value of learning rate.
        - The frequent can also result in noisy gradients which may cause the error to increase instead of decreasing it.
        - Frequent updates are computationally expensive.

1. **Mini-Batch Gradient Descent:** It’s best among all the variations of gradient descent algorithms. It is an improvement on both SGD and standard gradient descent. It updates the model parameters after every batch. So, the dataset is divided into various batches and after every batch, the parameters are updated. `θ=θ−α⋅∇J(θ; B(i)), where {B(i)}` are the batches of training examples. It is a combination of the concepts of SGD and batch gradient descent. It simply splits the training dataset into small batches and performs an update for each of those batches. This creates a balance between the robustness of stochastic gradient descent and the efficiency of batch gradient descent. it can reduce the variance when the parameters are updated, and the convergence is more stable. It splits the data set in batches in between 50 to 256 examples, chosen at random.

    1. **Advantages:**

        - Frequently updates the model parameters and also has less variance.
        - Requires medium amount of memory.
        - It leads to more stable convergence.
        - more efficient gradient calculations.
        
    1. **All types of Gradient Descent have some challenges:**

        - Choosing an optimum value of the learning rate. If the learning rate is too small than gradient descent may take ages to converge. If the learning rate is too small, the convergence rate will be slow. If it is too large, the loss function will oscillate or even deviate at the minimum value.

        - Have a constant learning rate for all the parameters. There may be some parameters which we may not want to change at the same rate.
        - May get trapped at local minima. Mini-batch gradient descent does not guarantee good convergence,

1. **Momentum:** Momentum was invented for reducing high variance in SGD and softens the convergence. It accelerates the convergence towards the relevant direction and reduces the fluctuation to the irrelevant direction. One more hyperparameter is used in this method known as momentum symbolized by ‘γ’. `V(t)=γV(t−1)+α.∇J(θ)`. Now, the weights are updated by `θ=θ−V(t)`. The momentum term `γ` is usually set to 0.9 or a similar value.

    1. **Advantages:**

        - Reduces the oscillations and high variance of the parameters.
        - Converges faster than gradient descent.

    1. **Disadvantages:**

        - One more hyper-parameter is added which needs to be selected manually and accurately.

1. **Nesterov Accelerated Gradient:** Momentum may be a good method but if the momentum is too high the algorithm may miss the local minima and may continue to rise up. So, to resolve this issue the NAG algorithm was developed. It is a look ahead method. We know we’ll be using `γV(t−1)` for modifying the weights so, `θ−γV(t−1)` approximately tells us the future location. Now, we’ll calculate the cost based on this future parameter rather than the current one. `V(t)=γV(t−1)+α. ∇J( θ−γV(t−1) )` and then update the parameters using `θ=θ−V(t)`. 
     
     <img width="434" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/c9351e81-bce5-4717-8654-27b300497341">
 
     1. **Advantages:**
        
        - Does not miss the local minima.
        - Slows if minima’s are occurring.
    
     1. **Disadvantages:**

        - Still, the hyperparameter needs to be selected manually.
   
1. **Adagrad**: One of the disadvantages of all the optimizers explained is that the learning rate is constant for all parameters and for each cycle. This optimizer changes the learning rate. It changes the learning rate `‘η’` for each parameter and at every time step `‘t’`. It’s a type second order optimization algorithm. It works on the derivative of an error function.
    
    <img width="564" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/6910d495-c191-4857-9360-e1402b7c6023">
    
    `η` is a learning rate which is modified for given parameter `θ(i)` at a given time based on previous gradients calculated for given parameter `θ(i)`. We store the sum of the squares of the gradients w.r.t. `θ(i)` up to time step `t`, while `ϵ` is a smoothing term that avoids division by zero (usually on the order of 1e−8). Interestingly, without the square root operation, the algorithm performs much worse. **It makes big updates for less frequent parameters and a small step for frequent parameters.**

    1. **Advantages:**

        - Learning rate changes for each training parameter.
        - Don’t need to manually tune the learning rate.
        - Able to train on sparse data.

    1. **Disadvantages:**

        - Computationally expensive as a need to calculate the second order derivative.

1. **AdaDelta:** It is an extension of AdaGrad which tends to remove the decaying learning Rate problem of it. Instead of accumulating all previously squared gradients, Adadelta limits the window of accumulated past gradients to some fixed size w. In this exponentially moving average is used rather than the sum of all the gradients.

    1. **Advantages:**

        - Now the learning rate does not decay and the training does not stop.
        
    1. **Disadvantages:**
        - Computationally expensive.

1. **Adam:**  Adam (Adaptive Moment Estimation) works with momentums of first and second order. The intuition behind the Adam is that we don’t want to roll so fast just because we can jump over the minimum, we want to decrease the velocity a little bit for a careful search. In addition to storing an exponentially decaying average of past squared gradients like AdaDelta, Adam also keeps an exponentially decaying average of past gradients.

    1. **Advantages:**

        - The method is too fast and converges rapidly.
        - Rectifies vanishing learning rate, high variance.

    1. **Disadvantages:**

        - Computationally costly.

#### <span style="color: #7b6b59;">Conclusions</span>

- **Adam is the best optimizers. If one wants to train the neural network in less time and more efficiently than Adam is the optimizer.**

- **For sparse data use the optimizers with dynamic learning rate.**

- **If, want to use gradient descent algorithm than min-batch gradient descent is the best option.The learning rate is always decreasing results in slow training.**

**How to choose optimizers?**

1. If the data is sparse, use the self-applicable methods, namely Adagrad, Adadelta, RMSprop, Adam.

1. RMSprop, Adadelta, Adam have similar effects in many cases.

1. Adam just added bias-correction and momentum on the basis of RMSprop,

1. As the gradient becomes sparse, Adam will perform better than RMSprop.


<img width="911" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/110e15de-29a7-4a7b-8850-211ba8d8c0ce">


### <span style="color: #7b6b59;">Concept 2: Weight Decay</span>

#### <span style="color: #7b6b59;">Introduction</span>

I mentioned before that data augmentation helps deep learning models generalize well. That was on the data side of things. **What about the model side of things? What can we do while training our models, that will help them generalize even better.**

***We do weight decay.***

<img width="917" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/73c04cdc-31f8-4818-97f6-ff70461c83ec">

We start by looking at the image above. We see that we have a bunch of data points and that we cannot fit them well with a straight line. Hence, we use a 2nd degree polynomial to do so. We also notice that if increase the degree of the polynomial beyond a certain point, then our model becomes too complex and starts to overfit.

This means that in order to prevent overfitting, we shouldn’t allow our models to get too complex. **Unfortunately, this has led to a misconception in deep learning that we shouldn’t use a lot of parameters (in order to keep our models from getting overly complex).**


- **Origin of weight decay:**  First of all, real world data is not going to be as simple as the one shown above. Real world data is complex and in order to solve complex problems, we need complex solutions. Having fewer parameters is only one way of preventing our model from getting overly complex. But it is actually a very limiting strategy. More parameters mean more interactions between various parts of our neural network. And more interactions mean more non-linearities. These non-linearities help us solve complex problems. **However, we don’t want these interactions to get out of hand. Hence, what if we penalize complexity. We will still use a lot of parameters, but we will prevent our model from getting too complex. This is how the idea of weight decay came up.**

- **Weight decay:** One way to penalize complexity, would be to add all our parameters (weights) to our loss function. Well, that won’t quite work because some parameters are positive and some are negative. So what if we add the squares of all the parameters to our loss function. We can do that, however it might result in our loss getting so huge that the best model would be to set all the parameters to 0.

- **weight decay rate:** To prevent that from happening, we multiply the sum of squares with another smaller number. This number is called weight decay or wd. Our loss function now looks as follows:

<img width="829" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/056024ed-0084-486e-9dda-d6dfe99cbd1b">

- **Deciding the value of weight decay rate:** Generally a wd = 0.1 works pretty well. However, the folks at fastai have been a little conservative in this respect. Hence the default value of weight decay in fastai is actually 0.01 . The reason to choose this value is because if you have too much weight decay, then no matter how much you train, the model never quite fits well enough whereas if you have too little weight decay, you can still train well, you just have to stop a little bit early.

To understand why using the sum of squares of weights leads to smaller weight updates in a machine learning model, let's delve into the role of weight decay and how it influences the learning process.

**Regularization by Weight Decay**

- **Weight Decay Mechanism:** In weight decay, we add a regularization term to the loss function. This term is typically the sum of the squares of all the model weights, multiplied by a regularization parameter (λ). The modified loss function looks something like this:
    <img width="533" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/b62a55cb-c811-47bb-99ac-dd4d294c6ad1">

- **Purpose:** The purpose of this term is to penalize larger weights in the model. By doing so, it encourages the model to keep the weights as small as possible.

**Impact on Weight Updates**

- **Gradient Descent:** During training, models typically use gradient descent or its variants to minimize the loss. Gradient descent updates each weight by subtracting a fraction of the derivative (gradient) of the loss with respect to that weight.

- **Incorporating Weight Decay:** When the sum of the squares of weights is added to the loss function, the gradient of this additional term with respect to each weight is proportional to the weight itself (since the derivative of w^2  with respect to w is 2 * w)

- **The Update Rule:** So, the update rule for each weight not only involves the gradient derived from the original loss (which reflects how well the model is performing on the training data) but also includes a term from the weight decay. The update looks something like this:
    <img width="653" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/09d407c2-ea1a-4c0f-a08a-2da4331ee1c1">

- **Effect of the Weight Decay Term:** Notice the last part of the update rule: it effectively reduces the weight by a factor proportional to its current value. Larger weights are reduced more than smaller weights.



#### <span style="color: #7b6b59;">What is weight decay?</span>

- Weight decay is a **regularization technique** by adding a small penalty, usually the L2 norm of the weights (all the weights of the model), to the loss function.  **`loss = loss + weight decay parameter * L2 norm of the weights`**

- Weight decay **is a form of regularization that penalizes large weights in the network.** It does this by adding a term to the loss function that is proportional to the sum of the squared weights. This term reduces the magnitude of the weights and prevents them from growing too large. Weight decay can also be seen as a way of introducing prior knowledge that the weights should be small and smooth.

Weight decay essentially pulls the weights towards 0. **While this is beneficial for convolutional and linear layer weights, Batchnorm layer parameters are meant to scale (the gamma parameter) and shift (the beta parameter) the normalized input of the layer. As such, forcing these values to a lower value would affect the distribution and result in inferior results.** The weight decay is a regularization parameter that prevents the model weights from ‘exploding’. Zeroing the weight decay for these parameters is usually done by default in various projects and frameworks, but it’s still worth checking since it is still not the default behavior for Pytorch. 

Some people prefer to only apply weight decay to the weights and not the bias. **PyTorch applies weight decay to both weights and bias.**

### <span style="color: #7b6b59;">How does weight decay work?</span>


Weight decay works by updating the weights in the opposite direction of their current value, scaled by a factor called the weight decay rate. This factor determines how much the weights are shrunk at each step of the optimization. A higher weight decay rate means more regularization and less overfitting, but also less flexibility and more underfitting. A lower weight decay rate means less regularization and more overfitting, but also more flexibility and less underfitting. The optimal weight decay rate depends on the data, the model, and the learning rate. During each training step, the weight decay mechanism updates the weights by subtracting a fraction of the weight's value from itself. This update is proportional to the weight and in the opposite direction of its current value. For instance:

- If a weight is positive and large, subtracting a fraction of it will make it less positive (i.e., smaller in magnitude).
- If a weight is negative and large (in magnitude), subtracting a fraction (which will be negative) will make it less negative (i.e., closer to zero).
 
### <span style="color: #7b6b59;">Why do we use weight decay?</span>

1. **To prevent overfitting.**

1. **To keep the weights small and avoid exploding gradient.** Because the L2 norm of the weights are added to the loss, each iteration of your network will try to optimize/minimize the model weights in addition to the loss. This will help keep the weights as small as possible, preventing the weights to grow out of control, and thus avoid exploding gradient.

**What are the benefits of weight decay?**

Weight decay offers a variety of advantages for neural network training and performance. It can reduce the variance of the model, which leads to better generalization ability. Additionally, weight decay prevents the weights from becoming too large, which could cause numerical instability or gradient explosion. Furthermore, it simplifies the model and makes it more interpretable. Finally, it can also improve the convergence speed and stability of the optimization algorithm.



### <span style="color: #7b6b59;">What are the drawbacks of weight decay??</span>


Weight decay has some drawbacks that should be taken into account. For instance, it adds an extra hyperparameter to tune, making the model selection process more complex and costly. Additionally, it can reduce the capacity and expressiveness of a model, especially for deep and complex networks. Furthermore, weight decay can interfere with the learning of sparse or important features since it treats all weights equally. Finally, it can cause underfitting if the weight decay rate is too high or if the data is noisy or insufficient.

### <span style="color: #7b6b59;">How do we use weight decay?</span>


To use weight decay, we can simply define the weight decay parameter in the `torch.optim.SGD` optimizer or the `torch.optim.Adam optimizer`. 

<img width="941" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/94444aa1-11d7-40e9-a39f-a86e7f666d09">

Note that Adam uses a different equation for the loss. But the key concept is the same. Also, as I mentioned above that **PyTorch applies weight decay to both weights and bias.** If you would like to only use weights, you can use `model.named_parameters()` function. **`model.named_parameters()` also allows you to do more complex weight decay operations like using weight decay in different layers.**


Weight decay is used to prevent overfitting, which happens when a model learns the training data so well that it performs poorly on new, unseen data. The idea is to make the model simpler and more generalizable by penalizing complexity.

**The Problem with Complexity in Weights**

1. **Why Penalize Weights:** In a neural network, weights (parameters) determine how the input data is transformed into predictions. If these weights are too large or complex, the model might start fitting the noise in the data, leading to overfitting.

1. **Adding Weights to Loss Function:** One intuitive idea to prevent overfitting is to keep the weights small. Initially, one might think of adding the weights directly to the loss function (which measures how wrong the model's predictions are). But there's a problem – weights can be both positive and negative, and just adding them can cancel each other out.  To address the issue of positive and negative weights canceling each other out, the sum of the squares of the weights is used. Squaring each weight ensures that every term is positive, regardless of the original sign of the weight. By adding the squares of the weights to the loss function, the model is penalized for having large weights – the larger the weight, the larger its square, and consequently, the larger the penalty.


**The Role of Weight Decay Coefficient**

1. **Risk of Too Much Penalty:** Simply adding the squares of the weights to the loss function can lead to another problem. The loss might become so dominated by these squares that the model might find setting all weights to zero as the best solution. This is not desirable, as it would make the model useless.

1. **Weight Decay Coefficient (wd):** To balance this, the sum of the squares of the weights is multiplied by a small number called the weight decay coefficient (wd). This coefficient controls how much we penalize the weights, preventing the loss from becoming excessively large due to the weight penalty.

1. **Tuning the Weight Decay:** The weight decay coefficient is a hyperparameter that needs to be chosen carefully. If it's too large, the model is overly penalized for having weights far from zero, which might hamper its ability to learn from data. If it's too small, it might not sufficiently prevent overfitting.

**Why This Leads to Smaller Weight Updates**

1. **Regularization Effect:** The weight decay term ensures that during each update, weights are pushed towards smaller values. This is because, in addition to moving in the direction that reduces the original loss, weights are also "shrinking" due to the weight decay term.

1. **Preventing Overfitting:** By penalizing large weights, weight decay helps in preventing overfitting. Overfitting often occurs when weights grow too large, fitting the training data too closely, including its noise.

1. **Smaller and More General Weights:** Over the course of training, this consistent push towards smaller weights leads to a model that, while trying to minimize the original loss, also maintains a more generalized set of weights that are less likely to overfit.

In summary, the sum of squares of weights in weight decay leads to smaller weight updates because it adds a term to the weight update rule that is proportional to the size of the weight itself, effectively shrinking larger weights more significantly. This results in a model that learns to balance fitting the training data and maintaining smaller, more general weights, which helps in preventing overfitting.

### <span style="color: #7b6b59;">How do you choose the weight decay rate?</span>

Choosing the weight decay rate is a trade-off between regularization and flexibility. There is no universal formula or rule for setting the weight decay rate, as it depends on many factors, such as the data, the model, the learning rate, and the optimization algorithm. However, some general guidelines are to start with a small weight decay rate and increase it gradually until an improvement in validation or test performance is seen. Validation sets or cross-validation can be used to evaluate the effect of different rates on model performance and select the one that minimizes validation error. Additionally, grid searches or random searches can be employed to explore a range of rates and find the optimal one for a given problem. ***Finally, learning rate schedulers or adaptive optimizers can adjust the weight decay rate dynamically based on learning progress.***


### <span style="color: #7b6b59;">How do you compare weight decay with other regularization methods?</span>

Weight decay is one of the most common and effective regularization methods for neural networks, but it is not the only one. Other regularization methods that can be used alone or in combination with weight decay include noise injection, dropout, batch normalization, early stopping, and data augmentation. Each of these methods has its own advantages and disadvantages, making the best choice dependent on the specific problem and data. As a general rule, use weight decay as a baseline regularization method as it is simple, effective, and widely applicable. For noisy, sparse, or imbalanced data or for very deep or complex networks, noise injection or dropout is recommended. Batch normalization or early stopping should be used when a network suffers from slow or unstable convergence or when the learning rate is too high or too low. Data augmentation should be used when data is limited, simple, or homogeneous or when the network is very flexible or expressive.


In [81]:
def get_optimizer_params(model, encoder_lr, decoder_lr, weight_decay=0.0):
    """
    Prepares the optimizer parameters for different parts of the model with specific learning rates and weight decay settings.

    Args:
        model: The neural network model for which the optimizer parameters are to be prepared.
        encoder_lr (float): The learning rate to be applied to the encoder part of the model.
        decoder_lr (float): The learning rate to be applied to the decoder part of the model.
        weight_decay (float, optional): The weight decay rate to be applied. Defaults to 0.0.

    Returns:
        list: A list of dictionaries, where each dictionary contains parameters ('params'), a learning rate ('lr'), and a weight decay rate ('weight_decay'). The list has three elements:
              1. Parameters of the model's encoder part with weight decay (excluding biases and LayerNorm parameters).
              2. Parameters of the model's encoder part without weight decay (only biases and LayerNorm parameters).
              3. Parameters outside the 'model' namespace, typically custom layers, with a separate learning rate and no weight decay.

    Example:
        optimizer_parameters = get_optimizer_params(model, encoder_lr=1e-5, decoder_lr=1e-4, weight_decay=0.01)
    """
    no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"]
    
    optimizer_parameters = [
        # The condition if not any(nd in n for nd in no_decay) checks if the name of the parameter does
        # not include any of the strings listed in no_decay (like "bias" or "LayerNorm").
        {'params': [p for n, p in model.model.named_parameters() if not any(nd in n for nd in no_decay)], # n, p represent the name and parameter, respectively.
         'lr': encoder_lr, 'weight_decay': weight_decay},
        {'params': [p for n, p in model.model.named_parameters() if any(nd in n for nd in no_decay)],
         'lr': encoder_lr, 'weight_decay': 0.0},
        # This condition checks if the string "model" is not present in the name (n) of each parameter.
        # Essentially, this filters out parameters that are specifically part of model.model.
        # In other words, it's selecting parameters that belong to model but not to the nested model.model.
        {'params': [p for n, p in model.named_parameters() if "model" not in n],
         'lr': decoder_lr, 'weight_decay': 0.0}
    ]
    return optimizer_parameters

**AdamW Optimizer:**

AdamW introduces a decoupling of the weight decay from the optimization steps taken w.r.t. the loss function. In traditional L2 regularization (also known as weight decay), the regularization term is added directly to the loss function, and the optimization step considers this augmented loss. However, this approach can interfere with the adaptive learning rate property of Adam. AdamW, proposed by Loshchilov and Hutter in their paper "Decoupled Weight Decay Regularization," separates the weight decay from the loss optimization, applying it directly to the weights themselves after the optimization step based on the loss function. This allows for better training stability and performance, especially in the context of complex deep learning models.

***Differences Between Adam and AdamW***

1. **Weight Decay Integration:** The key difference lies in how weight decay regularization is applied. Adam applies weight decay together with the loss gradient, which can lead to suboptimal updates due to the interaction with Adam's moment estimation. AdamW applies weight decay directly to the parameters after the gradient update, which is more in line with the original intention of L2 regularization.

1. **Performance in Practice:** AdamW can often lead to better generalization than Adam, especially in tasks that are sensitive to the settings of weight decay and require careful regularization.


In [82]:
optimizer_parameters = get_optimizer_params(
    model,
    encoder_lr=CFG.encoder_lr, 
    decoder_lr=CFG.decoder_lr,
    weight_decay=CFG.weight_decay
)

In [83]:
optimizer = AdamW(optimizer_parameters, lr=CFG.encoder_lr)

### <span style="color: #7b6b59;">Concept 3: Learning Rate Schedulers</span>

#### What is the Learning Rate in Deep Learning?

Neural networks have many **hyperparameters** that affect the model’s performance. **One of the essential hyperparameters is the learning rate (LR), which determines how much the model weights change between training steps.** Learning rate is one of the most important hyperparameters in the training of neural networks, impacting the speed and effectiveness of the learning process. In the context of machine learning, the learning rate is a hyperparameter that determines the step size at which an optimization algorithm (like gradient descent) proceeds while attempting to minimize the loss function. In the simplest case, the LR value is a fixed value between 0 and 1. However, choosing the correct LR value can be challenging. 

- On the one hand, a large learning rate can help the algorithm to converge quickly. But it can also cause the algorithm to bounce around the minimum without reaching it or even jumping over it if it is too large. 
- On the other hand, a small learning rate can converge better to the minimum. However, the optimizer may take too long to converge or get stuck in a plateau if it is too small.

A learning rate that is too high can cause the model to oscillate around the minimum, while a learning rate that is too low can cause the training process to be very slow or even stall. **This section provides a visual introduction to learning rate schedulers, which are techniques used to adapt the learning rate during training.**

The importance of learning rate schedulers becomes evident when considering the dynamic nature of model training. As models traverse complex loss landscapes, a fixed learning rate may hinder convergence or cause overshooting. Learning rate schedulers address this challenge by adapting the learning rate based on the model’s performance during training. This adaptability is crucial for avoiding divergence, accelerating convergence, and facilitating the discovery of optimal model parameters.

#### What is a Learning Rate Scheduler?

- A learning rate scheduler is a method that adjusts the learning rate during the training process, often lowering it as the training progresses. This helps the model to make large updates at the beginning of training when the parameters are far from their optimal values, and smaller updates later when the parameters are closer to their optimal values, allowing for more fine-tuning.

One solution to help the algorithm converge quickly to an optimum is to use a learning rate scheduler. **A learning rate scheduler adjusts the learning rate according to a pre-defined schedule during the training process.**

***One solution to help the algorithm converge quickly to an optimum is to use a learning rate scheduler.***

Usually, the learning rate is set to a higher value at the beginning of the training to allow faster convergence. As the training progresses, the learning rate is reduced to enable convergence to the optimum and thus leading to better performance. Reducing the learning rate over the training process is also known as **annealing** or **decay**.

<img width="797" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/05db42fd-1bd2-4dba-87bb-44e045b74767">

An easy start is to use a constant learning rate in gradient descent algorithm. But you can do better with a learning rate schedule. A schedule is to make learning rate adaptive to the gradient descent optimization procedure, so you can increase performance and reduce training time.

In the neural network training process, data is feed into the network in batches, with many batches in one epoch. Each batch triggers one training step, which the gradient descent algorithm updates the parameters once. However, usually the learning rate schedule is updated once for each training epoch only.

You can update the learning rate as frequent as each step but usually it is updated once per epoch because you want to know how the network performs in order to determine how the learning rate should update. Regularly, a model is evaluated with validation dataset once per epoch.

There are multiple ways of making learning rate adaptive. At the beginning of training, you may prefer a larger learning rate so you improve the network coarsely to speed up the progress. In a very complex neural network model, you may also prefer to gradually increasse the learning rate at the beginning because you need the network to explore on the different dimensions of prediction. At the end of training, however, you always want to have the learning rate smaller. Since at that time, you are about to get the best performance from the model and it is easy to overshoot if the learning rate is large.

Therefore, the simplest and perhaps most used adaptation of the learning rate during training are techniques that reduce the learning rate over time. These have the benefit of making large changes at the beginning of the training procedure when larger learning rate values are used and decreasing the learning rate so that a smaller rate and, therefore, smaller training updates are made to weights later in the training procedure.

This has the effect of quickly learning good weights early and fine-tuning them later.

#### Different Learning Rate Schedulers

The amount of different learning rate schedulers can be overwhelming. Thus, this section aims to give you an overview of how different pre-defined learning rate schedulers in PyTorch adjust the learning rate during training:

1. **StepLR:** The StepLR reduces the learning rate by a multiplicative factor after every predefined number of training steps.
    
    <img width="842" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/b8cfa2bf-25f8-4591-b839-87ff77ae1eb2">
    
1. **MultiStepLR:** The MultiStepLR — similarly to the StepLR — also reduces the learning rate by a multiplicative factor but after each pre-defined milestone.
    
    <img width="780" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/c7682b7c-6dbb-45ed-a9cb-a07b83b8d770">

1. **ConstantLR:** The ConstantLR reduces learning rate by a multiplicative factor until the number of training steps reaches a pre-defined milestone. As you might have already noticed, if your starting factor is smaller than 1, this learning rate scheduler increases the learning rate over the course of the training process instead of decreasing it.

    <img width="992" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/ab72106f-b673-424f-abf9-b456c1fb7069">

1. **LinearLR:** The LinearLR — similarly to the ConstantLR— also reduces the learning rate by a multiplicative factor at the beginning of the training. But it linearly increases the learning rate over a defined number of training steps until it reaches its originally set learning rate. If your starting factor is smaller than 1, this learning rate scheduler also increases the learning rate over the course of the training process instead of decreasing it.
    
    <img width="948" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/f9404d66-7aa8-4a91-bdf3-8d4330c6f10e">
 
1. **ExponentialLR:** The ExponentialLR reduces learning rate by a multiplicative factor at every training step.

    <img width="941" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/b649f3f2-9137-410a-8598-177427d30068">

1. **PolynomialLR:** The PolynomialLR reduces learning rate by using a polynomial function for a defined number of steps
    
    <img width="851" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/893c40c8-0d35-4fc9-a6e1-971d2000e33d">

1. **CosineAnnealingLR:** The CosineAnnealingLR reduces learning rate by a cosine function. While you could technically schedule the learning rate adjustments to follow multiple periods, the idea is to decay the learning rate over half a period for the maximum number of iterations. ***Philipp Singer and Yauhen Babakhin, two Kaggle Competition Grandmasters, recommend using cosine decay as a learning rate scheduler for deep transfer learning.***
    
    <img width="874" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/80f3e558-a7ee-4f4b-bee1-73d265fd979d">

1. **CosineAnnealingWarmRestartsLR:** The CosineAnnealingWarmRestarts is similar to the cosine annealing schedule. However, it allows you to restart the LR schedule with the initial LR at, e.g., each epoch. This is called a warm restart and was introduced in 2017. Increasing the LR causes the model to diverge. However, this intentional divergence enables the model to escape local minima and find an even better global minimum.
    
    <img width="974" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/229a407a-2813-416b-954f-c66b55976135">
    
1. **CyclicLR:** The CyclicLR adjusted the learning rate according to a cyclical learning rate policy, which is based on the concept of warm restarts which we just discussed in the previous section. In PyTorch there are three built-in policies.
    
    <img width="975" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/7b07a4d0-b7cc-46dd-b7ae-aab6ec686603">
    <img width="1118" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/a40fd6f7-d3dd-4410-b34e-77ed8a2a1090">

1. **OneCycleLR:** The OneCycleLR reduces learning rate according to the 1cycle learning rate policy, which was introduced in a paper in 2017. In contrast to many other learning rate schedulers, the learning rate is not only decreased over the training process. Instead, the learning rate increases from an initial learning rate to some maximum learning rate and then decreases again.
    
    <img width="937" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/f94f5316-af75-43c7-89e3-edd25943c728">
    <img width="1028" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/01cf8cda-555f-45d3-a53c-74b000284f22">

1. **ReduceLROnPlateauLR:** The ReduceLROnPlateau reduces the learning rate by when the metric has stopped improving. As you can guess, this is difficult to visualize because the learning rate reduction timing depends on your model, data, and hyperparameters.

1. **Custom Learning Rate Schedulers with Lambda Functions:** If the built-in learning rate schedulers don’t fit your needs, you have the possibility to define a scheduler with lambda functions. The lambda function is a function that returns a multiplicative factor based on the epoch value.

#### Parameters and their Significance

- **optimizer:** Establishes the connection between the PyTorch learning rate scheduler and the optimizer responsible for updating the model parameters.
- **step_size:** Dictates the number of epochs between each adjustment of the learning rate, influencing how often the learning rate is updated during training.
- **gamma:** Scales the learning rate after each step, controlling the rate at which the learning rate decays or grows.
- **last_epoch:** A parameter that aids in resuming training from a specific epoch, providing flexibility in model development and training management.

#### Conclusion
Learning rate schedulers are an important tool in the machine learning practitioner’s toolkit, providing a mechanism to adjust the learning rate over time, which can help to improve the efficiency and effectiveness of the training process. The best learning rate scheduler to use can depend on the specific problem and dataset, and it is often helpful to experiment with different schedulers to see which one works best. 

**Tips for Using Learning Rate Schedules**

1. **Increase the initial learning rate.** Because the learning rate will very likely decrease, start with a larger value to decrease from. A larger learning rate will result in a lot larger changes to the weights, at least in the beginning, allowing you to benefit from the fine-tuning later. To select a learning rate schedule, a common practice is to start with a value that’s not too small, e.g., 0.5, and then exponentially lower it to get the smaller values, such as 0.01, 0.001, 0.0001;

1. **Use a large momentum.** Many optimizers can consider momentum. Using a larger momentum value will help the optimization algorithm continue to make updates in the right direction when your learning rate shrinks to small values. To build an effective model, we should also factor in other hyperparameters, such as momentum, regularization parameters (dropout, early stopping etc.).

1. **Experiment with different schedules.** It will not be clear which learning rate schedule to use, so try a few with different configuration options and see what works best on your problem. Also, try schedules that change exponentially and even schedules that respond to the accuracy of your model on the training or test datasets.

1. Although oftentimes being the default optimizer in deep learning applications, `Adam` under the hood does not necessarily outperforms all the time; it can cause model divergence;

Now that you have seen a variety of different built-in PyTorch learning rate schedulers, you are probably curious about which learning rate scheduler you should choose for your Deep Learning project.

Unfortunately, the answer is not that easy — as often in life. For a while, ReduceLROnPlateau was a popular learning rate scheduler. Today, other approaches like CosineAnnealingLR and OneCycleLR or approaches with warm restarts like CosineAnnealingWarmRestarts and CyclicLR have been increasing in popularity.

Nonetheless, you might need to run a few experiments to determine which learning rate scheduler best suits your problem. **But, what we can say is that using any learning scheduler will most likely lead to better model performance.**

Some more learning rates in one plot.

<img width="845" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/15d72733-9844-49f5-8c06-d93cfdfd4885">



The function below, `get_scheduler`, is designed to select and configure a learning rate scheduler for an optimizer used in a machine learning model. The function takes three parameters: `cfg`, `optimizer`, and `num_train_steps`.

Here's a breakdown of its functionality:

- **Parameters:**

    - **cfg**: A configuration object, presumably an instance of the CFG class. It contains settings for the scheduler.
    - **optimizer:** The optimizer for which the scheduler will be applied. In machine learning, optimizers are algorithms or methods used to change the attributes of the neural network such as weights and learning rate to reduce the losses.
    - **num_train_steps:** The total number of training steps (or iterations) that the model will undergo during training.
    - **Scheduler Selection:** The function first checks the scheduler attribute of the cfg object. Based on this value, it selects the type of scheduler to be used:
        - If cfg.scheduler is 'linear', it selects a linear scheduler.
        - If cfg.scheduler is 'cosine', it selects a cosine scheduler.

- **Scheduler Configuration:**

    - **Linear Scheduler (get_linear_schedule_with_warmup):** This scheduler adjusts the learning rate linearly after an initial "warmup" period during which the learning rate increases linearly from 0 to the initial learning rate set in the optimizer.
        - **num_warmup_steps:** The number of steps for the warmup phase.
        - **num_training_steps:** Total number of training steps.

    - **Cosine Scheduler (get_cosine_schedule_with_warmup):** This scheduler also starts with a warmup period like the linear scheduler. After the warmup, it adjusts the learning rate following a cosine curve, typically decreasing it gradually. This can be useful for converging to a better final solution.
        - **num_warmup_steps:** The number of steps for the warmup phase.
        - **num_training_steps:** Total number of training steps.
        - **num_cycles:** The number of cycles in the cosine curve; 0.5 means it will use half a cosine curve, which is common in practice.

- **Return Value:** The function returns the configured scheduler.

In the provided `CFG class`, the scheduler is set to `'cosine'`, which means the `get_scheduler` function will configure and return a cosine learning rate scheduler with the specified num_warmup_steps (0 in this case) and num_cycles (0.5). The number of training steps (num_train_steps) must be provided when the function is called.  In the context of machine learning and particularly in training neural networks, a warmup period is a phase at the beginning of the training process where the learning rate is gradually increased from a lower value to its initially defined value. This concept is often used in conjunction with learning rate schedulers, such as the linear and cosine schedulers you have in your code. Here's a detailed explanation of the warmup period:

1. **Purpose of Warmup:**

    - **Stabilizing Training:** Starting with a high learning rate can cause the model's parameters to change too rapidly, leading to unstable training. A warmup period helps by starting from a lower learning rate, allowing the model to gradually adapt to the learning process.
    - **Preventing Early Divergence:** A lower learning rate at the beginning helps in avoiding divergence of the model's loss, which can occur if the steps taken are too large.
    - **Improving Convergence:** Gradually increasing the learning rate can lead to better overall convergence, as the model initially makes small, careful steps towards the minimum loss, before taking larger steps.

1. **How It Works:** During the warmup period, the learning rate starts from a small value (often close to zero) and gradually increases to the pre-set learning rate of the optimizer. The length of the warmup period is determined by the num_warmup_steps. This is a hyperparameter that can be tuned based on the model and dataset.

1. **Implementation:** In your code, during the warmup phase, the scheduler adjusts the learning rate linearly from a smaller value up to the optimizer's initial learning rate. After the warmup period is completed, the scheduler switches to its main schedule (linear or cosine) for the rest of the training process.

1. **Application in Different Schedulers:**

    - **Linear Scheduler with Warmup:** Post warmup, the learning rate decreases linearly from its initial value to zero.
    - **Cosine Scheduler with Warmup:** After the warmup, the learning rate follows a cosine curve (or part of it, based on num_cycles), typically decreasing in a non-linear fashion.
    
<img width="1085" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/592f872b-bfba-4419-89a3-23d531a18225">


In [84]:
def get_scheduler(cfg, optimizer, num_train_steps):
    """
    Creates a learning rate scheduler based on the configuration provided.

    This function supports the creation of two types of schedulers: 'linear' and 'cosine'.
    The 'linear' scheduler uses a linear schedule with warmup, whereas the 'cosine'
    scheduler uses a cosine schedule with warmup.

    Args:
        cfg: A configuration object with attributes 'scheduler', 'num_warmup_steps', and optionally 'num_cycles'.
             - cfg.scheduler (str): Type of scheduler to use ('linear' or 'cosine').
             - cfg.num_warmup_steps (int): Number of warmup steps for the scheduler.
             - cfg.num_cycles (float, optional): Number of cycles for the cosine scheduler. Required if cfg.scheduler is 'cosine'.
        optimizer: The optimizer for which the scheduler is being created.
        num_train_steps (int): Total number of training steps.

    Returns:
        A learning rate scheduler configured as per the provided configuration.

    Raises:
        ValueError: If the scheduler type specified in cfg.scheduler is not supported.
    """
    if cfg.scheduler == 'linear':
        scheduler = get_linear_schedule_with_warmup(
            optimizer, num_warmup_steps=cfg.num_warmup_steps, num_training_steps=num_train_steps
        )
    elif cfg.scheduler == 'cosine':
        scheduler = get_cosine_schedule_with_warmup(
            optimizer, num_warmup_steps=cfg.num_warmup_steps, num_training_steps=num_train_steps, num_cycles=cfg.num_cycles
        )
    else:
        raise ValueError("Unsupported scheduler type provided.")
    return scheduler

### <span style="color: #7b6b59;">Concept 4: Automatic Mixed Precision (AMP)</span>

#### Introduction

Mixed-precision training is one of the essential techniques that lets us significantly boost training speeds on modern GPUs. Sometimes, this can results in 2x to 3x speed-ups! Let’s see how this works.

**Using 32-Bit Precision**

When training deep neural networks on a GPU, we typically use a lower-than-maximum precision, namely, 32-bit floating point operations (in fact, PyTorch uses 32-bit floats by default).

In contrast, in conventional scientific computing, we typically use 64-bit floats. In general, a larger number of bits corresponds to a higher precision, which lowers the chance of errors accumulating during computations. As a result, 64-bit floating point numbers (also known as double-precision) have long been the standard in scientific computing due to their ability to represent a wide range of numbers with higher accuracy.

However, in deep learning, using 64-bit floating point operations is considered unnecessary and computationally expensive since 64-bit operations are generally more costly, and GPU hardware is also not optimized for 64-bit precision. So instead, 32-bit floating point operations (also known as single-precision) have become the standard for training deep neural networks on GPUs.

In the context of floating-point numbers, “bits” refer to the binary digits used to represent a number in a computer’s memory. The more bits used to represent a number, the higher the precision and the greater the range of values that can be represented. In floating-point representation, numbers are stored in a combination of three parts: the sign, the exponent, and the significand (or mantissa).

<img width="862" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/c6cbd706-c038-45e7-b103-75ce2130b4c3">

So, coming back to the motivation behind using a lower precision, there are essentially two main reasons why 32-bit floating point operations are preferred over 64-bit when training deep neural networks on a GPU:

1. Reduced memory footprint. One of the primary advantages of using 32-bit floats is that they require half the memory compared to 64-bit floats. This allows for more efficient use of GPU memory, enabling the training of larger models (and larger batch sizes).

1. Increased compute and speed. Since 32-bit floating point operations require less memory, GPUs can process them more quickly, leading to faster training times. This speedup is crucial in deep learning, where training complex models can take days or even weeks.

**From 32-Bit to 16-Bit Precision**

Now that discussed the benefits of 32-bit floats, can we go even further? Yes, we can! Recently, mixed-precision training has become a common training scheme where we temporarily use 16-bit precision for floating point computation, which often referred to as “half” precision.

<img width="915" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/9caed63d-b116-497b-b381-4264b81b6c6e">

As shown in the figure above, float16 uses three fewer bits for the exponent and 13 fewer bits for the fractional value.

Deep Neural Network training has traditionally relied on IEEE single-precision format, however with mixed precision, you can train with half precision while maintaining the network accuracy achieved with single precision. This technique of using both single- and half-precision representations is referred to as **mixed precision technique**. Mixed precision methods combine the use of different numerical formats in one computational workload. This document describes the application of mixed precision to deep neural network training.

- **IEEE single-precision floating point computer numbering format**, is a binary computing format that occupies **4 bytes** (32 bits) in computer memory
- In computing, **half precision** is a binary floating-point computer number format that occupies 16 bits in computer memory.

There are numerous benefits to using numerical formats with lower precision than 32-bit floating point. First, they require less memory, enabling the training and deployment of larger neural networks. Second, they require less memory bandwidth which speeds up data transfer operations. Third, math operations run much faster in reduced precision, especially on GPUs with **Tensor Core** support for that precision. Mixed precision training achieves all these benefits while ensuring that no task-specific accuracy is lost compared to full precision training. ***It does so by identifying the steps that require full precision and using 32-bit floating point for only those steps while using 16-bit floating point everywhere else.***

**Benefits of Mixed precision training**

- Speeds up math-intensive operations, such as linear and convolution layers, by using Tensor Cores.
- Speeds up memory-limited operations by accessing half the bytes compared to single-precision.
- Reduces memory requirements for training models, enabling larger models or larger minibatches.

*Nuance Research advances and applies conversational AI technologies to power solutions that redefine how humans and computers interact. The rate of our advances reflects the speed at which we train and assess deep learning models. With Automatic Mixed Precision, we’ve realized a 50% speedup in TensorFlow-based ASR model training without loss of accuracy via a minimal code change. We’re eager to achieve a similar impact in our other deep learning language processing applications.*, Wenxuan Teng, Senior Research Manager, Nuance Communications

#### Tensor cores

Tensor cores are specialized hardware units designed to accelerate deep learning computations, found in certain NVIDIA GPUs starting from the Volta architecture and beyond (including Turing, Ampere, and newer generations). These cores are specifically optimized for performing mixed-precision arithmetic operations, which are commonly used in machine learning and deep learning tasks.

Key Features of Tensor Cores:

1. **Mixed-Precision Computing:** Tensor cores are designed to perform operations in mixed precision, primarily using 16-bit floating-point (FP16) or 8-bit integer (INT8) formats for inputs and computations, while accumulating results in a higher precision format like 32-bit floating-point (FP32). This approach helps in speeding up computations without significantly impacting the model's accuracy.

1. **Matrix Operations Acceleration:** They are particularly efficient at accelerating matrix multiplication operations, which are at the heart of deep learning computations, especially in neural network training and inference. For example, a common operation performed by tensor cores is the matrix multiply-and-accumulate operation (WMMA - Warp Matrix Multiply-Accumulate).

1. **Increased Throughput:** By performing operations in lower precision and leveraging the specialized hardware, tensor cores can significantly increase the throughput of deep learning operations compared to using traditional CUDA cores alone. This results in faster training times and more efficient inference.

1. **Energy Efficiency:** Mixed-precision computations are not only faster but also more energy-efficient, which is crucial for scaling up deep learning models and deploying them in power-constrained environments.

#### Mixed-Precision Training Mechanics

**The use of both FP16 and FP32 is the reason this technique is called mixed-precision training.**

**Mixed precision** training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of Tensor Cores in the Volta and Turing architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps (Enabling mixed precision involves two steps):

1. Porting the model to use the FP16 data type where appropriate.
1. Adding loss scaling to preserve small gradient values.

The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in CUDA 8 in the NVIDIA Deep Learning SDK.

Deep learning researchers and engineers can easily get started enabling this feature on **Ampere**, **Volta** and **Turing** GPUs. On Ampere GPUs, automatic mixed precision uses FP16 to deliver a performance boost of 3X versus TF32, the new format which is already ~6x faster than FP32. On Volta and Turing GPUs, automatic mixed precision delivers up to 3X higher performance vs FP32 with just a few lines of code.

<img width="1121" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/ec6bbedf-82e8-43e2-bf0f-a3d0d3b2ee72">


Mixed precision is the combined use of different numerical precisions in a computational method.

- **Half precision (also known as FP16)** data compared to higher precision FP32 vs FP64 reduces memory usage of the neural network, allowing training and deployment of larger networks, and FP16 data transfers take less time than FP32 or FP64 transfers.

- **Single precision (also known as 32-bit)** is a common floating point format (float in C-derived programming languages), and 64-bit, known as double precision (double). 

Deep Neural Networks (DNNs) have led to breakthroughs in a number of areas, including:

    - image processing and understanding
    - language modeling
    - language translation
    - speech processing
    - game playing, and many others.

DNN complexity has been increasing to achieve these results, which in turn has increased the computational resources required to train these networks. One way to lower the required resources is to use lower-precision arithmetic, which has the following benefits.

1. **Decrease the required amount of memory.** Half-precision floating point format (FP16) uses 16 bits, compared to 32 bits for single precision (FP32). Lowering the required memory enables training of larger models or training with larger mini-batches.

1. **Shorten the training or inference time.** Execution time can be sensitive to memory or arithmetic bandwidth. Half-precision halves the number of bytes accessed, thus reducing the time spent in memory-limited layers. NVIDIA GPUs offer up to 8x more half precision arithmetic throughput when compared to single-precision, thus speeding up math-limited layers.

*Figure 1. Training curves for the bigLSTM English language model shows the benefits of the mixed-precision training techniques. The Y-axis is training loss. Mixed precision without loss scaling (grey) diverges after a while, whereas mixed precision with loss scaling (green) matches the single precision model (black).*

<img width="891" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/4ccef941-d4fb-4c3b-8f72-0e602c729743">

**Since DNN training has traditionally relied on IEEE single-precision format, this guide will focus on how to train with half precision while maintaining the network accuracy achieved with single precision (as Figure 1). This technique is called mixed-precision training since it uses both single and half-precision representations.**

It’s called “mixed-” rather than “low-“precision training because we don’t transfer all parameters and operations to 16-bit floats. Instead, we switch between 32-bit and 16-bit operations during training, hence, the term “mixed” precision.

As illustrated in the figure below, mixed-precision training involves converting weights to lower-precision (FP16) for faster computation, calculating gradients, converting gradients back to higher-precision (FP32) for numerical stability, and updating the original weights with the scaled gradients.

This approach allows for efficient training while maintaining the accuracy and stability of the neural network.

<img width="930" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/d979230c-a7fc-4ad4-aab0-3363d81e1c8b">

To combat this, a master copy of the weights is stored in FP32. This is converted into FP16 during part of each training iteration (one forward pass, back-propagation and weight update). At the end of the iteration, the weight gradients are used to update the master weights during the optimizer step.

<img width="818" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/3eb8e5ce-2608-406c-b934-6fe6f33bcede">


In more detail, the steps are as follows.

1. **Convert weights to FP16:** In this step, the weights (or parameters) of the neural network, which are initially in FP32 format, are converted to lower-precision FP16 format. This reduces the memory footprint and allows for faster computation, as FP16 operations require less memory and can be processed more quickly by the hardware.

1. **Compute gradients:** The forward and backward passes of the neural network are performed using the lower-precision FP16 weights. This step calculates the gradients (partial derivatives) of the loss function with respect to the network’s weights, which are used to update the weights during the optimization process.

1. **Gradient Scaling: `torch.cuda.amp.GradScaler`** If the forward pass for a particular op has float16 inputs, the backward pass for that op will produce float16 gradients. Gradient values with small magnitudes may not be representable in float16. These values will flush to zero (“underflow”), so the update for the corresponding parameters will be lost. To prevent underflow, “gradient scaling” multiplies the network’s loss(es) by a scale factor and invokes a backward pass on the scaled loss(es). Gradients flowing backward through the network are then scaled by the same factor. In other words, gradient values have a larger magnitude, so they don’t flush to zero. Each parameter’s gradient (.grad attribute) should be unscaled before the optimizer updates the parameters, so the scale factor does not interfere with the learning rate. An instance scaler of GradScaler. Helps perform the steps of gradient scaling conveniently.As was shown in the previous section, successfully training some networks requires gradient value scaling to keep them from becoming zeros in FP16. This can be achieved with a single multiplication. You can scale the loss values computed in the forward pass, before starting backpropagation. By the chain rule, backpropagation ensures that all the gradient values of the same amount are scaled. This requires no extra operations during backpropagation and keeps the relevant gradient values from becoming zeros and losing that gradient information. Weight gradients must be unscaled before weight update, to maintain the magnitude of updates the same as in FP32 training. It is simplest to perform this descaling right after the backward pass but before gradient clipping or any other gradient-related computations. This ensures that no hyperparameters (such as gradient clipping threshold, weight decay, etc.) have to be adjusted. While many networks match FP32 training results when all tensors are stored in FP16, some require updating an FP32 copy of weights. Furthermore, values computed by large reductions should be left in FP32. Examples of this include statistics (mean and variance) computed by batch-normalization, SoftMax. Batch-normalization can still take FP16 inputs and outputs, saving half the bandwidth compared to FP32, it’s just that the statistics and value adjustment should be done in FP32. This leads to the following high-level procedure for training:

    - `scaler.scale(loss)` multiplies a given loss by scaler’s current scale factor. Scales loss.  Calls backward() on scaled loss to create scaled gradients.
    - `scaler.step(optimizer)` safely unscales gradients and calls `optimizer.step()`. `scaler.step()` first unscales gradients of the optimizer's params. If gradients don't contain infs/NaNs, `optimizer.step()` is then called, otherwise, `optimizer.step()` is skipped.
    - Updates the scale for next iteration. `scaler.update()` scaler dynamically estimates the scale factor each iteration. To minimize gradient underflow, a large scale factor should be used. However, float16 values can “overflow” (become inf or NaN) if the scale factor is too large. Therefore, the optimal scale factor is the largest factor that can be used without incurring inf or NaN gradient values. scaler approximates the optimal scale factor over time by checking the gradients for infs and NaNs during every scaler.step(optimizer)

1. **Convert gradients to FP32:** After computing the gradients in FP16, they are converted back to the higher-precision FP32 format. This conversion is essential for maintaining numerical stability and avoiding issues such as vanishing or exploding gradients that can occur when using lower-precision arithmetic.

1. **Multiply by learning rate and update weights:** Now in FP32 format, the gradients are multiplied by a learning rate (a scalar value that determines the step size during optimization). Here, we can see the benefit of keeping the FP32 copy of the weights. As the learning rate is often small, when multiplied by the weight gradients they can often be tiny values. For FP16, any number with magnitude smaller than 2^(-24) will be equated to zero as it cannot be represented (this is the denormalized limit for FP16). Therefore, by completing the updates in FP32, these update values can be preserved.

1. The product from step 4 is then used to update the original FP32 neural network weights. The learning rate helps control the convergence of the optimization process and is crucial for achieving good performance.

Sum up:

1. Maintain a primary copy of weights in FP32.
1. For each iteration:
    1. Make an FP16 copy of the weights.
    1. Forward propagation (FP16 weights and activations).
    1. Multiply the resulting loss with the scaling factor S.
    1. Backward propagation (FP16 weights, activations, and their gradients).
    1. Multiply the weight gradient with 1/S.
    1. Complete the weight update (including gradient clipping, etc.).

The above procedure sounds quite complicated, but in practice, it’s pretty simple to implement. In the next section, we will see how we can use mixed-precision training for finetuning an LLM by changing just one line of code.

#### Automatic Mixed Precision
Using mixed precision training requires three steps:

1. Converting the model to use the float16 data type where possible.
1. Keeping float32 master weights to accumulate per-iteration weight updates.
1. Using loss scaling to preserve small gradient values.

Frameworks that support fully automated mixed precision training also support:

1. Automatic loss scaling and master weights integrated into optimizer classes
1. Automatic casting between float16 and float32 to maximize speed while ensuring no loss in task-specific accuracy

In those frameworks with automatic support, using mixed precision can be as simple as adding one line of code or enabling a single environment variable. Currently, the frameworks with support for automatic mixed precision are TensorFlow, PyTorch, and MXNet. Refer to NVIDIA Automatic Mixed Precision for Deep Learning for more information, along with the Frameworks section below.


#### Automatic Mixed Precision (AMP) with PyTorch

Provides the `torch.amp` module for seamless integration of mixed precision training. Automatic Mixed Precision feature is available in the Apex repository on GitHub. To enable, add these two lines of code into your existing training script:

```python

scaler = GradScaler()

with autocast():
    output = model(input)
    loss = loss_fn(output, target)

scaler.scale(loss).backward()

scaler.step(optimizer)

scaler.update()

```

### <span style="color: #7b6b59;">Concept 5: Gradient Accumulation</span>

#### Introduction

Learn how to use gradient accumulation to train models with large batch sizes in order to work around hardware limitations when GPU memory is a concern. Since we don’t have multiple GPUs available for tensor sharding, what can we do to train the model with larger batch sizes? **One workaround is gradient accumulation, where we modify the training loop.**

Gradient accumulation is a technique used to effectively increase the batch size for training deep learning models without requiring additional memory. It's particularly useful when the desired batch size cannot fit into the GPU's memory. Instead of updating model weights after each small batch, gradients from several batches are accumulated before a single update is made. This simulates the effect of a larger batch size. This technique allows for training with larger effective batch sizes than what might fit in GPU memory at once. By accumulating gradients over multiple mini-batches and only updating model weights after a specified number of steps, it simulates the effect of a larger batch size.

#### What is gradient accumulation?

Gradient accumulation is a way to virtually increase the batch size during training, which is very useful when the available GPU memory is insufficient to accommodate the desired batch size. In gradient accumulation, gradients are computed for smaller batches and accumulated **(usually summed or averaged)** over multiple iterations instead of updating the model weights after every batch. Once the accumulated gradients reach the target “virtual” batch size, the model weights are updated with the accumulated gradients.

If we set accumulation_steps to 2, then `zero_grad()` and `optimizer.step()` will only be called every second epoch. Consequently, running the modified training loop with `accumulation_steps=`2 will have the same effect as doubling the batch size.

For example, if we want to use a batch size of 256 but can only fit a batch size of 64 into GPU memory, we can perform gradient accumulation over four batches of size 64. (After processing all four batches, we will have the accumulated gradients equivalent to a single batch of size 256.) This allows us to effectively emulate a larger batch size without requiring larger GPU memory or tensor sharding across different devices.

While gradient accumulation can help us train models with larger batch sizes, it does not reduce the total computation required. In fact, it can sometimes lead to a slightly slower training process, as the weight updates are performed less frequently. Nevertheless, it allows us to work around limitations where we have very small batch sizes that lead to noisy updates.

1. The optimizer step (updating model weights) and resetting the gradients (`optimizer.zero_grad()`) are performed only after a specified number of batches have been processed, as indicated by `CFG.gradient_accumulation_steps`. This is where the actual accumulation happens: gradients computed on each batch are added up across multiple batches. In this snippet:

    - `scaler.step(optimizer)` is called to adjust the gradients and perform the optimizer step, but only after CFG.gradient_accumulation_steps batches have been processed. This is when the accumulated gradients are finally used to update the model parameters.
    - `scaler.update()` prepares the scaler for the next accumulation cycle.
    - `optimizer.zero_grad()` resets the gradients in the model parameters, ensuring that accumulation starts fresh for the next set of batches.


- `model.model.named_parameters()`: The use of `model.model.named_parameters()` suggests that the model (model) contains a nested model or a submodule (model.model). This is common in complex architectures where a model might encapsulate another model as one of its components. The first two groups are specifically targeting parameters of this nested model. By calling `named_parameters()` on `model.model`, the code is accessing parameters that are part of this inner model. Consider a scenario where model is a wrapper that contains a pre-trained model like BERT or ResNet as `model.model`. In this case, the first two groups are meant to configure parameters of that pre-trained component.
    - ***First Group - Parameters of the Pretrained Model with Weight Decay***
    - ***Second Group - Parameters of the Pretrained Model without Weight Decay***
- `model.named_parameters()`: The third group uses `model.named_parameters()`. This targets parameters that are directly part of the model object, but not part of the nested `model.model`. This might be used in a situation where the outer model has its own layers or components in addition to the nested `model.model`. For example, if model includes some custom layers or a different decoder mechanism on top of the pre-trained `model.model`, this group would be configuring those additional components.
    - ***Third Group - Custom Layers' Parameters, Typically Without Weight Decay***
    
- In summary, the code distinguishes between parameters of a nested model (handled in the first two groups) and parameters that are part of the outer model but not part of the nested model (handled in the third group). This distinction is crucial for applying different training configurations (like learning rates and weight decay settings) to different parts of a complex model architecture.

- **Why Differentiate:** Differentiating these groups allows for more nuanced control over the training process. For instance, we might want to use different learning rates or apply weight decay differently to the pretrained model's parameters versus the custom layers added to CustomModel.

- **Weight Decay Considerations:** Typically, weight decay is not applied to biases and normalization layers (like LayerNorm) because these parameters are not directly associated with the magnitude of the activations and thus don’t contribute as much to overfitting.


The decision to not apply weight decay to custom layers' parameters in a deep learning model is typically a strategic choice, influenced by several factors rather than being just a random decision. Here are some key reasons and considerations:

1. **Nature of the Custom Layers**
    - **Simplicity and Scale:** Custom layers added to a pretrained model are often simpler and have fewer parameters compared to the complex structures within the pretrained model. Since weight decay primarily targets large weights that can contribute to overfitting, its impact might be less significant on smaller, simpler custom layers.
    - **Specific Roles:** These layers often serve specific roles (like adaptation, transformation, or output formatting) that may not require aggressive regularization. Over-regularizing could hamper their ability to perform these roles effectively.
2. **Pretrained Model Characteristics**

    - **Already Regularized:** Pretrained models (like those from AutoModel) are usually trained on large datasets and have already undergone extensive regularization. Additional weight decay on these parts helps in fine-tuning without overfitting.
    - **Complexity and Overfitting:** These models have a high capacity and are more prone to overfitting when adapted to a new task or dataset. Weight decay in these layers is crucial to maintain their generalization ability.
3. **Fine-Tuning Dynamics**

    - **Different Learning Rates:** Often, custom layers are trained with a higher learning rate because they start from random initialization, unlike the pretrained layers. Applying weight decay with a higher learning rate to these layers might lead to instability or hinder their rapid adaptation to the new task.
    - **Balancing Adaptation and Stability:** The goal in fine-tuning is often to slightly adjust the pretrained layers while more significantly training the custom layers to adapt to the new task. Weight decay on custom layers might counteract this adaptation process.

4. **Empirical Results and Task-Specific Factors:**
    - **Empirical Observations:** In many cases, the decision is backed by empirical testing. Models might be trained with various configurations, and the setup that yields the best results on validation data is chosen.

    - **Task Dependency:** The necessity of weight decay on custom layers can vary depending on the specific task and data. For some tasks, applying weight decay to all layers might yield better results.

**Conclusion**

The choice to exclude weight decay from custom layers is often a considered decision based on the model's architecture, the nature of the task, and empirical results. It strikes a balance between allowing the custom layers to learn effectively from the new data and regulating the pretrained layers to leverage their learned representations without overfitting. This approach is not universal and can vary based on the specific requirements and observations of different models and tasks.












### <span style="color: #7b6b59;">Concept 6: PyTorch Training Loop</span>

**Writing the Custom Training Loop from Scratch**

Unlike TensorFlow, PyTorch requires us to write a training loop in pure Python. After specifying the custom model architecture, we instantiated it and defined a loss function and an optimization algorithm to train our network. We have chosen Cross Entropy Loss as the loss function to backpropagate the errors (as we have a multi-class single-label problem), and Adam optimizer as the optimization algorithm. Taking all these components together, we will now write a custom training loop from scratch. Finally, we will write a simplistic and clean version of our training routine using PyTorch's API support for optimization called `torch.optim`.

- **Optimizer:** Optimization is the process of adjusting model parameters to reduce model error in each training step. Optimization algorithms define how this process is performed (in this example we use Stochastic Gradient Descent). All optimization logic is encapsulated in the optimizer object. Here, we use the ADAM optimizer; additionally, there are many different optimizers available in PyTorch such as SGD and RMSProp, that work better for different kinds of models and data. We initialize the optimizer by registering the model’s parameters that need to be trained, and passing in the learning rate hyperparameter. `optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)`
- **16-bit precision, Automatic Mixed Precision:** In a regular training loop, PyTorch stores all float variables in 32-bit precision. For people who are training their models with strict constraints, sometimes, this can cause their model to take up too much memory, forcing them to have a slower training process with a smaller model and a smaller batch size. However, storing all the variables/numbers in the model in 16-bit precision can improve upon and fix most of these problems, like dramatically decreasing the memory consumption of the model and speeding up the training loop while still maintaining the same performance/accuracy of the model. Converting all calculations to 16-bit precision in Pytorch is very simple to do and only requires a few lines of code. When you are doing backward propagation with loss and the optimizer, instead of doing loss.backward() and optimizer.step(), you need to do `scaler.scale(loss).backward` and `scaler.step(optimizer)`. This allows your scaler to convert all the gradients and do all the calculations in 16-bit precision. When you are doing everything with 16-bit precision, there may be some numerical instability that causes some functions that you may use to not work properly. Only certain operations work correctly in 16-bit precision. One common error in any large deep learning model is the problem of underflowing gradients (i.e., your gradients are too small to take into account). float16 tensors often don't take into account extremely small variations. To prevent this, we can scale our gradients by some factor, so they aren't flushed to zero. Not to be confused with vanishing gradients, these gradients might contribute to the learning process but are skipped because of computational limits.
- **Gradient Accumulation:** If you run into a CUDA out of memory error, this means that you have exceeded your computational resources. To fix this, there are several things you can do, including converting everything to 16-bit precision as I mentioned above, reducing the batch size of your model, and reducing the num_workers parameter when creating your Dataloaders. However, sometimes, switching to 16-bit precision and reducing num_workers may not completely fix the problem. The most direct way to fix the problem is to reduce your batch size, but suppose that you don’t want to reduce your batch size. If you don’t want to reduce your batch size, you can use gradient accumulation to stimulate your desired batch size. Note that another solution to the CUDA out of memory issue is simply to use more than one GPU, but this is an option not accessible to many people. Suppose that your machine/model can only support a batch size of 16 and increasing it results in a CUDA out of memory error, and you want to have a batch size of 32. Gradient accumulation works by running the model with a batch size of 16 twice, accumulating the gradients computed for each batch, and finally doing an optimizer step after those 2 forward passes and accumulation of gradients. To understand gradient accumulation, it is important to understand what specific functions are done in training a neural network. 
    - `loss.backward()` creates and stores the gradients for the model. Calling loss.backward() twice before calling optimizer accumulates the gradients. 
    - but `optimizer.step()` actually updates the weights. 







Let’s see the procedure step by step:

1. **Step 1:** We instantiate the model and the optimizer
1. **Step 2:** We decide on a number of epochs
1. **Step 3:** We create a for loop that iterates through the epochs
1. **Step 4:** For each epoch, we set the model to training mode with `model.train()` and cycle through the train_loader. For each epoch, we open a for loop that iterates over the dataset, in batches. Gets a batch of training data from the DataLoader. 
1. **Step 5:** For each batch of the train_loader, we call the model on the input data to retrive the predictions, then we use them to compute a loss value. Performs an inference - that is, gets predictions from the model for an input batch. Calculates the loss for that set of predictions vs. the labels on the dataset.
1. **Step 6:** Bring the calculation of the derivatives to 0 with `optimizer.zero_grad()`. Tensors, variables, optimizers are all interconnected to one another via hidden global state. Also, don't forget to call model.zero_grad() before loss.backward(), or you won't get the right gradients for your variables. Call `optimizer.zero_grad()` to reset the gradients of model parameters. Gradients by default add up; to prevent double-counting, we explicitly zero them at each iteration. Zeros the optimizer’s gradients
1. **Step 7:** Calculates the backward gradients over the learning weights. Calling `loss.backward()` on a loss tensor triggers backpropagation. PyTorch's automatic differentiation engine called **autograd** keeps track of every operation on the tensors that require gradients, creating a computation graph consisting of all the tensor operations tensors are subjected to. In neural networks, the weight tensors of the parameters are what we want to optimize. These tensors are, by default, the leaf tensors requiring gradient grad. As is explained above, we aim to optimize these model parameters (weights and biases) using some criterion which is called the model loss. The model loss is a function of the model output and the target tensor. The model output, eventually, is a function of all the model parameters. The target tensor is a fixed one and hence has nothing to do with the optimization process regarding adjusting and updating the neural network. Hence, the model parameters are the leaf tensors in the graph of the loss of the neural network, and they have their requires_grad attribute to True. As soon as we call `.backward()` on the loss tensor, a backward graph is constructed. The gradients of the loss (the tensor on which backward is called) concerning the leaf tensors (the model parameters ) are calculated, and their grad attribute of them is populated. **These gradients are what we are required to update our model parameters and what we eventually access using `weight.grad` or `bias.grad` to update the weights and biases, respectively.** Finally, mathematically, the gradients of the loss function concerning the model parameters are calculated, and an optimization algorithm uses these gradients to update the model parameters. Backpropagate the prediction loss with a call to `loss.backward()`. PyTorch deposits the gradients of the loss w.r.t. each parameter.
1. **Step 8:** Once that's done, your optimizer is magically aware of the gradients for each variable and can update its variables, which is done via `optimizer.step()`. Outside the scope, we retrieve the gradients of the weights of the model with regard to the loss Finally, we use the optimizer to update the weights of the model based on the gradients. Once we have our gradients, we call `optimizer.step()` to adjust the parameters by the gradients collected in the backward pass. Tells the optimizer to perform one learning step - that is, adjust the model’s learning weights based on the observed gradients for this batch, according to the optimization algorithm we chose. 


At this point the training loop is complete, and if you want you can integrate the same logic on the validation dataloader as written in the code.

In [85]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [86]:
def get_score(y_trues, y_preds):
    """
    Calculate the AUC score from true labels and predicted probabilities.

    Args:
        y_trues (array-like): True binary class labels.
        y_preds (array-like): Predicted probabilities for the positive class.

    Returns:
        float: The AUC score.
    """
    # Calculate AUC score
    auc_score = roc_auc_score(y_trues, y_preds)
    return auc_score

In [87]:
class AverageMeter(object):
    """Computes and stores the average and current value.

    Attributes:
        val (float): The current value.
        avg (float): The running average.
        sum (float): The sum of all values encountered.
        count (int): The number of values encountered.
    """
    
    def __init__(self):
        """Initializes the AverageMeter and resets all attributes."""
        self.reset()

    def reset(self):
        """Resets all attributes to their initial state."""
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        """Updates the running average and current value.

        Args:
            val (float): The new value to update with.
            n (int, optional): The weight of the new value, i.e., how many times to count `val`. Defaults to 1.
        """
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count


In [88]:
def asMinutes(s):
    """Converts a time duration from seconds to a string in minute-second format.

    Args:
        s (float): The time duration in seconds.

    Returns:
        str: The time duration in 'minute m second s' format.
    """
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)


def timeSince(since, percent):
    """Calculates elapsed time since a start point and estimates remaining time.

    Args:
        since (float): The start time (usually obtained from `time.time()`).
        percent (float): The progress made as a fraction (between 0 and 1).

    Returns:
        str: A string indicating elapsed time and estimated remaining time in 'minute m second s' format.
    """
    now = time.time()
    s = now - since
    es = s / percent  # Estimated total time
    rs = es - s  # Remaining time
    return '%s (remain %s)' % (asMinutes(s), asMinutes(rs))


When you call `model.train()`, you're setting the entire model to training mode. This affects all layers of the model, not just a few. In training mode, certain behaviors are enabled that are suitable for training. Setting the model to training mode ensures that these and potentially other layer-specific behaviors that are relevant during training are correctly applied. If you want to train only certain layers of a model while keeping others frozen (i.e., their parameters are not updated), you would typically set `requires_grad` attribute of the parameters of the layers you want to freeze to `False`. This way, the optimizer won't compute gradients for those parameters, and they won't be updated during the training process. This technique is often used in transfer learning and fine-tuning scenarios.


In [89]:
def train_fn(fold: int, 
             train_loader: DataLoader, 
             model: torch.nn.Module, 
             criterion: torch.nn.Module, 
             optimizer: torch.optim, 
             epoch: int, 
             scheduler, 
             device: torch.device
            ) -> float:
    """Trains the model for one epoch through the entire dataset.

    Args:
        fold (int): The current fold in cross-validation.
        train_loader (DataLoader): The DataLoader for the training data.
        model (torch.nn.Module): The model to be trained.
        criterion (function): The loss function used for training.
        optimizer (torch.optim.Optimizer): The optimizer used for training.
        epoch (int): The current epoch number.
        scheduler (torch.optim.lr_scheduler): The learning rate scheduler.
        device (torch.device): The device on which to train the model, 'cuda' or 'cpu'.

    Returns:
        float: The average loss for this training epoch.
    """
    
    # Set the model to training mode - important for batch normalization and dropout layers
    model.train() # We set the model to training mode
    
    scaler = torch.cuda.amp.GradScaler(enabled=CFG.apex) # Create a gradient scaler 
    losses = AverageMeter()
    start = end = time.time()

    for step, (input_ids, attention_mask, labels) in enumerate(train_loader): # We open a for loop that iterates over the dataset, in batches.
        
        input_ids = input_ids.to(device)
        attention_mask = attention_mask.to(device)
        labels = labels.to(device)
        batch_size = labels.size(0)
        
        with torch.cuda.amp.autocast(enabled=CFG.apex):
            # Compute prediction and loss
            # calls the forward, model.forward(inputs) - performs calculations of the network
            logits = model(input_ids, attention_mask) # For each batch of the train_loader, we call the model on the input data to retrive the predictions, forward pass
            loss = criterion(logits.view(-1), labels.float()) # Then we use them to compute a loss value
                    
            
        
        losses.update(loss.item(), batch_size)
        
        # computes the derivative of the loss tensor w.r.t. the parameters 
        # using backpropagation and thus populates the `grad` attribute of model parameters
        scaler.scale(loss).backward() # backward pass, backpropagate, computes the derivative of the loss w.r.t. the parameters (or anything requiring gradients) using backpropagation.
        
        if (step + 1) % CFG.gradient_accumulation_steps == 0:
            scaler.step(optimizer) # causes the optimizer to take a step based on the gradients of the parameters
            scaler.update()
            optimizer.zero_grad() # clears old gradients from the last step (otherwise you’d accumulate the gradients from all loss.backward() calls).
           
            if CFG.batch_scheduler:
                scheduler.step()
        end = time.time()
        if step % CFG.print_freq == 0 or step == (len(train_loader)-1):
            print('Epoch: [{0}][{1}/{2}] '
                  'Elapsed {remain:s} '
                  'Loss: {loss.val:.4f}({loss.avg:.4f}) '
                  'LR: {lr:.8f}  '
                  .format(epoch+1, step, len(train_loader), 
                          remain=timeSince(start, float(step+1)/len(train_loader)),
                          loss=losses,
                          lr=scheduler.get_lr()[0]))

    return losses.avg


1. **Initial Setup:** 
    - The function starts by preparing for validation. It initializes a utility (likely an `AverageMeter`) to keep track of the average loss across all validation batches. 
    - The model is set to **evaluation mode**, which is crucial for certain layers that behave differently during training and evaluation (like dropout and batch normalization layers).
    - An empty list is created to store the model's predictions on the validation data, and the start time of the validation process is noted for potentially measuring the duration of validation.

1. **Iterating Over Batches:** The function enters a loop where it processes the validation data in batches. For each batch, it performs the following steps:

    - **Data Preparation:** The input data (`input_ids` and `attention_mask`) and `labels` for the current batch are moved to the designated computing device (CPU or GPU).
    - **Model Inference:** With **gradient calculation disabled** (to save memory and computation time), the model performs a forward pass on the input data to generate logits. These logits are then converted into probabilities using the sigmoid function, which is typical for binary classification tasks. In PyTorch, the context manager `torch.no_grad()` is used to disable gradient calculation, which effectively means that no training takes place and no backpropagation occurs within its scope. This is useful in situations where you are only performing forward passes, such as during model evaluation or inference, and you want to reduce memory usage and speed up computations by not keeping track of the operations for gradients. Context-manager that disables gradient calculation. Disabling gradient calculation is useful for inference, when you are sure that you will not call `Tensor.backward()`. It will reduce memory consumption for computations that would otherwise have `requires_grad=True`.
    - **Loss Calculation:** The loss is calculated by comparing the model's logits against the true labels using a specified loss function (criterion). This loss represents how well the model's predictions match the expected outputs.
    - **Tracking Loss and Predictions:** The calculated loss for the batch is recorded using the AverageMeter utility to compute an average over the entire validation set. The probabilities (model predictions) are detached from the computation graph (to prevent memory leaks), transferred to the CPU, and stored in a list.

1. **Reporting Progress:** Periodically, or at least after processing the last batch, the function prints a status update. This update includes the current batch number, the total number of batches, the average loss up to that point, and the elapsed time since the start of validation.

1. **Finalization:** After all batches have been processed, the function concatenates all batch predictions into a single array. It then returns the average loss across all validation batches and the array of concatenated predictions.

In [90]:
def validation_loop(valid_loader: DataLoader, model: torch.nn.Module, criterion: torch.nn.Module, device: torch.device) -> Tuple[float, np.ndarray]:
    """
    Perform the validation loop over the given DataLoader.

    Args:
        valid_loader (DataLoader): The DataLoader to iterate over for validation.
        model (torch.nn.Module): The model to evaluate.
        criterion (torch.nn.Module): The loss function to use for evaluation.
        device (torch.device): The device to run the validation on, e.g., 'cuda' or 'cpu'.

    Returns:
        Tuple[float, np.ndarray]: A tuple containing the average loss over the validation set and the concatenated predictions.
    """
    losses = AverageMeter()  # Initialize an object to track the average loss.
    model.eval()  # Set the model to evaluation mode.

    preds = []  # List to store predictions.
    start = time.time()  # Record the start time.

    for step, (input_ids, attention_mask, labels) in enumerate(valid_loader):  # Iterate over batches.
        input_ids = input_ids.to(device)
        attention_mask = attention_mask.to(device)
        labels = labels.to(device)
        batch_size = labels.size(0)

        with torch.no_grad():  # Disable gradient calculation for validation.
            logits = model(input_ids, attention_mask)  # Forward pass to get logits.
            probabilities = torch.sigmoid(logits)  # Convert logits to probabilities using sigmoid.
            loss = criterion(logits.view(-1), labels.float())  # Compute loss.

        losses.update(loss.item(), batch_size)  # Update the average loss.
        preds.append(probabilities.detach().cpu().numpy())  # Store predictions, detach from graph and move to CPU.
        end = time.time()  # Record the end time.

        if step % CFG.print_freq == 0 or step == (len(valid_loader)-1):  # Print progress.
            print('EVAL: [{0}/{1}] '
                  'Elapsed {remain:s} '
                  'Loss: {loss.val:.4f}({loss.avg:.4f}) '
                  .format(step, len(valid_loader),
                          loss=losses,
                          remain=timeSince(start, float(step+1)/len(valid_loader))))
    
    predictions = np.concatenate(preds)  # Concatenate all batch predictions.
    return losses.avg, predictions  # Return average loss and predictions.

### <span style="color: #7b6b59;">Concept 7: Loss functions</span>

Loss functions are an important component of a neural network. Interfacing between the forward and backward pass within a Deep Learning model, they effectively compute how poor a model performs (how big its loss) is. In this section, we're going to cover how to use a variety of PyTorch loss functions for classification.

#### What is a loss function?

Training a Deep Learning model involves what I call a high-level training process. This process is visualized below. It all starts with a training dataset, which - in the case of classification and regression - contains a set of descriptive variables (features) that jointly are capable of predicting some target variable.

Training the Deep Learning model, which often is a neural network, involves sequentially performing a forward pass and a backward pass, followed by optimization. In the forward pass, the dataset is fed to the network (in a batched fashion). This leads to predictions for the targets, which can then be compared with the true labels. No prediction is perfect, and hence there will be an error value. Using this error value, the error can be computed backwards into the neural network using backpropagation. Subsequently, with an optimizer, the model can be changed slightly in the hope that it performs better next time. By repeating this process over and over again, the model can improve and learn to generate accurate predictions.

Let's get back to this error value. As the name suggests, it is used to illustrate how poorly the model performs. In Deep Learning jargon, this value is also called a loss value. It is computed by means of a loss function. There are many functions that can be used for this purpose. Choosing one depends on:

- the problem you are solving (i.e. classification or regression), 
- the characteristics of your dataset, 
- and quite frequently on trial and error. 

In the rest of this section, we're going to walk through a lot of loss functions available in PyTorch. Let's take a look!

<img width="1420" alt="image" src="https://github.com/eraikakou/ml-theory/assets/28102493/877f6cdf-ea71-4ec0-a5b4-10109fe96b43">


#### PyTorch Classification loss function examples

1. **Binary Cross-entropy loss, on Sigmoid (`nn.BCELoss`) example:** Binary cross-entropy loss is a good candidate for binary classification problems, where a classifier has two classes. Implementing binary cross-entropy loss with PyTorch is easy. It involves the following steps:
    - Ensuring that the output of your neural network is a value between 0 and 1. **Recall that the Sigmoid activation function can be used for this purpose.** This is why we apply `nn.Sigmoid()` in our neural network below.
 - Ensuring that you use `nn.BCELoss()` as your loss function of choice during the training loop.

1. **Binary Cross-entropy loss, on logits (`nn.BCEWithLogitsLoss`):** Simple binary cross-entropy loss (represented by `nn.BCELoss` in PyTorch) computes BCE loss on the predictions generated in the range [0, 1]. **However, it is possible to generate more numerically stable variant of binary cross-entropy loss by combining the Sigmoid and the BCE Loss into one loss function:** "This version is more numerically stable than using a plain Sigmoid followed by a BCELoss as, by combining the operations into one layer, we take advantage of the log-sum-exp trick for numerical stability." In PyTorch, this is combined into the **`nn.BCEWithLogitsLoss`** function. The difference between `nn.BCEWithLogitsLoss` and `nn.BCELoss` is that BCE with Logits loss adds the Sigmoid function into the loss function. With simple BCE Loss, you will have to add Sigmoid to the neural network, whereas with BCE With Logits Loss you will not.




### <span style="color: #7b6b59;">Concept 8: Other Topics</span>


#### Scheduler Step

The `scheduler.step()` function is typically called to update the learning rate according to a specific policy defined by the learning rate scheduler you're using. Learning rate schedulers adjust the learning rate during training, which can lead to faster convergence and/or improved performance of the model.

- **In epoch-wise learning rate scheduling,** `scheduler.step()` is often called at the end of each epoch. This is common with schedulers like `StepLR`, where the learning rate is decreased by a certain factor after a specified number of epochs, or `MultiStepLR`, where the learning rate is reduced at specific epochs.

- **In batch-wise (or iteration-wise) learning rate scheduling,** `scheduler.step()` is called after each batch or iteration. This approach is used with schedulers like `OneCycleLR` or `CyclicLR`, which adjust the learning rate more frequently to allow for strategies like cyclic learning rates or learning rate warm-up followed by decay.

**Batch Scheduler**

The term "batch scheduler" as mentioned with if `CFG.batch_scheduler`: suggests a configuration option that, when enabled, indicates that the learning rate scheduler should be updated on a per-batch basis instead of the more common per-epoch basis. This means `scheduler.step()` is called after processing each batch instead of at the end of each epoch. This approach is used in certain training regimes where adjusting the learning rate more granularly can lead to better performance or faster convergence.

When using a batch scheduler, it's important to ensure that the scheduler and the training loop are correctly configured to adjust the learning rate as intended after each batch. This often involves careful consideration of the total number of steps (batches) in the training process, especially when using learning rate schedulers designed with epoch-wise adjustments in mind.


#### Gradient Checkpointing

Even when we set the batch size to 1 and use gradient accumulation we can still run out of memory when working with large models. In order to compute the gradients during the backward pass all activations from the forward pass are normally saved. This can create a big memory overhead. Alternatively, one could forget all activations during the forward pass and recompute them on demand during the backward pass. This would however add a significant computational overhead and slow down training.

Gradient checkpointing strikes a compromise between the two approaches and saves strategically selected activations throughout the computational graph so only a fraction of the activations need to be re-computed for the gradients. See [this great article](https://medium.com/tensorflow/fitting-larger-networks-into-memory-583e3c758ff9) explaining the ideas behind gradient checkpointing. One way to use significantly less GPU memory is to enabled “Gradient Checkpointing” (also known as “activation checkpointing”). When enabled, a lot of memory can be freed at the cost of small decrease in the training speed due to recomputing parts of the graph during back-propagation. The slowdown will depend on the model but quite often it is around 20-30%.


Gradient checkpointing is a technique used to reduce the memory footprint of training deep neural networks at the cost of additional computation. It allows for the training of very deep networks that would otherwise not fit into the GPU memory by trading off computational overhead for memory efficiency.

**How Gradient Checkpointing Works**

The basic idea behind gradient checkpointing is to store only a subset of intermediate activations during the forward pass and then recompute the other activations during the backward pass as needed for gradient computation. This is opposed to the standard approach where all intermediate activations are stored during the forward pass to be used in the backward pass, which can consume a significant amount of memory for deep networks.

When gradient checkpointing is enabled:

- During the forward pass, only the activations at certain "checkpoint" layers are saved.
- During the backward pass, when the gradient of a layer is needed, the network re-computes the forward pass from the nearest preceding checkpoint up to that layer. This requires additional computation because some of the forward pass computations are done multiple times.


**Why Use Gradient Checkpointing**

The primary reason to use gradient checkpointing is to enable the training of larger models or to use larger batch sizes within the memory constraints of the hardware. This can be particularly useful when working with very deep networks, such as those used in certain domains like deep learning for computer vision or large transformer models in natural language processing.

**Considerations**

- Trade-off: The main trade-off with gradient checkpointing is between memory usage and computational overhead. While it significantly reduces memory usage, it also increases the computational burden because of the need to recompute activations during the backward pass.

- Implementation: Modern deep learning frameworks like PyTorch and TensorFlow offer built-in support or extensions for gradient checkpointing, making it easier to apply this technique without extensive modifications to the model code.

- Use Cases: It's particularly useful in scenarios where memory is a limiting factor and where the additional computation time is acceptable. For example, when training very large models or when using particularly memory-intensive operations.

In summary, gradient checkpointing is a valuable technique for training larger models within memory constraints by efficiently managing the trade-off between memory usage and computational cost.


#### Logits
Logits are the outputs of a neural network before the activation function is applied. They are the unnormalized probabilities of the item belonging to a certain class. Logits are often used in classification tasks, where the goal is to predict the class label of an input. **Converting the logits to probabilities makes understanding the neural network's final output easier.**

Logits typically refer to the raw, unnormalized outputs of the last layer of a neural network, just before the application of a softmax function (or another type of normalization that converts them into probabilities). In the context of classification tasks, logits are the outputs of the final linear layer of a neural network, and they represent the input to the softmax function.

**Understanding Logits in Detail:**

1. **For Binary Classification:** In a binary classification problem, a single logit can be output by the model, which represents the log-odds of the positive class. The **sigmoid function** is then applied to this logit to obtain the probability of the positive class.

1. **For Multi-Class Classification:** In multi-class classification problems, the model outputs a logit for each class. The **softmax function** is then applied to the vector of logits to obtain a probability distribution over all possible classes. The softmax function ensures that the output probabilities sum to 1 and are in the range [0, 1].

#### Last but not Least


We need terminologies like epochs, batch size, iterations only when the data is too big which happens all the time in machine learning and we can’t pass all the data to the computer at once. So, to overcome this problem we need to divide the data into smaller sizes and give it to our computer one by one and update the weights of the neural networks at the end of every step to fit it to the data given.

1. **Epochs:** One Epoch is when an ENTIRE dataset is passed forward and backward through the neural network only ONCE. I know it doesn’t make sense in the starting that — passing the entire dataset through a neural network is not enough. And we need to pass the full dataset multiple times to the same neural network. But keep in mind that we are using a limited dataset and to optimise the learning and the graph we are using Gradient Descent which is an iterative process. So, updating the weights with single pass or one epoch is not enough.

1. **Batch:**  Since one epoch is too big to feed to the computer at once we divide it in several smaller batches. You can’t pass the entire dataset into the neural net at once. So, you divide dataset into Number of Batches or sets or parts. `No. of batches = (Size of the entire dataset / batch size) + 1`

1. **Batch Size:** Total number of training examples present in a single batch. Note: Batch size and number of batches are two different things.

1. **Iteration / Step:** Iterations is the number of batches needed to complete one epoch. Note: The number of batches is equal to number of iterations for one epoch. More precisely, a training step (iteration) is one gradient update. `No. of training steps = No. of batches = No. of gradient updates` `No. of ALL gradient updates = No. of batches x No. of epochs
`



We can divide the dataset of 2000 examples into batches of 500 then it will take 4 iterations to complete 1 epoch.


### <span style="color: #7b6b59;">Put it all together for training</span>



In [91]:
def training_loop(train, fold, tokenizer):
    """Executes the training loop for a given fold of the data.

    Args:
        train (DataFrame): The training dataset containing features and labels.
        fold (int): The current fold index to be used for validation within a cross-validation scheme.
        tokenizer: The tokenizer instance to process the text data.

    Returns:
        numpy.ndarray: The predictions for the validation set of the current fold.
    """
    
    # Log the start of training for the current fold
    LOGGER.info(f"========== fold: {fold} training ==========")

    # Split the data into training and validation sets for the current fold
    train_folds = train[train["fold"] != fold].reset_index(drop=True)
    valid_folds = train[train["fold"] == fold].reset_index(drop=True)
    valid_labels = valid_folds["label"].values
    
    # Prepare the training dataset and loader
    train_dataset = DAIGTDataset(train_folds, tokenizer, CFG.max_len)
    train_loader = DataLoader(
        dataset=train_dataset,
        batch_size=CFG.batch_size,
        shuffle=True,
        num_workers=4,
        pin_memory=True
    )
    
    # Prepare the validation dataset and loader
    valid_dataset = DAIGTDataset(valid_folds, tokenizer, CFG.max_len)
    valid_loader = DataLoader(
        dataset=valid_dataset,
        batch_size=batch_size * 2,
        shuffle=False,
        num_workers=4,
        pin_memory=True,
        drop_last=False
    )
    
    # Initialize a custom model instance with some configuration and possibly load pretrained weights.
    model = DAIGTModel(CFG, config_path=None, pretrained=True)

    # Save the model's configuration to a file. 'model.config' is assumed to contain the configuration
    # of the model which could include its architecture, hyperparameters, etc. This is saved to a file
    # named 'config.pth' in the directory specified by 'OUTPUT_DIR'. 'OUTPUT_DIR' is assumed to be a
    # predefined directory path where output files are stored.
    torch.save(model.config, OUTPUT_DIR+'config.pth')

    # Move the model to a specific device. 'device' is assumed to be a string or torch.device object
    # that specifies whether the model should run on a CPU, a single GPU, or multiple GPUs.
    # Commonly, 'device' would be set to something like 'cpu', 'cuda', or 'cuda:0'.
    model.to(device)
    
    # Setup the optimizer and learning rate scheduler
    optimizer_parameters = get_optimizer_params(
        model,
        encoder_lr=CFG.encoder_lr, 
        decoder_lr=CFG.decoder_lr,
        weight_decay=CFG.weight_decay
    )
    
    optimizer = AdamW(optimizer_parameters, lr=CFG.encoder_lr)
    
    num_train_steps = int(len(train_folds) / CFG.batch_size * CFG.epochs)
    scheduler = get_scheduler(CFG, optimizer, num_train_steps)

    
    best_score = np.inf
    # Initialize the BCEWithLogitsLoss
    criterion = nn.BCEWithLogitsLoss()
    
    for epoch in range(CFG.epochs):
        start_time = time.time()
        
        # Run one training epoch and return the average loss
        avg_loss = train_fn(fold, train_loader, model, criterion, optimizer, epoch, scheduler, device)
        
        # Run the validation loop and return the average validation loss and predictions
        avg_val_loss, predictions = validation_loop(valid_loader, model, criterion, device)
        
        # Compute the score based on validation predictions
        score = get_score(valid_labels, predictions)
        
        elapsed = time.time() - start_time
        
        LOGGER.info(f'Epoch {epoch+1} - avg_train_loss: {avg_loss:.4f}  avg_val_loss: {avg_val_loss:.4f}  time: {elapsed:.0f}s')
        LOGGER.info(f'Epoch {epoch+1} - Score: {score:.4f}')
        
        # Check if the current model is the best so far and save if it is
        if best_score > score:
            best_score = score
            LOGGER.info(f'Epoch {epoch+1} - Save Best Score: {best_score:.4f} Model')
            torch.save({'model': model.state_dict(),
                        'predictions': predictions},
                        OUTPUT_DIR+f"{CFG.model.replace('/', '-')}_fold{fold}_best.pth")
        

    # Load the best model's predictions for the validation set
    predictions = torch.load(OUTPUT_DIR+f"{CFG.model.replace('/', '-')}_fold{fold}_best.pth", 
                             map_location=torch.device('cpu'))['predictions']
    
    # Convert the numpy array to a pandas DataFrame
    predictions = pd.DataFrame(predictions)

    # Optionally, you can specify column names
    valid_folds["preds"] = predictions
    
    # Clear CUDA cache and collect garbage to free memory
    torch.cuda.empty_cache()
    gc.collect()
    
    return valid_folds
        

In [92]:
if __name__ == '__main__':
    oof_df = pd.DataFrame()
    for fold in range(CFG.n_folds):
        _oof_df = training_loop(train, fold, tokenizer)
        oof_df= pd.concat([oof_df, _oof_df])
        LOGGER.info(f"========== fold: {fold} result ==========")
        score = get_score(oof_df["label"], oof_df["preds"])
        LOGGER.info(f'Score: {score:<.4f}')
        break # TO REMOVE IT FOR PROPER TRAINING
    
    oof_df = oof_df.reset_index(drop=True)
    
    LOGGER.info(f"========== CV ==========")
    score = get_score(oof_df["label"], oof_df["preds"])
    LOGGER.info(f'Score: {score:<.4f}')
    
    oof_df.to_pickle(OUTPUT_DIR+'oof_df.pkl')



Epoch: [1][0/130] Elapsed 0m 1s (remain 3m 54s) Loss: 0.9151(0.9151) LR: 0.00002000  
Epoch: [1][20/130] Elapsed 0m 19s (remain 1m 42s) Loss: 0.0024(0.3265) LR: 0.00001967  
Epoch: [1][40/130] Elapsed 0m 37s (remain 1m 21s) Loss: 0.0003(0.1677) LR: 0.00001878  
Epoch: [1][60/130] Elapsed 0m 55s (remain 1m 2s) Loss: 0.0002(0.1127) LR: 0.00001737  
Epoch: [1][80/130] Elapsed 1m 12s (remain 0m 43s) Loss: 0.0001(0.0849) LR: 0.00001552  
Epoch: [1][100/130] Elapsed 1m 30s (remain 0m 25s) Loss: 0.0014(0.0794) LR: 0.00001334  
Epoch: [1][120/130] Elapsed 1m 48s (remain 0m 8s) Loss: 0.0032(0.0738) LR: 0.00001097  
Epoch: [1][129/130] Elapsed 1m 55s (remain 0m 0s) Loss: 0.0019(0.0693) LR: 0.00000988  
EVAL: [0/44] Elapsed 0m 0s (remain 0m 21s) Loss: 0.0013(0.0013) 
EVAL: [20/44] Elapsed 0m 4s (remain 0m 5s) Loss: 0.0013(0.0013) 
EVAL: [40/44] Elapsed 0m 8s (remain 0m 0s) Loss: 0.0013(0.0210) 


Epoch 1 - avg_train_loss: 0.0693  avg_val_loss: 0.0200  time: 125s
Epoch 1 - Score: 0.9913
Epoch 1 - Save Best Score: 0.9913 Model


EVAL: [43/44] Elapsed 0m 9s (remain 0m 0s) Loss: 0.0014(0.0200) 
Epoch: [2][0/130] Elapsed 0m 1s (remain 2m 34s) Loss: 0.0017(0.0017) LR: 0.00000976  
Epoch: [2][20/130] Elapsed 0m 19s (remain 1m 38s) Loss: 0.0009(0.0012) LR: 0.00000735  
Epoch: [2][40/130] Elapsed 0m 36s (remain 1m 19s) Loss: 0.0007(0.0010) LR: 0.00000511  
Epoch: [2][60/130] Elapsed 0m 54s (remain 1m 1s) Loss: 0.0009(0.0305) LR: 0.00000315  
Epoch: [2][80/130] Elapsed 1m 12s (remain 0m 43s) Loss: 0.0011(0.0232) LR: 0.00000159  
Epoch: [2][100/130] Elapsed 1m 30s (remain 0m 25s) Loss: 0.0011(0.0188) LR: 0.00000054  
Epoch: [2][120/130] Elapsed 1m 47s (remain 0m 8s) Loss: 0.0010(0.0159) LR: 0.00000004  
Epoch: [2][129/130] Elapsed 1m 55s (remain 0m 0s) Loss: 0.0010(0.0150) LR: 0.00000000  
EVAL: [0/44] Elapsed 0m 0s (remain 0m 22s) Loss: 0.0007(0.0007) 
EVAL: [20/44] Elapsed 0m 4s (remain 0m 5s) Loss: 0.0007(0.0008) 
EVAL: [40/44] Elapsed 0m 8s (remain 0m 0s) Loss: 0.0007(0.0221) 


Epoch 2 - avg_train_loss: 0.0150  avg_val_loss: 0.0210  time: 125s
Epoch 2 - Score: 0.9884
Epoch 2 - Save Best Score: 0.9884 Model


EVAL: [43/44] Elapsed 0m 9s (remain 0m 0s) Loss: 0.0008(0.0210) 


Score: 0.9884
Score: 0.9884


***Attention: The primary objective of this notebook is to walk you through the concepts and detailed steps involved in fine-tuning a pretrained model. It is designed to be educational, focusing on each aspect of the process to enhance your understanding. Please note that we will not be training the actual model intended for submissions with the provided dataset. This is because the dataset in question is highly imbalanced, containing only 4 samples for the positive class, which would lead to results that are not representative of a well-trained model. You are encouraged to use this notebook as a template or skeleton code and apply the fine-tuning process to another, more balanced dataset to achieve meaningful results.***

# <div style="padding:20px;color:white;margin:0;font-size:30px;font-family:Georgia;text-align:left;display:fill;border-radius:5px;background-color:#7b6b59;overflow:hidden">Train your own Tokenizer</div>

## <span style="color: #7b6b59;">Introduction</span>

Large Language Generative AI models are developed mostly working with large amounts of text data. For this reason anyone working in this area should have specific skills in text processing. To enable AI models to learn from text data effectively we must first preprocess text into a format which is understandable to machines. Tokenization and Vectorization are two of the most important steps in this procedure. 

Before our data can be fed to a model, it needs to be transformed to a format the model can understand. Machine learning algorithms take numbers as inputs. This means that we will need to convert the texts into numerical vectors. There are two steps to this process:

1. **Tokenization:** Divide the texts into words or smaller sub-texts, which will enable good generalization of relationship between the texts and the labels. This determines the “vocabulary” of the dataset (set of unique tokens present in the data). (Splitting text into smaller units such as words or phrases.)

1. **Vectorization:** Define a good numerical measure to characterize these texts. Converting text into numerical representations for ML models.

In summary, the typical order is tokenization first to break down the text into understandable units and then vectorization to turn those units into a numerical format suitable for machine learning models.

## <span style="color: #7b6b59;">Tokenization</span>

Text tokenization is the process of reformatting a piece of text into smaller units called “tokens.” It transforms unstructured text into structured data that models can understand. The goal of tokenization is to break down text into meaningful units like words, phrases, sentences, etc. which can then be inputted into machine learning models. It’s one of the first and most important steps in natural language preprocessing, and often goes hand-in-hand with text vectorization.

Tokenization enables natural language processing tasks like part-of-speech tagging (identifying verbs vs nouns, etc.), named entity recognition (categories like person, organization, location), and relationship extraction (family relationships, professional relationships, etc.).

There are a number of different tokenization methods; some of the simpler ones include splitting text on whitespace or punctuation. Advanced techniques use language rules to identify word boundaries and tokenize text into linguistic units; this can split words into sub-word tokens (such as prefixes, or based on syllables), or even combine certain tokens into larger units based on language semantics. The goal is to produce tokens that best represent the original text for ML purposes.

## <span style="color: #7b6b59;">Vectorization</span>

Now since most Large language models today are based on Transformers and Deep Learning architectures, they still work best with numbers, so to enable them to learn from text we should also convert the tokens to numbers, so each word will be represented with a number instead of sequence of letters. After tokenization, the tokens can then be converted into numerical format through vectorization, which is necessary because machine learning models don't understand text directly; they understand numbers. Vectorization represents the tokens in a way that the model can understand, often as vectors in a high-dimensional space. There are several methods of vectorization, including **Bag of Words**, **TF-IDF**, and **word embeddings** like **Word2Vec** or **GloVe**.

Text vectorization is the process of converting text into numerical representations (or “vectors”) that can be understood by ML models. It transforms unstructured text into structured numeric data with the goal to represent the semantic meaning of text in a mathematical format.

Text vectorization allows for a variety of NLP tasks like document classification (checking whether something is an email or an essay, etc.), sentiment analysis (opinions or attitudes of the text, etc.), enhancing search engines, and so on.

Common text vectorization methods include **one-hot encoding** (assigning a unique integer value to each word), **bag-of-words** (counting the occurrence of words within each document), and **word embeddings** (mapping words to vectors so as to capturing meaning). ***The vector space allows words with similar meanings to have similar representations.***

## <span style="color: #7b6b59;">Tokenizers</span>

Before you can use your data in a model, the data needs to be processed into an acceptable format for the model. A model does not understand raw text, images or audio. These inputs need to be converted into numbers and assembled into tensors.

**The main tool for processing textual data is a tokenizer. A tokenizer starts by splitting text into tokens according to a set of rules. The tokens are converted into numbers, which are used to build tensors as input to a model. Any additional inputs required by a model are also added by the tokenizer.**


On this section, we will have a closer look at tokenization. As we saw, tokenizing a text is splitting it into **words** or **subwords**, which then are converted to ids through a look-up table. Converting words or subwords to ids is straightforward, so in this summary, we will focus on splitting a text into words or subwords (i.e. tokenizing a text). 
More specifically, we will look at the three main types of tokenizers used in 🤗 Transformers: 

1. **Byte-Pair Encoding (BPE)**, 
1. **WordPiece**, 
1. and **SentencePiece**, and show examples of which tokenizer type is used by which model.

Note that on each model page, you can look at the documentation of the associated tokenizer to know which tokenizer type was used by the pretrained model. For instance, if we look at [BertTokenizer](https://huggingface.co/docs/transformers/v4.36.1/en/model_doc/bert#transformers.BertTokenizer), we can see that the model uses WordPiece.


### <span style="color: #7b6b59;">Introduction</span>

***What is a tokenizer?***

The definition of tokenization, as given by Stanford NLP group is:

“Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation”

Tokenizers are one of the core components of the NLP pipeline. They serve one purpose: to translate text into data that can be processed by the model. Models can only process numbers, so tokenizers need to convert our text inputs to numerical data.

The goal is to find the most meaningful representation — that is, the one that makes the most sense to the model — and, if possible, the smallest representation.

There are different solutions available: **word-based**, **character-based** but the one used by the state-of-the-art transformer models are **sub-word tokenizers**: Byte-level BPE(GPT-2), WordPiece(BERT) etc.

#### <span style="color: #7b6b59;">Space & Punctuation Tokenization</span>

Splitting a text into smaller chunks is a task that is harder than it looks, and there are multiple ways of doing so. For instance, let’s look at the sentence `"Don't you love 🤗 Transformers? We sure do."`

A simple way of tokenizing this text is to split it by spaces, which would give:

`["Don't", "you", "love", "🤗", "Transformers?", "We", "sure", "do."]`

This is a sensible first step, but if we look at the tokens "Transformers?" and "do.", we notice that the punctuation is attached to the words "Transformer" and "do", which is suboptimal. We should take the punctuation into account so that a model does not have to learn a different representation of a word and every possible punctuation symbol that could follow it, which would explode the number of representations the model has to learn. Taking punctuation into account, tokenizing our exemplary text would give:

`["Don", "'", "t", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]`

Better. However, it is disadvantageous, how the tokenization dealt with the word "Don't". "Don't" stands for "do not", so it would be better tokenized as `["Do", "n't"]`. This is where things start getting complicated, and part of the reason each model has its own tokenizer type. Depending on the rules we apply for tokenizing a text, a different tokenized output is generated for the same text. A pretrained model only performs properly if you feed it an input that was tokenized with the same rules that were used to tokenize its training data.

**spaCy** and **Moses** are two popular **rule-based tokenizers**. Applying them on our example, spaCy and Moses would output something like:

`["Do", "n't", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]`

As can be seen space and punctuation tokenization, as well as rule-based tokenization, is used here. Space and punctuation tokenization and rule-based tokenization are both examples of word tokenization, which is loosely defined as splitting sentences into words. While it’s the most intuitive way to split texts into smaller chunks, this tokenization method can lead to problems for massive text corpora. In this case, space and punctuation tokenization usually generates a very big vocabulary (the set of all unique words and tokens used). E.g., **Transformer XL uses space and punctuation tokenization**, resulting in a vocabulary size of 267,735!

Such a big vocabulary size forces the model to have an enormous embedding matrix as the input and output layer, which causes both an increased memory and time complexity. In general, **transformers models rarely have a vocabulary size greater than 50,000, especially if they are pretrained only on a single language.**

So if simple space and punctuation tokenization is unsatisfactory, why not simply tokenize on characters?

#### <span style="color: #7b6b59;">Character Tokenization</span>

While character tokenization is very simple and would greatly reduce memory and time complexity it makes it much harder for the model to learn meaningful input representations. E.g. learning a meaningful context-independent representation for the letter "t" is much harder than learning a context-independent representation for the word "today". Therefore, character tokenization is often accompanied by a loss of performance. **So to get the best of both worlds, transformers models use a hybrid between word-level and character-level tokenization called subword tokenization.**

### <span style="color: #7b6b59;">Subword Tokenization</span>
Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords. For instance "annoyingly" might be considered a rare word and could be decomposed into "annoying" and "ly". Both "annoying" and "ly" as stand-alone subwords would appear more frequently while at the same time the meaning of "annoyingly" is kept by the composite meaning of "annoying" and "ly". This is especially useful in agglutinative languages such as Turkish, where you can form (almost) arbitrarily long complex words by stringing together subwords.

Subword tokenization allows the model to have a reasonable vocabulary size while being able to learn meaningful context-independent representations. In addition, subword tokenization enables the model to process words it has never seen before, by decomposing them into known subwords. For instance, the BertTokenizer tokenizes "I have a new GPU!" as follows:

`["i", "have", "a", "new", "gp", "##u", "!"]`

Because we are considering the uncased model, the sentence was lowercased first. We can see that the words `["i", "have", "a", "new"]` are present in the tokenizer’s vocabulary, but the word "gpu" is not. Consequently, the tokenizer splits "gpu" into known subwords: `["gp" and "##u"]`. "##" means that the rest of the token should be attached to the previous one, without space (for decoding or reversal of the tokenization).
As another example, XLNetTokenizer tokenizes our previously exemplary text as follows:

`["▁Don", "'", "t", "▁you", "▁love", "▁", "🤗", "▁", "Transform", "ers", "?", "▁We", "▁sure", "▁do", "."]`

We’ll get back to the meaning of those "▁" when we look at SentencePiece. As one can see, the rare word "Transformers" has been split into the more frequent subwords "Transform" and "ers".

Let’s now look at how the different subword tokenization algorithms work. Note that all of those tokenization algorithms rely on some form of training which is usually done on the corpus the corresponding model will be trained on.

Concepts related to BPE:


1. **Vocabulary:** A set of subword units that can be used to represent a text corpus.
1. **Byte:** A unit of digital information that typically consists of eight bits.
1. **Character:** A symbol that represents a written or printed letter or numeral.
1. **Frequency:** The number of times a byte or character occurs in a text corpus.
1. **Merge:** The process of combining two consecutive bytes or characters to create a new subword unit.

#### <span style="color: #7b6b59;">Byte-Pair Encoding (BPE)</span>

Byte-Pair Encoding (BPE) was introduced in Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., 2015). BPE relies on a pre-tokenizer that splits the training data into words. Pretokenization can be as simple as space tokenization, e.g. GPT-2, RoBERTa. More advanced pre-tokenization include rule-based tokenization, e.g. XLM, FlauBERT which uses Moses for most languages, or GPT which uses Spacy and ftfy, to count the frequency of each word in the training corpus.

After pre-tokenization, a set of unique words has been created and the frequency with which each word occurred in the training data has been determined. Next, BPE creates a base vocabulary consisting of all symbols that occur in the set of unique words and learns merge rules to form a new symbol from two symbols of the base vocabulary. It does so until the vocabulary has attained the desired vocabulary size. Note that the desired vocabulary size is a hyperparameter to define before training the tokenizer.

As an example, let’s assume that after pre-tokenization, the following set of words including their frequency has been determined:

`("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)`

Consequently, the base vocabulary is `["b", "g", "h", "n", "p", "s", "u"]`. Splitting all words into symbols of the base vocabulary, we obtain:

`("h" "u" "g", 10), ("p" "u" "g", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "u" "g" "s", 5)`

BPE then counts the frequency of each possible symbol pair and picks the symbol pair that occurs most frequently. In the example above "h" followed by "u" is present 10 + 5 = 15 times (10 times in the 10 occurrences of "hug", 5 times in the 5 occurrences of "hugs"). However, the most frequent symbol pair is "u" followed by "g", occurring 10 + 5 + 5 = 20 times in total. Thus, the first merge rule the tokenizer learns is to group all "u" symbols followed by a "g" symbol together. Next, "ug" is added to the vocabulary. The set of words then becomes

`("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "ug" "s", 5)`

BPE then identifies the next most common symbol pair. It’s "u" followed by "n", which occurs 16 times. "u", "n" is merged to "un" and added to the vocabulary. The next most frequent symbol pair is "h" followed by "ug", occurring 15 times. Again the pair is merged and "hug" can be added to the vocabulary.

At this stage, the vocabulary is `["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"]` and our set of unique words is represented as

`("hug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("hug" "s", 5)`

Assuming, that the Byte-Pair Encoding training would stop at this point, the learned merge rules would then be applied to new words (as long as those new words do not include symbols that were not in the base vocabulary). For instance, the word "bug" would be tokenized to `["b", "ug"]` but "mug" would be tokenized as `["<unk>", "ug"]` since the symbol `"m"` is not in the base vocabulary. In general, single letters such as `"m"` are not replaced by the `"<unk>"` symbol because the training data usually includes at least one occurrence of each letter, but it is likely to happen for very special characters like emojis.

As mentioned earlier, the vocabulary size, i.e. the base vocabulary size + the number of merges, is a hyperparameter to choose. For instance GPT has a vocabulary size of 40,478 since they have 478 base characters and chose to stop training after 40,000 merges.

**Recap:** Steps involved in BPE:

1. Initialize the vocabulary with all the bytes or characters in the text corpus
1. Calculate the frequency of each byte or character in the text corpus.
1. Repeat the following steps until the desired vocabulary size is reached:
    - Find the most frequent pair of consecutive bytes or characters in the text corpus
    - Merge the pair to create a new subword unit.
    - Update the frequency counts of all the bytes or characters that contain the merged pair.
    - Add the new subword unit to the vocabulary.

1. Represent the text corpus using the subword units in the vocabulary.

#### <span style="color: #7b6b59;">Byte-level BPE</span>

A base vocabulary that includes all possible base characters can be quite large if e.g. all unicode characters are considered as base characters. To have a better base vocabulary, GPT-2 uses bytes as the base vocabulary, which is a clever trick to force the base vocabulary to be of size 256 while ensuring that every base character is included in the vocabulary. With some additional rules to deal with punctuation, the GPT2’s tokenizer can tokenize every text without the need for the `<unk>` symbol. GPT-2 has a vocabulary size of 50,257, which corresponds to the 256 bytes base tokens, a special end-of-text token and the symbols learned with 50,000 merges.

**Understanding Characters and Bytes:**

1. **Characters:** These are the basic units of text (like 'A', '7', '!', 'é', '中'). In human language, we see these as individual symbols or letters.
1. **Bytes:** A byte is a unit of digital information that commonly consists of eight bits. It's a fundamental concept in computer science and is used to represent data.

**Character Encoding:**

Characters are represented in computers using various encoding systems, which map characters to specific byte sequences. Two common encodings are ASCII and UTF-8:

1. **ASCII (American Standard Code for Information Interchange):** This is one of the oldest character encoding standards. It uses one byte (8 bits) per character and can represent up to 256 different symbols (0-255). ASCII is limited to English characters and some control characters and symbols. The maximum of 256 different symbols in ASCII is due to its use of one byte per character, and a byte consists of 8 bits. Here's a breakdown of why this limits it to 256 symbols. When you have a single bit, you have two possible values (0 or 1). With two bits, you can have 4 possible combinations (00, 01, 10, 11). For 8 bits (1 byte), the number of possible combinations is  2^8 = 256. This range is from 0 to 255, which gives 256 total possible values.

1. **UTF-8 (8-bit Unicode Transformation Format):** This is a more modern and versatile encoding standard capable of representing a vast array of characters from virtually all written languages. UTF-8 is backward compatible with ASCII but can use one to four bytes per character, allowing it to cover much more than the basic ASCII set.


For the GPT models, OpenAI uses a method known as byte-level byte pair encoding, instead of alphabets or ASCII, the base vocabulary is defined in bytes. Since every character in any encoding on a computer is created from bytes, the base vocabulary contains every possible combination of byte, and the tokenizer never runs into an unknown token.

**Byte-Level:** 

Instead of starting with a vocabulary of words or characters (like alphabets or ASCII characters), byte-level BPE operates on bytes, which are essentially the smallest addressable group of bits in a computer (usually 8 bits). This means the base vocabulary consists of all 256 possible byte values (from 0 to 255).
In traditional BPE or other tokenization methods that start with characters, the process involves looking at the text's character-level representation. For example, the word "hello" would be considered as 'h', 'e', 'l', 'l', 'o' – five separate characters.

In Byte-Level BPE, instead of looking at characters, we consider the byte representation of the text. This approach doesn't start with an understanding of "characters" per se but with the bytes that encode these characters. Here's why it's significant:


**Why Byte-Level?:**

1. **All-Inclusive:** Since every character (no matter the language or symbol) can be broken down into bytes, starting with bytes ensures that the vocabulary can represent any text without missing symbols or needing placeholders for unknowns.

1. **Simplifies Vocabulary:** Instead of potentially needing thousands of character tokens to cover various languages and symbols, Byte-Level BPE only needs 256 base tokens, corresponding to all possible values of a byte (0-255). This drastically simplifies the model's vocabulary.

1. **Handles Varied Text:** By using bytes, the tokenizer can handle texts in ASCII (like English text) and texts in more complex encodings like UTF-8 (which can represent virtually all human languages) without needing separate mechanisms or special handling for different languages or symbol sets.

1. **Universality:** Bytes are the fundamental building blocks of digital data. By using bytes, the model can represent any character in any language or even other forms of data like emojis or special symbols without being restricted to a specific character set. This universality means that it can process text in virtually any language or symbol system.

1. **No Unknown Tokens:** Traditional tokenizers might encounter characters or words they have never seen before (out-of-vocabulary words), leading to the use of a special "unknown" token. Byte-level BPE virtually eliminates this problem because every piece of text can be broken down into bytes, which are always within the model's vocabulary. Thus, the tokenizer is capable of handling any text input without encountering unknown tokens.


So, when we say a "character is a byte" in the context of Byte-Level BPE, it's a bit of a simplification. A more accurate statement would be: "All characters can be represented as sequences of bytes, and Byte-Level BPE uses these byte sequences as the foundational elements of its vocabulary." This means each character in text is represented by one or more bytes, depending on its encoding, and these bytes are the building blocks for the tokenizer's vocabulary and subsequent text processing. In summary, byte-level BPE is a way of preparing text for machine learning models like GPT that is both highly versatile and capable of handling a wide variety of languages and symbols without running into the issue of unknown tokens. It's a foundational aspect of how these models process and understand the text data they're trained on and generate.


Byte-Level Byte Pair Encoding (BPE) is a tokenization method that builds upon the standard BPE algorithm by using bytes as the fundamental unit for its vocabulary. This approach, as used in models like GPT-2, is particularly effective and efficient for several reasons. Here's a more detailed explanation of how it works and its benefits:

**Base Vocabulary:**

1. **Standard BPE:** Traditional Byte Pair Encoding starts with a base vocabulary of all unique characters (or tokens) in the training corpus and iteratively combines the most frequent pair of tokens to create new, longer tokens. This process continues for a number of merges, determined beforehand.

1. **Byte-Level BPE:** Instead of starting with characters, Byte-Level BPE considers each byte (256 possible values in total, representing all possible single-byte characters) as the base vocabulary. This approach automatically includes all possible characters in ASCII and extends to any byte value that might represent a part of a character in more extensive encoding systems like UTF-8.

***Advantages:***

1. **Compact and Comprehensive Base Vocabulary:** By using bytes, the base vocabulary is limited to 256 tokens (since there are 256 possible byte values), which is more compact compared to potentially thousands of Unicode characters. Yet, it's comprehensive enough to represent any text because all text can be broken down into bytes.

1. **Eliminating `<unk>` Tokens:** Traditional tokenizers might encounter unknown characters or words not present in the vocabulary, often represented by an `<unk>` (unknown) token. Since Byte-Level BPE can tokenize any text into bytes (and subsequentially into byte-level tokens), it theoretically doesn't need an `<unk>` symbol, as every possible byte can be represented in its vocabulary.

1. **Handling Diverse Scripts and Symbols:** With the ability to represent any character as a series of bytes, Byte-Level BPE is naturally equipped to handle text in multiple languages, including those with large character sets or special symbols, without needing separate models or token sets for different languages.

**GPT-2's Vocabulary:**
In the case of GPT-2:

- **256 Base Tokens:** Corresponding to all possible byte values.
- **Special End-of-Text Token:** Used to signify the end of a text.
- **50,000 Merges:** The tokenizer iteratively combines frequent pairs of these byte-level tokens to form higher-level tokens, up to 50,000 merges. These merges are learned from the training corpus and represent common words, subwords, or sequences of characters that appear frequently together.

The resulting vocabulary size is 50,257 (256 base tokens + 1 special token + 50,000 merged tokens), which provides a good balance between granularity and coverage. This means GPT-2's tokenizer is capable of handling a wide variety of texts, from different languages and domains, without a substantial increase in vocabulary size, making it efficient and powerful for language understanding and generation tasks.
    
#### <span style="color: #7b6b59;">WordPiece</span>

WordPiece is the subword tokenization algorithm used for BERT, DistilBERT, and Electra. The algorithm was outlined in Japanese and Korean Voice Search (Schuster et al., 2012) and is very similar to BPE. WordPiece first initializes the vocabulary to include every character present in the training data and progressively learns a given number of merge rules. In contrast to BPE, WordPiece does not choose the most frequent symbol pair, but the one that maximizes the likelihood of the training data once added to the vocabulary.

So what does this mean exactly? Referring to the previous example, maximizing the likelihood of the training data is equivalent to finding the symbol pair, whose probability divided by the probabilities of its first symbol followed by its second symbol is the greatest among all symbol pairs. E.g. "u", followed by "g" would have only been merged if the probability of "ug" divided by "u", "g" would have been greater than for any other symbol pair. Intuitively, WordPiece is slightly different to BPE in that it evaluates what it loses by merging two symbols to ensure it’s worth it.


#### <span style="color: #7b6b59;">SentencePiece</span>

All tokenization algorithms described so far have the same problem: It is assumed that the input text uses spaces to separate words. However, not all languages use spaces to separate words. One possible solution is to use language specific pre-tokenizers, e.g. XLM uses a specific Chinese, Japanese, and Thai pre-tokenizer). To solve this problem more generally, SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing (Kudo et al., 2018) treats the input as a raw input stream, thus including the space in the set of characters to use. It then uses the BPE or unigram algorithm to construct the appropriate vocabulary.

The XLNetTokenizer uses SentencePiece for example, which is also why in the example earlier the "▁" character was included in the vocabulary. Decoding with SentencePiece is very easy since all tokens can just be concatenated and "▁" is replaced by a space.

All transformers models in the library that use SentencePiece use it in combination with unigram. Examples of models using SentencePiece are ALBERT, XLNet, Marian, and T5.



## <span style="color: #7b6b59;">HuggingFace Tokenizers: `tokenizers` Library</span>

### <span style="color: #7b6b59;">Introduction</span>

Fast State-of-the-art [tokenizers](https://huggingface.co/docs/tokenizers/index), optimized for both research and production

[🤗 Tokenizers](https://github.com/huggingface/tokenizers) provides an implementation of today’s most used tokenizers, with a focus on performance and versatility. These tokenizers are also used in [🤗 Transformers](https://github.com/huggingface/transformers).

**Main features:**

- Train new vocabularies and tokenize, using today’s most used tokenizers.
- Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server’s CPU.
- Easy to use, but also extremely versatile.
- Designed for both research and production.
- Full alignment tracking. Even with destructive normalization, it’s always possible to get the part of the original sentence that corresponds to any token.
- Does all the pre-processing: Truncation, Padding, add the special tokens your model needs.

### <span style="color: #7b6b59;">The tokenization pipeline</span>

In this section, we will try to understand the HuggingFace tokenizers in depth and will go through all the parameters and also the outputs returned by a tokenizer. We’ll dive into the AutoTokenizer class and see how to use a pre-trained tokenizer for our data.

So, let’s get started!

Hugging Face is a New York based company that has swiftly developed language processing expertise. The company’s aim is to advance NLP and democratize it for use by practitioners and researchers around the world.

In an effort to offer access to fast, state-of-the-art, and easy-to-use tokenization that plays well with modern NLP pipelines, Hugging Face contributors have developed and open-sourced Tokenizers. Tokenizers is, as the name implies, an implementation of today’s most widely used tokenizers with emphasis on performance and versatility.

An implementation of a tokenizer consists of the following pipeline of processes, each applying different transformations to the textual information. When calling Tokenizer.encode or Tokenizer.encode_batch, the input text(s) go through the following pipeline:

- normalization
- pre-tokenization
- model
- post-processing

We’ll see in details what happens during each of those steps in detail, as well as when you want to decode `<decoding>` some token ids, and how the 🤗 Tokenizers library allows you to customize each of those steps to your needs. 

Let’s go through these steps:

<img width="935" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/57dbec9a-de4a-4bed-b4a0-491639298f65">

1. **Normalization:** The [normalization step](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.normalizers) involves some general cleanup, such as removing needless whitespace, lowercasing, and/or removing accents. If you’re familiar with Unicode normalization (such as NFC or NFKC), this is also something the tokenizer may apply. `"Héllò hôw are yoü?"` Given the input above, the normalization step would transform it into: `"hello, how are you?"`. Normalization is, in a nutshell, a set of operations you apply to a raw string to make it less random or “cleaner”. Common operations include stripping whitespace, removing accented characters or lowercasing all text. If you’re familiar with Unicode normalization, it is also a very common normalization operation applied in most tokenizers. Each normalization operation is represented in the 🤗 Tokenizers library by a `Normalizer`, and you can combine several of those by using a `normalizers.Sequence.` Here is a normalizer applying NFD Unicode normalization and removing accents as an example:

    ```python
from tokenizers import normalizers
from tokenizers.normalizers import NFD, StripAccents
normalizer = normalizers.Sequence([NFD(), StripAccents()])
# You can manually test that normalizer by applying it to any string:
normalizer.normalize_str("Héllò hôw are ü?")
``` 
    When building a Tokenizer, you can customize its normalizer by just changing the corresponding attribute: `tokenizer.normalizer = normalizer`. Of course, if you change the way a tokenizer applies normalization, you should probably retrain it from scratch afterward.

1. **Pre-tokenization:** A tokenizer cannot be trained on raw text alone. Instead, we first need to split the texts into small entities, like words. That’s where the pre-tokenization step comes in. A word-based tokenizer can simply split a raw text into words on whitespace and punctuation. Those words will be the boundaries of the subtokens the tokenizer can learn during its training. `"hello, how are you?"`. Given this string, the pre-tokenizer’s output will be something like: `[('Hello', (0, 5)), (',', (5, 6)), ('how', (7, 10)), ('are', (11, 14)), ('you', (16, 19)), ('?', (19, 20))]`. As we can see, the tokenizer also keeps track of the offsets. Also, the rules for [pre-tokenization](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.pre_tokenizers) can vary with the tokenizer being used. For instance, BERT will have different set of rules for this step than GPT-2. Pre-tokenization is the act of splitting a text into smaller objects that give an upper bound to what your tokens will be at the end of training. A good way to think of this is that the pre-tokenizer will split your text into “words” and then, your final tokens will be parts of those words. An easy way to pre-tokenize inputs is to split on spaces and punctuations, which is done by the `pre_tokenizers.Whitespace pre-tokenizer`. Of course, if you change the way the pre-tokenizer, you should probably retrain your tokenizer from scratch afterward. The output is a list of tuples, with each tuple containing one word and its span in the original sentence (which is used to determine the final offsets of our Encoding). Note that splitting on punctuation will split contractions like "I'm" in this example. You can combine together any PreTokenizer together. For instance, here is a pre-tokenizer that will split on space, punctuation and digits, separating numbers in their individual digits:
`pre_tokenizer = pre_tokenizers.Sequence([Whitespace(), Digits(individual_digits=True)])
`

1. **Modeling:** After normalization and pre-processing steps, we apply [a training algorithm](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.models) to the text data. This output of this step is dependent on the type of training strategy we are going to use. The state-of-the-art models use subword tokenization algorithms, for example BERT uses WordPiece tokenization, GPT, GPT-2 use BPE, AIBERT uses unigram etc. Using a BERT tokenizer, will tokenize the sentence like this: `["hello"; ","; "how"; "are"; "you"; "?"]`. Once the input texts are normalized and pre-tokenized, the Tokenizer applies the model on the pre-tokens. ***This is the part of the pipeline that needs training on your corpus (or that has been trained if you are using a pretrained tokenizer).*** ***The role of the model is to split your “words” into tokens, using the rules it has learned.*** It’s also responsible for mapping those tokens to their corresponding IDs in the vocabulary of the model. This model is passed along when intializing the Tokenizer so you already know how to customize this part. Currently, the 🤗 Tokenizers library supports:
    - models.BPE
    - models.Unigram
    - models.WordLevel
    - models.WordPiece

1. **Post-processing:** Similar to the modeling part, a number of post-processors are available depending on the training strategy used. They’re responsible for adding the special tokens to the input sequence as needed by the model. Using a BERT post-processor to our sequence will result in: `["CLS"; "hello"; ","; "how"; "are"; "you"; "?"; "SEP"]`. Here, `[CLS]` denotes the classification token, which tells the model that this is a classification task and `[SEP]` denotes the end of sentence and is also used between two sentences. Post-processing is the last step of the tokenization pipeline, to perform any additional transformation to the Encoding before it’s returned, like adding potential special tokens.

Subword tokenization methods, such as Byte Pair Encoding (BPE), WordPiece, or SentencePiece, need to be trained on a specific corpus to learn an efficient and effective way of breaking down words into smaller units (subwords). The training process allows the tokenizer to adapt to the particularities of the text it will be processing.

**The Training Process:**

During training, a subword tokenizer typically starts with a large corpus of text and performs the following:

1. **Initial Vocabulary Creation:** It creates an initial vocabulary, often at the character level or using a simple character or word frequency threshold.

1. **Merging Rules Learning:** It iteratively finds the most frequent pairs of characters or subwords and merges them to form a new, longer subword. This process repeats until a set number of merges is reached or the desired vocabulary size is achieved.

1. **Final Vocabulary Compilation:** The final vocabulary consists of the original characters plus all the merged subwords, ensuring that any word can be tokenized using this set.

In essence, the training of a subword tokenizer is about learning the most efficient and effective way to break down and represent the text it will encounter, taking into account frequency, morphology, and the specific needs of the task or language. This process results in a tokenizer that can handle a wide variety of text inputs, generalize well to new text, and efficiently interface with downstream language models or other NLP tools.


### <span style="color: #7b6b59;">Build a tokenizer from scratch</span>

To illustrate how fast the 🤗 Tokenizers library is, let’s train a new tokenizer on wikitext-103 (516M of text) in just a few seconds. In this tour, we will build and train a Byte-Pair Encoding (BPE) tokenizer. Here, training the tokenizer means it will learn merge rules by:

1. Start with all the characters present in the training corpus as tokens.
1. Identify the most common pair of tokens and merge it into one token.
1. Repeat until the vocabulary (e.g., the number of tokens) has reached the size we want.

The main API of the library is the class `Tokenizer`, here is how we instantiate one with a BPE model:

In [93]:
from tokenizers import Tokenizer, normalizers
from tokenizers.normalizers import NFD, StripAccents, NFC, Lowercase
from tokenizers.pre_tokenizers import Whitespace, ByteLevel
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer

LOWERCASE = False


In [94]:
# The main API of the library is the class Tokenizer, here is how we instantiate one with a BPE model:
# Creating Byte-Pair Encoding tokenizer
# we instantiate a new Tokenizer with this model - BPE
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))


In [95]:
normalizer = normalizers.Sequence([NFD(), StripAccents()])
# You can manually test that normalizer by applying it to any string:
print(normalizer.normalize_str("Héllò hôw are ü?"))

normalizer = normalizers.Sequence([NFC()] + [normalizers.Lowercase()] if LOWERCASE else [])
print(normalizer.normalize_str("Héllò hôw are ü?"))
tokenizer.normalizer = normalizer

Hello how are u?
Héllò hôw are ü?


In [96]:
pre_tokenizer = Whitespace()
print(pre_tokenizer.pre_tokenize_str("Hello! How are you? I'm fine, thank you."))
pre_tokenizer = ByteLevel()
print(pre_tokenizer.pre_tokenize_str("Hello! How are you? I'm fine, thank you."))

# We could train our tokenizer right now, but it wouldn’t be optimal.
# Without a pre-tokenizer that will split our inputs into words, we might get tokens that overlap several words:
# for instance we could get an "it is" token since those two words often appear next to each other. 
# Using a pre-tokenizer will ensure no token is bigger than a word returned by the pre-tokenizer.
# Here we want to train a subword BPE tokenizer, and we will use the easiest pre-tokenizer possible by splitting on whitespace.
# As we saw in the quicktour, you can customize the pre-tokenizer of a Tokenizer by just changing the corresponding attribute:
tokenizer.pre_tokenizer = pre_tokenizer

[('Hello', (0, 5)), ('!', (5, 6)), ('How', (7, 10)), ('are', (11, 14)), ('you', (15, 18)), ('?', (18, 19)), ('I', (20, 21)), ("'", (21, 22)), ('m', (22, 23)), ('fine', (24, 28)), (',', (28, 29)), ('thank', (30, 35)), ('you', (36, 39)), ('.', (39, 40))]
[('ĠHello', (0, 5)), ('!', (5, 6)), ('ĠHow', (6, 10)), ('Ġare', (10, 14)), ('Ġyou', (14, 18)), ('?', (18, 19)), ('ĠI', (19, 21)), ("'m", (21, 23)), ('Ġfine', (23, 28)), (',', (28, 29)), ('Ġthank', (29, 35)), ('Ġyou', (35, 39)), ('.', (39, 40))]


In [97]:

# To train our tokenizer on the wikitext files, we will need to instantiate a [trainer]{.title-ref}, in this case a BpeTrainer
# We can set the training arguments like vocab_size or min_frequency 
# but the most important part is to give the special_tokens we plan to use later on (they are not used at all during training) so that they get inserted in the vocabulary.

# Adding special tokens and creating trainer instance
# The order in which you write the special tokens list matters: here "[UNK]" will get the ID 0, "[PAD]" will get the ID 1 and so forth.

special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
VOCAB_SIZE = 30522
trainer = BpeTrainer(vocab_size=VOCAB_SIZE, special_tokens=special_tokens)

# Now, we can just call the Tokenizer.train method with any list of files we want to use:

#files = [f"data/wikitext-103-raw/wiki.{split}.raw" for split in ["test", "train", "valid"]]
#tokenizer.train(files, trainer)


# <div style="padding:20px;color:white;margin:0;font-size:30px;font-family:Georgia;text-align:left;display:fill;border-radius:5px;background-color:#7b6b59;overflow:hidden">References</div>


1. [Text Tokenization and Vectorization in NLP](https://medium.com/@WojtekFulmyk/text-tokenization-and-vectorization-in-nlp-ac5e3eb35b85)
1. [Developing LLMs for Generative AI Tokenization and Vectorization
](https://www.linkedin.com/pulse/developing-llms-generative-ai-tokenization-darko-medin/)
1. [Google Machine Learning Guide](https://developers.google.com/machine-learning/guides/text-classification/step-3#:~:text=Tokenization%3A%20Divide%20the%20texts%20into,measure%20to%20characterize%20these%20texts.)
1. [Hugging Face: Understanding tokenizers
](https://medium.com/@awaldeep/hugging-face-understanding-tokenizers-1b7e4afdb154)
1. [How to use [HuggingFace’s] Transformers Pre-Trained tokenizers? - To READ](https://nlpiation.medium.com/how-to-use-huggingfaces-transformers-pre-trained-tokenizers-e029e8d6d1fa)
1. [Byte-Pair Encoding (BPE) in NLP](https://www.geeksforgeeks.org/byte-pair-encoding-bpe-in-nlp/)
1. https://neptune.ai/blog/vectorization-techniques-in-nlp-guide
1. [Summary of the tokenizers](https://huggingface.co/docs/transformers/tokenizer_summary)
1. [Build a tokenizer from scratch](https://huggingface.co/docs/tokenizers/quicktour)
1. https://huggingface.co/blog/how-to-train
1. [HuggingFace Tokenizers](https://huggingface.co/docs/tokenizers/index)
1. [Adding Custom Layers on Top of a Hugging Face Model](https://towardsdatascience.com/adding-custom-layers-on-top-of-a-hugging-face-model-f1ccdfc257bd)
1. [Add dense layer on top of Huggingface BERT model](https://stackoverflow.com/questions/64156202/add-dense-layer-on-top-of-huggingface-bert-model)
1. [FINE-TUNING PRE-TRAINED MODELS FOR GENERATIVE AI APPLICATIONS](https://www.leewayhertz.com/fine-tuning-pre-trained-models/)
1. [Fine-Tuning the Model: What, Why, and How
](https://medium.com/@amanatulla1606/fine-tuning-the-model-what-why-and-how-e7fa52bc8ddf)
1. https://rumn.medium.com/part-1-ultimate-guide-to-fine-tuning-in-pytorch-pre-trained-model-and-its-configuration-8990194b71e
1. https://towardsdatascience.com/fine-tuning-pretrained-nlp-models-with-huggingfaces-trainer-6326a4456e7b
1. https://medium.com/@alexmriggio/bert-for-sequence-classification-from-scratch-code-and-theory-fb88053800fa
1. https://mccormickml.com/2019/07/22/BERT-fine-tuning/#4-train-our-classification-model
1. https://huggingface.co/transformers/v2.2.0/model_doc/bert.html
1. https://towardsdatascience.com/fine-tuning-pretrained-nlp-models-with-huggingfaces-trainer-6326a4456e7b
1. [Fine-tune a pretrained model](https://huggingface.co/docs/transformers/training#train-with-pytorch-trainer)
1. [What’s in the Dataset object](https://huggingface.co/docs/datasets/v1.2.1/exploring.html)
1. [Loading a Dataset](https://huggingface.co/docs/datasets/v1.1.1/loading_datasets.html)
1. [The Dataset object](https://huggingface.co/docs/datasets/v2.2.1/en/access)
1. [Create a dataset](https://huggingface.co/docs/datasets/create_dataset)
1. [BertForSequenceClassification source code](https://huggingface.co/transformers/v3.0.2/_modules/transformers/modeling_bert.html#BertForSequenceClassification)
1. [7 Text Classification Techniques for Any Scenario](https://blog.dataiku.com/7-text-classification-techniques-for-any-scenario#:~:text=A%20simple%20approach%20for%20text,regression%20or%20tree%2Dbased%20models.)
1. [TF-IDF Simplified](https://towardsdatascience.com/tf-idf-simplified-aba19d5f5530)
1. [Understanding TF-IDF for Machine Learning](https://www.capitalone.com/tech/machine-learning/understanding-tf-idf/)
1. [Understanding TF-IDF in NLP: A Comprehensive Guide
](https://medium.com/@er.iit.pradeep09/understanding-tf-idf-in-nlp-a-comprehensive-guide-26707db0cec5)
1. [TF-IDF Guide: Using scikit-learn for TF-IDF implementation](https://www.capitalone.com/tech/machine-learning/scikit-tfidf-implementation/)
1. [Creating BERT Embeddings with Hugging Face Transformers](https://www.analyticsvidhya.com/blog/2023/08/bert-embeddings/)
1. [How to use embeddings for feature extraction?](https://medium.com/mlearning-ai/how-to-use-embeddings-for-feature-extraction-4956db52b5f5)
1. [Feedback Prize - English Language Learning](https://www.kaggle.com/competitions/feedback-prize-english-language-learning/code?competitionId=38321&sortBy=voteCount)
1. [Tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer)
1. [Summary of the tokenizers](https://huggingface.co/docs/transformers/tokenizer_summary)
1. [The tokenization pipeline](https://huggingface.co/docs/tokenizers/pipeline)
1. [Preprocess](https://huggingface.co/docs/transformers/preprocessing)
1. [How to use BERT from the Hugging Face transformer library
](https://towardsdatascience.com/how-to-use-bert-from-the-hugging-face-transformer-library-d373a22b0209)
1. [Neural Networks: Pooling Layers](https://www.baeldung.com/cs/neural-networks-pooling-layers)
1. [Understanding Pooling in Transformer Architecture, Aggregating Outputs for Downstream Tasks](https://www.datasciencebyexample.com/2023/04/30/what-is-pooling-in-transformer-model/)
1. https://huggingface.co/docs/transformers/main_classes/output
1. https://huggingface.co/docs/transformers/v4.36.1/en/main_classes/output#transformers.utils.ModelOutput
1. [Deep learning basics — weight decay](https://medium.com/analytics-vidhya/deep-learning-basics-weight-decay-3c68eb4344e9)
1. [How do you compare weight decay with other regularization methods for neural networks?
](https://www.linkedin.com/advice/3/how-do-you-compare-weight-decay-other#:~:text=Weight%20decay%20is%20a%20form,them%20from%20growing%20too%20large.)
1. [Zero-Weight Decay on BatchNorm and Bias
](https://deci.ai/deep-learning-glossary/zero-weight-decay-on-batchnorm-and-bias/)
1. [Various Optimization Algorithms For Training Neural Network
](https://towardsdatascience.com/optimizers-for-training-neural-network-59450d71caf6)
1. [Optimizers in Deep Learning
](https://medium.com/mlearning-ai/optimizers-in-deep-learning-7bf81fed78a0)
1. [DATASETS & DATALOADERS
](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html)
1. [An Introduction to Datasets and DataLoader in PyTorch
](https://wandb.ai/sauravmaheshkar/Dataset-DataLoader/reports/An-Introduction-to-Datasets-and-DataLoader-in-PyTorch--VmlldzoxMDI5MTY2)
1. [PyTorch DataLoader: Features, Benefits, and How to Use it
](https://saturncloud.io/blog/pytorch-dataloader-features-benefits-and-how-to-use-it/#:~:text=The%20basic%20architecture%20of%20PyTorch%20DataLoader&text=The%20DataLoader%20class%20takes%20in,of%20data%20loading%20and%20preprocessing.)
1. [A detailed example of how to generate your data in parallel with PyTorch
](https://stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel)
1. [PyTorch DataLoader: A Complete Guide](https://datagy.io/pytorch-dataloader/)
1. [How does DataLoader work in PyTorch?](https://medium.com/noumena/how-does-dataloader-work-in-pytorch-8c363a8ee6c1)
1. [How to use Datasets and DataLoader in PyTorch for custom text data
](https://towardsdatascience.com/how-to-use-datasets-and-dataloader-in-pytorch-for-custom-text-data-270eed7f7c00)
1. [Playing with PyTorch and Datasets
](https://fede-bianchi.medium.com/playing-with-pytorch-and-datasets-fe64f5590f2)
1. [Effective Data Handling with Custom PyTorch Dataset Classes
](https://dantokeefe.medium.com/effective-data-handling-with-custom-pytorch-dataset-classes-b141bcb87b41)
1. [TRAINING WITH PYTORCH
](https://pytorch.org/tutorials/beginner/introyt/trainingyt.html#:~:text=The%20Dataset%20and%20DataLoader%20classes,processing%20single%20instances%20of%20data.)
1. [Training a PyTorch Model with DataLoader and Dataset
](https://machinelearningmastery.com/training-a-pytorch-model-with-dataloader-and-dataset/)
1. [Cross-Validation in Machine Learning
](https://towardsdatascience.com/cross-validation-in-machine-learning-72924a69872f)
1. [Understanding 8 types of Cross-Validation
](https://towardsdatascience.com/understanding-8-types-of-cross-validation-80c935a4976d)
1. [7 Types of Cross-Validation (CV) Techniques You Should Know as a Data Scientist in 2023](https://rukshanpramoditha.medium.com/7-types-of-cross-validation-cv-techniques-you-should-know-as-a-data-scientist-in-2023-516bd17b9189)
1. [Cross-Validation in Machine Learning: How to Do It Right
](https://neptune.ai/blog/cross-validation-in-machine-learning-how-to-do-it-right)
1. [Automatic Mixed Precision for Deep Learning](https://developer.nvidia.com/automatic-mixed-precision)
1. [Introduction to PyTorch: from training loop to prediction
](https://towardsdatascience.com/introduction-to-pytorch-from-training-loop-to-prediction-a70372764432)
1. [Writing a training loop from scratch in PyTorch
](https://keras.io/guides/writing_a_custom_training_loop_in_torch/)
1. [Writing a Custom Training loop](https://www.scaler.com/topics/pytorch/writing-a-custom-training-loop-with-pytorch/)
1. [OPTIMIZING MODEL PARAMETERS](https://pytorch.org/tutorials/beginner/basics/optimization_tutorial.html)
1. [TRAINING WITH PYTORCH](https://pytorch.org/tutorials/beginner/introyt/trainingyt.html)
1. [CUDA AUTOMATIC MIXED PRECISION EXAMPLES](https://pytorch.org/docs/stable/notes/amp_examples.html)
1. [Pytorch Training Tricks and Tips](https://towardsdatascience.com/pytorch-training-tricks-and-tips-a8808ebf746c)
1. [A (Very Short) Visual Introduction to Learning Rate Schedulers (With Code)
](https://medium.com/@theom/a-very-short-visual-introduction-to-learning-rate-schedulers-with-code-189eddffdb00)
1. [Using Learning Rate Schedule in PyTorch Training
](https://machinelearningmastery.com/using-learning-rate-schedule-in-pytorch-training/)
1. [How to Choose a Learning Rate Scheduler for Neural Networks](https://neptune.ai/blog/how-to-choose-a-learning-rate-scheduler#:~:text=A%20Learning%20rate%20schedule%20is,iterations%20as%20the%20training%20progresses.)
1. [A Visual Guide to Learning Rate Schedulers in PyTorch
](https://towardsdatascience.com/a-visual-guide-to-learning-rate-schedulers-in-pytorch-24bbb262c863)
1. [This thing called Weight Decay
](https://towardsdatascience.com/this-thing-called-weight-decay-a7cd4bcfccab)
1. [7 Text Classification Techniques for Any Scenario
](https://blog.dataiku.com/7-text-classification-techniques-for-any-scenario)
1. [Accelerating Large Language Models with Mixed-Precision Techniques
](https://lightning.ai/pages/community/tutorial/accelerating-large-language-models-with-mixed-precision-techniques/)
1. [Understanding Mixed Precision Training
](https://towardsdatascience.com/understanding-mixed-precision-training-4b246679c7c4)
1. [AUTOMATIC MIXED PRECISION PACKAGE - TORCH.AMP
](https://pytorch.org/docs/stable/amp.html#gradient-scaling)
1. [How to use pytorch loss functions](https://github.com/christianversloot/machine-learning-articles/blob/main/how-to-use-pytorch-loss-functions.md#binary-cross-entropy-loss-on-sigmoid-nnbceloss-example)

# <div style="padding:20px;color:white;margin:0;font-size:30px;font-family:Georgia;text-align:left;display:fill;border-radius:5px;background-color:#7b6b59;overflow:hidden">QA</div>
