# Install the latest version of Hugging Face's Transformers library with Pytorch support to train a casual language model for dialogue generation.

In [None]:
## Check if the code is running on a google colab environment
import sys
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
    # Adding making a notebook experience better
    !pip install rich
    
    # Macine Learning Libraries
    
    ## Deep Learning  
    
    ### Deep Learning Frameworks of Choices
    !pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118

    
    ## Deep Learning Learning for NLP 
    
    ### Premade and Built Deep Learning Models
    !pip install transformers
    
# Adding making a notebook experience better
from rich import print

## Machine Learning Libraries

### Machine Learning for NLP

import transformers


# Efficient Single GPU Training ( I am Broke )

## If I have a GPU set the default data type to tf32 which is faster than fp32 and only part of cuda gpu
The Ampere hardware uses a magical data type called tf32. It has the same numerical range as fp32 (8-bits), but instead of 23 bits precision it has only 10 bits (same as fp16) and uses only 19 bits in total.
It’s magical in the sense that you can use the normal fp32 training and/or inference code and by enabling tf32 support you can get up to 3x throughput improvement. All you need to do is to add this to your code:
- When this is done CUDA will automatically switch to using tf32 instead of fp32 where it’s possible. This, of course, assumes that the used GPU is from the Ampere series.
- Note: tf32 mode is internal to CUDA and can’t be accessed directly via tensor.to(dtype=torch.tf32) as torch.tf32 doesn’t exist.
- Note: you need torch>=1.7 to enjoy this feature.
- You can also see a variety of benchmarks on tf32 vs other precisions: [RTX-3090](https://github.com/huggingface/transformers/issues/14608#issuecomment-1004390803) and [A100](https://github.com/huggingface/transformers/issues/15026#issuecomment-1004543189).
- We’ve now seen how we can change the floating types to increase throughput, but we are not done, yet! There is another area where we can save GPU memory: the optimizer.[9]

In [None]:
## Check if the PyTorch version is equal or greater than 1.7.0
import torch
if torch.__version__ > '1.7.0':
    print('PyTorch version is updated to 1.7.0')
    import torch
    torch.backends.cuda.matmul.allow_tf32 = True
else:
    print('PyTorch version is not updated to 1.7.0 and is {torch.__version__} and strongly recommended to update it to 1.7.0')

# Use a Dataset from Hugging Face's Datasets Library

#

# References

[1] Scaling Instruction-Finetuned Language Models
By Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V Le, Jason Wei Container: arXiv.org Year: 2022 DOI: 10.48550/arXiv.2210.11416 URL: https://arxiv.org/abs/2210.11416

[2]LoRA: Low-Rank Adaptation of Large Language Models
By Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen Container: arXiv.org Year: 2021 DOI: 10.48550/arXiv.2106.09685 URL: https://arxiv.org/abs/2106.09685

[3]8-bit Optimizers via Block-wise Quantization
By Tim Dettmers, Mike Lewis, Sam Shleifer, Luke Zettlemoyer Container: arXiv.org Year: 2021 DOI: 10.48550/arXiv.2110.02861 URL: https://arxiv.org/abs/2110.02861
[3]8-bit Optimizers via Block-wise Quantization
By Tim Dettmers, Mike Lewis, Sam Shleifer, Luke Zettlemoyer Container: arXiv.org Year: 2021 DOI: 10.48550/arXiv.2110.02861 URL: https://arxiv.org/abs/2110.02861

[4]MQBench: Towards Reproducible and Deployable Model Quantization Benchmark
By Yuhang Li, Mingzhu Shen, Jian Ma, Yan Ren, Mingxin Zhao, Qi Zhang, Ruihao Gong, Fengwei Yu, Junjie Yan Container: arXiv.org Year: 2021 DOI: 10.48550/arXiv.2111.03759 URL: https://arxiv.org/abs/2111.03759

[5]InPars-v2: Large Language Models as Efficient Dataset Generators for Information Retrieval
By Vitor Jeronymo, Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, Roberto Lotufo, Jakub Zavrel, Rodrigo Nogueira Container: arXiv.org Year: 2023 DOI: 10.48550/arXiv.2301.01820 URL: https://arxiv.org/abs/2301.01820v2


[6]@inproceedings{dao2022flashattention,
  title={Flash{A}ttention: Fast and Memory-Efficient Exact Attention with {IO}-Awareness},
  author={Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R{\'e}, Christopher},
  booktitle={Advances in Neural Information Processing Systems},
  year={2022}
}




[7]https://www.pinecone.io/learn/question-answering/

[8]https://learning.oreilly.com/library/view/natural-language-processing/9781098136789/

[9] https://huggingface.co/docs/transformers/perf_train_gpu_one#bf16
