# Training our own LLM on Nepali Language (Part-1)

* why: customization, nepali language
* Datasets : crawl nepali newspapers, commonCrawl, huggingface_datasets, Falcon Refined Web
* Model architectures: Encoder, Decoder (combined/stand_alone)
* Benchmarks: <to-search>

# Resources

* [BERT-paper](https://arxiv.org/abs/1810.04805)
* [original GPT paper ](https://www.mikecaptain.com/resources/pdf/GPT-1.pdf)
* [GPT-2 paper](paperLanguage Models are Unsupervised Multitask Learners (2019))
* [GPT-3 model paper](paper: Language models are few-shot
learners (2020))
* Transformer-XL authors (paper:
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context (2019))
* XLNet (paper: XLNet: Generalized Autoregressive Pretraining for Language Understanding (2019))
*  T5 (paper: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (2019))
* BART model (paper: BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension (2019))
* exBERT: A Visual Analysis Tool to Explore Learned Representations in Transformer Models, Benjamin Hoover, Hendrik Strobelt, Sebastian Gehrmann, 2019
* [RoBERTa](https://arxiv.org/abs/1907.11692)

## Books:
* [Mastering Transformers by Savaş Yıldırım](https://www.amazon.com/Mastering-Transformers-state-art-processing/dp/1801077657)

## Videos
* https://exbert.net/
* * [Create a Large Language Model from Scratch with Python – Tutorial](https://m.youtube.com/watch?v=UU1WVnMk4E8)


# Nepali language models:
* [Nepberta](https://aclanthology.org/2022.aacl-short.34/)
* [Nepberta (Huggingface)](https://huggingface.co/Rajan/NepaliBERT)
* [Distilled GPT-2 Nepali (Huggingface)](https://huggingface.co/Sakonii/distilgpt2-nepali)


# References
* [Build Your Own LLM Model Using OpenAI](https://medium.com/@nileshpatel7048/build-your-own-llm-model-using-openai-6ed0954e4db1)
* [A Step-by-Step Guide to Training Your Own Large Language Models (LLMs).](https://blog.gopenai.com/a-step-by-step-guide-to-training-your-own-llm-2d81ff810695)
* [How to Build an LLM from Scratch](https://towardsdatascience.com/how-to-build-an-llm-from-scratch-8c477768f1f9)

# Training Nepali LLm (Part-2/N)

# Datasets:

## 1. nepalitext-language-model-dataset
    * 13 million Nepali text sequences
    * extracted by combining the datasets: OSCAR , cc100 

## 2.OSCAR
    * Open Super-large Crawled ALMAnaCH coRpus 
    * multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture

## 3. Common Crawl [4]
    * free, open repository of web crawl datahttps://commoncrawl.org

## 4. Crawling Online Nepali newspapers
    * Nepberta[4] seems to have crawled 36 nepali online newspapers to obtain 1.5 billion tokens in total.
    
    * It seems good idea to crawl online newspapers as they could provide large amount of up to date data.
    
    * We have implemented simple scrapy crawler[5] to crawl online nepali newspapers.
    * Data crawled so far is stored at [6]


# References:
1. [nepalitext-language-model-dataset](https://huggingface.co/datasets/Sakonii/nepalitext-language-model-dataset)
2. [OSCAR](https://huggingface.co/datasets/oscar)
3. [CC100](https://huggingface.co/datasets/cc100)
4. [NepBerta](https://aclanthology.org/2022.aacl-short.34/)
5. [scrapy-crawler](https://github.com/Aananda-giri/scrapy_engine)
6. [crawled-dataset](https://drive.google.com/drive/folders/1v_dv0H56D3J-56VDPIaBkJ601djs8C0w?usp=sharing)

# Training Nepali LLm (Part-3/N)



# Tokenizion

* Padding and truncation of sentences.

* N-gram: Combination of sentence and word level tokenizion

* Special tokens: CLS <classification>, SEP<seperator>, MASK

* note: tokena are unit of text data: words, subwords, characters, punctuation marks, etc.



# Training (Theory):

Language Models like chat-gpt [2] seems to be trained in three stages:



## 1. Pre-training

* Masked language modelling: predict [MASK]. Mask is token removed from sentence.

* NEXT TOKEN PREDICTION

* Next Sentence Prediction (NSP):

* Continuous Bag of Words (CBOW): predicting a target word based on the context of its surrounding words.

* Skip-gram: Given a target word, the model predicts the context words surrounding it.

* Denoising Autoencoder: Instead of masking random words, denoising autoencoders add noise to the input data and train the model to reconstruct the original, uncorrupted input

* Replaced Token Detection:

* Similar to MLM, this approach involves replacing a token in a sentence with another token and training the model to detect the replacement

* Permutation Language Modeling (PLM): randomly permuting the order of words in a sentence and training the model to predict the original word order

* Document-level Tasks: tasks as predicting the next document in a sequence or understanding document semantics.



## 2. Reward Model

* Training reward model.

* RLHF model is trained to predict the reward given by human feedback.

* Model generates multiple outputs and human labeler ranks them from best to worst.



## 3. Fine-tuning

* train on downstream task like question answering, summarization, etc.

* Optimize policy against reward model using PPO RL algorithm.



# References:

1. [Paper. Attention is all you need](https://proceedings.neurips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html)

2. [Blog. Chat-gpt introduction](https://openai.com/blog/chatgpt)

3. [Vid. Let's build GPT-Andrej karpathy](https://www.youtube.com/watch?v=kCc8FmEb1nY)

