<img src="https://raw.githubusercontent.com/NCAI-Research/CALM/main/assets/CALMLogo.png" atl="CALM"/>

This notebook was written for the Collaborative Arabic Language Model CALM project, it will contain instructions on how to set up your collaborative training.


* For more information, please visit https://github.com/NCAI-Research/CALM and https://huggingface.co/CALM. 

---


# 📣 Pre-training required steps: 
1.   Create a [**Huggingface account**](https://huggingface.co) and join the NCAI-CALM organization 👉🏻 https://huggingface.co/CALM using the link sent to you on the invitation email.

---

# Step 1: Clone the repo

In [None]:
!git clone https://github.com/NCAI-Research/CALM

# Step 2: Installing required libraries

NOTE: be patient this may take a couple of minutes.

In [None]:
print("Installing requirements...")
%cd CALM
!pip install -q -r requirements.txt &> log

# Step 3: Setup the experiment environment variables


In [None]:
# Initialize the name of the experiment
exp_name = "CALM"

# the name of the HF organization and model for the experiment
%env HF_ORGANIZATION_NAME=CALM
%env HF_MODEL_NAME={exp_name}

# WANDB information for tracking the run 

%env WANDB_API_KEY=65dbae2761bd93ee41c54b443c361114be29b8ec

# Name, project, and method for the WANDB Team
%env WANDB_ENTITY=calm
%env WANDB_PROJECT={exp_name}-hivemind-trainers
%env WANDB_START_METHOD=thread

## Check the user authority in the HF organization 🤗

When the code runs it will request for the user access token 🔑 in HF, to get it:

1. Go to your [HF account](https://huggingface.co)
2. Go to Settings ⇒ Access Tokens
3. Generate a new Access Token and enter any name for "what's this token for"
4. Select `read` role
5. Copy your access token
6. Paste it in the execution prompt in the notebook




In [None]:
import os
from huggingface_auth import authorize_with_huggingface

os.environ['HF_USER_ACCESS_TOKEN'] = authorize_with_huggingface().hf_user_access_token

## Download the punkt sentence tokenizer


In [None]:
import nltk
nltk.download('punkt')

# Step 3: Let's start training 👏 🕖
 

In [None]:
# Check the device capability to set the batch size
import torch
capability = torch.cuda.get_device_capability()
memory_gb = torch.cuda.mem_get_info()[1] / 1e9
gradient_checkpointing = False
if capability >= (8, 0):  # ampere
  batch_size, fp16 = 8, True
elif capability >= (6, 0):  # v100, t4, p100
  batch_size, fp16 = 4, True
else:  # maxwell, kepler
  batch_size, fp16 = 1, False
if memory_gb < 9:  # 8gb gpus: 1070, 2060S, 
  batch_size = min(batch_size, 2)
if memory_gb < 7:  # 6gb or less: try our best to fit
  batch_size, fp16 = min(batch_size, 1), True
  gradient_checkpointing = True
print(f"\nRunning {torch.cuda.get_device_name()}, setting batch size = {batch_size}, fp16 = {fp16}, gradient_checkpointing={gradient_checkpointing}\n")

# start the training
!ulimit -n 16384 && python run_trainer.py --run_id {exp_name} --per_device_train_batch_size {batch_size} --gradient_accumulation_steps 1 --fp16 {fp16} --gradient_checkpointing {gradient_checkpointing} \
  --client_mode --matchmaking_time 60 --initial_peers /ip4/34.124.232.172/tcp/12345/p2p/QmdGDSzDEi7uo8pTGG7n8s2dW12VGoPQKiDVDoQaVAo3bf /ip4/193.106.95.184/tcp/12345/p2p/QmRgdEXySu8hEB3xUxexJPxcv7M41PggRDnUTf9kStdgup /ip4/194.213.3.15/tcp/12345/p2p/QmYSF8GSLWxjJxSrtpAbdQGRSxbDT81MruruEcxNaDcZCD