# Sky Reach AI Bilingual Training

This notebook demonstrates how to train the Sky Reach AI bilingual model with Bangla and English data.

## Step 1: Install required packages

- Install necessary libraries for transformers, datasets, PyTorch, and sentencepiece tokenizer support.

In [1]:
!pip install transformers datasets torch sentencepiece -q

  error: subprocess-exited-with-error
  
  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [48 lines of output]
      Traceback (most recent call last):
        File "C:\Users\DELL\AppData\Roaming\Python\Python313\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 389, in <module>
          main()
          ~~~~^^
        File "C:\Users\DELL\AppData\Roaming\Python\Python313\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 373, in main
          json_out["return_val"] = hook(**hook_input["kwargs"])
                                   ~~~~^^^^^^^^^^^^^^^^^^^^^^^^
        File "C:\Users\DELL\AppData\Roaming\Python\Python313\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 143, in get_requires_for_build_wheel
          return hook(config_settings)
        File "C:\Users\DELL\AppData\Local\Temp\pip-build-env-ola19u55\overlay\Lib\site-packages\setuptools\build_meta.py", line 3

## Step 2: Load configuration, tokenizer, and model

- Load training configuration from a YAML file.
- Initialize tokenizer based on the pretrained model path.
- Load the pretrained causal language model.

In [3]:
import yaml
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
from datasets import load_dataset

# Load config
with open('SkyReach-AI-Bilingual-Training\train_config.yaml') as f:
    config = yaml.safe_load(f)

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(config['tokenizer_path'])
model = AutoModelForCausalLM.from_pretrained(config['model_name'])

ModuleNotFoundError: No module named 'transformers'

## Step 3: Load and preprocess dataset

- Load bilingual dataset from JSON files.
- Define a tokenization function to truncate and limit sequence length to 512 tokens.
- Apply tokenization to dataset.

In [4]:
# Load dataset
dataset = load_dataset('json', data_files=config['dataset_path'])['train']

# Tokenize dataset
def tokenize_fn(example):
    return tokenizer(example['text'], truncation=True, max_length=512)

tokenized_ds = dataset.map(tokenize_fn, batched=True)

NameError: name 'load_dataset' is not defined

## Step 4: Setup training parameters

- Define training hyperparameters like epochs, batch size, and learning rate.
- Configure logging and checkpoint saving frequency.

In [None]:
# Prepare training arguments
training_args = TrainingArguments(
    output_dir="../results",
    num_train_epochs=config['epochs'],
    per_device_train_batch_size=config['batch_size'],
    learning_rate=config['learning_rate'],
    logging_dir='../logs',
    logging_steps=10,
    save_steps=100,
    save_total_limit=2
)

## Step 5: Initialize Trainer and start training

- Instantiate the Hugging Face `Trainer` with model, args, and dataset.
- Begin fine-tuning the bilingual model on the tokenized data.

In [None]:
# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_ds
)

# Train the model
trainer.train()