## Installing Unsloth

In [None]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

# Cell 1: Import Libraries and Configure Model

This cell imports the necessary libraries for loading datasets, tokenization, and model handling. It also defines the model configuration for efficient training and inference.

- **Imports**:
  - `load_dataset`: For loading the transliteration dataset.
  - `LlamaTokenizer`: Tokenizer for the Llama model.
  - `FastLanguageModel`: Lightweight wrapper for loading pre-trained Llama models.
  - `torch`: For dataset handling and computation.

- **Model Configuration**:
  - `max_seq_length`: Sets the maximum sequence length for text processing.
  - `dtype`: Determines the data type for model weights (e.g., Float16 for faster computation).
  - `load_in_4bit`: Enables 4-bit quantization to reduce memory usage.

- **Pre-Quantized Models**:
  Lists supported models optimized for 4-bit quantization and faster performance.

---

# Cell 2: Load Pre-trained Model and Tokenizer

This cell loads the pre-trained Llama model and tokenizer.

- **Model**: The `Meta-Llama-3.1-8B` model is used for transliteration.
- **Configuration**:
  - The model is loaded with support for 4-bit quantization.
  - `max_seq_length` ensures compatibility with long-text tasks.
- **Tokenizer**: Handles text tokenization and decoding for both Banglish and Bangla.

---

# Cell 3: Define Data Loading and Preprocessing Pipeline

This cell defines a function to load and preprocess the transliteration dataset.

- **Dataset Loading**:
  - Fetches the Banglish-to-Bangla transliteration dataset from Hugging Face.
  - Extracts Banglish (`rm`) and Bangla (`bn`) text fields.
  
- **Dataset Splitting**:
  - Splits the data into training and validation sets using an 80/20 split.

- **Tokenization**:
  - Tokenizes the text using the Llama tokenizer with padding and truncation.
  - Prepares both the input (Banglish) and target (Bangla) for training.

---

# Cell 4: Filter Data by Length

This cell defines a function to filter overly short or excessively long sentences.

- **Purpose**:
  - Improves training efficiency by ensuring that sequences fall within a specified length range (`min_len` to `max_len`).
- **Implementation**:
  - Iterates through the tokenized inputs and labels to exclude sequences outside the acceptable range.

---

# Cell 5: Create PyTorch Dataset Class

This cell defines a custom PyTorch dataset class.

- **TransliterationDataset**:
  - A PyTorch-compatible dataset for the transliteration task.
  - Handles input IDs and corresponding labels.
- **Methods**:
  - `__len__`: Returns the size of the dataset.
  - `__getitem__`: Fetches input and label pairs for training or validation.

---

# Cell 6: Load and Return Datasets

This cell executes the data preprocessing pipeline and returns the prepared datasets.

- **Execution**:
  - Calls the `load_and_preprocess_data` function to prepare the data.
- **Output**:
  - `train_dataset`: Tokenized training dataset.
  - `val_dataset`: Tokenized validation dataset.
  - `tokenizer`: The Llama tokenizer for encoding new inputs.

---


In [None]:
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from transformers import LlamaTokenizer
import torch
from unsloth import FastLanguageModel

# Model Configuration
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 15 trillion tokens model 2x faster!
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # We also uploaded 4bit for 405b!
    "unsloth/Mistral-Nemo-Base-2407-bnb-4bit", # New Mistral 12b 2x faster!
    "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
    "unsloth/mistral-7b-v0.3-bnb-4bit",        # Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

def load_and_preprocess_data():
    # Load the dataset from Hugging Face
    dataset = load_dataset("SKNahin/bengali-transliteration-data")

    # Extract Banglish and Bangla text
    banglish_texts = [item['rm'] for item in dataset['train']]
    bangla_texts = [item['bn'] for item in dataset['train']]

    # Split the dataset into training and validation subsets (80/20 split)
    train_banglish, val_banglish, train_bangla, val_bangla = train_test_split(
        banglish_texts, bangla_texts, test_size=0.2, random_state=42
    )

    # Tokenize the data
    def tokenize_function(texts):
        return tokenizer(
            texts,
            max_length=128,  # Adjust as needed for your task
            padding="max_length",
            truncation=True,
            return_tensors="pt"
        )

    train_encodings = tokenize_function(train_banglish)
    train_labels = tokenize_function(train_bangla)['input_ids']

    val_encodings = tokenize_function(val_banglish)
    val_labels = tokenize_function(val_bangla)['input_ids']

    # Filter overly short or excessively long sentences if needed
    def filter_by_length(encodings, labels, min_len=5, max_len=128):
        filtered_encodings, filtered_labels = [], []
        for enc, lbl in zip(encodings['input_ids'], labels):
            if min_len <= len(enc) <= max_len and min_len <= len(lbl) <= max_len:
                filtered_encodings.append(enc)
                filtered_labels.append(lbl)
        return filtered_encodings, filtered_labels

    train_encodings['input_ids'], train_labels = filter_by_length(
        train_encodings, train_labels
    )
    val_encodings['input_ids'], val_labels = filter_by_length(
        val_encodings, val_labels
    )

    # Convert to PyTorch datasets
    class TransliterationDataset(torch.utils.data.Dataset):
        def __init__(self, encodings, labels):
            self.encodings = encodings
            self.labels = labels

        def __len__(self):
            return len(self.labels)

        def __getitem__(self, idx):
            return {
                'input_ids': self.encodings[idx],
                'labels': self.labels[idx]
            }

    train_dataset = TransliterationDataset(
        train_encodings['input_ids'], train_labels
    )
    val_dataset = TransliterationDataset(
        val_encodings['input_ids'], val_labels
    )

    return train_dataset, val_dataset, tokenizer

train_dataset, val_dataset, tokenizer = load_and_preprocess_data()


==((====))==  Unsloth 2024.12.8: Fast Llama patching. Transformers: 4.46.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


* We support Llama, Mistral, Phi-3, Gemma, Yi, DeepSeek, Qwen, TinyLlama, Vicuna, Open Hermes etc
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* [**NEW**] We make Gemma-2 9b / 27b **2x faster**! See our [Gemma-2 9b notebook](https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing)
* [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)
* [**NEW**] We make Mistral NeMo 12B 2x faster and fit in under 12GB of VRAM! [Mistral NeMo notebook](https://colab.research.google.com/drive/17d3U-CAIwzmbDRqbZ9NnpHxCkmXB6LZ0?usp=sharing)

In [None]:
train_dataset

<__main__.load_and_preprocess_data.<locals>.TransliterationDataset at 0x789f5a47b6a0>

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.12.8 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the Alpaca dataset from [yahma](https://huggingface.co/datasets/yahma/alpaca-cleaned), which is a filtered version of 52K of the original [Alpaca dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html). You can replace this code section with your own data prep.

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!

If you want to use the `llama-3` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing).

For text completions like novel writing, try this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).

In [None]:
# Define the prompt for Banglish-to-Bangla translation
banglish_to_bangla_prompt = """
### Banglish Text:
{}

### Bengali Translation:
{}</s>"""  # Use '</s>' as the end-of-sequence token for LLaMA tokenizer

EOS_TOKEN = tokenizer.eos_token  # Must add EOS_TOKEN to ensure generation stops properly

# Function to format data into the prompt template
def format_transliteration_data(examples):
    banglish_texts = examples["rm"]  # Column for Banglish text
    bangla_texts = examples["bn"]  # Column for Bengali text
    texts = [banglish_to_bangla_prompt.format(banglish, bangla) + EOS_TOKEN for banglish, bangla in zip(banglish_texts, bangla_texts)]
    return {"text": texts}


In [None]:
dataset

Dataset({
    features: ['question', 'answer'],
    num_rows: 3495
})

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [None]:
print(trainer.model.config)


LlamaConfig {
  "_attn_implementation_autoset": true,
  "_name_or_path": "unsloth/meta-llama-3.1-8b-bnb-4bit",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pad_token_id": 128004,
  "pretraining_tp": 1,
  "quantization_config": {
    "bnb_4bit_compute_dtype": "float16",
    "bnb_4bit_quant_type": "nf4",
    "bnb_4bit_use_double_quant": true,
    "llm_int8_enable_fp32_cpu_offload": false,
    "llm_int8_has_fp16_weight": false,
    "llm_int8_skip_modules": null,
    "llm_int8_threshold": 6.0,
    "load_in_4bit": true,
    "load_in_8bit": false,
    "quant_method": "bitsandbytes"
  },
  "rms

In [None]:
total_params = sum(p.numel() for p in trainer.model.parameters())
trainable_params = sum(p.numel() for p in trainer.model.parameters() if p.requires_grad)
print(f"Total Parameters: {total_params}")
print(f"Trainable Parameters: {trainable_params}")


Total Parameters: 4582543360
Trainable Parameters: 41943040


In [None]:
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in trainer.model.parameters() if p.requires_grad)
print(f"Total Parameters: {total_params}")
print(f"Trainable Parameters: {trainable_params}")


Total Parameters: 4582543360
Trainable Parameters: 41943040


In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
7.736 GB of memory reserved.


In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

462.7198 seconds used for training.
7.71 minutes used for training.
Peak reserved memory = 7.922 GB.
Peak reserved memory for training = 1.938 GB.
Peak reserved memory % of max memory = 53.716 %.
Peak reserved memory for training % of max memory = 13.141 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
# Import necessary libraries
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from transformers import LlamaTokenizer
import torch
from unsloth import FastLanguageModel

# Model Configuration
max_seq_length = 2048  # Choose any! We auto support RoPE Scaling internally!
dtype = None  # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True  # Use 4bit quantization to reduce memory usage. Can be False.

# Load Pre-trained Model and Tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit
)


==((====))==  Unsloth 2024.12.8: Fast Llama patching. Transformers: 4.46.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [None]:
# Apply LoRA Fine-Tuning
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # Choose any number > 0! Suggested 8, 16, 32, 64, 128
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,  # Supports any, but = 0 is optimized
    bias="none",  # Supports any, but = "none" is optimized
    use_gradient_checkpointing="unsloth",  # True or "unsloth" for very long context
    random_state=3407,
    use_rslora=False,  # Supports rank stabilized LoRA
    loftq_config=None  # And LoftQ
)


In [None]:
# Define the prompt for Banglish-to-Bangla translation
banglish_to_bangla_prompt = """
### Banglish Text:
{}

### Bengali Translation:
{}</s>"""  # Use '</s>' as the end-of-sequence token for LLaMA tokenizer

EOS_TOKEN = tokenizer.eos_token  # Ensure generation stops properly

# Function to format the dataset
def format_transliteration_data(examples):
    banglish_texts = examples["rm"]  # Column for Banglish text
    bangla_texts = examples["bn"]  # Column for Bengali text
    texts = [banglish_to_bangla_prompt.format(banglish, bangla) + EOS_TOKEN for banglish, bangla in zip(banglish_texts, bangla_texts)]
    return {"text": texts}

# Function to tokenize the dataset
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        max_length=2048,
        padding="max_length",
        truncation=True,
        return_tensors="pt"
    )



In [None]:
# Load and preprocess the dataset
raw_dataset = load_dataset("SKNahin/bengali-transliteration-data", split="train")
formatted_dataset = raw_dataset.map(format_transliteration_data, batched=True)

# Tokenize the dataset
tokenized_dataset = formatted_dataset.map(tokenize_function, batched=True)

# Split the dataset into training and validation sets
split_dataset = tokenized_dataset.train_test_split(test_size=0.1, shuffle=True, seed=42)
train_dataset_hf = split_dataset["train"]
val_dataset_hf = split_dataset["test"]

# Convert to PyTorch datasets
class TokenizedDataset(torch.utils.data.Dataset):
    def __init__(self, hf_dataset):
        self.input_ids = hf_dataset["input_ids"]
        self.attention_mask = hf_dataset["attention_mask"]

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return {
            "input_ids": self.input_ids[idx],
            "attention_mask": self.attention_mask[idx]
        }

train_dataset = TokenizedDataset(train_dataset_hf)
val_dataset = TokenizedDataset(val_dataset_hf)

# Verify the data
print(train_dataset[0])


Map:   0%|          | 0/5006 [00:00<?, ? examples/s]

{'input_ids': [128000, 198, 14711, 17343, 1706, 2991, 512, 10835, 52757, 26548, 887, 836, 426, 819, 278, 802, 469, 31764, 2194, 33820, 359, 597, 12052, 24688, 14711, 26316, 8115, 39141, 512, 11372, 228, 11372, 103, 87648, 50228, 108, 36278, 237, 11372, 104, 11372, 105, 62456, 36278, 228, 11372, 229, 11372, 94, 62456, 36278, 101, 60008, 11372, 106, 36278, 105, 81278, 114, 50228, 110, 36278, 228, 73358, 36278, 237, 11372, 229, 11372, 244, 50228, 101, 60008, 36278, 106, 50228, 106, 28025, 223, 87648, 36278, 243, 60008, 87648, 949, 524, 82, 29, 128001, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 12800

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

# Adjusted TrainingArguments for your 3.5k dataset
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    dataset_text_field="text",
    max_seq_length=2048,  # Keeps context length high
    args=TrainingArguments(
        per_device_train_batch_size=2,  # Small batch size for memory efficiency
        gradient_accumulation_steps=4,  # Effective batch size = 2 * 4 = 8
        max_steps=100,  # 3 epochs for 3.5k dataset
        learning_rate=2e-4,  # Optimal LR for LoRA
        warmup_steps=100,  # Gradual warmup for stable training
        fp16=not is_bfloat16_supported(),  # Mixed precision for faster training
        bf16=is_bfloat16_supported(),
        logging_steps=10,  # Log training metrics every 10 steps
        eval_steps=100,  # Evaluate every 100 steps
        save_steps=500,  # Save model checkpoint every 500 steps
        weight_decay=0.01,  # Regularization to prevent overfitting
        lr_scheduler_type="cosine",  # Cosine decay for smoother learning rate drop
        output_dir="outputs",  # Model output directory
        save_total_limit=2,  # Keep only the 2 latest checkpoints
        seed=42,  # Reproducibility
        report_to="none"  # Set to 'wandb' or 'tensorboard' for monitoring
    ),
)
trainer.train()


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 4,505 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 100
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
10,1.6613
20,1.3232
30,1.1092
40,1.0104
50,0.9676
60,0.9289
70,0.9404
80,0.9123
90,0.8868
100,0.9066


TrainOutput(global_step=100, training_loss=1.0646703052520752, metrics={'train_runtime': 3770.1271, 'train_samples_per_second': 0.212, 'train_steps_per_second': 0.027, 'total_flos': 7.41887283560448e+16, 'train_loss': 1.0646703052520752, 'epoch': 0.1775410563692854})

In [None]:
# Define a simple prompt structure for inference
banglish_prompt = """
### Banglish Text:
{}

### Bengali Translation:
"""

# Prepare the input for inference
FastLanguageModel.for_inference(model)  # Enable native faster inference
inputs = tokenizer(
    [
        banglish_prompt.format("xda-developers e ei browser er ekta good mod ase. Tar download link ta diven please. Ami download korte parsi naa.")
    ],
    return_tensors="pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)

# Generate predictions
_ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=300)


<|begin_of_text|>
### Banglish Text:
xda-developers e ei browser er ekta good mod ase. Tar download link ta diven please. Ami download korte parsi naa.

### Bengali Translation:
এক্সডিএ ডেভেলপার্স এ এই ব্রাউজার এর একটা গুড মোড আছে। তার ডাউনলোড লিংক টা দিন প্লিজ। আমি ডাউনলোড করতে পারছি না।</s><|end_of_text|>


In [None]:
!pip install evaluate


Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


In [40]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")
model.push_to_hub("Tamim18/banglish_to_bangla_llama_fine_tuning", token = "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") # Online saving
tokenizer.push_to_hub("Tamim18/banglish_to_bangla_llama_fine_tuning", token = "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") # Online saving

README.md:   0%|          | 0.00/588 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Saved model to https://huggingface.co/Tamim18/banglish_to_bangla_llama_fine_tuning


tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in `llama.cpp` or a UI based system like `GPT4All`. You can install GPT4All by going [here](https://gpt4all.io/index.html).

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**