# Amharic LLaMA 2 experiments

This notebook contains code used to experiment Amharic LlaMA2 model. More information about the model is found [here](https://medium.com/@garrilogistics/llama-2-amharic-llms-for-low-resource-languages-d6fb0ba332f4).



## Installing required packages.

Run the following command to install required pacakges. Some of the packages are optional and added to handle specific exceptions thrown when running initial experiments.

In [22]:
!pip install --upgrade transformers accelerate  bitsandbytes-cuda110 bitsandbytes sentencepiece peft datasets -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m20.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
!nvidia-smi

Wed Jan 24 13:30:24 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   32C    P0              42W / 400W |      2MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [23]:
import torch
import os
import sys
import time
import json
from typing import List
import datasets
import csv
from transformers import LlamaTokenizer, LlamaForCausalLM

BASE_PROMPT = """Below is an interaction between a human and an AI fluent in English and Amharic, providing reliable and informative answers. The AI is supposed to answer test questions from the human with short responses saying just the answer and nothing else.

Human: {instruction}

Assistant [Amharic] : """

In [4]:
# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed according to the terms of the GNU General Public License version 3.

from peft import PeftModel
from transformers import LlamaForCausalLM, LlamaConfig

# Function to load the main model for text generation
def load_model(model_name, quantization):
    model = LlamaForCausalLM.from_pretrained(
        model_name,
        return_dict=True,
        load_in_8bit=quantization,
        device_map='cuda:0',
        low_cpu_mem_usage=True,
    )
    return model


# Function to load the PeftModel for performance optimization
def load_peft_model(model, peft_model):
    peft_model = PeftModel.from_pretrained(model, peft_model,offload_folder='./')
    return peft_model

# Loading the model from config to load FSDP checkpoints into that
def load_llama_from_config(config_path):
    model_config = LlamaConfig.from_pretrained(config_path)
    model = LlamaForCausalLM(config=model_config)
    return model



In [49]:
def main(
    model,
    tokenizer,
    datasource,  # List of data sources to use, no default value
    csv_file_path,  # Path to the CSV file to save responses, no default value
    max_new_tokens=100,  # The maximum numbers of tokens to generate
    seed=42,  # seed value for reproducibility
    do_sample=True,  # Whether or not to use sampling; use greedy decoding otherwise.
    min_length=None,  # The minimum length of the sequence to be generated
    use_cache=True,  # [optional] Whether or not the model should use the past last key/values attentions
    top_p=1.0,  # [optional] If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
    temperature=1.0,  # [optional] The value used to modulate the next token probabilities.
    top_k=5,  # [optional] The number of highest probability vocabulary tokens to keep for top-k-filtering.
    repetition_penalty=5.0,  # The parameter for repetition penalty. 1.0 means no penalty.
    length_penalty=1,  # [optional] Exponential penalty to the length used with beam-based generation.
    enable_azure_content_safety=False,  # Enable safety check with Azure content safety API
    enable_sensitive_topics=False,  # Enable check for sensitive topics using AuditNLG APIs
    enable_saleforce_content_safety=False,  # Enable safety check with Salesforce safety T5
    **kwargs  # Additional arguments for the model.generate function
):
    # Note: Ensure that the appropriate tokenizer is used for the language.
    print("*** Ensure that you have replaced the default tokenizer with the appropriate one for your use case.")

    model.eval()

    # Load the dataset from Hugging Face
    dataset = hf_dataset['test']

   # Prepare the CSV file for saving responses
    with open(csv_file_path, mode='w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(['Instruction', 'Input Text', 'Datasource','response', 'gold_label'])  # Column headers

        for item in dataset:  # Change to the desired split if necessary
            instruction = item['instruction']  # Extracting the instruction
            input_text = item['input']  # Extracting the input text
            datasource = item['datasource']
            gold_label=item['output']

            # Combine instruction and input_text for the prompt
            user_prompt = BASE_PROMPT.format(instruction=f"{instruction}\n{input_text}")

            batch = tokenizer(user_prompt, return_tensors="pt")
            batch = {k: v.to(model.device) for k, v in batch.items()}  # Ensure tensors are on the same device as the model

            start = time.perf_counter()

            with torch.no_grad():
                outputs = model.generate(
                **batch,
                max_new_tokens=max_new_tokens,
                do_sample=do_sample,
                top_p=top_p,
                temperature=temperature,
                min_length=min_length,
                use_cache=use_cache,
                top_k=top_k,
                repetition_penalty=repetition_penalty,
                length_penalty=length_penalty,
                **kwargs)

            e2e_inference_time = (time.perf_counter() - start) * 1000
            print(f"Inference time: {e2e_inference_time} ms")

            output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)[len(user_prompt):]
            # print("Model Output: {}".format(output_text))

            # Write the instruction, input text, and output to the CSV file
            writer.writerow([instruction, input_text,datasource, output_text, gold_label])

# Example of how to use the function

In [None]:
base_model_name = "daryl149/llama-2-7b-hf"
adapters_name = 'iocuydi/llama-2-amharic-3784m'

In [9]:
model = load_model(base_model_name, quantization=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/578 [00:00<?, ?B/s]

pytorch_model.bin.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

  return self.fget.__get__(instance, owner)()


generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

In [10]:
tokenizer = LlamaTokenizer.from_pretrained(adapters_name)
embedding_size = model.get_input_embeddings().weight.shape[0]

if len(tokenizer) != embedding_size:
    print("resize the embedding size by the size of the tokenizer")
    model.resize_token_embeddings(len(tokenizer))

tokenizer_config.json:   0%|          | 0.00/745 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/899k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

resize the embedding size by the size of the tokenizer


In [11]:
# Load adapter model
model.load_adapter(adapters_name)

adapter_config.json:   0%|          | 0.00/591 [00:00<?, ?B/s]

adapter_model.bin:   0%|          | 0.00/1.75G [00:00<?, ?B/s]

In [33]:
# Load Hugging Face dataset
hf_dataset  = datasets.load_dataset("HuggingFace link here", use_auth_token="HUGGING_FACE_TOKEN")



In [None]:
main(model, tokenizer, ['DATASOURCE-HERE'], csv_file_path='responses.csv')

For this experiment the weights for the base LLaMA 2 model are fetched from [here](https://huggingface.co/daryl149/llama-2-7b-chat-hf). Accessing the official LLaMA 2 weights from the Huggingface requires approval from the Meta.