In [None]:
%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers trl peft accelerate bitsandbytes

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Your task is to understand the provided text in English and generate a summary and a title for it in Arabic. The result should be formatted in JSON. Follow these steps:

1- Comprehension:
Understand the provided English text.
2- Summary Generation:
Create a brief and concise summary of the text in Arabic.
3- Title Generation:
Generate a relevant title in Arabic that accurately reflects the content and theme of the text.
4-Output Formatting:
Format the summary and title into a JSON object with the keys 'summary' and 'title'.

### Input:
{}

### Response:
{}"""

In [None]:
from unsloth import FastLanguageModel

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


In [None]:
max_seq_length = 2048
dtype = None
load_in_4bit = True

In [None]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "AhmedBou/Llama-3-EngText-ArabicSummary", # YOUR MODEL YOU USED FOR TRAINING
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

adapter_config.json:   0%|          | 0.00/732 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

==((====))==  Unsloth: Fast Llama patching release 2024.5
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.26.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/172 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/464 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Unsloth 2024.5 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [None]:
input = """
The fallout from Thursday’s attacks was announced on Al Masirah television, a Houthi-controlled channel, which broadcast a video that appeared to depict wounded civilians being treated in Hodeidah. At least 42 people were reportedly injured.

“The American-British aggression will not prevent us from continuing our military operations in support of Palestine,” Houthi official Mohammed al-Bukhaiti said on X, warning that the rebels would “meet escalation with escalation”.

The US Central Command (CENTCOM) said on X that attacks against 13 Houthi targets had “successfully destroyed” eight uncrewed aerial vehicles, or drones, in Houthi-controlled areas of Yemen and over the Red Sea.
"""

In [None]:
# function that takes the generated result and parse it to extract the part after "\n\n\n### Response:\n" and before '<|end_of_text|>':
import json
import re

def parse_response(text):
  json_string = text.split("\n\n\n### Response:\n")[1].split("<|end_of_text|>")[0]
  print(json_string)
  # Regular expression pattern to extract summary and title
  pattern = re.compile(r"'summary':'(.*?)', 'title':'(.*?)'")

  # Search for the pattern in the string
  match = pattern.search(json_string)

  # Check if the pattern was found and extract the groups
  if match:
      summary = match.group(1)
      title = match.group(2)

      # Create the dictionary
      parsed_dict = {
          'summary': summary,
          'title': title
      }

  return parsed_dict

In [None]:
def generate_summary_and_title(text):
  inputs = tokenizer(
  [
      alpaca_prompt.format(
          text, # input
          "", # output - leave this blank for generation!
      )
  ], return_tensors = "pt").to("cuda")

  outputs = model.generate(**inputs, max_new_tokens = 800, use_cache = True)
  tokenizer.batch_decode(outputs)
  result = tokenizer.batch_decode(outputs)[0]
  parsed_dict = parse_response(result)
  print(f"Summary: {parsed_dict['summary']}")
  print(f"Title: {parsed_dict['title']}")

In [None]:
generate_summary_and_title(input)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


{'summary':'قال متحدث عن القوات المسلحة الحوثية إنها سترد بالهجوم على أهداف غربية في اليمن بعد أن تسببت القاذفات الأمريكية والبريطانية في إصابة 42 شخصًا في الهجوم على مطار هوديداه.', 'title':'القوات المسلحة الحوثية: لن نتراجع في هوديداه'}
Summary: قال متحدث عن القوات المسلحة الحوثية إنها سترد بالهجوم على أهداف غربية في اليمن بعد أن تسببت القاذفات الأمريكية والبريطانية في إصابة 42 شخصًا في الهجوم على مطار هوديداه.
Title: القوات المسلحة الحوثية: لن نتراجع في هوديداه


You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [None]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

# Inference the Gemma 7B model

In [None]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "AhmedBou/Gemma-7b-EngText-ArabicSummary", # YOUR MODEL YOU USED FOR TRAINING
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

adapter_config.json:   0%|          | 0.00/730 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

==((====))==  Unsloth: Fast Gemma patching release 2024.5
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.26.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


model.safetensors:   0%|          | 0.00/5.57G [00:00<?, ?B/s]

`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/40.0k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/200M [00:00<?, ?B/s]

Unsloth 2024.5 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


In [None]:
def generate_summary_and_title(text):
  inputs = tokenizer(
  [
      alpaca_prompt.format(
          text, # input
          "", # output - leave this blank for generation!
      )
  ], return_tensors = "pt").to("cuda")

  outputs = model.generate(**inputs, max_new_tokens = 800, use_cache = True)
  tokenizer.batch_decode(outputs)
  result = tokenizer.batch_decode(outputs)[0]
  return result

In [None]:
generate_summary_and_title(input)

