# Dataset
Getting dataset from: [Benchmark Arabic text diacritization dataset](https://github.com/AliOsm/arabic-text-diacritization/tree/master)
- train.txt: Contains 50,000 lines of diacritized Arabic text which can be used as training dataset
- val.txt: Contains 2,500 lines of diacritized Arabic text which can be used as validation dataset
- test.txt: Contains 2,500 lines of diacritized Arabic text which can be used as testing dataset


In [1]:
!wget https://raw.githubusercontent.com/AliOsm/arabic-text-diacritization/refs/heads/master/dataset/train.txt &> /dev/null
!wget https://raw.githubusercontent.com/AliOsm/arabic-text-diacritization/refs/heads/master/dataset/test.txt &> /dev/null
!wget https://raw.githubusercontent.com/AliOsm/arabic-text-diacritization/refs/heads/master/dataset/val.txt &> /dev/null

In [2]:
def read_file_content(file_path):
    return open(file_path, encoding="utf8").read()

# Read and split data based on lines
train_data = read_file_content("/content/train.txt").splitlines()
val_data = read_file_content("/content/val.txt").splitlines()
test_data = read_file_content("/content/test.txt").splitlines()

# Preprocessing dataset
The Gemini API will act as the word sense disambugator module. It will provide the definition of each word based on it's context.
The prompt that will be used:


> Assume the role of an Arabic language expert, you know the definition of words in a given context.
I'm going to provide you with a list of sentences, and for each sentence
provide me a list of words and their word sense in arabic language and part of speech.
Return the response as json
List of sentences: [List of sentences]

The dataset will be in this format
```
[{
    "sentence": "some text in arabic",
    "words": [
      {
        "word": "word_1",
        "word_sense": "definition_1"
        "pos" : "part of speech"
      }
    ]
}]
```

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
%cd /content/drive/MyDrive/ATD-WSD

# Create dir to store the data in drive
!mkdir train_wsd
!mkdir val_wsd
!mkdir test_wsd

/content/drive/MyDrive/ATD-WSD


# Gemini Limits
- 15 RPM (requests per minute)
- 1,500 RPD (requests per day)
- 8192 Output number of tokens
- Saftey settings terminate call
- 1 token = 4 characters


The biggest issue is the max output length from the model, therefore I will split the dataset based on the tokens, and data with large tokens will be processed in a separate API call

In [5]:
import math
def split_data_by_char_length(data, token_threshold=100):
  '''
    Args:
      data: list[str] -> list of strings
      token_threshold: int -> max number of tokens. 1 token = 4 characters
    Returns:
      list[str], list[str] -> acc_threshold_data, rej_threshold_data
  '''
  acc_threshold_data = []
  rej_threshold_data = []
  for sentence in data:
    if math.ceil(len(sentence)/4) < token_threshold:
      acc_threshold_data.append(sentence)
    else:
      rej_threshold_data.append(sentence)

  return acc_threshold_data, rej_threshold_data

In [6]:
token_threshold = 100

# Train split
train_small_token, train_large_token = split_data_by_char_length(train_data, token_threshold=token_threshold)
print(f"train_small_token: {len(train_small_token)} with max token length: {token_threshold}")
print(f"train_large_token: {len(train_large_token)} exceeds token length: {token_threshold}")

# val split
val_small_token, val_large_token = split_data_by_char_length(val_data, token_threshold=token_threshold)
print(f"val_small_token: {len(val_small_token)} with max token length: {token_threshold}")
print(f"val_small_token: {len(val_large_token)} exceeds token length: {token_threshold}")


# test split
test_small_token, test_large_token = split_data_by_char_length(test_data, token_threshold=token_threshold)
print(f"test_small_token: {len(test_small_token)} with max token length: {token_threshold}")
print(f"test_small_token: {len(test_large_token)} exceeds token length: {token_threshold}")

train_small_token: 34299 with max token length: 100
train_large_token: 15701 exceeds token length: 100
val_small_token: 1768 with max token length: 100
val_small_token: 732 exceeds token length: 100
test_small_token: 1700 with max token length: 100
test_small_token: 800 exceeds token length: 100


Download generative ai, I will use gemini flash

In [7]:
# Downloading generative ai
!pip install -q -U google-generativeai

In [8]:
import google.generativeai as genai
from google.colab import userdata

GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')
genai.configure(api_key=GOOGLE_API_KEY)

model = genai.GenerativeModel(model_name="gemini-1.5-flash")

### Preparing the accepted JSON response from Gemini

In [9]:
import typing_extensions as typing

class Words(typing.TypedDict):
    word: str
    sense: str
    pos: str

class WSD(typing.TypedDict):
    sentence: str
    words: list[Words]

### Extract word sense

In [10]:
from ast import Global
import json
import time

prev_failed_char_token = 9999

def save_error_log(name, start_idx, end_idx):
  with open(f"{name}_error_log", mode="a") as f:
    f.write(f"{start_idx},{end_idx}\n")

def save_failed_text(name, start_idx, end_idx, text):
  with open(f"{name}_failed_text_log", mode="a") as f:
    f.write(f"{start_idx},{end_idx},{text}\n")

def extract_word_sense(prompt, name, batch_itr, start_idx, end_idx, total_char_tokens):
  global prev_failed_char_token

  # limited RPD: Check if this token size will cause the bad response,
  if total_char_tokens >= prev_failed_char_token:
    print(f"prompt length of same size previously failed; batch stored in error_log to avoid fail request: batch_number={batch_itr}, start_idx={start_idx}, end_idx={end_idx}")
    save_error_log(name, start_idx, end_idx)
    return # skip this batch

  result = model.generate_content(
      prompt,
      # Force Gemini to respond in JSON format
      generation_config=genai.GenerationConfig(
          response_mime_type="application/json", response_schema=list[WSD]
      ),
      # Gemini terminates if there is safty issues, hence ignore (BLOCK_NONE).
      safety_settings=[
        {"category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_NONE"},
        {"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_NONE"},
        {"category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_NONE"},
        {"category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_NONE"}
      ]
    )

  # Check response
  if len(result.candidates) <= 0:
    print(f"No response returned: batch_number={batch_itr}, start_idx={start_idx}, end_idx={end_idx}")
    print(result)
    save_error_log(name, start_idx, end_idx)
    raise SystemExit('Stop code execution')
  else:
    # If request returned but not 'Natural stop'
    if result.candidates[0].finish_reason.name != "STOP":
      print(f"Bad response: batch_number={batch_itr}, start_idx={start_idx}, end_idx={end_idx}")
      print(f"Reason: {result.candidates[0].finish_reason} \| Continue with next batch")
      save_error_log(name, start_idx, end_idx)
      return # skip this batch


  # Store response in file
  try:
    result_dict = json.loads(result.text)
    with open(f"{name}/{name}_{start_idx}_{end_idx}.json", mode="w", encoding="utf-8") as f:
      json.dump(result_dict, f, ensure_ascii=False, indent=2)

  # Incase of an error store the batch that failed
  except Exception as e:
    print(e)
    save_failed_text(name, start_idx, end_idx, result.text)
    print(f"Failed to parse: batch_number={batch_itr}, start_idx={start_idx}, end_idx={end_idx}, total_tokens={total_char_tokens}")
    # Store the input token size that causes bad request
    if total_char_tokens > 600:
      if total_char_tokens < prev_failed_char_token:
        prev_failed_char_token = total_char_tokens

    save_error_log(name, start_idx, end_idx)


def prepare_dataset(data, data_start, data_end, name, batch_size):
    '''
      Args:
        data: list[str] -> list of strings
        data_start: int -> index for the start of the data
        data_end: int -> index for the end of the data
        name: str -> name of the file to store results
        batch_size: int -> size of the batch to process
    '''
    batch_data = data[data_start:data_end]

    start = data_start
    end = start + batch_size

    number_of_calls = math.ceil(len(batch_data)/batch_size)

    # Stats related variables
    avg_res_time = 0
    total_time = 0
    for i in range(number_of_calls):
      if end > data_end:
        end = data_end

      prompt = f"""
        You are an Arabic language expert, you understand the definition of words in a given context.
        I will provide you with a list of sentences. For each sentence, please provide
        a list of words, their part of speech and their word sense.
        Return the response as json keep word sense in english.
        List of sentences:
        {data[start:end]}
      """
      start_time = time.time()
      extract_word_sense(
          prompt,
          name=name,
          batch_itr=i,
          start_idx=start,
          end_idx=end,
          total_char_tokens= math.ceil(len(prompt)/4)
          )
      end_time = time.time()
      total_time += (end_time - start_time)

      start = end
      end = end + batch_size

      # Print the stats
      if (i+1) % 10 == 0:
        avg_res_time = total_time/(i+1)
        print(f"Current Batch:{end} \| iteration: {i+1}/{number_of_calls} \| avg response time: {avg_res_time:.2f}")



In [11]:
# Sample result
prepare_dataset(train_large_token, data_start=0, data_end=1, name="train_wsd", batch_size=4)

### Try with new batch size on the error log

In [13]:
import os
from os import listdir
from os.path import isfile, join


def get_failed_slices(file_name):
  failed_lines = []
  with open(file_name) as f:
    failed_lines = f.read().splitlines()

  # clear file
  with open(file_name, 'w'): pass

  return failed_lines

In [None]:
failed_lines = get_failed_slices("train_wsd_error_log")

for line in failed_lines:
  start_idx, end_idx = line.split(",")
  prepare_dataset(train_large_token, data_start=int(start_idx), data_end=int(end_idx), name="train_wsd", batch_size=2)

## Script for joining json files

In [14]:
def get_json_files(folder_name):
  cwd = os.getcwd()
  my_folder = f"{cwd}/{folder_name}"

  json_files = [my_folder + "/" + f for f in listdir(my_folder) if isfile(join(my_folder, f))]

  print(f"Number of files in {folder_name}: {len(json_files)}")
  return json_files

def join_jsons(files, output_file_name):
  result = {}
  for f1 in files:
    with open(f1, 'r') as infile:
      print(f1)
      loaded_file = json.load(infile)
      for d in loaded_file:
        if d["sentence"] not in result:
          result[d["sentence"]] = d

  with open(f'{output_file_name}.json', 'w') as output_file:
      dump_result = [v for v in result.values()]
      json.dump(dump_result, output_file, ensure_ascii=False, indent=2)

  cwd = os.getcwd()
  print(f"Joined in {cwd}/{output_file_name}")


# Join Json files

### Train data

In [None]:
train_folder = "train_wsd"
train_wsd_jsons = get_json_files(train_folder)

# Join json files
join_jsons(train_wsd_jsons, "220_train_wsd")

# Script for checking out the data

In [16]:
import os
from os import listdir
from os.path import isfile, join

files = get_json_files("train_wsd")

def get_start_end_idx(file_name):
  file_name = file_name.split(".")[0]
  file_name_chopped = file_name.split("_")
  return (int(file_name_chopped[3]), int(file_name_chopped[4]))


def get_start_idx(file_name):
  return get_start_end_idx(file_name)[0]


files = sorted(files, key=get_start_idx)
for i in range(len(files)):
  if i + 1 < len(files):
    f = files[i]
    f_next = files[i + 1]
    f_start, f_end = get_start_end_idx(f)
    f_next_start, f_next_end = get_start_end_idx(f_next)
    if f_end != f_next_start:
      f = f[40:]
      f_next = f_next[40:]
      print(f"overlap happened between these two files: {f} - {f_next}")


Number of files in train_wsd: 44


In [17]:
import json

files = get_json_files("train_wsd")
number_of_sentences = 0
files = sorted(files, key=get_start_idx)

unique_sentences = set()
duplicated_sentences = {}
for f_name in files:
  with open(f_name, mode="r", encoding="utf-8") as f:
    sentences = json.load(f)
    for s in sentences:
      if s["sentence"] in unique_sentences:
        if f_name not in duplicated_sentences:
          duplicated_sentences[f_name] = [s["sentence"]]
        else:
          duplicated_sentences[f_name].append(s["sentence"])
      else:
        unique_sentences.add(s["sentence"])

    number_of_sentences += len(sentences)

print(number_of_sentences)

for f_name, sentence in duplicated_sentences.items():
  print(f_name)
  print(sentence)

Number of files in train_wsd: 44
220
