# Dataset
Getting dataset from: [Benchmark Arabic text diacritization dataset](https://github.com/AliOsm/arabic-text-diacritization/tree/master)
- train.txt: Contains 50,000 lines of diacritized Arabic text which can be used as training dataset
- val.txt: Contains 2,500 lines of diacritized Arabic text which can be used as validation dataset
- test.txt: Contains 2,500 lines of diacritized Arabic text which can be used as testing dataset


In [None]:
!wget https://raw.githubusercontent.com/AliOsm/arabic-text-diacritization/refs/heads/master/dataset/train.txt &> /dev/null
!wget https://raw.githubusercontent.com/AliOsm/arabic-text-diacritization/refs/heads/master/dataset/test.txt &> /dev/null
!wget https://raw.githubusercontent.com/AliOsm/arabic-text-diacritization/refs/heads/master/dataset/val.txt &> /dev/null

In [113]:
def read_file_content(file_path):
    return open(file_path, encoding="utf8").read()

# Read and split data based on lines
train_data = read_file_content("/content/train.txt").splitlines()
val_data = read_file_content("/content/val.txt").splitlines()
test_data = read_file_content("/content/test.txt").splitlines()

# Preprocessing dataset
The Gemini API will act as the word sense disambugator module. It will provide the definition of each word based on it's context.
The prompt that will be used:


> Assume the role of an Arabic language expert, you know the definition of words in a given context.
I'm going to provide you with a list of sentences, and for each sentence
provide me a list of words and their word sense in arabic language and part of speech.
Return the response as json
List of sentences: [List of sentences]

The dataset will be in this format
```
[{
    "sentence": "some text in arabic",
    "words": [
      {
        "word": "word_1",
        "word_sense": "definition_1"
        "pos" : "part of speech"
      }
    ]
}]
```

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [114]:
%cd /content/drive/MyDrive/ATD-WSD

/content/drive/MyDrive/ATD-WSD


In [115]:
!pip install -q -U google-generativeai

In [116]:
# Import the Python SDK
import google.generativeai as genai

from google.colab import userdata

GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')
genai.configure(api_key=GOOGLE_API_KEY)


In [117]:
model = genai.GenerativeModel(model_name="gemini-1.5-flash")

In [118]:
import typing_extensions as typing

class Words(typing.TypedDict):
    word: str
    sense: str
    pos: str

class WSD(typing.TypedDict):
    sentence: str
    words: list[Words]

In [None]:
import json
import math
import time

def extract_word_sense(prompt, name, is_start=True, is_end=False):
  result = model.generate_content(
      prompt,
      generation_config=genai.GenerationConfig(
          response_mime_type="application/json", response_schema=list[WSD]
      ),
      safety_settings=[
        {"category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_NONE"},
        {"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_NONE"},
        {"category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_NONE"},
        {"category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_NONE"}
      ]
    )

  with open(name, mode="a") as f:
    if is_start: f.write("[")

    f.write(result.text[1:-2])

    if is_end: f.write("]")
    else: f.write(",")


def prepare_dataset(data, data_start, data_end, name, batch_size=200):
    file_name = f"{name}_{data_end}.json"

    batch_data = data[data_start:data_end]

    start = 0
    end = batch_size

    number_of_calls = math.ceil(len(batch_data)/batch_size)
    for i in range(number_of_calls):
      if i%15 == 0:
        time.sleep(1)

      prompt = f"""
        You are an Arabic language expert, you understand the definition of words in a given context.
        I will provide you with a list of sentences. For each sentence, please provide
        a list of words, their part of speech and their word sense. Ignore numbers and punctuations
        Return the response as json.
        List of sentences:
        {batch_data[start:end]}
      """
      #extract_word_sense(prompt, name=file_name, is_start=(i==0), is_end=(i >= number_of_calls-1))
      start = end
      end = end + batch_size
      if i%100 == 0:
        print(f"Current Batch:{data_end} \| iteration: {i}/{number_of_calls}")



In [None]:
prepare_dataset(train_data, data_start=0, data_end=1000, name="train_wsd", batch_size=2)