<a href="https://colab.research.google.com/github/Tabook22/AI/blob/main/WSD_test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
! pip install nltk scikit-learn



Make sure to replace 'training_data.json', 'testing_data.json', and 'wsd_dictionary.json' with the actual file paths of your JSON files containing the respective data.
This code does the following:

It loads the training data, testing data, and WSD dictionary from JSON files.
It preprocesses the data by extracting the context sentences and lemma IDs.
It creates a dictionary mapping lemma IDs to their possible glosses using the WSD dictionary.
It defines a function extract_features() that tokenizes the context sentences, converts them to lowercase, removes stopwords, and joins the remaining tokens back into a string. This function is used to extract features from the context sentences.
It prepares the training data by extracting features from the training contexts and storing the corresponding lemma IDs.
It creates a TF-IDF vectorizer to convert the training contexts into a matrix of TF-IDF features.
It trains a Linear SVM model using the training features and lemma IDs.
It evaluates the trained model on the testing data by extracting features from the testing contexts, predicting the lemma IDs using the model, and comparing the predicted lemma IDs with the true lemma IDs.
Finally, it calculates and prints the accuracy of the WSD model on the testing data.

Note that this is a basic implementation and can be further enhanced by incorporating more advanced features, experimenting with different machine learning algorithms, and fine-tuning hyperparameters.
Remember to provide the necessary JSON files with the training data, testing data, and WSD dictionary in the specified format for the code to run successfully.

In [None]:
import json
import nltk
nltk.download('punkt')
from nltk.corpus import stopwords
nltk.download('stopwords')
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
# Load the training data, testing data, and WSD dictionary from JSON files
with open('train.json', 'r', encoding='utf-8') as f:
    training_data = json.load(f)

with open('test_wsd.json', 'r', encoding='utf-8') as f:
    testing_data = json.load(f)

with open('WSD_dict.json', 'r', encoding='utf-8') as f:
    wsd_dictionary = json.load(f)


In [None]:
# Preprocess the data
def preprocess_data(data):
    preprocessed_data = []
    for item in data:
        context = item['context']
        lemma_id = item['lemma_id']
        preprocessed_data.append((context, lemma_id))
    return preprocessed_data

training_data = preprocess_data(training_data)
testing_data = preprocess_data(testing_data)

In [None]:
# Create a dictionary mapping lemma IDs to their possible glosses
lemma_to_glosses = {}
for item in wsd_dictionary:
    lemma_id = item['lemma_id']
    gloss = item['gloss']
    if lemma_id not in lemma_to_glosses:
        lemma_to_glosses[lemma_id] = []
    lemma_to_glosses[lemma_id].append(gloss)

In [None]:
# Extract features from the context sentences
def extract_features(context):
    tokens = nltk.word_tokenize(context)
    lowercase_tokens = [token.lower() for token in tokens]
    filtered_tokens = [token for token in lowercase_tokens if token not in stopwords.words('arabic')]
    return ' '.join(filtered_tokens)

In [None]:
# Prepare the training data
train_contexts = [extract_features(context) for context, _ in training_data]
train_lemma_ids = [lemma_id for _, lemma_id in training_data]

In [None]:
# Create the TF-IDF vectorizer and transform the training contexts
vectorizer = TfidfVectorizer()
train_features = vectorizer.fit_transform(train_contexts)

In [None]:
# Train the WSD model using Linear SVM
model = LinearSVC()
model.fit(train_features, train_lemma_ids)

In [None]:
# Evaluate the model on the testing data
correct_predictions = 0
total_predictions = 0

for context, true_lemma_id in testing_data:
    context_features = vectorizer.transform([extract_features(context)])
    predicted_lemma_id = model.predict(context_features)[0]

    if predicted_lemma_id == true_lemma_id:
        correct_predictions += 1
    total_predictions += 1

accuracy = correct_predictions / total_predictions
print(f"Accuracy: {accuracy:.2f}")

Accuracy: 0.03
