<a href="https://colab.research.google.com/github/Tabook22/AI/blob/main/wsd3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
! pip install nltk scikit-learn ftfy tensorflow keras



Make sure to replace 'training_data.json', 'testing_data.json', and 'wsd_dictionary.json' with the actual file paths of your JSON files containing the respective data.
This code does the following:

It loads the training data, testing data, and WSD dictionary from JSON files.
It preprocesses the data by extracting the context sentences and lemma IDs.
It creates a dictionary mapping lemma IDs to their possible glosses using the WSD dictionary.
It defines a function extract_features() that tokenizes the context sentences, converts them to lowercase, removes stopwords, and joins the remaining tokens back into a string. This function is used to extract features from the context sentences.
It prepares the training data by extracting features from the training contexts and storing the corresponding lemma IDs.
It creates a TF-IDF vectorizer to convert the training contexts into a matrix of TF-IDF features.
It trains a Linear SVM model using the training features and lemma IDs.
It evaluates the trained model on the testing data by extracting features from the testing contexts, predicting the lemma IDs using the model, and comparing the predicted lemma IDs with the true lemma IDs.
Finally, it calculates and prints the accuracy of the WSD model on the testing data.

Note that this is a basic implementation and can be further enhanced by incorporating more advanced features, experimenting with different machine learning algorithms, and fine-tuning hyperparameters.
Remember to provide the necessary JSON files with the training data, testing data, and WSD dictionary in the specified format for the code to run successfully.

In [4]:
import json
from ftfy import fix_text
import nltk
nltk.download('punkt')
from nltk.corpus import stopwords
nltk.download('stopwords')
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_val_score
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, GRU, Dense


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
# Load the training data, testing data, and WSD dictionary from JSON files
def load_json_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        json_text = f.read()
        fixed_json_text = fix_text(json_text)
        data = json.loads(fixed_json_text)
    return data

training_data = load_json_file('train.json')
testing_data = load_json_file('test_wsd.json')
wsd_dictionary = load_json_file('WSD_dict.json')

In [6]:
# Preprocess the data
def preprocess_data(data):
    preprocessed_data = []
    for item in data:
        context = item['context']
        lemma_id = item['lemma_id']
        preprocessed_data.append((context, lemma_id))
    return preprocessed_data

training_data = preprocess_data(training_data)
testing_data = preprocess_data(testing_data)

In [7]:
# Create a dictionary mapping lemma IDs to their possible glosses
lemma_to_glosses = {}
for item in wsd_dictionary:
    lemma_id = item['lemma_id']
    gloss = item['gloss']
    if lemma_id not in lemma_to_glosses:
        lemma_to_glosses[lemma_id] = []
    lemma_to_glosses[lemma_id].append(gloss)

In [8]:
# Extract features from the context sentences
def extract_features(context):
    tokens = nltk.word_tokenize(context)
    lowercase_tokens = [token.lower() for token in tokens]
    filtered_tokens = [token for token in lowercase_tokens if token not in stopwords.words('arabic')]
    return ' '.join(filtered_tokens)

In [9]:
# Prepare the training data
train_contexts = [extract_features(context) for context, _ in training_data]
train_lemma_ids = [lemma_id for _, lemma_id in training_data]

In [10]:
# Create the TF-IDF vectorizer and transform the training contexts
vectorizer = TfidfVectorizer()
train_features = vectorizer.fit_transform(train_contexts)

In [11]:
# Tokenize and pad the training contexts
tokenizer = Tokenizer()
tokenizer.fit_on_texts(train_contexts)
train_sequences = tokenizer.texts_to_sequences(train_contexts)
max_length = max(len(seq) for seq in train_sequences)
train_padded = pad_sequences(train_sequences, maxlen=max_length)

In [12]:
# Prepare the testing data
test_contexts = [extract_features(context) for context, _ in testing_data]
test_sequences = tokenizer.texts_to_sequences(test_contexts)
test_padded = pad_sequences(test_sequences, maxlen=max_length)

In [13]:
# Create the BiLSTM model
bilstm_model = Sequential([
    Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=100, input_length=max_length),
    Bidirectional(LSTM(64)),
    Dense(len(lemma_to_glosses), activation='softmax')
])
bilstm_model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])


In [14]:
# Create the BiGRU model
bigru_model = Sequential([
    Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=100, input_length=max_length),
    Bidirectional(GRU(64)),
    Dense(len(lemma_to_glosses), activation='softmax')
])
bigru_model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])


In [None]:
# Train and evaluate different models
models = [
    ('Linear SVM', LinearSVC()),
    ('Decision Tree', DecisionTreeClassifier()),
    ('Random Forest', RandomForestClassifier()),
    ('Neural Network', MLPClassifier(hidden_layer_sizes=(100,), activation='relu', solver='adam', max_iter=500)),
    ('BiLSTM', bilstm_model),
    ('BiGRU', bigru_model)
]

for model_name, model in models:
    if model_name in ['BiLSTM', 'BiGRU']:
        # Train the Keras model
        model.fit(train_padded, train_lemma_ids, epochs=5, batch_size=32, validation_split=0.2)
    else:
        # Train the scikit-learn model
        model.fit(train_features, train_lemma_ids)

        # Evaluate the model using cross-validation
        cv_scores = cross_val_score(model, train_features, train_lemma_ids, cv=5)

        print(f"{model_name} - Cross-validation scores: {cv_scores}")
        print(f"{model_name} - Average cross-validation score: {cv_scores.mean():.2f}")
    print()




In [None]:
# Select the best model based on cross-validation scores (excluding BiLSTM and BiGRU)
best_model = max(models[:-2], key=lambda x: cross_val_score(x[1], train_features, train_lemma_ids, cv=5).mean())
best_model_name, best_model = best_model

print(f"Best model: {best_model_name}")


In [None]:
# Evaluate the best model on the testing data
correct_predictions = 0
total_predictions = 0

for context, true_lemma_id in testing_data:
    if best_model_name in ['BiLSTM', 'BiGRU']:
        # Prepare the testing context for the Keras model
        test_sequence = tokenizer.texts_to_sequences([extract_features(context)])
        test_padded = pad_sequences(test_sequence, maxlen=max_length)
        predicted_lemma_id = best_model.predict(test_padded).argmax()
    else:
        # Prepare the testing context for the scikit-learn model
        context_features = vectorizer.transform([extract_features(context)])
        predicted_lemma_id = best_model.predict(context_features)[0]

    if predicted_lemma_id == true_lemma_id:
        correct_predictions += 1
    total_predictions += 1

accuracy = correct_predictions / total_predictions
print(f"Accuracy of the best model on testing data: {accuracy:.2f}")