# Read AraWordNet

contributed by **Ali Ahmed**

A utility file to read **AraWordNet** and provide dictionary to map between the sense and its words.

AraWordNet[1][2] could be found at http://globalwordnet.org/resources/arabic-wordnet/.

## Prerequisite:
- Define `wordnet_path` variable

[1] Black W., Elkateb S., Rodriguez H., Alkhalifa M., Vossen P., Pease A., Bertran M., Fellbaum C., (2006) The Arabic WordNet Project, Proceedings of LREC 2006

[2] Lahsen Abouenour, Karim Bouzoubaa, Paolo Rosso (2013) On the evaluation and improvement of Arabic WordNet coverage and usability, Language Resources and Evaluation 47(3) pp 891–91

## Import and Setup

In [None]:
from bs4 import BeautifulSoup
from collections import Counter, defaultdict

wordnet_path = '/content/arb2-lmf.xml'
%store wordnet_path
%store -r wordnet_path

Stored 'wordnet_path' (str)


## Read AraWordNet

**AraWordNet** has the following structure which is embedded in XML:
- LexicalEntry:
  - Lemma. Its properties are: `partOfSpeech` and `writtenForm`. We are interested in the `writtenForm` which shows how the word looks like.
  - Sense. Its properties are: `id` and `synset`. We are interested in the `synset` to map between the word and its relations in the WordNet.
  - WordForm. Its properties are: `formType` and `writtenForm`. We are not interested in any of these properties.
- Synset. Its properties are `baseConcept` and `id`. We are interested in the `id` which maps to the `synset` property in the `Sense` node for every word:
  - SynsetRelations
    - SynsetRelation. Its properties are `relType` and `targets`. We are interested in both. `relType` shows the relation type (whether its `hypernym`, `hyponym`, .. etc). `targets` maps to the `synset` property in the `Sense` node for every word.

In [None]:
print("Loading AraWordNet")
wordnet_file = open(wordnet_path).read()
wordnet = BeautifulSoup(wordnet_file, "xml")

Loading AraWordNet


## Extracting relations from AraWordNet (AWN)

Relation types can be:
- `hypernym`: represents a parent to child relationship.
- `hyponym`: represents a child to parent relationship.
- `has_instance`: represents an object to one of its instances relationship.
- `is_instance`: represents an instance to its object relationship.

and other relationships that we are not interested in.

In [None]:
print("Reading hypernym relations")
relations = []
# For every synonym set
for synset in wordnet.findAll('Synset'):
    # Get its hypernym relations
    synset_hypernym_relations = list(filter((lambda relation: relation['relType'] == 'hypernym'), synset.findAll('SynsetRelation')))
    for relation in synset_hypernym_relations:
        # Construct a pair between each synonym set and its child
        relations.append((synset['id'], relation['targets']))

Reading hypernym relations


### Testing if hypernym relations represent similar relations as hyponym relations

If this is true, we can safely ignore the `hyponym` relationship and work with the `hypernym` relationship only.

In [None]:
hyponym_relations = []
# For every synonym set
for synset in wordnet.findAll('Synset'):
    # Get its hyponym relations
    synset_hyponym_relations = list(filter((lambda relation: relation['relType'] == 'hyponym'), synset.findAll('SynsetRelation')))
    for relation in synset_hyponym_relations:
        # Construct a pair between each synonym set and its parent
        hyponym_relations.append((relation['targets'], synset['id']))

relations.sort()
hyponym_relations.sort()
print("Test: Are hypernym relations similar to hyponym relations? {}".format(relations == hyponym_relations))
if relations == hyponym_relations: print("Considering hypernym relations only")

Test: Are hypernym relations similar to hyponym relations? True
Considering hypernym relations only


### We might also consider the is_instance and has_instance

In [None]:
has_instance_relations = []
# For every synonym set
for synset in wordnet.findAll('Synset'):
    # Get its has_instance relations
    synset_has_instance_relations = list(filter((lambda relation: relation['relType'] == 'has_instance'), synset.findAll('SynsetRelation')))
    for relation in synset_has_instance_relations:
        # Construct a pair between each synonym set and its instance
        has_instance_relations.append((synset['id'], relation['targets']))

### Testing if has_instance relations represent similar relations as is_instance relations

If this is true, we can again safely ignore the `is_instance` relationship and work with the `has_instance` relationship only.

In [None]:
is_instance_relations = []
# For every synonym set
for synset in wordnet.findAll('Synset'):
    # Get its is_instance relations
    synset_is_instance_relations = list(filter((lambda relation: relation['relType'] == 'is_instance'), synset.findAll('SynsetRelation')))
    for relation in synset_is_instance_relations:
        # Construct a pair between each synonym set and its object
        is_instance_relations.append((relation['targets'], synset['id']))

is_instance_relations.sort()
has_instance_relations.sort()
is_instance_relations == has_instance_relations

True

### Testing if hypernym contains repeated relations

We have to remove repeated relations if they exist.

In [None]:
print("Number of hypernym relations: {}".format(len(relations)))
print("Contains unique relations only? {}".format(len(relations) == len(set(hyponym_relations))))

Number of hypernym relations: 19806
Contains unique relations only? False


### Therefore, We have to consider the set of unique relations ignoring the repeated ones

In [None]:
print("Considering unique hypernym relations only")
relations = list(set(relations))
print("Number of unique hypernym relations: {}".format(len(relations)))

Considering unique hypernym relations only
Number of unique hypernym relations: 9305


### Testing if we have self- or bi-directional relations, and removing them
Self-directional is a relation between the word and itself. Bi-directional is a relation between two words where every one of them is parent `hypernym` of the other. Both relations contain loops and will be problematic when constructing tree for generating catcode and word-sense-children files.

In [None]:
# List for the relations in both directions
bi_directional_relations = []
# Synset is the sense id for the parent, target is the sense id for the child
for synset, target in relations:
    # Add a relation between the parent and the child
    bi_directional_relations.append((synset, target))
    # Add a relation in the other way around
    bi_directional_relations.append((target, synset))

# Count the number of occurences for each pair. This should be 1 for every pair since we are
# considering the unique set of hypernym relations
counter = Counter(bi_directional_relations)
# If the counter of any pair is more than 1, it should be marked as invalid
invalid_relations = list(filter((lambda relation: counter[relation] > 1), bi_directional_relations))
print("Considering unique uni-directional hypernym relations only")
# Remove the invalid relations
relations = list(set(relations) - set(invalid_relations))
print("Number of unique uni-directional hypernym relations: {}".format(len(relations)))

Considering unique uni-directional hypernym relations only
Number of unique uni-directional hypernym relations: 9302


### Testing if every child is occuring once as a child

Child should have only one parent and therefore should occur in the unique uni-directional relations once. If child occur multiple times as child, this means the parents n-balls will have to intersect. As a result, we have to remove the relations containing repeated children.

In [None]:
children = list(map((lambda relation: relation[1]), relations))
counter = Counter(children)
invalid_children = list(filter((lambda child: counter[child] > 1), children))

print("Number of invalid children: {}".format(len(invalid_children)))

Number of invalid children: 1763


In [None]:
relations = list(filter((lambda relation: relation[0] not in invalid_children), relations))
relations = list(filter((lambda relation: relation[1] not in invalid_children), relations))
print("Number of valid hypernym relations without repeated children: {}".format(len(relations)))

Number of valid hypernym relations without repeated children: 7177


### Extract vocabulary

Our vocabulary is limited to those words appearing in the valid hypernym relations. We have to extract them as they are providing the written form which is used in the word embedding file.

In [None]:
# List for synonym set ids
synset_ids = []
for relation in relations:
    # Extract parent synonym set id
    synset_ids.append(relation[0])
    # Extract child synonym set id
    synset_ids.append(relation[1])

# Create unique set of synonym set ids extracted from the valid hypernym relations
synset_ids = list(set(synset_ids))
print("Number of synonym set ids: {}".format(len(synset_ids)))

Number of synonym set ids: 7622


In [None]:
# Filter lexical entries which have synonym set id appearing in our list
lexical_entries = list(filter((lambda entry: entry.Sense['synset'] in synset_ids), wordnet.findAll('LexicalEntry')))
# Extract words that correspond to our synonym set id list. These words form our vocabulary list.
words = list(set(map((lambda entry: entry.Lemma['writtenForm']), lexical_entries)))
print("Number of unique words: {}".format(len(words)))

Number of unique words: 14391


## Dictinary for synset to words

Construct a dictionary for every synonym set id and its set of words. The key is the synonym set id and the value is a list of words.

In [None]:
lexical_entries = list(filter((lambda entry: entry.Sense['synset'] in synset_ids), wordnet.findAll('LexicalEntry')))
synset_dict = defaultdict(list)
for entry in lexical_entries:
    written_form = entry.Lemma['writtenForm']
    synset_dict[entry.Sense['synset']].append(written_form)

# **Test**

In [None]:
# Test the synset to words dictionary
print("Testing synset to words dictionary:")
test_synset = list(synset_dict.keys())[0]
print(f"Synset: {test_synset}")
print(f"Words: {synset_dict[test_synset]}")

# Test the vocabulary list
print("\nTesting vocabulary list:")
print(f"Total words in the vocabulary: {len(words)}")
print(f"Sample words: {words[:10]}")

# Test the unique uni-directional hypernym relations
print("\nTesting unique uni-directional hypernym relations:")
print(f"Number of unique uni-directional hypernym relations: {len(relations)}")
print(f"Sample relations: {relations[:10]}")


Testing synset to words dictionary:
Synset: ZalAam_n1AR
Words: [' ظلْماء', ' دُهْمة', 'عتْمة', 'ظلام', 'ظُلْمة', 'غلس', 'قتْمة', ' ظَلْماء', ' دُهْمَة']

Testing vocabulary list:
Total words in the vocabulary: 14391
Sample words: ['', 'إِتْقان', 'أسْلُوب كِتابِي', 'قبْر', 'سير في موكب', 'وِقَاء', 'سخّن', 'شوى', 'عالج ببراعة', 'قُماش قُطْنِي']

Testing unique uni-directional hypernym relations:
Number of unique uni-directional hypernym relations: 7177
Sample relations: [('Eamaliy~ap_n1AR', 'jamoE__n1AR'), ('>abodaEa_*ihoniy~aA_v1AR', 'taxay~ala_v2AR'), ('>amad_n1AR', 'madaY_n1AR'), ('quw~ap_n7AR', 'taHak~um_n2AR'), ('tijaArap_n4AR', 'tijaArap_n6AR'), ('Eil~ap_n2AR', 'xalal__n1AR'), ('quw~ap_n3AR', 'jA*iby~ap _n1AR'), ('baAsotA_n1AR', 'lAzAnyA_n1AR'), ('mud~ap_muHad~adFp_n1AR', 'faSol__n1AR'), ('Tariyqap_n2AR', 'Hal~_n1AR')]


In [None]:
# Test the synset to words dictionary
print("Testing synset to words dictionary:")
test_synset = list(synset_dict.keys())[5]
print(f"Synset: {test_synset}")
print(f"Words: {synset_dict[test_synset]}")

# Test the vocabulary list
print("\nTesting vocabulary list:")
print(f"Total words in the vocabulary: {len(words)}")
print(f"Sample words: {words[:10]}")

# Test the unique uni-directional hypernym relations
print("\nTesting unique uni-directional hypernym relations:")
print(f"Number of unique uni-directional hypernym relations: {len(relations)}")
print(f"Sample relations: {relations[:10]}")


Testing synset to words dictionary:
Synset: $a>n_n1AR
Words: ['شأن', 'همّ']

Testing vocabulary list:
Total words in the vocabulary: 14391
Sample words: ['', 'إِتْقان', 'أسْلُوب كِتابِي', 'قبْر', 'سير في موكب', 'وِقَاء', 'سخّن', 'شوى', 'عالج ببراعة', 'قُماش قُطْنِي']

Testing unique uni-directional hypernym relations:
Number of unique uni-directional hypernym relations: 7177
Sample relations: [('Eamaliy~ap_n1AR', 'jamoE__n1AR'), ('>abodaEa_*ihoniy~aA_v1AR', 'taxay~ala_v2AR'), ('>amad_n1AR', 'madaY_n1AR'), ('quw~ap_n7AR', 'taHak~um_n2AR'), ('tijaArap_n4AR', 'tijaArap_n6AR'), ('Eil~ap_n2AR', 'xalal__n1AR'), ('quw~ap_n3AR', 'jA*iby~ap _n1AR'), ('baAsotA_n1AR', 'lAzAnyA_n1AR'), ('mud~ap_muHad~adFp_n1AR', 'faSol__n1AR'), ('Tariyqap_n2AR', 'Hal~_n1AR')]


In [None]:
def get_related_words(word, relation_type):
    """
    Get words related to the given word based on the chosen relation type.

    Parameters:
    - word: The input word.
    - relation_type: The type of relation (hypernym, hyponym, has_instance, is_instance).

    Returns:
    - A list of words related to the input word based on the chosen relation type.
    """
    related_words = []

    # Find the synset ID for the input word
    synset_id = None
    for synset, words in synset_dict.items():
        if word in words:
            synset_id = synset
            break

    if synset_id is not None:
        # Find related synsets based on the chosen relation type
        if relation_type == 'hypernym':
            related_synsets = [target for source, target in relations if source == synset_id]
        elif relation_type == 'hyponym':
            related_synsets = [source for source, target in relations if target == synset_id]
        elif relation_type == 'has_instance':
            related_synsets = [target for source, target in has_instance_relations if source == synset_id]
        elif relation_type == 'is_instance':
            related_synsets = [source for source, target in is_instance_relations if target == synset_id]
        else:
            print("Invalid relation type. Choose from hypernym, hyponym, has_instance, or is_instance.")
            return []

        # Get words corresponding to the related synsets
        related_words = [word for synset, words in synset_dict.items() if synset in related_synsets for word in words]

    return related_words

# Example usage:
input_word = "قبْر"
chosen_relation_type = "hyponym"  # Choose from hypernym, hyponym, has_instance, or is_instance
result_words = get_related_words(input_word, chosen_relation_type)

print(f"Words related to '{input_word}' based on {chosen_relation_type}:")
print(result_words)


Words related to 'قبْر' based on hyponym:
['بُقْعة', 'مكان', 'نُقْطة طُوبُوغْرافِيّة']


In [None]:
# Example usage:
input_word = "مكان"
chosen_relation_type = "hyponym"  # Choose from hypernym, hyponym, has_instance, or is_instance
result_words = get_related_words(input_word, chosen_relation_type)

print(f"Words related to '{input_word}' based on {chosen_relation_type}:")
print(result_words)


Words related to 'مكان' based on hyponym:
['حِلّة', 'بيْت', 'منْزِل', 'مقام', 'مقرّ', 'مرْكز', 'مسْكن', 'مسْكن', 'سكن', 'وَطَن']


# **Load Arabic Word-Embedding**

A utility to load word embedding model.

AraVec N-Gram model is used as a source of word embeddings, as it provides larger set of embeddings to the words we have in AraWordNet. fastText and AraVec uni-gram are both uni-gram models so they are missing many words in the WordNet. In this notebook, we load the the N-gram model.

Prerequisite:
Define fasttext_path, uni_gram_aravec_path and n_gram_aravec_path variables

## Import and setup

In [None]:
!pip install fasttext

Collecting fasttext
  Downloading fasttext-0.9.2.tar.gz (68 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m68.8/68.8 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pybind11>=2.2 (from fasttext)
  Using cached pybind11-2.11.1-py3-none-any.whl (227 kB)
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.2-cp310-cp310-linux_x86_64.whl size=4199772 sha256=7a560ee0c7d04e0850ce9408e873b6e1329193888d3c94fb9aa3668f4bc5fd5d
  Stored in directory: /root/.cache/pip/wheels/a5/13/75/f811c84a8ab36eedbaef977a6a58a98990e8e0f1967f98f394
Successfully built fasttext
Installing collected packages: pybind11, fasttext
Successfully installed fasttext-0.9.2 pybind11-2.11.1


In [None]:
import fasttext.util
fasttext.util.download_model('ar', if_exists='ignore')  # English
ft = fasttext.load_model('cc.ar.300.bin')

Downloading https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.ar.300.bin.gz





In [None]:
import fasttext.util
fasttext.util.download_model('ar', if_exists='ignore')  # English
ft = fasttext.load_model('cc.ar.300.bin')



In [None]:
import io
import gensim

fasttext_path = '/content/cc.ar.300.bin'
uni_gram_aravec_path = '/content/full_grams_sg_wiki.mdl'
n_gram_aravec_path = uni_gram_aravec_path

%store fasttext_path
%store -r fasttext_path
%store uni_gram_aravec_path
%store -r uni_gram_aravec_path
%store n_gram_aravec_path
%store -r n_gram_aravec_path

Stored 'fasttext_path' (str)
Stored 'uni_gram_aravec_path' (str)
Stored 'n_gram_aravec_path' (str)


Load fastText word embedding model
## Nouvelle section

fastText[1] word embedding could be found at https://fasttext.cc/docs/en/crawl-vectors.html. We use the Arabic word embedding.

[1] E. Grave, P. Bojanowski, P. Gupta, A. Joulin, T. Mikolov, “Learning Word Vectors for 157 Languages”, in Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), 2018.

In [None]:
import pandas as pd
import numpy as np
from gensim.models import fasttext

# Function to calculate cosine similarity
def cosine_similarity(word, related_words, model):
    similarities = []
    for related_word in related_words:
        similarity = np.dot(model[word], model[related_word]) / (np.linalg.norm(model[word]) * np.linalg.norm(model[related_word]))
        similarities.append(similarity)

    # Create a DataFrame with words and cosine similarities
    result_df = pd.DataFrame({'Words': related_words, 'Cosine_Similarity': similarities})
    return result_df

# Example usage
input_word = 'مكان'  # Replace with the desired word
related_words = ['حِلّة', 'بيْت', 'منْزِل', 'مقام', 'مقرّ', 'مرْكز', 'مسْكن', 'مسْكن', 'سكن', 'وَطَن']  # Replace with your list of related words

# Calculate cosine similarity
result_df = cosine_similarity(input_word, related_words, ft)

# Print the result DataFrame
print(result_df)


In [None]:
# Code to load the model, the code is imported from fasttext: https://fasttext.cc/docs/en/crawl-vectors.html
def load_vectors(fname):
    fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
    n, d = map(int, fin.readline().split())
    data = {}
    for line in fin:
        tokens = line.rstrip().split(' ')
        data[tokens[0]] = list(map(float, tokens[1:]))
    return data

fasttext_model = load_vectors(fasttext_path)

## Load AraVec Uni-Gram word embedding model

AraVec[2] uni-gram word embedding could be found at https://github.com/bakrianoo/aravec#unigrams-models. We use the Wikipedia-SkipGram with vector size 300.

[2] A. Soliman, K. Eisa, and S. R. El-Beltagy, “AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP”, in proceedings of the 3rd International Conference on Arabic Computational Linguistics (ACLing 2017), Dubai, UAE, 2017.

In [None]:
unigram_aravec_model = gensim.models.Word2Vec.load(uni_gram_aravec_path)

## Load AraVec N-Gram word embedding model

AraVec n-gram word embedding could be found at https://github.com/bakrianoo/aravec#n-grams-models-1. We use the Wikipedia-SkipGram with vector size 300.

In [None]:
ngram_aravec_model = gensim.models.Word2Vec.load(n_gram_aravec_path)

# **Cosine similarity**

In [None]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity


related_words = result_words

def calculate_cosine_similarity(input_word, related_words, model):
    vector_input_word = model[input_word]
    vectors_related_words = [model[word] for word in related_words]

    cosine_similarities = cosine_similarity([vector_input_word], vectors_related_words)[0]

    result_df = pd.DataFrame({'Words': related_words, 'Cosine Similarity': cosine_similarities})
    return result_df

# Calculate cosine similarity using the fastText model
df = calculate_cosine_similarity(input_word, related_words, ft)

# Print the result DataFrame
print(df)


# **WordCloud for مكان hyponym**

In [None]:
!pip install wordcloud

## Import arabic font

In [None]:
arabic_font_path = "NotoKufiArabic-VariableFont_wght.ttf"
arabic_font = {"font_path": arabic_font_path, "width": 800, "height": 400, "background_color": 'white'}

In [None]:
import matplotlib.pyplot as plt
from wordcloud import WordCloud

# Create a WordCloud
wordcloud = WordCloud(**arabic_font).generate_from_frequencies(
        dict(zip(df['words'], df['cosine']))
    )

# Display the WordCloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

## **References**

https://github.com/bakrianoo/aravec#unigrams-models


<table style="width:100%">
  <tr>
      <td colspan="1" style="text-align:left;background-color:#0071BD;color:white">
        <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/">
            <img alt="Creative Commons License" style="border-width:0;float:left;padding-right:10pt"
                 src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png" />
        </a>
        &copy; T. Dong, C. Bauckhage<br/>
        Licensed under a
        <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/" style="color:white">
            CC BY-NC 4.0
        </a>.
      </td>
      <td colspan="2" style="text-align:left;background-color:#66A5D1">
          <b>Acknowledgments:</b>
          This material was prepared within the project
          <a href="http://www.b-it-center.de/b-it-programmes/teaching-material/p3ml/" style="color:black">
              P3ML
          </a>
          which is funded by the Ministry of Education and Research of Germany (BMBF)
          under grant number 01/S17064. The authors gratefully acknowledge this support.
      </td>
  </tr>
</table>