# Trained word2vec Model for the Afrikaans Language

To try to improve language modeling for Afrikaans language tasks, a language model was trained using word2vec. The modeling of language tasks are known to be sensitive to the domain of usage - a language model for legal documents would not be expected to perform well on a lnaguage task involving a different domain (like technology) - so as a first attempt a 'general purpose' Afrikaan language model was trained using Afrikaans conten from wikepedia.

This model should appear in the same directory as this document, with the name 'word2vec_af_wikipedia_500000.kv'. The .kv extension is from word2vec and indicates stored word vectors that are keyed by lookup tokens. By storing keyed vectors instead of the full model, the model would not be able to trained further. The advantage, though, is that the saved model file will be smaller in size. This model is 61MB's in size and git hub has a recommendation of a maximum file size of 50MB.

The model can be used in the following way:



In [3]:
# !pip install gensim

Collecting gensim
  Using cached gensim-3.8.3-cp36-cp36m-win_amd64.whl (24.2 MB)
Collecting smart-open>=1.8.1
  Downloading smart_open-4.1.2-py3-none-any.whl (111 kB)
Installing collected packages: smart-open, gensim
Successfully installed gensim-3.8.3 smart-open-4.1.2


You should consider upgrading via the 'c:\users\carlo\appdata\local\programs\python\python36\python.exe -m pip install --upgrade pip' command.


In [5]:
# Load word vectors
import gensim
model_word_vectors = gensim.models.KeyedVectors.load("word2vec_af_wikipedia_500000.kv")

This Word2Vec model was computed using the following parameters:
- vector_size = 100, The number of dimensions of the embeddings
- window = 5, The maximum distance between a target word and words around the target word
- min_count = 500, The minimum count of words to consider when training the model

These parameters were choosen arbitrarily, and different parameters will produce a different model. To assess which parameters are optimal a specific use case will be required. At the time of creating this model, no specific use case was identified and tests were still being devised to assess the performance of the model.

It was run on 500,000 web pages retrieved from wikepedia, specifically for Afrikaans content. Minimal process was doen to the content of the web pages. the webpages were retreaved during December 2020 and January 2021.

It took 35 hours to process. Most of that time was processing the html efrom the web pages. Only 3 hours of that was for calculating the Word2Vec model itself.

In [6]:
# Using the word 'vrou' we
# look at one of the vectors (embeddings), it can be seen that it is indeed 100 dimensional.
test_vector = model_word_vectors["vrou"]
test_vector

array([ 5.9409714 ,  4.496691  ,  6.708669  , -0.73155135,  1.3737465 ,
       -1.5263249 ,  3.3370247 , -4.6911836 ,  2.3992188 ,  3.6628628 ,
       -2.0540934 ,  4.4699187 , -0.831507  , -6.2117314 , -0.9083498 ,
        3.5368695 , -0.14083028, -1.5410953 , -4.560557  ,  9.646817  ,
       -0.47829846,  0.57315624, 10.241089  , -2.1930175 ,  4.319505  ,
       -2.8107812 , -1.0864744 ,  1.9947112 , -1.4881661 , -1.1401428 ,
        9.81603   ,  5.0182805 ,  3.5927055 , -6.8963404 ,  7.9481792 ,
        3.4396493 ,  0.7054076 ,  0.49245477,  0.9315639 ,  2.610291  ,
       -3.3183303 , -2.5655458 , -3.008428  , -1.5372294 , -2.4686952 ,
        1.6182895 , -5.2672544 , -0.30342856, -3.9773808 ,  1.6618026 ,
       -6.450713  , -1.3467321 , -1.3227992 ,  9.374692  , -1.404849  ,
       10.4987335 , -2.2145247 ,  0.48969543, -2.9401553 ,  3.302072  ,
       -0.516358  ,  7.0161767 ,  0.6384455 ,  2.995517  ,  7.2274675 ,
       -5.422364  , -2.1316519 ,  0.35439172,  2.482815  ,  2.94

In [7]:
# Test another word to see if it is present in the model
len(model_word_vectors["seun"])

100

In [8]:
# Test another word
len(model_word_vectors["ontbyt"])

100

If a particular word is not part of the model, a KeyError will be throw. For example, 'KeyError: "word 'Donald Trump' not in vocabulary"'

### What can be done with the word vectors?

In [45]:
# Family
# seun meisie broer suster
a = "seun"
b = "meisie"
c = "broer"
d = "suster"
result = model_word_vectors.most_similar(positive=[b, c], negative=[a])
most_similar_key, similarity = result[0]
print(">> '{}' staan tot '{}', soos '{}' staan tot <<'{}'- {:.4f}>> EXPECTED '{}'" \
      .format(a, b, c, most_similar_key, similarity, d))

# Family
# pa ma vader moeder
a = "pa"
b = "ma"
c = "vader"
d = "moeder"
result = model_word_vectors.most_similar(positive=[b, c], negative=[a])
most_similar_key, similarity = result[0]
print(">> '{}' staan tot '{}', soos '{}' staan tot <<'{}'- {:.4f}>> EXPECTED '{}'" \
      .format(a, b, c, most_similar_key, similarity, d))

# Plural
# vader moeder broers susters
a = "vader"
b = "moeder"
c = "broers"
d = "susters"
result = model_word_vectors.most_similar(positive=[b, c], negative=[a])
most_similar_key, similarity = result[0]
print(">> '{}' staan tot '{}', soos '{}' staan tot <<'{}'- {:.4f}>> EXPECTED '{}'" \
      .format(a, b, c, most_similar_key, similarity, d))

# Opposites
# gelukkig ongelukkig eerlik oneerlik
a = "gelukkig"
b = "ongelukkig"
c = "eerlik"
d = "oneerlik"
result = model_word_vectors.most_similar(positive=[b, c], negative=[a])
most_similar_key, similarity = result[0]
print(">> '{}' staan tot '{}', soos '{}' staan tot <<'{}'- {:.4f}>> EXPECTED '{}'" \
      .format(a, b, c, most_similar_key, similarity, d))

print()

a = "vrou"
b = "man"
similarity = model_word_vectors.similarity(a, b)
print(">> Similarity between '{}' and '{}' is {:.4f}".format(a, b, similarity))

a = "broer"
b = "seun"
similarity = model_word_vectors.similarity(a, b)
print(">> Similarity between '{}' and '{}' is {:.4f}".format(a, b, similarity))

a = "pa"
b = "man"
similarity = model_word_vectors.similarity(a, b)
print(">> Similarity between '{}' and '{}' is {:.4f}".format(a, b, similarity))

a = "bewus"
b = "onbewus"
similarity = model_word_vectors.similarity(a, b)
print(">> Similarity between '{}' and '{}' is {:.4f}".format(a, b, similarity))

print()

a = "oom"
similar = model_word_vectors.similar_by_word(a)
print(">> Most similar to '{}'': {}".format(a, similar[:3]))

a = "tannie"
similar = model_word_vectors.similar_by_word(a)
print(">> Most similar to '{}'': {}".format(a, similar[:3]))

a = "oupa"
similar = model_word_vectors.similar_by_word(a)
print(">> Most similar to '{}'': {}".format(a, similar[:3]))

a = "seun"
similar = model_word_vectors.similar_by_word(a)
print(">> Most similar to '{}'': {}".format(a, similar[:3]))


>> 'seun' staan tot 'meisie', soos 'broer' staan tot <<'vrou'- 0.5705>> EXPECTED 'suster'
>> 'pa' staan tot 'ma', soos 'vader' staan tot <<'moeder'- 0.6940>> EXPECTED 'moeder'
>> 'vader' staan tot 'moeder', soos 'broers' staan tot <<'dogters'- 0.6842>> EXPECTED 'susters'
>> 'gelukkig' staan tot 'ongelukkig', soos 'eerlik' staan tot <<'handeling'- 0.4491>> EXPECTED 'oneerlik'

>> Similarity between 'vrou' and 'man' is 0.5491
>> Similarity between 'broer' and 'seun' is 0.7525
>> Similarity between 'pa' and 'man' is 0.4007
>> Similarity between 'bewus' and 'onbewus' is 0.5069

>> Most similar to 'oom'': [('broer', 0.7135692238807678), ('dogter', 0.6887587904930115), ('vader', 0.6787451505661011)]
>> Most similar to 'tannie'': [('groothertogin', 0.5186679363250732), ('Nijinsky', 0.48672914505004883), ('Petrowna', 0.481914758682251)]
>> Most similar to 'oupa'': [('pa', 0.6433155536651611), ('broer', 0.6229803562164307), ('tante', 0.6226500272750854)]
>> Most similar to 'seun'': [('dogter', 