# Trained word2vec Model for the Afrikaans Language

To try to improve language modeling for Afrikaans language tasks, a language model was trained using word2vec. The modeling of language tasks are known to be sensitive to the domain of usage - a language model for legal documents would not be expected to perform well on a lnaguage task involving a different domain (like technology) - so as a first attempt a 'general purpose' Afrikaan language model was trained using Afrikaans conten from wikepedia.

This model should appear in the same directory as this document, with the name 'word2vec_af_wikipedia_500000.kv'. The .kv extension is from word2vec and indicates stored word vectors that are keyed by lookup tokens. By storing keyed vectors instead of the full model, the model would not be able to trained further. The advantage, though, is that the saved model file will be smaller in size. This model is 61MB's in size and git hub has a recommendation of a maximum file size of 50MB.

The model can be used in the following way:



In [9]:
# !pip install gensim

In [10]:
# Load word vectors
import gensim
model_word_vectors = gensim.models.KeyedVectors.load("word2vec_af_wikipedia_500000_2021_01_07.kv")

This Word2Vec model was computed using the following parameters:
- vector_size = 100, The number of dimensions of the embeddings
- window = 5, The maximum distance between a target word and words around the target word
- min_count = 500, The minimum count of words to consider when training the model

These parameters were choosen arbitrarily, and different parameters will produce a different model. To assess which parameters are optimal a specific use case will be required. At the time of creating this model, no specific use case was identified and tests were still being devised to assess the performance of the model.

It was run on 500,000 web pages retrieved from wikepedia, specifically for Afrikaans content. Minimal process was doen to the content of the web pages. the webpages were retreaved during December 2020 and January 2021.

It took 35 hours to process. Most of that time was processing the html efrom the web pages. Only 3 hours of that was for calculating the Word2Vec model itself.

In [11]:
# Using the word 'vrou' we
# look at one of the vectors (embeddings), it can be seen that it is indeed 100 dimensional.
test_vector = model_word_vectors["vrou"]
test_vector

array([ 5.9409714 ,  4.496691  ,  6.708669  , -0.73155135,  1.3737465 ,
       -1.5263249 ,  3.3370247 , -4.6911836 ,  2.3992188 ,  3.6628628 ,
       -2.0540934 ,  4.4699187 , -0.831507  , -6.2117314 , -0.9083498 ,
        3.5368695 , -0.14083028, -1.5410953 , -4.560557  ,  9.646817  ,
       -0.47829846,  0.57315624, 10.241089  , -2.1930175 ,  4.319505  ,
       -2.8107812 , -1.0864744 ,  1.9947112 , -1.4881661 , -1.1401428 ,
        9.81603   ,  5.0182805 ,  3.5927055 , -6.8963404 ,  7.9481792 ,
        3.4396493 ,  0.7054076 ,  0.49245477,  0.9315639 ,  2.610291  ,
       -3.3183303 , -2.5655458 , -3.008428  , -1.5372294 , -2.4686952 ,
        1.6182895 , -5.2672544 , -0.30342856, -3.9773808 ,  1.6618026 ,
       -6.450713  , -1.3467321 , -1.3227992 ,  9.374692  , -1.404849  ,
       10.4987335 , -2.2145247 ,  0.48969543, -2.9401553 ,  3.302072  ,
       -0.516358  ,  7.0161767 ,  0.6384455 ,  2.995517  ,  7.2274675 ,
       -5.422364  , -2.1316519 ,  0.35439172,  2.482815  ,  2.94

In [12]:
# Test another word to see if it is present in the model
len(model_word_vectors["seun"])

100

In [13]:
# Test another word
len(model_word_vectors["ontbyt"])

100

In [14]:
# Test if the model is case-sensitive
test_vector_1 = model_word_vectors["Die"]
test_vector_1

array([ 0.77464277,  5.8762918 , -5.731501  , -3.6681118 , -4.011173  ,
       -9.028599  , -0.74995923, -0.8948402 , -0.31283137, -4.871647  ,
        3.034864  , -1.2406137 ,  7.918766  , 10.318445  , -4.517135  ,
       -9.164283  ,  0.38009557, -6.2928634 , -2.3856127 ,  7.126466  ,
       -6.122153  , -2.092636  , -5.2584124 ,  6.7173157 ,  3.5572379 ,
       12.739954  , -1.0829413 ,  1.7337722 ,  8.25426   ,  2.2497866 ,
       -5.311122  , -1.7938011 , -0.13378747, -6.582857  , -0.77637875,
        3.3784635 , 14.4127035 ,  3.6457427 , -2.245472  , -4.0668936 ,
       -3.8872232 , -4.1480355 ,  1.3257781 ,  0.9594052 , -4.2537293 ,
        4.333547  ,  0.96040857,  6.758412  , -2.2232096 , -0.6627186 ,
       -0.5521825 , -3.6668394 ,  5.2618895 , -2.8118012 ,  1.2557592 ,
        7.189046  , -0.835054  ,  0.38913158, -2.6705868 , -5.9174643 ,
       -6.101463  , 10.014037  ,  1.1932181 , -1.8210099 , -6.3528605 ,
        4.119733  ,  5.636462  , -1.952771  ,  2.5994592 ,  3.43

In [15]:
test_vector_2 = model_word_vectors["die"]
test_vector_2

array([ 0.9637588 ,  3.340163  ,  2.6202397 , -0.3631109 ,  2.8870285 ,
       -6.0887575 , -1.3381153 , -3.089379  ,  1.0239747 , -2.7498906 ,
        2.1360517 , -0.51465225, -0.898215  ,  3.830969  ,  4.102815  ,
       -2.3922975 , -9.089986  , -0.8491865 , -3.2739496 ,  4.5140324 ,
       -1.664932  ,  1.9908719 , -3.1572967 , -1.22494   , -5.072695  ,
        0.1696878 ,  0.8671259 , -6.3541837 , -1.3597887 , -0.68297356,
        0.7620054 ,  1.2939987 ,  4.094528  , -7.109388  ,  1.7496419 ,
        0.7566529 ,  5.1341224 ,  7.7791677 ,  4.6878214 , -0.8463696 ,
        3.348159  , -1.1573505 ,  0.30039543, -0.6708167 , -0.9934627 ,
        3.9892833 ,  0.46180266,  6.1970863 , -4.85614   , -3.649036  ,
       -6.3312283 , -1.9286971 , -0.21158388, -5.8725734 ,  2.6953738 ,
        3.5396297 , -3.1873872 ,  3.7628553 ,  0.5165237 ,  4.5804224 ,
       -0.90519804,  1.662768  ,  4.0671263 ,  2.8800857 , -0.37748143,
       -2.4882345 ,  6.057412  , -4.5931463 , -2.4023619 , -1.64

**'test_vector_1'** and **'test_vector_2'** are different vectorrs. The word **'die'** and the word **'Die'** have been both been computed, this indicates that the model is case-sensitive.

It should be considered that it is not right or wrong for a model to be case-sensitive; instead it should be considered what the model will be used for. If there is a use case where it is important to be able to distinguish between word in there lower-case form and their proper-case form, then the model should be case-insensitive. Also, if the model is build using only lower-case words then things like proper nouns would have ot be converted to lower-case before being used.

If a particular word is not part of the model, a KeyError will be throw. For example, 'KeyError: "word 'Donald Trump' not in vocabulary"'

### What can be done with the word vectors?

In [31]:
def run_simple_test(model_word_vectors):
    # Family
    # seun meisie broer suster
    a = "seun"
    b = "meisie"
    c = "broer"
    d = "suster"

    print(">> Reasoning with word vectors")
    
    try:
        result = model_word_vectors.most_similar(positive=[b, c], negative=[a])
        most_similar_key, similarity = result[0]
        print(">> '{}' staan tot '{}', soos '{}' staan tot <<'{}'- {:.4f}>> EXPECTED '{}'" \
              .format(a, b, c, most_similar_key, similarity, d))
    except:
        print(">> '{}' staan tot '{}', soos '{}' staan tot <<'{}'- {}>> EXPECTED '{}'" \
              .format("UNDEFINED", "UNDEFINED", "UNDEFINED", "UNDEFINED", "UNDEFINED", d))
        
    # Family
    # pa ma vader moeder
    a = "pa"
    b = "ma"
    c = "vader"
    d = "moeder"
    
    try:
        result = model_word_vectors.most_similar(positive=[b, c], negative=[a])
        most_similar_key, similarity = result[0]
        print(">> '{}' staan tot '{}', soos '{}' staan tot <<'{}'- {:.4f}>> EXPECTED '{}'" \
              .format(a, b, c, most_similar_key, similarity, d))
    except:
        print(">> '{}' staan tot '{}', soos '{}' staan tot <<'{}'- {}>> EXPECTED '{}'" \
              .format("UNDEFINED", "UNDEFINED", "UNDEFINED", "UNDEFINED", "UNDEFINED", d))
        
    # Plural
    # vader moeder broers susters
    a = "vader"
    b = "moeder"
    c = "broers"
    d = "susters"
    
    try:
        result = model_word_vectors.most_similar(positive=[b, c], negative=[a])
        most_similar_key, similarity = result[0]
        print(">> '{}' staan tot '{}', soos '{}' staan tot <<'{}'- {:.4f}>> EXPECTED '{}'" \
              .format(a, b, c, most_similar_key, similarity, d))
    except:
        print(">> '{}' staan tot '{}', soos '{}' staan tot <<'{}'- {}>> EXPECTED '{}'" \
              .format("UNDEFINED", "UNDEFINED", "UNDEFINED", "UNDEFINED", "UNDEFINED", d))
        
    # Opposites
    # gelukkig ongelukkig eerlik oneerlik
    a = "gelukkig"
    b = "ongelukkig"
    c = "eerlik"
    d = "oneerlik"
    
    try:
        result = model_word_vectors.most_similar(positive=[b, c], negative=[a])
        most_similar_key, similarity = result[0]
        print(">> '{}' staan tot '{}', soos '{}' staan tot <<'{}'- {:.4f}>> EXPECTED '{}'" \
              .format(a, b, c, most_similar_key, similarity, d))
    except:
        print(">> '{}' staan tot '{}', soos '{}' staan tot <<'{}'- {}>> EXPECTED '{}'" \
              .format("UNDEFINED", "UNDEFINED", "UNDEFINED", "UNDEFINED", "UNDEFINED", d))
        
    print()

    print(">> Simlarity between words")

    a = "vrou"
    b = "man"
    
    try:
        similarity = model_word_vectors.similarity(a, b)
        print(">> Similarity between '{}' and '{}' is {:.4f}".format(a, b, similarity))
    except:
        print(">> Similarity between '{}' and '{}' is {}".format(a, b, "UNDEFINED"))
        
    a = "broer"
    b = "seun"
    
    try:
        similarity = model_word_vectors.similarity(a, b)
        print(">> Similarity between '{}' and '{}' is {:.4f}".format(a, b, similarity))
    except:
        print(">> Similarity between '{}' and '{}' is {}".format(a, b, "UNDEFINED"))
        
    a = "pa"
    b = "man"
    similarity = model_word_vectors.similarity(a, b)
    print(">> Similarity between '{}' and '{}' is {:.4f}".format(a, b, similarity))

    a = "bewus"
    b = "onbewus"
    
    try:
        similarity = model_word_vectors.similarity(a, b)
        print(">> Similarity between '{}' and '{}' is {:.4f}".format(a, b, similarity))
    except:
        print(">> Similarity between '{}' and '{}' is {}".format(a, b, "UNDEFINED"))
        
    print()

    print(">> Most similar words")

    a = "oom"
    
    try:
        similar = model_word_vectors.similar_by_word(a)
        print(">> Most similar to '{}'': {}".format(a, similar[:3]))
    except:
        print(">> Most similar to '{}'': {}".format(a, "UNDEFINED"))
        
    a = "tannie"
    
    try:
        similar = model_word_vectors.similar_by_word(a)
        print(">> Most similar to '{}'': {}".format(a, similar[:3]))
    except:
        print(">> Most similar to '{}'': {}".format(a, "UNDEFINED"))
        
    a = "oupa"
    
    try:
        similar = model_word_vectors.similar_by_word(a)
        print(">> Most similar to '{}'': {}".format(a, similar[:3]))
    except:
        print(">> Most similar to '{}'': {}".format(a, "UNDEFINED"))
        
    a = "seun"
    
    try:
        similar = model_word_vectors.similar_by_word(a)
        print(">> Most similar to '{}'': {}".format(a, similar[:3]))
    except:
        print(">> Most similar to '{}'': {}".format(a, "UNDEFINED"))

In [32]:
run_simple_test(model_word_vectors=model_word_vectors)

>> Reasoning with word vectors
>> 'seun' staan tot 'meisie', soos 'broer' staan tot <<'vrou'- 0.5705>> EXPECTED 'suster'
>> 'pa' staan tot 'ma', soos 'vader' staan tot <<'moeder'- 0.6940>> EXPECTED 'moeder'
>> 'vader' staan tot 'moeder', soos 'broers' staan tot <<'dogters'- 0.6842>> EXPECTED 'susters'
>> 'gelukkig' staan tot 'ongelukkig', soos 'eerlik' staan tot <<'handeling'- 0.4491>> EXPECTED 'oneerlik'

>> Simlarity between words
>> Similarity between 'vrou' and 'man' is 0.5491
>> Similarity between 'broer' and 'seun' is 0.7525
>> Similarity between 'pa' and 'man' is 0.4007
>> Similarity between 'bewus' and 'onbewus' is 0.5069

>> Most similar words
>> Most similar to 'oom'': [('broer', 0.7135692238807678), ('dogter', 0.6887587904930115), ('vader', 0.6787451505661011)]
>> Most similar to 'tannie'': [('groothertogin', 0.5186679363250732), ('Nijinsky', 0.48672914505004883), ('Petrowna', 0.481914758682251)]
>> Most similar to 'oupa'': [('pa', 0.6433155536651611), ('broer', 0.6229803562

From the few simple tests devised it can be seen that the performance of the model is not very impressive. Under the ‘Reasoning with word vectors’ section, none of the four tests were correct. The similarity values ranged from 0.4 to 0.6. Larger similarity values are better, so perhaps aceptable 9or correct) result could be expected to be seen at values of 0.8 and higher.

Two things should be noted about testing a Word2Vec model:
1.	These few tests do not represent a comprehensive evaluation of the model. These were just used to see if the model development is heading in the right direction. A more comprehensive test could be devised and many such tests, with thousands of individual tests, exist.
2.	Even though a more comprehensive test could be devised, I consider the ultimate test to be an actual application of the model. What this means is that if a classifier is built from some Afrikaans text, then the performance of that classifier should be evaluated using the word vectors. If this performance increases then the word vectors can be consdidered useful.


To try to improve these results, another model was trained. This time a pre-processing step was introduced. The text from the HTML was processed using the [NLTK](https://www.nltk.org/) library. Specifically, the 'sent_tokenize' function was used. By scanning some of the sentences (which are used by the Gensim Word2Vec function) by eye, it did appear that the first model used large blocks of text (paragraphs) instead of sentences. The second model appeared to use sentences.

**Observation:**
When inspecting the word (token) counts of the text used for training the word vectors, it was observed that tokens like ‘.’, ‘[‘, ‘(‘ appeared. It might improve the model performance if these are removed from the input to the model development process. Unless of course, these are valuable tokens, that might be needed when using the word vectors.

In [33]:
# Worde vector model build using an additional preprocessing step. 
# HTML was processed by the NLTK library, to try to extract sentences
model_word_vectors_by_sentence = gensim.models.KeyedVectors.load("word2vec_af_wikipedia_10000_by_sent_speed_test.kv")
run_simple_test(model_word_vectors=model_word_vectors_by_sentence)

>> Reasoning with word vectors
>> 'seun' staan tot 'meisie', soos 'broer' staan tot <<'16-jarige'- 0.5693>> EXPECTED 'suster'
>> 'pa' staan tot 'ma', soos 'vader' staan tot <<'moeder'- 0.5485>> EXPECTED 'moeder'
>> 'UNDEFINED' staan tot 'UNDEFINED', soos 'UNDEFINED' staan tot <<'UNDEFINED'- UNDEFINED>> EXPECTED 'susters'
>> 'UNDEFINED' staan tot 'UNDEFINED', soos 'UNDEFINED' staan tot <<'UNDEFINED'- UNDEFINED>> EXPECTED 'oneerlik'

>> Simlarity between words
>> Similarity between 'vrou' and 'man' is 0.6617
>> Similarity between 'broer' and 'seun' is 0.7156
>> Similarity between 'pa' and 'man' is 0.2496
>> Similarity between 'bewus' and 'onbewus' is UNDEFINED

>> Most similar words
>> Most similar to 'oom'': UNDEFINED
>> Most similar to 'tannie'': UNDEFINED
>> Most similar to 'oupa'': UNDEFINED
>> Most similar to 'seun'': [('dogter', 0.7809075117111206), ('broer', 0.7156499028205872), ('moeder', 0.674589991569519)]


These results show that many of the words were not found. But, this is to be expected because this model was only trained on 10,000 Html files. There was one-word 'moeder' in the first set of tests ('Reasoning with word vectors') that was correctly matched. It should be noted that the similarity score is not that high though (0.5485). This indicates that further testing should be reveling with a model which used more Html pages.