# Pretrained Word2Vec Model on Urdu Wikipedia Corpus

Word2vec is a widely featured as a member of the “new wave” of machine learning algorithms based on neural networks, commonly referred to as "deep learning" (though word2vec itself is rather shallow). Using large amounts of unannotated plain text, word2vec learns relationships between words automatically. The output are vectors, one vector per word, with remarkable linear relationships that allow us to do things like vec(“king”) – vec(“man”) + vec(“woman”) =~ vec(“queen”), or vec(“Montreal Canadiens”) – vec(“Montreal”) + vec(“Toronto”) resembles the vector for “Toronto Maple Leafs”.

## Loading the pretrained Urdu word2vec 300 dimension vector model

This model trainied on 50,000 news posts data.

In [1]:
# import modules & set up logging
import gensim, logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [2]:
# Downloading the Word2Vec model from google drive
#!wget -O - 'https://drive.google.com/uc?export=download&id=1yz8RfJeg65QByLs1aJUORtPujHYx_oQP' > urdu_wikipedia_vector300.bin
# import urllib.request

# model_url = "https://drive.google.com/uc?export=download&id=1yz8RfJeg65QByLs1aJUORtPujHYx_oQP"
# file_name = "urdu_wikipedia_vector300.bin"
# urllib.request.urlretrieve(model_url, file_name)

In [3]:
# Loading the model
model = gensim.models.KeyedVectors.load_word2vec_format('urdu_wikipedia_vector300.bin', binary=True)
# print(model)
# print(model.wv.vocab)

2018-06-30 18:04:43,075 : INFO : loading projection weights from urdu_wikipedia_vector300.bin
2018-06-30 18:04:43,888 : INFO : loaded (49003, 300) matrix from urdu_wikipedia_vector300.bin


## Evaluating
`Word2Vec` training is an unsupervised task, there’s no good way to objectively evaluate the result. Evaluation depends on your end application.

Google has released their testing set of about 20,000 syntactic and semantic test examples, following the “A is to B as C is to D” task. It is provided in the 'datasets' folder.

For example a syntactic analogy of comparative type is bad:worse;good:?. There are total of 9 types of syntactic comparisons in the dataset like plural nouns and nouns of opposite meaning.

The semantic questions contain five types of semantic analogies, such as capital cities (Paris:France;Tokyo:?) or family members (brother:sister;dad:?). 

In [4]:
model.most_similar("پاکستان")

2018-06-30 18:04:46,922 : INFO : precomputing L2-norms of word weight vectors


[('پنجاب', 0.7467092275619507),
 ('فیصل_آباد', 0.6053035855293274),
 ('یونین_کونسلیں', 0.5992769002914429),
 ('سرگودھا', 0.5941762328147888),
 ('اسلام_آباد', 0.5806517601013184),
 ('راولپنڈی', 0.5633086562156677),
 ('پاکستانی', 0.5550359487533569),
 ('ساہیوال', 0.5546157360076904),
 ('ضلع', 0.5504013895988464),
 ('پختونخوا', 0.5464353561401367)]

In [5]:
model.most_similar(positive=['بغداد', 'یونان'], negative=['ایتھنز'])

[('عراق', 0.49037542939186096),
 ('مصر', 0.44794124364852905),
 ('عباسیہ', 0.4396800994873047),
 ('المقتدی', 0.4304294288158417),
 ('المتقی', 0.4190688729286194),
 ('بصرہ', 0.4182446002960205),
 ('باللہ', 0.4113025665283203),
 ('عباسی', 0.4049525260925293),
 ('واسط', 0.39807507395744324),
 ('المقتفی', 0.39746952056884766)]

In [6]:
model.most_similar(positive=['ٹوکیو', 'پاکستان'], negative=['اسلام_آباد'])

[('جاپان', 0.482408344745636),
 ('اوساکا', 0.4233572781085968),
 ('توکیو', 0.41428637504577637),
 ('ہیروشیما', 0.4139555096626282),
 ('اوکیناوا', 0.3928012549877167),
 ('یوکوہاما', 0.39099442958831787),
 ('کاناگاوا', 0.38639694452285767),
 ('چوو', 0.36407244205474854),
 ('چیبا', 0.3635532855987549),
 ('گونما', 0.3627404272556305)]

In [7]:
model.most_similar(positive=['بھائی', 'لڑکی'], negative=['لڑکا'])

[('بہن', 0.5604875087738037),
 ('بیٹی', 0.5328823924064636),
 ('بیوی', 0.510018527507782),
 ('چچا', 0.49377816915512085),
 ('ماں', 0.49138566851615906),
 ('بیٹے', 0.4737793207168579),
 ('بہنیں', 0.4572775363922119),
 ('اپنے', 0.4566517174243927),
 ('شوہر', 0.4563017785549164),
 ('باپ', 0.4543006420135498)]

In [8]:
model.most_similar(positive=['ماں', 'دادا'], negative=['باپ'])

[('پھالکے', 0.5259937644004822),
 ('دادی', 0.5064771771430969),
 ('والد', 0.5038189888000488),
 ('والدہ', 0.501828134059906),
 ('صاحب', 0.4324100613594055),
 ('بھائی', 0.43129321932792664),
 ('بیٹی', 0.4238676428794861),
 ('نانا', 0.4186864197254181),
 ('بچپن', 0.414540559053421),
 ('تھی', 0.41393712162971497)]

In [9]:
model.most_similar(positive=['دلہن', 'شوہر'], negative=['دولہا'])

[('شادی', 0.4839109182357788),
 ('بیوی', 0.4822179675102234),
 ('طلاق', 0.4765128195285797),
 ('بیٹی', 0.4720294177532196),
 ('عورت', 0.426933228969574),
 ('بیوہ', 0.4230973422527313),
 ('زوجیت', 0.40840286016464233),
 ('بائن', 0.40645647048950195),
 ('ماں', 0.39884424209594727),
 ('خاوند', 0.38704994320869446)]

In [10]:
model.most_similar(positive=['ملکہ', 'باپ'], negative=['بادشاہ'])

[('ماں', 0.5418938994407654),
 ('بہن', 0.4456590414047241),
 ('وکٹوریہ', 0.42152002453804016),
 ('شوہر', 0.4130823612213135),
 ('پرورش', 0.40089383721351624),
 ('والدین', 0.39476436376571655),
 ('بیٹی', 0.38815945386886597),
 ('بہنیں', 0.3837417960166931),
 ('علاتی', 0.3805781900882721),
 ('بیوی', 0.36311426758766174)]

In [28]:
model['پاکستان']

array([ 2.90719718e-01, -1.54507101e-01, -1.01328149e-01,  2.22087190e-01,
       -3.07357967e-01, -4.02863398e-02, -5.57583682e-02,  1.13708153e-01,
        3.64474282e-02,  1.82163820e-01, -1.03520408e-01, -7.86808804e-02,
       -1.73256442e-01,  1.38376638e-01, -4.12036739e-02, -2.63018042e-01,
        5.62551692e-02, -1.17260098e-01, -7.21818581e-02,  1.15282960e-01,
        3.13853800e-01,  1.01477422e-01, -5.95420636e-02, -2.33268905e-02,
        9.58233029e-02,  1.53424725e-01,  1.02251381e-01, -6.76823035e-02,
       -1.50715426e-01,  8.69434401e-02, -1.76381662e-01,  9.32310522e-02,
       -2.59102434e-01,  1.19624197e-01, -1.10111170e-01, -1.76104322e-01,
       -4.41194586e-02, -2.03329138e-02,  1.74071994e-02,  5.92087135e-02,
        5.02466500e-01, -9.65452045e-02,  1.82059482e-01, -1.86802164e-01,
       -2.26636663e-01,  2.67531220e-02,  3.24652210e-05,  1.36122927e-01,
        1.86685532e-01,  1.75396845e-01, -1.99274853e-01,  2.08395988e-01,
       -4.25279647e-01,  