# Pretrained Word2Vec Model on Urdu News Data

Word2vec is a widely featured as a member of the “new wave” of machine learning algorithms based on neural networks, commonly referred to as "deep learning" (though word2vec itself is rather shallow). Using large amounts of unannotated plain text, word2vec learns relationships between words automatically. The output are vectors, one vector per word, with remarkable linear relationships that allow us to do things like vec(“king”) – vec(“man”) + vec(“woman”) =~ vec(“queen”), or vec(“Montreal Canadiens”) – vec(“Montreal”) + vec(“Toronto”) resembles the vector for “Toronto Maple Leafs”.

## Londing the pretrained Urdu word2vec 300 dimension vector model

This model trainied on 50,000 news posts data.

In [9]:
# import modules & set up logging
import gensim, logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [10]:
# Downloading the Word2Vec model from google drive
#!wget -O - 'https://drive.google.com/uc?export=download&id=1MEQMqHdsAm1vaJgG9PrNccn0vFEhX41l' > urdu_web_news_vector300.bin
# import urllib.request

# model_url = "https://drive.google.com/uc?export=download&id=1MEQMqHdsAm1vaJgG9PrNccn0vFEhX41l"
# file_name = "urdu_web_news_vector300.bin"
# urllib.request.urlretrieve(model_url, file_name)

In [11]:
# Loading the model
model = gensim.models.KeyedVectors.load_word2vec_format('urdu_web_news_vector300.bin', binary=True)
# print(model)
# print(model.wv.vocab)

2018-06-29 12:34:36,472 : INFO : loading projection weights from urdu_web_news_vector300.bin
2018-06-29 12:34:37,037 : INFO : loaded (27994, 300) matrix from urdu_web_news_vector300.bin


## Evaluating
`Word2Vec` training is an unsupervised task, there’s no good way to objectively evaluate the result. Evaluation depends on your end application.

Google has released their testing set of about 20,000 syntactic and semantic test examples, following the “A is to B as C is to D” task. It is provided in the 'datasets' folder.

For example a syntactic analogy of comparative type is bad:worse;good:?. There are total of 9 types of syntactic comparisons in the dataset like plural nouns and nouns of opposite meaning.

The semantic questions contain five types of semantic analogies, such as capital cities (Paris:France;Tokyo:?) or family members (brother:sister;dad:?). 

In [12]:
model.most_similar("پاکستان")

2018-06-29 12:34:53,829 : INFO : precomputing L2-norms of word weight vectors


[('بھارت', 0.6387995481491089),
 ('پاکستانی', 0.5963138937950134),
 ('نے', 0.5816565155982971),
 ('ملک', 0.5756391286849976),
 ('کہا', 0.5627798438072205),
 ('افغانستان', 0.5310071110725403),
 ('انہوں', 0.5267655253410339),
 ('کیلئے', 0.5223846435546875),
 ('عالمی', 0.5208458304405212),
 ('کیخلاف', 0.5103233456611633)]

In [7]:
model.most_similar(positive=['دہلی', 'پاکستان'], negative=['پنجاب'])

[('بھارت', 0.6196030974388123),
 ('بھارتی', 0.4861717224121094),
 ('انڈیا', 0.47880643606185913),
 ('ہندوستان', 0.4585074782371521),
 ('پاکستانی', 0.42618855834007263),
 ('نئی', 0.41773831844329834),
 ('دہلی:', 0.4116121530532837),
 ('ممبئی', 0.4058562219142914),
 ('ہندوستانی', 0.3850819170475006),
 ('ممبئی،', 0.38133400678634644)]

In [17]:
model.most_similar(positive=['بغداد', 'یونان'], negative=['ایتھنز'])

[('عراق', 0.45472341775894165),
 ('عراقی', 0.41974303126335144),
 ('خودکش', 0.3903034031391144),
 ('ایران،', 0.36064645648002625),
 ('یمن', 0.34030547738075256),
 ('فرانس،', 0.3286679983139038),
 ('ادلب', 0.3282949924468994),
 ('موصل', 0.32810428738594055),
 ('دھماکے', 0.32595381140708923),
 ('حملہ', 0.3246413469314575)]

In [19]:
model.most_similar(positive=['ٹوکیو', 'پاکستان'], negative=['اسلام_آباد'])

[('جاپان', 0.518461287021637),
 ('جاپانی', 0.42522647976875305),
 ('بھارت', 0.3991791605949402),
 ('دنیا', 0.3974219858646393),
 ('چین', 0.3774305582046509),
 ('اوساکا', 0.3636421859264374),
 ('جاپان،', 0.35131868720054626),
 ('انڈیا', 0.3293466866016388),
 ('عالمی', 0.32560476660728455),
 ('جاپانیوں', 0.3245166540145874)]

In [20]:
model.most_similar(positive=['بھائی', 'لڑکی'], negative=['لڑکا'])

[('بہن', 0.5513333082199097),
 ('والد', 0.532108724117279),
 ('بیٹی', 0.5085018873214722),
 ('والدہ', 0.48878273367881775),
 ('کوقتل', 0.46216732263565063),
 ('بھائیوں', 0.45481085777282715),
 ('پولیس', 0.4398535490036011),
 ('باپ', 0.439206600189209),
 ('کزن', 0.417349249124527),
 ('خاتون', 0.4159335494041443)]

In [21]:
model.most_similar(positive=['ماں', 'دادا'], negative=['باپ'])

[('دادی', 0.43415290117263794),
 ('والد', 0.40385210514068604),
 ('والدہ', 0.3773915469646454),
 ('خالہ', 0.339709997177124),
 ('’میرے', 0.3385547697544098),
 ('چچا', 0.3361136019229889),
 ('دادا،', 0.33231672644615173),
 ('پوتے', 0.32845091819763184),
 ('بچے', 0.3262842893600464),
 ('میرے', 0.3215794563293457)]

In [22]:
model.most_similar(positive=['دلہن', 'شوہر'], negative=['دولہا'])

[('بیوی', 0.6536001563072205),
 ('خاوند', 0.6006074547767639),
 ('طلاق', 0.5600955486297607),
 ('خاتون', 0.5458393692970276),
 ('شادی', 0.5421558022499084),
 ('بیٹی', 0.5145429968833923),
 ('اداکارہ', 0.4982667863368988),
 ('ماں', 0.4932785630226135),
 ('عورت', 0.476948082447052),
 ('اہلیہ', 0.4722379744052887)]

In [23]:
model.most_similar(positive=['ملکہ', 'باپ'], negative=['بادشاہ'])

[('ماں', 0.5100770592689514),
 ('بیٹی', 0.4709329605102539),
 ('بیٹے', 0.42628371715545654),
 ('رشتے', 0.3735599219799042),
 ('بیٹوں', 0.3722909986972809),
 ('بہو', 0.37172698974609375),
 ('بیوی', 0.3640066385269165),
 ('بچی', 0.36133867502212524),
 ('شوہر', 0.36050519347190857),
 ('بہن', 0.3537209630012512)]

In [25]:
model['پاکستان']

array([ 0.07345428, -0.15035841,  0.10580754, -0.16741668, -0.11897424,
        0.03232108, -0.03160947,  0.15612677,  0.05198599, -0.00172963,
       -0.1187181 ,  0.02466597, -0.03736669,  0.1159278 , -0.01627684,
       -0.13789167,  0.04579989,  0.11536546,  0.00917979, -0.14214034,
        0.17768794, -0.11860704, -0.00795708, -0.14678237, -0.07083804,
        0.02779116,  0.25819954, -0.12104045,  0.04358017, -0.10631539,
       -0.10219189,  0.04885708, -0.06228257,  0.12631819,  0.11282982,
        0.17081013,  0.00242554,  0.00097731, -0.01763403,  0.05260813,
       -0.00712063, -0.2173309 , -0.05800922, -0.03741108, -0.08908761,
        0.21754497, -0.00073987, -0.02886635, -0.10344625,  0.00163667,
       -0.31038073, -0.23071307, -0.05358592, -0.19568776,  0.06148865,
       -0.0973029 , -0.03981625,  0.06310487, -0.04688233,  0.04291866,
       -0.02293593, -0.1755999 ,  0.11775573,  0.02200602,  0.11084885,
        0.1321153 ,  0.18022722, -0.10948982,  0.19681081, -0.18