# Pretrained Word2Vec Model on Urdu News Data

Word2vec is a widely featured as a member of the “new wave” of machine learning algorithms based on neural networks, commonly referred to as "deep learning" (though word2vec itself is rather shallow). Using large amounts of unannotated plain text, word2vec learns relationships between words automatically. The output are vectors, one vector per word, with remarkable linear relationships that allow us to do things like vec(“king”) – vec(“man”) + vec(“woman”) =~ vec(“queen”), or vec(“Montreal Canadiens”) – vec(“Montreal”) + vec(“Toronto”) resembles the vector for “Toronto Maple Leafs”.

## Londing the pretrained Urdu word2vec 300 dimension vector model

This model trainied on 50,000 news posts data.

In [1]:
# import modules & set up logging
import gensim, logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [2]:
# Downloading the Word2Vec model from google drive
#!wget -O - 'https://drive.google.com/uc?export=download&id=13KLg3wUTOwWiF_YdAtZFe18j7MQmOWfb' > urdu_web_news_vector300.bin
# import urllib.request

# model_url = "https://drive.google.com/uc?export=download&id=13KLg3wUTOwWiF_YdAtZFe18j7MQmOWfb"
# file_name = "urdu_web_news_vector300.bin"
# urllib.request.urlretrieve(model_url, file_name)

In [3]:
# Loading the model
model = gensim.models.KeyedVectors.load_word2vec_format('urdu_web_news_vector300.bin', binary=True)
# print(model)
# print(model.wv.vocab)

2018-06-09 13:11:32,196 : INFO : loading projection weights from urdu_web_news_vector300.bin
2018-06-09 13:11:32,689 : INFO : loaded (24248, 300) matrix from urdu_web_news_vector300.bin


## Memory
At its core, `word2vec` model parameters are stored as matrices (NumPy arrays). Each array is **#vocabulary** (controlled by min_count parameter) times **#size** (size parameter) of floats (single precision aka 4 bytes).

Three such matrices are held in RAM (work is underway to reduce that number to two, or even one). So if your input contains 100,000 unique words, and you asked for layer `size=200`, the model will require approx. `100,000*200*4*3 bytes = ~229MB`.

There’s a little extra memory needed for storing the vocabulary tree (100,000 words would take a few megabytes), but unless your words are extremely loooong strings, memory footprint will be dominated by the three matrices above.

## Evaluating
`Word2Vec` training is an unsupervised task, there’s no good way to objectively evaluate the result. Evaluation depends on your end application.

Google has released their testing set of about 20,000 syntactic and semantic test examples, following the “A is to B as C is to D” task. It is provided in the 'datasets' folder.

For example a syntactic analogy of comparative type is bad:worse;good:?. There are total of 9 types of syntactic comparisons in the dataset like plural nouns and nouns of opposite meaning.

The semantic questions contain five types of semantic analogies, such as capital cities (Paris:France;Tokyo:?) or family members (brother:sister;dad:?). 

In [4]:
model.most_similar("پاکستان")

2018-06-09 13:11:37,295 : INFO : precomputing L2-norms of word weight vectors


[('افغانستان', 0.534391462802887),
 ('پاکستانی', 0.527515172958374),
 ('بھارت', 0.5176973342895508),
 ('زمبابوے', 0.5033701062202454),
 ('اورپاکستان', 0.491726279258728),
 ('خطے', 0.49089372158050537),
 ('امن', 0.484420508146286),
 ('ہندوستان', 0.48133471608161926),
 ('امریکہ', 0.4743611216545105),
 ('انڈیا', 0.4649897515773773)]

In [5]:
model.most_similar(positive=['دہلی', 'پاکستان'], negative=['پنجاب'])

[('دلی', 0.47923001646995544),
 ('انڈیا', 0.4310738444328308),
 ('بھارت', 0.4303123652935028),
 ('پاکستانی', 0.42918506264686584),
 ('بیجنگ', 0.42072150111198425),
 ('اسرائیل', 0.4158594310283661),
 ('کولکتہ', 0.4154762327671051),
 ('ماسکو', 0.40587425231933594),
 ('ہندوستان', 0.39771535992622375),
 ('نیپال', 0.3900505304336548)]

In [6]:
model['پاکستان'] # Raw NumPy vector of a word

array([ 0.3053402 , -0.94782573, -0.9116081 ,  1.4664174 ,  0.3587429 ,
       -2.006762  ,  0.84409386, -1.1533123 , -1.020578  , -2.472494  ,
       -0.35124967, -0.88023293,  1.082465  , -0.63575685,  1.9495397 ,
        0.8278821 ,  0.7919433 ,  1.0122516 ,  1.6798223 , -1.3990988 ,
       -1.1028582 , -0.06708375, -0.49618647,  1.4706877 ,  0.8937356 ,
       -0.05487218, -0.5642198 ,  0.16571091, -0.22957174, -0.0665009 ,
        1.4714681 , -0.4470899 ,  1.1184038 , -1.6914088 ,  0.21329756,
        0.23761903, -0.7907745 ,  1.875816  ,  1.2476456 ,  0.7480573 ,
       -0.24974996,  1.1688299 , -0.49197036, -0.37073052,  0.75272155,
       -0.8955356 ,  0.23544744,  0.63295424,  0.14902435,  1.802959  ,
        0.7852391 , -0.2467207 ,  0.5826205 ,  0.31808347,  0.365385  ,
       -0.396788  , -0.98084587, -0.31967983, -0.08042665,  0.06011335,
        0.7501427 , -0.04102356, -0.3784863 , -0.8184698 , -0.9788742 ,
        0.5499173 , -0.18837541, -0.9261864 , -0.8445961 ,  0.82