# Training Word2Vec Model

### Reading and Exploring the Dataset
The dataset we are using here is a perspective dataset comprises the thesis of the key practitioners and acade. The data is stored as a excel file and can be read using pandas.

In [43]:
# Training Word2Vec Modelimport gensim

In [44]:
import pandas as pd

In [45]:
df = pd.read_excel("D:/github/Python/Deep Learning/annotation.xlsx")

In [46]:
df.head()

Unnamed: 0,sentence,indicator,factor,source
0,"Where superiority of numbers is overwhelming, ...",superiority of numbers,physical,"Clausewitz, 1989, p.196​"
1,Grand strategy should calculate and develop ec...,"economic resources, man-power",physical,"Hart, 1991, 322"
2,"Beyond geography, money has always been the gr...",finance,physical,"Smith, 2019, 19."
3,War is not so much a matter of armaments as of...,finance,physical,"Thucydides, 1972"
4,Moral elements are among the most important in...,Spirit and will,moral,"Clausewitz, 1989, p.184​"


### Simple Preprocessing & Tokenization
The first thing to do for any data science task is to clean the data.
For NLP, we apply various processing like converting all the words to lower case, trimming spaces, removing punctuations. 
This is something we will do over here too.

Additionally, we can also remove stop words like 'and', 'or', 'is', 'the', 'a', 'an' and convert words to their root forms like 'running' to 'run'.

In [47]:
df.sentence[0]

'Where superiority of numbers is overwhelming, it is admittedly is the most important factor in the outcome of an engagement. '

In [48]:
perspectives = df.sentence.apply(gensim.utils.simple_preprocess)

In [49]:
perspectives

0     [where, superiority, of, numbers, is, overwhel...
1     [grand, strategy, should, calculate, and, deve...
2     [beyond, geography, money, has, always, been, ...
3     [war, is, not, so, much, matter, of, armaments...
4     [moral, elements, are, among, the, most, impor...
5     [once, war, is, declared, the, highest, comman...
6     [grand, strategy, should, calculate, and, deve...
7     [in, war, three, quarters, turns, on, morale, ...
8     [force, employment, or, the, doctrine, and, ta...
9     [those, who, masters, moral, influence, weathe...
10    [know, the, enemy, and, know, yourself, in, hu...
11    [strategy, directs, armies, to, the, decisive,...
12    [strategy, is, determinant, of, victorious, wi...
13    [indirect, approach, creates, dislocation, and...
14    [concept, of, victory, has, three, elements, g...
15    [every, engagement, is, bloody, and, destructi...
16    [in, the, engagement, loss, of, marale, has, p...
17    [by, doctrine, mean, organization, control

### Training the Word2Vec Model

Train the model for reviews. Use a window of size 10 i.e. 10 words before the present word and 10 words ahead. A sentence with at least 2 words should only be considered, configure this using min_count parameter.

Workers define how many CPU threads to be used.

### initialize gensim model 

In [50]:
model = gensim.models.Word2Vec( # identify parameters
    window=10, # means 10 words before target, 10 words after target
    min_count =2, #if you have a sentence which has only one word, dont use taht sentence, at least 2 words need to present
    workers=4, #how many cpu threads to use to train model, if cpu has 4 cores, write 4.    
)

In [51]:
#building vocabulary

model.build_vocab(perspectives, progress_per=100)

In [52]:
model.epochs

5

In [53]:
model.corpus_count

26

In [54]:
model.train(perspectives, total_examples=model.corpus_count, epochs =model.epochs) 

(586, 2715)

In [55]:
model.save("pax.model")

### Experiment model

### Finding Similar Words and Similarity between words
https://radimrehurek.com/gensim/models/word2vec.html

In [56]:
model.wv.most_similar("war")

[('victor', 0.25118643045425415),
 ('the', 0.2239464819431305),
 ('should', 0.21791531145572662),
 ('win', 0.20920003950595856),
 ('calculate', 0.16436129808425903),
 ('deciding', 0.13799598813056946),
 ('be', 0.12990830838680267),
 ('effects', 0.10774846374988556),
 ('loss', 0.10681787133216858),
 ('death', 0.09493274241685867)]

In [61]:
model.wv.similarity(w1="war", w2= "engagement") #cosine

0.033534527