### Word2Vec is a word embedding approach 
#### It first converts the words to vectors and then places the similar words or closely used words in a text together, by placing them close as vectors by assigning affinity to the dimensions

In [1]:
import nltk

In [2]:
from gensim.models import Word2Vec
from nltk.corpus import stopwords

In [3]:
import re

In [4]:
# a paragraph on neural networks
para = """A neural network is a series of algorithms that endeavors to recognize underlying relationships in a set of data 
through a process that mimics the way the human brain operates. In this sense, neural networks refer to systems of neurons, 
either organic or artificial in nature. Neural networks can adapt to changing input; so the network generates the best possible 
result without needing to redesign the output criteria. The concept of neural networks, which has its roots in artificial 
intelligence, is swiftly gaining popularity in the development of trading systems.Neural networks, in the world of finance, 
assist in the development of such process as time-series forecasting, algorithmic trading, securities classification, credit 
risk modeling and constructing proprietary indicators and price derivatives.
A neural network works similarly to the human brain’s neural network. A “neuron” in a neural network is a mathematical function 
that collects and classifies information according to a specific architecture. The network bears a strong resemblance to 
statistical methods such as curve fitting and regression analysis.
A neural network contains layers of interconnected nodes. Each node is a perceptron and is similar to a multiple linear 
regression. The perceptron feeds the signal produced by a multiple linear regression into an activation function that may be 
nonlinear.In a multi-layered perceptron (MLP), perceptrons are arranged in interconnected layers. The input layer collects 
input patterns. The output layer has classifications or output signals to which input patterns may map. For instance, the 
patterns may comprise a list of quantities for technical indicators about a security; potential outputs could be “buy,” “hold” 
or “sell.”
Hidden layers fine-tune the input weightings until the neural network’s margin of error is minimal. It is hypothesized that 
hidden layers extrapolate salient features in the input data that have predictive power regarding the outputs. This describes 
feature extraction, which accomplishes a utility similar to statistical techniques such as principal component analysis.Neural 
networks are broadly used, with applications for financial operations, enterprise planning, trading, business analytics and 
product maintenance. Neural networks have also gained widespread adoption in business applications such as forecasting and 
marketing research solutions, fraud detection and risk assessment.
A neural network evaluates price data and unearths opportunities for making trade decisions based on the data analysis. The 
networks can distinguish subtle nonlinear interdependencies and patterns other methods of technical analysis cannot. According 
to research, the accuracy of neural networks in making price predictions for stocks differs. Some models predict the correct 
stock prices 50 to 60 percent of the time while others are accurate in 70 percent of all instances. Some have posited that a 10 
percent improvement in efficiency is all an investor can ask for from a neural network.
There will always be data sets and task classes that a better analyzed by using previously developed algorithms. It is not so 
much the algorithm that matters; it is the well-prepared input data on the targeted indicator that ultimately determines the 
level of success of a neural network."""

In [6]:
# Data Preprocessing
text = re.sub(r'\[[0-9]*\]',' ',para) 
text = re.sub(r'\s+',' ',text) 
text = text.lower() 
text = re.sub(r'\d',' ',text)
text = re.sub(r'\s+',' ',text)

In [8]:
# words are tokenised and stopwords removed
sentences = nltk.sent_tokenize(text)
sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
for i in range(len(sentences)):
    sentences[i] = [word for word in sentences[i] if word not in stopwords.words('english')]
print(sentences) # List of meaningful words in the paragraph

[['neural', 'network', 'series', 'algorithms', 'endeavors', 'recognize', 'underlying', 'relationships', 'set', 'data', 'process', 'mimics', 'way', 'human', 'brain', 'operates', '.'], ['sense', ',', 'neural', 'networks', 'refer', 'systems', 'neurons', ',', 'either', 'organic', 'artificial', 'nature', '.'], ['neural', 'networks', 'adapt', 'changing', 'input', ';', 'network', 'generates', 'best', 'possible', 'result', 'without', 'needing', 'redesign', 'output', 'criteria', '.'], ['concept', 'neural', 'networks', ',', 'roots', 'artificial', 'intelligence', ',', 'swiftly', 'gaining', 'popularity', 'development', 'trading', 'systems.neural', 'networks', ',', 'world', 'finance', ',', 'assist', 'development', 'process', 'time-series', 'forecasting', ',', 'algorithmic', 'trading', ',', 'securities', 'classification', ',', 'credit', 'risk', 'modeling', 'constructing', 'proprietary', 'indicators', 'price', 'derivatives', '.'], ['neural', 'network', 'works', 'similarly', 'human', 'brain', '’', 'ne

In [9]:
model = Word2Vec(sentences, min_count=2) 
# min count = 2 means that only those tokens will be considered which have appeared 2 or more times in the 
# paragraph over all the sentences
words = model.wv.vocab

In [10]:
words

{'neural': <gensim.models.keyedvectors.Vocab at 0x15079635ef0>,
 'network': <gensim.models.keyedvectors.Vocab at 0x15079635f60>,
 'algorithms': <gensim.models.keyedvectors.Vocab at 0x15079635f98>,
 'data': <gensim.models.keyedvectors.Vocab at 0x15079635a20>,
 'process': <gensim.models.keyedvectors.Vocab at 0x15079635978>,
 'human': <gensim.models.keyedvectors.Vocab at 0x15079635940>,
 'brain': <gensim.models.keyedvectors.Vocab at 0x150796359b0>,
 '.': <gensim.models.keyedvectors.Vocab at 0x15078165e80>,
 ',': <gensim.models.keyedvectors.Vocab at 0x150775cecc0>,
 'networks': <gensim.models.keyedvectors.Vocab at 0x1507816db38>,
 'artificial': <gensim.models.keyedvectors.Vocab at 0x15077a95e48>,
 'input': <gensim.models.keyedvectors.Vocab at 0x15077a95d68>,
 ';': <gensim.models.keyedvectors.Vocab at 0x1506ea453c8>,
 'output': <gensim.models.keyedvectors.Vocab at 0x1506c44c9e8>,
 'development': <gensim.models.keyedvectors.Vocab at 0x1506c44cf98>,
 'trading': <gensim.models.keyedvectors.Voc

In [13]:
vector = model.wv['analysis']
print(vector)
print(len(vector)) # these are points that the word 'analysis' marks over 100 dimensions

[ 1.7882499e-03 -7.3534460e-04  3.5770691e-03 -3.5137567e-03
 -1.3872167e-03 -4.8282347e-03 -2.2223189e-03  2.8980188e-03
  2.3327591e-03 -4.1577322e-03  1.7765195e-03  3.5341179e-03
 -2.6832540e-03  1.7329424e-03  4.0831664e-03 -4.7500078e-03
  3.0543541e-03 -3.2694545e-03  2.6768423e-03  1.7643366e-03
  2.9024247e-03 -4.6829074e-03 -1.9399243e-03 -5.2071345e-04
  3.7325355e-03 -3.4103862e-03 -1.0379353e-03 -2.3358960e-03
 -2.0473127e-03  4.7081904e-03 -3.7244470e-03  2.9085134e-03
 -1.8144966e-03 -3.9202212e-03 -3.8085724e-03  2.4605305e-03
 -4.4240062e-03 -4.6432149e-03  1.5382556e-04  3.0929346e-03
  4.2525982e-03 -2.0447746e-03 -3.8713627e-03  4.8777862e-03
  4.4917697e-03  1.3928565e-03  1.2972564e-03 -5.3491979e-04
  3.6973676e-03  1.1728632e-03  3.9730929e-03  1.8387869e-03
 -7.9959910e-04 -8.3132116e-05 -2.7351961e-03 -1.2522034e-04
 -1.4878893e-03  3.6511519e-03  2.5038652e-03 -3.7544046e-03
  4.6441276e-03  3.1028336e-03  1.3241119e-03  1.0910855e-03
 -2.7851027e-03 -1.14582

In [14]:
# Generates a list of words closely used with word mentioned
similar = model.wv.most_similar('artificial')
similar

[(',', 0.2724544107913971),
 ('statistical', 0.24404123425483704),
 ('regression', 0.17220473289489746),
 ('collects', 0.14969736337661743),
 ('technical', 0.12971970438957214),
 ('neural', 0.12779946625232697),
 ('making', 0.12751170992851257),
 ('”', 0.11790597438812256),
 ('“', 0.11625520139932632),
 ('according', 0.11557957530021667)]

In [17]:
similar = model.wv.most_similar('neural')
similar

[('layers', 0.16600537300109863),
 ('hidden', 0.13354891538619995),
 ('multiple', 0.13139352202415466),
 ('artificial', 0.12779945135116577),
 ('networks', 0.12183421850204468),
 ('perceptron', 0.11365452408790588),
 ('outputs', 0.09578442573547363),
 ('data', 0.0943332314491272),
 ('development', 0.08591002225875854),
 ('layer', 0.0827164426445961)]

In [18]:
similar = model.wv.most_similar('price')
similar

[('trading', 0.2603369951248169),
 ('linear', 0.22607269883155823),
 ('methods', 0.19584564864635468),
 ('patterns', 0.17461207509040833),
 ('”', 0.15355277061462402),
 ('making', 0.13648977875709534),
 ('research', 0.11855696886777878),
 ('regression', 0.11451949179172516),
 ('algorithms', 0.10579060763120651),
 ('perceptron', 0.1013592928647995)]

In [24]:
#Create CBOW model 
model1 = Word2Vec(sentences, min_count = 1,  size = 200, window = 20) 
print("Cosine similarity between 'neural' " + "and 'networks' - CBOW : ", model1.similarity('neural', 'networks')) 
      
print("Cosine similarity between 'neural' " +"and 'statistical' - CBOW : ", model1.similarity('neural', 'statistical')) 
  
# Create Skip Gram model 
model2 = Word2Vec(sentences, min_count = 1, size = 200, window = 20, sg = 1)  
print("Cosine similarity between 'neural' " + "and 'networks' - Skip Gram : ", model2.similarity('neural', 'networks')) 
      
print("Cosine similarity between 'neural' " +"and 'statistical' - Skip Gram : ", model2.similarity('neural', 'statistical')) 

Cosine similarity between 'neural' and 'networks' - CBOW :  0.12134572
Cosine similarity between 'neural' and 'statistical' - CBOW :  -0.064506575
Cosine similarity between 'neural' and 'networks' - Skip Gram :  0.3636789
Cosine similarity between 'neural' and 'statistical' - Skip Gram :  0.07178426


  This is separate from the ipykernel package so we can avoid doing imports until
  """
  if __name__ == '__main__':
  # This is added back by InteractiveShellApp.init_path()
