### Word2Vec is a Word Embedding Approach 
#### It first converts the words to vectors and then places the similar words or closely used words in a text together, by placing them close as vectors by assigning affinity to the dimensions

![Word2Vec](https://user-images.githubusercontent.com/51756349/85860609-b1afd480-b7dc-11ea-9e7e-0aba671fc3d5.png)


![1_hELlVp9hmZbDZVFstS61pg](https://user-images.githubusercontent.com/51756349/85862210-14a26b00-b7df-11ea-92a2-1aeb2c6b813f.png)


In [1]:
import nltk

In [2]:
from gensim.models import Word2Vec
from nltk.corpus import stopwords

In [3]:
import re

In [4]:
# a paragraph on neural networks
para = """A neural network is a series of algorithms that endeavors to recognize underlying relationships in a set of data 
through a process that mimics the way the human brain operates. In this sense, neural networks refer to systems of neurons, 
either organic or artificial in nature. Neural networks can adapt to changing input; so the network generates the best possible 
result without needing to redesign the output criteria. The concept of neural networks, which has its roots in artificial 
intelligence, is swiftly gaining popularity in the development of trading systems.Neural networks, in the world of finance, 
assist in the development of such process as time-series forecasting, algorithmic trading, securities classification, credit 
risk modeling and constructing proprietary indicators and price derivatives.
A neural network works similarly to the human brain’s neural network. A “neuron” in a neural network is a mathematical function 
that collects and classifies information according to a specific architecture. The network bears a strong resemblance to 
statistical methods such as curve fitting and regression analysis.
A neural network contains layers of interconnected nodes. Each node is a perceptron and is similar to a multiple linear 
regression. The perceptron feeds the signal produced by a multiple linear regression into an activation function that may be 
nonlinear.In a multi-layered perceptron (MLP), perceptrons are arranged in interconnected layers. The input layer collects 
input patterns. The output layer has classifications or output signals to which input patterns may map. For instance, the 
patterns may comprise a list of quantities for technical indicators about a security; potential outputs could be “buy,” “hold” 
or “sell.”
Hidden layers fine-tune the input weightings until the neural network’s margin of error is minimal. It is hypothesized that 
hidden layers extrapolate salient features in the input data that have predictive power regarding the outputs. This describes 
feature extraction, which accomplishes a utility similar to statistical techniques such as principal component analysis.Neural 
networks are broadly used, with applications for financial operations, enterprise planning, trading, business analytics and 
product maintenance. Neural networks have also gained widespread adoption in business applications such as forecasting and 
marketing research solutions, fraud detection and risk assessment.
A neural network evaluates price data and unearths opportunities for making trade decisions based on the data analysis. The 
networks can distinguish subtle nonlinear interdependencies and patterns other methods of technical analysis cannot. According 
to research, the accuracy of neural networks in making price predictions for stocks differs. Some models predict the correct 
stock prices 50 to 60 percent of the time while others are accurate in 70 percent of all instances. Some have posited that a 10 
percent improvement in efficiency is all an investor can ask for from a neural network.
There will always be data sets and task classes that a better analyzed by using previously developed algorithms. It is not so 
much the algorithm that matters; it is the well-prepared input data on the targeted indicator that ultimately determines the 
level of success of a neural network."""

In [5]:
# Data Preprocessing
text = re.sub(r'\[[0-9]*\]',' ',para) 
text = re.sub(r'\s+',' ',text) 
text = text.lower() 
text = re.sub(r'\d',' ',text)
text = re.sub(r'\s+',' ',text)

In [6]:
# words are tokenised and stopwords removed
sentences = nltk.sent_tokenize(text)
sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
for i in range(len(sentences)):
    sentences[i] = [word for word in sentences[i] if word not in stopwords.words('english')]
print(sentences) # List of meaningful words in the paragraph

[['neural', 'network', 'series', 'algorithms', 'endeavors', 'recognize', 'underlying', 'relationships', 'set', 'data', 'process', 'mimics', 'way', 'human', 'brain', 'operates', '.'], ['sense', ',', 'neural', 'networks', 'refer', 'systems', 'neurons', ',', 'either', 'organic', 'artificial', 'nature', '.'], ['neural', 'networks', 'adapt', 'changing', 'input', ';', 'network', 'generates', 'best', 'possible', 'result', 'without', 'needing', 'redesign', 'output', 'criteria', '.'], ['concept', 'neural', 'networks', ',', 'roots', 'artificial', 'intelligence', ',', 'swiftly', 'gaining', 'popularity', 'development', 'trading', 'systems.neural', 'networks', ',', 'world', 'finance', ',', 'assist', 'development', 'process', 'time-series', 'forecasting', ',', 'algorithmic', 'trading', ',', 'securities', 'classification', ',', 'credit', 'risk', 'modeling', 'constructing', 'proprietary', 'indicators', 'price', 'derivatives', '.'], ['neural', 'network', 'works', 'similarly', 'human', 'brain', '’', 'ne

In [7]:
model = Word2Vec(sentences, min_count=2) 
# min count = 2 means that only those tokens will be considered which have appeared 2 or more times in the 
# paragraph over all the sentences
words = model.wv.vocab

In [8]:
words

{'neural': <gensim.models.keyedvectors.Vocab at 0x293d3666fd0>,
 'network': <gensim.models.keyedvectors.Vocab at 0x293d2dfaa20>,
 'algorithms': <gensim.models.keyedvectors.Vocab at 0x293d47749e8>,
 'data': <gensim.models.keyedvectors.Vocab at 0x293d47749b0>,
 'process': <gensim.models.keyedvectors.Vocab at 0x293d4774a20>,
 'human': <gensim.models.keyedvectors.Vocab at 0x293d4774a58>,
 'brain': <gensim.models.keyedvectors.Vocab at 0x293d4774ac8>,
 '.': <gensim.models.keyedvectors.Vocab at 0x293d4774b00>,
 ',': <gensim.models.keyedvectors.Vocab at 0x293d4774b70>,
 'networks': <gensim.models.keyedvectors.Vocab at 0x293d4774ba8>,
 'artificial': <gensim.models.keyedvectors.Vocab at 0x293d4774be0>,
 'input': <gensim.models.keyedvectors.Vocab at 0x293d4774c88>,
 ';': <gensim.models.keyedvectors.Vocab at 0x293d4774cc0>,
 'output': <gensim.models.keyedvectors.Vocab at 0x293d4774cf8>,
 'development': <gensim.models.keyedvectors.Vocab at 0x293d4774d30>,
 'trading': <gensim.models.keyedvectors.Voc

In [9]:
vector = model.wv['analysis']
print(vector)
print(len(vector)) # these are points that the word 'analysis' marks over 100 dimensions

[ 2.2707290e-03 -4.2620939e-03 -9.7396289e-04  2.4291351e-03
 -1.3892971e-03  4.5290855e-03 -4.2491262e-03 -1.4231686e-03
 -4.5711634e-04  1.4904573e-03 -4.5003693e-04  3.8633293e-03
  4.5335721e-03  3.7968282e-03 -4.5820465e-03 -2.9537515e-03
  3.2541754e-03 -1.6681055e-03 -2.9305071e-03  2.4236403e-03
 -3.0991794e-03  3.7775543e-03  2.9707695e-03  3.7834714e-03
  1.3560386e-03  3.6193265e-03 -4.9997629e-03  4.5577675e-04
  4.5338240e-03  2.6881774e-03 -3.2558830e-03 -4.6232748e-03
 -3.1811534e-04 -3.9419634e-03  2.7059070e-03  3.5311822e-03
  3.7677403e-04 -3.7658469e-05 -3.4251995e-03 -8.1352255e-04
 -2.5429304e-03  9.4373716e-04 -3.5800645e-03  4.5270859e-03
 -3.9450689e-03  6.9655693e-04  1.5461041e-03  1.7647581e-03
 -3.4081824e-03  8.3387182e-05  4.4232449e-03 -4.1057039e-03
 -1.9614575e-03  2.4817986e-03 -3.5980509e-03  4.3026106e-03
 -1.9736392e-03 -3.9448808e-03  2.2245373e-03  2.6434494e-04
 -1.7638891e-03  3.1780011e-03  8.7420369e-04  4.2032823e-03
 -3.6597179e-05 -7.54954

In [10]:
# Generates a list of words closely used with word mentioned
similar = model.wv.most_similar('artificial')
similar

[('hidden', 0.1423044502735138),
 ('outputs', 0.1080545112490654),
 ('forecasting', 0.09159283339977264),
 ('statistical', 0.0738331601023674),
 ('data', 0.07324998080730438),
 ('regression', 0.07109078019857407),
 ('similar', 0.06977398693561554),
 ('neural', 0.06019268557429314),
 ('risk', 0.05840431898832321),
 ('methods', 0.05552846938371658)]

In [11]:
similar = model.wv.most_similar('neural')
similar

[('interconnected', 0.2929634749889374),
 ('.', 0.16374357044696808),
 ('making', 0.16234111785888672),
 ('similar', 0.15972241759300232),
 ('network', 0.11722555756568909),
 ('business', 0.1146617978811264),
 ('data', 0.11317825317382812),
 ('layers', 0.10206985473632812),
 ('may', 0.0993332713842392),
 ('trading', 0.09606222063302994)]

In [12]:
similar = model.wv.most_similar('price')
similar

[('function', 0.22197595238685608),
 ('human', 0.14206382632255554),
 ('layer', 0.13823407888412476),
 ('statistical', 0.13101479411125183),
 ('collects', 0.12973207235336304),
 (',', 0.12816891074180603),
 ('perceptron', 0.12561340630054474),
 ('multiple', 0.12113900482654572),
 ('outputs', 0.11764165014028549),
 ('linear', 0.11197951436042786)]

### Word2Vec has 2 neural network models used to vectorize the words
### 1. CBOW - Continuous Bag of words
### 2. Skip Gram
#### In the CBOW model, the distributed representations of context (or surrounding words) are combined to predict the word in the middle. 
#### While in the Skip-gram model, the distributed representation of the input word is used to predict the context.

![Architecture-of-Word2Vec-models-CBOW-and-Skip-Gram](https://user-images.githubusercontent.com/51756349/85862530-84b0f100-b7df-11ea-997e-b74b261cbc13.jpg)


In [13]:
#Create CBOW model 
model1 = Word2Vec(sentences, min_count = 1,  size = 200, window = 20) 
print("Cosine similarity between 'neural' " + "and 'networks' - CBOW : ", model1.similarity('neural', 'networks')) 
      
print("Cosine similarity between 'neural' " +"and 'statistical' - CBOW : ", model1.similarity('neural', 'statistical')) 
  
# Create Skip Gram model 
model2 = Word2Vec(sentences, min_count = 1, size = 200, window = 20, sg = 1)  
print("Cosine similarity between 'neural' " + "and 'networks' - Skip Gram : ", model2.similarity('neural', 'networks')) 
      
print("Cosine similarity between 'neural' " +"and 'statistical' - Skip Gram : ", model2.similarity('neural', 'statistical')) 

Cosine similarity between 'neural' and 'networks' - CBOW :  0.031239564
Cosine similarity between 'neural' and 'statistical' - CBOW :  0.12230882
Cosine similarity between 'neural' and 'networks' - Skip Gram :  0.3217716
Cosine similarity between 'neural' and 'statistical' - Skip Gram :  0.29052967


  This is separate from the ipykernel package so we can avoid doing imports until
  """
  if __name__ == '__main__':
  # This is added back by InteractiveShellApp.init_path()
