# Word2Vec and Word Embedding

> https://wiki.pathmind.com/word2vec#:~:text=Word2vec%20is%20a%20two%2Dlayer,deep%20neural%20networks%20can%20understand. <br>
>https://gist.github.com/aparrish/2f562e3737544cf29aaf1af30362f469 <br>
> http://jalammar.github.io/illustrated-word2vec/


**Word Embedding**

- group similar words together in vector space
- represent distributed numerical representations of word features most importantly context
- vectors are called "neural word embeddings"
- similar to autoencoder, but training against other words that neighbor them in input corpus -> context is the learning motiv
- words close to each other in vector space are semantically linked (synonyms or functionally identical)
- *Distributional Hypothesis in NLP*: Linguistic items with similar distributions have similar meanings.
- *Lemmatisation*: two occurences in different form of the same word count as same context ("walk" and "walking")
- *Context*: dimensionality dependable on context size (e.g. 3 word context or 5 word context) each column represents one possible pre- and past context words. There are millions of dimension possible -> dimensionality reduction necessary. The implementation of the context function is comlicated in itself.
- *Cosine Similarity* is difference in angle of two vectors. The smaller the angle the more similar two words are in word embedding

**Word2Vec**

- input is a corpus of text -> processed by a neural network -> producing a vector space representation with several hundred dimensions
- *Continuous Bag of Words*: looking at two words in front and behind -> is called skipgram architecture. Input dataset has two column with input word relating to all the words in the output (e.g. [red,by],[red,a],[red,bus],[red,in], representative for "by a red bus in")
- *Negative Sampling*: Change of tasks: The model should predict if two words are neighbors. This results in a simpler learning process (logistic regression model). Dataset than includes input, output word and neighbor argument. Problem all the regular inputs are positiv -> model would always output one. Infusing negative samples in dataset, by adding random sampling words from the input vocabulary in the output and giving them a negative neighbor label.

<div><img src="http://jalammar.github.io/images/word2vec/word2vec-embedding-context-matrix.png"></div>

**Word2Vec Training** 
<div><img src="http://jalammar.github.io/images/word2vec/word2vec-lookup-embeddings.png" width=65%></div>

Using the embedding value and the context values we can calculate (using sigmoid) an value between 0 and 1. With this we can get an error value that can be used to update the embedding and context parameters to fit the labels.


> <a href ="https://colab.research.google.com/drive/1neqSVpcGIIl3Q6bzKPto9Raumn5gHGdW#scrollTo=bEIJ4B2H8SIx"> Word2Vec Implementation </a>

# Kernel trick in SVM

What we do in SVM is assuming our dataset is linearly seperable (by some sort of line in 2D or hyperlane in n-dimensions). Thinking about it, most datasets are not linearly seperable. Example(left): 

<div><img src="https://miro.medium.com/max/700/1*mCwnu5kXot6buL7jeIafqQ.png"></div><br>
One approach might be to scale the dataset into a higher dimension to reach linaerly seperability. In this example we suppose $x = (x_1,x_2,x_3)^T$ and $y =(y_1,y_2,y_3)^T$ being three dimensional datapoints. Now suppose we want to scale the datapoints into 9-dimenional space using a mapping function $\Phi$ what we do is: $$\Phi (x) = (x_1^2,x_1x_2,x_1x_3,x_2,x_1,x_2^2,x_2x_3,x_3x_1,x_3x_2,x_3^2) $$ and $$\Phi (y) = (y_1^2,y_1y_2,y_1y_3,y_2,y_1,y_2^2,y_2y_3,y_3y_1,y_3y_2,y_3^2) $$
What this does it introduces non-linearity to the dataset such that the dataset is mapped onto a non linear surface. When we now calculate the dot product of both datapoints we get $$\Phi(x)^T\Phi(y) =\sum^3_{i,j=1}x_ix_jy_iy_j$$ What the kernel trick does is relying on a mathematical identity such that this nine dimensional dot product can be calcualted in the three dimensional initial input space. This is denoted as kernel function $$k(x,y) = (x^Ty)^2 = (x_1y_1+x_2y_2+x_3y_3)^2 = \sum^3_{i,j=1}x_ix_jy_iy_j$$ So the kernel trick represents an option for calculating a higher dimensional dot product in the lower dimensional input space. 
For a good explanation video go to <a href="https://www.youtube.com/watch?v=wBVSbVktLIY">YouTube</a>, otherwise explain with screenshots in gDrive.

> https://people.eecs.berkeley.edu/~jordan/courses/281B-spring04/lectures/lec3.pdf <br>

# Application of Hopfield Networks

- Pattern Recognition
- Form basis of Restricted Boltzmann Machines and Deep Belief Nets
- When learning data is sparse, they can learn patterns from single examples
- Unsupervised learning, patterns are the input to the network
- Facial Recognition <- with low resolution, binarized inputs (and low patter number)

**How does the learning mechanism work?**

The neurons in the hopfield nets are initialized with the input. Imagine each neuron inheriting a state $S$ taking the state value $a$, that is initialized with the input (each neuron corresponds to one input dimension). Afterwards the networks start learning by updating the weights $w$ according to hebbian lerning.  
$$w_{ij} = \sum_k^P a_ia_j$$ with $P$ being the number of patterns to be lerned, and $a_i$ and $a_j$ being the state values of the $i$-th and $j$-th neuron.

> https://medium.com/@serbanliviu/hopfield-nets-and-the-brain-e5880070cdba

# Long Short Term Memory
> https://towardsdatascience.com/understanding-lstm-and-its-quick-implementation-in-keras-for-sentiment-analysis-af410fd85b47 <br>

<div> <img src="https://miro.medium.com/max/700/1*Niu_c_FhGtLuHjrStkB_4Q.png" width=50%></div>

The symbols used here have following meaning: <br>
a) X : Scaling of information <br>
b)+ : Adding information <br>
c) σ : Sigmoid layer <br>
d) tanh: tanh layer <br>
e) h(t-1) : Output of last LSTM unit <br>
f) c(t-1) : Memory from last LSTM unit <br>
g) X(t) : Current input <br>
h) c(t) : New updated memory <br>
i) h(t) : Current output <br>

# Gated Recurrent Unit (GRU)

LSTM without a dedicated output gate. The output is the cell state.
<div> <img src="https://upload.wikimedia.org/wikipedia/commons/3/37/Gated_Recurrent_Unit%2C_base_type.svg" width=50% ></div>

# RNN GANS for Abstract Reasoning Diagram Generation

> https://arnabgho.github.io/Contextual-RNN-GAN/ <br>
> https://github.com/ratschlab/RGAN <br>

<div><img src="https://www.researchgate.net/profile/Vinay_Namboodiri/publication/308744455/figure/fig1/AS:411855819427840@1475205487718/Context-RNN-GAN-model-where-the-generator-G-and-the-discriminator-D-where-Di-represents.png"></div>


Generator and Dsicriminator models are RNN networks. 

# Recurrent GAN

<div><img src="https://www.researchgate.net/profile/Morteza_Mardani/publication/321347348/figure/fig1/AS:565866850336768@1511924578474/Recurrent-GAN-Top-the-closed-loop-circuit-diagram-and-right-unrolled-computational.png"></div>

https://medium.com/datadriveninvestor/autoencoders-like-google-but-for-your-genome-942b59046cbd

# Conditional Generative Adversarial Network - cGAN

> https://machinelearningmastery.com/how-to-develop-a-conditional-generative-adversarial-network-from-scratch/ <br>

<div><img src="https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2019/05/Example-of-a-Conditional-Generator-and-a-Conditional-Discriminator-in-a-Conditional-Generative-Adversarial-Network-1024x887.png" width = 60%></div><br>

An additional input is used that corresponds to class labels. For example in the case of the famous mnist-dataset the additional informations correspond to the number of the image. These class labels are fed into the generator and the discriminator in addition to the usual inputs in GANS. This provides a way for more stable training and also for targeted image generation. A best practice involves using an embedding layer followed by a fully connected layer with a linear activation that scales the embedding to the size of the image before concatenating it in the model as an additional channel or feature map.

**Spacial Embedding** based upon distance

- Principal Component Analysis (PCA)
- Multidimensional Scaling (MDS)


**Diffusion Embedding** 
- Diffusion map






** Not sure**
- Kernel PCA -> maximizes variance in dataset