# Word Embedding with Deep Learning
</br>
<font color=green>__doc2vec and German elections__</font>

## Outline

1.Introduction to NLP
 
2.Case study: german elections with doc2vec
 - data preparation and overview
 - descriptive statistics
 
 
3.Theory
 - doc2vec and DBOW
 - step by step: matrices and manual application
 - additional notions: softmax, loss function and negative sampling
 
 
4.Code
 - gensim package
 - results overview

# Natural Language Processing

 - massive text corpus accumulated over the years
 - transformation from text to numeric - never easy but very needed
 - how to quantify relationships?


### Word embedding as a solution
 - 1 mil words would end up as 1 mil dim vector - dimensionality reduction case
 - one hot encoding yields very sparse data
 - TF-IDF and bag-of-words

### Current arsenal
 - bag of words (BOW): simple but often inefficient due to lost order
 - Latent Dirichlet Allocation (LDA): more efficient but hard to play with
 - word2vec and doc2vec
 - GloVE
 - ..more coming

### Implementation areas
- document retrieval
- web search
- spam filtering
- topic modeling
- recommendations, etc
- translations

### word2vec - familiar concepts used in funny way
- made in Google (Mikolov, Chen 2013)
- focusing on relationships
- onehot encoding (id#4e2fsd) __vs__ vectors (0.54, 0.78, 0.12...0,83)
- we grasp synonyms, antonyms, or analogies

![vecs.jpeg](attachment:vecs.jpeg)

### Continuous Bag-of-Words model (CBOW)
1. words are feature vectors (semantic characteristics)
2. select a size of gliding window
3. train
4. words became word-vectors (calibrated semantic characteristics)
5. predict the missing word _(hello, google search...)_
<img src="introImg1.jpeg" alt="img1" style="width: 600px;"/>

### Skip-Gram model
1. reverse CBOW: predict the surrounding, using one word
2. (context, target) pairs
3. network is going to learn the statistics from the number of times each pairing shows up
4. takes in pairs (word1, word2) generated by moving a window across text data, and trains a 1-hidden-layer neural network based on the fake task of given an input word, giving us a predicted probability distribution of nearby words to the input. The hidden-to-output weights in the neural network give us the word embeddings. So if the hidden layer has 300 neurons, this network will give us 300-dimensional word embeddings. We use one-hot encoding for the words.

<img src="introImg2.jpeg" alt="img2" style="width: 600px;"/>

We thought that the ever increasing use of social media in politics might be an approach to find some nice text data

# Heiko Maas (SPD), minister of justice:

> In den vergangenen Wochen gab es eine sehr lebhafte und teilweise laute Diskussion um den Gesetzentwurf zur besseren Rechtsdurchsetzung in den sozialen Netzwerken. Ich freue mich, dass sich so viele Menschen an der Debatte beteiligt haben – auch hier auf Facebook.    

>**In the last weeks there was a very lively and partly loud discussion about the bill for a better rule of law in the social networks. I'm glad that many people have participated in the debate here too on facebook.**

[...]

>Nur, wenn alle diesen Respekt zeigen, gibt es auch Freiheit für alle – und deshalb ist unser Gesetzentwurf keine Beschränkung der Meinungsfreiheit, sondern er stärkt und er schützt sie gegenüber denen, die sie verletzen. (Foto: DPA) Mehr Infos: www.bmjv.de/fair-im-netz

>**There is only freedom for all if all show this respect that's why this bill is no limitation of the freedom of speech. The bill instead strenghtens and protects it against those who want to violate it.**

June 30, 2017 on his facebook page.

* likes: 204
* shares: 60
* angry: 63
* comments: 605

# Jens Spahn (CDU):

>Erschütternde Szenen aus Hamburg. Diese vermummten Linksfaschisten zerstören die Autos von Familien, Azubis, Bürgern, sie verletzen Menschen und skandieren Hass. Und zur Belohnung gibt es Applaus von den Linken und eine verständnisvolle Berichterstattung im öffentlich-rechtlichen. Ätzend. Die Polizei hat unsere volle Unterstützung verdient, wenn sie darauf mit der nötigen Härte reagiert. Punkt.

July 7, 2017 on his facebook page.

* shares: 16711
* likes: 18395
* angry: 8660
* comments: 5146

# Most successful post of Angela Merkel:

![alt text](angela_profile.png "merkel profile picture")

her newest profil picture post on July 25, 2017.

* shares: 3128
* likes: 135151
* comments: 12774

# Data

* 1008 politicans + seven parties
* 177307 posts in total
* between **1. January** and **24. September 2017** (election day)
