# Application of Doc2Vec in gensim

on facebook posts from politicans in the German election 2017

**GOAL:** to let the model find a distinctiveness in the data, e.g. difference between party candidates

**Approach**:

- two models **PV-DBOW** and **PV-DM**
- 100 dimensions
- for 1, 5, 10 and 20 epochs
- ...

*Remember:*
- one post = one doc

**Solution:** 
- Gensim allows to tag documents. Our tag = name of the candidate
- model creates and trains vectors for the candidates

#### Code

In [2]:
from plotly import __version__
from plotly import offline as pyoff

In [3]:
pyoff.init_notebook_mode(connected=True)

In [4]:
import pandas as pd
import numpy as np
import gensim
from nltk.corpus import stopwords
import logging
import multiprocessing
import os
from collections import namedtuple

#FORMAT = '%(asctime)s %(levelname)s %(message)s'
#DATEFORMAT = '%Y-%m-%d %H:%M:%S'
#logging.basicConfig(level=logging.INFO,
#                    format=FORMAT,
#                    datefmt=DATEFORMAT)

In [5]:
parent_dir = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
data_dir = os.path.join(parent_dir, 'data')
models_dir = os.path.join(parent_dir, 'models')
print('working directory: ', os.getcwd())
print('data directory:    ', data_dir, )
print('models directory:  ', models_dir)

working directory:  /Volumes/Datahouse/Users/Stipe/Documents/Studium/Master VWL/5 WS 2017/Seminar Information Systems/InformationSystemsWS1718/notebooks
data directory:     /Volumes/Datahouse/Users/Stipe/Documents/Studium/Master VWL/5 WS 2017/Seminar Information Systems/InformationSystemsWS1718/data
models directory:   /Volumes/Datahouse/Users/Stipe/Documents/Studium/Master VWL/5 WS 2017/Seminar Information Systems/InformationSystemsWS1718/models


##### loads the data

In [10]:
data = pd.read_pickle(os.path.join(data_dir, 'data_clean_4cols.pickle'))
data = data.drop(['id'], axis=1)

pd.set_option('max_colwidth', 1000)
data.sample(n=10)

Unnamed: 0,from_name,message,Partei_ABK
27810,Ute Finckh-Krämer,"Heute früh war ich zunächst in Lankwitz, um verwilderte Spielplätze zu besichtigen. Dann ""Fishbowl""-Diskussion beim deutsch-russischen Festival des Jugendaustauschs. Etwa 100 der 400 Teilnehmer*innen nahmen daran teil. Ich habe etliche Wünsche zum Thema Visa und Finanzierung mitgenommen.",SPD
76096,Wolfgang Kubicki,LIVE: WK schwört sich und sein Team auf den Wahlkampf ein! #dasbestefürsh TK,FDP
100446,Norbert Müller,#Schulverpflegung: Gut Lernen und gut essen gehören zusammen! Die Fraktion DIE LINKE. im Bundestag fordert ein »Bundesprogramm #Kita- und Schulverpflegung« für eine hochwertige und beitragsfreie Verpflegung,DIE LINKE
31998,Thomas Gebhart,"Am 24. September wählen wir den neuen Deutschen Bundestag. Manche dürfen ihre Stimme zum ersten Mal abgeben und freuen sich darauf. Vielleicht sehen einige das Ganze auch kritisch und und fragen sich, ob sie mit ihrer Stimme einen Einfluss haben oder was die Politik für sie tut. \nDieser Frage geht Ingo Espenschied auf den Grund: In einer multimedialen Zeitreise durch den Bundestag hat er hinter die Kulissen des Parlaments geblickt. Dazu hat er mir in Berlin zwei Wochen über die Schulter geschaut und mich in der Südpfalz begleitet. Sein Ergebnisse präsentiert er am 19. September in Herxheim in seiner neuen Dokumentation „Deine Stimme zählt!?“\nZur Premiere lade ich alle herzlich ein, gemeinsam einen unterhaltsamen Abend zu verbringen. Bringt dazu gerne auch Freunde und Bekannte mit oder ladet sie hier bei Facebook zur Veranstaltung ein. Gerne beantworte ich im Anschluss an die Vorführung offene Fragen und stehe für Gespräche zur Verfügung. \nIch freue mich auf eine spannende Dokume...",CDU
167173,Kai Whittaker,Veränderungen können auch ein Neuanfang sein. \n\n#digitalerstaat #egovernment #kaiwhittaker #cdu #bundestag,CDU
62553,Kirsten Kappert-Gonther,"Fantastisch - 500 SchülerInnen aus über 80 Ländern führen mit einem der führenden Orchestern der Welt auch dieses Jahr wieder eine furiose Stadtteiloper auf! Wenn es dieses Projekt nicht gäbe, müsste man es erfinden. Die Teilnahme an dieser Aufführung wird für alle Beteiligten einen positiven Einfluss auf ihr Leben haben - standing ovations, chapeau!",GRÜNE
95926,Jan Metzler,Herzliche Einladung auch nach Schornsheim zu unserem Info-Stand. Gemeinsam mit der CDU stehe ich Rede und Antwort - für Essen und Trinken ist gesorgt!,CDU
39513,"Manfred Grund, MdB",Heute letzter Tag des Info-Mobils in Nordhausen. Habe meinen Besuch für Bürgergespräche genutzt.,CDU
95272,Dr. Michael Meister,"Firmenbesuch bei Fa. Fieger – zusammen mit CDU Birkenau\n\nAuf dem Bild: Michael Fieger, Prokurist und Vertriebsleiter\nBirgit Heitland",CDU
144927,Petra Sitte,"Noch 18 Tage bis zur #Bundestagswahl\n#83 Gerechtigkeit für die Menschen in Ostdeutschland - Rentengerechtigkeit [1/2]\n\nAuch im dritten Jahrzehnt nach der deutschen #Einheit sind die Menschen in Ostdeutschland in vielen Bereichen nicht gleichgestellt: Der Rentenwert und die #Standardrente sind niedriger, die Löhne und Gehälter sind durchschnittlich niedriger, die #Wirtschaftsleistung liegt immer noch über ein Viertel unter der der westlichen Bundesländer.\n\nSo unterschiedlich die persönlichen Erfahrungen vor und nach 1989 waren und sind und so vielfältig deren Bewertungen ausfallen – es bleibt, dass die Wirklichkeit in den ostdeutschen Ländern weit hinter dem zurückbleibt, was in der #Wendezeit versprochen und erwartet wurde. Die Art und Weise der #Vereinigung hat die Lebensperspektiven von vielen Menschen beschnitten. Und neue strukturelle Probleme sind entstanden. Auch hoher Leistungswille aller Generationen im Osten, auch ausgeprägte Bereitschaft zum Verzicht sowie große Inve...",DIE LINKE


##### gives doc2vec a challenge

In [32]:
# removes party mentions to test w
for char in ['SPD', 'spd', 'FDP', 'fdp', 'CDU' 'cdu', 'AfD' 'afd', 'AFD', 'Grüne', 'GRÜNE', 'Die Grünen', 'GRÜNEN', 'Linke', 'LINKE', 'CSU', 'csu', 'Die Linke', 'DIE LINKE',]:
        data.message = data.message.str.replace(char, '')

In [33]:
any(data.message.str.count('FDP')>0)

False

In [34]:
import string
from nltk.tokenize.casual import TweetTokenizer

stopword_set = set(stopwords.words('german'))
tokenizer = TweetTokenizer
MessageDoc = namedtuple('MessageDoc', 'words tags')

alldocs = []  # Will hold all doacs in original order
for line_no, line in data.iterrows():
    #import pdb; pdb.set_trace()
    message = line.message.lower()
    words = [word for word in tokenizer().tokenize(message) if word not in stopword_set and word not in string.punctuation]
    tags = [str(line_no), line['from_name']] #, line['Partei_ABK']] # line_no needs to be converted as string to be included in tags 
    alldocs.append(MessageDoc(words, tags))

In [9]:
message = data.message[7433].lower()
words = [word for word in tokenizer().tokenize(message) if word not in stopword_set and word not in string.punctuation]

print('Post', '\n', message, '\n', '\n', 'Tokenization', '\n', words)

Post 
 mein ärger der woche beim sonntags-stammtisch: die abschiebung von ahmad pouya. der afghanischen künstler und musiker lebte bis letzten freitag 6 jahre in deutschland, spricht fließend deutsch (und daneben fünf weitere sprachen) und war vorbildlich in deutschland integriert. er hat als dolmetscher gearbeitet und sollte eine beratungsstelle für geflüchtete leiten. nun wurde er - trotz vieler fürsprecher - eiskalt nach afghanistan abgeschoben um ein exempel zu statuieren. noch immer bin ich fassungslos über dieses vorgehen des innenministers. 
 
 Tokenization 
 ['ärger', 'woche', 'beim', 'sonntags-stammtisch', 'abschiebung', 'ahmad', 'pouya', 'afghanischen', 'künstler', 'musiker', 'lebte', 'letzten', 'freitag', '6', 'jahre', 'deutschland', 'spricht', 'fließend', 'deutsch', 'daneben', 'fünf', 'weitere', 'sprachen', 'vorbildlich', 'deutschland', 'integriert', 'dolmetscher', 'gearbeitet', 'beratungsstelle', 'geflüchtete', 'leiten', 'wurde', 'trotz', 'vieler', 'fürsprecher', 'eiskalt'

In [35]:
alldocs[7433]

MessageDoc(words=['ärger', 'woche', 'beim', 'sonntags-stammtisch', 'abschiebung', 'ahmad', 'pouya', 'afghanischen', 'künstler', 'musiker', 'lebte', 'letzten', 'freitag', '6', 'jahre', 'deutschland', 'spricht', 'fließend', 'deutsch', 'daneben', 'fünf', 'weitere', 'sprachen', 'vorbildlich', 'deutschland', 'integriert', 'dolmetscher', 'gearbeitet', 'beratungsstelle', 'geflüchtete', 'leiten', 'wurde', 'trotz', 'vieler', 'fürsprecher', 'eiskalt', 'afghanistan', 'abgeschoben', 'exempel', 'statuieren', 'immer', 'fassungslos', 'vorgehen', 'innenministers'], tags=['7433', 'Margarete Bause'])

In [36]:
len(alldocs)

177307

In [37]:
cores = multiprocessing.cpu_count()
assert gensim.models.doc2vec.FAST_VERSION > -1, "This will be painfully slow otherwise"

# doc2vec model

#### sets parameters of the models

ran several models with changing parameters

In [16]:
from collections import OrderedDict

models = [
    #gensim.models.Doc2Vec(dm=0, size=60, negative=5, hs=0, min_count=5, workers=cores),
    gensim.models.Doc2Vec(dm=0, dbow_words=0, size=100, negative=5, hs=0, min_count=5, workers=cores),
    #gensim.models.Doc2Vec(dm=1, dm_concat=1, size=100, window=2, negative=5, hs=0, min_count=2, iter=1, workers=cores),
    gensim.models.Doc2Vec(dm=1, dm_mean=1, size=100, window=10, negative=5, hs=0, min_count=5, workers=cores),
]

#### builds vocabulary

In [17]:
# Speed up setup by sharing results of the 1st model's vocabulary scan
models[0].build_vocab(alldocs)  # PV-DM w/ concat requires one special NULL word so it serves as template
print(models[0])
for model in models[1:]:
    model.reset_from(models[0])
    print(model)

Doc2Vec(dbow,d100,n5,mc5,s0.001,t4)
Doc2Vec(dm/m,d100,n5,w10,mc5,s0.001,t4)


In [18]:
models_by_name = OrderedDict((str(model), model) for model in models[0:])

#from gensim.test.test_doc2vec import ConcatenatedDoc2Vec
#models_by_name['dbow+dmm'] = ConcatenatedDoc2Vec([models[1], models[2]])
#models_by_name['dbow+dmc'] = ConcatenatedDoc2Vec([models[1], models[2]])

In [19]:
models_by_name

OrderedDict([('Doc2Vec(dbow,d100,n5,mc5,s0.001,t4)',
              <gensim.models.doc2vec.Doc2Vec at 0x1a161372e8>),
             ('Doc2Vec(dm/m,d100,n5,w10,mc5,s0.001,t4)',
              <gensim.models.doc2vec.Doc2Vec at 0x1a16137358>)])

#### trains the models

In [26]:
epochs = 20

for name, train_model in models_by_name.items():
    train_model.train(alldocs, total_examples=len(alldocs), epochs=epochs, start_alpha=0.025, end_alpha=0.001)

#### saves the models

In [1]:
# saves the model
from datetime import datetime

count = 1
for name, trained_model in models_by_name.items():
    #import pdb; pdb.set_trace()
    name = name.replace('/', '')
    if count>len(models):
        break
    else:
        trained_model.save(os.path.join(models_dir, name + '_e' + str(epochs) + '.model'))
        count += 1
#logger.info('model saved')

NameError: name 'models_by_name' is not defined

#### creates candidate specific data

In [38]:
data2 = pd.read_pickle(os.path.join(data_dir, 'data_clean_4cols.pickle'))
candidate_data = (data2.drop(['id', 'message'], axis=1)
                      .drop_duplicates('from_name')
#                      .set_index('from_name')
                 )

In [39]:
candidate_data.tail(10)

Unnamed: 0,from_name,Partei_ABK
173042,Dr. Daniela De Ridder,SPD
173559,Björn Simon,CDU
173902,Waldemar Westermayer,CDU
173963,AfD Party,AfD
174481,CDU Party,CDU
174974,SPD Party,SPD
175488,CSU Party,CSU
176078,GRÜNE Party,GRÜNE
176425,FDP Party,FDP
176969,DIE LINKE Party,DIE LINKE


In [45]:
party_colors = {'AfD': 'rgb(0, 0, 153)',
                'DIE LINKE': 'rgb(204, 0, 102)',
                'GRÜNE': 'rgb(0, 153, 0)',
                'CSU': 'rgb(102, 178, 255)',
                'CDU': 'rgb(0, 0, 0)',
                'FDP': 'rgb(255, 255, 51)',
                'SPD': 'rgb(255, 0, 0)'}

candidate_data['color'] = candidate_data['Partei_ABK'].map(party_colors)

leaders = ['Sahra Wagenknecht',
'Dietmar Bartsch',
'Katrin Göring-Eckardt',
'Cem Özdemir',
'Martin Schulz',
'Angela Merkel',
'Joachim Herrmann',
'Alexander Gauland',
'Alice Weidel']

candidate_data['leader'] = data['from_name'].isin(leaders)

candidate_data['size'] = 5
candidate_data.loc[candidate_data['from_name'].str.contains('Party'),'size'] = 10
candidate_data.loc[candidate_data['leader'], 'size'] = 8

candidate_data['symbol'] = 100
candidate_data.loc[candidate_data['from_name'].str.contains('Party'),'symbol'] = 101
candidate_data.loc[candidate_data['leader'], 'symbol'] = 114

# plots

### Code

In [40]:
from sklearn.manifold import TSNE
import plotly.graph_objs as go
from plotly import tools

from time import time
import numpy as np

In [41]:
model1_name = 'Doc2Vec(dbow,d100,n5,mc5,s0.001,t4)_e1.model'
model2_name = 'Doc2Vec(dmm,d100,n5,w10,mc5,s0.001,t4)_e1.model' 
model3_name = 'Doc2Vec(dbow,d100,n5,mc5,s0.001,t4)_e5.model'
model4_name = 'Doc2Vec(dmm,d100,n5,w10,mc5,s0.001,t4)_e5.model'
model5_name = 'Doc2Vec(dbow,d100,n5,mc5,s0.001,t4)_e10.model'
model6_name = 'Doc2Vec(dmm,d100,n5,w10,mc5,s0.001,t4)_e10.model'
model7_name = 'Doc2Vec(dbow,d100,n5,mc5,s0.001,t4)_e20.model'
model8_name = 'Doc2Vec(dmm,d100,n5,w10,mc5,s0.001,t4)_e20.model'

model1 = gensim.models.Doc2Vec.load(os.path.join(models_dir, model1_name))
model2 = gensim.models.Doc2Vec.load(os.path.join(models_dir, model2_name))
model3 = gensim.models.Doc2Vec.load(os.path.join(models_dir, model3_name))
model4 = gensim.models.Doc2Vec.load(os.path.join(models_dir, model4_name))
model5 = gensim.models.Doc2Vec.load(os.path.join(models_dir, model5_name))
model6 = gensim.models.Doc2Vec.load(os.path.join(models_dir, model6_name))
model7 = gensim.models.Doc2Vec.load(os.path.join(models_dir, model7_name))
model8 = gensim.models.Doc2Vec.load(os.path.join(models_dir, model8_name))

In [42]:
fig1 = tools.make_subplots(rows=1, cols=2, subplot_titles=[model1_name, model2_name])
fig2 = tools.make_subplots(rows=1, cols=2, subplot_titles=[model3_name, model4_name])
fig3 = tools.make_subplots(rows=1, cols=2, subplot_titles=[model5_name, model6_name])
fig4 = tools.make_subplots(rows=1, cols=2, subplot_titles=[model7_name, model8_name])

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



#### Reducing dimensionality with t-SNE (t-distributed stochastic neighbor embedding)

In [43]:
# vector for model one
mask = [tag in candidate_data['from_name'].values for tag in model1.docvecs.offset2doctag]
candidate_vecs = model1.docvecs.doctag_syn0[mask]

X = candidate_vecs
tsne = TSNE(n_components=2)
X_tsne1 = tsne.fit_transform(X)

# vector for model two
mask = [tag in candidate_data['from_name'].values for tag in model2.docvecs.offset2doctag]
candidate_vecs = model2.docvecs.doctag_syn0[mask]

X = candidate_vecs
tsne = TSNE(n_components=2)
X_tsne2 = tsne.fit_transform(X)

# vector for model three
mask = [tag in candidate_data['from_name'].values for tag in model3.docvecs.offset2doctag]
candidate_vecs = model3.docvecs.doctag_syn0[mask]

X = candidate_vecs
tsne = TSNE(n_components=2)
X_tsne3 = tsne.fit_transform(X)

# vector for model four
mask = [tag in candidate_data['from_name'].values for tag in model4.docvecs.offset2doctag]
candidate_vecs = model4.docvecs.doctag_syn0[mask]

X = candidate_vecs
tsne = TSNE(n_components=2)
X_tsne4 = tsne.fit_transform(X)

# vector for model five
mask = [tag in candidate_data['from_name'].values for tag in model5.docvecs.offset2doctag]
candidate_vecs = model5.docvecs.doctag_syn0[mask]

X = candidate_vecs
tsne = TSNE(n_components=2)
X_tsne5 = tsne.fit_transform(X)

# vector for model six
mask = [tag in candidate_data['from_name'].values for tag in model6.docvecs.offset2doctag]
candidate_vecs = model6.docvecs.doctag_syn0[mask]

X = candidate_vecs
tsne = TSNE(n_components=2)
X_tsne6 = tsne.fit_transform(X)

# vector for model seven
mask = [tag in candidate_data['from_name'].values for tag in model7.docvecs.offset2doctag]
candidate_vecs = model7.docvecs.doctag_syn0[mask]

X = candidate_vecs
tsne = TSNE(n_components=2)
X_tsne7 = tsne.fit_transform(X)

# vector for model eight
mask = [tag in candidate_data['from_name'].values for tag in model8.docvecs.offset2doctag]
candidate_vecs = model8.docvecs.doctag_syn0[mask]

X = candidate_vecs
tsne = TSNE(n_components=2)
X_tsne8 = tsne.fit_transform(X)

In [46]:
trace = go.Scatter(x=X_tsne1[:, 0], y=X_tsne1[:, 1],
                       mode='markers',
                       marker=dict(color=candidate_data['color'],
                                   size=candidate_data['size'],
                                   symbol=candidate_data['symbol'],
                                   showscale=False,
                                   line=dict(color='black', 
                                             width=1)),
                       text=candidate_data['from_name'])
    
fig1.append_trace(trace, 1, 1)

trace = go.Scatter(x=X_tsne2[:, 0], y=X_tsne2[:, 1],
                       mode='markers',
                       marker=dict(color=candidate_data['color'],
                                   size=candidate_data['size'],
                                   symbol=candidate_data['symbol'],
                                   showscale=False,
                                   line=dict(color='black', 
                                             width=1)),
                       text=candidate_data['from_name'])
    
fig1.append_trace(trace, 1, 2)

trace = go.Scatter(x=X_tsne3[:, 0], y=X_tsne3[:, 1],
                       mode='markers',
                       marker=dict(color=candidate_data['color'],
                                   size=candidate_data['size'],
                                   symbol=candidate_data['symbol'],
                                   showscale=False,
                                   line=dict(color='black', 
                                             width=1)),
                       text=candidate_data['from_name'])
    
fig2.append_trace(trace, 1, 1)

trace = go.Scatter(x=X_tsne4[:, 0], y=X_tsne4[:, 1],
                       mode='markers',
                       marker=dict(color=candidate_data['color'],
                                   size=candidate_data['size'],
                                   symbol=candidate_data['symbol'],
                                   showscale=False,
                                   line=dict(color='black', 
                                             width=1)),
                       text=candidate_data['from_name'])
    
fig2.append_trace(trace, 1, 2)

trace = go.Scatter(x=X_tsne5[:, 0], y=X_tsne5[:, 1],
                       mode='markers',
                       marker=dict(color=candidate_data['color'],
                                   size=candidate_data['size'],
                                   symbol=candidate_data['symbol'],
                                   showscale=False,
                                   line=dict(color='black', 
                                             width=1)),
                       text=candidate_data['from_name'])
    
fig3.append_trace(trace, 1, 1)

trace = go.Scatter(x=X_tsne6[:, 0], y=X_tsne6[:, 1],
                       mode='markers',
                       marker=dict(color=candidate_data['color'],
                                   size=candidate_data['size'],
                                   symbol=candidate_data['symbol'],
                                   showscale=False,
                                   line=dict(color='black', 
                                             width=1)),
                       text=candidate_data['from_name'])
    
fig3.append_trace(trace, 1, 2)

trace = go.Scatter(x=X_tsne7[:, 0], y=X_tsne7[:, 1],
                       mode='markers',
                       marker=dict(color=candidate_data['color'],
                                   size=candidate_data['size'],
                                   symbol=candidate_data['symbol'],
                                   showscale=False,
                                   line=dict(color='black', 
                                             width=1)),
                       text=candidate_data['from_name'])
    
fig4.append_trace(trace, 1, 1)

trace = go.Scatter(x=X_tsne8[:, 0], y=X_tsne8[:, 1],
                       mode='markers',
                       marker=dict(color=candidate_data['color'],
                                   size=candidate_data['size'],
                                   symbol=candidate_data['symbol'],
                                   showscale=False,
                                   line=dict(color='black', 
                                             width=1)),
                       text=candidate_data['from_name'])
    
fig4.append_trace(trace, 1, 2)

In [47]:
fig1['layout'].update(title='Models with epochs=1', showlegend=False)
fig2['layout'].update(title='Models with epochs=5', showlegend=False)
fig3['layout'].update(title='Models with epochs=10', showlegend=False)
fig4['layout'].update(title='Models with epochs=20', showlegend=False)

## t-SNE plots (t-distributed stochastic neighbor embedding)

an algorithm to reduce number of dimensions

In [71]:
pyoff.iplot(fig1, filename='tsnePlot1_epochs1')

In [49]:
pyoff.iplot(fig2, filename='tsnePlot1_epochs5')

In [50]:
pyoff.iplot(fig3, filename='tsnePlot1_epochs10<b')

In [51]:
pyoff.iplot(fig4, filename='tsnePlot1_epochs20')

**But:** 

- we removed the party names in the posts to avoid that doc2vec identifies party names as the distinctive pattern in the data

# similarity between candidate and parties

#### Code

In [52]:
# calculate similarity for all candidates and parties
candidate1_data = (data2.drop(['id', 'message'], axis=1)
                      .drop_duplicates('from_name')
                                   )
candidate2_data = (data2.drop(['id', 'message'], axis=1)
                      .drop_duplicates('from_name')
                                   )
candidate3_data = (data2.drop(['id', 'message'], axis=1)
                      .drop_duplicates('from_name')
                                   )
candidate4_data = (data2.drop(['id', 'message'], axis=1)
                      .drop_duplicates('from_name')
                                   )
candidate5_data = (data2.drop(['id', 'message'], axis=1)
                      .drop_duplicates('from_name')
                                   )
candidate6_data = (data2.drop(['id', 'message'], axis=1)
                      .drop_duplicates('from_name')
                                   )
candidate7_data = (data2.drop(['id', 'message'], axis=1)
                      .drop_duplicates('from_name')
                                   )
candidate8_data = (data2.drop(['id', 'message'], axis=1)
                      .drop_duplicates('from_name')
                                   )

# calculate similarity for all candidates and parties
for party in ['SPD Party', 'CDU Party', 'DIE LINKE Party', 'AfD Party', 'CSU Party', 'GRÜNE Party', 'FDP Party']:
    candidate1_data[party] = candidate1_data['from_name'].map(lambda candidate: model1.docvecs.similarity(candidate, party))
    candidate2_data[party] = candidate2_data['from_name'].map(lambda candidate: model2.docvecs.similarity(candidate, party))
    candidate3_data[party] = candidate3_data['from_name'].map(lambda candidate: model3.docvecs.similarity(candidate, party))
    candidate4_data[party] = candidate4_data['from_name'].map(lambda candidate: model4.docvecs.similarity(candidate, party))
    candidate5_data[party] = candidate5_data['from_name'].map(lambda candidate: model5.docvecs.similarity(candidate, party))
    candidate6_data[party] = candidate6_data['from_name'].map(lambda candidate: model6.docvecs.similarity(candidate, party))
    candidate7_data[party] = candidate6_data['from_name'].map(lambda candidate: model7.docvecs.similarity(candidate, party))
    candidate8_data[party] = candidate8_data['from_name'].map(lambda candidate: model8.docvecs.similarity(candidate, party))

# make a new column holding which party is most similar
candidate1_data['most similar'] = candidate1_data.loc[:,'SPD Party':].idxmax(axis=1)
candidate2_data['most similar'] = candidate2_data.loc[:,'SPD Party':].idxmax(axis=1)
candidate3_data['most similar'] = candidate3_data.loc[:,'SPD Party':].idxmax(axis=1)
candidate4_data['most similar'] = candidate4_data.loc[:,'SPD Party':].idxmax(axis=1)
candidate5_data['most similar'] = candidate5_data.loc[:,'SPD Party':].idxmax(axis=1)
candidate6_data['most similar'] = candidate6_data.loc[:,'SPD Party':].idxmax(axis=1)
candidate7_data['most similar'] = candidate7_data.loc[:,'SPD Party':].idxmax(axis=1)
candidate8_data['most similar'] = candidate8_data.loc[:,'SPD Party':].idxmax(axis=1)

In [53]:
candidate1_data

Unnamed: 0,from_name,Partei_ABK,SPD Party,CDU Party,DIE LINKE Party,AfD Party,CSU Party,GRÜNE Party,FDP Party,most similar
0,Valentin Abel,FDP,0.761478,0.795627,0.737060,0.444924,0.847686,0.804785,0.850111,FDP Party
93,Dr. Michael von Abercron,CDU,0.838051,0.827695,0.874978,0.655550,0.894852,0.871433,0.802610,CSU Party
168,Grigorios Aggelidis,FDP,0.835674,0.829460,0.878334,0.660755,0.895902,0.872086,0.800843,CSU Party
215,Diyar Agu,DIE LINKE,0.836147,0.829469,0.879475,0.661478,0.896334,0.873780,0.804404,CSU Party
269,Gökay Akbulut DIE LINKE,DIE LINKE,0.834320,0.829034,0.878312,0.659388,0.896118,0.873012,0.801917,CSU Party
284,Rolf Albach,FDP,0.836197,0.830647,0.879282,0.661394,0.896532,0.873173,0.802289,CSU Party
406,Stephan Albani MdB,CDU,0.834250,0.829579,0.879659,0.666180,0.894962,0.871228,0.797348,CSU Party
922,Katrin Albsteiger,CSU,0.835344,0.830173,0.878741,0.663264,0.896390,0.872788,0.802361,CSU Party
1017,Daniel Alff,SPD,0.780483,0.721963,0.749543,0.602536,0.747802,0.735432,0.660232,SPD Party
1033,Renata Alt,FDP,0.837057,0.834135,0.876815,0.653096,0.899927,0.874355,0.807551,CSU Party


### how many candidates are most similar to their own party?
Model 7: dbow, epochs=20

In [54]:
most_similar_candidates = pd.crosstab(candidate7_data['Partei_ABK'], candidate7_data['most similar'])

In [55]:
most_similar_candidates

most similar,AfD Party,CDU Party,CSU Party,DIE LINKE Party,FDP Party,GRÜNE Party,SPD Party
Partei_ABK,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
AfD,97,6,6,5,9,1,0
CDU,3,174,16,3,6,6,1
CSU,0,4,41,0,1,0,0
DIE LINKE,0,0,0,108,0,2,1
FDP,3,6,11,3,140,6,4
GRÜNE,0,2,3,9,3,76,2
SPD,1,4,27,29,6,10,173


### average similarity

Model 8: dmm, epochs=20

#### code

In [56]:
# calculate average similarity of party candidates
candidate_data7 = (candidate7_data
                   .set_index(['Partei_ABK', 'from_name'])
                   .drop(columns=['most similar'])
                   .rename_axis('party_similarity', axis='columns')
                   .sort_index()
                  )

candidate_data7 = candidate_data7.stack().reset_index()

In [57]:
candidate_data7.sample(10)

Unnamed: 0,Partei_ABK,from_name,party_similarity,0
4902,GRÜNE,Julia Verlinden,DIE LINKE Party,0.275338
1938,CDU,Oswin Veith - Für die Wetterau im Bundestag,FDP Party,0.35113
3993,FDP,Jürgen Krämer,AfD Party,0.211499
3325,DIE LINKE,Simon Pschorr,SPD Party,0.319267
621,AfD,Nicole Jordan,GRÜNE Party,0.409322
1137,CDU,Dietrich Monstadt,AfD Party,0.099806
1897,CDU,Norbert Röttgen,SPD Party,0.330042
5655,SPD,Daniela Kolbe,FDP Party,0.334276
2403,CSU,Bernhard Loos - CSU,DIE LINKE Party,0.191257
1666,CDU,Maria Flachsbarth,SPD Party,0.211375


In [58]:
average_similarity7 = pd.pivot_table(data=candidate_data7,
                                    index='Partei_ABK',
                                    columns='party_similarity',
                                    values=0,
                                    aggfunc='mean'
                                    ).round(decimals=2)

#### plot model8 dmm

In [67]:
import seaborn as sns
# inserting columns of zeros and ones to align the styling in the next step
# (otherwise, the colours would not mean the same in each row)
average_similarity_styled = average_similarity8.copy()
average_similarity_styled['0'] = 0.0
average_similarity_styled['1'] = 1.0
cm = sns.light_palette("blue", as_cmap=True)
average_similarity_styled = average_similarity_styled.style.background_gradient(cmap=cm, axis=1)
average_similarity_styled

party_similarity,AfD Party,CDU Party,CSU Party,DIE LINKE Party,FDP Party,GRÜNE Party,SPD Party,0,1
Partei_ABK,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
AfD,0.57,0.23,0.32,0.23,0.28,0.23,0.23,0,1
CDU,0.22,0.47,0.34,0.19,0.34,0.29,0.33,0,1
CSU,0.22,0.34,0.63,0.17,0.32,0.28,0.26,0,1
DIE LINKE,0.18,0.16,0.21,0.57,0.23,0.36,0.3,0,1
FDP,0.24,0.29,0.33,0.27,0.55,0.33,0.29,0,1
GRÜNE,0.14,0.2,0.26,0.42,0.3,0.54,0.3,0,1
SPD,0.14,0.25,0.25,0.33,0.25,0.31,0.46,0,1


# evaluation of the models

determines to what degree the model found similarity between candidates and their parties and a difference between candidates and opposing parties 

In [69]:
# computing the mean of diagonal elements
average_diagonal7 = np.trace(average_similarity7) / 7

# computing the mean of off-diagonal elements
average_off_diagonal7 = (average_similarity7.values.sum() - average_diagonal7 * 7) / 42

#the difference can be seen as a performance metric of the model 
average_diff7 = average_diagonal7 - average_off_diagonal7

## code

In [60]:
# calculate average similarity of party candidates
candidate_data1 = (candidate1_data
                   .set_index(['Partei_ABK', 'from_name'])
                   .drop(columns=['most similar'])
                   .rename_axis('party_similarity', axis='columns')
                   .sort_index()
                  )

candidate_data1 = candidate_data1.stack().reset_index()

average_similarity1 = pd.pivot_table(data=candidate_data1,
                                    index='Partei_ABK',
                                    columns='party_similarity',
                                    values=0,
                                    aggfunc='mean'
                                    ).round(decimals=2)

# computing the mean of diagonal elements
average_diagonal1 = np.trace(average_similarity1) / 7

# computing the mean of off-diagonal elements
average_off_diagonal1 = (average_similarity1.values.sum() - average_diagonal1 * 7) / 42

#the difference can be seen as a performance metric of the model 
average_diff1 = average_diagonal1 - average_off_diagonal1

In [61]:
# calculate average similarity of party candidates
candidate_data2 = (candidate2_data
                   .set_index(['Partei_ABK', 'from_name'])
                   .drop(columns=['most similar'])
                   .rename_axis('party_similarity', axis='columns')
                   .sort_index()
                  )

candidate_data2 = candidate_data2.stack().reset_index()

average_similarity2 = pd.pivot_table(data=candidate_data2,
                                    index='Partei_ABK',
                                    columns='party_similarity',
                                    values=0,
                                    aggfunc='mean'
                                    ).round(decimals=2)

# computing the mean of diagonal elements
average_diagonal2 = np.trace(average_similarity2) / 7

# computing the mean of off-diagonal elements
average_off_diagonal2 = (average_similarity2.values.sum() - average_diagonal2 * 7) / 42

#the difference can be seen as a performance metric of the model 
average_diff2 = average_diagonal2 - average_off_diagonal2

In [62]:
# calculate average similarity of party candidates
candidate_data3 = (candidate3_data
                   .set_index(['Partei_ABK', 'from_name'])
                   .drop(columns=['most similar'])
                   .rename_axis('party_similarity', axis='columns')
                   .sort_index()
                  )

candidate_data3 = candidate_data3.stack().reset_index()

average_similarity3 = pd.pivot_table(data=candidate_data3,
                                    index='Partei_ABK',
                                    columns='party_similarity',
                                    values=0,
                                    aggfunc='mean'
                                    ).round(decimals=2)

# computing the mean of diagonal elements
average_diagonal3 = np.trace(average_similarity3) / 7

# computing the mean of off-diagonal elements
average_off_diagonal3 = (average_similarity3.values.sum() - average_diagonal3 * 7) / 43

#the difference can be seen as a performance metric of the model 
average_diff3 = average_diagonal3 - average_off_diagonal3

In [63]:
# calculate average similarity of party candidates
candidate_data4 = (candidate4_data
                   .set_index(['Partei_ABK', 'from_name'])
                   .drop(columns=['most similar'])
                   .rename_axis('party_similarity', axis='columns')
                   .sort_index()
                  )

candidate_data4 = candidate_data4.stack().reset_index()

average_similarity4 = pd.pivot_table(data=candidate_data4,
                                    index='Partei_ABK',
                                    columns='party_similarity',
                                    values=0,
                                    aggfunc='mean'
                                    ).round(decimals=2)

# computing the mean of diagonal elements
average_diagonal4 = np.trace(average_similarity4) / 7

# computing the mean of off-diagonal elements
average_off_diagonal4 = (average_similarity4.values.sum() - average_diagonal4 * 7) / 44

#the difference can be seen as a performance metric of the model 
average_diff4 = average_diagonal4 - average_off_diagonal4

In [64]:
# calculate average similarity of party candidates
candidate_data5 = (candidate5_data
                   .set_index(['Partei_ABK', 'from_name'])
                   .drop(columns=['most similar'])
                   .rename_axis('party_similarity', axis='columns')
                   .sort_index()
                  )

candidate_data5 = candidate_data5.stack().reset_index()

average_similarity5 = pd.pivot_table(data=candidate_data5,
                                    index='Partei_ABK',
                                    columns='party_similarity',
                                    values=0,
                                    aggfunc='mean'
                                    ).round(decimals=2)

# computing the mean of diagonal elements
average_diagonal5 = np.trace(average_similarity5) / 7

# computing the mean of off-diagonal elements
average_off_diagonal5 = (average_similarity5.values.sum() - average_diagonal5 * 7) / 55

#the difference can be seen as a performance metric of the model 
average_diff5 = average_diagonal5 - average_off_diagonal5

In [65]:
# calculate average similarity of party candidates
candidate_data6 = (candidate6_data
                   .set_index(['Partei_ABK', 'from_name'])
                   .drop(columns=['most similar'])
                   .rename_axis('party_similarity', axis='columns')
                   .sort_index()
                  )

candidate_data6 = candidate_data6.stack().reset_index()

average_similarity6 = pd.pivot_table(data=candidate_data6,
                                    index='Partei_ABK',
                                    columns='party_similarity',
                                    values=0,
                                    aggfunc='mean'
                                    ).round(decimals=2)

# computing the mean of diagonal elements
average_diagonal6 = np.trace(average_similarity6) / 7

# computing the mean of off-diagonal elements
average_off_diagonal6 = (average_similarity6.values.sum() - average_diagonal6 * 7) / 66

#the difference can be seen as a performance metric of the model 
average_diff6 = average_diagonal6 - average_off_diagonal6

In [66]:
# calculate average similarity of party candidates
candidate_data8 = (candidate8_data
                   .set_index(['Partei_ABK', 'from_name'])
                   .drop(columns=['most similar'])
                   .rename_axis('party_similarity', axis='columns')
                   .sort_index()
                  )

candidate_data8 = candidate_data8.stack().reset_index()

average_similarity8 = pd.pivot_table(data=candidate_data8,
                                    index='Partei_ABK',
                                    columns='party_similarity',
                                    values=0,
                                    aggfunc='mean'
                                    ).round(decimals=2)

# computing the mean of diagonal elements
average_diagonal8 = np.trace(average_similarity8) / 7

# computing the mean of off-diagonal elements
average_off_diagonal8 = (average_similarity8.values.sum() - average_diagonal8 * 7) / 88

#the difference can be seen as a performance metric of the model 
average_diff8 = average_diagonal8 - average_off_diagonal8

## results

In [70]:
print(model1_name, round(average_diff1, ndigits=4), '\n',
      model3_name, round(average_diff3, ndigits=4), '\n',
      model5_name, round(average_diff5, ndigits=4), '\n',
      model7_name, round(average_diff7, ndigits=4), '\n',
      
      '\n',

      model2_name, round(average_diff2, ndigits=4), '\n',
      model4_name, round(average_diff4, ndigits=4), '\n',
      model6_name, round(average_diff6, ndigits=4), '\n',
      model8_name, round(average_diff8, ndigits=4)
)

# code for other models is located in appendix

Doc2Vec(dbow,d100,n5,mc5,s0.001,t4)_e1.model 0.1293 
 Doc2Vec(dbow,d100,n5,mc5,s0.001,t4)_e5.model 0.2048 
 Doc2Vec(dbow,d100,n5,mc5,s0.001,t4)_e10.model 0.2055 
 Doc2Vec(dbow,d100,n5,mc5,s0.001,t4)_e20.model 0.1676 
 
 Doc2Vec(dmm,d100,n5,w10,mc5,s0.001,t4)_e1.model 0.0519 
 Doc2Vec(dmm,d100,n5,w10,mc5,s0.001,t4)_e5.model 0.2664 
 Doc2Vec(dmm,d100,n5,w10,mc5,s0.001,t4)_e10.model 0.3733 
 Doc2Vec(dmm,d100,n5,w10,mc5,s0.001,t4)_e20.model 0.4146


# Appendix

In [None]:
import math

n_subplots = len(models_by_name)

plot_titles = list(models_by_name.keys())

fig2 = tools.make_subplots(rows=math.ceil(n_subplots/2), cols=2, subplot_titles=plot_titles)

row = 1
col = 1

count = 1

for name, trained_model in models_by_name.items():
    mask = [tag in candidate_data['from_name'].values for tag in trained_model.docvecs.offset2doctag]
    candidate_vecs = trained_model.docvecs.doctag_syn0[mask]
    
    X = candidate_vecs
    tsne = TSNE(n_components=2)
    X_tsne = tsne.fit_transform(X)
    
    trace = go.Scatter(x=X_tsne[:, 0], y=X_tsne[:, 1],
                       mode='markers',
                       marker=dict(color=candidate_data['color'],
                                   size=candidate_data['size'],
                                   symbol=candidate_data['symbol'],
                                   showscale=False,
                                   line=dict(color='black', 
                                             width=1)),
                       text=candidate_data['from_name'])
    
    fig2.append_trace(trace, row, col)
             
    if count % 2 == 0:
        row = row + 1
        col = 1
    else:
        col = col +1
        
    count += 1
    
fig2['layout'].update(title='Plots', showlegend=False, autosize=False, height=800, width=1000)
    
pyoff.iplot(fig2, filename='tsnePlot2')

In [None]:
n_subplots = len(models_by_name)

cols = 2
d_array = np.array([{'is_3d': True}, {'is_3d': True}])
np.tile(d_array, (n_subplots/2, 1))

d3_array = np.arry([{'is_3d': True}])
np.repeat(3d_array, n_subplots)

fig3d = tools.make_subplots(rows=n_subplots//2, cols=2, 
                            specs=[[{'is_3d': True}, {'is_3d': True}]])

row = 1
col = 1

count = 1

for name, trained_model in models_by_name.items():
    mask = [tag in candidate_data['from_name'].values for tag in trained_model.docvecs.offset2doctag]
    candidate_vecs = trained_model.docvecs.doctag_syn0[mask]
    
    X = candidate_vecs
    tsne = TSNE(n_components=3)
    X_tsne = tsne.fit_transform(X)
    
    trace = go.Scatter3d(x=X_tsne[:, 0], y=X_tsne[:, 1], z=X_tsne[:, 2],
                       name=name,
                       mode='markers',
                       marker=dict(color=candidate_data['color'], 
                                   showscale=False,
                                   line=dict(color='black', width=1)),
                       text=candidate_data['from_name'])
    
    fig3d.append_trace(trace, row, col)
             
    if count % 2 == 0:
        row = row + 1
        col = 1
    else:
        col = col +1
        
    count += 1
    
fig3d['layout'].update(title='Plots', showlegend=False)
    
pyoff.iplot(fig3d, filename='tsnePlot2')

In [None]:
alldocs[103717]