## Training Doc2Vec on German Political Speeches

This notebook trains a Doc2Vec model on 1912 transcribed videos from the official channels of both the far right-wing party AfD (Alternative für Deutschland) in Germany and the left-wing party Die Linke.

In [None]:
#Install dependincies
%pip install gensim
pip install pypet
%matplotlib inline

Importing libraries

In [None]:
import gensim.models.doc2vec as d2v
import nltk
from pypet import progressbar
import numpy as np
import matplotlib.pyplot as plt
import json
import pandas as pd

Loading my data

In [None]:
DATASETPATH = "data/combined_dataset.json"

with open(DATASETPATH, "r", encoding="utf-8") as file:
    data = json.load(file)

Cleaning the data

In [None]:
for item in data:
    item.pop("score", None)
    item.pop("title", None)

In [None]:
len(data)

Taging each speech with a unique ID regardless of the party

In [None]:
speeches = {}
for idx, rec in enumerate(data):
    speech_id = f"speech_{idx+1}" 
    speeches[speech_id] = {
        "party": rec["party"],
        "speech": rec["transcript"],
    }

print(f"Number of speeches: {len(speeches)}")
print(list(speeches.items())[0])
print(speeches["speech_1"]["speech"])

### Tokenization

NLTK will be applyed to tokenize the documents. <br>
For this step the tokenization will be purely applied and using space as delimiter and only lowering the cases.
<br> For later and another way to vectorize we can use:  <br> <br>
`gensim.utils.simple_preprocess(doc, deacc=False, min_len=2, max_len=15)`
[doc2vec documentation](https://radimrehurek.com/gensim/utils.html#gensim.utils.simple_preprocess) <br>
Additionaly, I would want to create stopwords list and remove them. 

In [None]:
def tokenize_speechs(speechs):
    tokens = {}
    for idx, speech_id in enumerate(speechs):
        speech = speeches[speech_id]['speech']
        tokenized = [x.lower() for x in nltk.word_tokenize(speech, language='german')]
        tokens[speech_id] = tokenized
        progressbar(idx, len(speechs), reprint=False)
    return tokens

tokens = tokenize_speechs(speeches)

In [None]:
#Accessing the first token to check it's structure
print(list(tokens.items())[0])
print(list(tokens.items())[0][1][0])

In [None]:
#using doc2vec to created tagged documents
def create_tagged_objects(tokens):
    """Converts tokens to gensim tagged documents"""
    tagged_docs = {}
    for idx, com_id in enumerate(tokens):
        tagged_doc = d2v.TaggedDocument(words=tokens[com_id], tags=[com_id])
        tagged_docs[com_id]= tagged_doc
        progressbar(idx, len(speeches), percentage_step=5, reprint=False)
    return tagged_docs

tagged_docs = create_tagged_objects(tokens)

In [None]:
print(list(tagged_docs.items())[0])

In [None]:
# Split test/test 80/20
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(data, test_size=0.2, stratify=[item["party"] for item in data], random_state=42)

In [None]:
train_df

Converting to pandas dataframe and renaming party to lables to avoid model confusion

In [None]:
train_df = pd.DataFrame(train_df)

if 'labels' not in train_df.columns and 'party' in train_df.columns:
    train_df.rename(columns={'party': 'labels'}, inplace=True)

print(train_df["labels"].value_counts())

Tagging the documents with unique id for the training

In [None]:
training_tagged_docs = []
for idx, row in train_df.iterrows():
    speech_id = f"train_{idx}"
    tokenized = [x.lower() for x in nltk.word_tokenize(row["transcript"], language='german')]
    doc = d2v.TaggedDocument(words=tokenized, tags=[speech_id])
    training_tagged_docs.append(doc)

In [None]:
#Check the tages 
training_tagged_docs[10]

Training doc2vec model

In [None]:
model = d2v.Doc2Vec(vector_size=256,  
                    window=8, 
                    min_count=5, 
                    workers=4,  
                    sample=1e-4,
                    negative=5,
                    alpha=0.05, 
                    min_alpha=0.001)  

model.build_vocab(training_tagged_docs)

# Train the model
epochs = 20
for epoch in range(epochs):
    # Update learning rate for this epoch
    alpha = 0.05 - (0.05 - 0.001) * (epoch / epochs)
    model.alpha = alpha
    model.min_alpha = alpha
    
    # Train for one epoch
    model.train(training_tagged_docs, 
                total_examples=model.corpus_count, 
                epochs=1)
    
    print(f'Epoch {epoch+1}/{epochs}, Alpha: {alpha:.4f}')

Checking if the model is able to identify the speeches

In [None]:
# Geting a speech and seeing it
print(train_df.loc[800])
training_tagged_docs[800]

# Usind the model to get the most_similar to index 800 in the traing dataset
print("\n Top 3 simillar")
doc_vec = model.docvecs[800]
model.docvecs.most_similar([doc_vec], topn=3)

`infere_vector()` is doc2vec function that retrun the vector represenation of a post training new document.<br>
Also when runing on the same document, each time it returns a different representations of the same document. <br>
<br>
For more stability increase the number of epochs to have more control. 
[Doc2Vec documentation](https://radimrehurek.com/gensim/models/doc2vec.html?utm_source=chatgpt.com#gensim.models.doc2vec.Doc2Vec.infer_vector)

In [None]:
#infere_vector computes the vec of new input text
def infer_vector(model, text):
    """Infer vector for a new piece of text"""
    tokenized = [x.lower() for x in nltk.word_tokenize(text, language='german')]
    return model.infer_vector(tokenized, epochs=30)

In [None]:
sample_text = "Berlin, 17. April 2025. Zu den Plänen der amtierenden Bundesregierung im Rahmen des sogenannten „Resettlement“-Programms nun auch Menschen aus dem Sudan per Flugzeug diskret nach Deutschland zu holen teilt die AfD-Bundessprecherin Alice Weidel mit: „Obwohl die Belastungsgrenze längst überschritten ist, die Massenmigration unsere sozialen Sicherungssysteme überfordert und die innere Sicherheit zusehends erodiert, hat die gescheiterte Rest-Ampel auf ihren letzten Metern nichts Besseres zu tun als noch möglichst viele weitere Migranten nach Deutschland zu verbringen. Neben den Maschinen mit Afghanen fliegt die Regierung nun auch noch Sudanesen ein – möglichst geräuschlos, ohne jede öffentliche Debatte und gegen den erklärten Willen der Bevölkerungsmehrheit. 2025 sollen so insgesamt 6.560 Migranten zusätzlich in Deutschland ‚angesiedelt‘ werden. Ein skandalöser Vorgang sondergleichen. Diese Art weltfremder und ideologiegetriebener Politik ist nicht nur verantwortungslos, sondern brandgefährlich für den sozialen Frieden in unserem Land. Und trotz der großspurigen Ankündigungspolitik von schwarz-rot, derartige Programme zu beenden, geht das UN-Flüchtlingswerk in Deutschland bereits davon aus, dass auch die neue Bundesregierung das ‚Resettlement‘ weiterführen wird. Die AfD fordert die Regierung auf, alle Möglichkeiten ausschöpfen, jeden weiteren Massenzustrom nach Deutschland zu unterbinden. Wir fordern ein sofortiges Ende sämtlicher Bundesaufnahmeprogramme – Deutschland ist kein Siedlungsgebiet. Wir fordern die unverzügliche Einführung von wirklich effektivem Grenzschutz, das heißt mit konsequenter Abweisung Illegaler. CDU und CSU haben eine echte Migrationswende versprochen und auch zur Bedingung für eine Regierungsbeteiligung erhoben. Sollte Merz nicht Wort halten, hat die Union nach ihrem Totalversagen 2015 in der Migrations- und Sicherheitspolitik nun endgültig jegliche Glaubwürdigkeit verloren.“"
sample_vec = infer_vector(model, sample_text)

In [None]:
similar_docs = model.docvecs.most_similar([sample_vec], topn=5)

In [None]:
for doc_id, similarity in similar_docs:
    id = int(doc_id.split("_")[1])
    print(f"{train_df.loc[id]["labels"]}: {similarity:.3f}")

### Stage 02

In [None]:
# TODO:
# histogram of length of transcripts (number of words, number of tokens)
transcript_lengths = []  
ids = []

for idex, item in enumerate(data):
    trans_length = len(item["transcript"])
    transcript_lengths.append(trans_length)
    ids.append(idex)

In [None]:
plt.bar(ids, transcript_lengths, width=2.8)
plt.xlabel('Speech ID')
plt.ylabel('Length of Transcript')
plt.show()
plt.hist(transcript_lengths, bins=5)
plt.show()

In [None]:
def similar_docs_2_scores(similar_docs):
  scores = {}
  for doc_id, similarity in similar_docs:
    num_id = int(doc_id.split("_")[1])
    label = train_df.loc[num_id]["labels"]
    scores[label] = scores.get(label, 0) + similarity
  return scores

In [None]:
similar_docs_2_scores(similar_docs)

In [None]:
test_df = pd.DataFrame(test_df)

if 'labels' not in test_df.columns and 'party' in test_df.columns:
    test_df.rename(columns={'party': 'labels'}, inplace=True)

print(test_df["labels"].value_counts())

In [None]:
test_labels = test_df["labels"]
print(test_labels[1])
test_transcript = test_df["transcript"]
print(test_transcript[1])

Creating Predictions on the test dataset <br> 
1. Creating an empty test_prediction list. To use later for accuracy. 
2. Loops through the test trascripts
3. Gets top 10 similar vectors 
4. applys ``similar_docs_2_scores()`` to get the highst score accorss top 10 similar
5. Gets the party with the highst value 
5. Appends party lable to ``test_prediction``

In [None]:
test_prediction = []

for transcript in test_transcript:
  sample_vec = infer_vector(model, transcript)
  similar_docs = model.docvecs.most_similar([sample_vec], topn=10)

  doc_scores = similar_docs_2_scores(similar_docs)
  print("Document Score: ",doc_scores)
  
  # get the key with the highest values
  doc_scores_list = [(k,v) for k,v in doc_scores.items()] # [("afd", 0.543), ("linke", 0.123)]
  print("Document score list: ",doc_scores_list)
  doc_highest_score = max(doc_scores_list, key=lambda x: x[1]) # ("afd", 0.543)
  print("Docuemtn Highst score",doc_highest_score)

  test_prediction.append(doc_highest_score[0])

In [None]:
print(test_prediction[0])
print(test_labels[0])

Confusion Matrix from Sklearn to check the true values vs. the predicted values from the test data

In [None]:
# have test_predictions and test_labels
# measure accuracy, recall, precision, look at confusion matrix
#   might have to encode test_predictions and test_labels

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(test_labels, test_prediction, labels=["Die Linke", "AFD"])

disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["Die Linke", "AFD"])
disp.plot(cmap=plt.cm.Blues)
plt.show()

from sklearn.metrics import classification_report
print(classification_report(test_labels, test_prediction, target_names=["Die Linke", "AFD"]))
accuracy = ()