In [1]:
import pandas as pd
from tqdm import tqdm

from sklearn.utils.class_weight import compute_class_weight
from sklearn.model_selection import train_test_split
from torchmetrics.functional.classification import accuracy, f1_score, precision, recall

import numpy as np

import torch
from torch import nn

from sentence_transformers import SentenceTransformer

test, if running the model on either cpu or gpu is possible

In [2]:
if torch.cuda.is_available():
    device = torch.device('cuda')
    print('Using CUDA!')
else:
    device = torch.device('cpu')
    print("Using CPU!")

Using CPU!


Combine the dataframe with the manually labeled sentences. the _unlabeled/ _labeled file structure is a workaround to prevent overwriting the manually edited .csv file, if new labels are added

In [12]:
df_manifesto = pd.read_pickle('data\df_manifesto_final.pkl')
new_labels = pd.read_csv('data/df_spendings_unlabeled.csv', sep=';', encoding='utf-8-sig', index_col=0)
old_labels = pd.read_csv('data/df_spendings_labeled.csv', sep=';', encoding='utf-8-sig', index_col=0)


manual_labels = pd.concat([old_labels, new_labels], ignore_index=False)
manual_labels = manual_labels.drop(columns=["description_md"], errors="ignore")
manual_labels = manual_labels.dropna()

manual_labels.to_csv('data/df_spendings_labeled.csv', sep=';', encoding='utf-8-sig')

In [54]:
non_na_count = df_manifesto["label"].notna().sum()
random_samples = df_manifesto[df_manifesto["label"].isna()].sample(non_na_count, random_state=30).index
df_spendings = df_manifesto.copy()
df_spendings.loc[random_samples, "label"] = 0
df_spendings = df_spendings.dropna(subset=["label"])
df_spendings["label"] = df_spendings["label"].astype(int)

This code snippet prepares data and initializes a pre-trained sentence embedding model for a classification task.

1. **Extracting Sentences and Labels**:  
   The variables `sentences` and `labels` are populated from the `df_spendings` DataFrame. Specifically:
   - `df_spendings["text"].tolist()` extracts the text data (assumed to be sentences or documents) from the "text" column and converts it into a Python list.
   - `df_spendings["label"].values` retrieves the corresponding labels from the "label" column as a NumPy array. These labels likely represent the target classes for the classification task.

2. **Defining the Number of Classes**:  
   The variable `num_classes` is set to 3, indicating that the classification task involves three distinct categories. This value will likely be used later in the model architecture or evaluation.

3. **Loading the Sentence Transformer Model**:  
   The `SentenceTransformer` class is used to load a pre-trained model, specifically `'sentence-transformers/distiluse-base-multilingual-cased-v2'`. This model is a multilingual version of DistilUSE (Distilled Universal Sentence Encoder) and is designed to generate high-quality sentence embeddings for a wide range of languages. Sentence embeddings are dense vector representations of sentences that capture their semantic meaning.

4. **Moving the Model to the Device**:  
   The `.to(device)` method moves the model to the specified computation device (`device`), which is typically either a GPU or CPU. This ensures that the model's computations are performed on the appropriate hardware for efficiency.

In summary, this code prepares the input data (sentences and labels) and initializes a pre-trained sentence embedding model for multilingual text processing. The embeddings generated by this model will likely be used as input features for a downstream classification task involving three classes.

In [7]:
sentences, labels = df_spendings["text"].tolist(), df_spendings["label"].values
num_classes = 3
model = SentenceTransformer('sentence-transformers/distiluse-base-multilingual-cased-v2').to(device)

Generating embeddings for a set of sentences using a pre-trained model and saving these embeddings, along with corresponding labels, to disk for later use.

The `with torch.no_grad():` block is used to disable gradient computation in PyTorch. This is important here because the operation involves encoding sentences into embeddings, which is a forward pass through the model. Since no training or backpropagation is required, disabling gradient computation reduces memory usage and speeds up the process.

Inside the block, the `model.encode` method is called to generate embeddings for the input `sentences`. The `convert_to_tensor=True` argument ensures that the input sentences are converted into PyTorch tensors before being processed by the model. The resulting embeddings are then moved to the CPU using `.cpu()` and converted to a NumPy array with `.numpy()`. This conversion is necessary because the embeddings will be saved in a format compatible with NumPy's file-saving utilities.

The `np.save` function is used to save the embeddings and labels as `.npy` files in the data directory. Specifically:
- `"data\embeddings.npy"` stores the sentence embeddings, which are numerical representations of the input sentences in a high-dimensional space.
- `"data\labels.npy"` stores the corresponding labels, which are assumed to be defined elsewhere in the code.

Saving these files allows for efficient reuse of the embeddings and labels without needing to recompute them, which is particularly useful when working with large datasets or computationally expensive models.

In [8]:
with torch.no_grad():
    embeddings = model.encode(sentences, convert_to_tensor=True).cpu().numpy()

np.save("data\embeddings.npy", embeddings)
np.save("data\labels.npy", labels)

In [9]:
embeddings = np.load("data\embeddings.npy")
labels = np.load("data\labels.npy")

The Dataset class is a custom implementation for managing data, likely intended for use with PyTorch's DataLoader. It takes two inputs during initialization: embeddings and labels. The embeddings represent the feature vectors for the data points, while labels are the corresponding target values (e.g., class labels for classification).

This class encapsulates the logic for a simple feedforward neural network classifier, making it reusable and modular. Together, these two classes form the foundation for a machine learning pipeline, where the Dataset class handles data preparation and the ClassifierHead class defines the model architecture.

The __len__ method returns the number of data points in the dataset, which is determined by the length of the labels list.
The __getitem__ method retrieves a single data point and its label based on the provided index (idx). It converts the embeddings and labels at the specified index into PyTorch tensors with appropriate data types (float32 for embeddings and int64 for labels). This ensures compatibility with PyTorch models and training pipelines.
This class is essential for preparing data in a format that can be efficiently processed during training or evaluation.

The ClassifierHead class defines a neural network module for classification tasks. It inherits from PyTorch's nn.Module and is designed to process input feature vectors and produce predictions for two classes (binary classification).

The constructor (__init__) initializes a sequential model (nn.Sequential) consisting of:

A fully connected layer (nn.Linear) that maps the input features to 1024 dimensions.
A ReLU activation function (nn.ReLU) to introduce non-linearity.
A dropout layer (nn.Dropout) with a dropout rate of 0.2 to reduce overfitting by randomly zeroing some of the activations during training.
Another fully connected layer that maps the 1024-dimensional features to 2 output dimensions, corresponding to the two classes.
The forward method defines how the input data flows through the network. It takes an input tensor x, passes it through the sequential layers, and returns the output tensor out.


In [10]:
class Dataset:

    def __init__(self, embeddings, labels):
        self.embeddings = embeddings
        self.labels = labels
    
    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, idx):
        embeddings = torch.tensor(self.embeddings[idx], dtype=torch.float32)
        labels = torch.tensor(self.labels[idx], dtype=torch.int64)
        return embeddings, labels

class ClassifierHead(nn.Module):

    def __init__(self, input_dim, num_classes):
        super().__init__()
        self.classifier = nn.Sequential(
            nn.Linear(input_dim, 1024),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(1024, num_classes), # Hier dann num_classes statt 2.
        )

    def forward(self, x):
        out = self.classifier(x)
        return out

The provided function, model_pass, is a utility function designed to handle a single pass (either training or evaluation) of a machine learning model. It takes several arguments: model (the neural network to be trained or evaluated), criterion (the loss function), loader (a data loader providing batches of data), optimizer (used for updating model parameters during training), and train (a boolean indicating whether the pass is for training or evaluation).

The function begins by setting the model's mode using model.train() or model.eval(), depending on whether the train flag is True or False. This ensures that the model behaves appropriately, such as enabling or disabling dropout layers during training or evaluation, respectively.

It initializes empty lists to track losses, predictions (preds), and targets (targs). The function then iterates over the data loader, which provides batches of features and labels. These are moved to the appropriate device (e.g., CPU or GPU) for computation. The torch.set_grad_enabled(train) context manager ensures that gradients are only computed during training, saving memory and computation during evaluation.

Within the loop, the model processes the input features to produce outputs, and the loss is computed using the provided criterion. If the pass is for training, the optimizer is used to update the model's parameters: gradients are computed via loss.backward(), and the optimizer steps forward with optimizer.step(). Predictions are obtained by taking the index of the maximum value along the output's last dimension (torch.argmax), and both predictions and labels are detached from the computation graph to avoid unnecessary memory usage.

After processing all batches, the predictions and targets are concatenated into single tensors using torch.cat. The function then computes several evaluation metrics: accuracy, F1 score, precision, and recall, using helper functions like binary_accuracy, binary_f1_score, etc. These metrics are calculated for binary classification tasks, where the model predicts one of two possible classes.

Finally, the function returns a dictionary containing the average loss and the computed metrics. This structure makes it easy to monitor the model's performance during training and evaluation, providing insights into how well the model is learning and generalizing.

In [11]:
def model_pass(num_classes, model, criterion, loader, optimizer=None, train=True):
    if train:
        model.train()
    else:
        model.eval()

    losses = []

    preds = []
    targs = []

    for features, labels in loader:
        features = features.to(device)
        labels = labels.to(device)

        with torch.set_grad_enabled(train):
            outputs = model(features)
            loss = criterion(outputs, labels)

        if train:
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        pred = torch.argmax(outputs, dim=1)

        preds.append(pred.detach())
        targs.append(labels.detach())

        losses.append(loss.item())

    preds = torch.cat(preds)
    targs = torch.cat(targs)

    acc = accuracy(preds, targs, task="multiclass", num_classes=num_classes)
    f1 = f1_score(preds, targs, task="multiclass", num_classes=num_classes)
    prec = precision(preds, targs, task="multiclass", num_classes=num_classes)
    rec = recall(preds, targs, task="multiclass", num_classes=num_classes)

    return {
        "loss": np.mean(losses),
        "accuracy": acc.item(),
        "f1": f1.item(),
        "precision": prec.item(),
        "recall": rec.item(),
    }



This code sets up the training pipeline for a binary classification task using PyTorch. It begins by instantiating the ClassifierHead model with an input dimension of 512, which corresponds to the size of the feature vectors (embeddings). The model is moved to the appropriate device (e.g., CPU or GPU) for computation.

The dataset is split into training and validation sets using the train_test_split function. The test_size=0.1 parameter ensures that 10% of the data is reserved for validation, while the remaining 90% is used for training. The stratify=labels argument ensures that the class distribution in the training and validation sets matches the original dataset, which is important for imbalanced datasets. The random_state=42 ensures reproducibility by fixing the random seed.

To handle class imbalance, the compute_class_weight function calculates weights for each class based on their frequency in the training data. These weights are then used to create sample weights for each training example, ensuring that underrepresented classes are sampled more frequently during training. A WeightedRandomSampler is used in the DataLoader for the training dataset to implement this sampling strategy. The validation dataset, on the other hand, is loaded with a simple shuffle mechanism.

The Dataset class is used to wrap the training and validation data, making it compatible with PyTorch's DataLoader. The DataLoader batches the data and, in the case of the training set, uses the weighted sampler to draw 500 samples per epoch with replacement. The batch size for both loaders is set to 64.

The training process is configured to run for 250 epochs. The loss function used is CrossEntropyLoss, which is suitable for multi-class classification tasks (including binary classification). The optimizer is AdamW, a variant of the Adam optimizer that includes weight decay for better regularization. The learning rate is set to 1e-4. A learning rate scheduler, CosineAnnealingLR, is also defined to gradually reduce the learning rate over the course of training, with a minimum value of 1e-6. This helps the model converge more effectively by fine-tuning the learning rate as training progresses.

In [12]:
num_classes = 3

classifier = ClassifierHead(input_dim=512, num_classes=num_classes).to(device)

train_features, valid_features, train_labels, valid_labels = train_test_split(
    embeddings, labels, test_size=0.1, random_state=11, shuffle=True, stratify=labels
)

class_weights = compute_class_weight("balanced", classes=np.unique(train_labels), y=train_labels)
train_sample_weights = np.array([class_weights[label] for label in train_labels])

train_dataset = Dataset(train_features, train_labels)
valid_dataset = Dataset(valid_features, valid_labels)

train_loader = torch.utils.data.DataLoader(
    train_dataset, 
    batch_size=64, 
    sampler=torch.utils.data.WeightedRandomSampler(
        train_sample_weights, 
        500,
        replacement=True
    ),
)
valid_loader = torch.utils.data.DataLoader(
    valid_dataset, 
    batch_size=64, 
    shuffle=True
)

epochs = 300

criterion = nn.CrossEntropyLoss()

optimizer = torch.optim.AdamW(classifier.parameters(), lr=.15*1e-3)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=epochs, eta_min=1e-6)

In [13]:
pbar = tqdm(range(epochs))

best_f1 = 0
for e in pbar:

    train_metrics = model_pass(num_classes, classifier, criterion, train_loader, optimizer, train=True)
    valid_metrics = model_pass(num_classes, classifier, criterion, valid_loader, optimizer=None, train=False)
    scheduler.step()

    if valid_metrics["f1"] > best_f1:
        best_f1 = valid_metrics["f1"]
        torch.save(classifier.state_dict(), "classifier.pt")
    
    pbar.set_description(f"Train F1 {train_metrics['f1']:.4f} - Valid F1 {best_f1:.4f}")

Train F1 0.8560 - Valid F1 0.7810: 100%|██████████| 300/300 [00:12<00:00, 23.56it/s]


This code snippet implements a training loop for a machine learning model, with a focus on tracking performance metrics and saving the best-performing model based on validation F1 score. Here's a detailed explanation:

1. **Progress Bar Initialization**:  
   The `tqdm` library is used to create a progress bar (`pbar`) that iterates over the specified number of epochs. This provides a visual indicator of the training progress, making it easier to monitor.

2. **Tracking the Best F1 Score**:  
   The variable `best_f1` is initialized to 0. It is used to keep track of the highest F1 score achieved on the validation set during training. This ensures that the best-performing model is saved.

3. **Training and Validation Loop**:  
   For each epoch, the following steps are performed:
   - **Training Pass**: The `model_pass` function is called with the training data loader (`train_loader`) and the optimizer. This function computes the loss, accuracy, F1 score, precision, and recall for the training set while updating the model's weights.
   - **Validation Pass**: The `model_pass` function is called again, this time with the validation data loader (`valid_loader`) and no optimizer. This evaluates the model's performance on the validation set without updating its weights.

4. **Learning Rate Adjustment**:  
   The `scheduler.step()` method is called to adjust the learning rate based on the training progress. This is typically used to improve convergence.

5. **Saving the Best Model**:  
   If the F1 score from the validation pass (`valid_metrics["f1"]`) exceeds the current `best_f1`, the model's state dictionary is saved to a file named `"classifier.pt"`. This ensures that the best-performing model is preserved for later use.

6. **Updating the Progress Bar Description**:  
   The `pbar.set_description` method is used to dynamically update the progress bar's description with the current training F1 score and the best validation F1 score. This provides real-time feedback on the model's performance during training.

Overall, this code combines training, validation, and model checkpointing into a single loop, ensuring efficient and organized model training while tracking key performance metrics.

In [14]:
def predict_policy_stance(sentences):
    if not isinstance(sentences, list):
        sentences = [sentences]

    with torch.no_grad():
        embeddings = model.encode(sentences, convert_to_tensor=True).to(device)

    logits = classifier(embeddings)
    label = torch.argmax(logits, dim=1).item()

    if label ==2:
        return -1
    else:
        return label




Test the model by comparing the prediction label to the target label. In this case, pro-expenditure

In [16]:
# Example predictions
new_sentences = df_manifesto[df_manifesto['label'] == 1]['text'].sample(10).tolist()

for sentence in new_sentences:
    print(f"'{sentence}' → {predict_policy_stance(sentence)}")

'Wir haben deshalb den Erwerb von Wohneigentum z B durch die Eigenheimrente WohnRiester unterstützt' → 1
'So erhöhen wird die Kaufkraft und stärken den Binnenmarkt Vgl Kapitel I Gute Arbeit' → 1
'Das sind über 14  des Bruttoinlandsprodukts BIP' → 1
'Damit der Konsum steigt müssen die Menschen mehr verfügbares Einkommen haben das  sie für zusätzlichen Konsum ausgeben können' → 1
'Statt Dienstleistungen zu privatisieren und einzuschränken wollen wir dass öffentliche und soziale Leistungen ausgebaut werden  in Schulen und Hochschulen Pflege Betreuungsund Kultureinrichtungen öffentlichem Nahverkehr und im Umweltschutz' → 1
'Wer diesen Weg verfolgt setzt unsere Zukunft aufs Spiel oder will harte Einschnitte in den Sozialstaat' → 1
'Teilhaben Einmischen Zukunft schaffen  das beschreibt einen neuen Weg aus den Krisen und den Aufbruch hin zu einer offenen modernen Gesellschaft und einer Wirtschaft die besser und sparsamer mit unseren natürlichen Ressourcen umgeht' → 1
'Bei Produkten und Dienst

this function calculates the ratio of correctly predicted statements

In [17]:
def calculate_label_match_ratio(df):
    # Check if 'label' equals 'predicted_label'
    matches = df['label'] == df['predicted_label']
    
    # Count 'yes' (True) and 'no' (False)
    yes_count = matches.sum()
    
    # Calculate the ratio of 'yes' to 'no'
    ratio = yes_count/len(df)
    
    return ratio



In [None]:
sample = df_manifesto.sample(300)
sample['predicted_label'] = sample['text'].apply(predict_policy_stance)
sample['label'] = sample['label'].apply(lambda label: 0 if pd.isna(label) else (-1 if label == 2 else label))

calculate_label_match_ratio(sample)

0.8366666666666667

apply the model to the entire dataset to predict each policy statement.

In [None]:
df_manifesto['predicted_label'] = df_manifesto['text'].apply(predict_policy_stance)

edit label column to match the prediction classification scheme

In [None]:
df_manifesto["label"]=df_manifesto["label"].fillna(0)


In [58]:
df_manifesto["label"] = df_manifesto["label"].replace(2, -1)

save locally

In [63]:
df_manifesto.to_pickle('data/df_manifesto_predictions.pkl')
