Introduction

Welcome to PoliticsToYou – your playground for exploring the heart of politics in Germany.

Would you like to gain insights into political debates or uncover party positions on specific topics across different legislative periods?
You can use the ChatBot to ask your questions or search for related speech content in the Keyword Search section.

You can try out the application on Hugging Face Spaces.

Continue reading for technical details and an explanation of each tab in the application.

Technical Documentation

Overview

Execute src/FAISS/FAISS.ipynb to retrieve all speeches and store themn in vector databases (~24h).
Execute Home.py to run the app.

Extract Speech Data

The speech data is retrieved from a locally running OpenDiscourse database.
Refer to their documentation to set up the database locally:
OpenDiscourse Documentation.

Transform Speeches and Load into FAISS Vector Store

As a vector store, I chose FAISS due to its support for large-scale datasets with millions of vectors and its optimization for fast similarity search.

FAISS (Facebook AI Similarity Search) is an open-source vector database developed by Facebook AI Research. It is designed to efficiently find similar items in large datasets, such as text, images, or audio.

In Retrieval-Augmented Generation (RAG) systems, documents are typically split into smaller chunks. This is essential because smaller chunks fit more easily into the LLM’s context window and provide more focused information, improving overall response accuracy.

In this project, I implemented the well-established RecursiveCharacterTextSplitter from LangChain. The splitter recursively divides text along common structural boundaries such as double newlines ("\n\n"), single newlines ("\n"), and spaces (" "). Chunks are created by merging these smaller segments until a defined threshold is reached. To preserve context, a certain amount of overlap is introduced between adjacent chunks. This results in a few small chunks, are removed if they contain fewer than 100 characters.

For the retrieval step, documents are converted into vector representations using a sentence-transformer model. In this project, I used the paraphrase-multilingual-MiniLM-L12-v2 model from Hugging Face, which has demonstrated strong performance in multilingual similarity-based retrieval tasks.

FAISS vector databases are created for each legislative period, as well as a global vector store containing all speeches. This enables users to retrieve information from specific time periods or across all available data.

App

The app consists of two main sections: a chatbot interface and a keyword search.

Chatbot

The chatbot is based on the multilingual meta-llama/Llama-3.1-8B-Instruct model, which is optimized for instruction-following. Larger models are not used due to resource constraints.

As in any RAG implementation, relevant documents are retrieved based on the user’s query and provided to the LLM to generate answers grounded in parliamentary speeches.

The app supports both German and English through dedicated prompt templates, allowing users to interact with the LLM in either language. The templates for each scenario are implemented in src/chatbot.py.

Additionally, users can select the underlying vector store that provides contextual documents to the LLM. They can choose:

All speeches
Speeches from a single legislative period
Speeches from multiple legislative periods

The latter is implemented by merging vector stores, which results in slightly increased latency.

Keyword Search

In the Keyword Search tab, users can enter any word or phrase to find related speeches from all parties or selected parties.

Users can also download their results as JSON, Excel, or CSV files for further analysis.

In the background, a similarity search is performed to present the most relevant documents corresponding to the user’s input.

Further Improvements & Ideas

Improvements

Experiment with different LLMs and prompt templates
Include chat history
Add a date or legislative-period filter to the Keyword Search
Improve inference time

Ideas

Expand the scope to party manifestos and different countries
Implement a pipeline to Update the database every month with the latest content

Acknowledgment

A big thank you to the OpenDiscourse team for creating the underlying speech corpus.
Visit their website: https://opendiscourse.de/

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
Home.py		Home.py
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Technical Documentation

Overview

Extract Speech Data

Transform Speeches and Load into FAISS Vector Store

App

Chatbot

Keyword Search

Further Improvements & Ideas

Improvements

Ideas

Acknowledgment

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Introduction

Technical Documentation

Overview

Extract Speech Data

Transform Speeches and Load into FAISS Vector Store

App

Chatbot

Keyword Search

Further Improvements & Ideas

Improvements

Ideas

Acknowledgment

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages