Welcome to PoliticsToYou – your playground for exploring the heart of politics in Germany.
Would you like to gain insights into political debates or uncover party positions on specific topics across different legislative periods?
You can use the ChatBot to ask your questions or search for related speech content in the Keyword Search section.
You can try out the application on Hugging Face Spaces.
Continue reading for technical details and an explanation of each tab in the application.
- Execute
src/FAISS/FAISS.ipynbto retrieve all speeches and store themn in vector databases (~24h). - Execute
Home.pyto run the app.
The speech data is retrieved from a locally running OpenDiscourse database.
Refer to their documentation to set up the database locally:
OpenDiscourse Documentation.
As a vector store, I chose FAISS due to its support for large-scale datasets with millions of vectors and its optimization for fast similarity search.
FAISS (Facebook AI Similarity Search) is an open-source vector database developed by Facebook AI Research. It is designed to efficiently find similar items in large datasets, such as text, images, or audio.
In Retrieval-Augmented Generation (RAG) systems, documents are typically split into smaller chunks. This is essential because smaller chunks fit more easily into the LLM’s context window and provide more focused information, improving overall response accuracy.
In this project, I implemented the well-established RecursiveCharacterTextSplitter from LangChain. The splitter recursively divides text along common structural boundaries such as double newlines ("\n\n"), single newlines ("\n"), and spaces (" "). Chunks are created by merging these smaller segments until a defined threshold is reached. To preserve context, a certain amount of overlap is introduced between adjacent chunks. This results in a few small chunks, are removed if they contain fewer than 100 characters.
For the retrieval step, documents are converted into vector representations using a sentence-transformer model. In this project, I used the paraphrase-multilingual-MiniLM-L12-v2 model from Hugging Face, which has demonstrated strong performance in multilingual similarity-based retrieval tasks.
FAISS vector databases are created for each legislative period, as well as a global vector store containing all speeches. This enables users to retrieve information from specific time periods or across all available data.
The app consists of two main sections: a chatbot interface and a keyword search.
The chatbot is based on the multilingual meta-llama/Llama-3.1-8B-Instruct model, which is optimized for instruction-following. Larger models are not used due to resource constraints.
As in any RAG implementation, relevant documents are retrieved based on the user’s query and provided to the LLM to generate answers grounded in parliamentary speeches.
The app supports both German and English through dedicated prompt templates, allowing users to interact with the LLM in either language. The templates for each scenario are implemented in src/chatbot.py.
Additionally, users can select the underlying vector store that provides contextual documents to the LLM. They can choose:
- All speeches
- Speeches from a single legislative period
- Speeches from multiple legislative periods
The latter is implemented by merging vector stores, which results in slightly increased latency.
In the Keyword Search tab, users can enter any word or phrase to find related speeches from all parties or selected parties.
Users can also download their results as JSON, Excel, or CSV files for further analysis.
In the background, a similarity search is performed to present the most relevant documents corresponding to the user’s input.
- Experiment with different LLMs and prompt templates
- Include chat history
- Add a date or legislative-period filter to the Keyword Search
- Improve inference time
- Expand the scope to party manifestos and different countries
- Implement a pipeline to Update the database every month with the latest content
A big thank you to the OpenDiscourse team for creating the underlying speech corpus.
Visit their website: https://opendiscourse.de/