# RAG system for TUM School of Computation, Information, and Technology

This project is a showcase for haystack's api 2.0, highlighting its pipeline. The notebook consists of a few parts:
 1. Load the website scraped from cit.tum.de using scrapy (Check Readme)
 2. HTML text processing, chunking, and writing to different stores
 3. Retrieval-Augmented Question Answering system, which handles english and german input differently.

This is the faster version of the two scripts.

In [59]:
# Get OpenAI token from .env

import os
from dotenv import load_dotenv
load_dotenv(override=True)

api_key = os.getenv("OAI_TOKEN")

In [61]:
# Always run this to have access to database

from qdrant_haystack import QdrantDocumentStore

# Document Store
document_store_en = QdrantDocumentStore(
    location="localhost",
    port=6333,
    recreate_index=True,
    return_embedding=True,
    wait_result_from_api=True,
)

document_store_de = QdrantDocumentStore(
    location="localhost",
    port=6334,
    recreate_index=True,
    return_embedding=True,
    wait_result_from_api=True,
)


In [60]:
# 1. Load the websites scraped using scrapy
def get_files_dict(directory_path):    
    files_dict = [{"path": os.path.join(directory_path, item), "url": item} for item in os.listdir(directory_path)]
    return files_dict

# adjust this to where your data lives
abs_path = "/Users/ferdydh/Code/haystack-rag-showcase/tum_data"
result = get_files_dict(abs_path)

In [62]:
# Classifier
# Routers


# 2. Processing the Data 
# This step takes the HTML from our list from the previous step -> [{content, url}]

# It uses different haystack pipeline components, such as:
# A. Converters = HTMLToDocument
# B. Preprocessors = DocumentCleaner, DocumentSplitter
# C. Classifier = DocumentLanguageClassifier
# D. Router = MetadataRouter
# E. Embedder = SentenceTransformersDocumentEmbedder
# F. Writer = DocumentWriter

# Kurzgesagt, what this block does is:
# Converts the html using A
# Processes (clean, split) it using B
# Classify the language using C
# Routes it according to its language (german or english) using D
# Embed the document using different embedder with E
# And finally, saves it to different DocumentStores for different languages

from haystack import Pipeline
from haystack.components.converters import HTMLToDocument
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack.components.classifiers import DocumentLanguageClassifier
from haystack.components.routers import MetadataRouter
from haystack.components.embedders import SentenceTransformersDocumentEmbedder

# Document Store
document_store_en = QdrantDocumentStore(
    location="localhost",
    port=6333,
    recreate_index=True,
    return_embedding=True,
    wait_result_from_api=True,
)

document_store_de = QdrantDocumentStore(
    location="localhost",
    port=6334,
    recreate_index=True,
    return_embedding=True,
    wait_result_from_api=True,
)

en_writer = DocumentWriter(document_store = document_store_en)
de_writer = DocumentWriter(document_store = document_store_de)


# Data pipeline
pipeline = Pipeline()
pipeline.add_component("converter", HTMLToDocument())
pipeline.add_component("cleaner", DocumentCleaner())
pipeline.add_component("splitter", DocumentSplitter(split_by="sentence", split_length=10))


# Language Pipeline Components
document_classifier = DocumentLanguageClassifier(languages = ["en", "de", "id"])
metadata_router = MetadataRouter(rules={"en": {"language": {"$eq": "en"}}, "de": {"language": {"$eq": "de"}}})
english_embedder = SentenceTransformersDocumentEmbedder()
german_embedder = SentenceTransformersDocumentEmbedder(model_name_or_path="PM-AI/bi-encoder_msmarco_bert-base_german")

pipeline.add_component(instance=document_classifier, name="document_classifier")
pipeline.add_component(instance=metadata_router, name="metadata_router")
pipeline.add_component(instance=english_embedder, name="english_embedder")
pipeline.add_component(instance=german_embedder, name="german_embedder")
pipeline.add_component(instance=en_writer, name="en_writer")
pipeline.add_component(instance=de_writer, name="de_writer")

# Connect all
pipeline.connect("converter", "cleaner")
pipeline.connect("cleaner", "splitter")
pipeline.connect("splitter", "document_classifier.documents")

pipeline.connect("document_classifier.documents", "metadata_router.documents")
pipeline.connect("metadata_router.en", "english_embedder.documents")
pipeline.connect("metadata_router.de", "german_embedder.documents")
pipeline.connect("english_embedder", "en_writer")
pipeline.connect("german_embedder", "de_writer")


pipeline.run(
    {
        "converter": {
            "sources": [x["path"] for x in result],
            # "meta": [{"url":x["url"]} for x in result]
            }
        }) 

Error parsing HTML
Traceback (most recent call last):
  File "/Users/ferdydh/Code/haystack-rag-showcase/.venv/lib/python3.11/site-packages/boilerpy3/extractors.py", line 108, in parse_doc
    bp_parser.feed(input_str)
  File "/Users/ferdydh/Code/haystack-rag-showcase/.venv/lib/python3.11/site-packages/boilerpy3/parser.py", line 658, in feed
    self.end_document()
  File "/Users/ferdydh/Code/haystack-rag-showcase/.venv/lib/python3.11/site-packages/boilerpy3/parser.py", line 461, in end_document
    self.flush_block()
  File "/Users/ferdydh/Code/haystack-rag-showcase/.venv/lib/python3.11/site-packages/boilerpy3/parser.py", line 540, in flush_block
    if self.last_start_tag.lower() == "title":
       ^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'lower'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/ferdydh/Code/haystack-rag-showcase/.venv/lib/python3.11/site-packages/boilerpy3/extrac

{'metadata_router': {'unmatched': [Document(id=724c6338cb851aed26b1f0451755d1d7e584d257d7105eedec38855b4c77ad0b, content: 'Google Custom Search
   Wir verwenden Google für unsere Suche. Mit Klick auf „Suche aktivieren“ aktivie...', meta: {'file_path': '/Users/ferdydh/Code/haystack-rag-showcase/tum_data/https___www.cit.tum.de_ito_fuer-mitarbeiter_fuer-neue_.html', 'source_id': 'fc29db3d0d62c87d0772227b552e7154092f260c51d2e7136ba199924229846c', 'language': 'unmatched'}),
   Document(id=34220d498a003479a9b953b9bde1bd95ef89950717c1982c98c4696c022cfefb, content: 'de / ucentral.ma.tum.de vornehmen können.
   Wie kann ich meine E-Mails von der Uni lesen / empfangen?
   ...', meta: {'file_path': '/Users/ferdydh/Code/haystack-rag-showcase/tum_data/https___www.cit.tum.de_ito_fuer-mitarbeiter_fuer-neue_.html', 'source_id': 'fc29db3d0d62c87d0772227b552e7154092f260c51d2e7136ba199924229846c', 'language': 'unmatched'}),
   Document(id=5371b966166dd0ee5ce0f57a692211fde4f7d81c9312ac98dd2d93c7e951b052, 

In [86]:
# Rankers
# Retrievers
# GPTGenerator

# 3. Answer questions
# Now, the system is almost ready to answer your questions about TUM CIT, 
# such as "when does TUM's 2024 Summer Semester starts"
# Or "who is TUM's president"

# To do this, it uses the pipeline components:
# A. Embedder = SentenceTransformersTextEmbedder
# B. Retriever = InMemoryEmbeddingRetriever
# C. Ranker = TransformersSimilarityRanker
# D. Builder = PromptBuilder
# E. Generator = OpenAIGenerator

# What happens is:
# You ask "when does TUM's 2024 Summer Semester starts"?
# Haystack finds the result by first searching for it, then ranking the results according to semantic similarity
# Then, these top resuls are sent for the LLM to answer the query with this knowledge

from haystack import Pipeline
from langdetect import detect
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.rankers import TransformersSimilarityRanker
from qdrant_haystack.retriever import QdrantRetriever
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.generators import OpenAIGenerator

# English 
english_pipeline = Pipeline()
english_pipeline.add_component("text_embedder", SentenceTransformersTextEmbedder())
english_pipeline.add_component("retriever", QdrantRetriever(document_store=document_store_en, top_k=50))
english_pipeline.add_component("ranker", TransformersSimilarityRanker())
english_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
english_pipeline.connect("retriever.documents", "ranker.documents")

# German
german_pipeline = Pipeline()
german_pipeline.add_component("text_embedder", SentenceTransformersTextEmbedder(model_name_or_path="PM-AI/bi-encoder_msmarco_bert-base_german"))
german_pipeline.add_component("retriever", QdrantRetriever(document_store=document_store_de, top_k=50))
german_pipeline.add_component("ranker", TransformersSimilarityRanker())
german_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
german_pipeline.connect("retriever.documents", "ranker.documents")


template_en = """
Given the following information, answer the question.

Context: 
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{ query }}?
"""

template_de = """
Gegeben die folgenden Informationen, beantworte die Frage.

Kontext:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Frage: {{ query }}?
"""


english_pipeline.add_component("prompt_builder", PromptBuilder(template=template_en))
english_pipeline.add_component("llm", OpenAIGenerator(api_key=api_key))
english_pipeline.connect("ranker", "prompt_builder.documents")
english_pipeline.connect("prompt_builder", "llm")

german_pipeline.add_component("prompt_builder", PromptBuilder(template=template_de))
german_pipeline.add_component("llm", OpenAIGenerator(api_key=api_key))
german_pipeline.connect("ranker", "prompt_builder.documents")
german_pipeline.connect("prompt_builder", "llm")



def ask_rag(query: str):
  language = detect(query)
  pipeline = german_pipeline if language == "de" else english_pipeline
  
  result = pipeline.run(
      {
        "text_embedder": {"text": query}, 
        "ranker": {"query": query, "top_k": 10},
        "prompt_builder": {"query": query},
        })
  
  return result



Now go ahead! Shoot your questions about TUM CIT.

Some examples:
 - When is Summer 2024's semester start for TUM
 - What are examples for elective modules
 - Wann wählt man das Anwendungsfach
 - Was macht  Claudia Scheimbauer

In [87]:
result = ask_rag("What are examples for elective modules")

Batches: 100%|██████████| 1/1 [00:00<00:00, 25.31it/s]


In [88]:
print(result["llm"]["replies"][0])
# print(result)

Some examples of elective modules mentioned in the given information include:

- Advanced Topics in Communications Systems (5 ECTS)
- Advanced Topics in Communications Electronics (5 ECTS)
- Elective Modules in Informatics (12 credits required)
- Elective Modules in Management (18 credits required)
- Support Electives (9 credits required)
- Applied Mathematics courses (2 out of 3 courses must be taken)
- D and E catalog courses in Computer Science (at least 23 ECTS required)
- MSNE Make-Up Modules (for MSNE students lacking specific engineering-related background knowledge)

Please note that these are just a few examples, and there may be more elective modules available.


# Pipeline

Data generation (Scrapy or beautiful soup)

<pre>
   ;;;;;   
   ;;;;;   
   ;;;;;   
   ;;;;;  
 ..;;;;;..    
  ':::::'   
    ':`
</pre>

Data processing:
 - Converters = HTMLToDocument
 - Preprocessors = DocumentCleaner, DocumentSplitter
 - Classifier = DocumentLanguageClassifier
 - Router = MetadataRouter
 - Embedder = SentenceTransformersDocumentEmbedder
 - Writer = DocumentWriter
 
<pre>
   ;;;;;   
   ;;;;;   
   ;;;;;   
   ;;;;;  
 ..;;;;;..    
  ':::::'   
    ':`
</pre>

Retrieval, ranking, and QA:
 - Embedder = SentenceTransformersTextEmbedder
 - Retriever = InMemoryEmbeddingRetriever
 - Ranker = TransformersSimilarityRanker
 - Builder = PromptBuilder
 - Generator = OpenAIGenerator


# Closing Words
There are possible improvements to this pipeline, such as:
 - Discarding bad documents (e.g. Text from Cookies popup)
 - Tuning the parameters (document size, retrieval top_k, ranking top_k, LLM prompt)

