# RAG system for TUM School of Computation, Information, and Technology

This project is a showcase for haystack's api 2.0, highlighting its pipeline. The notebook consists of a few parts:
 1. Website scraping from cit.tum.de using beautiful soup
 2. HTML text processing, chunking, and writing to different stores
 3. Retrieval-Augmented Question Answering system, which handles english and german input differently.

The 1st step is really slow. Please refer to [this repository](https://github.com/Ferdydh/haystack-rag-showcase) instead if you want to see this system at it's full capability. There, I use scrapy to crawl instead of beautiful soup. Scrapy can't be easily implemented without access to local disk, which is why I'm using beautiful soup in this example instead.

In [2]:
# Insert your OpenAI api key
api_key =

# Google colab has limited computing power, follow this guide below
number_of_fetched_websites = 3 # Use this at the start to make sure everything runs perfectly
# number_of_fetched_websites = 10 # Use this to play around with the system
# number_of_fetched_websites = 100 # "I want to see if this can handle more"
# number_of_fetched_websites = 900 # If you have the resources, just get all sites


In [None]:
# Install the required modules
!pip install haystack-ai==2.0.0b4
!pip install sentence-transformers langdetect beautifulsoup4 requests

In [3]:
# 1. Acquiring the Data
# If you don't care about non-haystack-focused part, skip reading this block
# What this does is, starting from cit.tum.de, scrape all html
# We do this by collecting all relevant outgoing link, save it in a list, and do it iteratively
# We save {url, content} to use in the 2. part (Data processing)
# Significant optimization has been implemented, such as using multiple threads, using better-suited data types, etc

import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse, urljoin, urlunparse
from concurrent.futures import ThreadPoolExecutor
from urllib.robotparser import RobotFileParser
from tqdm import tqdm
import logging
import urllib3
from haystack.dataclasses import ByteStream

# Set the logging level for urllib3 to WARNING to suppress debug messages
urllib3_logger = logging.getLogger('urllib3')
urllib3_logger.setLevel(logging.WARNING)

def can_fetch(url, user_agent='*'):
    rp = RobotFileParser()
    rp.set_url(urljoin(url, '/robots.txt'))
    rp.read()

    return rp.can_fetch(user_agent, url)

def is_media_url(url):
    media_extensions = ['.pdf', '.jpg', '.jpeg', '.png', '.gif']
    return any(url.lower().endswith(ext) for ext in media_extensions)

result = []

def crawl_website(root_url, max_workers=15, fetch_count=100):
    search_domain = urlparse(root_url).hostname
    visited_urls = set()
    stack = [root_url]
    visited_count = 0

    def process_page(url):
        try:
            response = requests.get(url, stream=True)
            soup = BeautifulSoup(response.text, 'html.parser')
            visited_urls.add(url)

            # Process the page here, e.g., extract information or save data
            result.append({"url":url, "content": ByteStream.from_string(response.text)})

            # Find and collect outgoing links
            outgoing_links = [urlparse(urljoin(url, link['href'])) for link in soup.find_all('a', href=True)]
            outgoing_stripped_urls = [urlunparse((next_url.scheme, next_url.netloc, next_url.path, '', '', '')) for next_url in outgoing_links]

            # Filter out media URLs
            non_media_links = [outgoing_url for outgoing_url in outgoing_stripped_urls if not is_media_url(outgoing_url)]

            return [non_media_url for non_media_url in non_media_links if search_domain in non_media_url]

        except Exception as e:
            print(f"Error crawling {url}: {e}")
            return []

    with ThreadPoolExecutor(max_workers) as executor, tqdm(total=fetch_count) as pbar:
        while stack:
            if (visited_count >= fetch_count):
                return

            current_url = stack.pop()

            if (current_url in visited_urls) or (not search_domain in current_url) or (not can_fetch(current_url)):
                continue

            # Submit the processing of the page to the thread pool
            future = executor.submit(process_page, current_url)

            # Collect outgoing links from the processed page
            outgoing_links = future.result()

            # Add outgoing links to the stack
            stack.extend(filter(lambda url: url not in visited_urls, outgoing_links))

            visited_count += 1
            pbar.update(1)

crawl_website('https://www.cit.tum.de', fetch_count=number_of_fetched_websites)


100%|██████████| 5/5 [00:07<00:00,  1.50s/it]


In [5]:
# Classifier
# Routers


# 2. Processing the Data
# This step takes the HTML from our list from the previous step -> [{content, url}]

# It uses different haystack pipeline components, such as:
# A. Converters = HTMLToDocument
# B. Preprocessors = DocumentCleaner, DocumentSplitter
# C. Classifier = DocumentLanguageClassifier
# D. Router = MetadataRouter
# E. Embedder = SentenceTransformersDocumentEmbedder
# F. Writer = DocumentWriter

# Kurzgesagt, what this block does is:
# Converts the html using A
# Processes (clean, split) it using B
# Classify the language using C
# Routes it according to its language (german or english) using D
# Embed the document using different embedder with E
# And finally, saves it to different DocumentStores for different languages

from haystack import Pipeline
from haystack.components.converters import HTMLToDocument
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack.components.classifiers import DocumentLanguageClassifier
from haystack.components.routers import MetadataRouter
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.document_stores.in_memory import InMemoryDocumentStore

# Document Store
document_store_en = InMemoryDocumentStore()
document_store_de = InMemoryDocumentStore()

en_writer = DocumentWriter(document_store = document_store_en)
de_writer = DocumentWriter(document_store = document_store_de)


# Data pipeline
pipeline = Pipeline()
pipeline.add_component("converter", HTMLToDocument())
pipeline.add_component("cleaner", DocumentCleaner())
pipeline.add_component("splitter", DocumentSplitter(split_by="sentence", split_length=30))


# Language Pipeline Components
document_classifier = DocumentLanguageClassifier(languages = ["en", "de", "id"])
metadata_router = MetadataRouter(rules={"en": {"language": {"$eq": "en"}}, "de": {"language": {"$eq": "de"}}})
english_embedder = SentenceTransformersDocumentEmbedder()
german_embedder = SentenceTransformersDocumentEmbedder(model_name_or_path="PM-AI/bi-encoder_msmarco_bert-base_german")

pipeline.add_component(instance=document_classifier, name="document_classifier")
pipeline.add_component(instance=metadata_router, name="metadata_router")
pipeline.add_component(instance=english_embedder, name="english_embedder")
pipeline.add_component(instance=german_embedder, name="german_embedder")
pipeline.add_component(instance=en_writer, name="en_writer")
pipeline.add_component(instance=de_writer, name="de_writer")

# Connect all
pipeline.connect("converter", "cleaner")
pipeline.connect("cleaner", "splitter")
pipeline.connect("splitter", "document_classifier.documents")

pipeline.connect("document_classifier.documents", "metadata_router.documents")
pipeline.connect("metadata_router.en", "english_embedder.documents")
pipeline.connect("metadata_router.de", "german_embedder.documents")
pipeline.connect("english_embedder", "en_writer")
pipeline.connect("german_embedder", "de_writer")


pipeline.run(
    {
        "converter": {
            "sources": [x["content"] for x in result],
            "meta": [{"url":x["url"]} for x in result]
            }
        })

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.53k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/16.3k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/681 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/118 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/729k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/451 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/240k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

Batches: 0it [00:00, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

{'metadata_router': {'unmatched': []},
 'en_writer': {'documents_written': 0},
 'de_writer': {'documents_written': 10}}

In [6]:
# Rankers
# Retrievers
# GPTGenerator

# 3. Answer questions
# Now, the system is almost ready to answer your questions about TUM CIT,
# such as "when does TUM's 2024 Summer Semester starts"
# Or "who is TUM's president"

# To do this, it uses the pipeline components:
# A. Embedder = SentenceTransformersTextEmbedder
# B. Retriever = InMemoryEmbeddingRetriever
# C. Ranker = TransformersSimilarityRanker
# D. Builder = PromptBuilder
# E. Generator = OpenAIGenerator

# What happens is:
# You ask "when does TUM's 2024 Summer Semester starts"?
# Haystack finds the result by first searching for it, then ranking the results according to semantic similarity
# Then, these top resuls are sent for the LLM to answer the query with this knowledge

from haystack import Pipeline
from langdetect import detect
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.rankers import TransformersSimilarityRanker
from haystack.components.retrievers import InMemoryEmbeddingRetriever
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.generators import OpenAIGenerator

# English
english_pipeline = Pipeline()
english_pipeline.add_component("text_embedder", SentenceTransformersTextEmbedder())
english_pipeline.add_component("retriever", InMemoryEmbeddingRetriever(document_store=document_store_en, top_k=50))
english_pipeline.add_component("ranker", TransformersSimilarityRanker())
english_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
english_pipeline.connect("retriever.documents", "ranker.documents")

# German
german_pipeline = Pipeline()
german_pipeline.add_component("text_embedder", SentenceTransformersTextEmbedder(model_name_or_path="PM-AI/bi-encoder_msmarco_bert-base_german"))
german_pipeline.add_component("retriever", InMemoryEmbeddingRetriever(document_store=document_store_de, top_k=50))
german_pipeline.add_component("ranker", TransformersSimilarityRanker())
german_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
german_pipeline.connect("retriever.documents", "ranker.documents")


template_en = """
Given the following information, answer the question.

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{ query }}?
"""

template_de = """
Gegeben die folgenden Informationen, beantworte die Frage.

Kontext:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Frage: {{ query }}?
"""


english_pipeline.add_component("prompt_builder", PromptBuilder(template=template_en))
english_pipeline.add_component("llm", OpenAIGenerator(api_key=api_key))
english_pipeline.connect("ranker", "prompt_builder.documents")
english_pipeline.connect("prompt_builder", "llm")

german_pipeline.add_component("prompt_builder", PromptBuilder(template=template_de))
german_pipeline.add_component("llm", OpenAIGenerator(api_key=api_key))
german_pipeline.connect("ranker", "prompt_builder.documents")
german_pipeline.connect("prompt_builder", "llm")


# We differentiate between english and german texts
def ask_rag(query: str):
  language = detect(query)
  pipeline = german_pipeline if language == "de" else english_pipeline

  result = pipeline.run(
      {
        "text_embedder": {"text": query},
        "ranker": {"query": query, "top_k": 3},
        "prompt_builder": {"query": query},
        })

  return result



Now go ahead! Shoot your questions about TUM CIT.

*note: The actual answer might not be found despite existing in TUM CIT Website, beause not enough website was loaded.

In [7]:
result = ask_rag("When is Summer 2024's semester start for TUM")

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]



In [8]:
print(result["llm"]["replies"][0])
# print(result)

To determine the start date of Summer 2024 semester at TUM, it is necessary to refer to the academic calendar of the university or contact the relevant department directly.


# Pipeline

Data generation (Scrapy or beautiful soup)

<pre>
   ;;;;;   
   ;;;;;   
   ;;;;;   
   ;;;;;  
 ..;;;;;..    
  ':::::'   
    ':`
</pre>

Data processing:
 - Converters = HTMLToDocument
 - Preprocessors = DocumentCleaner, DocumentSplitter
 - Classifier = DocumentLanguageClassifier
 - Router = MetadataRouter
 - Embedder = SentenceTransformersDocumentEmbedder
 - Writer = DocumentWriter
 
<pre>
   ;;;;;   
   ;;;;;   
   ;;;;;   
   ;;;;;  
 ..;;;;;..    
  ':::::'   
    ':`
</pre>

Retrieval, ranking, and QA:
 - Embedder = SentenceTransformersTextEmbedder
 - Retriever = InMemoryEmbeddingRetriever
 - Ranker = TransformersSimilarityRanker
 - Builder = PromptBuilder
 - Generator = OpenAIGenerator
