#### General work flow with LangChain 
- [LangChain Guide Reference](https://python.langchain.com/docs/how_to/#output-parsers)  
    1. Search: Query to url (e.g., using GoogleSearchAPIWrapper)   
    2. [Document Loader](https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/): load data from various sources(web sites, databases, YouTube, arXiv) of various types(PDF, HTML, Json, PowerPoint, etc) as a list of [```document objects```](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html)  
        - Example: [AsyncChromiumLoader](https://python.langchain.com/docs/integrations/document_loaders/async_chromium/) / [AyncHtmlLoader](https://python.langchain.com/docs/integrations/document_loaders/async_html/) - lightweight, load raw HTML files from a list of URLs concurrently
        - Possibly substitue current crawler and transformer to make use of meta data of documents ( read [HTML Loader](https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/html/), see also [Sementic Chunker : Langchain - How to split text based on semantic similarity](https://python.langchain.com/docs/how_to/semantic-chunker/) ) :  
            - spider crawler, FireCrawl(Subpage serach, markdown output ), AzureAIDocumentIntelligenceLoader(different file format supports) 
    3. [Transformer](https://python.langchain.com/docs/integrations/document_transformers/)
        - Example) [HTML2Text](https://python.langchain.com/v0.1/docs/integrations/document_transformers/html2text/) : HTML2Text provides a straightforward conversion of HTML content into plain text (with markdown-like formatting) without any specific tag manipulation. It's best suited for scenarios where the goal is to extract human-readable text without needing to manipulate specific HTML elements.
    4. [Text Splitters](https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/): Split up text into chunks. 
        - Common choice: [RecursiveCharacterTextSplitter](https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/recursive_text_splitter/) ( [Blog : Understanding RecursiveCharacterTextSplitter](https://dev.to/eteimz/understanding-langchains-recursivecharactertextsplitter-2846) )
        - [Chunk visualization](https://chunkviz.up.railway.app)
        - [Chunking Strategies for LLM Applications](https://www.pinecone.io/learn/chunking-strategies/)
        - Further readings :
            - [Sementic Chunker : Langchain - How to split text based on semantic similarity](https://python.langchain.com/docs/how_to/semantic-chunker/)   
            - [How Chunk Sizes Affect Semantic Retrieval Results](https://ai.plainenglish.io/investigating-chunk-size-on-semantic-results-b465867d8ca1)
            - [5 Levels of Text Splitting](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb) - ★
    5. [Embeddings Models](https://python.langchain.com/docs/how_to/embed_text/)
        - Common choice: OpenAIEmbedding ( paid, better to use cache ), uggingFace(OpenSource; BGE, Mistral)
            - [OpenAI Embeddings](https://platform.openai.com/docs/guides/embeddings#what-are-embeddings), [Huggingface Embeddings Benchmark](https://huggingface.co/spaces/mteb/leaderboard)
        - Cache: [Caching](https://python.langchain.com/docs/how_to/caching_embeddings/), [LocalFileStore](https://python.langchain.com/api_reference/langchain/embeddings/langchain.embeddings.cache.CacheBackedEmbeddings.html), [CacheBacekdEmbeddings](https://python.langchain.com/api_reference/langchain/embeddings/langchain.embeddings.cache.CacheBackedEmbeddings.html)
    
    6. [Vector Store](https://python.langchain.com/v0.1/docs/modules/data_connection/vectorstores/):
        - Common choise: Local: Chroma, FAISS(For us, FAISS can be a good choice ) / Cloud - Pinecone, Weaviate, ElasticSearch / etc - Lance, Qdrant(for asynchronous operations)
        - Vector store queries
            - Similarity search / Similarity search by vector / Maximum marginal releance search / Asynchronous operations 
            - Possible issue: the versions of two methods (```.embed_documents```, ```.embed_query```) might differ due to updates of embedding models(version control needed)
        - [TODO] Understanding Indexing of vector store
    7. [TODO] [Retrievers Explanation](https://python.langchain.com/v0.1/docs/modules/data_connection/) / [Retrievers](https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/) : A retriever is an interface that returns documents given an unstructured query. It is more general than a vector store. A retriever does not need to be able to store documents, only to return (or retrieve) them. Vector stores can be used as the backbone of a retriever, but there are other types of retrievers as well. Retrievers accept a string query as input and return a list of Document's as output.
        - [TODO] read [5 Levels of Text Splitting](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb) 
        - Eval Framework ([langchain evals](https://python.langchain.com/docs/guides/evaluation/) / [llama index evals](https://docs.llamaindex.ai/en/stable/module_guides/evaluating/) / [ragas evals](https://github.com/explodinggradients/ragas) ) 

- [Research Automation](https://python.langchain.com/v0.1/docs/use_cases/web_scraping/)
- [RAG Architecture](https://python.langchain.com/v0.1/docs/use_cases/question_answering/)    
- [Youtebe : RAG & Agentic RAG ](https://www.youtube.com/watch?v=hKfQ-0jLw3I)
  
#### Further Steps 
Read carefully and get a grasp of which tools to use before I proceed 
- [Agent Approaches: Plan-Execute Agents](https://blog.langchain.dev/planning-agents/), [structured chat](https://python.langchain.com/v0.1/docs/modules/agents/agent_types/structured_chat/), ReAct( [Paper: ReAct](https://arxiv.org/abs/2210.03629), [Blog: ReAct](https://dottxt-ai.github.io/outlines/latest/cookbook/react_agent/), [How to create a ReAct agent from scratch](https://langchain-ai.github.io/langgraph/how-tos/react-agent-from-scratch/) ), [Tagging](https://python.langchain.com/docs/tutorials/classification/), schema & function design([tool calling/binding](https://python.langchain.com/docs/concepts/tool_calling/))   
- Multi Agent Approaches: LangGraph([A Comprehensive Guide about LangGraph](https://www.ionio.ai/blog/a-comprehensive-guide-about-langgraph-code-included)), Autogen ( Simple Version )
- Complementary Tools: BertTopics   
- Implement in Kedro / Streamlit 

#### Optional references
- [outlines](https://github.com/dottxt-ai/outlines)  
- [Langsmith Prompt Library](https://smith.langchain.com/hub) / [OpenAI - Prompt Engineering](https://platform.openai.com/docs/guides/prompt-engineering)    

##### etc 
[Tool List](https://python.langchain.com/v0.1/docs/integrations/tools/)    
[Langchain setup](https://python.langchain.com/v0.1/docs/get_started/installation/)    
[Paper: Adaptive-RAG ( Routing )](https://arxiv.org/abs/2403.14403)    
[Paper: Corrective-RAG ( Fallback )](https://arxiv.org/abs/2401.15884)  
[Paper: Self-RAG ( Self Corretion)](https://arxiv.org/abs/2310.11511)  
[Paper: RAPTOR - For really long texts](https://www.youtube.com/watch?v=gcdkISrpMCA)  


### 1. Crawl HTML files and inner URLs 

In [3]:
input_companies = [ { 'name' : 'AXA Germany', 'url' : 'https://www.axa.de' },
              { 'name' : 'HUK-COBURG', 'url' : 'https://www.huk.de' },
              { 'name' : 'Generali', 'url' : 'https://www.generali.de' } 
              ]

In [4]:
import os
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse, quote

def crawl_company_websites(companies, output_dir='crawls', max_pages=200):
    os.makedirs(output_dir, exist_ok=True)
    crawl_results = []

    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36'
    }

    for company in companies:
        company_name = company['name'].lower().replace(' ', '_')
        company_dir = os.path.join(output_dir, company_name)
        os.makedirs(company_dir, exist_ok=True)

        start_url = company['url']
        visited_urls = set()
        to_visit = [start_url]
        pages_crawled = 0

        try:
            while to_visit and pages_crawled < max_pages:
                url = to_visit.pop(0)
                if url in visited_urls:
                    continue

                response = requests.get(url, headers=headers)
                response.raise_for_status()

                page_filename = generate_filename_from_url(url)
                page_path = os.path.join(company_dir, page_filename)
                
                
                with open(page_path, 'w', encoding='utf-8') as file:
                    file.write(response.text)

                soup = BeautifulSoup(response.content, 'html.parser')
                visited_urls.add(url)
                pages_crawled += 1

                for link in soup.find_all('a', href=True):
                    full_url = urljoin(url, link['href'])
                    if is_internal_link(start_url, full_url) and full_url not in visited_urls:
                        to_visit.append(full_url)
                        
            # added url_list for loader experiment.
            crawl_results.append({'company': company['name'], 'pages_crawled': pages_crawled, 'output_dir': company_dir, 'url_list': list(visited_urls)})
        except Exception as e:
            print(f"Error crawling {company['name']}: {e}")
            # added url_list 
            crawl_results.append({'company': company['name'], 'pages_crawled': pages_crawled, 'output_dir': None, 'error': str(e), 'url_list': list(visited_urls)})

        
    return crawl_results

def is_internal_link(base_url, test_url):
    base_domain = urlparse(base_url).netloc
    test_domain = urlparse(test_url).netloc
    return base_domain == test_domain

def generate_filename_from_url(url):
    parsed_url = urlparse(url)
    path = parsed_url.path if parsed_url.path else "home"
    path = path.strip("/").replace("/", "_")
    query = parsed_url.query
    if query:
        path += "_" + quote(query, safe="")
    filename = f"{path}.html"
    return filename


In [5]:
# For experiment : adjust the max_pages
crawl_results = crawl_company_websites(input_companies, max_pages=200) 

KeyboardInterrupt: 

In [13]:
crawl_results[0]

{'company': 'AXA Germany',
 'pages_crawled': 200,
 'output_dir': 'crawls/axa_germany',
 'url_list': ['https://www.axa.de/kontakt/formulare-download',
  'https://www.axa.de/karriere/new-way-of-working',
  'https://www.axa.de/pk/kfz/p/motorradversicherung',
  'https://www.axa.de/geschaeftskunden/kautionsversicherung',
  'https://www.axa.de/site/axa-de/get/documents_E-1395595287/axade/medien/medien/axa-social-media/gewinnspiel-teilnahmebedingungen-social-media-axa-konzern-ag.pdf',
  'https://www.axa.de/pk/gesundheit/r/pflegeratgeber',
  'https://www.axa.de/schadenservice-360/schadenmeldung',
  'https://www.axa.de/pk/gesundheit/s/gesundheitsservice',
  'https://www.axa.de/geschaeftskunden/gastronomie',
  'https://www.axa.de/pk/altersvorsorge/p/betriebliche-altersvorsorge',
  'https://www.axa.de/presse/pm-axa-partner-uefa-womens-euro-2025',
  'https://www.axa.de/presse/pm-fondsrente-justinvest',
  'https://www.axa.de/pk/haftpflicht/p/verkehrsrechtsschutz',
  'https://www.axa.de/wir-ueber-un

### 2. Transform
[Transformer](https://python.langchain.com/docs/integrations/document_transformers/)
- Example) [HTML2Text](https://python.langchain.com/v0.1/docs/integrations/document_transformers/html2text/) : HTML2Text provides a straightforward conversion of HTML content into plain text (with markdown-like formatting) without any specific tag manipulation. It's best suited for scenarios where the goal is to extract human-readable text without needing to manipulate specific HTML elements.

In [14]:
import os
from langchain.schema import Document  # Import the Document class
from langchain_community.document_transformers import Html2TextTransformer
from typing import Dict, List


def transform_html_to_plain_text(output_dir='cleansed') -> Dict[str, List[Document]] :
    os.makedirs(output_dir, exist_ok=True)
    docs_transformed = {}
    html_document_obejcts = get_document_objects() 

    # define transformer
    html2text = Html2TextTransformer()

    for company, document_list in html_document_obejcts.items():
        company_dir = os.path.join(output_dir, company)
        os.makedirs(company_dir, exist_ok=True)

        # trnasform documents 
        docs_transformed[company] = html2text.transform_documents(document_list) 
        
        # save cleansed html files
        for i, doc_transformed in enumerate(docs_transformed[company]):
            page_path = os.path.join(company_dir, company + "_cleansed_" + str(i) + ".txt")
            with open(page_path, 'w', encoding='utf-8') as file:
                file.write(doc_transformed.page_content)
        
    return docs_transformed


def get_document_objects(input_dir='crawls') -> Dict[str, List[Document]] :
    # Define transformer
    html2text = Html2TextTransformer()

    # Dictionary to store the parsed HTML content for each company
    html_document_obejcts = {}

    # Loop through each folder inside company_crawled_data
    for company in os.listdir(input_dir):
        company_folder = os.path.join(input_dir, company)
        
        # Check if it is a directory
        if os.path.isdir(company_folder):
            html_document_obejcts[company] = []
            
            # Loop through HTML files in the company's folder
            for html_file in os.listdir(company_folder):
                file_path = os.path.join(company_folder, html_file)
                
                # Ensure it's an HTML file
                if html_file.endswith(".html"):
                    with open(file_path, "r", encoding="utf-8") as file:
                        raw_html = file.read()
                        # Create a Document object for each HTML file
                        document = Document(page_content=raw_html, metadata={"source": file_path})
                        html_document_obejcts[company].append(document)

    return html_document_obejcts


In [15]:
transformed_docs = transform_html_to_plain_text()

In [16]:
for company, document_list in transformed_docs.items():
    # Print the first transformed content of each company 
    print(f"--- {company} ---")
    print(document_list[0].page_content[:1000])

--- generali ---
  * Privatkunden 
  * Geschäftskunden 

  * Journal 
  * Berater finden 
  * Service & Kontakt 

Suchen

  * Rundum-Schutz
  * Fahrzeug & Zuhause
  * Gesundheit & Freizeit
  * Recht & Haftung
  * Vorsorge & Finanzen

Rundum-Schutz

  * Vermögenssicherungspolice 
  * Vermögensaufbau & Sicherheitsplan 
  * Mein Zukunftsplan 
  * Mein Pflegeschutz 

Young Line

  * Young & Drive 
  * Young & Home 
  * Young & Life 
  * Young & Law 
  * Vermögensaufbau4you 

**Vermögenssicherungspolice**  
Rundum geschützt durchs Leben

mehr erfahren

Fahrzeug

  * Kfz-Versicherung 
  * Kfz-Schutzbrief 
  * Young & Drive 
  * Elektro-Fahrzeug 
  * Digitale Pannenhilfe 
  * Fahrer-Mobilitätsschutz 
  * Oldtimer Optimal 
  * Motorradversicherung 
  * Moped & E-Scooter 

Zuhause

  * Hausratversicherung 
  * Wohngebäudeversicherung 
  * Glasversicherung 
  * Haus- und Wohnungsschutzbrief 
  * Konto- und Finanzschutzbrief 
  * Photovoltaikversicherung 
  * Kunstversicherung 
  * Bauversicherun

### 3. Splitter
[Text Splitters](https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/): Split up text into chunks. 
- Common choice: [RecursiveCharacterTextSplitter](https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/recursive_text_splitter/) ( [Blog : Understanding RecursiveCharacterTextSplitter](https://dev.to/eteimz/understanding-langchains-recursivecharactertextsplitter-2846) )
- [Chunk visualization](https://chunkviz.up.railway.app)
- [Chunking Strategies for LLM Applications](https://www.pinecone.io/learn/chunking-strategies/)
- Further readings :
    - [Sementic Chunker : Langchain - How to split text based on semantic similarity](https://python.langchain.com/docs/how_to/semantic-chunker/)   
    - [How Chunk Sizes Affect Semantic Retrieval Results](https://ai.plainenglish.io/investigating-chunk-size-on-semantic-results-b465867d8ca1)
    - [Levels of Text Splitting](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb)

In [17]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from typing import Dict, List

def split_transformed_documents(transformed_docs, output_dir='split') -> Dict[str, List[Document]] :
    os.makedirs(output_dir, exist_ok=True)

    transformed_docs_split = {}

    # Define text_splitter : adjust the granurality to our project
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=256, 
        chunk_overlap=10,
        length_function=len,
        is_separator_regex=False,
    )
    
    for company, docs in transformed_docs.items():
        company_dir = os.path.join(output_dir, company)
        os.makedirs(company_dir, exist_ok=True)
        
        transformed_docs_split[company] = text_splitter.split_documents(docs)

        # save cleansed html files
        for i, doc_transformed_split in enumerate(transformed_docs_split[company]):
            page_path = os.path.join(company_dir, company + "_split" + str(i) + ".txt")
            with open(page_path, 'w', encoding='utf-8') as file:
                file.write(doc_transformed_split.page_content)
        
    return transformed_docs_split
    

In [18]:
docs_split = split_transformed_documents(transformed_docs)

In [19]:
for company, document_list in docs_split.items():
    # Print the first 2000 characters of the first transformed content of each company 
    print(f"--- {company} ---")
    print(f'length of chunks : {len(docs_split[company])}')
    print(document_list[0].page_content[:1000])

--- generali ---
length of chunks : 11758
* Privatkunden 
  * Geschäftskunden 

  * Journal 
  * Berater finden 
  * Service & Kontakt 

Suchen

  * Rundum-Schutz
  * Fahrzeug & Zuhause
  * Gesundheit & Freizeit
  * Recht & Haftung
  * Vorsorge & Finanzen

Rundum-Schutz
--- axa_germany ---
length of chunks : 16552
Bitte aktivieren Sie JavaScript in den Browser-Einstellungen, um diese Seite
nutzen zu konnen.

Privatkunden Geschäftskunden Über AXA Karriere Medien My Axa Login Meine
Gesundheit Login Kontakt
--- huk-coburg ---
length of chunks : 17988
Zum Hauptinhalt überspringen

  * Auto & Mobilität


### 4. Embedding & Vector stores
[Embeddings Models](https://python.langchain.com/docs/how_to/embed_text/)
- Common choice: OpenAIEmbedding ( paid, better to use cache ), uggingFace(OpenSource; BGE, Mistral)
    - [OpenAI Embeddings](https://platform.openai.com/docs/guides/embeddings#what-are-embeddings), [Huggingface Embeddings Benchmark](https://huggingface.co/spaces/mteb/leaderboard)
- Cache: [Caching](https://python.langchain.com/docs/how_to/caching_embeddings/), [LocalFileStore](https://python.langchain.com/api_reference/langchain/embeddings/langchain.embeddings.cache.CacheBackedEmbeddings.html), [CacheBacekdEmbeddings](https://python.langchain.com/api_reference/langchain/embeddings/langchain.embeddings.cache.CacheBackedEmbeddings.html)
    
[Vector Store](https://python.langchain.com/v0.1/docs/modules/data_connection/vectorstores/):
- Common choise: Local: Chroma, FAISS(For us, FAISS can be a good choice ) / Cloud - Pinecone, Weaviate, ElasticSearch / Lance, Qdrant(for asynchronous operations)
- Vector store queries
    - Similarity search / Similarity search by vector / Maximum marginal releance search / Asynchronous operations 
    - Possible issue: the versions of two methods (```.embed_documents```, ```.embed_query```) might differ due to updates of embedding models(version control needed)
- [TODO] Understanding Indexing 


In [1]:
from langchain_openai import OpenAIEmbeddings
from langchain.storage import LocalFileStore
from langchain.embeddings import CacheBackedEmbeddings
from langchain_community.vectorstores import FAISS 


def embed_and_store_in_vector_store(docs_split, store_path = './embeddings_cache/'):
    # define embeddings model 
    embeddings_model = OpenAIEmbeddings()
        # Serialize : embed - cache - store 
    cached_embedder = CacheBackedEmbeddings.from_bytes_store(
        underlying_embeddings = embeddings_model, 
        document_embedding_cache = store, 
        namespace = embeddings_model.model
    )
    
    for company, doc_split in docs_split.items():
        # directory to cache embedded data 
        store = LocalFileStore(store_path + company +'/')



        # Create FAISS vector store from documents using cached embeddings
        FAISS_db = FAISS.from_documents(docs_split[company], cached_embedder)
        
        # Save vector store locally
        FAISS_db.save_local('./db/faiss')

In [2]:
embed_and_store_in_vector_store(docs_split)

NameError: name 'docs_split' is not defined

---

### Extra (See the metadata in the last cell)
Instead of using the stored html files, load the raw html documents from the fetched urls.  
Loaders give rich metadata 
  
[Document Loader](https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/): load data from various sources(web sites, databases, YouTube, arXiv) of various types(PDF, HTML, Json, PowerPoint, etc) as a list of [```document objects```](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html)  
- Example: [AsyncChromiumLoader](https://python.langchain.com/docs/integrations/document_loaders/async_chromium/) / [AyncHtmlLoader](https://python.langchain.com/docs/integrations/document_loaders/async_html/) - lightweight, load raw HTML files from a list of URLs concurrently  
  
**Possibly substitue current crawler and transformer to make use of meta data of documents** :  
- spider crawler, FireCrawl(Subpage serach, markdown output ), AzureAIDocumentIntelligenceLoader(different file format supports)   
- read [HTML Loader](https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/html/), see also [Sementic Chunker : Langchain - How to split text based on semantic similarity](https://python.langchain.com/docs/how_to/semantic-chunker/)   


In [22]:
# When using AsyncChromiumLoader in Jupyter Notebook 
import nest_asyncio
nest_asyncio.apply()

In [6]:
from langchain_community.document_loaders import AsyncChromiumLoader, AsyncHtmlLoader
from langchain_community.document_transformers import Html2TextTransformer
from langchain.schema import Document
from typing import Dict, List

def get_url_list(crawl_results):
    url_lists = {}
    filter_company_urls = [{'company': item['company'], 'url_list': item['url_list']} for item in crawl_results] 
    for item in filter_company_urls:
        url_lists[item['company']] = item['url_list']
    return url_lists

# Loader : load html document objects from URL 
def get_htmls_from_urls(url_lists) -> Dict[str, List[Document]] :
    html_docs = {} 

    for company, url_list in url_lists.items():
        # define loader 
        # loader = AsyncChromiumLoader(urls=url_list)
        loader = AsyncHtmlLoader(url_list)
        # load data into HTML document objects
        docs = loader.load() 
        html_docs[company] = docs

    return html_docs

# transform loaded html documents to markdown format
def transform_html_to_plain_text_from_urls(html_docs):
    html2text = Html2TextTransformer()
    docs_transformed_url = {}
    for company, docs in html_docs.items():
        docs_transformed_url[company] = html2text.transform_documents(docs)
    
    return docs_transformed_url

In [9]:
crawl_results = crawl_company_websites(input_companies, max_pages=20) 
url_lists = get_url_list(crawl_results)

In [None]:

url_lists = get_url_list(crawl_results)
html_docs = get_htmls_from_urls(url_lists)
# docs_transformed_url = transform_html_to_plain_text_from_urls(html_docs)
# transformed_docs_split_url = split_transformed_documents(docs_transformed_url)
# embed_and_store_in_vector_store(transformed_docs_split_url) 

Fetching pages: 100%|##########| 20/20 [00:06<00:00,  3.31it/s]
Fetching pages: 100%|##########| 20/20 [00:02<00:00,  8.98it/s]
Fetching pages: 100%|##########| 20/20 [00:08<00:00,  2.33it/s]


In [25]:
for company, doc in html_docs.items():
    print(f'{company}')
    print(f'----mata data----')
    print(doc[0].metadata)
    print(f'----page content----')
    print(doc[0].page_content[:100])

AXA Germany
----mata data----
{'source': 'https://www.axa.de/site/axa-de/redirect/MyAxaLogin?AKTIONSCODE=14015D', 'title': 'My AXA Login', 'language': 'de'}
----page content----


<!DOCTYPE html>
<html lang="de">
<head>
	<meta name="version" content="1.21.0">
    <meta charset=
HUK-COBURG
----mata data----
{'source': 'https://www.huk.de/fahrzeuge/kfz-versicherung/leichtkraftrad-versicherung.html', 'title': 'Leichtkraftrad-Versicherung: über 50 ccm - 125 ccm', 'description': 'Ihre Leichtkraftrad-Versicherung: 80ccm & 125ccm ✓ Niedrige Kosten - Top Leistung ➨ Kasko & Haftpflicht ✓ Jetzt berechnen!', 'language': 'de'}
----page content----
<!DOCTYPE html>

<html lang="de" class="no-js no-touchevents ">
	
<head>
    <meta http-equiv="Conte
Generali
----mata data----
{'source': 'https://www.generali.de/', 'title': 'Versicherungen, Vorsorge und Vermögensaufbau I Generali ', 'description': 'Versicherungen, Vorsorge und Vermögensaufbau – Informieren Sie sich über unsere Produkte oder nutzen Sie

#### Tested with a query and it spit the corresponding responses. 

---