#<b><i>Open-source RAG using Mixtral 8x7B for financial data in German, Italian, and French</b></i>

## <center> *'Libérté, égalité, architecture open-sourcé...'*</center>

---
# <b><i>Mixture-of-Experts</b></i> 🎆

<p align="justify">For the A.I community, 2023 was certainly the year of the <b><i>Large Language Model (LLM)</b></i>. As of 2024, the new trendy concept is the <b><i>Mixture-of-Experts (MoE)</b></i>. MoE is not something new, it was originally introduced in 1991 in a paper called <b><i>“Adaptative Mixtures of Local Experts”</b></i>.

<p align="justify">Behind the democratization of MoE, is a Paris-based startup <b>Mistral AI</b>. The French company Mistral AI has recently released <b><i>Mixtral 8x7B</b></i>, a fully open-source large language model based on the MoE architecture (arXiv:2401.04088). This pre-trained generative Sparse Mixture of Experts, Mixtal 8x7B outperforms Llama 2 70B and ChatGPT 3.5 of OpenAI on many benchmarks and supports English, French, German, Italian, and Spanish.

<p align="justify"><b><i>Mixtral 8x7B</b></i> weights are open-source and licensed under Apache 2.0. This license makes it possible to understand the MoE architecture better, to use it commercially, and to build on top of another language. According to the paper <b><i>“Aurora: Activating Chinese chat capability for Mixtral-8x7B sparse Mixture-of-Experts through instruct tuning”</b></i> with 3 dialogue datasets in Chinese, researchers increased the Chinese conversational capabilities of Mixtral-8x7B (arXiv:2312.14557).

<p align="justify">Although OpenAI has not officially confirmed its proprietary system, there's speculation that GPT-4 is based on an MoE architecture. In 2022, Google researchers <i>William Fedus, Barret Zoph</i>, and <i>Noam Shazeer</i> gave us great insights into a Mixture of Experts with the paper <b><i>“Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity”</b></i> (arXiv:2101.03961).

<p align="justify">The sparse Mixture-of-Experts network can scale to trillion parameter models with reduced cost and latency at a constant computational cost. Mixtral 8x7B is based on Mistral 7B architecture with <b><i>Grouped-Query Attention (GQA)</b></i>, <b><i>Sliding-Window Attention (SWA)</b></i>, and <b><i>Byte-fallback BPE</b></i> tokenizer and trained on a context size of 32k tokens.

<p align="justify">The important change in MoE comes from the <b><i>Feed Forward Neural Networks</b></i> (FFNs). The FFN leverages 8 groups of <b><i>“experts”</b></i> with a dynamic router function that chooses 2 experts per token to process the current state and combine their outputs.

<p align="justify">To take an anthropomorphic analogy, a Mixture of Experts is one student asking a question to a group of eight teachers, each expert in their respective field. On the opposite, a Large Language Model is one student asking one question to a very experienced teacher who needs time to reply.

# <b><i>- Experimentation in French, German, and Italian</b></i> 🔬

<p align="justify">English as a means of scientific communication creates an imperialism of the English language. In my latest notebook, I tried to explain the specificities of Japanese for RAG architecture (Retrieval Augmented Generation) in a low-resource environment. Similarly, this notebook is a preliminary test of a <b><i>vanilla RAG</b></i> with <b><i>Mixtral 8x7B in French, Italian</b></i>, and <b><i>German</b></i> over <b><i>financial documents in a low-resource environment.</b></i>

<p align="justify"><b><i>Disclaimer: </b></i>The minimum hardware requirements for Mixtral-8x7B might be in 4-bit precision using 22.5Gb of VRAM, in 8-bit precision using 45Gb of VRAM, and in half-precision using 90GB of VRAM.  I was able to run it on Ollama and vLLM on V100 and A100 on Google Colab Pro. In this notebook, I am using a <b><i>quantized model of Mixtral 8x7B</b></i> created by The Bloke and the inference framework, <b><i>LlamaCPP</b></i>. The process of quantization diminishes the results, therefore it doesn’t correctly define the excellence of the official model, Mixtral 8x7B built by Mistral AI.

In [None]:
!pip install git+https://github.com/huggingface/transformers --quiet
!pip install llama-index cohere pymupdf typing_extensions --upgrade --quiet

!pip install llama-index-vector-stores-pgvecto-rs --upgrade --quiet  #Vector search of pgvecto.rs
!pip install llama-index-embeddings-huggingface --upgrade --quiet #Embeddings from HuggingFace
!pip install llama-index-llms-llama-cpp --upgrade --quiet #LLM inference of LlamaCPP

!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install "llama-cpp-python==0.2.54" --no-cache-dir --quiet #Install Python wrapper - Activate CUDA & BLAS

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.0/52.0 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.4/4.4 MB[0m [31m56.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.3/15.3 MB[0m [31m92.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m91.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m108.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.6/30.6 MB[0m [31m48.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━

# <b><i>- Llama-Index</b></i> 🦙

<p align="justify"><b><i>Llama-Index</b></i> is a robust data framework created by <b><i>Jerry Liu</b></i> and <b><i>Simon Suo</b></i> perfect for LLM-powered solutions. The latest version 0.10.x has been improving the overall implementation. Llama-Index modified the structured packages into template/integrate, incorporated the llama-hub inside, and deprecated ServiceContext.

<p align="justify">More flexible, Llama-Index enables us to change <b><i>global settings</b></i> and <b><i>local settings</b></i> for chunkers, vector stores, queries, or LLMs. This global/local construction could be good for fallback. For example, if my LLM 1 (Mixtral-7x8B) is down momentarily for technical reasons, my LLM 2 (ChatGPT4.0) could be loaded.

In [None]:
import os, sys, logging, warnings
warnings.filterwarnings('ignore')

import nest_asyncio
nest_asyncio.apply()

from llama_index.core import Settings #Incorporate settings function
from llama_index.vector_stores.pgvecto_rs import PGVectoRsStore
from llama_index.core.node_parser import SentenceSplitter

import tqdm, textwrap, ipywidgets
from IPython.display import Markdown, display, clear_output

In [None]:
from transformers import AutoTokenizer

Settings.tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1") #HuggingFace for Mixtral official Tokenizer

tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

# <b><i>- BGE M3-Embedding</b></i> 🥉

<p align="justify">Great alternative to the <b><i>multilangual E5-Large-instruct</b></i>, the <b><i>Beijing Academy of Artificial Intelligence (BAAI)</b></i> introduced recently their new embedding model, <b><i>BGM-3</b></i>. According to their paper published in February 2024, <b><i>"BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation"</b></i>, the model can perform <b><i>dense retrieval, multi-vector retrieval</b></i>, and <b><i>sparse retrieval</b></i> on more than 100 languages (arXiv:2402.03216).

<p align="justify">The research team incorporated a <b><i>Self-Knowledge Distillation</b></i> approach optimized with the batching strategy of size 128, 1024, 4096, and 8192. The pre-training consisted of 1.2 Billion multilingual unsupervised data to create a dense score. Then the fine-tuning introduced multilingual labeled data, and synthetic data (to mitigate the shortage of long documents). The hybrid approach in multi-stage integrate and normalise the dense retrieval, the multi-vector retrieval, and the lexical retrieval. The M3-Embedding model of BGE offers SoTA performance against other multilingual embedding models.

In [None]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-m3") #HuggingFace for BGE M3 embedding model

config.json:   0%|          | 0.00/687 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.27G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.31k [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

In [None]:
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.llms.llama_cpp import LlamaCPP
from llama_index.llms.llama_cpp.llama_utils import (
    messages_to_prompt,
    completion_to_prompt,
)

#<b><i>- Quantized model of Mixtral-8X7B</b></i> ⚡

<p align="justify"><b><i>Tom Jobbins (The Bloke)</b></i> has been releasing many quantized LLMs. For the official model Mixtral-8x7b, the most important choice is to calculate the performance and cost trade-off.

<p align="justify">For our quantized model, you need to choose the trade-off between model size in GB and RAM consumption against the quality of the response. By playing on the smaller models, I witnessed many strange responses. So I decided on the following model:

- Mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf
- Quant method: Q5_K_M
- Bits: 5
- Size: 32.23 GB
- Max RAM required: 34.73 GB

In [None]:
model_url = "https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/resolve/main/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf"

# <b><i>- LlamaCPP - Inference</b></i> 💻

<p align="justify">Created by <b><i>Georgi Gerganov</b></i>, the inference framework <b><i>LLaMA CPP</b></i> written in C, C++, and CUDA became one of the used tools to enjoy LLM in local machines. It was originally designed to support Meta’s LlaMA. It evolved into something much bigger with a dynamic community. LlamaCPP can be used with LlaMA 2, Alpaca, Falcon, Baichuan, Mistral, Bloom, Qwen, etc, and even few multimodal models such as LLaVA, BakLLaVA, and Yi-VL.

<p align="justify">In our case, it’s important to enable <b><i>BLAS</b></i> during the installation to accelerate CUDA kernels in our Nvidia GPU for the T4 Nvidia. Mixtral 8X7B can be run efficiently on single GPUs with high-performance specialized kernels. To optimize and accelerate the inference, the number of layers to offload to GPU can be changed with the function n_gpu_layers and the tradeoff could be found with the maximum batch size (n_batch) and context windows (n_cxt).


In [None]:
llm = LlamaCPP(
    model_url=model_url,
    temperature=0.1,
    max_new_tokens=1000,
    context_window=3900,
    generate_kwargs={},
    model_kwargs={"n_gpu_layers":1, "f16_kv": True}, #At least 1 for BLAS - Offloading parameters -> n_gpu_layers:32 / n_batch:512 / n_threads:4
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    verbose=True,
)

Downloading url https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/resolve/main/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf to path /tmp/llama_index/models/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf
total size (MB): 32229.28


30737it [01:27, 351.74it/s]
llama_model_loader: loaded meta data with 26 key-value pairs and 995 tensors from /tmp/llama_index/models/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mixtral-8x7b-instruct-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32           

In [None]:
Settings.llm = llm
Settings.node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=20) #chunk_size=1024 will be using too much RAM on a T4

# <b><i>- RAG over complex financial documents</b></i> 💹

<p align="justify">In the latest version, <b><i>Llama-Index</b></i> incorporates <b><i>PyMuPDFReader</b></i> directly. We will be using the half-year financial reports of 2023 in PDF format from three European companies:

* <b><i>Hermès International for French</b></i>
* <b><i>Porsche AG for German</b></i>
* <b><i>Brunello Cucinelli S.p.A for Italian</b></i>

<p align="justify">Financial departments are outsourcing their layout to graphic agencies. In finance, documents constituting regulated information are often proofread by legal departments and external PR agencies.

<p align="justify">In an high performing RAG system, the ingestion of complex PDFs is the most important part of the work. The constant switching between text data and tabular data is a big issue for financial documents. Tabular data problems could be solved with tools such as <b><i>Camelot</b></i>, <b><i>Tabula</b></i>, <b><i>Pdfplumber</b></i>, <b><i>LlamaParse</b></i>, and <b><i>Unstructured</b></i>.

<p align="justify">The trendy solution of using multimodal LLMs to scan images of tabular data in a financial report to create a summary is rather computationally expensive. It's like using a huge LLM for a simple name-entity recognition task. Scanned images inside a PDF file can be handled via Optical Character Recognition (OCR) systems with spatial attention.

In [None]:
from google.colab import files
uploaded = files.upload()

Saving CUCINELLI_Semestrale_2023_web.pdf to CUCINELLI_Semestrale_2023_web.pdf
Saving Halbjahresfinanzbericht 2023.pdf to Halbjahresfinanzbericht 2023.pdf
Saving hermes-rapport-financier-semetriel-2023-fr-01.pdf to hermes-rapport-financier-semetriel-2023-fr-01.pdf


# <b><i>- Metadata is a weapon</b></i> 🔪

<p align="justify">Metadata might be one of the most disregarded parameters. In the RAG system, the file’s name is a key value for metadata. As <b><i>General Michael V. Hayden</b></i> (former N.S.A. and C.I.A. Director) famously said: <b><i>"We kill people based on metadata."</b></i> at Johns Hopkins University in April 2014.

<p align="justify">In our 3 examples, we have 3 mistakes:

* The French luxury house, <b><i>Hermès</b></i> made a small typo on the file’s name “hermes-rapport-financier-semetriel-2023-fr-01.pdf”. The “s” is missing on an import word “semestriel”  (half-year) in French.
* The German automobile constructor, <b><i>Porsche</b></i> “Halbjahresfinanzbericht 2023.pdf”, doesn’t give information about the company, just “half-year financial report 2023”.
* And finally, the Italian house, <b><i>Brunello Cucinelli</b></i> included a confusing extra meta-data “web” version in the name “CUCINELLI_Semestrale_2023_web.pdf”.

<p align="justify">For metadata, a proper naming of the file grants <b><i>spatio-temporal insights</b></i> that are critical in finance. For a <b><i>clear hierarchical naming system</b></i> for PDF files, it's mandatory to include the full name of the company, the ticker, the financial report, the date, and the language.

Example: <i>"Hermes RMS EPA financial report half-year 2023 FR 30 June 2023.pdf"</i>.

In [None]:
from pathlib import Path
from llama_index.core import download_loader

PyMuPDFReader = download_loader("PyMuPDFReader")

loader = PyMuPDFReader()
hermes = loader.load(file_path="/content/hermes-rapport-financier-semetriel-2023-fr-01.pdf")
porsche = loader.load(file_path="/content/Halbjahresfinanzbericht 2023.pdf")
cucinelli = loader.load(file_path="/content/CUCINELLI_Semestrale_2023_web.pdf")

# <b><i>- Format freedom</b></i> 🎨

<p align="justify">The U.S. Securities and Exchange Commission in the USA has clearly defined a structure for their company fillings such as 10K, 10Q, etc. The European Union gave creative freedom to public companies to publish their regulated information. For NLP-powered quantitative analysis, the coverage of listed companies in the European Union must be crafted in a bespoke manner. Regardless of the language, we are still victims of word games, semantics is always a b*tch (Gil Scott-Heron).

* <p align="justify"><b><i>Porsche AG</b></i> (or Dr. Ing. h.c. F. Porsche AG) did its IPO on 29 September 2022, under the umbrella of Volkswagen Group. Porsche AG is a young public company with an austere report in black and white, text unjustified, and very light at <b><i>863Kb for 46 pages</b></i>.

* <b><i>Hermès International</b></i> gave us an interactive pdf, with a stylish orange layout, extremely pleasant to read at <b><i>1,732Kb for 36 pages</b></i>.

* <b><i>Brunello Cucinelli S.p.A</b></i> provides a glossy half-year financial report with more than 15 high-quality pictures for <b><i>3,528Kb for 112 pages</b></i> in Italian (at 15,026Kb due to compression issues for the English format).

<p align="justify">Below we can witness the different layouts used by companies in their financial reports. The tables of contents are unbalanced towards notes for Hermès, in all capitals for Porsche, or with many periods for Brunello Cucinelli.

In [None]:
hermes[1]

Document(id_='5d9be335-6dc7-4678-a086-ee7ab4d6ab7d', embedding=None, metadata={'total_pages': 36, 'file_path': '/content/hermes-rapport-financier-semetriel-2023-fr-01.pdf', 'source': '2'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='1\nCHIFFRES CLÉS\n3\nPrincipales données consolidées\ndu premier semestre 2023\n3\n2\nRAPPORT SEMESTRIEL D’ACTIVITÉ\n5\n2.1\nFaits marquants du semestre\n5\nActivité à fin juin par métier\n6\nActivité à fin juin par zone géographique\n5\n2.2\nChiffre d’affaires et activité du premier semestre\n5\n2.3\nCommentaires sur les comptes semestriels consolidés\nrésumés\n7\n2.3.1 Compte de résultat\n7\n2.3.2 Flux de trésorerie et investissements\n8\n2.3.3 Situation financière\n8\n2.4\nPerspectives\n9\n2.5\nRisques et incertitudes\n9\n2.6\nTransactions avec les parties liées\n9\n3\nCOMPTES SEMESTRIELS CONSOLIDÉS\nRÉSUMÉS AU 30 JUIN 2023\n11\n3.1\nCompte de résultat consolidé\n11\n3.2\nÉtat du résultat global consolidé\n11\n

In [None]:
porsche[1]

Document(id_='e86490c7-e0d9-4e42-8c39-e7687ab84b2b', embedding=None, metadata={'total_pages': 46, 'file_path': '/content/Halbjahresfinanzbericht 2023.pdf', 'source': '2'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text=' \n \n \n \nINHALT \n \n \n \n \n3 WESENTLICHE KENNZAHLEN \n \n \n KONZERN-ZWISCHENLAGEBERICHT \n5 GESCHÄFTSVERLAUF \n8 ERTRAGS-, FINANZ- UND VERMÖGENSLAGE \n16 PROGNOSE-, CHANCEN- UND RISIKOBERICHT \n \n \n KONZERN-ZWISCHENABSCHLUSS (KURZFASSUNG) \n20 KONZERN-GEWINN- UND VERLUSTRECHNUNG \n21 KONZERN-GESAMTERGEBNISRECHNUNG \n22 KONZERN-BILANZ \n23 KONZERN-EIGENKAPITALVERÄNDERUNGSRECHNUNG \n25 KONZERN-KAPITALFLUSSRECHNUNG \n26 \nKONZERN-ANHANG \n  \n44 VERSICHERUNG DER GESETZLICHEN VERTRETER \n45 BESCHEINIGUNG NACH PRÜFERISCHER DURCHSICHT \n46 WEITERE INFORMATIONEN \n \n', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')

In [None]:
cucinelli[1]

Document(id_='d0226d52-9f20-4c8c-bd0d-01e94664a71d', embedding=None, metadata={'total_pages': 112, 'file_path': '/content/CUCINELLI_Semestrale_2023_web.pdf', 'source': '2'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='1\nRELAZIONE FINANZIARIA SEMESTRALE AL 30 GIUGNO 2023\nINDICE \nDATI SOCIETARI ........................................................................................................................ 4\nCOMPOSIZIONE DEGLI ORGANI SOCIALI AL 30 GIUGNO 2023 ............................................................. 5\nORGANIGRAMMA SOCIETARIO DEL GRUPPO AL 30 GIUGNO 2023 ........................................................ 6\nCOMPOSIZIONE DEL GRUPPO AL 30 GIUGNO 2023 ............................................................................ 7\nRETE DISTRIBUTIVA  ................................................................................................................. 8\nRELAZIONE INTERMEDIA DEL CONSIGLIO DI AMMINISTR

# <b><i>- PGvectoR.s</b></i> 🧮

<p align="justify">Written in Rust, <b><i>Pgvecto.Rs</b></i> is a Postgres extension that focuses on vector similarity search function. Developed by <b><i>Allen Zhou</b></i> and <b><i>Ce Gao</b></i> of <b><i>Tensorchord</b></i>, the beta version already has amazing features, evolving extremely fast over the past few months. Pgvecto.rs supports vector length up to 65.535 perfect for LLM. The implementation in Rust enables various indexing algorithms with vector representations, including fp16, int8, and even sparse vectors.

<p align="justify">Pgvecto.Rs provides basic distance metrics (Euclidean, Dot, and Cosine distance), indexing algorithms (flat, IVF, and HNSW), and full text/vector search. Last month, the talented team of Tensorchord implemented the <b><i>VBASE</b></i> filtering according to the paper <b><i>VBASE: Unifying Online Vector Similarity Search and Relational Queries via Relaxed Monotonicity"</b></i> (<i>Microsoft Research Asia, 2023</i>).

<p align="justify">VBASE revealed a tremendous acceleration against top K-based systems with minimum changes. On the fly, VBase creates a unified query engine with index traversal at the beginning, the impressive <b><i>Relaxed Monotonicity</b></i>, a filter in the middle, and a Termination Check updated again with a "residual" of the relaxed monotonicity.

<p align="justify">Thanks to the usage of Rust, <b><i>Pgvecto.Rs</b></i> handles memory management perfectly improving the performance of Postgres. Most importantly for us, it gives a powerful tool for the open-source community that will allow independent developers and companies to fully respect data governance. Hopefully, Pgvecto.Rs will be integrated in more vendors and will be deployed in more solutions.

In [None]:
!pip install psycopg2-binary asyncpg sqlalchemy[asyncio] greenlet --upgrade --quiet
!pip install "pgvecto_rs[sdk]" --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m38.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.7/2.7 MB[0m [31m87.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m68.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from llama_index.core import StorageContext
from pgvecto_rs.sdk import PGVectoRs #Python SDK
'''
I am using the well-secured ingress platform, Ngrok for tunneling to connect
the local Postgres built via Docker to my Google Colab Pro.
'''
PORT = os.getenv("DB_PORT", 13854) #Ngrok TCP tuneling
HOST = os.getenv("DB_HOST", "7.tcp.eu.ngrok.io") #Ngrok TCP tuneling
USER = os.getenv("DB_USER", "postgres")
PASS = os.getenv("DB_PASS", "mysecretpassword")
DB_NAME = os.getenv("DB_NAME", "postgres")

URL = "postgresql+psycopg://{username}:{password}@{host}:{port}/{db_name}".format(
    port=PORT,
    host=HOST,
    username=USER,
    password=PASS,
    db_name=DB_NAME,
)

In [None]:
rms_client = PGVectoRs(
    db_url=URL,
    collection_name="hermes_Q1_2023",
    dimension=1024, # BGE-M3 Dimension of 1024 / Sequence length of 8192
)

In [None]:
rms_vector_store = PGVectoRsStore(client=rms_client)
rms_storage_context = StorageContext.from_defaults(vector_store=rms_vector_store)

rms_index = VectorStoreIndex.from_documents(hermes, storage_context=rms_storage_context, show_progress=True)

Parsing nodes:   0%|          | 0/36 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/89 [00:00<?, ?it/s]

## 💄 <b> French | Hermès International S.C.A (EPA: RMS) </b> 💄

In [None]:
from typing import ItemsView
from llama_index.core import PromptTemplate

fr_template = PromptTemplate("""
<s>[INST]
Vous êtes un assistant en analyse financière. Répondez à la question en vous basant sur le contexte ci-dessous. Donnez une réponse courte et concise. Répondez "Je ne suis pas sûr de la réponse" si vous n'êtes pas sûr de la réponse.

---------------------
Contexte: {context_str}
---------------------

Question: {query_str}
Réponse:
[/INST]""")

In [None]:
rms_query_engine = rms_index.as_query_engine(text_qa_template=fr_template, use_async=True)

In [None]:
response1 = rms_query_engine.query("Quelle est le chiffre d’affaires consolidé au premier semestre 2023 pour la Maison Hermès ?")


llama_print_timings:        load time =   21756.38 ms
llama_print_timings:      sample time =      20.49 ms /    39 runs   (    0.53 ms per token,  1903.00 tokens per second)
llama_print_timings: prompt eval time =   36272.04 ms /   601 tokens (   60.35 ms per token,    16.57 tokens per second)
llama_print_timings:        eval time =   13815.35 ms /    38 runs   (  363.56 ms per token,     2.75 tokens per second)
llama_print_timings:       total time =   50239.79 ms /   639 tokens


In [None]:
print(textwrap.fill(str(response1)))

 Le chiffre d’affaires consolidé du groupe Hermès au premier semestre
2023 s’élève à 6 698 M€.


In [None]:
response2 = rms_query_engine.query("Quelle est l'évolution d'Hermès par zone géographique au 1er semestre 2023 ?")

Llama.generate: prefix-match hit

llama_print_timings:        load time =   21756.38 ms
llama_print_timings:      sample time =      86.31 ms /   165 runs   (    0.52 ms per token,  1911.65 tokens per second)
llama_print_timings: prompt eval time =   17649.44 ms /   287 tokens (   61.50 ms per token,    16.26 tokens per second)
llama_print_timings:        eval time =   59856.75 ms /   164 runs   (  364.98 ms per token,     2.74 tokens per second)
llama_print_timings:       total time =   78140.93 ms /   451 tokens


In [None]:
print(textwrap.fill(str(response2)))

 Au premier semestre 2023, toutes les zones géographiques d'Hermès ont
enregistré des hausses supérieures ou égales à 20% par rapport à la
même période de l'année précédente. L'Asie a connu une croissance
exceptionnelle en bénéficiant d'une base de comparaison favorable au
deuxième trimestre. Les ventes en magasins du groupe ont augmenté de
25% à taux de change constants, tandis que les ventes en gros ont
connu une hausse de 26%, profitant du rebond des ventes aux voyageurs.
Hermès continue de développer son réseau de distribution exclusif.


## <b>French</b>: <i>Correct mais...</i>
***- 2 questions, 2 correct answer***

<p align="justify">In our experiment in French with Hermès, the vanilla RAG handles perfectly the retrieval of information in text data. On the second question <i>"Can you give us the breadown of revenues by region for the half 2023?"</i>, our system doesn't handle well the tabular data. The breakdown of revenues by region is an borderless table on page 5. Please note that Hermès (like many luxury groups) is separating France from Europe, and Japan from APAC to reveal those two historical strategic markets. Talking about Asia as an unified market might be too weak in the luxury industry for equity research.

## 🍺 <b>German | Dr. Ing. h.c. F. Porsche AG (ETR: P911)</b> 🍺

In [None]:
P911_client = PGVectoRs(
    db_url=URL,
    collection_name="porsche_Q1_2023",
    dimension=1024, # BGE-M3 Dimension of 1024 / Sequence length of 8192
)

In [None]:
P911_vector_store = PGVectoRsStore(client=P911_client)
P911_storage_context = StorageContext.from_defaults(vector_store=P911_vector_store)

P911_index = VectorStoreIndex.from_documents(porsche, storage_context=P911_storage_context, show_progress=True)

Parsing nodes:   0%|          | 0/46 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/135 [00:00<?, ?it/s]

In [None]:
ge_template = PromptTemplate("""
<s>[INST]
Du bist ein hilfsbereiter Finanzexperte. Beantworte Fragen basierend auf dem unten bereitgestellten Kontext. Stelle sicher dass die Antworten kurz und informativ sind.

---------------------
Kontext: {context_str}
---------------------

Frage: {query_str}
Antwort:
[/INST]""")

In [None]:
P911_query_engine = P911_index.as_query_engine(text_qa_template=ge_template, use_async=True)

In [None]:
response3 = P911_query_engine.query("Wieviel Umsatzerlöse hat die Porsche AG im Halbjahres 2023 erwirtschaftet?")

Llama.generate: prefix-match hit

llama_print_timings:        load time =   21756.38 ms
llama_print_timings:      sample time =      23.72 ms /    45 runs   (    0.53 ms per token,  1896.81 tokens per second)
llama_print_timings: prompt eval time =   33709.65 ms /   549 tokens (   61.40 ms per token,    16.29 tokens per second)
llama_print_timings:        eval time =   18841.17 ms /    44 runs   (  428.21 ms per token,     2.34 tokens per second)
llama_print_timings:       total time =   52724.49 ms /   593 tokens


In [None]:
print(textwrap.fill(str(response3)))

 Die Porsche AG hat im Halbjahres 2023 Umsatzerlöse in Höhe von 20.431
Mio. € erwirtschaftet.


In [None]:
response4 = P911_query_engine.query("Wieviele Fahrzeuge hat die Porsche AG ausgeliefert?")

Llama.generate: prefix-match hit

llama_print_timings:        load time =   21756.38 ms
llama_print_timings:      sample time =      22.00 ms /    42 runs   (    0.52 ms per token,  1909.35 tokens per second)
llama_print_timings: prompt eval time =   20031.24 ms /   500 tokens (   40.06 ms per token,    24.96 tokens per second)
llama_print_timings:        eval time =   18053.90 ms /    41 runs   (  440.34 ms per token,     2.27 tokens per second)
llama_print_timings:       total time =   38264.37 ms /   541 tokens


In [None]:
print(textwrap.fill(str(response4)))

 Im ersten Halbjahr 2023 hat die Porsche AG Konzern 167.354 Fahrzeuge
an Kunden ausgeliefert.


## <b>German</b>: <i>Perfekt!</i>
***- 2 questions, 2 correct answers...***

<p align="justify">In our experiment in German with Porsche AG, the model replied perfectly according to the financial report on the two questions. During testing with offloading, our vanilla RAG could sometimes replied in English.

<p align="justify">However, even by being stuck in English, our Mixtral RAG correctly found the 167.354 vehicles on page 5 and impressively <b><i>adjusted the comma</b></i> used in German numeric accounting rules to the English numeric accounting system 167.354 <b><i>with a period</b></i>.

## 🐑 <b> Italian | Brunello Cucinelli S.p.A (BIT: BC) </b> 🐑

In [None]:
bc_client = PGVectoRs(
    db_url=URL,
    collection_name="bc_Q1_2023",
    dimension=1024, # BGE-M3 Dimension of 1024 / Sequence length of 8192
)

In [None]:
bc_vector_store = PGVectoRsStore(client=bc_client)
bc_storage_context = StorageContext.from_defaults(vector_store=bc_vector_store)

bc_index = VectorStoreIndex.from_documents(cucinelli, storage_context=bc_storage_context, show_progress=True)

Parsing nodes:   0%|          | 0/112 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/257 [00:00<?, ?it/s]

In [None]:
it_template = PromptTemplate("""
<s>[INST]
Lei è un utile assistente finanziario. Rispondete alla domanda in base al contesto sottostante. Rispondi in modo breve e conciso.

---------------------
Contesto: {context_str}
---------------------

Domanda: {query_str}
Risposta:
[/INST]""")

In [None]:
bc_query_engine = bc_index.as_query_engine(text_qa_template=it_template, use_async=True)

In [None]:
response5 = bc_query_engine.query("Il Brunello Cucinelli conclude il primo semestre 2023 con ricavi consolidati de?")

Llama.generate: prefix-match hit

llama_print_timings:        load time =   21756.38 ms
llama_print_timings:      sample time =      23.30 ms /    43 runs   (    0.54 ms per token,  1845.57 tokens per second)
llama_print_timings: prompt eval time =   34490.39 ms /   580 tokens (   59.47 ms per token,    16.82 tokens per second)
llama_print_timings:        eval time =   15846.85 ms /    42 runs   (  377.31 ms per token,     2.65 tokens per second)
llama_print_timings:       total time =   50506.23 ms /   622 tokens


In [None]:
print(textwrap.fill(str(response5)))

 Il Brunello Cucinelli ha registrato ricavi consolidati di Euro
543.942 migliaia al termine del primo semestre del 2023.


In [None]:
response6 = bc_query_engine.query("Può spiegare l'obiettivo dell'accordo tra Brunello Cucinelli e Chanel?")

Llama.generate: prefix-match hit

llama_print_timings:        load time =   21756.38 ms
llama_print_timings:      sample time =      58.58 ms /   108 runs   (    0.54 ms per token,  1843.70 tokens per second)
llama_print_timings: prompt eval time =   19873.27 ms /   501 tokens (   39.67 ms per token,    25.21 tokens per second)
llama_print_timings:        eval time =   41562.33 ms /   107 runs   (  388.43 ms per token,     2.57 tokens per second)
llama_print_timings:       total time =   61876.17 ms /   608 tokens


In [None]:
print(textwrap.fill(str(response6)))

 L'obiettivo dell'accordo tra Brunello Cucinelli e Chanel è quello di
rafforzare la filiera italiana del lusso, con Chanel che investe nel
"Progetto Lanificio Cariaggi" come simbolo di eccellenza e made in
Italy. Questa collaborazione mira a creare una crescita umana e
professionale di qualità per entrambe le parti nei prossimi decenni.


## <b>Italian</b>: <i>Fantastico!</i>
***- 2 questions, 2 correct answers***

<p align="justify">In our experiment in Italian with Brunello Cucinelli, our Mixtral RAG surprisingly found the solution for the first question of 543,9 in the column chart without even a title, or mention of the italian word "ricavi (revenues)". The full amount is stated 543.942 millions with "il fatturato consolidato del Gruppo (the consolidated turnover/sales of the group)".

<p align="justify">My second question is rather tricky by being extremely broad and abstract. <i>"Can you explain the objective of the parternship between Brunello Cucinelli and Chanel?"</i>. Our Mixtral RAG gives good explanation of the investment between Brunello Cucinelli and the reasoning behind the investment in Cariaggi Lanificio S.p.A. for their respective supply chains. It even quoted the Chairman for the name of the project "Progetto Lanificio Cariaggi".

## <b><i> - Technical observations</b></i> ⚠

<p align="justify"><u><i>Positive observations:</u></i> the combination of <b><i>BGE-M3</b></i> for the embedding model, <b><i>PgVecto.Rs</b></i> for the vector search on Postgres database, <b><i>Mixtral 8x7B</b></i> for the multilingual LLM, and <b><i>LLama-Index</b></i> for the orchestration provides sublime performance. This open-source Mixtral RAG could empower many projects <b><i>to circumvent proprietary systems</b></i>, and <b><i>empower data governance</b></i> in order to <b><i>respect the EU Artificial Intelligence Act</b></i>.

<p align="justify"><u><i>Negative observations:</u></i> LlamaCPP is an amazing tool for experimentation, but it might be too brittle for production. The loading of parameters (gpu layers/batch size) are giving different outputs, sometimes responses in English. <b><i>The inferences on a MoE architecture with a basic Nvidia T4 GPU are high quality, yet rather slow</b></i>. The real difficulty of the notebook was the so-called <b><i>"prompt engineering"</b></i> or finding correct prompt templates in French, Italian, and German. One change of word in the prompt template could drastically change the quality of the response.


## <b><i> - Partial conclusion,  future development</b></i> 🏁

<p align="justify">Many benchmarks do not reflect the actual linguistic performance of LLMs. More than open-washing, we have a lot of “benchmarketing”. <b><i>By being available fully Open-source</b></i>, Mixtral 8x7B enables students, developers, and SMEs in Europe to experience a powerful model and its variations. The cost of development and heavy computing power are the two remaining roadblocks.

<p align="justify">According to OpenAI business terms, data created by OpenAI's models couldn't be used commercially. Under the Apache 2.0 license, this Mixtral gives us a powerful tool <b><i>to build synthetic data for many domain-specific datasets.</b></i>

<p align="justify">Their paper <b><i>“Mixtral of Experts”</b></i> from January 2024 doesn’t reveal much about the dataset used during training. Magic of a European team, Mixtral 8x7B outperforms LlaMA 2 70B in 4 languages: French, German, Spanish, and Italian, but not in English. I am wondering if Mistral A.I team will create models of complex languages like Chinese, Japanese, Greek, Farsi, Arabic, etc or focus only on European languages to lead the European market.

<p align="justify"><b><i>Mixtral is indeed impressive in English</b></i> and has also <b><i>pretty amazing capabilities in French, German, Spanish, and Italian</b></i>. The Sparse Mixture of Experts (SMoE) language model could be the defacto architecture for 2024. In the same vision of multiple small structures to create a powerful one, we are also seeing the rise of the <b><i>“Merger of LLMs”</b></i>. The great concept of <b><i>“Mindstorms” for Natural Language-Based Societies of Mind (NLSOMs)</b></i> demonstrates incredible engineering prowess.

Thank you for reading!

Please feel free to contact me if you have any questions.<br>

<b><i>Akim Mousterou</b></i>
<br><br><br>

---

<p align="justify">
<b><i>Disclaimer:</b></i> <i>None of the content published on this notebook constitutes a recommendation that any particular security, portfolio of securities, transaction, or investment strategy is suitable for any specific person. None of the information providers or their affiliates will advise you personally concerning the nature, potential, value, or suitability of any particular security, portfolio of securities, transaction, investment strategy, or other matter.</i>