<a href="https://colab.research.google.com/github/JanMeow/2025Hack/blob/main/MaterialProductDataExtract%5BRAG%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG for material product data verification
This notebook serves to extract information commonly seen in construction project such as pdf, images or plans
We will use different ML appraoches to extract the information we need.
And we see how do we create insights out of these data ! ✌

First, Lets install some dependecy !

In [1]:
!pip install pillow
!pip install pdfminer.six
!pip install --upgrade pymupdf
!pip install requests
!pip install transformers
!pip install torch
!pip install openai

Collecting pdfminer.six
  Downloading pdfminer.six-20240706-py3-none-any.whl.metadata (4.1 kB)
Downloading pdfminer.six-20240706-py3-none-any.whl (5.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m44.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pdfminer.six
Successfully installed pdfminer.six-20240706
Collecting pymupdf
  Downloading pymupdf-1.25.3-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.25.3-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (20.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m80.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pymupdf
Successfully installed pymupdf-1.25.3
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)


Lets get an exmple product declaration
Here we are using  [The offical CreaBeton Sample](https://pepadocs.com/de/openWindow/Documents/byCode/694928caefd60034a8ce50c30ab540da/asMainPage/)

Here we are using the pdf of
Technische Wegleitung Betonsteinbeläge
for RAG it has 54 pages, so we can also test if it would exceed the context window.

In [4]:
# Get an example product declarat
import requests
from pathlib import Path
r = requests.get("https://img.socialcraft.me/cache/fileUgWYMwmDT88KPnEFlnTN7vspZVcf4W1EmPkXMy8gKYlKwAnifwwSFhd6pS0L/wegleitung-betonsteinbelaege.pdf")


pdf_folder = Path("data")
pdf_folder.mkdir(parents=True, exist_ok=True)


pdf_path = pdf_folder/"sample.pdf"

with open(pdf_path, "wb") as f:
    f.write(r.content)


pdf_path = "sample.pdf"


In [6]:
from pdfminer.high_level import extract_text, extract_pages

texts = extract_text(pdf_path)

In [7]:
len(texts)

22075

As you can see, there are usually  A LOT OF words for a product declaration.
This is difficult for the model because it exceeds its context window
There are a few ways to acheive it.


1. Fine-Tuning
2. Chuck RAG
3. Sliding Window Attention
4. Summarization and Pre-processing

since Fine-tuning would need to train part of the model or train added layers and also product infomration changes all the time. This is not an efficient apporach.
In this notebook, we will experiment with a few techniques
- Passing the text to a summarization model before RAG (4.Sumnmarization and Pre-processing)
- Perform Chuck RAG

In the later cells I could also perform fine tuning.
But it will be focused more on the regulations or LCA data since these are not changed as frequently.

# Formatting tool for Construction Product

\Note that currently I summarise it session by session, one could also do it page by page. But I figure by session it is better.

In future practice, we might want to specify certain font sizes for the user manual or add tag so we can extract information from the pdf.
The other option is to train one more small linear model, in which we read text from many documents and label the font size in relation to the title and content so the model can decide but this is for later.

## Getting and formating the text for LLM  to summarise the text and generate tags as a basis to create a vector database

In [8]:
import fitz
from transformers import pipeline
import json
from openai import OpenAI
from google.colab import userdata
from pydantic import BaseModel


pdf = fitz.open(pdf_path)


# Get your own MotherFucking key
api_key = userdata.get("OpenAiKey")


# For test purpose, in this particular document, text sizes are 9.5 and title size are either 10 or 14

test_page = pdf[4]
text_dict = test_page.get_text("dict")

def extract_text(pdf, title_font_size):
  page_content = []

  for block in text_dict["blocks"]:
    if "lines" in block:
      for line in block["lines"]:
        if "spans" in line:
          for span in line["spans"]:
            # print(span["size"], span["text"])
            if span["size"] >= title_font_size:
              if len (page_content) >0:
                # Because sometimes the title are broken into few lines
                if page_content[-1]["text"] == "":
                  page_content[-1]["title"] += span["text"]
                  continue
              dict_obj = {
                "title":span["text"],
                "text":""
              }
              page_content.append(dict_obj)
            else:
              if len(page_content) >0:
                # .replace because german documents have this weird thing
                cleaned = span["text"].replace("\xad", "")
                page_content[-1]["text"] += cleaned
  return page_content

def summarise_OpenAi(text, max_word, api_key):

  class Summary(BaseModel):
    summary: list[str]

  client = OpenAI(
      api_key=api_key,
  )

  response = client.beta.chat.completions.parse(
      model="gpt-4o-mini",
       messages = [
      {"role": "user", "content": f"Summarise {text} and generate {max_word} keywords from the text in the orignal language"},
      ],
      response_format= Summary
  )

  outout = response.choices[0].message.parsed
  return outout



# ==============================================================================================================
page_content = extract_text(test_page, 12)

for content in page_content:
  text = content["text"]
  word_counts = len(text.split())

  # I want to dynamically summraise the word counts, meaning that if the passage is longer, it could have a slightly longer summary.
  response = summarise_OpenAi(text, round(word_counts * 0.1), api_key)
  content["summary"] = response.summary

# ==============================================================================================================

In [9]:
for content in page_content:
  print(content["title"])
  print(content["summary"])


Innenwand, tragend
['Brandschutz bezieht sich auf Maßnahmen und Praktiken, die der Verhütung und Bekämpfung von Bränden dienen. Ziel ist es, Menschenleben zu schützen, Sachwerte zu bewahren und Umweltschäden zu minimieren. Es umfasst sowohl bauliche Maßnahmen, wie feuerbeständige Materialien, als auch organisatorische Aspekte, wie das Erstellen von Notfallplänen und regelmäßige Schulungen. Die Einhaltung von Vorschriften und Normen ist entscheidend für die Effektivität des Brandschutzes.']
 
['TragwerkR 60 Brandabschnitt mit Aufbau aus zweilagiger Gipsfaserplatte (2x 15 mm), einem 160 mm breiten Ständer und 160 mm Mineralwolle RF1 (SP > 1000° C), abgedeckt mit erneut zweilagiger Gipsfaserplatte (2x 15 mm).', 'Optimal für Feuerwiderstandsklassen, um den Brandschutz in Gebäuden zu gewährleisten.']
Decke (DE-01)
['Brandschutz ist ein System von Maßnahmen und Vorschriften, die darauf abzielen, Brände zu verhüten und die Sicherheit von Personen und Sachwerten zu gewährleisten.', 'Es umfasst

Now that we have used OpenAi for the summrization Task.
(Although originally I would like to use Deepseek cause its much cheaper lollll)
We could either turn the text back to a queryable instead of saving the entire text, but since this doesnt affect the pipeline but more so on the memories side I decided to leave it aside.


# Creating a vector DB and a Graph DB

## Chroma - the open-source embedding database using OpenAi Embedding !!!

Naturally, if you use Chroma DB to embed the data but unfortunately most environmental documents in switzerland are in german so we will need to use the OpenAi embedding

In [10]:
# First lets create some embeddings from our text.
from openai import OpenAI
from functools import reduce

# dimension is set to 384 as by default OpenAi embeddings are of dinension 1536 but ChromaDB takes dimension 384
def get_embedding(text, model="text-embedding-ada-002", dimensions = 384):

  client = OpenAI(
      api_key=api_key,
  )

  response = client.embeddings.create(
      input=text,
      model="text-embedding-3-small",
      dimensions=dimensions
  )

  return response.data[0].embedding

for content in page_content:
  summary = " ".join(content["summary"])
  response = get_embedding(summary, model = "text-embedding-ada-002")
  content["embedding"] = response

In [11]:
len(page_content[0]["embedding"])

384

In [12]:
# OK ! Finally time to create the vector DB

!pip install chromadb --quiet

import chromadb
# setup Chroma in-memory, for easy prototyping. Can add persistence easily!
chroma_client = chromadb.Client()

# Create collection. get_collection, get_or_create_collection, delete_collection also available!
collection = chroma_client.get_or_create_collection("all-my-documents")

# # Add docs to the collection. Can also update and delete. Row-based API coming soon!

for i, content in enumerate(page_content):
  document = content["text"]
  embedding = content["embedding"]
  title = content["title"]

  collection.add(
      documents=document, # we handle tokenization, embedding, and indexing automatically. You can skip that and add your own embeddings as well
      metadatas={"title": title,
                  "source": "page4, lineXXX PlaceHolder"}, # filter on these!
      ids=f"id_{i}", # unique for each doc
      embeddings=embedding
  )


[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m611.1/611.1 kB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m52.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m278.6/278.6 kB[0m [31m24.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.8/94.8 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m76.7 MB/s[0m eta [36m0:00:00

Lets query our vectorDB ! Meow !

In [18]:
pset0 = {'BaseQuantities': {'NetSurfaceArea': 30.3922,
  'NetVolume': 5.2284,
  'id': 7386},
 'Cadwork3dProperties': {'Group': 'Fenster',
  'SubGroup': 'Fenster',
  'BTA TYP': 'Fenster 16.1',
  'id': 7390},
 'BIMWood_Common': {'Local coordinate system': {'id': 7393,
   'type': 'IfcComplexProperty',
   'UsageName': 'Local coordinate system',
   'properties': {'Location': {'id': 7394,
     'type': 'IfcComplexProperty',
     'UsageName': 'Location',
     'properties': {'X': 224.9999741, 'Y': 6510.0000975, 'Z': 7214.9501083}},
    'Axis': {'id': 7396,
     'type': 'IfcComplexProperty',
     'UsageName': 'Axis',
     'properties': {'X': 0.0, 'Y': -1.0, 'Z': 0.0}},
    'RefDirection': {'id': 7397,
     'type': 'IfcComplexProperty',
     'UsageName': 'Reference Direction',
     'properties': {'X': -1.0, 'Y': 0.0, 'Z': 0.0}}}},
  'id': 7392},
 'BIMWood_Production': {'ProductionNumber': '0',
  'Package': '',
  'Layer': 0,
  'id': 7399}}

In [19]:
pset1 = {'BaseQuantities': {'NetSurfaceArea': 107.1924,
  'NetVolume': 2.6723,
  'id': 888},
 'Cadwork3dProperties': {'Group': 'Fassade EG',
  'SubGroup': 'AW 1.OG',
  'BTA TYP': 'IS 15.1',
  'id': 892},
 'BIMWood_Common': {'Local coordinate system': {'id': 898,
   'type': 'IfcComplexProperty',
   'UsageName': 'Local coordinate system',
   'properties': {'Location': {'id': 899,
     'type': 'IfcComplexProperty',
     'UsageName': 'Location',
     'properties': {'X': 100.9998339, 'Y': 6524.5001187, 'Z': 5615.2774545}},
    'Axis': {'id': 903,
     'type': 'IfcComplexProperty',
     'UsageName': 'Axis',
     'properties': {'X': 1.0, 'Y': 0.0, 'Z': 0.0}},
    'RefDirection': {'id': 904,
     'type': 'IfcComplexProperty',
     'UsageName': 'Reference Direction',
     'properties': {'X': 0.0, 'Y': -1.0, 'Z': 0.0}}}},
  'id': 897},
 'BIMWood_Production': {'ProductionNumber': '0',
  'Package': '',
  'Layer': 0,
  'id': 906}}

In [20]:
# Query/search 2 most similar results. You can also .get by id
# We need to change the query function from chroma DB abit because we used OpenAi for embedding
def query(query_texts, n_results):
  query_embeddings = get_embedding(query_texts, model="text-embedding-ada-002")
  results = collection.query(
      query_embeddings= query_embeddings,
      n_results=n_results,
      # where={"metadata_field": "is_equal_to_this"}, # optional filter
      # where_document={"$contains":"search_string"}  # optional filter
  )
  return results


query(f"what is the best conncetion type for the entity0 with {pset0} and entity1 {pset1}  ", 1)


{'ids': [['id_9']],
 'embeddings': None,
 'documents': [['1Geschossübergang IW tragendRG']],
 'uris': None,
 'data': None,
 'metadatas': [[{'source': 'page4, lineXXX PlaceHolder', 'title': ' '}]],
 'distances': [[1.4482554197311401]],
 'included': [<IncludeEnum.distances: 'distances'>,
  <IncludeEnum.documents: 'documents'>,
  <IncludeEnum.metadatas: 'metadatas'>]}

# Graph DB Set up

Objective for today
Graph Database
finish the summarization part
maybe revise a bit on SKlearn