# A brief introduction to the patent system
The patent is a register, tipically a document, that to document a exclusive discovery, invention or method and aims to give to the patent holder exclusive rights over the discovery/invention.

To organize the patents and find a suitable way to structure its information, a commonly used method defines a patent with 2 characteristics:
1. **Task:** the method used in the described patent. In can be compress something or agilize a effect, for example.
2. **Object:** the "target" of the task. It can be a food, a construction material or any other object that, combined with the task, defines the patent.
3. **Kind of Effect:** the kind of effect the Task is causing to the Object.
4. **Effect:** the effect itself being caused to the object.

This method is defined by the Hallbach matrix, that defines a list of Task and Objects that can be extracted from the Title or the Resume of the patent.

# T,O Finder
The T,O Finder is the method that identifies the Task and the Object from a given patent and in this notebook we will construct a method to do such thing using Generative AI.

In [1]:
import pandas as pd
import os
from dotenv import load_dotenv
from tqdm import tqdm
import json
import uuid
import sys

from langchain.vectorstores import Chroma
from langchain_openai import ChatOpenAI
from langchain.schema import Document
from langchain.embeddings import OpenAIEmbeddings
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA, ConversationalRetrievalChain, create_qa_with_structure_chain, create_retrieval_chain
from langchain.chains.openai_functions import create_structured_output_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts.base import BasePromptTemplate
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import PydanticOutputParser

project_root = os.path.abspath(os.path.join(os.getcwd(), "../.."))
sys.path.insert(0, project_root)
os.chdir(project_root)  # Optional: changes working directory to root

from src.utils.genai_utils import TRIZExtractor

load_dotenv()

True

Inserting a ID indexer method to ChromaDB based on discussions in the community. Ref.: https://github.com/langchain-ai/langchain/pull/17938

In [2]:
import langchain_community.vectorstores.chroma as chroma_mod
from typing import (
    Any,
    List,
    Tuple,
)

# Ref.: https://github.com/langchain-ai/langchain/pull/17938
def novo_results_to_docs_and_scores(results: Any) -> List[Tuple[Document, float]]:
    return [
        # TODO: Chroma can do batch querying,
        # we shouldn't hard code to the 1st result
        (
            Document(
                page_content=result[0], metadata=(result[1] | {"id": result[3]}) or {}
            ),
            result[2],
        )
        for result in zip(
            results["documents"][0],
            results["metadatas"][0],
            results["distances"][0],
            results["ids"][0],
        )
    ]

chroma_mod._results_to_docs_and_scores = novo_results_to_docs_and_scores

In [3]:
df_triz = pd.read_csv("data/processed/base_efeitos_físicos_publicada_lemmatized.csv")
df_triz.head()

Unnamed: 0,TIPO DE EFEITO,TAREFA,OBJETO,EFEITO FÍSICO,SINONIMO 1 EFEITO FISICO,SINONIMO 2 EFEITO FISICO,PT Link,PT Description,Link Wiki (English),TAREFA_lemmatized
0,Aplicação,Apertar,Sólido,Matriz de Halbach,,,,,http://en.wikipedia.org/wiki/Halbach_array,apertar
1,Aplicação,Apertar,Sólido dividido,Matriz de Halbach,,,,,http://en.wikipedia.org/wiki/Halbach_array,apertar
2,Aplicação,Concentrar,Campo,Matriz de Halbach,,,,,http://en.wikipedia.org/wiki/Halbach_array,concentrar
3,Aplicação,Concentrar,Sólido dividido,Matriz de Halbach,,,,,http://en.wikipedia.org/wiki/Halbach_array,concentrar
4,Aplicação,Depositar,Sólido dividido,Matriz de Halbach,,,,,http://en.wikipedia.org/wiki/Halbach_array,depositar


# TRIZ Vector Store Data Preparation

We are preparing text-metadata pairs for a vector store that will be used to match patents with TRIZ (Theory of Inventive Problem Solving) principles. Here's how the data is structured:

**Text Format:** The text is formatted as a natural language sentence following this pattern:

`O "{TAREFA}" é um {TIPO DE EFEITO}, que no {OBJETO} causa {EFEITO FÍSICO}.`

For example:
> "O 'Aquecimento' é um Térmico, que na Água causa Aumento de temperatura."

Each text entry has **metadata** associated containing four key components from TRIZ:
- **tipo_de_efeito**: The type of effect (e.g., Mechanical, Thermal)
- **tarefa**: The task or action being performed
- **objeto**: The object being affected
- **efeito_fisico**: The resulting physical effect
- **id**: an ID used to index and identify a content

This structure allows for:
1. Semantic search through the text descriptions
2. Precise filtering using the metadata fields
3. Mapping between patents and existing TRIZ principles
4. Identification of new derived TRIZ relationships

The entries are created using the `build_entry()` function and then converted into Langchain Document objects for storage in the Chroma vector store.


In [4]:
# Creating the pairs of iniputs
def build_entry(row):
    text = f'O/a "{row["TAREFA"]}" atua no/na {row["OBJETO"]} para produzir um/uma {row["EFEITO FÍSICO"]}.'
    metadata = {
        "kind_effect": row["TIPO DE EFEITO"],
        "task": row["TAREFA"],
        "object": row["OBJETO"],
        "physical_effect": row["EFEITO FÍSICO"]
    }
    unique_id = str(uuid.uuid4())
    return {"text": text, "metadata": metadata, "id": unique_id}

documents = [build_entry(row) for _, row in df_triz.iterrows()]
documents[:2]

[{'text': 'O/a "Apertar" atua no/na Sólido para produzir um/uma \xa0Matriz de Halbach.',
  'metadata': {'kind_effect': 'Aplicação',
   'task': 'Apertar',
   'object': 'Sólido',
   'physical_effect': '\xa0Matriz de Halbach'},
  'id': '906b1ab2-ba7a-48b9-bdb6-d93fad956a84'},
 {'text': 'O/a "Apertar" atua no/na Sólido dividido para produzir um/uma \xa0Matriz de Halbach.',
  'metadata': {'kind_effect': 'Aplicação',
   'task': 'Apertar',
   'object': 'Sólido dividido',
   'physical_effect': '\xa0Matriz de Halbach'},
  'id': 'f78273d2-f707-414f-9d19-dc6aa606e44e'}]

Now we are going to create a Vectorstore to store the TRIZ data

In [5]:
docs = [
    Document(page_content=item["text"], metadata=item["metadata"], id=item["id"])
    for item in documents
]

embedding_model = OpenAIEmbeddings()

persist_directory = "notebooks/experiments/chroma_rag_tabular"

# If the vectorstore isnt created
if not os.path.exists(persist_directory):
    vectorstore = Chroma.from_documents(
        documents=docs,
        embedding=embedding_model,
        persist_directory=persist_directory,
        ids=[doc.id for doc in docs],
    )
    vectorstore.persist()

else:
    print("Vectorstore already exists")
    vectorstore = Chroma(
        persist_directory=persist_directory,
        embedding_function=embedding_model,
    )

persist_directory

  embedding_model = OpenAIEmbeddings()
  vectorstore = Chroma(


Vectorstore already exists


'notebooks/experiments/chroma_rag_tabular'

In [6]:
from langchain.vectorstores import Chroma

# Reopens the vectorstore
vectorstore = Chroma(
    persist_directory="notebooks/experiments/chroma_rag_tabular",
    embedding_function=embedding_model
)

# Similarity search
query = "Qual tarefa pode causar aumento da temperatura?"
results = vectorstore.similarity_search(query, k=2)

for doc in results:
    print(doc)
    print("Metadados:", doc.metadata)
    print("id: ", doc.metadata['id'])


page_content='O/a Aquecer atua no/na Superfície de tratamento para produzir um/uma Aumento da temperatura.' metadata={'derived_from': 'O/a "Aquecer" atua no/na Líquido para produzir um/uma Irradiação térmica.', 'physical_effect': 'Aumento da temperatura', 'task': 'Aquecer', 'object': 'Superfície de tratamento', 'id': 'beab7f1e-ef4c-4cd6-9a17-ccab5ccfcff3'}
Metadados: {'derived_from': 'O/a "Aquecer" atua no/na Líquido para produzir um/uma Irradiação térmica.', 'physical_effect': 'Aumento da temperatura', 'task': 'Aquecer', 'object': 'Superfície de tratamento', 'id': 'beab7f1e-ef4c-4cd6-9a17-ccab5ccfcff3'}
id:  beab7f1e-ef4c-4cd6-9a17-ccab5ccfcff3
page_content='O/a "Aquecer" atua no/na Gás para produzir um/uma Dilatação térmica.' metadata={'physical_effect': 'Dilatação térmica', 'kind_effect': 'Efeito', 'object': 'Gás', 'task': 'Aquecer', 'id': 'dd963441-787e-4eb9-b78c-569f5644c0a3'}
Metadados: {'physical_effect': 'Dilatação térmica', 'kind_effect': 'Efeito', 'object': 'Gás', 'task': 'Aq

## Execution of the Patent Mining System Based on GenAI

The following code will use the utils avalaible on `src/utils/genai_utils.py` to implement the data mining system. On [excalidraw-genai-to-finder_english.excalidraw](../../docs/meeting-notes/excalidraw-genai-to-finder_english.excalidraw) it is possible to see the visual representation of the system.

After the creation of the vectorstore, it will be connected to a setup using langchain. These are the following substeps:

1. **Patent Formatting:**  
   Each patent is formatted as a string:  `Título: {{title}} \nAbstract: {{abstract}}`

2. **TRIZ Extraction via LLM:**  
The `TRIZExtractor` uses a language model (e.g., ChatGPT 4.1) to extract the following TRIZ ontology elements from the patent text:
      - `kind_effect` (str): Type of effect (e.g., "Mechanical", "Thermal")
      - `task` (str): Action performed (e.g., "Compress", "Heat")
      - `object` (str): Target object (e.g., "Water", "Metal")
      - `physical_effect` (str): Resulting physical effect (e.g., "Temperature increase")

3. **Semantic Similarity Search:**  
The extracted TRIZ elements are compared against existing entries in the vectorstore using semantic search to find the most similar TRIZ relationships already present in the knowledge base.

4. **Derivative Decision Logic:**  
The `TRIZExtractor` applies a decision rule:
      - If the extracted TRIZ element is **sufficiently different** from the closest match (based on a similarity threshold), it is considered a new relationship. This new element is added to the vectorstore and linked to its closest match via the `derived_from` field.
      - If the extracted element is **not sufficiently different**, the system simply returns the closest existing TRIZ element, avoiding unnecessary duplication.

5. **Knowledge Base Expansion:**  
When a new, derived TRIZ element is identified, it is automatically inserted into the vectorstore, allowing the knowledge base to grow and improve with each processed patent.

**Summary:**  
This workflow enables automated, scalable mapping of patents to TRIZ principles, supports semantic search and reasoning, and ensures the TRIZ knowledge base evolves as new inventions are processed.

In [7]:
df_patents = pd.read_csv("data/processed/patentes_inpi_english_matched.csv")
df_patents.head()

Unnamed: 0,id_pedido,data_deposito,titulo,ipc,url,resumo,classifica_ipc,titulo_english,match_top_10_title
0,BR 11 2021 018393 0,02/03/2020,TRATAMENTO DE COLISÕES EM UPLINK,H04L 1/18,https://busca.inpi.gov.br/pePI/servlet/Patente...,"A presente invenção se refere a métodos, sis...",H04L 1/18,Treatment of collisions in Uplink,"{'Move', 'Break Down', 'Change Phase', 'Separa..."
1,BR 11 2021 018071 0,02/03/2020,ALOJAMENTO DE VELA DE IGNIÇÃO COM PROTEÇÃO ANT...,H01T 13/14,https://busca.inpi.gov.br/pePI/servlet/Patente...,ALOJAMENTO DE VELA DE IGNIÇÃO COM PROTEÇÃO A...,H01T 13/14 ; H01T 13/20 ; H01T 13/32 ; H0...,"In this case, it is necessary to ensure that y...","{'Move', 'Break Down'}"
2,BR 11 2021 016947 4,02/03/2020,ANTICORPOS QUE RECONHECEM TAU,C07K 16/18,https://busca.inpi.gov.br/pePI/servlet/Patente...,ANTICORPOS QUE RECONHECEM TAU. A invenção fo...,C07K 16/18 ; G01N 33/68,Antibodies that recognize you,"{'Move', 'Break Down', 'Change Phase', 'Cool',..."
3,BR 10 2020 004169 0,02/03/2020,AQUECEDOR DE AR A LENHA COM DUPLA EXAUSTÃO PAR...,F24H 3/00,https://busca.inpi.gov.br/pePI/servlet/Patente...,AQUECEDOR DE AR A LENHA COM DUPLA EXAUSTAO P...,F24H 3/008 ; F24H 4/06,Air heater with double exhaust to be used in a...,"{'Move', 'Break Down', 'Expand', 'Separate', '..."
4,BR 11 2021 006234 3,02/03/2020,BIBLIOTECAS DE CÉLULAS ÚNICAS E NÚCLEOS ÚNICOS...,C12N 15/10,https://busca.inpi.gov.br/pePI/servlet/Patente...,BIBLIOTECAS DE CÉLULAS ÚNICAS E NÚCLEOS ÚNIC...,C12N 15/10,Unique cell libraries and unique high-end nucl...,"{'Remove', 'Break Down', 'Move', 'Concentrate'..."


In [8]:
def clean_resumo(row):
    resumo = str(row["resumo"]) if pd.notnull(row["resumo"]) else ""
    titulo = str(row["titulo"]) if pd.notnull(row["titulo"]) else ""
    if resumo.startswith(titulo):
        resumo = resumo[len(titulo):]
    return resumo.lstrip()

df_patents["resumo"] = df_patents.apply(clean_resumo, axis=1)
df_patents.head()

Unnamed: 0,id_pedido,data_deposito,titulo,ipc,url,resumo,classifica_ipc,titulo_english,match_top_10_title
0,BR 11 2021 018393 0,02/03/2020,TRATAMENTO DE COLISÕES EM UPLINK,H04L 1/18,https://busca.inpi.gov.br/pePI/servlet/Patente...,"A presente invenção se refere a métodos, siste...",H04L 1/18,Treatment of collisions in Uplink,"{'Move', 'Break Down', 'Change Phase', 'Separa..."
1,BR 11 2021 018071 0,02/03/2020,ALOJAMENTO DE VELA DE IGNIÇÃO COM PROTEÇÃO ANT...,H01T 13/14,https://busca.inpi.gov.br/pePI/servlet/Patente...,ALOJAMENTO DE VELA DE IGNIÇÃO COM PROTEÇÃO ANT...,H01T 13/14 ; H01T 13/20 ; H01T 13/32 ; H0...,"In this case, it is necessary to ensure that y...","{'Move', 'Break Down'}"
2,BR 11 2021 016947 4,02/03/2020,ANTICORPOS QUE RECONHECEM TAU,C07K 16/18,https://busca.inpi.gov.br/pePI/servlet/Patente...,ANTICORPOS QUE RECONHECEM TAU. A invenção forn...,C07K 16/18 ; G01N 33/68,Antibodies that recognize you,"{'Move', 'Break Down', 'Change Phase', 'Cool',..."
3,BR 10 2020 004169 0,02/03/2020,AQUECEDOR DE AR A LENHA COM DUPLA EXAUSTÃO PAR...,F24H 3/00,https://busca.inpi.gov.br/pePI/servlet/Patente...,AQUECEDOR DE AR A LENHA COM DUPLA EXAUSTAO PAR...,F24H 3/008 ; F24H 4/06,Air heater with double exhaust to be used in a...,"{'Move', 'Break Down', 'Expand', 'Separate', '..."
4,BR 11 2021 006234 3,02/03/2020,BIBLIOTECAS DE CÉLULAS ÚNICAS E NÚCLEOS ÚNICOS...,C12N 15/10,https://busca.inpi.gov.br/pePI/servlet/Patente...,BIBLIOTECAS DE CÉLULAS ÚNICAS E NÚCLEOS ÚNICOS...,C12N 15/10,Unique cell libraries and unique high-end nucl...,"{'Remove', 'Break Down', 'Move', 'Concentrate'..."


In [9]:
extractor = TRIZExtractor(vectorstore=vectorstore, verbose=False, threshold=0.3, temperature=0)

In [None]:
# Calcula 10% do tamanho total
sample_size = int(len(df_patents) * 0.1)
df_patents_sample = df_patents.sample(n=sample_size, random_state=42)

results = []

for idx, row in tqdm(df_patents_sample.iterrows(), total=len(df_patents_sample)):
    try:
        title = row["titulo"]
        abstract = row["resumo"]

        # Use the extractor object to process and insert derived elements
        result = extractor.run(title=title, abstract=abstract)

        # Save fields to DataFrame
        df_patents_sample.at[idx, "kind_effect"] = result.get("effect", "")
        df_patents_sample.at[idx, "task"] = result.get("task", "")
        df_patents_sample.at[idx, "object"] = result.get("object", "")
        df_patents_sample.at[idx, "physical_effect"] = result.get("physical_effect", "")
        df_patents_sample.at[idx, "derived_from"] = result.get("derived_from", "")

        # print("Structured output:", result)

        results.append(result)
    except Exception as e:
        print(f"Erro no índice {idx}: {e}")
        df_patents_sample.at[idx, "kind_effect"] = "Erro"
        df_patents_sample.at[idx, "task"] = "Erro"
        df_patents_sample.at[idx, "object"] = "Erro"
        df_patents_sample.at[idx, "physical_effect"] = "Erro"
        df_patents_sample.at[idx, "derived_from"] = "Erro"
        results.append({"kind_effect": "Erro", "task": "Erro", "object": "Erro", "physical_effect": "Erro", "derived_from": "Erro"})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_patents_sample.at[idx, "kind_effect"] = result.get("effect", "")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_patents_sample.at[idx, "task"] = result.get("task", "")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_patents_sample.at[idx, "object"] = result.get("object", "")
A value is tryi

In [11]:
import numpy as np

# Filtra registros onde derived_from não é NaN ou vazio
mask_derived = df_patents_sample["derived_from"].notna() & (df_patents_sample["derived_from"] != "")
derived_rows = df_patents_sample[mask_derived]

# Para cada registro com derived_from preenchido
for idx, row in derived_rows.iterrows():
    kind_effect = row["kind_effect"]
    task = row["task"]
    obj = row["object"]
    physical_effect = row["physical_effect"]
    derived_value = row["derived_from"]

    # Busca outros registros com os mesmos campos, mas derived_from vazio ou NaN
    mask_same = (
        (df_patents_sample["kind_effect"] == kind_effect) &
        (df_patents_sample["task"] == task) &
        (df_patents_sample["object"] == obj) &
        (df_patents_sample["physical_effect"] == physical_effect) &
        (df_patents_sample["derived_from"].isna() | (df_patents_sample["derived_from"] == ""))
    )

    # Atualiza derived_from nesses registros
    df_patents_sample.loc[mask_same, "derived_from"] = derived_value

# df_patents_sample.to_csv("data/processed/patents_inpi_llm_matched_10percent_v2.csv", index=False)

In [12]:
# Carrega o CSV existente (se existir)
import os

csv_path = "data/processed/patents_inpi_llm_matched_10percent_v2.csv"

if os.path.exists(csv_path):
    df_existing = pd.read_csv(csv_path)
    # Concatena removendo duplicatas (por id ou outro campo, se desejar)
    df_final = pd.concat([df_existing, df_patents_sample], ignore_index=True)
else:
    df_final = df_patents_sample

# Salva o resultado final
df_final.to_csv(csv_path, index=False)

In [17]:
# Análise dos campos kind_effect, task, object e physical_effect em df_patents_sample

# Função para contar repetições e mostrar os mais frequentes
def analyze_repetitions(df, column):
    counts = df[column].value_counts()
    repeated = counts[counts >= 2]
    print(f"\nCampo: {column}")
    print(f"Total de valores únicos: {counts.shape[0]}")
    print(f"Valores que se repetem pelo menos 2 vezes: {repeated.shape[0]}")
    print("Top 5 mais frequentes:")
    print(repeated.head(5))

# Campos a analisar
fields = ["kind_effect", "task", "object", "physical_effect"]

for field in fields:
    analyze_repetitions(df_patents_sample, field)


Campo: kind_effect
Total de valores únicos: 1
Valores que se repetem pelo menos 2 vezes: 1
Top 5 mais frequentes:
kind_effect
    329
Name: count, dtype: int64

Campo: task
Total de valores únicos: 165
Valores que se repetem pelo menos 2 vezes: 62
Top 5 mais frequentes:
task
Inibir        17
Transmitir    15
Detectar      13
Proteger      13
Isolar         8
Name: count, dtype: int64

Campo: object
Total de valores únicos: 270
Valores que se repetem pelo menos 2 vezes: 23
Top 5 mais frequentes:
object
Líquido                                    18
Fluido                                      5
Sólido dividido                             5
Imagem                                      4
Chapa de aço elétrico de grão orientado     4
Name: count, dtype: int64

Campo: physical_effect
Total de valores únicos: 270
Valores que se repetem pelo menos 2 vezes: 37
Top 5 mais frequentes:
physical_effect
Isolamento físico                                5
Propagação eletromagnética                     

In [18]:
docs_data = [
    {
        "text": doc["text"],
        "kind_effect": doc["metadata"]["kind_effect"],
        "task": doc["metadata"]["task"],
        "object": doc["metadata"]["object"],
        "physical_effect": doc["metadata"]["physical_effect"],
        "id": doc["id"]
    }
    for doc in documents
]

# Cria o DataFrame e salva em CSV
df_docs = pd.DataFrame(docs_data)
df_docs.to_csv("data/processed/triz_documents_reference_ai.csv", index=False)

# Testing

In [None]:
extractor = TRIZExtractor(vectorstore=vectorstore, verbose=True, threshold=0.3, temperature=0.5)

In [None]:
titles = [
    "APLICADOR PARA PROVER UMA COMPOSIÇÃO ATIVA A UMA SUPERFÍCIE, E, MÉTODOS PARA PREPARAR E PARA PRODUZIR UM APLICADOR E PARA DESINFETAR UMA SUPERFÍCIE INANIMADA, PELE E SUPERFÍCIE DA PELE",
    "BARRA DE SABÃO EXTRUDADA E PROCESSOS PARA PREPARAR UMA BARRA DE SABÃO",
    "BEBIDA ENERGÉTICA COM NICOTINA",
    "APARELHO DE DETECÇÃO DE PINO DO TREM DE POUSO PARA DETECTAR A PRESENÇA DE UM PINO DO TREM DE POUSO"
]

filtered_patents = df_patents[df_patents["titulo"].isin(titles)]
filtered_patents

In [None]:
# for idx, row in tqdm(filtered_patents.iterrows(), total=len(filtered_patents)):
for idx, row in filtered_patents.iterrows():

    title = row["titulo"]
    abstract = row["resumo"]

    # Use the extractor object to process and insert derived elements
    result = extractor.run(title=title, abstract=abstract)