%md <img src="https://github.com/Databricks-BR/lab_genai/blob/main/img/header.png?raw=true" width=100%>

# Hands-On LAB 02 - Preparando uma base de conhecimento

Treinamento Hands-on na plataforma Databricks com foco nas funcionalidades de IA Generativa.

In [0]:
dbutils.widgets.text("CATALOG", "databricks_workshop_jsf")
dbutils.widgets.text("SCHEMA", "agents")

In [0]:
%pip install PyMuPDF
dbutils.library.restartPython()

In [0]:
CATALOG = dbutils.widgets.get("CATALOG")
SCHEMA = dbutils.widgets.get("SCHEMA")

In [0]:
import re

In [0]:
import fitz

def read_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text.replace("\n", "")

In [0]:
pdf_path = "/Volumes/workshop_databricks_jsf/agents/vol_ir/P&R IRPF 2024 - v1.0 - 2024.05.03.pdf"

In [0]:
pdf_text = read_pdf(pdf_path)
regex_pattern = r"\d{3} [—-](.*?)Retorno ao sumário"

In [0]:
lista_perguntas = re.findall(regex_pattern,pdf_text)

In [0]:
lista_perguntas_respostas = [x.strip().split("?", 1) for x in lista_perguntas]
lista_perguntas_respostas_qm = [
    [
        {
          "Pergunta": p + "?",
          "Resposta": r.strip()
        }
    ]
    for p, r in lista_perguntas_respostas
]

lista_perguntas_respostas_qm = [[entry] for entry in lista_perguntas_respostas_qm]

In [0]:
from pyspark.sql.functions import monotonically_increasing_id, to_json, col

schema = StructType([
    StructField('chunk', ArrayType(
        StructType([
            StructField('Pergunta', StringType(), True),
            StructField('Resposta', StringType(), True)
        ]), True), True)
])

df = spark.createDataFrame(lista_perguntas_respostas_qm, schema)
df = df.withColumn('chunk', to_json(col('chunk')))
df = df.withColumn('id', monotonically_increasing_id())

In [0]:
df.display()

In [0]:
df.write.mode("overwrite").saveAsTable(f"{CATALOG}.{SCHEMA}.ir_pdf_doc")

In [0]:
%sql
SELECT * FROM workshop_databricks_jsf.agents.ir_pdf_doc

Ativando o log da tabela para caso quisermos que a cada nova informação na tabela, a nossa base de conhecimento atualize.

In [0]:
spark.sql(f"alter table {CATALOG}.{SCHEMA}.ir_pdf_doc set TBLPROPERTIES  (delta.enableChangeDataFeed = true)")