# **Procesamiento de Lenguaje Natural**

## Maestría en Inteligencia Artificial Aplicada
#### Tecnológico de Monterrey
#### Prof Luis Eduardo Falcón Morales

### **Adtividad en Equipos: sistema LLM + RAG**

* **Nombres y matrículas:**

  *   Elemento de lista
  *   Elemento de lista
  *   Elemento de lista

* **Número de Equipo:**


* ##### **El formato de este cuaderno de Jupyter es libre, pero debe incuir al menos las siguientes secciones:**

  * ##### **Introducción de la problemática a resolver.**
  * ##### **Sistema RAG + LLM**
  * ##### **El chatbot, incluyendo ejemplos de prueba.**
  * ##### **Conclusiones**

* ##### **Pueden importar los paquetes o librerías que requieran.**

* ##### **Pueden incluir las celdas y líneas de código que deseen.**

### Setup API openAI

In [1]:
import os
from dotenv import load_dotenv

load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY")

### Leer pdfs usando langchain

In [None]:
from langchain_community.document_loaders import PyPDFLoader
from pathlib import Path

# Load all PDF files in the `docs/` folder
def load_pdfs_from_folder(folder_path):
    all_docs = []
    for pdf_path in Path(folder_path).glob("*.pdf"):
        loader = PyPDFLoader(str(pdf_path))
        docs = loader.load()
        for doc in docs:
            doc.metadata["source"] = pdf_path.name  # Add source metadata
        all_docs.extend(docs)
    return all_docs

documents = load_pdfs_from_folder("docs/")
print(f"✅ Loaded {len(documents)} pages from PDF files.")
print(f"📄 First page preview:\n\n{documents[0].page_content[:500]}")


✅ Loaded 500 pages from PDF files.
📄 First page preview:

THE STATUTES OF THE REPUBLIC OF SINGAPORE
PERSONAL DATA PROTECTION
ACT 2012
2020 REVISED EDITION
This revised edition incorporates all amendments up to and
including 1 December 2021 and comes into operation on 31 December 2021.
Prepared and Published by
THE LAW REVISION COMMISSION
UNDER THE AUTHORITY OF
THE REVISED EDITION OF THE LAWS ACT 1983
Informal Consolidation– version in force from 1/10/2022


### Separación de los documentos en chunks para el embedding y vector store

In [4]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Recommended for OpenAI embeddings: 1000 characters with 200 overlap
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ".", " ", ""]
)

# Apply to your loaded documents
chunks = text_splitter.split_documents(documents)

print(f"✅ Split into {len(chunks)} chunks.")
print(f"📄 First chunk preview:\n\n{chunks[0].page_content[:500]}")


✅ Split into 1887 chunks.
📄 First chunk preview:

THE STATUTES OF THE REPUBLIC OF SINGAPORE
PERSONAL DATA PROTECTION
ACT 2012
2020 REVISED EDITION
This revised edition incorporates all amendments up to and
including 1 December 2021 and comes into operation on 31 December 2021.
Prepared and Published by
THE LAW REVISION COMMISSION
UNDER THE AUTHORITY OF
THE REVISED EDITION OF THE LAWS ACT 1983
Informal Consolidation– version in force from 1/10/2022


### Embeddding y creación del vector store

In [7]:
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

# 1. Define embedding model
embedding_function = OpenAIEmbeddings(model="text-embedding-3-small")

# 2. Set path to local DB
persist_directory = "chroma_db"

# 3. Try to load existing DB; if not, create from chunks
try:
    vectorstore = Chroma(
        embedding_function=embedding_function,
        persist_directory=persist_directory
    )
    print("✅ Loaded existing Chroma DB.")
except:
    print("🚧 Creating new vector store...")
    vectorstore = Chroma.from_documents(
        documents=chunks,
        embedding=embedding_function,
        persist_directory=persist_directory
    )
    print("✅ New Chroma DB created.")

✅ Loaded existing Chroma DB.


### RAG + LLM 

In [9]:
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI

# 1. Set up the LLM with gpt-4o-mini
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.0)

# 2. Create a retriever from your Chroma vector store
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 4})

# 3. Build the Retrieval-Augmented Generation chain
rag_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    return_source_documents=True  # So we can display sources
)

# 4. Test it!
query = "What are the legal bases for processing personal data under the GDPR?"
response = rag_chain.invoke(query)

# Show answer
print("🧠 Answer:\n", response["result"])

# Show sources
for doc in response["source_documents"]:
    print(f"\n📄 Source: {doc.metadata['source']}\n---\n{doc.page_content[:300]}")


🧠 Answer:
 The legal bases for processing personal data under the GDPR are as follows:

1. **Consent**: The data subject has given clear consent for their personal data to be processed for a specific purpose.
2. **Contractual necessity**: Processing is necessary for the performance of a contract to which the data subject is a party, or to take steps at the request of the data subject prior to entering into a contract.
3. **Legal obligation**: Processing is necessary for compliance with a legal obligation to which the controller is subject.
4. **Vital interests**: Processing is necessary to protect the vital interests of the data subject or another natural person.
5. **Public task**: Processing is necessary for the performance of a task carried out in the public interest or in the exercise of official authority vested in the controller.
6. **Legitimate interests**: Processing is necessary for the purposes of legitimate interests pursued by the controller or a third party, except where s

### Comparando RAG + LLM contra un LLM sin RAG

In [11]:
query = "Does the GDPR allow processing without consent?"

# Without RAG (no context)
no_rag_response = llm.invoke(query)
print("🚫 No RAG Answer:\n", no_rag_response.content)

# With RAG
rag_response = rag_chain.invoke(query)
print("✅ RAG Answer:\n", rag_response["result"])


🚫 No RAG Answer:
 Yes, the General Data Protection Regulation (GDPR) allows for the processing of personal data without consent under certain circumstances. While consent is one of the legal bases for processing personal data, the GDPR outlines several other legal bases that can justify processing without the need for explicit consent. These include:

1. **Contractual Necessity**: Processing is necessary for the performance of a contract to which the data subject is a party or to take steps at the request of the data subject prior to entering into a contract.

2. **Legal Obligation**: Processing is necessary for compliance with a legal obligation to which the data controller is subject.

3. **Vital Interests**: Processing is necessary to protect the vital interests of the data subject or another natural person.

4. **Public Task**: Processing is necessary for the performance of a task carried out in the public interest or in the exercise of official authority vested in the data control

| Métrica                        | ❌ Respuesta sin RAG                      | ✅ Respuesta con RAG                     |
| ------------------------------ | ---------------------------------------- | --------------------------------------- |
| **Precisión factual**          | ✅ Correcta                               | ✅ Correcta                              |
| **Claridad**                   | ⚠️ Densa, extensa                        | ✅ Clara y concisa                       |
| **Redundancia**                | ❌ Repite definiciones generales          | ✅ Resumen enfocado                      |
| **Fundamentación en contexto** | ❌ El LLM se basa en conocimiento general | ✅ Se basa en el documento real del GDPR |
| **Estilo**                     | 🗨️ Respuesta tipo conferencia           | ✅ Tono de asistente legal               |


### Usando gradio para crear el chatbot

In [10]:
import gradio as gr

# RAG function to connect UI to the model
def ask_data_law(query):
    if not query.strip():
        return "Please enter a valid question."
    
    response = rag_chain.invoke(query)
    answer = response["result"]

    # Optional: show sources (for transparency)
    sources = "\n\n".join(
        f"📄 {doc.metadata['source']}" for doc in response["source_documents"]
    )
    
    return f"🧠 **Answer:**\n{answer}\n\n---\n**Sources:**\n{sources}"

# Gradio Blocks interface
with gr.Blocks() as demo:
    gr.Markdown("# 💼 Ask the Data Law Assistant")
    gr.Markdown("Ask about GDPR, CCPA, LGPD, APPI, and more.")

    with gr.Row():
        textbox = gr.Textbox(label="Your Question")
        submit_btn = gr.Button("Ask")

    output = gr.Markdown()

    submit_btn.click(fn=ask_data_law, inputs=textbox, outputs=output)

# Launch the app inline
demo.launch(inline=True, share=False)


  from .autonotebook import tqdm as notebook_tqdm


* Running on local URL:  http://127.0.0.1:7860
* To create a public link, set `share=True` in `launch()`.




# **Conclusiones:**

* #### **Incluyan sus conclusiones de la actividad chatbot LLM + RAG:**



None

# **Fin de la actividad chatbot: LLM + RAG**