# Prueba Técnica – Ingeniero/a de IA Aplicada: Sistemas RAG y LLMs 

## Importar librerías necesarias

In [50]:
import requests
from bs4 import BeautifulSoup
from langchain.text_splitter import CharacterTextSplitter
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

## 1. Data Ingestion 

- Selecciona una base documental (puede ser pública; se sugiere documentación técnica o artículos en PDF/HTML/TXT). 

In [51]:
def extract_text_from_url(url):
    """
    Descarga una página web en formato HTML y extrae su contenido textual limpio.

    Este método realiza una solicitud HTTP a la URL proporcionada, elimina elementos
    no textuales del HTML (como scripts, estilos, navegación, etc.), y retorna el
    texto visible, limpio y estructurado en líneas.

    Args:
        url (str): URL de la página web a procesar.

    Returns:
        str: Texto limpio extraído del contenido HTML de la página.

    Raises:
        Exception: Si la solicitud HTTP devuelve un código diferente a 200.
    """
    response = requests.get(url)
    if response.status_code != 200:
        raise Exception(f"Error al descargar la página: {response.status_code}")
    
    soup = BeautifulSoup(response.content, "html.parser")

    # Eliminamos scripts, estilos, navegación, etc.
    for tag in soup(["script", "style", "nav", "footer", "header", "aside"]):
        tag.decompose()

    # Extraemos el texto visible y limpiamos espacios
    text = soup.get_text(separator="\n")
    lines = [line.strip() for line in text.splitlines()]
    clean_text = "\n".join(line for line in lines if line)

    return clean_text

In [52]:
def extract_all_texts(urls):
    """
    Extrae y concatena el texto limpio de múltiples URLs en formato HTML.

    Para cada URL en la lista, se descarga el contenido HTML, se limpia y se
    extrae el texto visible utilizando la función `extract_text_from_url`.
    Si ocurre un error durante la descarga o procesamiento, se imprime el
    error y se continúa con las siguientes URLs.

    Args:
        urls (list of str): Lista de URLs a procesar.

    Returns:
        str: Texto combinado de todas las páginas, separado por saltos de línea.
    """
    all_text = ""
    for url in urls:
        try:
            print(f"🔗 Procesando: {url}")
            all_text += extract_text_from_url(url) + "\n"
        except Exception as e:
            print(f"❌ Error en {url}: {e}")
    return all_text

In [53]:
# Links de LangChain
urls = [
    "https://python.langchain.com/docs/versions/v0_3/",
    "https://python.langchain.com/docs/introduction/",
    "https://python.langchain.com/docs/tutorials/",
    "https://python.langchain.com/docs/tutorials/retrievers/",
    "https://python.langchain.com/docs/concepts/document_loaders/",
    "https://python.langchain.com/docs/concepts/embedding_models/",
    "https://python.langchain.com/docs/concepts/vectorstores/",
    "https://python.langchain.com/docs/tutorials/classification/",
    "https://python.langchain.com/docs/concepts/structured_outputs/", 
    "https://python.langchain.com/docs/tutorials/extraction/", 
    "https://python.langchain.com/docs/how_to/", 
    "https://python.langchain.com/docs/how_to/structured_output/",
    "https://python.langchain.com/docs/how_to/tool_calling/", 
    "https://python.langchain.com/docs/how_to/streaming/", 
    "https://python.langchain.com/docs/how_to/debugging/", 
    "https://python.langchain.com/docs/concepts/few_shot_prompting/", 
    "https://python.langchain.com/docs/concepts/chat_models/",
]

raw_text = extract_all_texts(urls)
print(raw_text[:1000]) # Muestra los primeros 1000 caracteres del texto extraído

🔗 Procesando: https://python.langchain.com/docs/versions/v0_3/
🔗 Procesando: https://python.langchain.com/docs/introduction/
🔗 Procesando: https://python.langchain.com/docs/tutorials/
🔗 Procesando: https://python.langchain.com/docs/tutorials/retrievers/
🔗 Procesando: https://python.langchain.com/docs/concepts/document_loaders/
🔗 Procesando: https://python.langchain.com/docs/concepts/embedding_models/
🔗 Procesando: https://python.langchain.com/docs/concepts/vectorstores/
🔗 Procesando: https://python.langchain.com/docs/tutorials/classification/
🔗 Procesando: https://python.langchain.com/docs/concepts/structured_outputs/
🔗 Procesando: https://python.langchain.com/docs/tutorials/extraction/
🔗 Procesando: https://python.langchain.com/docs/how_to/
🔗 Procesando: https://python.langchain.com/docs/how_to/structured_output/
🔗 Procesando: https://python.langchain.com/docs/how_to/tool_calling/
🔗 Procesando: https://python.langchain.com/docs/how_to/streaming/
🔗 Procesando: https://python.langchain.

- Procesa e indexa la información para facilitar la recuperación semántica. 

In [54]:
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=500,
    chunk_overlap=50,
    length_function=len
)

chunks = text_splitter.split_text(raw_text)

print(f"Total de chunks generados: {len(chunks)}")
print(chunks[0])  # Muestra el primer chunk

Created a chunk of size 555, which is longer than the specified 500
Created a chunk of size 1000, which is longer than the specified 500
Created a chunk of size 555, which is longer than the specified 500
Created a chunk of size 1000, which is longer than the specified 500
Created a chunk of size 877, which is longer than the specified 500
Created a chunk of size 590, which is longer than the specified 500
Created a chunk of size 627, which is longer than the specified 500
Created a chunk of size 680, which is longer than the specified 500
Created a chunk of size 589, which is longer than the specified 500
Created a chunk of size 616, which is longer than the specified 500
Created a chunk of size 693, which is longer than the specified 500
Created a chunk of size 636, which is longer than the specified 500
Created a chunk of size 510, which is longer than the specified 500
Created a chunk of size 506, which is longer than the specified 500
Created a chunk of size 930, which is longer t

Total de chunks generados: 482
LangChain v0.3 | 🦜️🔗 LangChain
Skip to main content
Our
Building Ambient Agents with LangGraph
course is now available on LangChain Academy!
On this page
Last updated: 09.16.24
What's changed
​
All packages have been upgraded from Pydantic 1 to Pydantic 2 internally. Use of Pydantic 2 in user code is fully supported with all packages without the need for bridges like
langchain_core.pydantic_v1
or
pydantic.v1
.
Pydantic 1 will no longer be supported as it reached its end-of-life in June 2024.


## 2. Retrieval Pipeline (RAG) 

- Implementa un motor de búsqueda semántica (p. ej., con FAISS o Elasticsearch). 

In [55]:
# Modelo compacto y eficiente
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Crea los embeddings
embeddings = embedding_model.encode(chunks, show_progress_bar=True)

print(f"Embeddings shape: {embeddings.shape}")

Batches: 100%|██████████| 16/16 [00:11<00:00,  1.41it/s]

Embeddings shape: (482, 384)





In [56]:
# Creamos el índice FAISS
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(embeddings))

# Asociamos los chunks con sus índices (opcional, útil para trazabilidad)
chunk_map = {i: chunk for i, chunk in enumerate(chunks)}

print(f"Total de vectores indexados: {index.ntotal}")

Total de vectores indexados: 482


- Realiza  embedding  de  los  documentos  usando  un  modelo  como  sentence-
transformers o equivalente. 

In [57]:
def search_similar_chunks(question, top_k=3):
    """
    Recupera los chunks más relevantes del corpus en función de una pregunta dada.

    Genera el embedding de la pregunta, lo compara con los vectores indexados en FAISS
    y devuelve los `top_k` chunks más similares. Los resultados se concatenan como un solo
    string separado por doble salto de línea.

    Args:
        question (str): Pregunta en lenguaje natural.
        top_k (int, optional): Número de chunks más relevantes a recuperar. Por defecto es 3.

    Returns:
        str: Texto combinado de los chunks más similares, separados por saltos de línea dobles.
    """
    query_embedding = embedding_model.encode([question])
    distances, indices = index.search(np.array(query_embedding), top_k)
    retrieved_chunks = [chunk_map[i] for i in indices[0]]
    return "\n\n".join(retrieved_chunks)

In [58]:
question = "What is LangChain and what is it used for?"
context = search_similar_chunks(question, top_k=5)
print("🔍 Contexto recuperado:\n")
print(context)

🔍 Contexto recuperado:

this page
.
Integrations
​
LangChain is part of a rich ecosystem of tools that integrate with our framework and build on top of it.
If you're looking to get up and running quickly with
chat models
,
vector stores
,
or other LangChain components from a specific provider, check out our growing list of
integrations
.
API reference
​
Head to the reference section for full documentation of all classes and methods in the LangChain Python packages.
Ecosystem
​
🦜🛠️ LangSmith
​

here
.
Marked as deprecated a number of legacy chains and added migration guides for all of them. These are slated for removal in
langchain
1.0.0. See the deprecated chains and associated
migration guides here
.
How to update your code
​
If you're using
langchain
/
langchain-community
/
langchain-core
0.0 or 0.1, we recommend that you first
upgrade to 0.2
.
If you're using
langgraph
, upgrade to
langgraph>=0.2.20,<0.3
. This will work with either 0.2 or 0.3 versions of all the base packages.

Int

## 3. Generación con LLM 

- Integra un modelo de lenguaje como OpenAI, HuggingFace, o Llama.cpp.. 
- Asegúrate de que las respuestas generadas estén condicionadas por los documentos recuperados (prompting controlado). 

In [59]:
model_name = "google/flan-t5-base"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Carga un pipeline de generación
llm_pipeline = pipeline("text2text-generation", model=model, tokenizer=tokenizer)

Device set to use cpu


In [60]:
def build_prompt(context, question):
    """
    Construye un prompt instructivo para un modelo de lenguaje, usando contexto y pregunta.

    El prompt resultante indica explícitamente al modelo que debe generar una respuesta
    basada únicamente en el contexto proporcionado, lo cual es esencial en sistemas RAG
    para reducir la alucinación y aumentar la fidelidad de la respuesta.

    Args:
        context (str): Fragmento(s) de texto recuperado(s) desde el corpus (chunks relevantes).
        question (str): Pregunta formulada por el usuario.

    Returns:
        str: Prompt formateado para ser procesado por un modelo generativo.
    """
    return f"""Answer the following question based solely on the context provided.

Context:
{context}

Question:
{question}

Answer:"""

In [61]:
def answer_question(question):
    """
    Genera una respuesta en lenguaje natural utilizando un modelo LLM y recuperación semántica.

    Esta función realiza los siguientes pasos:
    1. Recupera los chunks más relevantes desde el índice semántico basado en la pregunta.
    2. Construye un prompt controlado con dicho contexto y la pregunta original.
    3. Envía el prompt al modelo LLM para generar una respuesta.

    Args:
        question (str): Pregunta en lenguaje natural a responder.

    Returns:
        str: Respuesta generada por el modelo de lenguaje, basada en el contexto recuperado.
    """
    context = search_similar_chunks(question, top_k=3)
    prompt = build_prompt(context, question)
    response = llm_pipeline(prompt, max_new_tokens=512, do_sample=False)[0]["generated_text"]
    return response

## 4. Evaluación básica 

- Proporciona ejemplos de preguntas sobre los documentos. 
- Incluye  métricas  o criterios para verificar que  las respuestas  generadas sean 
relevantes y fieles al contenido. 


In [62]:
preguntas = [
    "What is LangChain?",
    "What are the main components of LangChain?",
    "What are LLM models used for in LangChain?",
    "Can LangChain be integrated with databases?",
    "What are the advantages of using LangChain in production?",
    "How does LangChain differ from direct use of OpenAI API?",
    "What are agents in LangChain and how do they work?",
    "Can LangChain interact with external APIs?",
    "How does LangChain support memory in conversations?",
    "What are tools in LangChain and how are they defined?",
    "How can I integrate LangChain with vector databases like FAISS or Pinecone?",
    "What types of memory does LangChain support?",
    "How do you persist memory in a LangChain app?",
    "What are retrieval-based QA chains in LangChain?",
    "How does LangChain handle prompt templating?",
    "What is the role of LangChain Expression Language (LCEL)?",
    "How do you debug chains and agents in LangChain?",
    "Can LangChain be used with Hugging Face models?",
    "How does LangChain handle streaming responses?",
    "What are some production-ready deployment strategies for LangChain apps?",
    "How can LangChain be integrated with frameworks like FastAPI or Streamlit?",
    "What are common use cases of LangChain in enterprise settings?",
    "How does LangChain work with multi-modal models?"
]

In [63]:
for pregunta in preguntas:
    print("🧠 Pregunta:", pregunta)
    print("💬 Respuesta:", answer_question(pregunta))
    print("-" * 80)

🧠 Pregunta: What is LangChain?
💬 Respuesta: a framework for developing applications powered by large language models
--------------------------------------------------------------------------------
🧠 Pregunta: What are the main components of LangChain?
💬 Respuesta: chat models, vector stores, and other LangChain components
--------------------------------------------------------------------------------
🧠 Pregunta: What are LLM models used for in LangChain?
💬 Respuesta: building robust and stateful multi-actor applications
--------------------------------------------------------------------------------
🧠 Pregunta: Can LangChain be integrated with databases?
💬 Respuesta: Yes
--------------------------------------------------------------------------------
🧠 Pregunta: What are the advantages of using LangChain in production?
💬 Respuesta: LangGraph builds stateful, multi-actor applications with LLMs. Integrates smoothly with LangChain, but can be used without it. LangGraph powers production

In [64]:
def score_similarity(response, context):
    """
    Calcula la similitud semántica entre una respuesta generada y su contexto de origen.

    Utiliza embeddings generados por el modelo de `sentence-transformers` y calcula la
    similitud del coseno entre la respuesta y el contexto proporcionado. Esto permite
    estimar qué tan fiel es la respuesta al contenido recuperado.

    Args:
        response (str): Texto generado por el modelo (respuesta).
        context (str): Texto fuente usado como contexto en el prompting.

    Returns:
        float: Valor de similitud del coseno entre 0 y 1 (mayor valor = mayor similitud).
    """
    response_embedding = embedding_model.encode([response])
    context_embedding = embedding_model.encode([context])
    similarity = cosine_similarity(response_embedding, context_embedding)
    return similarity[0][0]

In [65]:
context = search_similar_chunks("What is LangChain?")
respuesta = answer_question("What is LangChain?")
print("🔢 Similitud respuesta-contexto:", score_similarity(respuesta, context))
print(respuesta)
print(context)

🔢 Similitud respuesta-contexto: 0.32325405
a framework for developing applications powered by large language models
this page
.
Integrations
​
LangChain is part of a rich ecosystem of tools that integrate with our framework and build on top of it.
If you're looking to get up and running quickly with
chat models
,
vector stores
,
or other LangChain components from a specific provider, check out our growing list of
integrations
.
API reference
​
Head to the reference section for full documentation of all classes and methods in the LangChain Python packages.
Ecosystem
​
🦜🛠️ LangSmith
​

here
.
Marked as deprecated a number of legacy chains and added migration guides for all of them. These are slated for removal in
langchain
1.0.0. See the deprecated chains and associated
migration guides here
.
How to update your code
​
If you're using
langchain
/
langchain-community
/
langchain-core
0.0 or 0.1, we recommend that you first
upgrade to 0.2
.
If you're using
langgraph
, upgrade to
langgraph>