# Creación del modelo de chat

En este notebook crearemos un RAG partiendo de una base de documentos intercalados entre webs de dominio accesible de manera gratuita y articulos de ciencia del deporte (también de dominio público).

**IMPORTANTE**: En este Notebook se han utilizado recursos como _FireCrawl_ y _LlamaParse_ que requieren de claves API para su uso. Estas se han eliminado a fin de subir este archivo al repositorio de GitHub y que se pueda visitar sin exponer estas mismas.

In [None]:
import numpy as np
import pandas as pd
import os
workpath = 'C:/Users/Legion/TFM/Tareas'
os.chdir(workpath)
from firecrawl import FirecrawlApp
import json
from llama_parse import LlamaParse
import nest_asyncio
import pickle
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document
from sentence_transformers import SentenceTransformer
import faiss

## Compilacion de data para el vector store

El primer paso para crear la RAG es tener un vector store donde se guarde la información relacionada con nuestro tema para que el modelo sea capaz de responder preguntas. En este caso, comenzaremos pasando las múltiples fuentes de información que tenemos a markdown, con tal de poder embedir estos documentos.

#### Información contenida en Webs

In [None]:
# Crear app para web-scraping
scrape_app = FirecrawlApp(api_key='------APIKEY------')


In [None]:
urls = pd.read_table('./Docs/URLs.txt')['URLS'].tolist()

In [None]:
url_docs = [scrape_app.scrape_url(url) for url in urls]

In [None]:
for i in range(len(url_docs)):
    with open(f'./Docs/webs/documenst({i}).json', "w") as json_file:
        json.dump(url_docs[i], json_file, indent=4)

### Información contenida en pdfs
#### Articulos científicos



In [None]:
nest_asyncio.apply()

In [None]:
articulos_pdf = os.listdir('./Docs/Articles/')

In [None]:
for articulo in articulos_pdf:
    pdfs.append(LlamaParse(api_key='------APIKEY------', result_type='markdown', parsing_instruction='''
    This document is a scientific article.
    Tipically, this documents have the title, authors, and then an abstract explaining the whole document at first in a summarization.
    After this abstract, the documents are written with multiple columns.
    Images and tables can be ignored.
    ''').load_data('./Docs/Articles/'+articulo))

In [None]:
for i in range(len(pdfs)):
    with open(f'./Docs/Markdowns/articulo_{i}.pickle', 'wb') as file:
        pickle.dump(pdfs[i], file)

#### Artículos y libros (escritos a página entera)

In [None]:
libros = os.listdir('./Docs/Books/')

In [None]:
libro1 = LlamaParse(api_key='------APIKEY------', result_type='markdown', parsing_instruction='''
    This document is in spanish.
    Images and tables can be ignored.
    ''').load_data('./Docs/Books/'+libros[0])

In [None]:
for libro in libros:
    libros_parsed.append(LlamaParse(api_key='------APIKEY------', result_type='markdown', parsing_instruction='''
    This document is about training and excercise.
    Images and tables can be ignored.
    ''').load_data('./Docs/Books/'+libro))

In [None]:
for i in range(len(libros_parsed)):
    with open(f'./Docs/Markdowns/libro_{i}.pickle', 'wb') as file:
        pickle.dump(libros_parsed[i], file)

### Juntar toda la informacion y crear chunks

In [None]:
lista_webs = []
webs_path = os.listdir('./Docs/webs/')
for path in webs_path:
    with open(f'./Docs/webs/{path}', 'r', encoding='utf-8') as file:
        web = json.load(file)
    lista_webs.append(web[0])

In [None]:
lista_webs[0]['docs'][0].keys()

#### Crear chunks de los documentos 

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2048,  # Adjust chunk size as needed
    chunk_overlap=200  # Overlap between chunks to maintain context
)

web_splits = []
for web in lista_webs:
    web_splits.append(text_splitter.split_text(web['docs'][0]['markdown']))

web_chunks = [item for sublist in web_splits for item in sublist]

In [None]:
web_chunks[5]

In [None]:
libros_list = [item.text for sublist in libros_parsed for item in sublist if len(item.text) < 2048]

In [None]:
libros_chunks = [text_splitter.split_text(i) for i in [item.text for sublist in libros_parsed for item in sublist if len(item.text) > 2048]]

In [None]:
libros_chunks_unwrapped = [item for sublist in libros_chunks for item in sublist]

In [None]:
for i in libros_list:
    libros_chunks_unwrapped.append(i)

In [None]:
articulos_list = [item.text for sublist in pdfs for item in sublist if len(item.text) < 2048]

In [None]:
articulos_chunks = [text_splitter.split_text(i) for i in [item.text for sublist in pdfs for item in sublist if len(item.text) > 2048]]

In [None]:
articulos_chunks_unwrapped = [item for sublist in articulos_chunks for item in sublist]

In [None]:
for i in articulos_list:
    articulos_chunks_unwrapped.append(i)

In [None]:
all_docs = web_chunks + libros_chunks_unwrapped + articulos_chunks_unwrapped

In [None]:
with open(f'./Docs/Markdowns/Chunks_all.pickle', 'wb') as file:
        pickle.dump(all_docs, file)

## Crear vector store

In [None]:
with open(f'./Docs/Markdowns/Chunks_all.pickle', 'rb') as file:
    all_docs = pickle.load(file)

In [None]:
# usar modelo de embedido multilingüe para embedir los chunks de documentos

model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
embeddings = model.encode(all_docs)


In [None]:
with open(f'./Docs/Markdowns/Embedded_chunks.pickle', 'wb') as file:
        pickle.dump(embeddings, file)

In [None]:
with open(f'./Docs/Markdowns/Embedded_chunks.pickle', 'rb') as file:
    embeddings = pickle.load(file)

**IMPORTANTE:** Se debe usar el mismo modelo de embeddings que se usa para crear el vector store como para el programa, a fin de evitar que falle el modelo RAG en un final. Si se cambia uno, hay que cambiarlos todos

In [None]:
# Convertimos los embeddings a array de numpy
embeddings = np.array(embeddings)

# Crear un index (vector store)
dimension = embeddings.shape[1]  # Dimensión de los embeddings
index = faiss.IndexFlatL2(dimension)  # Tipo de índice (vector store)

# Añadir los embeddings al vector store
index.add(embeddings)

# Guardar el vector store en local
faiss.write_index(index, "vector_store.index")


In [None]:
# Prueba de funcionamiento
question = "¿Cómo hacer una dominada?"
query_vector = model.encode([question])[0]
D, I = index.search(np.array([query_vector]), k=5)  # Busca los 5 chunks más cercanos a la pregunta

# D tiene la informacion de las distancias, I los índices de los vectores
# (Habrá que tener cargados tanto el vector store como los chunks sin embedir para visualizar el resultado)
closest_docs = [all_docs[i] for i in I[0]]


In [None]:
closest_docs

### Creación de las funciones para el uso del vector store

In [None]:
def create_vector_store(docspath):
    with open(docspath, 'rb') as file:
        all_docs = pickle.load(file)
    model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
    embeddings = np.array(model.encode(all_docs))
    dimension = embeddings.shape[1]  # Dimension of the embeddings
    index = faiss.IndexFlatL2(dimension)  # L2 distance index

    # Add embeddings to the index
    index.add(embeddings)

    # Save index to disk (optional)
    faiss.write_index(index, docspath+"/vector_store.index")

def load_vector_store(path):
    return faiss.read_index(path)