# Chat With Your Data

Requirements

In [None]:
%pip install langchain openai pypdf python-dotenv chromadb lark -q

## Estructura 

- API keys y Variables de entorno 
- Document Loading
- Document Splitting
- Vectorstores and Embedding (Storage)
- Retrieval
- Question Answering
- Chat
- Conclusion

<!-- ![](figs/0_preview.png) -->
![](https://python.langchain.com/assets/images/data_connection-95ff2033a8faa5f3ba41376c0f6dd32a.jpg)

In this tutorial, we will delve into the essential steps for creating a natural language model. We'll begin by understanding the significance of API keys and utilizing environment variables to ensure security and privacy in our applications. Next, we'll dive into document loading, mastering the handling of various file types and data sources. Following that, we'll tackle document splitting for efficient processing. We'll then proceed to create vector stores and embeddings, crucial for representing the semantic meaning of words. Afterward, we'll explore information retrieval techniques and question-answering capabilities, culminating in the implementation of a chat system based on our model. 

Finally, we'll draw conclusions on the challenges and possibilities within this fascinating field of natural language processing.


## Api Key y Environment VAriables

Crear la API de OPENAI, https://platform.openai.com/account/api-keys

Guardar esa api

Para cargar esa API existen varias maneras, una es utilizar archivos de entorno lo cuales "deben" ser privados, es decir que deben incluirse en el `.gitignore` si se esta trabajando en un entorno de trabajo, (actualmente github detecta si se subio alguna llave a la plataforma, soltara una alerta e inhabilitara la apikey teniendo que crear otra). Y colocarlo manualmente .

### `Dotenv`

Para este metodo se puede crear un archivo `.env` en el entorno de trabajo y enumerar las variables de la siguiente manera `variable = "value_variable"`


```.env
NAME_OF_VARIABLE="sk-xxxxxxxxxxxxxxxx"
```

Y para utilizar se usa `python-dotenv`, el cual mediante las funciones `load_dotenv` y `find_dotenv`, cargara las variables dentro del archivo `.env`. 

```python
# `!pip install python-dotenv`
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv())

secret_variable = os.environ['NAME_OF_VARIABLE']
```

### Colab 

Para colab podemos introducir un formulario con getpass, el cual ocultara la apikey cuando es introducida, la desventaja a comparacion del anterior metodo es que siempre tendremos que pegar la api key cada vez que se ejecute el archivo.

In [None]:
# !pip install openai
import getpass, openai, os
api_key = getpass.getpass(prompt="OPENAI - KEY: ")
openai.apikey = api_key
os.environ["OPENAI_API_KEY"] = api_key

## Document Loading

Se debe considerar 2 casos, si se extraera la informacion de la web o esta en nuestro entorno de trabajo. Para el primer caso es posible (aunque `langchain` ya tiene implementado estos casos) que necesitemos de la libreria `requests` para poder descargar el archivo o el contenido del archivo. Mientras que para el segundo caso solo basta la ruta relativa o absoluta del archivo. 

Para todos los documentos se tiene esta estructura, el cual retornara una lista de objetos de tipo `Document` el cual tiene 2 objetos dentro, uno es el `page_content` el cual es el texto dentro, y la `metadata`.

```
from langchain.document_loaders import `Method``
file = Method(file_path)
file_read = file.lod()
print(file_read[0])
Document(
    page_content: "text",
    metadata: {"source": file_path, ...}
)
```


### PDFs

Para archivo pdf ya se tiene implementado para poder leer el documento por la url y path local.

In [None]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("https://arxiv.org/pdf/2103.15348.pdf")
pages = loader.load()

print(pages[0].page_content(:100))
print(pages[0].metadata)

### Web Plain Text

Para textos planos de la url, se puede utilizar el `WebBaseLoader`.

In [None]:
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://raw.githubusercontent.com/basecamp/handbook/master/getting-started.md")
loader.load()

### JSON

In [None]:
# from langchain_community.document_loaders import JSONLoader
# loader = JSONLoader(
#     file_path="",
#     jq_schema='.messages[].content',
#     text_content=False)

# data = loader.load()

Para otros documentos se puede leer la documentacion de [Langchain - Document Loaders](https://python.langchain.com/docs/modules/data_connection/document_loaders/)

## Document Splitting

Splitting the text of a document in LLM (Deep Learning Language Models) can be advantageous for several reasons. Firstly, it helps manage long documents, as LLMs may struggle with processing very large texts due to memory constraints or computational limitations. It improves contextual representation by capturing local contextual relationships more effectively. 

All methods of `langchain.text_splitter` have the following parameters

- `separator="\n"`: Character used as a separator between parts of the text (e.g., line breaks).
- `chunk_size=100`: Maximum size of each text fragment.
- `chunk_overlap=20`: Overlap of characters between consecutive fragments.
- `length_function`: A function that may dynamically adjust the fragment size, though its specific function is unclear without further context.



### Split by character

Splits text based on a user defined character. 

In [None]:
from langchain.document_loaders import WebBaseLoader

markdown = WebBaseLoader("https://raw.githubusercontent.com/basecamp/handbook/master/how-we-work.md")
markdown_doc = markdown.load()
text_markdown = markdown_doc[0].page_content
print(text_markdown[:500])

In [None]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    is_separator_regex=False,
)

text_splitted = text_splitter.split_text(text_markdown[0].page_content)
print(text_splitted[0])

### Split for markdown

Splits text based on Markdown-specific characters. Notably, this adds in relevant information about where that chunk came from (based on the Markdown)

We can use `MarkdownHeaderTextSplitter` to preserve header metadata in our chunks, as show below.

In [None]:
from langchain.text_splitter import MarkdownHeaderTextSplitter
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

md_header_splits = markdown_splitter.split_text(text_markdown[0].page_content)
md_header_splits[:2]

### Split for Code

Splits text based on characters specific to coding languages.

In [None]:
from langchain.text_splitter import (
    Language,
    RecursiveCharacterTextSplitter,
)
PYTHON_CODE = """
def hello_world():
    print("Hello, World!")

# Call the function
hello_world()
"""
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=50, chunk_overlap=0
)
python_docs = python_splitter.create_documents([PYTHON_CODE])
python_docs

In [None]:
RecursiveCharacterTextSplitter.get_separators_for_language(Language.PYTHON)

## Embedding and Vectorstores (Storage)

### Embeddings

Embeddings are vector representations of words in a dimensional space, learned during training. They capture semantic and contextual meaning to facilitate the model's understanding and processing of the text.

The base Embeddings class in LangChain provides two methods: one for embedding documents and one for embedding a query. The former takes as input multiple texts, while the latter takes a single text. The reason for having these as two separate methods is that some embedding providers have different embedding methods for documents (to be searched over) vs queries (the search query itself).


Example:

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings()
embeddings = embedding.embed_documents(
    [
        "Hi there!",
        "Hello"
    ]
)
len(embeddings), len(embeddings[0])

In [None]:
text_embedding = embeddings[0]
print("length: ", len(text_embedding), "\nvector_sample: " ,text_embedding[:3])

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = PyPDFLoader("https://arxiv.org/pdf/2103.15348.pdf")
pages = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

splits = text_splitter.split_documents(pages)

### Vectorstores

Para poder 
Para nuestro base de datos, debemos tener un array de [Documents].

Con Chroma se hara de manera local, note que no hay ningun directorio que haga referencia al nuestra base de datos de Chroma

In [None]:
os.listdir()

Ahora se creara el vectorstore tomando en cuenta el split del documento, el metodo de embedding y la ubicacion de vectorstore 

In [None]:
from langchain.vectorstores import Chroma

persist_directory = './vector_db_chroma/'

!rm -rf ./docs/chroma  # remove old database files if any


vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory
)

print(vectordb._collection.count())

Similarity search

In [None]:
query_1 = vectordb.similarity_search(
    "What are some of the challenges hindering the widespread adoption and reuse of innovations in document image analysis (DIA), particularly in comparison to disciplines like natural language processing and computer vision?",
    k=3,
)
query_2 = vectordb.similarity_search(
    "How does the LayoutParser library address the challenges mentioned in the summary and contribute to streamlining the usage of deep learning in DIA research and applications?",
    k=3,
)
# print(query_1)

In [None]:
query_1[0].page_content[:100]

## Retrieval
## Question Answering
## Chat
## Conclusion