### Author: Guilherme Resende

This notebook generates the local databases of each strategy to be experimented (See explanation below).

---

### Reading the files

Since we are dealing with Markdown texts, there are unwanted characters throughout the files. The way we read the files depend on the pre-processing steps we want to perform and the chunking strategy we're about to follow. 

After quickly browsing through the AWS Documentation, I've noticed that each `.md` file comprises an entire text subsection. If the writers follow a reasonable writing strategy, we can assume the texts will not be too long and will be approximately self-contained. Hence, the first chunking alternative is already given, that is, read and chunk each `.md` file as it is.

The second alternative we will test here is based on concatenating the entire pieces of texts into one concise article, and subsequently proceed to a recursive chunking based on text separators and pre-defined chunk sizes.

In [None]:
import json
import os

credentials=None
with open("credentials.json", 'r') as f:
    credentials = json.load(f)

os.environ["OPENAI_API_KEY"] = credentials["OPENAI_API_KEY"]

In [None]:
__import__('pysqlite3')
import sys
sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')

from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.document_loaders import DirectoryLoader, UnstructuredMarkdownLoader

In [None]:
# loader = DirectoryLoader('./awsdocs_merged/', glob="*.txt", loader_cls=TextLoader)
loader = DirectoryLoader('./awsdocs_plain_text/', glob="*/*.txt", loader_cls=TextLoader)

documents = loader.load()

In [None]:
len(documents)

In [None]:
CHUNK_SIZE = 2048
OVERLAP = 256

recursive_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n"],
    chunk_size=CHUNK_SIZE,
    chunk_overlap=OVERLAP
)

chunked_documents = recursive_splitter.split_documents(documents)

In [None]:
n = len(chunked_documents)

print(f"There are {n} chunks of data.")

In [None]:
persist_directory = f"db_chunk_size_{CHUNK_SIZE}"

embedding = OpenAIEmbeddings()

vectordb = Chroma.from_documents(
    documents=chunked_documents, 
    embedding=embedding,
    persist_directory=persist_directory
)

In [None]:
vectordb.persist()