## Step 1 - Add your OpenAI API key.
The key must be taken from a registered OpenAI account.
Link to the keys sub-menu https://platform.openai.com/account/api-keys.

In [12]:
%env OPENAI_API_KEY YOUT_OPENAI_API_KEY

env: OPENAI_API_KEY=YOUT_OPENAI_API_KEY


## Step 2 - Installing necessary libraries
The libraries below are needed for:
- Loading of text data from files and further preparation;
- Creating embeddings via Langchain;
- Prompting;

In [5]:
!pip install langchain


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Step 3 - Libraries import
Imported libraries will be used across the whole notebook

In [6]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.document_loaders import TextLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

## Step 4 - Create tool for embedding generation
The tool is provided by Langchain and simplified embeddings generation process while providing reasonable defaults.

By default, OpenAIEmbeddings uses nearly optimal settings:
Model: text-embedding-ada-002
Max chunk length: 8191

In [8]:
embeddings = OpenAIEmbeddings()

In [9]:
query = "What am I doing here?"
query_embedding = embeddings.embed_query(query)

query_embedding

[-0.009089913617436905,
 -0.0317989494628773,
 0.01587427686933764,
 -0.016390821866538088,
 -0.03472183068608283,
 0.004176446923055135,
 -0.00835289384611027,
 0.01373250981349047,
 0.017108944172611485,
 -0.016542005313854063,
 0.023521647340425018,
 -0.018809758886238717,
 -0.0026441380227851703,
 0.0115403479647638,
 -0.0005110320335433305,
 -0.003703997951700825,
 0.03404150703580599,
 -0.014740400083145313,
 -0.0047811811779802935,
 -0.0122080754779577,
 -0.02120349804383834,
 -0.015256944149023238,
 0.017436507066699403,
 -0.002888236491122946,
 0.008390689707939264,
 0.01700815458685249,
 0.009864729250592535,
 -0.02522246205405225,
 0.0069607445613177185,
 -0.01867117436997325,
 0.017020753517902994,
 0.007559179816378884,
 -0.015080563770928776,
 -0.01797824806335582,
 -0.02000662939636105,
 0.0009165505224679578,
 0.018557785853163748,
 -0.005496153751291066,
 0.019527881192312118,
 -0.011786020600990998,
 0.01057025262561543,
 -0.0023512198607798396,
 -0.017285323619383428

## Step 5 - Create tools for text data splitting

Related docs:
- [CharacterTextSplitter](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/character_text_splitter.html);
- [RecursiveCharacterTextSplitter](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/recursive_text_splitter.html);

Despite being more advance in terms of text chunk splitting the `RecursiveCharacterTextSplitter` shows not that impressive results in a realworld app.

In [10]:
chunk_size = 350
chunk_overlap = 50
length_function = len

recursive_text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    length_function=length_function
)

character_text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    length_function=length_function,
)

## Step 6 - Creating a context based text chunks
Text chunks will be vectorised and put into a storage.

To be maximum usable, the text must be split by its contextual parts. Those contextual parts must conform to the following requirements:
- Be as small as possible;
- Be as focused on the subject as possible;

LLMs can be used to achieve this goal automatically:
1. Derive contextual parts from the text (ask LLM):
2. Split the overall document into chunks by their context (ask LLM);
3. Split the resulting text by a delimiter, usually it will be "\n\n" (can be done with some langchain text splitter or a customer one);
4. Repeat operation to achieve the desired granularity (the simplest check might be by length);

Example of prompt to a generative OpenAI model to breakdown a text into small contextual peaces:
```
Define the main components of the text below based on the context and semantics. Mandatory use of all of the above rules to perform the task:
1. Breakdown the original text into independent blocks by using the defined components;
2. Separate text blocks by an empty line between each other;
3. Define a general title, in a few words, for the text below based on its context;
4. Name each text block according to the general title of the original text and the main component that each block describes;
5. For block naming use only - the general title of the original text and the name of the component based on its context;
6. Do not use numbers in the block's name;
7. After the block's name use a new line;

Text:
{text}
```

The expected output of the command is:
```
General Title: {general_title_name}

{general_title_name} - {block_name}
{block_content}
```

Content preprocessed with the above method is in `docs/preprocessed_with_gpt` directory.

## Step 7 - Content loading and splitting by the Langchain tools
The purpose is to get text chunks to further pass to the GPT as a context.

Conclusion after experiments:
Working with raw content without "context based splitting" produces not satisfactory results

In [11]:
loaded_pdf = PyPDFLoader(file_path="./docs/what_is_microservice_architecture.pdf")
recursively_split_pages = loaded_pdf.load_and_split(text_splitter=recursive_text_splitter)
character_split_pages = loaded_pdf.load_and_split(text_splitter=character_text_splitter)

[
    recursively_split_pages[0],
    "+++",
    recursively_split_pages[1],
    "+++",
    recursively_split_pages[2],
    "============",
    character_split_pages[0],
    "+++",
    character_split_pages[1],
    "+++",
    character_split_pages[2],
]

[Document(page_content='30/04/2023, 16:12 Microservice architecture style - Azure Architecture Center | Microsoft Learn\nhttps://learn.microsoft.com/en-us/azure/architecture/guide/architecture-styles/microservices 1/5Microservice architecture style\nAzure\nA microservices architecture consists of a collection of small, autonomous services. Each', metadata={'source': './docs/what_is_microservice_architecture.pdf', 'page': 0}),
 '+++',
 Document(page_content='service is self-contained and should implement a single business capability within a\nbounded context. A bounded context is a natural division within a business and provides\nan explicit boundary within which a domain model exists.\nMicroservices are small, independent, and loosely coupled. A single small team of', metadata={'source': './docs/what_is_microservice_architecture.pdf', 'page': 0}),
 '+++',
 Document(page_content='developers can write and maintain a service.\nEach service is a separate codebase, which can be managed by a

## Step 8 - GPT pre-processed content loading and splitting by the Langchain tools
The purpose is to get text chunks to further pass to the GPT as a context.

Conclusion after experiments:
Working with content that is "context based split" produces the required chunking result!

Requirements:
    - Contextual blocks are separated by a well identifiable delimiter (\n\n for example);

Actual chunking of context grouped content can be conveniently done via `CharacterTextSplitter`.

In [28]:
loaded_pdf = TextLoader(file_path="./docs/preprocessed_with_gpt/what_is_microservice_architecture.txt")
recursively_split_pages = loaded_pdf.load_and_split(text_splitter=recursive_text_splitter)
character_split_pages = loaded_pdf.load_and_split(text_splitter=character_text_splitter)

[
    recursively_split_pages[0],
    "+++",
    recursively_split_pages[1],
    "+++",
    recursively_split_pages[2],
    "============",
    character_split_pages[0],
    "+++",
    character_split_pages[1],
    "+++",
    character_split_pages[2],
]

Created a chunk of size 627, which is longer than the specified 350
Created a chunk of size 571, which is longer than the specified 350
Created a chunk of size 365, which is longer than the specified 350
Created a chunk of size 2223, which is longer than the specified 350
Created a chunk of size 2272, which is longer than the specified 350
Created a chunk of size 1494, which is longer than the specified 350


[Document(page_content='1. What are microservices?', metadata={'source': './docs/preprocessed_with_gpt/what_is_microservice_architecture.txt'}),
 '+++',
 Document(page_content='Microservices are small, independent, and loosely coupled. A single small team of developers can write and maintain a service. Each service is a separate codebase, which can be managed by a small development team. Services can be deployed independently. A team can update an existing service without rebuilding and redeploying the entire application.', metadata={'source': './docs/preprocessed_with_gpt/what_is_microservice_architecture.txt'}),
 '+++',
 Document(page_content='rebuilding and redeploying the entire application. Services are responsible for persisting their own data or external state. Services communicate with each other by using well-defined APIs. Internal implementation details of each service are hidden from other services. Supports polyglot programming.', metadata={'source': './docs/preprocessed_wi