## Step 1 - Add your OpenAI API key.
The key must be taken from a registered OpenAI account.
Link to the keys sub-menu https://platform.openai.com/account/api-keys.

In [1]:
%env OPENAI_API_KEY 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

env: OPENAI_API_KEY=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


## Step 2 - Installing necessary libraries
The libraries below are needed for:
- Loading of text data from files and further preparation;
- Creating embeddings via Langchain;
- Prompting;

In [2]:
!pip install langchain


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Step 3 - Libraries import
Imported libraries will be used across the whole notebook

In [3]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.document_loaders import TextLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

## Step 4 - Create tool for embedding generation
The tool is provided by Langchain and simplified embeddings generation process while providing reasonable defaults.

By default, OpenAIEmbeddings uses nearly optimal settings:
Model: text-embedding-ada-002
Max chunk length: 8191

In [4]:
embeddings = OpenAIEmbeddings()

In [5]:
query = "What am I doing here?"
query_embedding = embeddings.embed_query(query)

query_embedding

[-0.009068081133982732,
 -0.03183904282672644,
 0.01591952141336322,
 -0.016360329425402716,
 -0.034685410477182335,
 0.004175095809067048,
 -0.008400569650043679,
 0.013702878696115835,
 0.01699005808390151,
 -0.01651146445245404,
 0.023501444552416526,
 -0.018828863233520593,
 -0.0025803099791057707,
 0.011605884852391604,
 -0.0005061437829539754,
 -0.0037279890296396737,
 0.034030493268390034,
 -0.014685254509304285,
 -0.004798526398669893,
 -0.012285990611477409,
 -0.02120923432164348,
 -0.015277198479717674,
 0.017330111894766985,
 -0.0029754643258148965,
 0.008381677771662264,
 0.017040437047133667,
 0.009886727644973104,
 -0.025277277394979272,
 0.006933303999156955,
 -0.018627351105882258,
 0.017027841840664342,
 0.007575626467141216,
 -0.01518903669104526,
 -0.017959838690620636,
 -0.019962374073760364,
 0.0009949700883352013,
 0.01856437693618078,
 -0.005466037882608945,
 0.019471185235843565,
 -0.011801100308117847,
 0.010497563493767912,
 -0.0023850942905488852,
 -0.0173049

## Step 5 - Create tools for text data splitting

Related docs:
- [CharacterTextSplitter](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/character_text_splitter.html);
- [RecursiveCharacterTextSplitter](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/recursive_text_splitter.html);

Despite being more advance in terms of text chunk splitting the `RecursiveCharacterTextSplitter` shows not that impressive results in a realworld app.

In [6]:
chunk_size = 350
chunk_overlap = 50
length_function = len

recursive_text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    length_function=length_function
)

character_text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    length_function=length_function,
)

## Step 6 - Creating a context based text chunks
Text chunks will be vectorised and put into a storage.

To be maximum usable, the text must be split by its contextual parts. Those contextual parts must conform to the following requirements:
- Be as small as possible;
- Be as focused on the subject as possible;

LLMs can be used to achieve this goal automatically:
1. Derive contextual parts from the text (ask LLM):
2. Split the overall document into chunks by their context (ask LLM);
3. Split the resulting text by a delimiter, usually it will be "\n\n" (can be done with some langchain text splitter or a customer one);
4. Repeat operation to achieve the desired granularity (the simplest check might be by length);

Example of prompt to a generative OpenAI model to breakdown a text into small contextual peaces:
```
Define the main components of the text below based on the context and semantics. Mandatory use of all of the above rules to perform the task:
1. Breakdown the original text into independent blocks by using the defined components;
2. Separate text blocks by an empty line between each other;
3. Define a general title, in a few words, for the text below based on its context;
4. Name each text block according to the general title of the original text and the main component that each block describes;
5. For block naming use only - the general title of the original text and the name of the component based on its context;
6. Do not use numbers in the block's name;
7. After the block's name use a new line;

Text:
{text}
```

The expected output of the command is:
```
General Title: {general_title_name}

{general_title_name} - {block_name}
{block_content}
```

Content preprocessed with the above method is in `docs/preprocessed_with_gpt` directory.

## Step 7 - Content loading and splitting by the Langchain tools
The purpose is to get text chunks to further pass to the GPT as a context.

Conclusion after experiments:
Working with raw content without "context based splitting" produces not satisfactory results

In [7]:
loaded_pdf = PyPDFLoader(file_path="./docs/what_is_microservice_architecture.pdf")
recursively_split_pages = loaded_pdf.load_and_split(text_splitter=recursive_text_splitter)
character_split_pages = loaded_pdf.load_and_split(text_splitter=character_text_splitter)

[
    recursively_split_pages[0],
    "+++",
    recursively_split_pages[1],
    "+++",
    recursively_split_pages[2],
    "============",
    character_split_pages[0],
    "+++",
    character_split_pages[1],
    "+++",
    character_split_pages[2],
]

[Document(page_content='30/04/2023, 16:12 Microservice architecture style - Azure Architecture Center | Microsoft Learn\nhttps://learn.microsoft.com/en-us/azure/architecture/guide/architecture-styles/microservices 1/5Microservice architecture style\nAzure\nA microservices architecture consists of a collection of small, autonomous services. Each', metadata={'source': './docs/what_is_microservice_architecture.pdf', 'page': 0}),
 '+++',
 Document(page_content='service is self-contained and should implement a single business capability within a\nbounded context. A bounded context is a natural division within a business and provides\nan explicit boundary within which a domain model exists.\nMicroservices are small, independent, and loosely coupled. A single small team of', metadata={'source': './docs/what_is_microservice_architecture.pdf', 'page': 0}),
 '+++',
 Document(page_content='developers can write and maintain a service.\nEach service is a separate codebase, which can be managed by a

## Step 8 - GPT pre-processed content loading and splitting by the Langchain tools
The purpose is to get text chunks to further pass to the GPT as a context.

Conclusion after experiments:
Working with content that is "context based split" produces the required chunking result!

Requirements:
    - Contextual blocks are separated by a well identifiable delimiter (\n\n for example);

Actual chunking of context grouped content can be conveniently done via `CharacterTextSplitter`.

In [8]:
loaded_pdf = TextLoader(file_path="./docs/preprocessed_with_gpt/what_is_microservice_architecture.txt")
recursively_split_pages = loaded_pdf.load_and_split(text_splitter=recursive_text_splitter)
character_split_pages = loaded_pdf.load_and_split(text_splitter=character_text_splitter)

[
    recursively_split_pages[0],
    "+++",
    recursively_split_pages[1],
    "+++",
    recursively_split_pages[2],
    "============",
    character_split_pages[0],
    "+++",
    character_split_pages[1],
    "+++",
    character_split_pages[2],
]

Created a chunk of size 409, which is longer than the specified 350
Created a chunk of size 848, which is longer than the specified 350
Created a chunk of size 1077, which is longer than the specified 350
Created a chunk of size 2231, which is longer than the specified 350
Created a chunk of size 2425, which is longer than the specified 350
Created a chunk of size 1525, which is longer than the specified 350


[Document(page_content='General title: Microservice Architecture Style\nMicroservice Architecture Style - Overview', metadata={'source': './docs/preprocessed_with_gpt/what_is_microservice_architecture.txt'}),
 '+++',
 Document(page_content='A microservices architecture consists of a collection of small, autonomous services. Each service is self-contained and should implement a single business capability within a bounded context. A bounded context is a natural division within a business and provides an explicit boundary within which a domain model exists.', metadata={'source': './docs/preprocessed_with_gpt/what_is_microservice_architecture.txt'}),
 '+++',
 Document(page_content='Microservice Architecture Style - What are Microservices?', metadata={'source': './docs/preprocessed_with_gpt/what_is_microservice_architecture.txt'}),
 Document(page_content='General title: Microservice Architecture Style\nMicroservice Architecture Style - Overview\nA microservices architecture consists of a co