#### CS1 - Install Libraries, define main variables, and some basic functions ####
You can convert this notebook to HTML by entering the following command in the VS Code terminal window:

`jupyter nbconvert --no-input --to html introduction.ipynb`

_Make sure that the terminal window is running in the same virtual environment.  On my Mac computer, if the terminal prompt is `(.venv) $ ` then I know it is running in the virtual environment.  Otherwise, I enter `source venv/bin/activate` to put it into the virtual environment_

**CS1, CS2, CS3, etc. are Code cells in the notebook.  The code does not appear in the HTML output.  You have to look into the notebook file (`introduction.ipynb`) to see the code contained in CS1, CS2, etc..**

In [14]:
# IMPORTANT :
# I am running this in the local virtual environment venv.  The python libraries installed when running introduction.ipynb
# do not have to be re-installed here as they are already installed in the virtual environment.
#
# The following are the additional libraries used in the notebook
#
# I set my OPENAI_API_KEY in a .env file.  You can also set it in your environment variables.
# The following two lines read the .env file and set the environment variable.
import dotenv
dotenv.load_dotenv()
import os

import math
import numpy as np
import melib
from melib.xt import mdx

# The following are the variables used in the notebook

PIE=math.pi
SECTION=0
Chapter="embeddings"
#
# Define the md() function to display markdown text
from IPython.display import display, Markdown
def md(s):
    display(Markdown(s))

# Establish OpenAI API key (see below for how to get one)
import os
import openai
from openai import OpenAI
openai.api_key = os.getenv("OPENAI_API_KEY")
client = OpenAI()
LLM="text-embedding-ada-002"

#### CS1 Ends ####

### CS2 ####
Program flow control variables

In [15]:
RECREATE_EMBEDDINGS=False

### CS2 Ends ###

#### CS3 ####
Embedding computations.  The results are used in the text.

In [16]:
# The following function calculates the factors of the integer N
# I use it to find X and Y factors of the embedding vector size N
# For N=1536, which is the length of the embeddings created by OpenAI A, use X=32 and Y=48
def calculate_factors(N):
    factors = []
    for i in range(1, N+1):
        if N % i == 0:
            factors.append(i)
    return factors
# N = 1536
# factors = calculate_factors(N)
# print(factors)
# print(1536/48)
#
# The following function maps a vector to a matrix of size X by Y
def map_vector_to_array(vector, X, Y):
    A = np.reshape(vector, (X, Y))
    return A
# print(map_vector_to_array([1,2,3,4,5,6,7,8,9,10,11,12], 3, 4))
#
#
import matplotlib.pyplot as plt
def pickcolor(a, s):
    if s=='NP':
        if a < 0:
            color = 'red'
        else:
            color = 'black'
    return color
#
# The following function visualizes the array
# The coloring choices are 'NP' for negative-positive or 'viridis' for the viridis colormap

def visualize_array(A, coloring="NP"):
    # Create a new figure
    plt.figure()

    # Get the dimensions of the array
    rows, cols = A.shape

    if coloring == "NP":
    # Iterate over each element in the array
        for i in range(rows):
            for j in range(cols):
                marker='s'
                # Pick the color
                color = pickcolor(A[i, j], coloring)

                # Plot the element on the x-y plot
                plt.plot(j, i, marker, color=color)
    elif coloring == "viridis":
        plt.imshow(A, cmap='viridis')
        plt.colorbar()

    # Set the x and y axis labels
    plt.xlabel('X')
    plt.ylabel('Y')


In [17]:
# read a text file into a string
def read_text_file(file_name):
    with open(file_name, 'r') as file:
        data = file.read().replace('\n', '')
    return data

# Generate the embedding for the text
def get_embedding(text):
    response = client.embeddings.create(
        input=text,
        model=LLM
    )
    return np.array(response.data[0].embedding)



In [18]:
# The following function calculates the dot product of two vectors
def dot_product(v1, v2):
    return np.dot(v1, v2)

# The following function calculates the cosine similarity of two vectors
def cosine_similarity(v1, v2):
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

# The following function searches for the closest vector in the embedding space.
# It returns the index of the closest vector and the cosine similarity
def find_closest_vector(v, vectors):
    similarity = -1
    index = -1
    for i in range(len(vectors)):
        s = cosine_similarity(v, vectors[i])
        if s > similarity:
            similarity = s
            index = i
    return index, similarity

In [19]:
# Define your string here.  Then run the following cell to get the embedding
sa=["Hello World!",
"One, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve",
read_text_file("data/ozconstitution.txt"),
read_text_file("data/elon.txt"),
read_text_file("data/buildings.txt"),
read_text_file("data/pascal.txt")
]

In [20]:
if RECREATE_EMBEDDINGS:
    embeddings=[]
    for s in sa:
        embedding=get_embedding(s)
        embeddings.append(embedding)
    np.save("data/embeddings.npy", embeddings)
else:
    embeddings=np.load("data/embeddings.npy", allow_pickle=True)
    print("Loaded embeddings from file")

Loaded embeddings from file


In [21]:
iembed=1
v=embeddings[iembed]
A=map_vector_to_array(v, 32,48)
# Count the number of negative numbers in the array
print(np.sum(A < 0))
# Visualize the array
visualize_array(A,"viridis")


806


In [22]:
s1="Do lorikeets like being kept in cages?"
s2="When did Turkey have an earthquake?"
s3="Kindness is a friendly hello"
s4="What is the first name of the person who bought Twitter?"
s=s4
v=get_embedding(s)
if v is None:
    print("No embedding for string ", s)
else:
    (i,similarity)=find_closest_vector(v, embeddings)
    md("The query = %s\n\n"%s)
    md("The closest source is %d ('%s') with a similarity of %.2f%%\n\n"%(i, sa[i][:30], similarity*100))
    # md("Closest vector is "+sa[i]+" with similarity "+"%f"%similarity)

The query = What is the first name of the person who bought Twitter?



The closest source is 3 ('Mustafa Kemal Atatürk died 85 ') with a similarity of 77.68%



#### CS3 Ends ####

In [23]:
TOC=["What is an embedding?", "How to measure distance between two embeddings?",  \
     "The optimal size for the text sections"
     ]
MD=mdx(Chapter, SECTION, title="EMBEDDINGS")
MD.toc(TOC,"2023")
#
# 
MD.write('This notebook is about embe\
ddings. I will use OpenAI API to generate embeddin\
gs. To run this notebook you need to have an OpenA\
I account.  Do not try to run this notebook without reading my first notebook, \
[`introduction,ipynb`](https://github.com/Gurgenci/probot/blob/main/introduction.ipynb) in this series.\
The link to that notebook can be found on [27 November 2023 my blog post]\
(https://halimgur.substack.com/p/training-chatbots-to-become-professionals).  \
On that [post](https://halimgur.substack.com/p/training-chatbots-to-become-professionals) \
and in the [notebook](https://github.com/Gurgenci/probot/blob/main/introduction.ipynb) `introduction.ipynb`, I show how to get an OpenAI account \
as well as a few other things.\n\n')
if RECREATE_EMBEDDINGS:
     MD.write("**Important** : EMbeddings were recreated and saved into the `data` folder. \
              Make RECREATE_EMBEDDINGS False if you do not want to recreate the embeddings in future RUN ALLs.  \n\n")
md(MD.out())

# EMBEDDINGS #

#### Table of Contents ####

_2023_

|Section|Title|
|:------|:-------|
|1|<a href="#What-is-an-embedding?">What is an embedding?</a>|
|2|<a href="#How-to-measure-distance-between-two-embeddings?">How to measure distance between two embeddings?</a>|
|3|<a href="#The-optimal-size-for-the-text-sections">The optimal size for the text sections</a>|


This notebook is about embeddings. I will use OpenAI API to generate embeddings. To run this notebook you need to have an OpenAI account.  Do not try to run this notebook without reading my first notebook, [`introduction,ipynb`](https://github.com/Gurgenci/probot/blob/main/introduction.ipynb) in this series.The link to that notebook can be found on [27 November 2023 my blog post](https://halimgur.substack.com/p/training-chatbots-to-become-professionals).  On that [post](https://halimgur.substack.com/p/training-chatbots-to-become-professionals) and in the [notebook](https://github.com/Gurgenci/probot/blob/main/introduction.ipynb) `introduction.ipynb`, I show how to get an OpenAI account as well as a few other things.







In [24]:
SECTION+=1
SECTION=1
MD=mdx(Chapter, SECTION, TOC[SECTION-1])
MD.write("An embedding is a vector of numbers that represents a text.  \
         For example, using the method explained further below, I generated the following embedding vector for the text string `%s`:\n\n\
         "%sa[0])
MD.write("\n\n|Dimension|Value|\n|--|--|\n")
combined_list = [i for i in range(0, 3)] + [i for i in range(1533, 1536)]
for i in combined_list:
    MD.write("|%d|%.3f|\n"%(i+1,v[i]))
    if i==2:
        MD.write("|...|...|\n")
MD.write("\n\n")
MD.write("The following image shows how I generate the embedding vector by running my python function `get_embedding()`\
         which you can find in CS3 at the beginning of this \
         notebook.  The _roller_ in the image is that function.  I feed the text string to the OpenAI API and it returns the embedding vector, \
         which is falling down off the roller like a spaghetti string.\n\n\
:::3|halembeds.jpg::\n\n")
MD.write('### Calling OpenAI to generate the embedding vector ###\n\n\
I use the `get_embedding()` function to generate the embedding vector.  The following lines in that function gets \
         the OpenAI API compute the embedding and the function returns it as a `numpy` array (`np` stands for `numpy`):\n\n\
```python\n\
    response = client.embeddings.create(\n\
        input=text,\n\
        model="text-embedding-ada-002"\n\
    )\n\
    return np.array(response.data[0].embedding)\n\
```\n\n')
MD.write("\n\n### The length of the embedding vector ###\n\n\
The length of an embedding vector is fixed for a given model.  For example, the model `text-embedding-ada-002` \
         generates an embedding vector of length 1536.  Whether the input text is only of two words (e.g. 'Hello World!' or whether it is the \
         entire text of one blog post (e.g. https://halimgur.substack.com/p/why-did-elon-musk-buy-twitter), \
         its embedding vector will be %d long when using %s \n\n\
         "%(len(v), LLM))
MD.write("\n\n### Visualisation of the embedding vector ###\n\n")
MD.write("Visualizing the embedding vector for large language models (LLMs) like\
 'text-embedding-ada-002' presents a challenge due to the high dimensionality of\
 the vector space (typically 1536 dimensions for text-embedding-ada-002). People have used \
    dimensionality reduction techniques like PCA and t-SNE to reduce the dimensionality.  I am \
    not going to use those techniques in this notebook.  I am interested in simple things like how the number of \
         negative and positive entries and magnitudes change.  This is not useful knowledge but I was curious. \
         I reshaped the embedding vector to a 32 by 48 matrix and plotted it by coloring the negatives red \
         and positives black. The following are the visualisation of the first three embeddings:\n\n\
:::3|embeddings_np.jpg::\n\n\
The first two are short strings and the last one is the preamble of the Australian Constitution.  \
         The number of positive and negative entries are in the table below:\n\n")
MD.write("\n\n|String|Positive|Negative|\n|--|--|--|\n")
for i in range(3):
    v=embeddings[i]
    A=map_vector_to_array(v, 32,48)
    MD.write("|%s|%d|%d|\n"%(sa[i][:30], np.sum(A > 0), np.sum(A < 0)))
MD.write("\n\n")
MD.write("As you can see, there is not a significant difference between the number of positive and negative entries.  \
         I then plotted the embeddings using the `viridis` colormap, which is a utility offered by \
         the `matplotlib` library. CS2 section above in this notebook has the function. The following are the results:\n\n\
:::3|embeddings_viridis.jpg::\n\n\
         ")
MD.write("\n\n")
MD.write("The color plots suggest that all three embeddings have their minimum member at the top left corner.  \
         The following table shows the index and the value of the minimum and maximum values for all six embeddings:\n\n")
MD.write("\n\n|String|String Length|Min Index|Min Value|Max Index|Max Value|\n|--|--|--|--|--|--|\n")
for i in range(6):
    v=embeddings[i]
    min_value = np.min(v)
    min_index = np.argmin(v)
    max_value = np.max(v)
    max_index = np.argmax(v)
    MD.write("|%s|%d|%d|%.3f|%d|%.3f|\n"%(sa[i][:30], len(sa[i]), min_index, min_value, max_index, max_value))
    # A=map_vector_to_array(v, 32,48)
    # min_index = np.unravel_index(np.argmin(A, axis=None), A.shape)
    # max_index = np.unravel_index(np.argmax(A, axis=None), A.shape)
    # MD.write("|%s|%d|%.3f|%d|%.3f|\n"%(sa[i][:30], min_index[0]*48+min_index[1], A[min_index], max_index[0]*48+max_index[1], A[max_index]))
MD.write("\n\n")
MD.write("This is interesting but not useful.  I am going to move on to more useful things.\n\n")
md(MD.out())

# What is an embedding? #

An embedding is a vector of numbers that represents a text.           For example, using the method explained further below, I generated the following embedding vector for the text string `Hello World!`:

         

|Dimension|Value|
|--|--|
|1|-0.002|
|2|-0.026|
|3|-0.009|
|...|...|
|1534|-0.010|
|1535|0.008|
|1536|-0.005|


The following image shows how I generate the embedding vector by running my python function `get_embedding()`         which you can find in CS3 at the beginning of this          notebook.  The _roller_ in the image is that function.  I feed the text string to the OpenAI API and it returns the embedding vector,          which is falling down off the roller like a spaghetti string.

![alt text](pics/halembeds.jpg 'halembeds.jpg')

<i>Figure 1.1. </i>



### Calling OpenAI to generate the embedding vector ###

I use the `get_embedding()` function to generate the embedding vector.  The following lines in that function gets          the OpenAI API compute the embedding and the function returns it as a `numpy` array (`np` stands for `numpy`):

```python
    response = client.embeddings.create(
        input=text,
        model="text-embedding-ada-002"
    )
    return np.array(response.data[0].embedding)
```



### The length of the embedding vector ###

The length of an embedding vector is fixed for a given model.  For example, the model `text-embedding-ada-002`          generates an embedding vector of length 1536.  Whether the input text is only of two words (e.g. 'Hello World!' or whether it is the          entire text of one blog post (e.g. https://halimgur.substack.com/p/why-did-elon-musk-buy-twitter),          its embedding vector will be 1536 long when using text-embedding-ada-002 

         

### Visualisation of the embedding vector ###

Visualizing the embedding vector for large language models (LLMs) like 'text-embedding-ada-002' presents a challenge due to the high dimensionality of the vector space (typically 1536 dimensions for text-embedding-ada-002). People have used     dimensionality reduction techniques like PCA and t-SNE to reduce the dimensionality.  I am     not going to use those techniques in this notebook.  I am interested in simple things like how the number of          negative and positive entries and magnitudes change.  This is not useful knowledge but I was curious.          I reshaped the embedding vector to a 32 by 48 matrix and plotted it by coloring the negatives red          and positives black. The following are the visualisation of the first three embeddings:

![alt text](pics/embeddings_np.jpg 'embeddings_np.jpg')

<i>Figure 1.2. </i>



The first two are short strings and the last one is the preamble of the Australian Constitution.           The number of positive and negative entries are in the table below:



|String|Positive|Negative|
|--|--|--|
|Hello World!|763|773|
|One, two, three, four, five, s|730|806|
|The Australian Constitution ha|777|759|


As you can see, there is not a significant difference between the number of positive and negative entries.           I then plotted the embeddings using the `viridis` colormap, which is a utility offered by          the `matplotlib` library. CS2 section above in this notebook has the function. The following are the results:

![alt text](pics/embeddings_viridis.jpg 'embeddings_viridis.jpg')

<i>Figure 1.3. </i>



         

The color plots suggest that all three embeddings have their minimum member at the top left corner.           The following table shows the index and the value of the minimum and maximum values for all six embeddings:



|String|String Length|Min Index|Min Value|Max Index|Max Value|
|--|--|--|--|--|--|
|Hello World!|12|194|-0.691|954|0.231|
|One, two, three, four, five, s|73|194|-0.692|954|0.222|
|The Australian Constitution ha|2387|194|-0.639|954|0.179|
|Mustafa Kemal Atatürk died 85 |13428|194|-0.651|954|0.221|
|All those workshops, changed l|17325|194|-0.649|954|0.208|
|Pascal, whom I mentioned on my|7522|194|-0.640|954|0.221|


This is interesting but not useful.  I am going to move on to more useful things.







In [25]:
SECTION+=1
SECTION=2
MD=mdx(Chapter, SECTION, TOC[SECTION-1])
MD.write("The cosine similarity is a measure of similarity between two vectors.  The cosine similarity of two vectors \
         is the cosine of the angle between them.  It is calculated as follows:\n\n\
$$cosine\_similarity = \\frac{A \\cdot B}{||A|| \\cdot ||B||}$$\n\n\
where $A$ and $B$ are the two vectors and $||A||$ is the norm of $A$.\n\n")
MD.write("Those of you who know their linear algebra will know that the dot product of two vectors is the product of their \
         magnitudes and the cosine of the angle between them.  The cosine similarity is the dot product of the two vectors \
         divided by the product of their magnitudes.  The cosine similarity is a number between -1 and 1.  If the cosine \
         similarity is 1, then the two vectors are identical.  If the cosine similarity is -1, then the two vectors are \
         opposite to each other.  If the cosine similarity is 0, then the two vectors are orthogonal to each other.\n\n")
MD.write("The following are the cosine similarities of the first three embeddings with themselves:\n\n")
MD.write("\n\n|String|String Length|Cosine Similarity|\n|--|--|--|\n")
for i in range(6):
    v=embeddings[i]
    similarity = cosine_similarity(v, v)
    MD.write("|%s|%d|%.3f|\n"%(sa[i][:30], len(sa[i]), similarity))
MD.write("\n\n")
MD.write("As you can see, the cosine similarity of the embedding of a string with itself is 1.  \
         The following are the cosine similarities of the first three embeddings with the embedding of the \
         preamble of the Australian Constitution:\n\n")
MD.write("\n\n|String|String Length|Cosine Similarity|\n|--|--|--|\n")
for i in range(6):
    v=embeddings[i]
    similarity = cosine_similarity(v, embeddings[2])
    MD.write("|%s|%d|%.3f|\n"%(sa[i][:30], len(sa[i]), similarity))
query="When did Mustafa Kemal Ataturk die?"
THIS=False # Make this false if you do not want to keep creating the query embedding
if THIS:
    query_embedding=get_embedding(query)
MD.write("\n\nLet us compute the cosine similarity of the query `%s` with our six embeddings:\n\n"%query)
MD.write("\n\n|String|String Length|Cosine Similarity|\n|--|--|--|\n")
for i in range(6):
    v=embeddings[i]
    similarity = cosine_similarity(v, query_embedding)
    MD.write("|%s|%d|%.3f|\n"%(sa[i][:30], len(sa[i]), similarity))
MD.write("\n\nAs you can see it is favouring the fourth string, which is my blog post on Elon Musk.  I started that post \
         with a brief note on November 10th, which is the date Mustafa Kemal Ataturk died. But it is only a small part of \
         the post.  This is probably why the cosine similarity is not very high.\n\n")
MD.write("If you download the notebook, then you can run this cell with different \
         queries to see how the cosine similarity changes.\n\n")
MD.write("\n\n")
md(MD.out())


# How to measure distance between two embeddings? #

The cosine similarity is a measure of similarity between two vectors.  The cosine similarity of two vectors          is the cosine of the angle between them.  It is calculated as follows:

$$cosine\_similarity = \frac{A \cdot B}{||A|| \cdot ||B||}$$

where $A$ and $B$ are the two vectors and $||A||$ is the norm of $A$.

Those of you who know their linear algebra will know that the dot product of two vectors is the product of their          magnitudes and the cosine of the angle between them.  The cosine similarity is the dot product of the two vectors          divided by the product of their magnitudes.  The cosine similarity is a number between -1 and 1.  If the cosine          similarity is 1, then the two vectors are identical.  If the cosine similarity is -1, then the two vectors are          opposite to each other.  If the cosine similarity is 0, then the two vectors are orthogonal to each other.

The following are the cosine similarities of the first three embeddings with themselves:



|String|String Length|Cosine Similarity|
|--|--|--|
|Hello World!|12|1.000|
|One, two, three, four, five, s|73|1.000|
|The Australian Constitution ha|2387|1.000|
|Mustafa Kemal Atatürk died 85 |13428|1.000|
|All those workshops, changed l|17325|1.000|
|Pascal, whom I mentioned on my|7522|1.000|


As you can see, the cosine similarity of the embedding of a string with itself is 1.           The following are the cosine similarities of the first three embeddings with the embedding of the          preamble of the Australian Constitution:



|String|String Length|Cosine Similarity|
|--|--|--|
|Hello World!|12|0.709|
|One, two, three, four, five, s|73|0.706|
|The Australian Constitution ha|2387|1.000|
|Mustafa Kemal Atatürk died 85 |13428|0.730|
|All those workshops, changed l|17325|0.744|
|Pascal, whom I mentioned on my|7522|0.693|


Let us compute the cosine similarity of the query `When did Mustafa Kemal Ataturk die?` with our six embeddings:



|String|String Length|Cosine Similarity|
|--|--|--|
|Hello World!|12|0.709|
|One, two, three, four, five, s|73|0.717|
|The Australian Constitution ha|2387|0.692|
|Mustafa Kemal Atatürk died 85 |13428|0.863|
|All those workshops, changed l|17325|0.769|
|Pascal, whom I mentioned on my|7522|0.739|


As you can see it is favouring the fourth string, which is my blog post on Elon Musk.  I started that post          with a brief note on November 10th, which is the date Mustafa Kemal Ataturk died. But it is only a small part of          the post.  This is probably why the cosine similarity is not very high.

If you download the notebook, then you can run this cell with different          queries to see how the cosine similarity changes.









In [38]:
SECTION+=1
SECTION=3
MD=mdx(Chapter, SECTION, TOC[SECTION-1])
MD.write("According to Bard **and** ChatGPT, there is not an easy rule to determine the optimal size of the text sections.  \
          ChatGPT says that as a rough guideline, in many NLP applications, text sections ranging from a few sentences to a few \
         paragraphs (roughly 100 to 500 words) are commonly used, but this can vary significantly based on the specific \
         use case and requirements.\n\n\
Let us calculate the number of words in the first three strings:\n\n")
MD.write("\n\n|String|String Length|Number of Words|\n|--|--|--|\n")
for i in range(6):
    MD.write("|%s|%d|%d|\n"%(sa[i][:30], len(sa[i]), len(sa[i].split())))
MD.write("\n\nIt looks like some of my strings are too long.  It is easy to split them to smaller segments but \
         this may cause other problems.  For example, if I do the segmentation automatically, part of \
         an important sentence can be in one segment and the other part in another segment. \
         This may cause the loss of the information included in that segment.  \
         It is possible to have them overlapping each other.  The best option of course is \
         to obey the optimum segment size condition while preparing the corpus.  But this requires \
         knowing what that optimal number is.  This needs further thinking and probably some experimentation.\n\n")
MD.write("I asked Bard for some references in this area.  It gave me the following:\n\n")
MD.write("* `RAG: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks by Lewis et al. (202\
0)`:This paper introduces the RAG model and discusses the potential benefits of using different text \
section sizes for retrieval and generation.While the paper doesn't provide specific optimal sizes, i\
t emphasizes the importance of experimenting with different sizes based on the task and dataset.Link\
: https://arxiv.org/abs/2005.11401\n")
MD.write("* `Exploring the Impact of Text Chunk Size in Dense Retrievers by Chen et al. (2022)`:    This\
 paper investigates the impact of text chunk size on the performance of dense retrieval models, \
which are often used as the retrieval component in RAG models.Their findings suggest that a moderate\
 chunk size (e.g., 512 tokens) can achievea good balance between retrieval accuracy and computationa\
l efficiency.Link: https://arxiv.org/abs/2211.14876\n")
MD.write("* `Adaptive Text Chunk Size for Efficient Dense Retrieval by Wang et al. (2022)`:This paper \
proposes an adaptive text chunking approach that dynamically adjusts the text chunk size based on th\
e document length and complexity.Their results demonstrate that this approach can improve retrieval \
accuracy and efficiency compared to using a fixed chunk size.Link: https://arxiv.org/abs/2205.03284"\
)
MD.write("\n\n")
md(MD.out())


# The optimal size for the text sections #

According to Bard **and** ChatGPT, there is not an easy rule to determine the optimal size of the text sections.            ChatGPT says that as a rough guideline, in many NLP applications, text sections ranging from a few sentences to a few          paragraphs (roughly 100 to 500 words) are commonly used, but this can vary significantly based on the specific          use case and requirements.

Let us calculate the number of words in the first three strings:



|String|String Length|Number of Words|
|--|--|--|
|Hello World!|12|2|
|One, two, three, four, five, s|73|12|
|The Australian Constitution ha|2387|352|
|Mustafa Kemal Atatürk died 85 |13428|2178|
|All those workshops, changed l|17325|2762|
|Pascal, whom I mentioned on my|7522|1350|


It looks like some of my strings are too long.  It is easy to split them to smaller segments but          this may cause other problems.  For example, if I do the segmentation automatically, part of          an important sentence can be in one segment and the other part in another segment.          This may cause the loss of the information included in that segment.           It is possible to have them overlapping each other.  The best option of course is          to obey the optimum segment size condition while preparing the corpus.  But this requires          knowing what that optimal number is.  This needs further thinking and probably some experimentation.

I asked Bard for some references in this area.  It gave me the following:

* `RAG: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks by Lewis et al. (2020)`:This paper introduces the RAG model and discusses the potential benefits of using different text section sizes for retrieval and generation.While the paper doesn't provide specific optimal sizes, it emphasizes the importance of experimenting with different sizes based on the task and dataset.Link: https://arxiv.org/abs/2005.11401
* `Exploring the Impact of Text Chunk Size in Dense Retrievers by Chen et al. (2022)`:    This paper investigates the impact of text chunk size on the performance of dense retrieval models, which are often used as the retrieval component in RAG models.Their findings suggest that a moderate chunk size (e.g., 512 tokens) can achievea good balance between retrieval accuracy and computational efficiency.Link: https://arxiv.org/abs/2211.14876
* `Adaptive Text Chunk Size for Efficient Dense Retrieval by Wang et al. (2022)`:This paper proposes an adaptive text chunking approach that dynamically adjusts the text chunk size based on the document length and complexity.Their results demonstrate that this approach can improve retrieval accuracy and efficiency compared to using a fixed chunk size.Link: https://arxiv.org/abs/2205.03284





