#### CS1 - Install Libraries, define main variables, and some basic functions ####

In [15]:
# IMPORTANT :
# Before you run this cell, make sure you create a virtual environment.  See below for how.
# DO NOT INSTALL ANYTHING IN THE BASE ENVIRONMENT.
#
# The following are the libraries used in the notebook
%pip install --q melib
%pip install --q pylatex
%pip install --q openpyxl
%pip install --q setuptools
%pip install --q --upgrade openai
%pip install --q nbconvert
import math
import numpy as np
import melib
from melib.xt import mdx
from melib.excel import Xcel
from melib.library import steelprop
# The following are the variables used in the notebook

PIE=math.pi
SECTION=0
Chapter="Embeddings"
#
TextString1="There are four sides to a square."
TextString2="There are three sides to a triangle."
TextString3="There are five sides to a pentagon."
TextString4="In a footbal game, there are eleven players on each team."

# The following are the functions used in the notebook
from IPython.display import display, Markdown
def md(s):
    display(Markdown(s))

# The following will print the version of the package:import pkg_resources
def packageversion(package_name):
    import pkg_resources
    version = pkg_resources.get_distribution(package_name).version
    print(f"The version of {package_name} is: {version}")

# Establish OpenAI API key (see below for how to get one)
import os
import openai
openai.api_key = os.getenv("OPENAI_API_KEY")


Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


#### CS1 ends ####

#### CS2 - OpenAI calls ####

In [16]:
from openai import OpenAI
client = OpenAI()

# response = client.embeddings.create(
#     input="Empty",
#     model="text-embedding-ada-002"
# )

def get_embedding(text):
    response = client.embeddings.create(
        input=text,
        model="text-embedding-ada-002"
    )
    return np.array(response.data[0].embedding)

def get_similarity(text1, text2):
    response = client.embeddings.similarity(
        texts=[text1, text2],
        model="text-embedding-ada-002"
    )
    return response.data[0].score

def get_distance(text1, text2):
    response = client.embeddings.distance(
        texts=[text1, text2],
        model="text-embedding-ada-002"
    )
    return response.data[0].distance

def get_embeddings(texts):
    response = client.embeddings.create(
        input=texts,
        model="text-embedding-ada-002"
    )
    return np.array(response.data[0].embedding)



#### CS2 Ends

#### CS3 - Embedding vector computations.  The results are referred to in the text.

In [17]:
Emvectors=np.array([get_embedding(TextString1),get_embedding(TextString2),get_embedding(TextString3),get_embedding(TextString4)])
Magnitudes=np.linalg.norm(Emvectors,axis=1)
print(Magnitudes)

[0.99999998 1.00000003 0.99999998 1.00000004]


#### CS3 Ends ####

In [18]:
TOC=["Introduction", "Please Join Me", "Install Python", "Set up Virtual Environment", "Establish OpenAI Credentials", "What is an embedding?", "How to create an embedding vector"]
MD=mdx(Chapter, SECTION, title="TEXT EMBEDDINGS")
MD.toc(TOC,"2023")
#
# 
MD.write("Text Embeddings used in Large Language Models (LLMs)\n\n")
md(MD.out())

# TEXT EMBEDDINGS #

#### Table of Contents ####

_2023_

|Section|Title|
|:------|:-------|
|1|<a href="#Introduction">Introduction</a>|
|2|<a href="#Please-Join-Me">Please Join Me</a>|
|3|<a href="#Install-Python">Install Python</a>|
|4|<a href="#Set-up-Virtual-Environment">Set up Virtual Environment</a>|
|5|<a href="#Establish-OpenAI-Credentials">Establish OpenAI Credentials</a>|
|6|<a href="#What-is-an-embedding?">What is an embedding?</a>|
|7|<a href="#How-to-create-an-embedding-vector">How to create an embedding vector</a>|


Text Embeddings used in Large Language Models (LLMs)







In [19]:
SECTION+=1
SECTION=1
MD=mdx(Chapter, SECTION, TOC[SECTION-1])
#
MD.write("In March 2023, [I wrote in my blog](https://halimgur.substack.com/p/competent-intelligence-is-here-will) that the SOA LLMs were like high-school graduates:\n\n\
* Knows how to read and write\n\
* Thinks they know everything\n\n")
MD.write("When you ask them a question you always get an answer because, if they do not know the answer, they would make it up\n\n")
MD.write("In March 2023, I said that LLMs were not yet competent enough to deliver professional functions such as engineering. \
A competent intelligence would be equivalent to a college graduate.  GPT-4 or even a future GPT-5 would have to be further trained to get there.  \
Today, there are two paths for a general LLM, i.e. an LLM straight out of high school, to get 'higher education':\n\n\
* Fine Tuning\n\
* Retrieval Augmented Generation (RAG)\n\n:::3|LLMStudent.jpg::\n\n\
I am a retired university teacher.  One might argue that it was natural for me to become interested in providing 'university' training to the LLMs.  I mentioned \
this interest in March 2023 but was not sure yet how to go about it.  I have been watching the progress in the field since then.  A number of tools have been proposed \
but were lacking in one way or another. I did not want to invest my time in a tool \
that would not be around long.  The situation has now changed. \
I am happy to say that the OpenAI offerings early November provides a path for people like me to develop tools to \
train LLMs to competence levels of a college graduate. They are not perfect but good enough to provide a starting point.\n\n\
I am planning to use Retrieval Augmented Generation (RAG) to do this. Before I justify this choice, I should give \
a very brief description of the two methods mentioned above.\n\n\
## Fine Tuning ##\n\n\
This method is a repetition of the initial training of the LLM.  Remember that the model is already trained on \
a large corpus of text.  The fine tuning is done on a smaller corpus of text that is specific to the task at hand. \
In human education terms, this is similar to recitation learning like in islamic madrasas.  The madrasa students keeo \
reciting religious texts without necessarily understanding context.  Similarly, in fine-tuning, the LLM is \
given a body of text in a specific domain and is trained to tease out the probabilistic relations connecting \
different words to each other in this specific domain, which may be slightly different from the original relations \
developed using an entire corpus of general internet and other sources. As in its original initial training, there is \
no contextual knowledge relations here just probabilities. \n\n\
Either the entire parameter set (weights and biases) of the LLM is fine tuned or only the last layer is fine tuned. \
The fine-tuning starts with unsupervised learning, which is usually reinforced by human feedback.\n\n\
I decided not to use fine-tuning because at the end the answer will still be probabilistic and not contextual.\
A probalistic answer may be adequate in non-numerical fields such as law but it is totally unacceptable \
in say engineering where categorical answers are needed and if a numerical response is asked for, \
it should be accurate with reliabilities exceeding 99%\n\n")
MD.write("In a way, it is good that there are good reasons for me not to pick fine-tuning as my method \
of choice because I do not have access to the computing power needed to do fine-tuning. It is possible \
to use OpenaAI API but this would be expensive.  I also think some heuristic methods need to be used in addition to \
running fine-tuning through the API and I do not think this would be possible when using the OpenAI as a black box.\n\n")
MD.write("## Retrieval Augmented Generation (RAG) ##\n\n")
MD.write("If fine-tuning is like recitation learning, RAG is like contextual learning.  In RAG, there is \
a retriever program between the user and the LLM.  When the user asks a question, the retriever program \
retrieves the most relevant text from a corpus of text and feeds it to the LLM.  The LLM then generates \
a response.  The response is then fed back to the retriever program to determine if the response is \
relevant to the question.  If it is not, the retriever program retrieves another text from the corpus \
and the process is repeated until a relevant response is obtained.\n\n\
Most of the current interest in RAG is in building company chatbots where the corpus of text is the \
company's knowledge base.  The retriever program is usually a search engine. The corpus of text \
precedes RAG and is developed independent of the RAG effort.\n\n\
This is not my interest.  I am interested in developing a corpus of text that is specific to a \
specific area in which I have knowledge.  In other words, I approach RAG like writing a textbook \
for a course.  The difference is that writing a textbook for a LLM is probably different from writing.\n\n\
Let me give an example.  Suppose that I am an expert on Australian wildlife and I want to \
develop a chatbot that people can ask questions about Australian wildlife. I will first organise my knowledge \
in series of text files:\n\n:::3|corpus.jpg::\n\n\
Then I will develop a retriever program that will retrieve the most relevant text file from the corpus \
and feed it to the LLM.  The LLM will then generate a response.  The response is then fed back to the \
retriever program to determine if the response is relevant to the question.  If it is not, the retriever \
program retrieves another text file from the corpus and the process is repeated until a relevant response \
is obtained.\n\n:::3|ragprocess.jpg::\n\n")
MD.write("### How does Retriever Program work? ###\n\n")
MD.write("We cannot rely on a simple text matching search because the user may not the same words as in the corpus. \
Therefore, we need to use a semantic search.  To do semantic search, both the corpus text and the  \
user's question must be converted to vectors.  The vectors are then compared to each other.  These vectors are \
called embeddings.  The embeddings are generated by the LLM.  The LLM is trained to generate embeddings \
that are similar to each other if the text is similar.\n\n\
In summary, the beginning place is the embeddings.  We have to understand how embeddings work so that we can \
try different styles in writing our corpus text that will generate the most relevant embeddings.\n\n")
MD.write("\n\n")
#
md(MD.out())

# Introduction #

In March 2023, [I wrote in my blog](https://halimgur.substack.com/p/competent-intelligence-is-here-will) that the SOA LLMs were like high-school graduates:

* Knows how to read and write
* Thinks they know everything

When you ask them a question you always get an answer because, if they do not know the answer, they would make it up

In March 2023, I said that LLMs were not yet competent enough to deliver professional functions such as engineering. A competent intelligence would be equivalent to a college graduate.  GPT-4 or even a future GPT-5 would have to be further trained to get there.  Today, there are two paths for a general LLM, i.e. an LLM straight out of high school, to get 'higher education':

* Fine Tuning
* Retrieval Augmented Generation (RAG)

![alt text](pics/LLMStudent.jpg 'LLMStudent.jpg')

<i>Figure 1.1. </i>



I am a retired university teacher.  One might argue that it was natural for me to become interested in providing 'university' training to the LLMs.  I mentioned this interest in March 2023 but was not sure yet how to go about it.  I have been watching the progress in the field since then.  A number of tools have been proposed but were lacking in one way or another. I did not want to invest my time in a tool that would not be around long.  The situation has now changed. I am happy to say that the OpenAI offerings early November provides a path for people like me to develop tools to train LLMs to competence levels of a college graduate. They are not perfect but good enough to provide a starting point.

I am planning to use Retrieval Augmented Generation (RAG) to do this. Before I justify this choice, I should give a very brief description of the two methods mentioned above.

## Fine Tuning ##

This method is a repetition of the initial training of the LLM.  Remember that the model is already trained on a large corpus of text.  The fine tuning is done on a smaller corpus of text that is specific to the task at hand. In human education terms, this is similar to recitation learning like in islamic madrasas.  The madrasa students keeo reciting religious texts without necessarily understanding context.  Similarly, in fine-tuning, the LLM is given a body of text in a specific domain and is trained to tease out the probabilistic relations connecting different words to each other in this specific domain, which may be slightly different from the original relations developed using an entire corpus of general internet and other sources. As in its original initial training, there is no contextual knowledge relations here just probabilities. 

Either the entire parameter set (weights and biases) of the LLM is fine tuned or only the last layer is fine tuned. The fine-tuning starts with unsupervised learning, which is usually reinforced by human feedback.

I decided not to use fine-tuning because at the end the answer will still be probabilistic and not contextual.A probalistic answer may be adequate in non-numerical fields such as law but it is totally unacceptable in say engineering where categorical answers are needed and if a numerical response is asked for, it should be accurate with reliabilities exceeding 99%

In a way, it is good that there are good reasons for me not to pick fine-tuning as my method of choice because I do not have access to the computing power needed to do fine-tuning. It is possible to use OpenaAI API but this would be expensive.  I also think some heuristic methods need to be used in addition to running fine-tuning through the API and I do not think this would be possible when using the OpenAI as a black box.

## Retrieval Augmented Generation (RAG) ##

If fine-tuning is like recitation learning, RAG is like contextual learning.  In RAG, there is a retriever program between the user and the LLM.  When the user asks a question, the retriever program retrieves the most relevant text from a corpus of text and feeds it to the LLM.  The LLM then generates a response.  The response is then fed back to the retriever program to determine if the response is relevant to the question.  If it is not, the retriever program retrieves another text from the corpus and the process is repeated until a relevant response is obtained.

Most of the current interest in RAG is in building company chatbots where the corpus of text is the company's knowledge base.  The retriever program is usually a search engine. The corpus of text precedes RAG and is developed independent of the RAG effort.

This is not my interest.  I am interested in developing a corpus of text that is specific to a specific area in which I have knowledge.  In other words, I approach RAG like writing a textbook for a course.  The difference is that writing a textbook for a LLM is probably different from writing.

Let me give an example.  Suppose that I am an expert on Australian wildlife and I want to develop a chatbot that people can ask questions about Australian wildlife. I will first organise my knowledge in series of text files:

![alt text](pics/corpus.jpg 'corpus.jpg')

<i>Figure 1.2. </i>



Then I will develop a retriever program that will retrieve the most relevant text file from the corpus and feed it to the LLM.  The LLM will then generate a response.  The response is then fed back to the retriever program to determine if the response is relevant to the question.  If it is not, the retriever program retrieves another text file from the corpus and the process is repeated until a relevant response is obtained.

![alt text](pics/ragprocess.jpg 'ragprocess.jpg')

<i>Figure 1.3. </i>



### How does Retriever Program work? ###

We cannot rely on a simple text matching search because the user may not the same words as in the corpus. Therefore, we need to use a semantic search.  To do semantic search, both the corpus text and the  user's question must be converted to vectors.  The vectors are then compared to each other.  These vectors are called embeddings.  The embeddings are generated by the LLM.  The LLM is trained to generate embeddings that are similar to each other if the text is similar.

In summary, the beginning place is the embeddings.  We have to understand how embeddings work so that we can try different styles in writing our corpus text that will generate the most relevant embeddings.









In [25]:
SECTION+=1
SECTION=2
MD=mdx(Chapter, SECTION, TOC[SECTION-1])
MD.write("As you can see I am not trying to develop new programs to do RAG.  I wil use existing programs but I will \
experiment with different styles of writing the corpus text to see which style generates the most relevant embeddings. \
I will also experiment with different ways of retrieving the most relevant text from the corpus.  I suspect there will be \
room for some heuristics here.\n\n")
MD.write("If you have an area of expertise which you would like to share with the world using this \
new technology, please join me.  I will be regularly (hopefully fortnightly) posting my progress \
on Substack.  I will also be posting my code on Github.  I will be using Python and the OpenAI API. In addition, I created \
an X (formerly known as Twitter) Community Group '[Building AI Tutors](https://twitter.com/i/communities/1727552258454474973)'. If you \
are interested, please join the group.  I will be posting my progress there as well.\n\n")
MD.write("You do not need to be an experienced Python programmer but some knowledge of Python will be helpful.  Velow I list the steps \
you need to take to join me and explain how to go about it:\n\n\
* Install VS Code from [https://code.visualstudio.com/download](https://code.visualstudio.com/download).  \
This is a free code editor and development environment.  It is the tool I am using therefore should be \
able to help you if you have problems.\n\
* Install Python from [https://www.python.org/downloads/](https://www.python.org/downloads/).  \
At the time I started this notebook,\n\
  * I had `python 3.10.11` installed on my Windows computer.\n\
  * Python 3.11.1 on my Mac\n\n")
MD.write("\n\n")
#
md(MD.out())

# Please Join Me #

As you can see I am not trying to develop new programs to do RAG.  I wil use existing programs but I will experiment with different styles of writing the corpus text to see which style generates the most relevant embeddings. I will also experiment with different ways of retrieving the most relevant text from the corpus.  I suspect there will be room for some heuristics here.

If you have an area of expertise which you would like to share with the world using this new technology, please join me.  I will be regularly (hopefully fortnightly) posting my progress on Substack.  I will also be posting my code on Github.  I will be using Python and the OpenAI API. In addition, I created an X (formerly known as Twitter) Community Group '[Building AI Tutors](https://twitter.com/i/communities/1727552258454474973)'. If you are interested, please join the group.  I will be posting my progress there as well.

You do not need to be an experienced Python programmer but some knowledge of Python will be helpful.  Velow I list the steps you need to take to join me and explain how to go about it:

* Install VS Code from [https://code.visualstudio.com/download](https://code.visualstudio.com/download).  This is a free code editor and development environment.  It is the tool I am using therefore should be able to help you if you have problems.
* Install Python from [https://www.python.org/downloads/](https://www.python.org/downloads/).  At the time I started this notebook,
  * I had `python 3.10.11` installed on my Windows computer.
  * Python 3.11.1 on my Mac









In [21]:
SECTION+=1
MD=mdx(Chapter, SECTION, TOC[SECTION-1])
#
MD.write("Download and install python from [https://www.python.org/downloads/] if you do not have it already.  At the time I started this notebook,\n\n\
* I had `python 3.10.11` installed on my Windows computer.\n\
* Python 3.11.1 on my Mac")
MD.write("\n\n")
#
md(MD.out())

# Install Python #

Download and install python from [https://www.python.org/downloads/] if you do not have it already.  At the time I started this notebook,

* I had `python 3.10.11` installed on my Windows computer.
* Python 3.11.1 on my Mac







In [22]:
SECTION+=1
MD=mdx(Chapter, SECTION, TOC[SECTION-1])
#
MD.write("It is good practice to have a virtual environment dedicated to the project. This is to ensure that the project \
has all the required libraries and versions. The virtual environment must be created \
as the first action after creating the file before installing any libraries.\n\n")
MD.write("I am running this Jupyter notebook in VS Code.  If you are using another editor, \
your method will be different.  In VS Code, I create the virtual environment using `CTRL`+`SHIFT`+`P` $\\rightarrow$  \
`Pyton: Create Environment`$\\rightarrow$ Pick the `venv` option.\n\n\
This creates a `.venv` folder in the project folder.  The `.venv` folder is where the virtual environment is. \
Here are the contents of the `pyvenv.cfg` generated under `.venv` :\n\n\
\n\n\
### WINDOWS ###\n\n\
home = `C:\\Users\e4hgurge\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0`\n\n\
include-system-site-packages = false\n\n\
version = 3.10.11\n\n\
\n\n\
### MAC ###\n\n\
home = /Library/Frameworks/Python.framework/Versions/3.11/bin\n\n\
include-system-site-packages = false\n\n\
version = 3.11.1\n\n\
executable = /Library/Frameworks/Python.framework/Versions/3.11/bin/python3.11\n\n\
command = /usr/local/bin/python3 -m venv /Users/Halim/openAI/openAI_first/.venv\n\n\
**IMPORTANT**\n\n\
Every time you start VS Code, you must make sure you are running in the virtual environment. The environment can be seen \
in the upper right corner of the VS Code window.  If you are not running in the virtual environment, click there \
on the upper right corner on the \
displayed Python version and select the virtual environment.\n\n\
Sometime, VSCode tells me that running under `venv` requires the `ipykernel`. It will ask me whether to install it.  I always accept the offer click on `Install`.\n\n\
")

md(MD.out())

# Set up Virtual Environment #

It is good practice to have a virtual environment dedicated to the project. This is to ensure that the project has all the required libraries and versions. The virtual environment must be created as the first action after creating the file before installing any libraries.

I am running this Jupyter notebook in VS Code.  If you are using another editor, your method will be different.  In VS Code, I create the virtual environment using `CTRL`+`SHIFT`+`P` $\rightarrow$  `Pyton: Create Environment`$\rightarrow$ Pick the `venv` option.

This creates a `.venv` folder in the project folder.  The `.venv` folder is where the virtual environment is. Here are the contents of the `pyvenv.cfg` generated under `.venv` :



### WINDOWS ###

home = `C:\Users\e4hgurge\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0`

include-system-site-packages = false

version = 3.10.11



### MAC ###

home = /Library/Frameworks/Python.framework/Versions/3.11/bin

include-system-site-packages = false

version = 3.11.1

executable = /Library/Frameworks/Python.framework/Versions/3.11/bin/python3.11

command = /usr/local/bin/python3 -m venv /Users/Halim/openAI/openAI_first/.venv

**IMPORTANT**

Every time you start VS Code, you must make sure you are running in the virtual environment. The environment can be seen in the upper right corner of the VS Code window.  If you are not running in the virtual environment, click there on the upper right corner on the displayed Python version and select the virtual environment.

Sometime, VSCode tells me that running under `venv` requires the `ipykernel`. It will ask me whether to install it.  I always accept the offer click on `Install`.







In [23]:
SECTION+=1
MD=mdx(Chapter, SECTION, TOC[SECTION-1])
#
MD.write("You need to have an `openai` account to place python calls to `openai` platform (openAI API).  This means openai will charge you for the usage.  \
I have an openai account where I set the monthly limit to 120 dollars (Australian).  The monthly limit is necessary.  Otherwise, you can make a coding error for example \
that will make your program call openai 100 times a second and run a huge bill.  So far, my monthly bill has not reached $1 yet.  \
So for the stuff we are doing here, the charge is trivial.\n\n\
You must have your own account to run this notebook.\n\n\
")
MD.write("To get an openai account, go to [https://beta.openai.com/](https://beta.openai.com/) and follow the instructions.  \
Once you have an account, you need to get an API key.  To get the API key, click on your name on the upper right corner of the screen.  \
Then click on `My Account`.  Then click on `API Keys` on the left side of the screen.  Then click on `Create API Key` on the right side of the screen.  \
Then copy the API key and paste it as an environment variable into your configuration file as described below.\n\n")
#
MD.write("**_Setting OpenAPI credentials on Windows Computer_**\n\n\
Define the environment variable `OPENAI_API_KEY` by using the `setx` command in Windows:\n\n\
```\n\
setx OPENAI_API_KEY 'my-api-key-here'\n\
```\n\n\
Use `setx` not `set` to make sure that the OPENAI_API_KEY will be available globally and persistently.  Otherwise (i.e. if using `set`), it would be available only for the current session.\n\n\
\n\n\
To check if it is set correctly:\n\n\
* Close the terminal window\n\
* Open a new terminal window\n\
* Enter the command: `echo %OPENAI_API_KEY%`.  This should display my OPENAI access key.  If it is not set, it will simply echo the string `%OPENAI_API_KEY%`\n\n\
**_On `macOS`_**\n\n\
* Open Terminal (use spotlight, search for `Terminal.app`)\n\
* Edit Bash profile using `nano ~/.zshrc`\n\
*  * `nano` is a command-line text editor.  It is a simple interface for editing text files in the terminal\n\
*  * `~` is the shortcut for user's home directory\n\
*  * `.zshrc` is the configuration file for the Z shell, which is the default shell in macOS starting with Catalina (macOS 10.15) -- My MAC runs sonoma 14.0\n\
*  Add the line `export OPENAI_API_KEY='my-api-key-here'`\n\
*  Use CTRL+O to save changes and CTRL+X to exit `nano`\n\
*  Load the profile using `source ~/.zshrc`\n\
*  Verify the set-up by typing `echo $OPENAI_API_KEY`\n\n\
")
MD.write("\n\n")
#
md(MD.out())

# Establish OpenAI Credentials #

You need to have an `openai` account to place python calls to `openai` platform (openAI API).  This means openai will charge you for the usage.  I have an openai account where I set the monthly limit to 120 dollars (Australian).  The monthly limit is necessary.  Otherwise, you can make a coding error for example that will make your program call openai 100 times a second and run a huge bill.  So far, my monthly bill has not reached $1 yet.  So for the stuff we are doing here, the charge is trivial.

You must have your own account to run this notebook.

To get an openai account, go to [https://beta.openai.com/](https://beta.openai.com/) and follow the instructions.  Once you have an account, you need to get an API key.  To get the API key, click on your name on the upper right corner of the screen.  Then click on `My Account`.  Then click on `API Keys` on the left side of the screen.  Then click on `Create API Key` on the right side of the screen.  Then copy the API key and paste it as an environment variable into your configuration file as described below.

**_Setting OpenAPI credentials on Windows Computer_**

Define the environment variable `OPENAI_API_KEY` by using the `setx` command in Windows:

```
setx OPENAI_API_KEY 'my-api-key-here'
```

Use `setx` not `set` to make sure that the OPENAI_API_KEY will be available globally and persistently.  Otherwise (i.e. if using `set`), it would be available only for the current session.



To check if it is set correctly:

* Close the terminal window
* Open a new terminal window
* Enter the command: `echo %OPENAI_API_KEY%`.  This should display my OPENAI access key.  If it is not set, it will simply echo the string `%OPENAI_API_KEY%`

**_On `macOS`_**

* Open Terminal (use spotlight, search for `Terminal.app`)
* Edit Bash profile using `nano ~/.zshrc`
*  * `nano` is a command-line text editor.  It is a simple interface for editing text files in the terminal
*  * `~` is the shortcut for user's home directory
*  * `.zshrc` is the configuration file for the Z shell, which is the default shell in macOS starting with Catalina (macOS 10.15) -- My MAC runs sonoma 14.0
*  Add the line `export OPENAI_API_KEY='my-api-key-here'`
*  Use CTRL+O to save changes and CTRL+X to exit `nano`
*  Load the profile using `source ~/.zshrc`
*  Verify the set-up by typing `echo $OPENAI_API_KEY`









In [24]:
SECTION+=1
MD=mdx(Chapter, SECTION, TOC[SECTION-1])
#
MD.write("In Large Language Models (LLMs) an embedding is a list of numbers, which are between -1 and +1.  This list is referred to as the embedding vector. \
The embedding vector represents the features of the text string.  The embedding vector is created by a neural network.  The neural network is trained on a large \
corpus of text.\n\n")
#
MD.write("You can think of the embedding vector as a point in a high dimensional space.  The number of dimensions is the number of numbers in the embedding vector.\n\n")
MD.write("Let me use an example from everyday life to explain the concept of embedding.  Suppose you are a real estate agent.  You have a list of houses for sale.  \
Each house has a number of features such as number of bedrooms, number of bathrooms, size of the land, size of the house, etc.  You can think of each house as a point \
in a high dimensional space.  The number of dimensions is the number of features.  The features are the numbers that describe the house.  The features are the \
coordinates of the point in the high dimensional space.  The coordinates are the numbers that describe the house.  The coordinates are the features of the house.  \
The coordinates are the embedding vector of the house.\n\n")
MD.write("### Numerical Example ###\n\n")
MD.write("In another example, assume you are the CEO of a very large company with hundreds of branches around the country.  Your HR Department wants to give an award \
to the best branch in terms of the human resources.  \n\nHow do they do that?  They ask each branch nominate four of their employees \
who will write an essay where \
they describe the mission of the company and their place in it, and also phone interview them.\n\n\
The employees are assessed on a number of criteria including their education, customer references, the quality of their essay, communications skills.\n\n\
From the Cairns branch, for example, the nominees are John, Pascal, Emin, and Rachel.  The HR Department will create an embedding vector for each nominee.  \n\n\
||John|Pascal|Emin|Rachel|Cairns|\n\
|--|--|--|--|--|--|\n\
|Highest Degree   |2|2|3|2|2.25|\n\
|Customer Reference quality|5|7|9|6|6.75|\n\
|Writing skills.  |6|7|9|7|7.25|\n\
|Understanding of the requirements|4|5|3|6|4.50|\n\
|Verbal communications|6|6|5|8|6.25|\n\n\
")
MD.write("We can refer to this table as our embeddings in this instance.  Each column is an embedding vector:\n\n\
* $\\overrightarrow{\\text{John}}=\{2,5,6,4,6\}$;\n\
* $\\overrightarrow{\\text{Pascal}}=\{2,7,7,5,6\}$;\n\
* $\\overrightarrow{\\text{Emin}}=\{3,9,9,3,5\}$;\n\
* $\\overrightarrow{\\text{Rachel}}=\{2,6,7,6,8\}$.\n\n\
Last column is the average for the Cairns branch.  We can refer to it as the embedding vector for Cairns:\n\n\
* $\\overrightarrow{\\text{Cairns}}=\{2.25,6.75,7.25,4.50,6.25\}$.")
MD.write("\n\n**Some observations on Embedding vectors:**\n\n\
* For a given embedding scheme, the number of dimensions is fixed.  In the example above, the number of dimensions is 5.\n\
* The embedding vector is a list of numbers.  In the example above, the embedding vector is a list of 5 numbers.\n\
* The embedding vector for one text string is different from the embedding vector for another text string.  In the example above, the embedding vector for John is different from the embedding vector for Pascal.\n\
* The embedding vector for a text string is the same every time it is created.  In the example above, the embedding vector for John is the same if another \
HR officer calculates it.  It will remain to be the same unless the rules (the model) change.\n\
* The embedding vector for one string and the embedding vector for an ensemble of strings have the same number of dimensions\n\
* In the above example, the magnitude of the embedding vectors is not important.  What is important is the relative magnitude of the numbers in the embedding vector.\n\
* In the above example, the embedding vectors are not normalised.  Normalisation is not necessary for the embedding vectors to be useful.  However, normalisation \
is necessary for the embedding vectors to be comparable.  In the above example, the embedding vectors are not comparable.\n\
* Text embedding vectors used in LLMs are comparable because they are normalised.  The normalisation is done by the neural network that creates the embedding vectors.\n\n")
MD.write("### Embedding vectors in LLMs ###\n\n")
MD.write("As an example, I computed the embedding vectors and the magnitudes of those vectors for the following text strings:\n\n")
MD.write("|String|Embedding Vector Magnitude|Embedding Vector Length\n\
|--|--|---|\n")
for i,s in enumerate([TextString1, TextString2, TextString3, TextString4]):
    MD.write("|%s|%.5f|%d|\n"%(s, Magnitudes[i], len(Emvectors[i])))
MD.write("\n\n")
#
md(MD.out())

# What is an embedding? #

In Large Language Models (LLMs) an embedding is a list of numbers, which are between -1 and +1.  This list is referred to as the embedding vector. The embedding vector represents the features of the text string.  The embedding vector is created by a neural network.  The neural network is trained on a large corpus of text.

You can think of the embedding vector as a point in a high dimensional space.  The number of dimensions is the number of numbers in the embedding vector.

Let me use an example from everyday life to explain the concept of embedding.  Suppose you are a real estate agent.  You have a list of houses for sale.  Each house has a number of features such as number of bedrooms, number of bathrooms, size of the land, size of the house, etc.  You can think of each house as a point in a high dimensional space.  The number of dimensions is the number of features.  The features are the numbers that describe the house.  The features are the coordinates of the point in the high dimensional space.  The coordinates are the numbers that describe the house.  The coordinates are the features of the house.  The coordinates are the embedding vector of the house.

### Numerical Example ###

In another example, assume you are the CEO of a very large company with hundreds of branches around the country.  Your HR Department wants to give an award to the best branch in terms of the human resources.  

How do they do that?  They ask each branch nominate four of their employees who will write an essay where they describe the mission of the company and their place in it, and also phone interview them.

The employees are assessed on a number of criteria including their education, customer references, the quality of their essay, communications skills.

From the Cairns branch, for example, the nominees are John, Pascal, Emin, and Rachel.  The HR Department will create an embedding vector for each nominee.  

||John|Pascal|Emin|Rachel|Cairns|
|--|--|--|--|--|--|
|Highest Degree   |2|2|3|2|2.25|
|Customer Reference quality|5|7|9|6|6.75|
|Writing skills.  |6|7|9|7|7.25|
|Understanding of the requirements|4|5|3|6|4.50|
|Verbal communications|6|6|5|8|6.25|

We can refer to this table as our embeddings in this instance.  Each column is an embedding vector:

* $\overrightarrow{\text{John}}=\{2,5,6,4,6\}$;
* $\overrightarrow{\text{Pascal}}=\{2,7,7,5,6\}$;
* $\overrightarrow{\text{Emin}}=\{3,9,9,3,5\}$;
* $\overrightarrow{\text{Rachel}}=\{2,6,7,6,8\}$.

Last column is the average for the Cairns branch.  We can refer to it as the embedding vector for Cairns:

* $\overrightarrow{\text{Cairns}}=\{2.25,6.75,7.25,4.50,6.25\}$.

**Some observations on Embedding vectors:**

* For a given embedding scheme, the number of dimensions is fixed.  In the example above, the number of dimensions is 5.
* The embedding vector is a list of numbers.  In the example above, the embedding vector is a list of 5 numbers.
* The embedding vector for one text string is different from the embedding vector for another text string.  In the example above, the embedding vector for John is different from the embedding vector for Pascal.
* The embedding vector for a text string is the same every time it is created.  In the example above, the embedding vector for John is the same if another HR officer calculates it.  It will remain to be the same unless the rules (the model) change.
* The embedding vector for one string and the embedding vector for an ensemble of strings have the same number of dimensions
* In the above example, the magnitude of the embedding vectors is not important.  What is important is the relative magnitude of the numbers in the embedding vector.
* In the above example, the embedding vectors are not normalised.  Normalisation is not necessary for the embedding vectors to be useful.  However, normalisation is necessary for the embedding vectors to be comparable.  In the above example, the embedding vectors are not comparable.
* Text embedding vectors used in LLMs are comparable because they are normalised.  The normalisation is done by the neural network that creates the embedding vectors.

### Embedding vectors in LLMs ###

As an example, I computed the embedding vectors and the magnitudes of those vectors for the following text strings:

|String|Embedding Vector Magnitude|Embedding Vector Length
|--|--|---|
|There are four sides to a square.|1.00000|1536|
|There are three sides to a triangle.|1.00000|1536|
|There are five sides to a pentagon.|1.00000|1536|
|In a footbal game, there are eleven players on each team.|1.00000|1536|






