Create and activate a virtual environment (Optional)
- `python -m venv openai-env`
- `source openai-env/bin/activate` (Mac)
- `openai-env\Scripts\activate` (Windows)

Once the virtial environment is set up, install the OpenAI Python library:
- `pip install --upgrade openai`

or if you want to install additional libraries which could be useful, including langchain:
- `pip install -r requirements.txt`

In [1]:
from openai import OpenAI
import numpy as np
import pandas as pd
import tiktoken

This Python code is setting up three string constants:

- OPENAI_API_KEY: This is the API key for OpenAI. It's used to authenticate your application when making requests to the OpenAI API.

- EMBEDDING_MODEL: This is the name of the model used for text embedding. Text embedding is a way to convert text into a form that can be processed by machine learning algorithms. In this case, the model is "text-embedding-ada-002".

- LLM: This is the name of the language model used by OpenAI. We recommend using `gpt-4o-mini`.

In [5]:
OPENAI_API_KEY = "your-api-key"

In [6]:
client = OpenAI(
    api_key=OPENAI_API_KEY
)

These are the available models you can choose to experiment with:

In [32]:
[model.id for model in client.models.list().data]

['gpt-4o',
 'tts-1',
 'tts-1-1106',
 'chatgpt-4o-latest',
 'gpt-4-turbo-preview',
 'gpt-3.5-turbo-0125',
 'gpt-3.5-turbo',
 'gpt-4o-realtime-preview-2024-10-01',
 'dall-e-3',
 'gpt-4o-realtime-preview',
 'gpt-4o-2024-08-06',
 'gpt-4o-mini',
 'gpt-4o-mini-2024-07-18',
 'tts-1-hd',
 'text-embedding-ada-002',
 'text-embedding-3-small',
 'text-embedding-3-large',
 'whisper-1']

In [8]:
EMBEDDING_MODEL = "text-embedding-ada-002"
LLM = "gpt-4o-mini"

This Python code uses the OpenAI API to create a chat completion. It's using the GPT-4 model to generate responses to a user's question about Telenor.

Here's a breakdown of the code:

- `client.chat.completions.create`: This is a method from the OpenAI API that creates a chat completion. A chat completion is a conversation with the model where you provide a series of messages and the model returns a generated message as a response.

- `model="gpt-4o-mini"`: This specifies the model to use for the chat completion.

- `messages`: This is a list of messages to send to the model. Each message is a dictionary with two keys: 'role' and 'content'. 'role' can be 'system', 'user', or 'assistant', and 'content' is the text of the message. The 'system' role is used to set the behavior of the 'assistant', and the 'user' role is used to ask the assistant a question.

In this example, the system message sets the assistant's role as a helpful assistant that can answer questions about Telenor. The user message then asks the question "What does Telenor do?".

In [9]:
chat_completion = client.chat.completions.create(
    model=LLM,
    messages=[
        {"role": "system", "content": "Du er en hjelpsom assistent som kan svare på spørsmål om Telenor."},
        {"role": "user", "content": "Hva driver Telenor med?"}
        ]
)
chat_completion.choices[0].message.content

'Telenor er et ledende telekommunikasjonsselskap som tilbyr ulike tjenester innen mobil- og bredbåndskommunikasjon. Selskapet er aktivt i flere markeder, både i Norge og internasjonalt. Telenor driver med følgende hovedområder:\n\n1. **Mobiltelefonitjenester**: Telenor tilbyr mobilabonnementer, både for privatkunder og bedrifter, inkludert tale, tekstmeldinger og data.\n\n2. **Bredbåndstjenester**: De leverer fastlinje-bredbånd med forskjellige hastigheter, samt fiberoptiske tjenester.\n\n3. **TV-tjenester**: Telenor tilbyr TV-løsninger, inkludert IPTV og streaming-tjenester.\n\n4. **IoT (Internet of Things)**: Selskapet jobber med IoT-løsninger for å koble opp enheter og systemer, noe som er særlig relevant for smarte byer og industri.\n\n5. **Finansielle tjenester**: Telenor har også hatt fokus på mobile betalingsløsninger og finansielle tjenester i flere markeder, inkludert samarbeid med lokale aktører.\n\n6. **Bedriftstjenester**: De tilbyr en rekke kommunikasjonstjenester og løsni

This Python code is using the OpenAI API to create an embedding for a given text input.

Here's a breakdown of the code:

- `client.embeddings.create()`: This is a method from the OpenAI API that creates an embedding. An embedding is a vector representation of the input text. It's a way of converting text into a form that can be processed by machine learning algorithms. It also enables you to perform vector search using e.g. cosine similarity. 

- `model="text-embedding-ada-002"`: This specifies the model to use for creating the embedding. In this case, it's using the "text-embedding-ada-002" model.

- `input="Verdifull informasjon om Telenor som du vil bruke inn i språkmodellen."`: This is the text input for which the embedding will be created.

- `encoding_format="float"`: This specifies the format of the encoding for the embedding. In this case, it's set to "float", which means the embedding will be a list of floating-point numbers.

In this example, the code is creating an embedding for the text "Verdifull informasjon om Telenor som du vil bruke inn i språkmodellen." using the "text-embedding-ada-002" model and a floating-point encoding format.

In [33]:
def get_embedding(text: str, model_name: str):
    embedding = client.embeddings.create(
        model=model_name,
        input=text,
        encoding_format="float"
    )
    return embedding.data[0].embedding


The function `cos_sim` is calculating the cosine similarity between two vectors `a` and `b`. If the vectors are identical, the cosine similarity is 1. If the vectors are orthogonal (i.e., not similar at all), the cosine similarity is 0.

In [11]:
def cos_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

In [36]:
example_data = [
    "Telia Norge er et ledende teknologiselskap som bygger samfunnskritisk infrastruktur og leverer innovative produkter og tjenester innen TV, internett, mobil og smart hjem-teknologi.",
    "Telenor Norge er landets største digitale tjenesteleverandør innenfor mobil, bredbånd og TV-tjenester. Hver dag jobber vi for å lede an i digitaliseringen av Norge og utvikle de beste digitale sikkerhetstjenestene for våre kunder.",
    "Eple er en frukt kjent allerede fra steinalderen og finnes i dag i et ukjent antall varianter."
]

In [37]:
input_text = "Hva driver Telenor med?"

input_embedding = get_embedding(input_text, EMBEDDING_MODEL)
example_embedding = [get_embedding(ex, EMBEDDING_MODEL) for ex in example_data]

The cosine similarity indicates semantic similarity between the user input message and the document(s)

In [38]:
similarity_scores = [cos_sim(input_embedding, ex_embedding) for ex_embedding in example_embedding]
similarity_scores

[0.8768048924494523, 0.8814451296547565, 0.7605974032552157]

The API has a limit on the maximum number of input tokens for embeddings. The following function calculates number of tokens from a string:

In [16]:
def num_tokens_from_string(string: str, encoding_name: str = "cl100k_base") -> int:
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

How to augment LLM with "new" knowledge: Retrieval augmented generation

In [None]:
example_data_df = pd.DataFrame({"content": example_data})
example_data_df["embedding"] = example_data_df.apply(lambda x: get_embedding(x["content"], EMBEDDING_MODEL), axis=1)

In [None]:
similarity_scores = [cos_sim(input_embedding, ex_embedding) for ex_embedding in example_data_df['embedding']]
max_index = np.argmax(similarity_scores)

In [None]:
most_relevant_data = example_data_df['content'][max_index]

In [None]:
system_message = f"""Du er en hjelpsom assistent som kan svare på spørsmål om Telenor.
                     Du skal basere svarene dine på følgende informasjon: {most_relevant_data}"""

In [None]:
chat_completion = client.chat.completions.create(
    model=LLM,
    messages=[
        {"role": "system", "content": system_message},
        {"role": "user", "content": "Hva driver Telenor med?"}
        ]
)
chat_completion.choices[0].message.content