# Generate embeddings

What we do in this notebook:

1. Load synthetic data.
2. Generate embeddings with `model_name`.
3. Define a function for dynamically select few-shot examples based on input similarity.
4. Store the embeddins in a json file.

This code will then be used to create a python script that performs these steps.

In [1]:
import environ
# from sklearn.metrics.pairwise import cosine_similarity
import json
from pathlib import Path
from openai import OpenAI
import sqlite3

# Import OpenAI key
env = environ.Env()
environ.Env.read_env()
API_KEY = env("OPENAI_API_KEY")
client = OpenAI(api_key=API_KEY)



## Import json files

Load one of the files with examples generated with `gpt-3.5-turbo`

In [2]:
# Folder with synthetic data
directory = Path('data_nb/')

# Select all the files with examples generated with GPT
matching_files = directory.glob('synthetic_data*gpt*.json')

# Select one of the files
for file_path in matching_files:
    print(f"Loading file: {file_path}")
    with open(file_path, 'r') as file:
        syn_data_gpt = json.load(file)
    break

Loading file: data_nb/synthetic_data_gpt.json


In [3]:
syn_data_gpt[:5]

[{'dysfunctional': "You always waste money on useless things! You're so irresponsible with our finances.",
  'functional': "I've noticed we have different spending habits. Can we discuss how we can better manage our finances together?"},
 {'dysfunctional': "If you don't pay child support on time, I'll make sure you regret it.",
  'functional': "It's important for both of us to fulfill our financial responsibilities. Can we find a way to ensure child support payments are made on time?"},
 {'dysfunctional': "You're a gold digger who only cares about money. I regret ever being with you.",
  'functional': "Let's have a calm discussion about our financial disagreements and how we can move forward positively."},
 {'dysfunctional': "I'll drain our joint account if you don't agree to my terms. You'll be left with nothing.",
  'functional': "Let's work together to find a fair solution that respects both of our financial needs and concerns."},
 {'dysfunctional': "You're always asking for more mo

Load one of the files with examples generated with `dolphin-mistral`

In [4]:
# Folder with synthetic data
directory = Path('synthetic_data')

# Select all the files with examples generated with dolphin-mistral
matching_files = directory.glob('synthetic_data*dolphin*.json')

# Select one of the files
for file_path in matching_files:
    print(f"Loading file: {file_path}")
    with open(file_path, 'r') as file:
        syn_data_dolphin = json.load(file)
    break

Loading file: synthetic_data/synthetic_data_dolphin_2024-06-18_11-46.json


In [5]:
syn_data_dolphin[:5]

[{'dysfunctional': "You're such a bad money manager, always overspending and racking up debt.",
  'functional': "I'm concerned about our financial situation. Let's have an open conversation to find solutions."},
 {'dysfunctional': "If you don't start contributing more to the household expenses, I'll be forced to take legal action.",
  'functional': 'Can we discuss your contribution to our financial responsibilities and come up with an agreement?'},
 {'dysfunctional': "I can't believe you're trying to take my child support payments away from me. What a low thing to do!",
  'functional': "We need to address the issue of child support payments and come to an agreement that's fair for both parties."},
 {'dysfunctional': 'You never listen to me when we talk about money. You always have to be right and get your way.',
  'functional': "Let's try to compromise and work together to resolve our financial disagreements in a constructive manner."},
 {'dysfunctional': "You don't care about anything

## Generate embeddings

In [6]:
emb_model_name = "text-embedding-3-small"

In [7]:
def get_embedding(text, model:str):
    """
    Generate embeddings for the input text using OpenAI's API.
    """
    # text = text.replace("\n", " ")
    response = client.embeddings.create(input = [text], model=model)
    return response.data[0].embedding

In [8]:
text = "So long and thanks for all the fish!"
r1 = get_embedding(text=text, model=emb_model_name)

In [9]:
r1[:5]

[0.03120167925953865,
 0.010045701637864113,
 -0.016684768721461296,
 0.00215172884054482,
 0.01921393722295761]

In [10]:
len(r1)

1536

Get the embeddings of the first 20 examples in the data

In [11]:
# Select 20 examples from the "syn_data_gpt" list 
syn_data_gpt_subset = syn_data_gpt[:20]

In [12]:
embeddings_gpt = [get_embedding(text=text["dysfunctional"], model=emb_model_name) for text in syn_data_gpt_subset]

In [13]:
print(f"N data: {len(embeddings_gpt)}")
print(f"Len vector: {len(embeddings_gpt[0])}")

N data: 20
Len vector: 1536


## SQLite Database

### Option 1 (used)

Create a .sql file with the following code

```
CREATE TABLE examples (
    id SERIAL PRIMARY KEY,
    dysfunctional TEXT NOT NULL,
    embedding TEXT NOT NULL,
    functional TEXT NOT NULL
);
```

and run the functions below.

### Option 2

To create a SQLite Database run the following bash code.

1. Create a new database named "emb_examples.db"
    ```
    sqlite3 emb_examples.db
    ```

2. Create the table (the vector dimension must be equal to the model's embedding size, here **1536**)
    ```
    CREATE TABLE examples (
        id SERIAL PRIMARY KEY,
        example_text TEXT NOT NULL,
        embedding VECTOR(1536)
    );
    ```

3. Exit this program
    ```
    .quit
    ``` 



In [15]:
def create_db(file_name_sql:str="create_bd.sql", file_name_bd:str="database/emb_examples.db"):

    file_path = Path(file_name_bd)

    if file_path.exists():
        print("The table already exists!")
        return None
    else:
        print("Creating table...")
        with open(file_name_sql, 'r') as sql_file:
            sql_script = sql_file.read()
        db = sqlite3.connect(file_name_bd)
        cursor = db.cursor()
        cursor.executescript(sql_script)
        db.commit()
        db.close()

In [16]:
create_db()

Creating table...


In [17]:
def insert_embeddings(examples:list, embeddings:list, path_db:str="database/emb_examples.db"):
    con = sqlite3.connect(path_db)
    with con:
        for ex, emb in zip(examples, embeddings):
            con.execute(
                "INSERT INTO examples (dysfunctional, embedding, functional) VALUES (?, ?, ?)",
                (ex["dysfunctional"], json.dumps(emb), ex["functional"])
            )
    con.close()

In [19]:
insert_embeddings(
    examples=syn_data_gpt_subset,
    embeddings=embeddings_gpt
)