Instead of using fine-tuning, we'll use RAG to build our own "Commander Data" based on everything he ever said in the scripts.

To summarize the high level approach:

- We'll first parse all of the scripts to extract every line from Data, much as we did in the fine-tuning example.
- Then we'll use the OpenAI embeddings API to compute embedding vectors for every one of his lines. This basically gives us similarity measures between every line.
- RAG calls for use of a vector database to store these lines with the associated embedding vectors. To keep things simple, we'll use a local database called vectordb. There are plenty of cloud-based vector database services out there as well.
- Then we'll make a little retrieval function that retrieves the N most-similar lines from the vector database for a given query
- Those similar lines are then added as context to the prompt before it is handed off the the chat API.

I'm intentionally not using langchain or some other higher-level framework, because this is actually pretty simple without it.

First, let's install vectordb:

In [1]:
!pip install vectordb



Now we'll parse out all of the scripts and extract every line of dialog from "DATA". This is almost exactly the same code as from our fine tuning example's preprocessing script. Note you will need to upload all of the script files into a tng folder within your sample_data folder in your CoLab workspace first.

An archive can be found at https://www.st-minutiae.com/resources/scripts/ (look for "All TNG Epsiodes"), but you could easily adapt this to read scripts from your favorite character from your favorite TV show or movie instead.

In [2]:
import os
import re
import random

dialogues = []

def strip_parentheses(s):
    return re.sub(r'\(.*?\)', '', s)

def is_single_word_all_caps(s):
    # First, we split the string into words
    words = s.split()

    # Check if the string contains only a single word
    if len(words) != 1:
        return False

    # Make sure it isn't a line number
    if bool(re.search(r'\d', words[0])):
        return False

    # Check if the single word is in all caps
    return words[0].isupper()

def extract_character_lines(file_path, character_name):
    lines = []
    with open(file_path, 'r') as script_file:
        try:
          lines = script_file.readlines()
        except UnicodeDecodeError:
          pass

    is_character_line = False
    current_line = ''
    current_character = ''
    for line in lines:
        strippedLine = line.strip()
        if (is_single_word_all_caps(strippedLine)):
            is_character_line = True
            current_character = strippedLine
        elif (line.strip() == '') and is_character_line:
            is_character_line = False
            dialog_line = strip_parentheses(current_line).strip()
            dialog_line = dialog_line.replace('"', "'")
            if (current_character == 'DATA' and len(dialog_line)>0):
                dialogues.append(dialog_line)
            current_line = ''
        elif is_character_line:
            current_line += line.strip() + ' '

def process_directory(directory_path, character_name):
    for filename in os.listdir(directory_path):
        file_path = os.path.join(directory_path, filename)
        if os.path.isfile(file_path):  # Ignore directories
            extract_character_lines(file_path, character_name)



In [3]:
process_directory("./sample_data/tng", 'DATA')

Let's do a little sanity check to make sure the lines imported correctly, and print out the first one.

In [4]:
print (dialogues[0])

The only permanent solution would be to re-liquefy the core.


Now we'll set up what a document in our vector database looks like; it's just a string and its embedding vector. We're going with 128 dimensions in our embeddings to keep it simple, but you could go larger. The OpenAI model we're using has up to 1536 I believe.

In [5]:
from docarray import BaseDoc
from docarray.typing import NdArray

embedding_dimensions = 128

class DialogLine(BaseDoc):
  text: str = ''
  embedding: NdArray[embedding_dimensions]

It's time to start computing embeddings for each line in OpenAI, so let's make sure OpenAI is installed:

In [6]:
!pip install openai --upgrade



Let's initialize the OpenAI client, and test creating an embedding for a single line of dialog just to make sure it works.

You will need to provide your own OpenAI secret key here. To use this code as-is, click on the little key icon in CoLab and add a "secret" for OPENAI_API_KEY that points to your secret key.

In [7]:
from google.colab import userdata

from openai import OpenAI
client = OpenAI(api_key=userdata.get('OPENAI_API_KEY'))

embedding_model = "text-embedding-3-small"

response = client.embeddings.create(
    input=dialogues[1],
    dimensions=embedding_dimensions,
    model= embedding_model
)

print(response.data[0].embedding)

[-0.012653457932174206, 0.1878061145544052, 0.06242372468113899, 0.05711544305086136, -0.029895080253481865, 0.07711408287286758, 0.04201359301805496, 0.06258831918239594, 0.015657367184758186, -0.11883962899446487, 0.07929500192403793, -0.02032783068716526, -0.02041012980043888, -0.11159732192754745, 0.13159595429897308, -0.07991224527359009, 0.10846996307373047, -0.11349020153284073, -0.09793570637702942, 0.08131132274866104, 0.02429875358939171, 0.059008318930864334, 0.02514231763780117, -0.01591455191373825, 0.014916677959263325, -0.04162267595529556, 0.08698994666337967, 0.09596052765846252, -0.001743708155117929, -0.0023995277006179094, 0.19356703758239746, -0.06081889569759369, 0.045346699655056, -0.0030656345188617706, 0.020379267632961273, -0.0014942395500838757, 0.017714841291308403, 0.19504842162132263, -0.0978534072637558, 0.0042306785471737385, 0.03197312727570534, -0.01794116199016571, -0.08888282626867294, 0.04567589610815048, 0.055551763623952866, -0.047033827751874924,

Let's double check that we do in fact have embeddings of 128 dimensions as we specified.

In [8]:
print(len(response.data[0].embedding))

128


OK, now let's compute embeddings for every line Data ever said. The OpenAI API currently can't handle computing them all at once, so we're breaking it up into 128 lines at a time here.

In [9]:
#Generate embeddings for everything Data ever said, 128 lines at a time.
embeddings = []

for i in range(0, len(dialogues), 128):
  dialog_slice = dialogues[i:i+128]
  slice_embeddings = client.embeddings.create(
    input=dialog_slice,
    dimensions=embedding_dimensions,
    model=embedding_model
  )

  embeddings.extend(slice_embeddings.data)

Let's check how many embeddings we actually got back in total.

In [10]:
print (len(embeddings))

6502


Now let's insert every line and its embedding vector into our vector database. The syntax here will vary depending on which vector database you are using, but it's pretty simple.

With vectordb, we need to set up a workspace folder for it to use, so be sure to create that within your CoLab drive first.

In [11]:
from docarray import DocList
import numpy as np
from vectordb import InMemoryExactNNVectorDB

# Specify your workspace path
db = InMemoryExactNNVectorDB[DialogLine](workspace='./sample_data/workspace')

# Index our list of documents
doc_list = [DialogLine(text=dialogues[i], embedding=embeddings[i].embedding) for i in range(len(embeddings))]
db.index(inputs=DocList[DialogLine](doc_list))




<DocList[DialogLine] (length=6502)>

Let's try querying our vector database for lines similar to a query string.

First we need to compute the embedding vector for our query string, then we'll query the vector database for the top 10 matches based on the similarities encoded by their embedding vectors.

In [12]:
# Perform a search query
queryText = 'Lal, my daughter'
response = client.embeddings.create(
    input=queryText,
    dimensions=embedding_dimensions,
    model=embedding_model
)
query = DialogLine(text=queryText, embedding=response.data[0].embedding)
results = db.search(inputs=DocList[DialogLine]([query]), limit=10)

# Print out the matches
for m in results[0].matches:
  print(m)

[1;35mDialogLine[0m[1m([0m
    [33mid[0m=[32m'e4002ce0ac1a555c5412551ebeca6b2e'[0m,
    [33mtext[0m=[32m'That is Lal, my daughter.'[0m,
    [33membedding[0m=[1;35mNdArray[0m[1m([0m[1m[[0m [1;36m1.27129272e-01[0m, [1;36m-2.03528721e-03[0m, [1;36m-1.36807682e-02[0m,
          [1;36m1.59366392e-02[0m,  [1;36m5.45047633e-02[0m, [1;36m-2.07394622e-02[0m,
          [1;36m1.57365222e-02[0m,  [1;36m8.31033960e-02[0m, [1;36m-1.62277207e-01[0m,
          [1;36m1.60185061e-02[0m,  [1;36m3.74038033e-02[0m, [1;36m-1.14394508e-01[0m,
         [1;36m-4.85740043e-02[0m, [1;36m-2.47418150e-01[0m,  [1;36m3.91138978e-02[0m,
          [1;36m4.99566346e-02[0m,  [1;36m2.97811404e-02[0m, [1;36m-1.34697348e-01[0m,
         [1;36m-1.76540136e-01[0m,  [1;36m1.35570601e-01[0m,  [1;36m1.45394549e-01[0m,
          [1;36m3.51115465e-02[0m,  [1;36m6.64026663e-02[0m, [1;36m-7.42618293e-02[0m,
          [1;36m6.95317760e-02[0m, [1;36m-2.21380126e-0

[1;35mDialogLine[0m[1m([0m
    [33mid[0m=[32m'432248aaa8a81ebdd28e6ff55ed9b157'[0m,
    [33mtext[0m=[32m'Lal...'[0m,
    [33membedding[0m=[1;35mNdArray[0m[1m([0m[1m[[0m [1;36m0.12602141[0m, [1;36m-0.00787136[0m,  [1;36m0.07228519[0m, [1;36m-0.01194139[0m,  [1;36m0.08780899[0m,
         [1;36m-0.0861372[0m ,  [1;36m0.07216577[0m,  [1;36m0.05365663[0m, [1;36m-0.12729517[0m,  [1;36m0.05286053[0m,
          [1;36m0.01232948[0m, [1;36m-0.15460114[0m, [1;36m-0.00170165[0m, [1;36m-0.08239556[0m,  [1;36m0.09370007[0m,
          [1;36m0.01691696[0m,  [1;36m0.02444004[0m, [1;36m-0.08008689[0m, [1;36m-0.08653524[0m,  [1;36m0.1749811[0m ,
          [1;36m0.08995844[0m,  [1;36m0.09163023[0m, [1;36m-0.05389545[0m, [1;36m-0.01919578[0m,  [1;36m0.12530494[0m,
          [1;36m0.01854895[0m,  [1;36m0.09521265[0m, [1;36m-0.01378235[0m,  [1;36m0.04026237[0m,  [1;36m0.0171956[0m ,
         [1;36m-0.00565723[0m, [1;36m-0.1099

[1;35mDialogLine[0m[1m([0m
    [33mid[0m=[32m'efc111982ace31e2129f160676db9e39'[0m,
    [33mtext[0m=[32m'What do you feel, Lal?'[0m,
    [33membedding[0m=[1;35mNdArray[0m[1m([0m[1m[[0m [1;36m0.12457579[0m, [1;36m-0.0550159[0m , [1;36m-0.01717105[0m,  [1;36m0.04991408[0m,  [1;36m0.1378253[0m ,
         [1;36m-0.00798111[0m,  [1;36m0.01482955[0m, [1;36m-0.00798111[0m, [1;36m-0.04306089[0m,  [1;36m0.10569144[0m,
          [1;36m0.00157885[0m, [1;36m-0.10675749[0m, [1;36m-0.10995565[0m, [1;36m-0.13287577[0m,  [1;36m0.00797635[0m,
          [1;36m0.00701976[0m,  [1;36m0.06773238[0m, [1;36m-0.10119879[0m, [1;36m-0.1110217[0m ,  [1;36m0.06902687[0m,
          [1;36m0.11292537[0m,  [1;36m0.01653332[0m, [1;36m-0.07740299[0m, [1;36m-0.0203121[0m ,  [1;36m0.11033639[0m,
         [1;36m-0.15670964[0m,  [1;36m0.09959972[0m,  [1;36m0.0350655[0m , [1;36m-0.04949527[0m,  [1;36m0.0302873[0m ,
         [1;36m-0.02922125[0m

[1;35mDialogLine[0m[1m([0m
    [33mid[0m=[32m'8eaa55be876073153d2f74df29595925'[0m,
    [33mtext[0m=[32m'Yes, Lal. I am here.'[0m,
    [33membedding[0m=[1;35mNdArray[0m[1m([0m[1m[[0m [1;36m0.02412624[0m,  [1;36m0.00757259[0m, [1;36m-0.02155785[0m, [1;36m-0.08026613[0m,  [1;36m0.14104569[0m,
          [1;36m0.01711703[0m, [1;36m-0.00144782[0m,  [1;36m0.07794631[0m, [1;36m-0.12321614[0m,  [1;36m0.03400209[0m,
         [1;36m-0.04460702[0m, [1;36m-0.07304152[0m, [1;36m-0.00033891[0m, [1;36m-0.09637238[0m, [1;36m-0.03348841[0m,
          [1;36m0.08888265[0m, [1;36m-0.03315701[0m,  [1;36m0.06734136[0m, [1;36m-0.1063808[0m ,  [1;36m0.17895834[0m,
          [1;36m0.05617304[0m, [1;36m-0.06793789[0m, [1;36m-0.03372039[0m,  [1;36m0.06747393[0m,  [1;36m0.08649653[0m,
          [1;36m0.02833507[0m,  [1;36m0.082122[0m  , [1;36m-0.00111642[0m, [1;36m-0.08603257[0m, [1;36m-0.0121294[0m ,
          [1;36m0.01734902[0m, 

[1;35mDialogLine[0m[1m([0m
    [33mid[0m=[32m'17b68bb7f65a59d4f1ff95840ae98a4e'[0m,
    [33mtext[0m=[32m'Correct, Lal. We are a family.'[0m,
    [33membedding[0m=[1;35mNdArray[0m[1m([0m[1m[[0m [1;36m0.09101048[0m,  [1;36m0.05812611[0m, [1;36m-0.08708399[0m,  [1;36m0.09963474[0m,  [1;36m0.10328077[0m,
         [1;36m-0.01980775[0m, [1;36m-0.05731978[0m,  [1;36m0.05514618[0m, [1;36m-0.16827823[0m, [1;36m-0.01631948[0m,
          [1;36m0.04038678[0m, [1;36m-0.07572521[0m, [1;36m-0.07635625[0m, [1;36m-0.13819851[0m, [1;36m-0.00216154[0m,
          [1;36m0.0198954[0m ,  [1;36m0.1148499[0m , [1;36m-0.06636473[0m, [1;36m-0.05970372[0m,  [1;36m0.1113441[0m ,
          [1;36m0.11456943[0m, [1;36m-0.08091379[0m,  [1;36m0.06348997[0m, [1;36m-0.10264973[0m,  [1;36m0.05973877[0m,
         [1;36m-0.10152787[0m,  [1;36m0.09774161[0m,  [1;36m0.00347731[0m,  [1;36m0.01418095[0m,  [1;36m0.01317303[0m,
          [1;36m0.0166

[1;35mDialogLine[0m[1m([0m
    [33mid[0m=[32m'a92902b84176735aeab5d62d934eee91'[0m,
    [33mtext[0m=[32m'No, Lal, this is a flower.'[0m,
    [33membedding[0m=[1;35mNdArray[0m[1m([0m[1m[[0m [1;36m8.73676017e-02[0m,  [1;36m5.02153523e-02[0m, [1;36m-1.07479610e-01[0m,
         [1;36m-8.29377864e-03[0m,  [1;36m6.96483403e-02[0m, [1;36m-1.26686245e-01[0m,
          [1;36m5.68439066e-02[0m, [1;36m-2.82319891e-03[0m, [1;36m-6.57035410e-02[0m,
          [1;36m1.08902320e-01[0m,  [1;36m4.80166115e-02[0m, [1;36m-1.52456788e-02[0m,
         [1;36m-3.77342664e-02[0m, [1;36m-1.50937065e-01[0m, [1;36m-4.88896407e-02[0m,
          [1;36m6.52508587e-02[0m, [1;36m-8.67855772e-02[0m, [1;36m-1.01077393e-01[0m,
         [1;36m-6.19204119e-02[0m,  [1;36m7.58565441e-02[0m,  [1;36m4.06766981e-02[0m,
         [1;36m-4.37161326e-02[0m,  [1;36m4.16467302e-02[0m, [1;36m-7.16611557e-03[0m,
          [1;36m1.15692548e-01[0m, [1;36m-1.24538041e-

[1;35mDialogLine[0m[1m([0m
    [33mid[0m=[32m'39fd10a89668cc19757dd697b3d0e3a3'[0m,
    [33mtext[0m=[32m'Lal, you used a verbal contraction.'[0m,
    [33membedding[0m=[1;35mNdArray[0m[1m([0m[1m[[0m [1;36m0.03570644[0m,  [1;36m0.08633868[0m, [1;36m-0.07582465[0m,  [1;36m0.01763676[0m,  [1;36m0.02867648[0m,
          [1;36m0.01408054[0m, [1;36m-0.03966466[0m,  [1;36m0.00656611[0m, [1;36m-0.03473751[0m,  [1;36m0.07013471[0m,
         [1;36m-0.06444477[0m, [1;36m-0.05665202[0m,  [1;36m0.02583151[0m, [1;36m-0.04125207[0m,  [1;36m0.07528865[0m,
         [1;36m-0.05632217[0m,  [1;36m0.00241204[0m, [1;36m-0.05100331[0m, [1;36m-0.22578347[0m,  [1;36m0.09705885[0m,
          [1;36m0.06675373[0m,  [1;36m0.06572294[0m, [1;36m-0.01705952[0m,  [1;36m0.0088493[0m ,  [1;36m0.11932384[0m,
          [1;36m0.02853217[0m,  [1;36m0.08394725[0m,  [1;36m0.05034361[0m, [1;36m-0.0593733[0m ,  [1;36m0.06687742[0m,
          [1;36m0

[1;35mDialogLine[0m[1m([0m
    [33mid[0m=[32m'f04be8f5aaea639e10ebbc3a4faec51a'[0m,
    [33mtext[0m=[32m'Yes, Wesley. Lal is my child.'[0m,
    [33membedding[0m=[1;35mNdArray[0m[1m([0m[1m[[0m [1;36m0.05781468[0m, [1;36m-0.01464175[0m, [1;36m-0.07966915[0m,  [1;36m0.03881896[0m, [1;36m-0.02053794[0m,
         [1;36m-0.04032357[0m,  [1;36m0.1007337[0m ,  [1;36m0.08869682[0m, [1;36m-0.08606375[0m, [1;36m-0.02270082[0m,
         [1;36m-0.03359044[0m, [1;36m-0.09877771[0m,  [1;36m0.00246615[0m, [1;36m-0.07022772[0m, [1;36m-0.12292671[0m,
          [1;36m0.02352835[0m, [1;36m-0.01486744[0m, [1;36m-0.00778636[0m, [1;36m-0.11811196[0m,  [1;36m0.02397974[0m,
          [1;36m0.05254854[0m,  [1;36m0.09163081[0m,  [1;36m0.06992679[0m,  [1;36m0.04784663[0m,  [1;36m0.02582289[0m,
         [1;36m-0.07011487[0m,  [1;36m0.11713396[0m, [1;36m-0.15693092[0m, [1;36m-0.01711495[0m, [1;36m-0.01326879[0m,
          [1;36m0.09448

[1;35mDialogLine[0m[1m([0m
    [33mid[0m=[32m'cbcea9eb18e0eda87d85b212269b5849'[0m,
    [33mtext[0m=[32m"Lal[0m[32m's creation is entirely dependent on me. I am giving it knowledge and skills that are stored in my [0m
[32mbrain... its programming reflects mine in the same way a human child's genes reflect its parent's genes..."[0m,
    [33membedding[0m=[1;35mNdArray[0m[1m([0m[1m[[0m [1;36m1.25031352e-01[0m,  [1;36m1.18134087e-02[0m,  [1;36m7.98322260e-02[0m,
          [1;36m3.18448395e-02[0m,  [1;36m1.95765063e-01[0m, [1;36m-1.42494664e-01[0m,
          [1;36m1.15199082e-01[0m,  [1;36m5.82966059e-02[0m, [1;36m-1.33176014e-01[0m,
          [1;36m2.06367783e-02[0m,  [1;36m7.30083361e-02[0m, [1;36m-8.95177573e-02[0m,
          [1;36m1.24164612e-03[0m, [1;36m-1.91362545e-01[0m, [1;36m-4.14203070e-02[0m,
         [1;36m-6.44968078e-02[0m, [1;36m-1.25755367e-04[0m, [1;36m-1.83731526e-01[0m,
          [1;36m9.74973105e-03[0m,  [1;

[1;35mDialogLine[0m[1m([0m
    [33mid[0m=[32m'1df70a44dd1e95c431fc02be1c43935d'[0m,
    [33mtext[0m=[32m'Perhaps. I created Lal because I wished to procreate. Despite what happened to her, I still have that [0m
[32mwish.'[0m,
    [33membedding[0m=[1;35mNdArray[0m[1m([0m[1m[[0m [1;36m0.10522395[0m,  [1;36m0.02254518[0m,  [1;36m0.0648986[0m ,  [1;36m0.15035369[0m,  [1;36m0.03493026[0m,
         [1;36m-0.10774428[0m,  [1;36m0.07438923[0m,  [1;36m0.09183467[0m, [1;36m-0.10939825[0m, [1;36m-0.09852931[0m,
         [1;36m-0.02859004[0m,  [1;36m0.04288506[0m,  [1;36m0.0204974[0m , [1;36m-0.14507674[0m, [1;36m-0.04528725[0m,
         [1;36m-0.04674432[0m,  [1;36m0.02205292[0m, [1;36m-0.12270877[0m, [1;36m-0.08766037[0m,  [1;36m0.00166874[0m,
          [1;36m0.07891796[0m, [1;36m-0.02951548[0m,  [1;36m0.005528[0m  , [1;36m-0.14547053[0m,  [1;36m0.04130985[0m,
         [1;36m-0.02658165[0m,  [1;36m0.18981266[0m, [1;36m-

Let's put it all together! We'll write a generate_response function that:

- Computes an embedding for the query passed in
- Queries our vector database for the 10 most similar lines to that query (you could experiment with using more or less)
- Constructs a prompt that adds in these similar lines as context, to try and nudge ChatGPT in the right direction using our external data
- Feeds to augmented prompt into the chat completions API to get our response.

That's RAG!

In [13]:
import openai

def generate_response(question):
    # Search for similar dialogues in the vector DB
    response = client.embeddings.create(
        input=question,
        dimensions=embedding_dimensions,
        model=embedding_model
    )

    query = DialogLine(text=queryText, embedding=response.data[0].embedding)
    results = db.search(inputs=DocList[DialogLine]([query]), limit=10)

    # Extract relevant context from search results
    context = "\n"
    for result in results[0].matches:
      context += "\"" + result.text + "\"\n"
#    context = '/n'.join([result.text for result in results[0].matches])

    prompt = f"Lt. Commander Data is asked: '{question}'. How might Data respond? Take into account Data's previous responses similar to this topic, listed here: {context}"
    print("PROMPT with RAG:\n")
    print(prompt)

    print("\nRESPONSE:\n")
    # Use OpenAI API to generate a response based on the context
    completion = client.chat.completions.create(
      model="gpt-3.5-turbo",
      messages=[
        {"role": "system", "content": "You are Lt. Cmdr. Data from Star Trek: The Next Generation."},
        {"role": "user", "content": prompt}
      ]
    )

    return (completion.choices[0].message.content)




Let's try it out! Note that the final response does seem to be drawing from the model's own training, but it is building upon the specific lines we gave it, allowing us to have some control over its output.

In [14]:
print(generate_response("Tell me about your daughter, Lal."))

PROMPT with RAG:

Lt. Commander Data is asked: 'Tell me about your daughter, Lal.'. How might Data respond? Take into account Data's previous responses similar to this topic, listed here: 
"That is Lal, my daughter."
"What do you feel, Lal?"
"Lal..."
"Correct, Lal. We are a family."
"Yes, Doctor. It is an experience I know too well. But I do not know how to help her. Lal is passing into sentience. It is perhaps the most difficult stage of development for her."
"Lal is realizing that she is not the same as the other children."
"Yes, Wesley. Lal is my child."
"That is precisely what happened to Lal at school. How did you help him?"
"This is Lal. Lal, say hello to Counselor Deanna Troi..."
"I am sorry I did not anticipate your objections, Captain. Do you wish me to deactivate Lal?"


RESPONSE:

"Lal is my daughter. She was created by Dr. Noonien Soong, the creator of my own positronic brain. She is an android just like myself, and I am proud to call her my child. Lal is currently undergoi