<a href="https://colab.research.google.com/github/DanielWarfield1/MLWritingAndResearch/blob/main/RAGFromScratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG From Scratch
This notebook is a low level conceptual exploration of RAG. We use a word vector encoder to embed words, calculate the mean vector of documents and prompts, and use manhattan distance as a distance metric.

There are surely more efficient/better ways to get this done, which I'll explore in future demos. For now, this is the low level fundamentals.

note:The terms "embedding" and "encoding" are painfully interchangable. Generally encoding is a verb, and an embedding is a noun, so you "encode words into an embedding", but it's also common to say you "embed words into an embedding". I have a tendency to flip between the two depending on the context.

# Loading Word Space Encoder


In [3]:
"""Downloading a word encoder.
I was going to use word2vect, but glove downloads way faster. For our purposes
they're conceptually identical
"""

import gensim.downloader

#doenloading encoder
word_encoder = gensim.downloader.load('glove-twitter-25')

#getting the embedding for a word
word_encoder['apple']

array([ 0.85337  ,  0.011645 , -0.033377 , -0.31981  ,  0.26126  ,
        0.16059  ,  0.010724 , -0.15542  ,  0.75044  ,  0.10688  ,
        1.9249   , -0.45915  , -3.3887   , -1.2152   , -0.054263 ,
       -0.20555  ,  0.54706  ,  0.4371   ,  0.25194  ,  0.0086557,
       -0.56612  , -1.1762   ,  0.010479 , -0.55316  , -0.15816  ],
      dtype=float32)

# Embedding text
embed either the document or the prompt via calculating the mean vector

In [23]:
"""defining a function for embedding an entire document to a single mean vector
"""

import numpy as np

def embed_sequence(sequence):
    vects = word_encoder[sequence.split(' ')]
    return np.mean(vects, axis=0)

embed_sequence('its a sunny day today')

array([-6.3483393e-01,  1.3683620e-01,  2.0645106e-01, -2.1831200e-01,
       -1.8181981e-01,  2.6023200e-01,  1.3276964e+00,  1.7272198e-01,
       -2.7881199e-01, -4.2115799e-01, -4.7215199e-01, -5.3013992e-02,
       -4.6326599e+00,  4.3883198e-01,  3.6487383e-01, -3.6672002e-01,
       -2.6924044e-03, -3.0394283e-01, -5.5415201e-01, -9.1787003e-02,
       -4.4997922e-01, -1.4819117e-01,  1.0654800e-01,  3.7024397e-01,
       -4.6688594e-02], dtype=float32)

# Defining distance calculation

In [56]:
from scipy.spatial.distance import cdist

def calc_distance(embedding1, embedding2):
    return cdist(np.expand_dims(embedding1, axis=0), np.expand_dims(embedding2, axis=0), metric='cityblock')[0][0]

print('similar phrases:')
print(calc_distance(embed_sequence('sunny day today')
                  , embed_sequence('rainy morning presently')))

print('different phrases:')
print(calc_distance(embed_sequence('sunny day today')
                  , embed_sequence('perhaps reality is painful')))

similar phrases:
8.496297497302294
different phrases:
11.832107525318861


# Defining Documents

In [57]:
"""Defining documents
for simplicities sake I only included words the embedder knows. You could just
parse out all the words the embedder doesn't know, though. After all, the retreival
is done on a mean of all embeddings, so a missing word or two is of little consequence
"""
documents = {"menu": "ratatouille is a stew thats twelve dollars and fifty cents also gazpacho is a salad thats thirteen dollars and ninety eight cents also hummus is a dip thats eight dollars and seventy five cents also meat sauce is a pasta dish thats twelve dollars also penne marinera is a pasta dish thats eleven dollars also shrimp and linguini is a pasta dish thats fifteen dollars",
             "events": "on thursday we have karaoke and on tuesdays we have trivia",
             "allergins": "the only item on the menu common allergen is hummus which contain pine nuts",
             "info": "the resteraunt was founded by two brothers in two thousand and three"}

# Defining Retreival

In [65]:
"""defining a function that retreives the most relevent document
"""

def retreive_relevent(prompt, documents=documents):
    min_dist = 1000000000
    r_docname = ""
    r_doc = ""

    for docname, doc in documents.items():
        dist = calc_distance(embed_sequence(prompt)
                           , embed_sequence(doc))

        if dist < min_dist:
            min_dist = dist
            r_docname = docname
            r_doc = doc

    return r_docname, r_doc


prompt = 'what pasta dishes do you have'
print(f'finding relevent doc for "{prompt}"')
print(retreive_relevent(prompt))
print('----')
prompt = 'what events do you guys do'
print(f'finding relevent doc for "{prompt}"')
print(retreive_relevent(prompt))

finding relevent doc for "what pasta dishes do you have"
('menu', 'ratatouille is a stew thats twelve dollars and fifty cents also gazpacho is a salad thats thirteen dollars and ninety eight cents also hummus is a dip thats eight dollars and seventy five cents also meat sauce is a pasta dish thats twelve dollars also penne marinera is a pasta dish thats eleven dollars also shrimp and linguini is a pasta dish thats fifteen dollars')
----
finding relevent doc for "what events do you guys do"
('events', 'on thursday we do karaoke and on tuesdays we do trivia')


# Defining Retreival and Augmentation

In [78]:
"""Defining retreival and augmentation
creating a function that does retreival and augmentation,
this can be passed straight to the model
"""
def retreive_and_agument(prompt, documents=documents):
    docname, doc = retreive_relevent(prompt, documents)
    return f"Answer the customers prompt based on the folowing documents:\n==== document: {docname} ====\n{doc}\n====\n\nprompt: {prompt}\nresponse:"

prompt = 'what events do you guys do'
print(f'prompt for "{prompt}":\n')
print(retreive_and_agument(prompt))

prompt for "what events do you guys do":

Answer the customers prompt based on the folowing documents:
==== document: events ====
on thursday we do karaoke and on tuesdays we do trivia
====

prompt: what events do you guys do
response:


# Defining RAG and prompting OpenAI's LLM

In [67]:
!pip install --upgrade openai

Collecting openai
  Downloading openai-0.28.1-py3-none-any.whl (76 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/77.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━[0m [32m71.7/77.0 kB[0m [31m1.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.0/77.0 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: openai
Successfully installed openai-0.28.1


In [68]:
#copying from google drive to local
from google.colab import drive
import os
drive.mount('/content/drive')

with open ("/content/drive/My Drive/Colab Notebooks/Credentials/OpenAI-danielDemoKey.txt", "r") as myfile:
    OPENAI_API_TOKEN = myfile.read()


Mounted at /content/drive


In [89]:
"""Using RAG with OpenAI's gpt model
"""

import openai
openai.api_key = OPENAI_API_TOKEN

prompts = ['what pasta dishes do you have', 'what events do you guys do', 'oh cool what is karaoke']

for prompt in prompts:

    ra_prompt = retreive_and_agument(prompt)
    response = openai.Completion.create(model="gpt-3.5-turbo-instruct", prompt=ra_prompt, max_tokens=80).choices[0].text

    print(f'prompt: "{prompt}"')
    print(f'response: {response}')

prompt: "what pasta dishes do you have"
response:  We have a variety of pasta dishes including meat sauce for $12, penne marinera for $11, and shrimp and linguini for $15.
prompt: "what events do you guys do"
response:  On Thursdays, we do karaoke and on Tuesdays, we do trivia.
prompt: "oh cool what is karaoke"
response:  Karaoke is a fun event where people can sing along to their favorite songs while the lyrics are displayed on a screen. Our karaoke night is held on Thursdays, so make sure to come and join us for a great time!
