# This notebook is a helper for testing the embedding process

Some code from this notebook was inspired by ```https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb```

In [5]:
import os
import numpy as np
import openai
import pandas as pd
import pickle
import tiktoken

In [4]:
COMPLETIONS_MODEL = "text-davinci-003"
EMBEDDING_MODEL = "text-embedding-ada-002"

In [8]:
openai.api_key = "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"

In [9]:
prompt = "Who won the 2020 Summer Olympics men's high jump?"

openai.Completion.create(
    prompt=prompt,
    temperature=0,
    max_tokens=300,
    model=COMPLETIONS_MODEL,
)["choices"][0]["text"].strip(" \n")

"Marcelo Chierighini of Brazil won the gold medal in the men's high jump at the 2020 Summer Olympics."

In [10]:
prompt = """Answer the question as truthfully as possible, and if you're unsure of the answer, say "Sorry, I don't know".

Q: Who won the 2020 Summer Olympics men's high jump?
A:"""

openai.Completion.create(
    prompt=prompt,
    temperature=0,
    max_tokens=300,
    model=COMPLETIONS_MODEL
)["choices"][0]["text"].strip(" \n")

"Sorry, I don't know."

In [11]:
prompt = """Answer the question as truthfully as possible using the provided text, and if the answer is not contained within the text below, say "I don't know"

Context:
The men's high jump event at the 2020 Summer Olympics took place between 30 July and 1 August 2021 at the Olympic Stadium.
33 athletes from 24 nations competed; the total possible number depended on how many nations would use universality places 
to enter athletes in addition to the 32 qualifying through mark or ranking (no universality places were used in 2021).
Italian athlete Gianmarco Tamberi along with Qatari athlete Mutaz Essa Barshim emerged as joint winners of the event following
a tie between both of them as they cleared 2.37m. Both Tamberi and Barshim agreed to share the gold medal in a rare instance
where the athletes of different nations had agreed to share the same medal in the history of Olympics. 
Barshim in particular was heard to ask a competition official "Can we have two golds?" in response to being offered a 
'jump off'. Maksim Nedasekau of Belarus took bronze. The medals were the first ever in the men's high jump for Italy and 
Belarus, the first gold in the men's high jump for Italy and Qatar, and the third consecutive medal in the men's high jump
for Qatar (all by Barshim). Barshim became only the second man to earn three medals in high jump, joining Patrik Sjöberg
of Sweden (1984 to 1992).

Q: Who won the 2020 Summer Olympics men's high jump?
A:"""

openai.Completion.create(
    prompt=prompt,
    temperature=0,
    max_tokens=300,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
    model=COMPLETIONS_MODEL
)["choices"][0]["text"].strip(" \n")

'Gianmarco Tamberi and Mutaz Essa Barshim emerged as joint winners of the event.'

In [12]:
df = pd.read_csv('https://cdn.openai.com/API/examples/data/olympics_sections_text.csv')
df = df.set_index(["title", "heading"])
print(f"{len(df)} rows in the data.")
df.sample(5)

3964 rows in the data.


Unnamed: 0_level_0,Unnamed: 1_level_0,content,tokens
title,heading,Unnamed: 2_level_1,Unnamed: 3_level_1
Volleyball at the 2020 Summer Olympics – Men's African qualification,Qualification,Seven CAVB national teams which had not yet qu...,99
Athletics at the 2020 Summer Olympics – Women's triple jump,Competition format,The 2020 competition continued to use the two-...,102
Swimming at the 2020 Summer Olympics – Men's 200 metre individual medley,Competition format,The competition consists of three rounds: heat...,67
Botswana at the 2020 Summer Olympics,Boxing,Botswana entered two boxers into the Olympic t...,102
Poland at the 2020 Summer Olympics,Table tennis,Poland entered three athletes into the table t...,62


In [13]:
def get_embedding(text: str, model: str=EMBEDDING_MODEL) -> list[float]:
    result = openai.Embedding.create(
      model=model,
      input=text
    )
    return result["data"][0]["embedding"]

In [14]:
def compute_doc_embeddings(df: pd.DataFrame) -> dict[tuple[str, str], list[float]]:
    """
    Create an embedding for each row in the dataframe using the OpenAI Embeddings API.
    
    Return a dictionary that maps between each embedding vector and the index of the row that it corresponds to.
    """
    return {
        idx: get_embedding(r.content) for idx, r in df.iterrows()
    }

In [15]:
def load_embeddings(fname: str) -> dict[tuple[str, str], list[float]]:
    """
    Read the document embeddings and their keys from a CSV.
    
    fname is the path to a CSV with exactly these named columns: 
        "title", "heading", "0", "1", ... up to the length of the embedding vectors.
    """
    
    df = pd.read_csv(fname, header=0)
    max_dim = max([int(c) for c in df.columns if c != "title" and c != "heading"])
    return {
           (r.title, r.heading): [r[str(i)] for i in range(max_dim + 1)] for _, r in df.iterrows()
    }

In [16]:
document_embeddings = load_embeddings("https://cdn.openai.com/API/examples/data/olympics_sections_document_embeddings.csv")

# ===== OR, uncomment the below line to recaculate the embeddings from scratch. ========

# document_embeddings = compute_doc_embeddings(df)

In [17]:
# An example embedding:
example_entry = list(document_embeddings.items())[0]
print(f"{example_entry[0]} : {example_entry[1][:5]}... ({len(example_entry[1])} entries)")

('2020 Summer Olympics', 'Summary') : [0.0037565305829048, -0.0061981128528714, -0.0087078781798481, -0.0071364338509738, -0.0025227521546185]... (1536 entries)


In [18]:
def vector_similarity(x: list[float], y: list[float]) -> float:
    """
    Returns the similarity between two vectors.
    
    Because OpenAI Embeddings are normalized to length 1, the cosine similarity is the same as the dot product.
    """
    return np.dot(np.array(x), np.array(y))

In [19]:
def order_document_sections_by_query_similarity(query: str, contexts: dict[(str, str), np.array]) -> list[(float, (str, str))]:
    """
    Find the query embedding for the supplied query, and compare it against all of the pre-calculated document embeddings
    to find the most relevant sections. 
    
    Return the list of document sections, sorted by relevance in descending order.
    """
    query_embedding = get_embedding(query)
    
    document_similarities = sorted([
        (vector_similarity(query_embedding, doc_embedding), doc_index) for doc_index, doc_embedding in contexts.items()
    ], reverse=True)
    
    return document_similarities

In [20]:
order_document_sections_by_query_similarity("Who won the men's high jump?", document_embeddings)[:5]

[(0.8848643084506063,
  ("Athletics at the 2020 Summer Olympics – Men's high jump", 'Summary')),
 (0.8633938355935517,
  ("Athletics at the 2020 Summer Olympics – Men's pole vault", 'Summary')),
 (0.8616397305838509,
  ("Athletics at the 2020 Summer Olympics – Men's long jump", 'Summary')),
 (0.8560523857031264,
  ("Athletics at the 2020 Summer Olympics – Men's triple jump", 'Summary')),
 (0.8469039130441239,
  ("Athletics at the 2020 Summer Olympics – Men's 110 metres hurdles",
   'Summary'))]

In [21]:
order_document_sections_by_query_similarity("Who won the women's high jump?", document_embeddings)[:5]

[(0.8726165220223292,
  ("Athletics at the 2020 Summer Olympics – Women's long jump", 'Summary')),
 (0.8682196158313356,
  ("Athletics at the 2020 Summer Olympics – Women's high jump", 'Summary')),
 (0.8631915263706721,
  ("Athletics at the 2020 Summer Olympics – Women's pole vault", 'Summary')),
 (0.8609374262115408,
  ("Athletics at the 2020 Summer Olympics – Women's triple jump", 'Summary')),
 (0.8581515607285686,
  ("Athletics at the 2020 Summer Olympics – Women's 100 metres hurdles",
   'Summary'))]

In [22]:
MAX_SECTION_LEN = 500
SEPARATOR = "\n* "
ENCODING = "cl100k_base"  # encoding for text-embedding-ada-002

encoding = tiktoken.get_encoding(ENCODING)
separator_len = len(encoding.encode(SEPARATOR))

f"Context separator contains {separator_len} tokens"

'Context separator contains 3 tokens'

In [23]:
def construct_prompt(question: str, context_embeddings: dict, df: pd.DataFrame) -> str:
    """
    Fetch relevant 
    """
    most_relevant_document_sections = order_document_sections_by_query_similarity(question, context_embeddings)
    
    chosen_sections = []
    chosen_sections_len = 0
    chosen_sections_indexes = []
     
    for _, section_index in most_relevant_document_sections:
        # Add contexts until we run out of space.        
        document_section = df.loc[section_index]
        
        chosen_sections_len += document_section.tokens + separator_len
        if chosen_sections_len > MAX_SECTION_LEN:
            break
            
        chosen_sections.append(SEPARATOR + document_section.content.replace("\n", " "))
        chosen_sections_indexes.append(str(section_index))
            
    # Useful diagnostic information
    print(f"Selected {len(chosen_sections)} document sections:")
    print("\n".join(chosen_sections_indexes))
    
    header = """Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "I don't know."\n\nContext:\n"""
    
    return header + "".join(chosen_sections) + "\n\n Q: " + question + "\n A:"

In [24]:
prompt = construct_prompt(
    "Who won the 2020 Summer Olympics men's high jump?",
    document_embeddings,
    df
)

print("===\n", prompt)

Selected 2 document sections:
("Athletics at the 2020 Summer Olympics – Men's high jump", 'Summary')
("Athletics at the 2020 Summer Olympics – Men's long jump", 'Summary')
===
 Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "I don't know."

Context:

* The men's high jump event at the 2020 Summer Olympics took place between 30 July and 1 August 2021 at the Olympic Stadium. 33 athletes from 24 nations competed; the total possible number depended on how many nations would use universality places to enter athletes in addition to the 32 qualifying through mark or ranking (no universality places were used in 2021). Italian athlete Gianmarco Tamberi along with Qatari athlete Mutaz Essa Barshim emerged as joint winners of the event following a tie between both of them as they cleared 2.37m. Both Tamberi and Barshim agreed to share the gold medal in a rare instance where the athletes of different nations h

In [25]:
COMPLETIONS_API_PARAMS = {
    # We use temperature of 0.0 because it gives the most predictable, factual answer.
    "temperature": 0.0,
    "max_tokens": 300,
    "model": COMPLETIONS_MODEL,
}

In [26]:
def answer_query_with_context(
    query: str,
    df: pd.DataFrame,
    document_embeddings: dict[(str, str), np.array],
    show_prompt: bool = False
) -> str:
    prompt = construct_prompt(
        query,
        document_embeddings,
        df
    )
    
    if show_prompt:
        print(prompt)

    response = openai.Completion.create(
                prompt=prompt,
                **COMPLETIONS_API_PARAMS
            )

    return response["choices"][0]["text"].strip(" \n")

In [27]:
answer_query_with_context("Who won the 2020 Summer Olympics men's high jump?", df, document_embeddings)

Selected 2 document sections:
("Athletics at the 2020 Summer Olympics – Men's high jump", 'Summary')
("Athletics at the 2020 Summer Olympics – Men's long jump", 'Summary')


'Gianmarco Tamberi and Mutaz Essa Barshim emerged as joint winners of the event following a tie between both of them as they cleared 2.37m. Both Tamberi and Barshim agreed to share the gold medal.'

In [28]:
query = "Why was the 2020 Summer Olympics originally postponed?"
answer = answer_query_with_context(query, df, document_embeddings)

print(f"\nQ: {query}\nA: {answer}")

Selected 1 document sections:
('Concerns and controversies at the 2020 Summer Olympics', 'Summary')

Q: Why was the 2020 Summer Olympics originally postponed?
A: The 2020 Summer Olympics were originally postponed due to the COVID-19 pandemic.


In [29]:
query = "In the 2020 Summer Olympics, how many gold medals did the country which won the most medals win?"
answer = answer_query_with_context(query, df, document_embeddings)

print(f"\nQ: {query}\nA: {answer}")

Selected 2 document sections:
('2020 Summer Olympics medal table', 'Summary')
('List of 2020 Summer Olympics medal winners', 'Summary')

Q: In the 2020 Summer Olympics, how many gold medals did the country which won the most medals win?
A: The United States won the most medals overall, with 113, and the most gold medals, with 39.


In [30]:
query = "What was unusual about the men’s shotput competition?"
answer = answer_query_with_context(query, df, document_embeddings)

print(f"\nQ: {query}\nA: {answer}")

Selected 2 document sections:
("Athletics at the 2020 Summer Olympics – Men's shot put", 'Summary')
("Athletics at the 2020 Summer Olympics – Men's discus throw", 'Summary')

Q: What was unusual about the men’s shotput competition?
A: The same three competitors received the same medals in back-to-back editions of the same individual event.


In [31]:
query = "In the 2020 Summer Olympics, how many silver medals did Italy win?"
answer = answer_query_with_context(query, df, document_embeddings)

print(f"\nQ: {query}\nA: {answer}")

Selected 2 document sections:
('Italy at the 2020 Summer Olympics', 'Summary')
('San Marino at the 2020 Summer Olympics', 'Summary')

Q: In the 2020 Summer Olympics, how many silver medals did Italy win?
A: 10 silver medals.


In [34]:
query = "What is the total number of medals won by France, multiplied by the number of Taekwondo medals given out to all countries?\nLet's think step by step."
answer = answer_query_with_context(query, df, document_embeddings)

print(f"\nQ: {query}\nA: {answer}")

Selected 3 document sections:
('France at the 2020 Summer Olympics', 'Taekwondo')
('2020 Summer Olympics medal table', 'Medal count')
('Taekwondo at the 2020 Summer Olympics – Qualification', 'Qualification summary')

Q: What is the total number of medals won by France, multiplied by the number of Taekwondo medals given out to all countries?
Let's think step by step.
A: France entered two athletes into the taekwondo competition at the Games. We don't know how many medals France won, so we can't answer this question. I don't know.


In [3]:
import pdfplumber
from typing import List, Dict
import json

import pytube
import whisper

import openai
import numpy as np
from docx import Document


import config

In [5]:
class Extract:
    def __init__(self):
        self.text_pages: List[str] = []
        
    def extract_pages(self, file_or_link_or_str: str, str_type: str) -> List[str]:
        if str_type == "text":
            self.text_pages = self.text2text_pages(file_or_link_or_str)
        elif str_type == "pdf":
            self.text_pages = self.pdf2text(file_or_link_or_str)
        elif str_type == "mp3":
            self.text_pages = self.mp3_to_text(file_or_link_or_str)
        elif str_type == "mp4":
            self.text_pages = self.mp4_to_text(file_or_link_or_str)
        elif str_type == "youtube":
            self.text_pages = self.youtube2text(file_or_link_or_str)
        elif str_type == "github":
            self.text_pages = self.github2text(file_or_link_or_str)
        elif str_type == "docx":
            self.text_pages = self.docx2text(file_or_link_or_str)
        
        self.reformat_pages()
        return self.text_pages
            
    def text2text_pages(self, text: str, threshold: int=700):
        for chunk in text.split('. '):
            if self.text_pages and len(chunk)+len(self.text_pages[-1]) < threshold:
                self.text_pages[-1] += ' '+chunk+'.'
            else:
                self.text_pages.append(chunk+'.')
        return self.text_pages

    def pdf2text(self, pdf_path):
        with pdfplumber.open(pdf_path) as pdf:
            for page in pdf.pages:
                text = page.extract_text().replace("\t", " ").replace("\n", " ").replace("\xa0", " ")
                self.text_pages.append(text)
        return self.text_pages

    def mp3_to_text(self, mp3_path):
        model = whisper.load_model(WHISPER_MODEL_NAME)
        self.text_pages = model.transcribe(mp3_path, language='english')["text"]
        return self.text_pages

    def mp4_to_text(self, mp4_path):
        model = whisper.load_model(WHISPER_MODEL_NAME)
        self.text_pages = model.transcribe(mp4_path, language='english')["text"]
        return self.text_pages

    def youtube2text(self, youtube_link):
        data = pytube.YouTube(youtube_link)
        video = data.streams.get_highest_resolution()
        video_path = video.download()
        return self.mp4_to_text(video_path)

    def github2text(self, github_repo_link):
        raise NotImplementedError("github2text")

    def word_office_to_text(self, word_file_path):
        document = Document(word_file_path)
        for para in document.paragraphs:
            self.text_pages.append(para.text)
        return self.text_pages
    
    def reformat_pages(self):
        low_thresh = 150
        high_thresh = 750
        
        reformatted_pages = [""]
        for page in self.text_pages:
            words_in_last_page = len(reformatted_pages[-1].split())
            words_in_cur_page = len(page.split())

            #condition 1, add page to last page
            if (words_in_last_page < low_thresh) and (words_in_last_page + 1 + words_in_cur_page < high_thresh):
                reformatted_pages[-1] += f"\n{page}"
            
            #condition 2, page too big, split in two
            elif (words_in_cur_page > high_thresh):
                half_page_i = len(page)//2
                reformatted_pages.append(page[:half_page_i])
                reformatted_pages.append(page[half_page_i:])
            
            #condition 3, add the page to a new reformatting page
            else:
                reformatted_pages.append(page)
        self.text_pages = reformatted_pages
            
    def get_dict(self):
        # This program takes a list of strings, and returns a dictionary in the following format: {"pages_text": ["page1", "page2", ...], "pages_embeddings": arr.tolist()}
        
        def get_embedding(page: str):
            result = openai.Embedding.create(model=config.EMBEDDING_MODEL, input=page)
            return result["data"][0]["embedding"]

        arr = np.array([get_embedding(page) for page in self.text_pages])
        return self.text_pages, arr.tolist()


In [7]:
import openai
extract = Extract()


In [13]:
pdf_file = "Test\AIPosNegFactor.pdf"
pdf_file = r"C:\Users\Henri\Documents\GitHub\WiseUp\src\Test\phil.pdf"

In [14]:
text_pages = extract.extract_pages(pdf_file, "pdf")

In [17]:
config.EMBEDDING_MODEL = "text-embedding-ada-002"

In [20]:
openai.api_key = "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"

text_pages, embedding_pages = extract.get_dict()

In [21]:
def vector_similarity(x: list[float], y: list[float]) -> float:
    """
    Returns the similarity between two vectors.
    
    Because OpenAI Embeddings are normalized to length 1, the cosine similarity is the same as the dot product.
    """
    return np.dot(np.array(x), np.array(y))

In [30]:
vector_similarity(embedding_pages[0], embedding_pages[1])

0.9565584663693785

In [23]:
def get_embedding(text: str) -> list[float]:
    result = openai.Embedding.create(
      model=config.EMBEDDING_MODEL,
      input=text
    )
    return result["data"][0]["embedding"]

def order_document_sections_by_query_similarity(query: str, contexts):
    query_embedding = get_embedding(query)
    
    document_similarities = sorted([(vector_similarity(query_embedding, get_embedding(context)), context) for context in contexts
    ], reverse=True)
    
    return document_similarities


In [43]:
a = order_document_sections_by_query_similarity("Bayesian probabilities", text_pages)

In [44]:
for i in a[:10]:
    print(i[0])
    print(i[1])

0.8401311861016638
this with a simple example. Suppose you are wondering how likely it is to rain during an  upcoming tennis match. The problem is that you don’t remember where the tennis match  will take place. You think it might be in New York or Boston or LA. Your credences are  as follows:      Cr (NY) = 0.48    Cr (Boston) = 0.48    Cr (LA) = 0.04    Of course, how likely it is to rain during the match depends on where it will take place.  You have the following conditional credences reflecting this:      Cr (rain|NY) = 0.7    Cr (rain|Boston) = 0.9    Cr (rain|LA) = 0.1    In order to correctly compute Cr(rain), you need to plug your conditional credences of  the form (rain|place of match) and your unconditional credences about where the match  happens into the total probability theorem:    Cr (rain) = Cr(rain|NY) Cr(NY) + Cr(rain|Boston) Cr(Boston) + Cr(rain|LA) Cr(LA)  Cr (rain) = 0.772    This computation could be simplified if you disregarded the possibility that the match  m