<a href="https://colab.research.google.com/github/Fuenfgeld/ChatGPTHackathon/blob/main/TextEmbedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# 🚀 🌐 Get Ready for the DIA Hackathon Extravaganza! 🌐 🚀

Hello, innovative minds of the DIA unit! 🧠

We are excited to share this Jupyter Notebook, designed as a starting point for our epic journey at the DIA Hackathon on May 12th, 2023. This notebook will help you dive into the world of text embeddings using OpenAI's GPT-4 model, providing you with the foundation for the project.

🎯 Our Mission: Use the awe-inspiring power of ChatGPT to streamline Clinnova's study protocol and data protection notice, enabling seamless Q&A sessions for study participants.

The notebook demonstrates how to:

Install required packages
Mount Google Drive and import necessary modules
Load Clinnova's text from a file
Define and use the get_embedding function to obtain text embeddings from OpenAI API
With this foundation, you'll be well-equipped to tackle the challenges ahead, working with string embeddings and vector databases, and using the mighty OpenAI API services to revolutionize the world of clinical studies.

Get ready to network with the brightest minds, enjoy delicious snacks and drinks, and be part of a groundbreaking project! 🎉

We can't wait to see you on May 12th, 2023, and witness the incredible solutions you'll create!

Best regards,

The DIA Hackathon Team

In [None]:
%pip install --upgrade openai
%pip install tiktoken

In [None]:
from google.colab import drive
drive.mount('/content/drive')

import sys
sys.path.append('/content/drive/MyDrive/your_folder') #change
from APIkey import API_key

import openai
import tiktoken
from tenacity import retry, wait_random_exponential, stop_after_attempt
import pandas as pd

In [None]:
with open('/content/drive/MyDrive/your_folder/ClinnovaDocs.txt') as f: #change
    ClinnovaText = f.read()

In [None]:
openai.api_key = API_key

In [None]:
@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
def get_embedding(text: str, model="text-embedding-ada-002") -> list[float]:
    return openai.Embedding.create(input=[text], model=model)["data"][0]["embedding"]

In [None]:
embedding = get_embedding("Hello World", model="text-embedding-ada-002")
print(len(embedding))

#More Information
[Embedding of long text](https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb)

This Jupyter notebook demonstrates how to handle texts that are longer than a model's maximum context length when using OpenAI's embedding models, such as text-embedding-ada-002. The maximum context length is measured by tokens, and exceeding this limit causes an error. The notebook covers two main approaches to handle longer texts: truncation and chunking.

Truncation: The input text is truncated to the maximum allowed length. The notebook provides a function, truncate_text_tokens, that takes care of the tokenization and truncation process.

Chunking: The input text is divided into chunks and each chunk is embedded individually. The notebook provides a function, len_safe_get_embedding, which handles chunking and embedding. The embeddings can be returned as a weighted average or as a list of individual chunk embeddings.

The notebook also includes utility functions like batched, chunked_tokens, and len_safe_get_embedding to facilitate the process of handling longer texts.