<table align="center">
  <td align="center"><a target="_blank" href="https://colab.research.google.com/github/ogirimah/generative-ai-workshop/blob/main/workshop_vector_database.ipynb">
        <img src="https://i.ibb.co/2P3SLwK/colab.png"  style="padding-bottom:5px;" />Run in Google Colab</a></td>
  <td align="center"><a target="_blank" href="https://github.com/ogirimah/generative-ai-workshop/workshop_vector_database.ipynb">
        <img src="https://i.ibb.co/xfJbPmL/github.png"  height="70px" style="padding-bottom:5px;"  />View Source on GitHub</a></td>
</table>

In [None]:
!pip install -Uq \
  openai \
  langchain \
  pinecone-client \
  tiktoken \
  datasets

In [None]:
from openai import OpenAI
from langchain.chat_models import ChatOpenAI

In [None]:
# from google.colab import userdata
# api_key = userdata.get('OPENAI_API_KEY')

from google.colab import userdata
import getpass

openai_api_key = userdata.get('OPENAI_API_KEY');

In [None]:
client = OpenAI(api_key=openai_api_key)

chat_client = ChatOpenAI(
    openai_api_key = openai_api_key,
    model_name = 'gpt-4',
    temperature=0.0
)

Temperature: Determines the randomness of the models predictions. The higher the value, the more random and creative the model will be with its response. Mostly between 0 and 1

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings

model_name = 'text-embedding-ada-002' #

embeder = OpenAIEmbeddings(
    model=model_name,
    openai_api_key=openai_api_key
)

 text-embedding-ada-002 is used by most OpenAI models, it is one of the cheapest and best performing

In [None]:
test_tesxt = [
     'This is used by most OpenAI models',
     'it is also one of the cheapest and best perfoming'
]

result = embeder.embed_documents(test_tesxt)
len(result), len(result[0])

# Vector Database (Pinecone)

We first create a pinecone account and then create an API key.

We could also experiment with other vector databases that run on your local machine i.e. Lance, FAISS, Chroma and Qdrant. Details here: https://python.langchain.com/docs/modules/data_connection/vectorstores/

In [None]:
index_name = 'llm-workshop-retrieval-augmentation'

In [None]:
import pinecone


PINECONE_API_KEY = userdata.get('pinecone-llmrag')

# Environemt is next to API key in console
PINECONE_ENVIRONMENT = userdata.get('Pinecone-Environment')

pinecone.init(
    api_key=PINECONE_API_KEY,
    environment=PINECONE_ENVIRONMENT
)

if index_name not in pinecone.list_indexes():
    # we create a new index if it does not exist
    pinecone.create_index(
        name=index_name,
        metric='cosine',
        dimension=len(result[0])  # 1536 dim of text-embedding-ada-002,
        # We could also hard-code the dimension, but this is better
    )

Connect to the index and view its characteristics

In [None]:
pinecone_index = pinecone.Index(index_name)
# Use pinecone.GRPCIndex, it has beter performance,
# But you need to use pinecone-client[grpc] and not just pinecone-client

pinecone_index.describe_index_stats()

The Pinecone index should have no namespaces and vector_count of zero. This will be populated once we have added our vector. Note that if you are re-running thos scripts at a later time after adding data, it will not be zero

# Load the dataset from huggingface hub

We will use Huggingface Dataset library to load the dataset, and view the content of the first index

In [None]:
from datasets import load_dataset

dataset = load_dataset("ogirimah/ask_herts")

dataset['train'][0]

# Indexing



# Create a Vector Store

We will now use langchain to create a vector store using the pinecone index we created above

In [None]:
import tiktoken

tokenizer = tiktoken.get_encoding('cl100k_base')

In [None]:
# Create the token length function and test it
def token_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

token_len('This is just a sample text to test the token_len function'
          'The token length of this function is found below')

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=20,  # number of tokens overlap between chunks
    length_function=token_len,
    separators=['\n\n', '\n', ' ', '']
)

In [None]:
from langchain.vectorstores import Pinecone

index = pinecone.Index(index_name)
vector_store = Pinecone(index, embeder, 'text')
# vector_store = pinecone(index)

In [None]:
from tqdm.auto import tqdm
from uuid import uuid4

batch_limit = 100

texts = []
metadatas = []
documents = dataset['train']

for i, record in enumerate(tqdm(documents)):
    # first get metadata fields for this record
    metadata = {
        'doc-id': str(record['id']),
        'source': record['source'],
        # 'title': record['title'] # Use regular expression to take the string after the last /
    }
    # now we create chunks from the record text
    record_texts = text_splitter.split_text(record['text'])
    # create individual metadata dicts for each chunk
    record_metadatas = [{
        "chunk": j, "text": text, **metadata
    } for j, text in enumerate(record_texts)]
    # append these to current batches
    texts.extend(record_texts)
    metadatas.extend(record_metadatas)
    # if we have reached the batch_limit we can add texts
    if len(texts) >= batch_limit:
        ids = [str(uuid4()) for _ in range(len(texts))]
        embeds = embeder.embed_documents(texts)
        index.upsert(vectors=zip(ids, embeds, metadatas))
        texts = []
        metadatas = []

if len(texts) > 0:
    ids = [str(uuid4()) for _ in range(len(texts))]
    embeds = embeder.embed_documents(texts)
    index.upsert(vectors=zip(ids, embeds, metadatas))

In [None]:
pinecone.describe_index(index_name)

In [None]:
query = 'What is the meaning of LRC'

vector_store.similarity_search(query, k=1)

# I need to go back and make sure the document has a text key

In [None]:
from langchain. import H

# Other Models

We are all mostly familiar with ChatGPT, but there are others

**Bing Chat**
  Integrated into microsoft edge browser, available as mobile app, and online. And recently it was released inPreview on windows 11

**Claude**
  Antropics LLM that is an alternative to ChatGPT. - https://claude.ai

# LLM Platforms

**Nvidia** **Nemo**
  A toolkit for building conversational AI models. It is a part of the Nvidia AI platform. - https://www.nvidia.com/en-us/ai-data-science/products/nemo/get-started/?nvid=nv-int-unbr-268853

**AWS** **SageMaker**

  Amazon platform for building, training and deploying machine learning (ML) models - https://aws.amazon.com/sagemaker/

**AWS** **Bedrock**

  Amazon platform for working ith foundation generative models - https://aws.amazon.com/bedrock/

**AWS** **Partyrock**

  An Amazon Bedrock Playground - https://partyrock.aws/