**File** :  demo for LangChain + Gemini

**Date** : 6.11.2024

**Author** : King Hang, WONG

**Description** :

Update : Will not use Langchain until online FM deployment. 

Working towards constructing a RAG system to retrieve UseGalaxy, nextFlow code

Split text into chunks: Splitting the text into chunks is a crucial step in building a RAG system. The way you divide the data will directly impact the relevance and accuracy of the retrieved documents for any given query, ultimately determining the quality of the output. Therefore, it is essential to perform semantic splitting, which preserves the meaning of the text within each chunk. This ensures that the chunks are contextually coherent and informative. Below code snippet, will demonstrate how to split the text from a PDF document effectively.

In [1]:
!conda list | grep "google" | awk '{print $1}' | xargs -n 1 conda remove -y

'grep' is not recognized as an internal or external command,
operable program or batch file.


In [7]:
# %pip install pypdf2
# %pip install tokenizers
# %pip install langchain
# %pip install transformers
# %pip install tiktoken

In [8]:
""
from PyPDF2 import PdfReader
from transformers import BertTokenizerFast
from langchain_text_splitters.base import TokenTextSplitter

local_path = r'C:\Users\kingw\Documents\Bioplatform\bioinformatics_platform\agent_scripts\CMU writing-research-statement.pdf'

pdfreader = PdfReader(local_path)

from typing_extensions import Concatenate
# read text from pdf
raw_text = ''
for i, page in enumerate(pdfreader.pages):
    content = page.extract_text()
    content = content.replace("\n-", "")
    content = content.replace("\n", "")
    content = content.replace("•", ' ')
    if content:
        raw_text += content

In [9]:
max_tokens = 100
chunk_overlap = 20
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

splitter = TokenTextSplitter.from_huggingface_tokenizer(tokenizer= tokenizer, chunk_size=max_tokens, chunk_overlap = chunk_overlap)
chunks = splitter.split_text(raw_text)
print(chunks)

['A research statement is a one to three page document that may be required to apply for an  academic job or (less frequently) graduate school. The purpose of a research statement is to describe the trajectory of your research to a selection/search committee. A research statement allows you to  show that you can take on independent resear ch  demonstrate your writing ability, independence as a r esearcher, and ability to earngrant money state your short-term and long-term resear ch goalsUse the', ' ability to earngrant money state your short-term and long-term resear ch goalsUse the research statement only to describe your research. Your research statement is one of  a number of documents (e.g., personal statement, teaching statement, statement of diversity,  resume/cv, cover letter, etc.) describing your academic career, so be discriminating and strategic about the information you include. Be sure to keep the spotlight on your ideas—not on you as  a person (remember,', ' Be sure to ke

In [11]:
%store chunks

Stored 'chunks' (list)


#### testing various spellcheckers

Conclusion : all are bad

1. TextBlob
2. NTLK

In [10]:
# %pip install textblob
from textblob import TextBlob

text = "g enerate a pple"
blob = TextBlob(text)
corrected_text = str(blob.correct())
print(corrected_text)  # Likely Output: "generate apple"

g generate a pale


In [11]:
from nltk.corpus import words
import nltk 
nltk.download('words')
english_words = set(words.words())

def correct_splits(text):
    tokens = text.split()
    corrected_tokens = []
    for i in range(len(tokens) - 1):
        combined = tokens[i] + tokens[i+1]
        if combined in english_words:
            print(combined)
            corrected_tokens.append(combined)
        else:
            # corrected_tokens.append(tokens[i])
            pass
    # corrected_tokens.append(tokens[-1])  # Add the last word
    return ' '.join(corrected_tokens)

text = "g enerate an a pple pie"
corrected_text = correct_splits(text)
print(corrected_text)  # Output: "generate apple"

generate
ana
apple
generate ana apple


[nltk_data] Downloading package words to
[nltk_data]     C:\Users\kingw\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


### Convert into embedding and store in vector db

Convert the text into embeddings & store into a vector database: In this step, we transform each chunk of text into a numerical vector representation, which captures the semantic meaning of the text. These embeddings are then stored in a vector database, which serves as a knowledge repository. This database can be queried to retrieve relevant information based on the semantic similarity of the vectors. Here’s how you can create embeddings and store them in the vector database:

Set up Google Cloud API according to this:  

https://cloud.google.com/iam/docs/keys-create-delete

Use this to add quota if you see a quota authentication error later on.

`gcloud auth application-default set-quota-project ubi-agent`

In [3]:
!gcloud auth application-default login

Your browser has been opened to visit:

    https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=764086051850-6qr4p6gpi6hn506pt8ejuq83di341hur.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Flocalhost%3A8085%2F&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fsqlservice.login&state=Plvpbo5VCK4O4ecOyFbVytBMliC2AH&access_type=offline&code_challenge=CbPo4M97yHnG0J0eNIMHIjglno5smze0lF3WjWgkuPY&code_challenge_method=S256


Credentials saved to file: [C:\Users\kingw\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\Roaming\gcloud\application_default_credentials.json]

These credentials will be used by any library that requests Application Default Credentials (ADC).
Cannot find a quota project to add to ADC. You might receive a "quota exceeded" or "API not enabled" error. Run $ gcloud auth application-default set-quot

In [5]:
# get my credentials
import os
credential_path = r'C:\Users\kingw\Documents\Bioplatform\application_default_credentials.json'
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = credential_path

In [14]:
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_community.vectorstores import Typesense

embeddings = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004", task_type="retrieval_document")

docsearch = Typesense.from_texts(
    chunks,
    embeddings,
    typesense_client_params={
        "host": "localhost",  # Use xxx.a1.typesense.net for Typesense Cloud
        "port": "8108",  # Use 443 for Typesense Cloud
        "protocol": "http",  # Use https for Typesense Cloud
        "typesense_api_key": "xyz",
        "typesense_collection_name": "gemini-with-typesense",
    },
)

GoogleGenerativeAIError: Error embedding content: 403 Request had insufficient authentication scopes. [reason: "ACCESS_TOKEN_SCOPE_INSUFFICIENT"
domain: "googleapis.com"
metadata {
  key: "service"
  value: "generativelanguage.googleapis.com"
}
metadata {
  key: "method"
  value: "google.ai.generativelanguage.v1beta.GenerativeService.BatchEmbedContents"
}
]

Query relevant documents and pass them to LLM:

In [None]:
from langchain.chains.question_answering import load_qa_chain

llm = ChatGoogleGenerativeAI(model="gemini", convert_system_message_to_human=True)
chain = load_qa_chain(llm, chain_type="stuff")

question = "What is Scaled Dot-Product Attention?"

retriever = docsearch.as_retriever()
docs = retriever.invoke(question)
chain.run(input_documents=docs, question=question)