# Q & A on my Substack Blog #
* This notebook can be used to query my blog posts
* The Large Language Model (LLM) is `llama 3.1`
* I am accessing the LLM **ollama** and **LLama3** and **langchain**
* My starting reference was  https://www.datacamp.com/tutorial/llama-3-1-rag

## Set up the environment
Make sure that we are running in a `venv`. Then install all the packages seen in the following cell.  I comment the line out after the package is installed so that VS Code will not try to reinstall the package the next time I run the notebook.  There is no harm in trying to reinstall a package already installed but it wastes time..

The first time you run this notebook on your computer, uncomment all lines in the following cell and run only the following cell.  It will `pip install` the required python packages used in the rest of the notebook.

In [1]:
#!pip3 install langchain_community
#!pip3 install bs4
#!pip3 install chromadb
# !pip3 install matplotlib
# !ollama pull nomic-embed-text
# !pip3 install langchain
# !pip3 install scikit-learn
# !pip3 install langchain-ollama
# !pip3 install pandas
# !pip3 install pyarrow
#
#
# !pip3 install --upgrade langchain langchain-community pydantic

## The function `md()`
I like using the markdown function **md** instead of **print**.

In [2]:
from IPython.display import display, Markdown
def md(s):
    display(Markdown(s))

import numpy as np


## Utilities ##
The following cell include the basic functions and parameters.

**PARAMETERS**
* `SEP` This is the separator string I use in my blog post pages.  It separates a post into sections with each having their own individual content.  Each blog post will typically have 3 to 5 sections.  All sections from all posts are stored as the `Doc_List`.   
* `EMBEDDER` I use a quick embedder. The **llama3** embedding takes several hours.  The **nomic-embed-text** takes several minutes.
* `LLMNAME` **llama3**
  
**FUNCTIONS**
* `sepstr` Separates a string (typically the entire cotents of one post) into congruent sections using the `SEP` as the separator.  I remove the starting section (the part on the blog post before the furst occurrence of the `SEP`)
* I access `0llama` using `langchain`.  Using `langchain`  is not necessary.  It is just a convenience.
* `embedstr` creates the embedding vector for a string using `EMBEDDER` as default
* `docs2vectors` takes an array of documents and generates the vector store `varr`.  **IMPORTANT** This function uses the embedder in `Ollama` directly  not through my `embedstr`. Therefore, make sure that both calls use the same `EMBEDDER`.
* `cosine_similarity` The distance between the two embedding vectors
* `most_similar` finds the `n` vectors in the vector store `varr` that are closest to the argument vector.  Returns the indices of those documents.  **IMPORTANT** Make sure that `varr[i]` is the embedding vector for `Doc_List[i]` all the time.  Erase the vector store file from the local disk when you add new post titles to the post directory `listfilename`.  This file is a file I keep in my Google drive with public access.  Download it to your local folder and use it fro that folder if you want to change it.  Otherwise, you may keep using it off my google drive address.
* `save_vectorstore` saves the vectors to the file VECTORSTORE.  The next time yu run this notebook, new embeddings will not be created but will be read from the file VECTORSTORE.  Delete VECTORSTORE manually to regenerate the embeddings.
* `load_vectorstore` Loads the embedding vectors from the file.  it does not check if they are the right ones for the documents read from `listfilename`.

In [3]:
# Utilities

SEP="-+-+-+-+"
EMBEDDER="nomic-embed-text"
EMBEDDER="llama3"
VECTORSTORE="data/"+EMBEDDER+".pkl"
LLMNAME="llama3"
LOGFILE=""

# Open log file
# The following function reads the list of files in the "log" folder
# These files are named as "log1.txt", "log2.txt", ...
# Ignore all other files in the folder
# The function determines the next file name and opens it in write mode
def open_log():
    global LOGFILE
    # Read the list of files in the "log" folder
    import os
    files = os.listdir("log")
    # Determine the next file name
    files = [f for f in files if f.endswith('.md')]
    if not files:
        next_file = "log1.md"
    else:
        next_file = "log" + str(max([int(f[3:-3]) for f in files]) + 1) + ".md"
    # Open the file in write mode
    LOGFILE = open("log/" + next_file, "w")
    from datetime import datetime
    LOGFILE.write("Date and time: " + str(datetime.now()) + "\n\n")
    return LOGFILE

# The following function closes the log file
def close_log():
    LOGFILE.close()

# The following function writes a string to the log file
def log2file(s):
    LOGFILE.write(s + "\n")
    LOGFILE.flush()

#
# The following function takes a text string `text` and separates it to a list
# of strings using the separator `sep`
def sepstr(text, sep=SEP, remove=1):
    # Split the text to a list of strings
    lst=text.split(sep)
    # Remove the empty strings
    lst=[s for s in lst if s]
    # Remove the first string
    if remove>0:
        lst=lst[remove:]
    return lst

# I access 0llama using langchain.  Using langchain  is not necessary.  It is just a convenience.
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.llms import Ollama
LLM = Ollama(model=LLMNAME)

# The following function takes a text string and returns its embedding
#  The second argument is the name of the embedder to be used
def embedstr(text, embedder=EMBEDDER):
    # Initialize the Ollama embeddings
    embeddings = OllamaEmbeddings(model=embedder)
    # Embed the text
    embedding = embeddings.embed_query(text)
    return embedding

# The following function takes a list of documents and returns a list of embeddings
# The second argument is the name of the embedder to be used
def docs2vectors(docs, embedder=EMBEDDER):
    # Initialize the Ollama embeddings
    embeddings = OllamaEmbeddings(model=embedder)
    # Embed the documents
    varr = embeddings.embed_documents(docs)
    return varr

# The following function returns the distance between two embeddings
# The distance is calculated as the cosine similarity between the two embeddings
# It is normalized to be between 0 and +/- 1
def cosine_similarity(v1, v2):
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

# The following function returns the indices of the top `n` most similar embeddings to the given embedding
# The `v` is the embedding and the `varr` is the array of embeddings
def most_similar(v, varr, n=5):
    # Calculate the cosine similarity between the given embedding and all the embeddings in the array
    sims = np.array([cosine_similarity(v, v2) for v2 in varr])
    # Get the indices of the top `n` most similar embeddings
    indices = np.argsort(sims)[::-1][:n]
    return indices, sims[indices]

# The following runction returns the similarity between two texts using the EMBEDDER
# def similarity(text1, text2, embedder=EMBEDDER):
#     # Embed the two texts
#     v1 = embedstr(text1, embedder)
#     v2 = embedstr(text2, embedder)
#     # Calculate the cosine similarity between the two embeddings
#     return cosine_similarity(v1, v2)

# # The following function returns the start of the string s1 inthe string s2:
# def startof(s1, s2):
#     return s2.find(s1)

# Vector store
import pickle
# The following function is used to save the vectorstore to a file
# The `vectorstore` is the array of embeddings and the `file_path` is the path to save the file
def save_vectorstore(vectorstore, file_path=VECTORSTORE):
# Open the file in write-binary mode and save the vectorstore
    with open(file_path, 'wb') as f:
        pickle.dump(vectorstore, f)

# The following function is used to load the vectorstore from a file
# The `file_path` is the path to load the file
def load_vectorstore(file_path=VECTORSTORE):
    # Open the file in read-binary mode and load the vectorstore
    with open(file_path, 'rb') as f:
        vectorstore = pickle.load(f)
    return vectorstore

# The following function returns True if the vectorstore file exists
def vectorstore_exists(file_path=VECTORSTORE):
    try:
        with open(file_path, 'rb') as f:
            return True
    except FileNotFoundError:
        return False


## Convert web page contents to embedding vectors:
`read_page_list` Read the list of the page URLs and page titles

`Docs_List` A Document array to hold the page contents

`Varr` Embedding vectors

In [4]:
def readfromdrive():
    import requests
    listfilename="https://drive.google.com/file/d/1LA9LV9WyvoRZgHL8R7DGFbFzfTAMcKbC/view?usp=drive_link"
    # Modify the URL to make it a direct download link
    file_id = listfilename.split('/d/')[1].split('/')[0]
    download_url = f"https://drive.google.com/uc?export=download&id={file_id}"
    response = requests.get(download_url)
    return response.text
def readlocal():
    with open("data/general_textlist.txt", "r") as f:
        return f.read()

# The following function reads a text file from my Google Drive.  The file is organised as follows:
# 1. The first line is the first URL
# 2. The second line is the post title on that URL
# 3. The third line is the second URL
# 4. The fourth line is the post title on that URL
# 5. etc.
# The function returns two lists: the first list contains the URLs and the second list contains the titles.
def read_page_list():
    text=readlocal()
    lines = text.split('\n')
    page_list=[]
    title_list=[]
    for i in range(0,len(lines),2):
        page_list.append(lines[i].strip())
        title_list.append(lines[i+1].strip())
        if i<3:
            print(lines[i].strip(), lines[i+1].strip())
        if i==4:
            print("...")
    print(lines[-1].strip(), lines[-2].strip())
    return page_list, title_list
# (p,t)=read_page_list()

from langchain_community.document_loaders import WebBaseLoader
import copy
# Load the URLs and titles
(urls, titles)=read_page_list()
print("Finished reading %d page addresses" % len(urls))
print("Now load them one by one")
# Load documents from the URLs
docs = [WebBaseLoader(url).load() for url in urls]
print("Finished loading %d documents" % len(docs))
print("Now split them using the separator %s" % SEP)

Docs_List=[]
for d in docs:
    sections=sepstr(d[0].page_content)
    isplit=0
    print(d[0].metadata['source'], end="")
    for s in sections:
        d2 = copy.deepcopy(d[0])  # Create a deep copy of d[0]
        d2.page_content=s
        d2.id=isplit
        isplit+=1
        # d2.metadata['source']=d.metadata['source']
        Docs_List.append(d2)
print("Finished splitting %d documents to %d" % (len(docs), len(Docs_List)))
#
if vectorstore_exists():
    Varr=load_vectorstore()
    print("Loaded vectorstore")
else:
    print("Creating vectorstore")
    Varr=docs2vectors(Docs_List)
    save_vectorstore(vectorstore=Varr)
    print("Created and saved vectorstore")


https://halimgur.substack.com/p/imaginary-conversation-reflections Imaginary Conversation: Reflections on Israel and the Middle East
https://halimgur.substack.com/p/the-unbearable-fakeness-of-politics The Unbearable Fakeness of Politics: A Foreign Perspective on the 2024 US Election
...
Retirement: A Journey of Continued Relevance and Learning https://halimgur.substack.com/p/retirement-a-journey-of-continued
Finished reading 57 page addresses
Now load them one by one
Finished loading 57 documents
Now split them using the separator -+-+-+-+
https://halimgur.substack.com/p/imaginary-conversation-reflectionshttps://halimgur.substack.com/p/the-unbearable-fakeness-of-politicshttps://halimgur.substack.com/p/living-in-a-simulation-how-real-ishttps://halimgur.substack.com/p/the-ai-reduxhttps://halimgur.substack.com/p/join-my-new-subscriber-chat-on-ourhttps://halimgur.substack.com/p/when-will-you-diehttps://halimgur.substack.com/p/the-art-of-restraint-why-the-parishttps://halimgur.substack.com/

## Something funny about the length of the context provided to the llm ##
I did not really pursue this but sometimes it helps to feed the `llm` just the contents that has the requiredinformation but a long text that includes the required information but other thinsg as well.

For example, for the question="What is the average U.S. life expectancy?", the answer is in the first document in the list of most similar documents.  However, if I give the entire `page_content` of that document, *llama3* fails to find the answer.

Try `context_length=None` to verify this.

When I have `context_length=5000`, it works.  When I have it as `6000`, it does NOT work.

**Llama 3.1** is supposed to have a context window length of 128K.  It looks however, its attention is diluted when the context length exceeds 5000.  This is true at least for this question. 

## Query functions ##
You can query in two modes:
### Exact Reference ###
Locate the post that refers to a person or to a specific object (keyword):
* Person example "Who is Grigory Potemkin"
* Keyword example "Locate electrolyser"


Ths syntaxt is strict.  It should be
* "Who is " + PersonName or
* "Locate " + ObjectName

The `Short_List` is populated with ALL the documents that comtains the required person or the keyword.   `showlist(Short_List)` prints them as a list.

### Directed query ###
* Use Short List Document x to answer Question y: "SxQy"
* `S=x` All future queries will be answered using Short List document x
* `S=None` Clear an earlier `S=x`
* `Q+...` append '...' to `Questions`

### Chatbot-like usage ###
Any text string that does not start with "Who is " or "Locate " is treated as a genuine chatbot query.  The function `similars` finds the NSIMILAR documents closest to the query text.  The function `respond` gets the `llm` to respond by going the through the top NSIMILAR documents.

`respond` starts from the closest document and moves on to the next one only if the response is "I do not know". The LLM is instructed to respond "I do not know" if it cannot retrieve the requested information from the attached document (which is atached as `context`.

In [5]:
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.llms import Ollama

expanded_question=""
Short_List=[]
Short_Sims=[]
Questions=[]
Refer=None
Xpand=True

def query(question, ragtext, context_length=None):
    llm = Ollama(model="llama3")
    if ragtext==None:
        response=llm.invoke("Write a paragraph to answer the following question : "+question) 
    else:
        prompt = """Based ONLY on the following information, answer the question. Do not use any external knowledge. If the answer cannot be found in the provided text, say 'I do not know'
                    Context: {context}

                    Question: {question}

                    Answer:"""
        if context_length==None:
            context=ragtext
        else:
            context=ragtext[0:context_length]
    #
        response = llm.invoke(prompt.format(context=context, question=question)) # response is a string
    return response
#

def keyword_references(word):
    indices=[]
    for i in range(len(Docs_List)):
        if word in Docs_List[i].page_content:
            indices.append(i)
    return indices

def showlist(lst):
    log2file("\n**Short list**\n\n")
    if len(lst)==0:
        log2file("There is no short list")
        return
    for idx, doc in enumerate(Short_List, start=0):
        s=f"* {idx}. {doc.metadata['source']}"
        s+=f" Section #{doc.id}"
        s+=f" --> '{doc.page_content[0:20]}'"
        s+="\n"
        log2file(s)


NSIMILAR=3
# The following function populates Short_List[] with the most similar documents to the given question
# It returns True to continue with the LLM, or False to show the list of similar documents
def similars(question, n=NSIMILAR, show=False):
    global expanded_question, Short_List, Short_Sims
    askllm=False
    if Xpand:
        expanded_question=query(question, None)
        s=(f"Find {n} docs closest to the expanded text:\n{expanded_question}\n")
    else:
        expanded_question=question
        s=(f"Find {n} docs closest to the question:\n{question}\n")
    log2file(s)
    print(s)
    similardoc_indices, Short_Sims = most_similar(embedstr(expanded_question), Varr, n=n)
    askllm=True
    #
    Short_List = [Docs_List[i] for i in similardoc_indices]
    if show:
        showlist(Short_List)
    return askllm

#
def respond(question, similardocs):
    for i in range(len(similardocs)):
        response= query(question, similardocs[i].page_content, None)
        # s=similardocs[i].metadata['source']
        # s+=f" Section #{similardocs[i].id}"
        # s+=f" --> '{response[0:14]}'"
        # s+="\n\n"
        log2file(f"{i}. `{response}`\n")
        if 'I do not know' not in response:
            return response
    return ""
# The followimg function returns the summary of the page_content of a document
def summarise(doc):
    log2file("\nSummarise "+doc.metadata['source']+" #"+str(doc.id)+"\n\n")
    llm = Ollama(model="llama3")
    response = llm.invoke("Please summarise the following text: "+doc.page_content)
    log2file(response)
    return response

# The following function checks if the string argument is in the list Questions[].
# If it is not, it adds it to the Questions[] list.
def addquestion(question):
    if question not in Questions:
        Questions.append(question)
        return True
    return False

# The following function reads text from a file, creates a Doc object with page_content being the read text
# and appends it to Short_List
def read2short(file_name):
    global Short_List
    with open(f"data/{file_name}.txt", 'r') as f:
        text = f.read()
    d2 = copy.deepcopy(Short_List[-1])  # Create a deep copy of d[0]
    d2.page_content=text
    d2.id=0
    d2.metadata['source']=file_name+'.txt'
    Short_List.append(d2)

# String s is of the form "xxx abcde" where xxxx is an integer and abcde is a string
# The function returns the integer xxxx and the string abcde
def getiands(s):
    i=s.find(" ")
    if i==-1:
        return None, None
    return int(s[0:i]), s[i+1:]

# Check if the argument string is of the form "SxxxQyyy" where xxx and yyy are integers
# and return the values of xxx and yyy
def getindexes(s):
    if s[0]=="S":
        s=s[1:]
    if s[0]=="Q":
        s=s[1:]
    i=s.find("Q")
    if i==-1:
        return None, None
    return int(s[0:i]), int(s[i+1:])
# print(getindexes("S123Q456"))

## User interface ##
The user interface is not fancy.  I use the python `input()` function.  This appears as a window at the top of the notebook window in VS Code.

|Command|What it does|
|:---:|-----------|
|blank line|_Quit_|
|!ASxQy|_Answer the question #y using the short listed doc #x_|
|!AQy|_Answer question #y based on the Document #x of last Sx command_|
|!CSx|_Condense short-list doc #x_|
|!Ftext|_Find the documents that contain text_|
|!PS|_List all short-listed documents_|
|!PQ|_List all past questions_|
|!RS file_name|_Read `file_name`.txt into a document, and append to Short_List|
|!SSx file_name|_Save Short List Document #x to `file_name`.txt_|
|!+Qtext|_Add text as a new question to `Questions`_|
|!XT|_Expand questin (stop expanding if XF)_|
|text|_Find the most similar NSIMILAR documents and try to answer_|

The following cell gets the user input and returns:
* `Q` for quit of the entry is blank
* Lists the short list documents and returns `None` if the entry is **Shorts** or **shorts**
* List past questions if the entry is **questions** or **Questions**
* Summarise the 5th short list document if the input is "5" (or any other digit)
* Use Short List Document x to answer Question y: "SxQy"
* Returns with the entry text for any other user input.

In [6]:
def xqtcommand(q):
    global Refer, Short_List
    # q=q.upper()
    if q[0]=='P':
        if q[1]=='S':
            showlist(Short_List)
        elif q[1]=='Q':
            for i, q in enumerate(Questions, start=0):
                log2file(f"{i:03d}. {q}"+"\n\n")
    elif q[0]=='A':
        if q[1]=='S':
            [x,y]=getindexes(q[1:])
            if x!=None and y!=None:
                s=f"Answer question {y} in the short-listed document {x}"
                log2file(s+"\n\n")
                if x>=0 and x<len(Short_List):
                    response=respond(Questions[y], [Short_List[x]])
                    log2file(response+"\n\n")
                    Refer=x
                else:
                    log2file("Index out of range")
        elif q[1]=='Q':
            y=int(q[2:])
            if y>=0 and y<len(Questions):
                response=respond(Questions[y], [Short_List[Refer]])
            else:
                response="Question index out of range"
            log2file(response+"\n\n")
    elif q[0]=='S':
        if q[1]=='S':
            [x,filename]=getiands(q[2:])
            if x!=None and filename!=None:
                with open(f"data/{filename}.txt", 'w') as file:
                    file.write(Short_List[x].page_content)
    elif q[0]=='+':
        if q[1]=='Q':
            addquestion(q[2:]) 
    elif q[0]=='C':
        if q[1]=='S':
            Refer=int(q[2:])
            if Refer>=0 or Refer<len(Short_List):
                summarise(Short_List[Refer])
            else:
                log2file("Index out of range")
    elif q[0]=='F':
        similardoc_indices=keyword_references(q[1:])
        Short_List = [Docs_List[i] for i in similardoc_indices]
        showlist(Short_List)
    elif q[0]=='R':
        if q[1]=='S':
            s=q[2:].lstrip()
            read2short(s)
    elif q[0]=='X':
        global Xpand
        if q[1]=='T':
            Xpand=True
        else:
            Xpand=False
            
                

def getquery():
    global Refer
    log2file("---")
    s=f"X={Xpand}; "
    if Refer!=None:
        s+=f"Ref={Refer}. "
    s+= "Input :"
    print(s, end="")
    question=input("")
    print(question)
    s+=question+"\n"
    log2file(s)
    if question=="":
        return 'Q'
    if question[0]=="!":
        xqtcommand(question[1:])
        return None
    addquestion(question)
    return question

## Keep calling `getquery` until the user enters a blank string ##
The following keeps calling `getquery` in a loop.  The output keeps adding up so you have access to the entire chat. The previous questions are not entered.  So this is not a true chat.  It is relatively easy to add a memory and add the previous questions and the LLM answer to the context for the new question but I need to get more familiar with the way LLM works before I do that and therefore I keep it simple at this stage.

In [7]:

# Open the log file
open_log()
question=getquery()
while question!="Q":
    if question is not None:
        print("Question : " + question, end="")
        askllm = similars(question, show=True)
        if askllm:
            response=respond(question, Short_List)
            # log2file(response)
            print(response)
        else:
            print("n/a")
    question=getquery()
close_log()


X=True; Input :What are typical electrolyser efficiencies?
Question : What are typical electrolyser efficiencies?Find 3 docs closest to the expanded text:
Typical electrolyzer efficiencies, also known as faradaic efficiency or current efficiency, vary depending on the type of electrolyzer and its specific design. In general, most alkaline electrolyzers have efficiencies ranging from 70% to 85%, while proton exchange membrane (PEM) electrolyzers typically achieve efficiencies between 80% and 90%. Solid oxide electrolyzers can reach efficiencies as high as 95% or more, although these are still relatively rare and typically used in specialized applications. Advanced technologies like high-temperature electrolyzers and liquid metal-based electrolyzers have also demonstrated efficiencies above 90%, but these are still in the early stages of development. Overall, while there is a range of efficiencies depending on the specific technology, most commercial-scale electrolyzers operate at effici