#### CS1 - Install Libraries, define main variables, and some basic functions ####
You can convert this notebook to HTML by entering the following command in the VS Code terminal window:

`jupyter nbconvert --no-input --to html retrievals.ipynb`

_Make sure that the terminal window is running in the same virtual environment.  On my Mac computer, if the terminal prompt is `(.venv) $ ` then I know it is running in the virtual environment.  Otherwise, I enter `source venv/bin/activate` to put it into the virtual environment_

**CS1, CS2, CS3, etc. are Code cells in the notebook.  The code does not appear in the HTML output.  You have to look into the notebook file (`introduction.ipynb`) to see the code contained in CS1, CS2, etc..**

In [27]:
#%pip install ipywidgets

In [28]:
# IMPORTANT :
# I am running this in the local virtual environment venv.  The python libraries installed when running introduction.ipynb
# do not have to be re-installed here as they are already installed in the virtual environment.
#
# The following are the additional libraries used in the notebook
#
# I set my OPENAI_API_KEY in a .env file.  You can also set it in your environment variables.
# The following two lines read the .env file and set the environment variable.
import dotenv
dotenv.load_dotenv()
import os
print("Current Working Directory:", os.getcwd())
import math
import numpy as np
# import melib
# from melib.xt import mdx
# import the functions in the local python file ute.py
from ute import set_openai_key, rdweb, sepstr, get_embedding, SEP, cosine_similarity
# The following are the variables used in the notebook

PIE=math.pi
SECTION=0
MD=None
Chapter="retrievals"
#
# Define the md() function to display markdown text
from IPython.display import display, Markdown
def md(s):
    display(Markdown(s))

# Establish OpenAI API key (see below for how to get one)
# import os
# import openai
# from openai import OpenAI
# openai.api_key = os.getenv("OPENAI_API_KEY")
# client = OpenAI()
# LLM="text-embedding-ada-002"

Current Working Directory: /Users/Halim/git/probot


#### CS1 Ends ####

#### CS2 ####

Program flow control variables

In [29]:
RECREATE_SHORT=False
RECREATE_FULL=False

#### CS2 Ends ####

#### CS3 ####
Computations specific to this notebook.  The results are used in the text.

In [30]:
# The following function creates a list of all pages under the given url and saves it to a file
def create_page_list(url):
    import requests
    from bs4 import BeautifulSoup
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    links = soup.find_all('a')
    page_list=[]
    for link in links:
        s=link.get('href')
        print(s)
        if not s==None and not s.endswith('/comments') and not s in page_list:
            page_list.append(s)
    with open('page_list.txt', 'w') as f:
        for item in page_list:
            f.write("%s\n" % item)
    return page_list

In [31]:
# The following function reads the file `filename`. This file is orgaanised as pairs
# of lines. The first line in each pair is the url address and the second line is the
# title of the page.  The function returns a list of tuples where each tuple is the
# url and the title.
def read_page_list(filename):
    with open(filename, 'r') as f:
        lines = f.readlines()
    page_list=[]
    for i in range(0,len(lines),2):
        page_list.append((lines[i].strip(),lines[i+1].strip()))
    return page_list
fullfile='data/general_list.txt'
full_list=read_page_list(fullfile)
shortfile='data/short_list.txt'
short_list=read_page_list(shortfile)
short_list_embedfile='data/short_list_embed'
full_list_embedfile='data/full_list_embed'
# for pair in page_list:
#     print(pair)

# The following function opens the text file `filename`. This file is orgaanised as pairs
# of lines. The first line in each pair is the url address and the second line is the
# title of the page.  The function reads the `n`th pair of lines and returns the url and
# the title.
def read_page(filename,n):
    with open(filename, 'r') as f:
        lines = f.readlines()
    return lines[2*n].strip(),lines[2*n+1].strip()

# (url, title)=read_page(fullfile,8)
# print(url)
# print(title)

In [32]:
# Read the web pages in the short_list, separate them into sections
# separated by SEP imported from ute and embed each section separately.  Create the embedding array.
# The embedding array is saved in the file 'data/short_list_embeddings.npy'
# The embedding array is a numpy array of shape (N,512) where N is the number of sections
# and 512 is the number of dimensions in the embedding space.

embeddings=[]

def headandtail(embeddings, MD):
    def writeone(v, MD):
        MD.write("%.5f, %.5f, %.5f, ..., %.5f, %3d, %3d\n\n"%(v[0], v[1], v[2], v[-3],v[-2],v[-1]))
    for (iv,v) in enumerate(embeddings[0:3]):
        writeone(v, MD)
    MD.write("...\n\n")
    for (iv,v) in enumerate(embeddings[-3:]):
        writeone(v, MD)

def create_embeddings(filename, listname):
    global embeddings
    embeddings=[]
    set_openai_key()
    for (itext,pair) in enumerate(listname):
        url=pair[0]
        text=rdweb(url, None)
        sa=sepstr(text, SEP)
        MD.write("%d. %s : "%(itext+1,pair[1]))
        for (isegment, s) in enumerate(sa):
            v=get_embedding(s)
            w=np.append(v,[itext,isegment])
            embeddings.append(w)
        MD.write("%d segments embedded to %d-long vectors"%(isegment, len(v)))
        MD.write("\n")
    # Save the embeddings to a file
    np.save(filename,embeddings)
    MD.write("Embeddings saved in file %s.npy\n\n"%filename)
    MD.write("### Extra two members for the embedding vectors ###\n\n\
The last two members of each embedding vector are the index of the page and the index of the segment in the page.\n\
To demonstrate this, I print below the first three and last three members oof all the embedding vectors:\n\n")
    headandtail(embeddings, MD)
    MD.write("\n\nWe have to remember to discard the last two numbers when we compare the query text embedding \
             against these embedding vectors.\n\n")
    
def load_embeddings(filename):
    global embeddings
    embeddings=np.load(filename+".npy")
    MD.write("Embeddings were read from file %s.npy\n\n"%filename)
    MD.write("The following shows the first three and last three members of each embedding vector:\n\n")
    headandtail(embeddings, MD)
    MD.write("\n\nThe last two are the index to the page list and the index to the segment in the page.\n\n")


### CS3 ends ###

In [33]:
TOC=["Getting my posts ready for embedding", "Create embeddings for the short list",  \
     "Create embeddings for the long list", "Simple queries"
     ]
MD=mdx(Chapter, SECTION, title="MyBlogPosts")
MD.toc(TOC,"2023")
#
# 
MD.write('This notebook is about creating embe\
ddings for my substack blog posts. I will use OpenAI API to generate embeddin\
gs. To run this notebook you need to have an OpenA\
I account.  Do not try to run this notebook without trying my earlier notebooks:\n\n\
* [`introduction.ipynb`](https://github.com/Gurgenci/probot/blob/main/introduction.ipynb)\n\
* [`embeddings.ipynb`](https://github.com/Gurgenci/probot/blob/main/embeddings.ipynb)\n\n\
Also read my earlier posts under `AI Stuff`.\n\n')
if RECREATE_SHORT:
     MD.write("**Important** : Short List Embeddings were recreated and saved into the `data` folder. \
              Make RECREATE_SHORT False if you do not want to recreate the short list embeddings in future RUN ALLs.  \n\n")
if RECREATE_FULL:
     MD.write("**Important** : Long List Embeddings were recreated and saved into the `data` folder. \
              Make RECREATE_FULL False if you do not want to recreate the long list embeddings in future RUN ALLs.  \n\n")
md(MD.out())

# MyBlogPosts #

#### Table of Contents ####

_2023_

|Section|Title|
|:------|:-------|
|1|<a href="#Getting-my-posts-ready-for-embedding">Getting my posts ready for embedding</a>|
|2|<a href="#Create-embeddings-for-the-short-list">Create embeddings for the short list</a>|
|3|<a href="#Create-embeddings-for-the-long-list">Create embeddings for the long list</a>|
|4|<a href="#Simple-queries">Simple queries</a>|


This notebook is about creating embeddings for my substack blog posts. I will use OpenAI API to generate embeddings. To run this notebook you need to have an OpenAI account.  Do not try to run this notebook without trying my earlier notebooks:

* [`introduction.ipynb`](https://github.com/Gurgenci/probot/blob/main/introduction.ipynb)
* [`embeddings.ipynb`](https://github.com/Gurgenci/probot/blob/main/embeddings.ipynb)

Also read my earlier posts under `AI Stuff`.







In [34]:
SECTION+=1
SECTION=1
MD=mdx(Chapter, SECTION, TOC[SECTION-1])
MD.write("I think my posts are too long and I divided them into segments using the separator `-+-+-+-+`.  \
         I also thought trying a second level separation using the string `-*-*-` but then I thought this \
         would be an overkill.  I will use only the first level separator to create embeddings.\n\n")
MD.write("The next task was to create a list of all my posts.  I wanted to create this list automatically and tried \
         the function `create_page_list()` in CS2 above to create a list of all my posts.\n\n")
MD.write("Unfortunately, this function did only get most recent posts and missed \
         the erarlier pages.  I asked ChatGPT about it and its response was:\n\n")
MD.write(">Substack, as a popular blogging platform, typically organizes posts in a feed, \
         which can use pagination to manage the display of a large number of posts. \
         Pagination in web platforms like Substack is often implemented to improve \
         load times and user experience by not loading all content at once.\n\n")
MD.write("SInce we did not know what pagination method Substack used, it was not possible to \
         create a list of all my posts automatically.  I had to create the list manually.\n\n")
MD.write("I created the list of all my posts in the file `general_list.txt` and read it into \
            the list `full_list`.\n\n")
MD.write("The list `full_list` has the following entries:\n\n")
for pair in full_list:
    MD.write("* "+pair[1]+"\n")
MD.write("To reduce the costs of debugging the embedding process, I created a shorter list of posts in the file `short_list.txt`.\n\n")
MD.write("The list `short_list` has the following entries:\n\n")
for pair in short_list:
    MD.write("* "+pair[1]+"\n")
MD.write("I will first use the shorter list in this notebook.\n\n")        
MD.write("\n\n")
md(MD.out())

# Getting my posts ready for embedding #

I think my posts are too long and I divided them into segments using the separator `-+-+-+-+`.           I also thought trying a second level separation using the string `-*-*-` but then I thought this          would be an overkill.  I will use only the first level separator to create embeddings.

The next task was to create a list of all my posts.  I wanted to create this list automatically and tried          the function `create_page_list()` in CS2 above to create a list of all my posts.

Unfortunately, this function did only get most recent posts and missed          the erarlier pages.  I asked ChatGPT about it and its response was:

>Substack, as a popular blogging platform, typically organizes posts in a feed,          which can use pagination to manage the display of a large number of posts.          Pagination in web platforms like Substack is often implemented to improve          load times and user experience by not loading all content at once.

SInce we did not know what pagination method Substack used, it was not possible to          create a list of all my posts automatically.  I had to create the list manually.

I created the list of all my posts in the file `general_list.txt` and read it into             the list `full_list`.

The list `full_list` has the following entries:

* The Requiem for a Dream: Israel's Untaken Paths
* The importance of elite consensus
* OpenAI astounds us again
* Why did Elon Musk buy Twitter?
* Elon Musk: The Spaceman
* Let us talk about Elon Musk
* Despicable Acts - Part 2
* Despicable Deeds
* "ROGUE Age" & Climate Change: Unpredictable Global Transitions
* ROGUE Age Accessory #1 - Population
* ROGUE - Renaissance on globe with upheavals everywhere
* The Great Stagnation ends but for whom?
* When the rivers run dry
* The Voice Referendum in Australia
* Conspiracy Theories - Part 2
* Conspiracy Theories - Part 1
* How many more ruins in Great Britain?
* While watching Utopia on ABC
* Lying Oracles and the 'Anyone and No Holds Barred' war against Musk
* The one thing necessary for the triumph of ignorance is for wise men to stop talking to the uninformed
* Will there be a war?
* How do we select our information sources?
* Do we know why we know what we know?
* Does layering affect heat loss?
* Elections in Turkey
* Monster Wave soon to hit your shore
* Try counting carbons not calories to have a better control on your weight
* Bridge Fixture for Five Players
* The Attainment of Happiness
* Competent intelligence is here, will it do engineering?
* The Prodigal Bird returns
* Earthquake Thoughts and Facts
* Why were so many buildings destroyed?
To reduce the costs of debugging the embedding process, I created a shorter list of posts in the file `short_list.txt`.

The list `short_list` has the following entries:

* The Requiem for a Dream: Israel's Untaken Paths
* Conspiracy Theories - Part 2
* Conspiracy Theories - Part 1
I will first use the shorter list in this notebook.









In [35]:
SECTION+=1
SECTION=2
MD=mdx(Chapter, SECTION, TOC[SECTION-1])
MD.write("I will now read the web pages in the `short_list` and create embeddings for each section.  \
         I will save the embeddings in a file in the `data` folder.\n\n")
if RECREATE_SHORT:
    create_embeddings(short_list_embedfile, short_list)
else:
    load_embeddings(short_list_embedfile)
MD.write("\n\n")
md(MD.out())

# Create embeddings for the short list #

I will now read the web pages in the `short_list` and create embeddings for each section.           I will save the embeddings in a file in the `data` folder.

Embeddings were read from file data/short_list_embed.npy

The following shows the first three and last three members of each embedding vector:

-0.00478, 0.01175, -0.00854, ..., -0.02872,   0,   0

-0.01794, -0.00800, -0.00239, ..., -0.02420,   0,   1

-0.01119, -0.02013, -0.01121, ..., -0.01077,   0,   2

...

-0.02132, -0.02397, -0.01467, ..., -0.01877,   2,   6

0.00630, -0.01048, 0.01402, ..., -0.01330,   2,   7

0.01223, -0.01027, -0.01069, ..., -0.02849,   2,   8



The last two are the index to the page list and the index to the segment in the page.









In [36]:
SECTION+=1
SECTION=3
MD=mdx(Chapter, SECTION, TOC[SECTION-1])
MD.write("I am happy with the way I created the embeddings for the short list.  \
         I will now create embeddings for the long list.\n\n")
if RECREATE_FULL:
    create_embeddings(full_list_embedfile, full_list)
else:
    load_embeddings(full_list_embedfile)
MD.write("\n\n")
md(MD.out())

# Create embeddings for the long list #

I am happy with the way I created the embeddings for the short list.           I will now create embeddings for the long list.

Embeddings were read from file data/full_list_embed.npy

The following shows the first three and last three members of each embedding vector:

-0.00478, 0.01175, -0.00854, ..., -0.02872,   0,   0

-0.01794, -0.00800, -0.00239, ..., -0.02420,   0,   1

-0.01119, -0.02015, -0.01128, ..., -0.01072,   0,   2

...

-0.00341, -0.00585, -0.00933, ..., -0.02407,  32,   7

-0.00294, -0.00936, -0.00163, ..., -0.03766,  32,   8

0.00450, -0.00034, -0.02252, ..., -0.04635,  32,   9



The last two are the index to the page list and the index to the segment in the page.









In [37]:
SECTION+=1
SECTION=4
MD=mdx(Chapter, SECTION, TOC[SECTION-1])
MD.write("Last two cells of this notebook are to be used interactively to query the embeddings.\n\n")
MD.write("\n\n")
md(MD.out())

# Simple queries #

Last two cells of this notebook are to be used interactively to query the embeddings.









### Enter your query ###
The following cell will let the user enter a query and it will then  retrieve the matching page(s) and segment(s).  You can run it a repeated number of times.

In [38]:
import ipywidgets as widgets
from IPython.display import display

# The following function sorts the np array `vsim` from highest to lowest and returns the sorted indices
def sort_indices(a):
    return np.argsort(a)[::-1]

vsim=np.zeros(len(embeddings))

text = widgets.Text(
    value='',
    placeholder='What is your question ?',
    description='String:',
    disabled=False
)

button = widgets.Button(description="Click Me")

def on_button_clicked(b):
    global vsim
    query=text.value
    query_embedding=get_embedding(query)
    max_similarity=0
    for i in range(len(embeddings)):
        w=embeddings[i]
        v=w[:-2]
        similarity = cosine_similarity(v, query_embedding)
        vsim[i]=similarity
        if similarity>max_similarity:
            max_similarity=similarity
            max_index=i 
    print("You asked : %s"%query)
    print("The following are the top 10 matches:")
    sorted_indices=sort_indices(vsim)
    # print(sorted_indices)
    for x in sorted_indices[0:10]:
        jpage=int(embeddings[x][-2])
        jsegment=int(embeddings[x][-1])
        (url, title)=read_page(fullfile,jpage)
        pagetext=rdweb(url, None)
        paginatedtext=sepstr(pagetext, SEP)
        print("%d, %d, %.5f, %s: '%s ...'"%(jpage,jsegment,vsim[x], title, paginatedtext[jsegment][0:40]))
   
    # print("Your question is related to Page %d, Segment %d"%(embeddings[max_index][-2],embeddings[max_index][-1]))

button.on_click(on_button_clicked)

display(text, button)


Text(value='', description='String:', placeholder='What is your question ?')

Button(description='Click Me', style=ButtonStyle())

### Print segment text ###
The following cell can be used a segment in one of the pages in the `fullfile`.  Change `jpage` and `jseg` to the values you are interested. You can run it a repeated number of times.

In [39]:
jpage=16 # Entry in the full list
jseg=4
(url, title)=read_page(fullfile,jpage)
pagetext=rdweb(url, None)
paginatedtext=sepstr(pagetext, SEP)
print("Page %d, Segment %d"%(jpage,jseg))
md(paginatedtext[jseg])


Page 16, Segment 4


The cancellation of the 2026 Melbourne Commonwealth Games with impunity last month (and Alberta getting out of the 2020 Commonwealth Games) is another indication of the status of the former Empire.Then there is Brexit.  Historians will explore the role of Brexit in Britain's decline and might be able to identify the factors that forced Brexit, which at the moment is imponderable for me. What I understood from the Brexit incident is that the ruling elites of British society are not able form a single unified prosperous future vision for the country. Stefan Dercon explains this issue of elite consensus requirement very well in his book Development Gambling. I may write about Dercon’s book some other time.When the American revolutionaries won the Battle of Saratoga against the royal armies in 1777, a student approached Adam Smith, muttering; "Burgoyne is defeated. We're ruined." Smith replied: "My boy, there is a lot of ruin in a nation." With all the setbacks of the twentieth century and post-Brexit problems, how many ruins are still left in Britain, we shall see.ReferencesMcCartney, S., and J. Stittle. 2017. ''A Very Costly Industry': The cost of Britain's privatised railway', _Critical perspectives on accounting_, 49: 1-17.Steer-Davies-Gleave. 2016. "Study on the prices and quality of rail passenger services." In, edited by European Commission Directorate General for Mobility and Transport. European Commission Directorate General for Mobility and Transport.Short TakesPower Engineering International, 3 Ağustos 2023