In [None]:
# pip install python-dotenv
# pip install llama-index
# pip istall matplotlib
# pip install llama-index-embeddings-huggingface
# pip install llama-index-embeddings-instructor

In [None]:
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'pg_essay.txt'

In [4]:
from dotenv import load_dotenv
import os
load_dotenv("../creds.env")
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")

In [44]:
from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.response.notebook_utils import display_source_node

<h4>HuggingFace Embedding</h4>

In [45]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings

embedding_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-mpnet-base-v2", max_length=512)
Settings.embed_model = embedding_model

In [46]:
documents = SimpleDirectoryReader("data").load_data()
embed_model = OpenAIEmbedding()
splitter = SemanticSplitterNodeParser(
    buffer_size=1, 
    breakpoint_percentile_threshold=95, 
    embed_model=Settings.embed_model
)

nodes = splitter.get_nodes_from_documents(documents)
vector_index = VectorStoreIndex(nodes)

In [47]:
print(nodes[0].get_content())
print('-'*100)
print(nodes[1].get_content())



What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.

The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.

The language we used was an early version of Fortran. You had to type programs on punch cards, then stack them in t

In [50]:
vector_index = VectorStoreIndex(nodes)

In [51]:
retriever = VectorIndexRetriever(
    index=vector_index,
    similarity_top_k=2,
)

query_engine = RetrieverQueryEngine(
    retriever=retriever
)

In [52]:
embeds = retriever.retrieve("Tell me about the author's programming journey through childhood to college")
for i in embeds:
    print(i, '-'*100)

Node ID: 5614ecf4-8f88-429b-9ee4-fb4aa100bdbd
Text: What I Worked On  February 2021  Before college the two main
things I worked on, outside of school, were writing and programming. I
didn't write essays. I wrote what beginning writers were supposed to
write then, and probably still are: short stories. My stories were
awful. They had hardly any plot, just characters with strong feelings,
which I ...
Score:  0.386
 ----------------------------------------------------------------------------------------------------
Node ID: 98ce35fe-5cdb-42bc-a9a4-88941fb978b0
Text: We had no idea how to be angel investors, and in Boston in 2005
there were no Ron Conways to learn from. So we just made what seemed
like the obvious choices, and some of the things we did turned out to
be novel.  There are multiple components to Y Combinator, and we
didn't figure them all out at once. The part we got first was to be an
angel fi...
Score:  0.382
 ---------------------------------------------------------------

<h4>OpenAI Embedding</h4>

In [11]:
documents = SimpleDirectoryReader("data").load_data()
embed_model = OpenAIEmbedding()
splitter = SemanticSplitterNodeParser(
    buffer_size=1, 
    breakpoint_percentile_threshold=95, 
    embed_model=embed_model
)

nodes = splitter.get_nodes_from_documents(documents)
vector_index = VectorStoreIndex(nodes)

In [12]:
print(nodes[1].get_content())
print('-'*100)
print(nodes[2].get_content())

I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.

The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.

The language we used was an early version of Fortran. You had to type programs on punch cards, then stack them in the card reader and press a button to load the program into memory and run it. The result would ordinarily be to print something on the spectacularly loud p

In [13]:
vector_index = VectorStoreIndex(nodes)

In [21]:
retriever = VectorIndexRetriever(
    index=vector_index,
    similarity_top_k=2,
)

query_engine = RetrieverQueryEngine(
    retriever=retriever
)

In [40]:
embeds = retriever.retrieve("Tell me about the author's programming journey through childhood to college")
for i in embeds:
    print(i, '-'*100)

Node ID: 212aadef-9e61-443b-9fcd-45cc6759ac9c
Text: I knew what I was going to do.  For my undergraduate thesis, I
reverse-engineered SHRDLU. My God did I love working on that program.
It was a pleasing bit of code, but what made it even more exciting was
my belief — hard to imagine now, but not unique in 1985 — that it was
already climbing the lower slopes of intelligence.  I had gotten into
a p...
Score:  0.344
 ----------------------------------------------------------------------------------------------------
Node ID: ee841b8e-8ad6-4777-9895-72dad0ebdb21
Text: My nice landlady let me leave my stuff in her attic. I had some
money saved from consulting work I'd done in grad school; there was
probably enough to last a year if I lived cheaply. Now all I had to do
was learn Italian.  Only stranieri (foreigners) had to take this
entrance exam. In retrospect it may well have been a way of excluding
them, bec...
Score:  0.336
 ---------------------------------------------------------------

In [42]:
response = query_engine.query(
    "Tell me about the author's programming journey through childhood to college"
)
print(response)

The author's programming journey started during childhood when they reverse-engineered the program SHRDLU for their undergraduate thesis. They later pursued Artificial Intelligence in college, choosing it as their major and focusing on Lisp programming. Despite realizing the limitations of AI during their first year of grad school, they continued to work on Lisp hacking and eventually wrote a book about it. The author's programming journey continued through grad school, where they were torn between their interest in building lasting software and their growing passion for art. This conflict led them to quickly write a dissertation to graduate and then transition to studying art at RISD and the Accademia di Belli Arti in Florence.


In [43]:
for n in response.source_nodes:
    # print(n)
    print(n.get_content())
    # display_source_node(n, source_length=20000)

I knew what I was going to do.

For my undergraduate thesis, I reverse-engineered SHRDLU. My God did I love working on that program. It was a pleasing bit of code, but what made it even more exciting was my belief — hard to imagine now, but not unique in 1985 — that it was already climbing the lower slopes of intelligence.

I had gotten into a program at Cornell that didn't make you choose a major. You could take whatever classes you liked, and choose whatever you liked to put on your degree. I of course chose "Artificial Intelligence." When I got the actual physical diploma, I was dismayed to find that the quotes had been included, which made them read as scare-quotes. At the time this bothered me, but now it seems amusingly accurate, for reasons I was about to discover.

I applied to 3 grad schools: MIT and Yale, which were renowned for AI at the time, and Harvard, which I'd visited because Rich Draves went there, and was also home to Bill Woods, who'd invented the type of parser I u