2023-02-20-Using-OpenAI-GPT-to-search-your-org-files.html

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
<!-- 2023-02-20 Mon 20:18 -->
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Using OpenAI GPT to search your org files</title>
<meta name="author" content="Matúš Goljer" />
<meta name="generator" content="Org Mode" />
<link rel="stylesheet" href="css/blog.css" />
</head>
<body>
<div id="preamble" class="status">

<div style="text-align: left; float: left;">
  <a href="./index.html">Home</a>
  <a href="https://github.com/Fuco1/">GitHub</a>
  <a href="https://www.patreon.com/user?u=3282358">Patreon</a>
</div>
<div style="text-align: right;">
  <a href="https://github.com/Fuco1/Fuco1.github.io/discussions">Discussions</a>
  <a href="https://fuco1.github.io/rss.xml">RSS</a>
  <a href="https://twitter.com/Fuco1337">Twitter</a>
</div>
<hr />
</div>
<div id="content" class="content">
<h1 class="title">Using OpenAI GPT to search your org files</h1>

<p>
As I wrote <a href="./2023-02-08-Visit-the-org-headline-from-the-attach-dired-buffer.html#org6dafc37">previously</a>, I store all my files under <code>~/data/org-attach</code> by
using the org mode file attachment feature.  To retrieve a file, I
usually use agenda search or other org packages which can go over your
org files and retrieve a headline.
</p>

<p>
But what for those rare moments when I only have a vague idea of what
I'm looking for and can't hit the query exactly?  Well&#x2026; we can use a
semantic search.
</p>

<p>
OpenAI recently opened a Large Language Model Embedding API.  Roughly
speaking, embedding takes a chunk of text and returns a fixed sized
vector in the model's "knowledge space" (OpenAI's ada model returns
1536 dimensional vector).  Two vectors which are close to each other
in this knowledge space should correspond to similar things.
</p>

<p>
The usual metric of judging closeness of vectors is the "angle"
between them.  This is hard to imagine in 1536 dimensions, but to keep
it short we can use the <a href="https://en.wikipedia.org/wiki/Dot_product">dot product</a> operation to compute the
"similarity".
</p>

<p>
So the idea here is simple:
</p>

<ol class="org-ol">
<li>Iterate over all headlines in my buffers.</li>
<li>Submit them to the Embedding API and cache the embedding vectors for each headline.</li>
<li>When querying for a headline, submit the query to the API, then
calculate similarity with all the cached embeddings and present the
closest ones as candidates.<sup><a id="fnr.1" class="footref" href="#fn.1" role="doc-backlink">1</a></sup></li>
</ol>

<p>
I wrote my own "org brain" clone which I called <a href="https://github.com/Fuco1/org-node-graph">org-graph</a> because I
wasn't happy with any available solution.  My biggest problem was that
most systems either prescribed one file per note, or only worked with
top level headlines, or had other artificial limitations.  So it was
logical this feature would live in that package as well.
</p>

<p>
But just after I got ready with all the preprocessing and data
preparation and API implementation, there came a shock.  Emacs math is
SLOOOOOOW.  So slow this "dot product" operation was taking ages and
the interface sucked.
</p>

<p>
But then I remembered about <a href="https://phst.eu/emacs-modules">dynamic modules</a>.  Not wanting to give up,
I decided to write a C module for some good old linear algebra.
</p>

<p>
You start with a header and some initialization:
</p>

<div class="org-src-container">
<pre class="src src-c"><span style="color: #ad7fa8;">#include</span> <span style="color: #e9b96e;">&lt;emacs-module.h&gt;</span>

<span style="color: #8cc4ff;">int</span> <span style="color: #fcaf3e;">plugin_is_GPL_compatible</span>;

<span style="color: #8cc4ff;">int</span>
<span style="color: #fce94f;">emacs_module_init</span> (<span style="color: #b4fa70;">struct</span> <span style="color: #8cc4ff;">emacs_runtime</span> *<span style="color: #fcaf3e;">runtime</span>)
{
  <span style="color: #8cc4ff;">emacs_env</span> *<span style="color: #fcaf3e;">env</span> = runtime-&gt;get_environment (runtime);

  <span style="color: #b4fa70;">return</span> 0;
}
</pre>
</div>

<p>
Then you can start implementing the functions.  I'm not going to
repeat everything here, you can find the full C source at <a href="https://github.com/Fuco1/org-node-graph/blob/master/dotproduct.c">GitHub</a>.  The
following function computes the dot product, which is really just a
lot of multiplication and addition.
</p>

<div class="org-src-container">
<pre class="src src-c"><span style="color: #b4fa70;">static</span> <span style="color: #8cc4ff;">emacs_value</span>
<span style="color: #fce94f;">dot_product</span> (<span style="color: #8cc4ff;">emacs_env</span> *<span style="color: #fcaf3e;">env</span>, <span style="color: #8cc4ff;">ptrdiff_t</span> <span style="color: #fcaf3e;">nargs</span>, <span style="color: #8cc4ff;">emacs_value</span> *<span style="color: #fcaf3e;">args</span>,
            <span style="color: #8cc4ff;">void</span> *<span style="color: #fcaf3e;">data</span>)
{
  assert (nargs == 2);
  <span style="color: #8cc4ff;">emacs_value</span> <span style="color: #fcaf3e;">a</span> = args[0];
  <span style="color: #8cc4ff;">emacs_value</span> <span style="color: #fcaf3e;">b</span> = args[1];

  <span style="color: #8cc4ff;">ptrdiff_t</span> <span style="color: #fcaf3e;">size</span> = env-&gt;vec_size (env, a);

  <span style="color: #8cc4ff;">double</span> <span style="color: #fcaf3e;">dp</span> = 0;
  <span style="color: #b4fa70;">for</span> (<span style="color: #8cc4ff;">int</span> <span style="color: #fcaf3e;">i</span> = 0; i &lt; size; i++) {
    <span style="color: #8cc4ff;">double</span> <span style="color: #fcaf3e;">first</span> = get_at(env, a, i);
    <span style="color: #8cc4ff;">double</span> <span style="color: #fcaf3e;">second</span> = get_at(env, b, i);
    dp += first * second;
  }

  <span style="color: #8cc4ff;">emacs_value</span> <span style="color: #fcaf3e;">result</span> = env-&gt;make_float (env, dp);
  <span style="color: #b4fa70;">return</span> result;
}
</pre>
</div>

<p>
With a simple Makefile
</p>

<div class="org-src-container">
<pre class="src src-makefile"><span style="color: #fce94f;">.PHONY</span>: all

<span style="color: #fce94f;">all</span>: dotproduct.so

<span style="color: #fce94f;">dotproduct.o</span>: dotproduct.c
    gcc -Wall -c dotproduct.c

<span style="color: #fce94f;">dotproduct.so</span>: dotproduct.o
    gcc -shared -o dotproduct.so dotproduct.o
</pre>
</div>

<p>
we can build the module
</p>

<div class="org-src-container">
<pre class="src src-bash">&gt; make
gcc -Wall -c dotproduct.c
gcc -shared -o dotproduct.so dotproduct.o
</pre>
</div>

<p>
Finally, we load the module into Emacs with:
</p>

<div class="org-src-container">
<pre class="src src-elisp"><span style="color: #888a85;">(</span>module-load <span style="color: #888a85;">(</span>expand-file-name <span style="color: #e9b96e;">"dotproduct.so"</span><span style="color: #888a85;">))</span>
</pre>
</div>

<p>
Armed with the now much faster math routines, I embedded about 2000
headers and now I can search my files by just giving very vague
queries&#x2014;it works surprisingly well, even across multiple different
natural languages (i.e. if I ask about Franz Kafka's Castle it will
return "Das Schloß" entry from my foreign language reading file).
</p>

<div class="org-src-container">
<pre class="src src-elisp"><span style="color: #888a85;">(</span>org-graph-openai-query <span style="color: #e9b96e;">"Book about journey to the center of the Earth"</span><span style="color: #888a85;">)</span>
</pre>
</div>

<p>
The top 10 results.  You can see it mixes the languages but all the
things are somewhat related.  However, it got the first hit exactly
right.
</p>

<pre class="example" id="orgdfd1ef5">
[8de3805f-8971-404a-98d2-84305db1a444] 87.43 Voyage au centre de la Terre
[f128f559-74a6-4435-a357-aa7d35976097] 85.43 Nicolai Klimii Iter Subterraneum
[2e0f3f7a-73bd-4b0f-a588-c757432607ca] 85.26 Endurance: Shackleton's Incredible Voyage
[d2828ffb-343c-431f-8657-521ef32de079] 85.11 Vesmír v orechovej škrupinke
[724b7e0d-f804-4d1d-8767-c4d166492472] 84.83 Oheň nad hlubinou - Pád Straumli
[5e2b17a4-902a-4e55-87f3-25a18f239346] 84.78 The Earthsea
[5e5906eb-7ab6-4b86-b41f-2bd007ff8eba] 84.65 Tolkien: Sur les rivages de la Terre du Milieu
[c33a670e-8ec2-4886-9e12-eee71891d2ea] 84.21 Vladimir Ulrich - Bis ans Ende der Welt - Ein Pilgerbuch
[d3c57fb4-9458-487d-bcf7-9852b9fec3cf] 84.09 The Lost World
[5d0d82de-df40-43f7-9868-3470d1bd376d] 83.95 Oheň nad hlubinou - Planeta spárů
</pre>

<p>
Here's another example which is purposefully silly description of the
<a href="https://en.wikipedia.org/wiki/Thus_Spoke_Zarathustra">expected</a> book's title:
</p>

<div class="org-src-container">
<pre class="src src-elisp"><span style="color: #888a85;">(</span>org-graph-openai-query <span style="color: #e9b96e;">"Philosophical book where someone spoke in a particular way"</span><span style="color: #888a85;">)</span>
</pre>
</div>

<p>
The top 10 results from my org files are:
</p>

<pre class="example" id="org853b35f">
[5d59a53e-fb81-4927-bce2-5096d9fb8417] 88.69 Thus Spoke Zarathustra
[15e61234-4d6b-413c-986d-391996f09a19] 87.82 Quotes
[7922112d-e871-419f-a4f2-68e892d5dad1] 87.04 Epistemology
[82e38b44-5ce1-416f-bd19-075aa70c7bf6] 86.96 Western Philosophy
[babcdf60-5840-4772-ba2b-314faa756997] 86.94 Nietzsche
[b192f472-d4f6-4596-8520-627b2c77e783] 86.90 Treatise on Human Nature
[b6f1d479-53fb-4e88-a474-024fbafab99e] 86.75 The Short History of Modern Philosophy
[1a366e71-eb03-472b-8c7c-d619494e9693] 86.62 The way of the bow
[2d738ef8-4a23-4b32-bcc1-0a1cc941c4be] 86.55 The Discourses
[2f00f564-af67-4d04-bf41-8cabe7764cdb] 86.36 The Life of Reason - Santayana
</pre>

<p>
Not bad, eh :)
</p>

<p>
To prepare the embeddings, you can run the function
<code>org-graph-compute-embeddings-for-buffer</code> in an org buffer.  Make sure
to set the environment variable <code>OPENAI_TOKEN</code>.  Also be aware that this
will add the <code>ID</code> property to every headline as this is a way to track
the cached embedding to the particular headline.  Make sure to backup
your org files before running this (you have them checked in to git
right&#x2026; right?!).  This will fire several requests to the API (about
20 headings per request), so be patient, it should take about a minute
for 500-600 headings.
</p>

<p>
You can discuss and ask questions on the <a href="https://github.com/Fuco1/Fuco1.github.io/discussions">discussions board on GitHub</a>.
</p>

<p>
This blog post was inspired by <a href="https://reasonabledeviations.com/2023/02/05/gpt-for-second-brain/">GPT for second brains</a>.
</p>
<div id="footnotes">
<h2 class="footnotes">Footnotes: </h2>
<div id="text-footnotes">

<div class="footdef"><sup><a id="fn.1" class="footnum" href="#fnr.1" role="doc-backlink">1</a></sup> <div class="footpara" role="doc-footnote"><p class="footpara">Now before we go
further, you need to register on <a href="https://platform.openai.com/">OpenAI platform</a> and the API costs
money.  The good news is that it is <b>extremely</b> cheap.  It will cost
you $1 to embed 2.4 MILLION tokens.  With a query being roughly 10
words which corresponds to 10-15 tokens, one query will cost you
about $0.0000006.  So it's pretty much free and you only need a
credit card to formally register.  You can also set monthly
spending limit to $0.01 and you would probably never run over the
limit. The step 2 will cost based on how much data you have.  I
have about 200000 lines of org files and so far I spent less than
$0.50 including all the experimenting.</p></div></div>


</div>
</div></div>
<div id="postamble" class="status">
<hr />
<div style="text-align: left; float: left;">Published at: 2023-02-20 20:18 Last updated at: 2023-02-20 20:18</div>
<div style="text-align: right;">Found a typo? <a href="https://github.com/Fuco1/fuco1.github.io/blob/master/posts/2023-02-20-Using-OpenAI-GPT-to-search-your-org-files.org">Edit on GitHub!</a></div>
</div>
</body>
</html>