/
2023-02-20-Using-OpenAI-GPT-to-search-your-org-files.html
286 lines (242 loc) · 12 KB
/
2023-02-20-Using-OpenAI-GPT-to-search-your-org-files.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
<!-- 2023-02-20 Mon 20:18 -->
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Using OpenAI GPT to search your org files</title>
<meta name="author" content="Matúš Goljer" />
<meta name="generator" content="Org Mode" />
<link rel="stylesheet" href="css/blog.css" />
</head>
<body>
<div id="preamble" class="status">
<div style="text-align: left; float: left;">
<a href="./index.html">Home</a>
<a href="https://github.com/Fuco1/">GitHub</a>
<a href="https://www.patreon.com/user?u=3282358">Patreon</a>
</div>
<div style="text-align: right;">
<a href="https://github.com/Fuco1/Fuco1.github.io/discussions">Discussions</a>
<a href="https://fuco1.github.io/rss.xml">RSS</a>
<a href="https://twitter.com/Fuco1337">Twitter</a>
</div>
<hr />
</div>
<div id="content" class="content">
<h1 class="title">Using OpenAI GPT to search your org files</h1>
<p>
As I wrote <a href="./2023-02-08-Visit-the-org-headline-from-the-attach-dired-buffer.html#org6dafc37">previously</a>, I store all my files under <code>~/data/org-attach</code> by
using the org mode file attachment feature. To retrieve a file, I
usually use agenda search or other org packages which can go over your
org files and retrieve a headline.
</p>
<p>
But what for those rare moments when I only have a vague idea of what
I'm looking for and can't hit the query exactly? Well… we can use a
semantic search.
</p>
<p>
OpenAI recently opened a Large Language Model Embedding API. Roughly
speaking, embedding takes a chunk of text and returns a fixed sized
vector in the model's "knowledge space" (OpenAI's ada model returns
1536 dimensional vector). Two vectors which are close to each other
in this knowledge space should correspond to similar things.
</p>
<p>
The usual metric of judging closeness of vectors is the "angle"
between them. This is hard to imagine in 1536 dimensions, but to keep
it short we can use the <a href="https://en.wikipedia.org/wiki/Dot_product">dot product</a> operation to compute the
"similarity".
</p>
<p>
So the idea here is simple:
</p>
<ol class="org-ol">
<li>Iterate over all headlines in my buffers.</li>
<li>Submit them to the Embedding API and cache the embedding vectors for each headline.</li>
<li>When querying for a headline, submit the query to the API, then
calculate similarity with all the cached embeddings and present the
closest ones as candidates.<sup><a id="fnr.1" class="footref" href="#fn.1" role="doc-backlink">1</a></sup></li>
</ol>
<p>
I wrote my own "org brain" clone which I called <a href="https://github.com/Fuco1/org-node-graph">org-graph</a> because I
wasn't happy with any available solution. My biggest problem was that
most systems either prescribed one file per note, or only worked with
top level headlines, or had other artificial limitations. So it was
logical this feature would live in that package as well.
</p>
<p>
But just after I got ready with all the preprocessing and data
preparation and API implementation, there came a shock. Emacs math is
SLOOOOOOW. So slow this "dot product" operation was taking ages and
the interface sucked.
</p>
<p>
But then I remembered about <a href="https://phst.eu/emacs-modules">dynamic modules</a>. Not wanting to give up,
I decided to write a C module for some good old linear algebra.
</p>
<p>
You start with a header and some initialization:
</p>
<div class="org-src-container">
<pre class="src src-c"><span style="color: #ad7fa8;">#include</span> <span style="color: #e9b96e;"><emacs-module.h></span>
<span style="color: #8cc4ff;">int</span> <span style="color: #fcaf3e;">plugin_is_GPL_compatible</span>;
<span style="color: #8cc4ff;">int</span>
<span style="color: #fce94f;">emacs_module_init</span> (<span style="color: #b4fa70;">struct</span> <span style="color: #8cc4ff;">emacs_runtime</span> *<span style="color: #fcaf3e;">runtime</span>)
{
<span style="color: #8cc4ff;">emacs_env</span> *<span style="color: #fcaf3e;">env</span> = runtime->get_environment (runtime);
<span style="color: #b4fa70;">return</span> 0;
}
</pre>
</div>
<p>
Then you can start implementing the functions. I'm not going to
repeat everything here, you can find the full C source at <a href="https://github.com/Fuco1/org-node-graph/blob/master/dotproduct.c">GitHub</a>. The
following function computes the dot product, which is really just a
lot of multiplication and addition.
</p>
<div class="org-src-container">
<pre class="src src-c"><span style="color: #b4fa70;">static</span> <span style="color: #8cc4ff;">emacs_value</span>
<span style="color: #fce94f;">dot_product</span> (<span style="color: #8cc4ff;">emacs_env</span> *<span style="color: #fcaf3e;">env</span>, <span style="color: #8cc4ff;">ptrdiff_t</span> <span style="color: #fcaf3e;">nargs</span>, <span style="color: #8cc4ff;">emacs_value</span> *<span style="color: #fcaf3e;">args</span>,
<span style="color: #8cc4ff;">void</span> *<span style="color: #fcaf3e;">data</span>)
{
assert (nargs == 2);
<span style="color: #8cc4ff;">emacs_value</span> <span style="color: #fcaf3e;">a</span> = args[0];
<span style="color: #8cc4ff;">emacs_value</span> <span style="color: #fcaf3e;">b</span> = args[1];
<span style="color: #8cc4ff;">ptrdiff_t</span> <span style="color: #fcaf3e;">size</span> = env->vec_size (env, a);
<span style="color: #8cc4ff;">double</span> <span style="color: #fcaf3e;">dp</span> = 0;
<span style="color: #b4fa70;">for</span> (<span style="color: #8cc4ff;">int</span> <span style="color: #fcaf3e;">i</span> = 0; i < size; i++) {
<span style="color: #8cc4ff;">double</span> <span style="color: #fcaf3e;">first</span> = get_at(env, a, i);
<span style="color: #8cc4ff;">double</span> <span style="color: #fcaf3e;">second</span> = get_at(env, b, i);
dp += first * second;
}
<span style="color: #8cc4ff;">emacs_value</span> <span style="color: #fcaf3e;">result</span> = env->make_float (env, dp);
<span style="color: #b4fa70;">return</span> result;
}
</pre>
</div>
<p>
With a simple Makefile
</p>
<div class="org-src-container">
<pre class="src src-makefile"><span style="color: #fce94f;">.PHONY</span>: all
<span style="color: #fce94f;">all</span>: dotproduct.so
<span style="color: #fce94f;">dotproduct.o</span>: dotproduct.c
gcc -Wall -c dotproduct.c
<span style="color: #fce94f;">dotproduct.so</span>: dotproduct.o
gcc -shared -o dotproduct.so dotproduct.o
</pre>
</div>
<p>
we can build the module
</p>
<div class="org-src-container">
<pre class="src src-bash">> make
gcc -Wall -c dotproduct.c
gcc -shared -o dotproduct.so dotproduct.o
</pre>
</div>
<p>
Finally, we load the module into Emacs with:
</p>
<div class="org-src-container">
<pre class="src src-elisp"><span style="color: #888a85;">(</span>module-load <span style="color: #888a85;">(</span>expand-file-name <span style="color: #e9b96e;">"dotproduct.so"</span><span style="color: #888a85;">))</span>
</pre>
</div>
<p>
Armed with the now much faster math routines, I embedded about 2000
headers and now I can search my files by just giving very vague
queries—it works surprisingly well, even across multiple different
natural languages (i.e. if I ask about Franz Kafka's Castle it will
return "Das Schloß" entry from my foreign language reading file).
</p>
<div class="org-src-container">
<pre class="src src-elisp"><span style="color: #888a85;">(</span>org-graph-openai-query <span style="color: #e9b96e;">"Book about journey to the center of the Earth"</span><span style="color: #888a85;">)</span>
</pre>
</div>
<p>
The top 10 results. You can see it mixes the languages but all the
things are somewhat related. However, it got the first hit exactly
right.
</p>
<pre class="example" id="orgdfd1ef5">
[8de3805f-8971-404a-98d2-84305db1a444] 87.43 Voyage au centre de la Terre
[f128f559-74a6-4435-a357-aa7d35976097] 85.43 Nicolai Klimii Iter Subterraneum
[2e0f3f7a-73bd-4b0f-a588-c757432607ca] 85.26 Endurance: Shackleton's Incredible Voyage
[d2828ffb-343c-431f-8657-521ef32de079] 85.11 Vesmír v orechovej škrupinke
[724b7e0d-f804-4d1d-8767-c4d166492472] 84.83 Oheň nad hlubinou - Pád Straumli
[5e2b17a4-902a-4e55-87f3-25a18f239346] 84.78 The Earthsea
[5e5906eb-7ab6-4b86-b41f-2bd007ff8eba] 84.65 Tolkien: Sur les rivages de la Terre du Milieu
[c33a670e-8ec2-4886-9e12-eee71891d2ea] 84.21 Vladimir Ulrich - Bis ans Ende der Welt - Ein Pilgerbuch
[d3c57fb4-9458-487d-bcf7-9852b9fec3cf] 84.09 The Lost World
[5d0d82de-df40-43f7-9868-3470d1bd376d] 83.95 Oheň nad hlubinou - Planeta spárů
</pre>
<p>
Here's another example which is purposefully silly description of the
<a href="https://en.wikipedia.org/wiki/Thus_Spoke_Zarathustra">expected</a> book's title:
</p>
<div class="org-src-container">
<pre class="src src-elisp"><span style="color: #888a85;">(</span>org-graph-openai-query <span style="color: #e9b96e;">"Philosophical book where someone spoke in a particular way"</span><span style="color: #888a85;">)</span>
</pre>
</div>
<p>
The top 10 results from my org files are:
</p>
<pre class="example" id="org853b35f">
[5d59a53e-fb81-4927-bce2-5096d9fb8417] 88.69 Thus Spoke Zarathustra
[15e61234-4d6b-413c-986d-391996f09a19] 87.82 Quotes
[7922112d-e871-419f-a4f2-68e892d5dad1] 87.04 Epistemology
[82e38b44-5ce1-416f-bd19-075aa70c7bf6] 86.96 Western Philosophy
[babcdf60-5840-4772-ba2b-314faa756997] 86.94 Nietzsche
[b192f472-d4f6-4596-8520-627b2c77e783] 86.90 Treatise on Human Nature
[b6f1d479-53fb-4e88-a474-024fbafab99e] 86.75 The Short History of Modern Philosophy
[1a366e71-eb03-472b-8c7c-d619494e9693] 86.62 The way of the bow
[2d738ef8-4a23-4b32-bcc1-0a1cc941c4be] 86.55 The Discourses
[2f00f564-af67-4d04-bf41-8cabe7764cdb] 86.36 The Life of Reason - Santayana
</pre>
<p>
Not bad, eh :)
</p>
<p>
To prepare the embeddings, you can run the function
<code>org-graph-compute-embeddings-for-buffer</code> in an org buffer. Make sure
to set the environment variable <code>OPENAI_TOKEN</code>. Also be aware that this
will add the <code>ID</code> property to every headline as this is a way to track
the cached embedding to the particular headline. Make sure to backup
your org files before running this (you have them checked in to git
right… right?!). This will fire several requests to the API (about
20 headings per request), so be patient, it should take about a minute
for 500-600 headings.
</p>
<p>
You can discuss and ask questions on the <a href="https://github.com/Fuco1/Fuco1.github.io/discussions">discussions board on GitHub</a>.
</p>
<p>
This blog post was inspired by <a href="https://reasonabledeviations.com/2023/02/05/gpt-for-second-brain/">GPT for second brains</a>.
</p>
<div id="footnotes">
<h2 class="footnotes">Footnotes: </h2>
<div id="text-footnotes">
<div class="footdef"><sup><a id="fn.1" class="footnum" href="#fnr.1" role="doc-backlink">1</a></sup> <div class="footpara" role="doc-footnote"><p class="footpara">Now before we go
further, you need to register on <a href="https://platform.openai.com/">OpenAI platform</a> and the API costs
money. The good news is that it is <b>extremely</b> cheap. It will cost
you $1 to embed 2.4 MILLION tokens. With a query being roughly 10
words which corresponds to 10-15 tokens, one query will cost you
about $0.0000006. So it's pretty much free and you only need a
credit card to formally register. You can also set monthly
spending limit to $0.01 and you would probably never run over the
limit. The step 2 will cost based on how much data you have. I
have about 200000 lines of org files and so far I spent less than
$0.50 including all the experimenting.</p></div></div>
</div>
</div></div>
<div id="postamble" class="status">
<hr />
<div style="text-align: left; float: left;">Published at: 2023-02-20 20:18 Last updated at: 2023-02-20 20:18</div>
<div style="text-align: right;">Found a typo? <a href="https://github.com/Fuco1/fuco1.github.io/blob/master/posts/2023-02-20-Using-OpenAI-GPT-to-search-your-org-files.org">Edit on GitHub!</a></div>
</div>
</body>
</html>