This repository has been archived by the owner on Nov 26, 2021. It is now read-only.
/
index.html
626 lines (364 loc) · 11.9 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
<!DOCTYPE html>
<html>
<head>
<title>Building Your Own Federated Search</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<style type="text/css">
@import url(http://fonts.googleapis.com/css?family=Yanone+Kaffeesatz);
@import url(http://fonts.googleapis.com/css?family=Droid+Serif:400,700,400italic);
@import url(http://fonts.googleapis.com/css?family=Ubuntu+Mono:400,700,400italic);
body {
font-family: 'Droid Serif';
}
h1, h2, h3 {
font-family: 'Yanone Kaffeesatz';
font-weight: normal;
}
img { width: 100%; }
.remark-slide-content {
color: white;
background-color: black;
}
a {
color: white;
}
.remark-code, .remark-inline-code {
font-family: 'Ubuntu Mono';
text-align: left;
}
</style>
<base target="_blank" />
</head>
<body>
<textarea id="source">
layout: true
class: center, middle
---
# Building Your Own Federated Search
https://trott.github.io/building-your-own-federated-search
???
Let's talk about building your own federated search.
---
# What
???
Before we build it, I guess we ought to know what it is.
---
<img alt="Not Metasearch" src="img/ghostbusted.svg" width="600" height="600">
???
For starters, it's not metasearch, which I guess means I better take ten seconds to distinguish between federated search and metasearch.
---
# Metasearch: one index, multiple resources
???
Metasearch operates on a one consolidated index harvested from multiple sources; Summon and similar products are metasearch tools.
---
# Federated search: one index per resource
???
Federated search is more like running the same search on ten different search engines all at once and getting ten different result sets.
---
# There are pros…
???
Reasons you might want to do this instead of metasearch include…
---
# <span style="color: green">$</span>
???
…higher cost for implementing metasearch…
---
# Extensibility
???
…and flexibility to include pretty much anything you can search rather than only things for which you have access to complete metadata.
---
# There are cons…
???
There are disadvantages to be aware of.
---
# Multiple result sets
???
Getting ten different result sets can be overwhelming compared to a single, sorted result set.
---
# But really, why?
???
Let's say you have dozens of online resources and each of these online resources has its own search tool.
---
# Here are the 387 URLs to try…
???
But users shouldn't be asked to search in dozens of different interfaces to find what they're looking for.
---
# Have we mentioned <span style="color: green">$</span>?
???
You're cash-constrained…
---
# I am a special snowflake.
???
…or you have a specialized collection set that needs searching that is not handled adequately by existing discovery products…
---
# Hi, I'm Cyndi Lauper, and I create amazing user experiences.
???
…or just want to have some fun creating something.
---
# Let's federate some search already!
???
Your focus is on federated search because it's frankly simpler and cheaper than metasearch.
---
# 3 Is A Magic Number
Yes it is. It's a magic number.
<small>Somewhere in the ancient mystic trinity, you get 3. That's a magic number.
<small>The past and the present and the future…
<small>…faith and hope and charity…
<small>…the heart and the brain and the body…
<small>…give you 3.
<small>That's a magic number.</small></small></small></small></small></small>
???
There are (at least) three ways to pull data out of other resources in real time, and here they are in descending order of desirability.
---
# #1
## Cool, they have an API for that!
<span class="blink">This almost never happens.</span>
???
Number 1: The content provider might publish an API, which is something that almost never happens.
---
# Faux APIs will break your heart.
???
Often what you and I consider an API is not what the content providers consider an API.
---
# You keep using that word.
I do not think it means what you think it means.
???
If you give me a script tag that injects a widget into my page, and then you call it an API, that is not an API.
---
# #2
## Screen-scrape the #*%! out of it.
This is by far the most common scenario.
???
Then there's straightforward grab-the-HTML and scrape the information out of it.
---
# <span style="color: red">WARNING</span>
???
As we all know, this brittle because the HTML can change and break your scraper.
---
# <span style="color: green">UPSIDES</span>
???
But it has two big upsides.
---
# Better than the alternative…
…which is nothing.
???
Number 1: It usually works well enough.
---
# You can actually do it.
???
And number 2: It is usually easy and fast to implement.
---
# Web New-dot-Oh
No JavaScript = No Content
???
Lastly, there's what you have to do when confronted with a site powered by front-end technologies where none of the content shows up unless you execute a bunch of JavaScript.
---
# Headless browsers to the rescue
???
For this, make friends with headless browsers like PhantomJS to scrape these sites.
---
# <span style="color: yellow">CAUTION</span>
???
This should be your last resort, not your first choice.
---
# Fast!
But not that fast.
???
Headless browsers are fast compared to Safari and Internet Explorer, but they're slow compared to curl or API calls.
---
# I have no opinion…
…except when I do.
???
I like to think I'm technology-agnostic and it's probably fine if you've concluded the way to do this is to code up a bunch of enterprise Java, compile it to a WAR file, and deploy that to your Tomcat server.
---
# JavaScript FTW
???
But it's impossible to deny that **"JavaScript has a more robust and widely-understood set of conventions and tools for processing blobs of HTML than any other programming language."**
---
# JS & HTML
BFF
???
It has built-in DOM-handling, a million battle-tested libraries with simple and powerful jQuery-like selectors, and its sole reason for existing is the web and HTML.
---
# Node.js FTW
io.js too!
???
So we decided that it probably made sense to build our federated search server using Node.js.
---
# Browserify FTW
???
Or maybe even take it a step further and just put all the federated search code entirely in the browser with no intermediary server whatsoever.
---
# Amalgamatic
## https://github.com/ucsf-ckm/amalgamatic
???
First, we wrote a pluggable, extensible federated search tool called Amalgamatic.
---
### `npm install --save amalgamatic`
???
You install it with `npm`.
---
### `npm install --save amalgamatic-pubmed`
???
By itself, it doesn't do much, so you need to install plugins too, which we'll get to in a minute.
---
````javascript
// Load Amalgamatic
var amalgamatic = require('amalgamatic');
// Load some plugins to search PubMed and SFX.
var pubmed = require('amalgamatic-pubmed');
var sfx = require('amalgamatic-sfx');
// Add the plugins to Amalgamatic.
amalgamatic.add('sfx', sfx);
amalgamatic.add('pubmed', pubmed);
var callback = function (err, results) {
if (err) {
console.dir(err);
} else {
results.forEach( function (result) {
console.log('\nCollection name: ' + result.name);
console.dir(result.data);
});
}
};
// Do a search!
amalgamatic.search({searchTerm: 'medicine'}, callback);
````
???
Here's sample code for a minimal server.
---
# Plugins
???
So, about those plugins…
---
### https://www.npmjs.org/browse/keyword/amalgamatic-plugin
???
The plugins are published via npm and tagged `amalgamatic-plugin` so you can find them at this URL or using `npm search`.
---
### https://github.com/ucsf-ckm/amalgamatic/wiki/How-to-write-an-Amalgamatic-plugin
???
Hopefully, there's a plugin for whatever you want to do, but if not, there's a short simple guide to writing plugins and, of course, you can just look at the source code for a similar plugin on GitHub.
---
# Our federated search
## http://search.library.ucsf.edu/
???
We stuck amalgamatic on an API server we run and created this interface to it.
---
# Browserified Demo
### http://trott.github.io/demo-amalgamatic-browserify/
???
But you don't need a server to handle the search federation, so here's a federated search example that runs via Browserify entirely within the browser.
---
````javascript
var sfx = require('amalgamatic-sfx');
sfx.setOptions({
url: 'http://cors-anywhere.herokuapp.com/ucelinks.cdlib.org:8888/sfx_ucsf/az'
});
````
???
For those of you who might be wondering about same-origin policy issues, the short answer is we used CORS where we could and a CORS proxy where we could not use CORS directly.
---
````javascript
var options = {
searchTerm: searchTerm, // We snarfed this from the form earlier
pluginCallback: function (err, result) {
var elem = document.getElementById(result.name);
if (err) {
elem.textContent = err.message;
} else {
if (result.data.length) {
// Code that inserts results into the DOM
} else {
elem.innerHTML = 'No results. :-(';
}
}
}
};
amalgamatic.search(options);
````
???
And here's the code that does the search.
---
````javascript
amalgamatic.search(options);
````
???
This time, in the search call, we're not using the callback that is invoked when all the plugins have finished returning data.
---
```javascript
var options = {
searchTerm: searchTerm, // We snarfed this from the form earlier
pluginCallback: function (err, result) {
var elem = document.getElementById(result.name);
if (err) {
elem.textContent = err.message;
} else {
if (result.data.length) {
// Code that inserts results into the DOM
} else {
elem.innerHTML = 'No results. :-(';
}
}
}
};
````
???
Instead, in the options object, we're specifying a plugin callback that gets called each time a plugin returns data.
---
<img alt="Have these results while you wait for these other results." src="img/loading.png" width="748" height="279">
???
Using the plugin callback allows us to show results as they arrive rather than making the user wait for the slowest plugin to return before showing anything.
---
`# browserify -o bundle.js main.js`
???
Next, we take that JavaScript, which is in `main.js`, and we bundle it up with Browserify into `bundle.js`.
---
````html
<script src="bundle.js"></script>
````
???
Finally, we include the bundle in our HTML file.
---
# But wait! There's more!
???
Getting creative, we can add search—again, even without any server-side components—for small resources that don't even have a search interface or API.
---
<img src="img/cenic-header.png" alt="">
???
Let's take the CENIC conference website, which has a nice search box in the header that, when you enter some search terms and submit the form, does absolutely nothing whatsoever.
---
<img src="img/program.png" alt="">
???
And the program page is hard to use because all the information you care about—like almost everything aside from the talk title—is buried behind the Abstract links.
---
### https://github.com/ucsf-ckm/amalgamatic-cenic2015
???
So I wrote an Amalgamatic plugin for the CENIC 2015 program.
---
### https://trott.github.io/cenic2015
???
And a quick interface to handle the JSON data generated by the plugin.
---
# No WiFi? No problem!
???
It's mobile-friendly and works offline too.
---
# Thanks!!
### https://trott.github.io/building-your-own-federated-search
Rich Trott
@trott
UC San Francisco
</textarea>
<script src="js/remark.min.js">
</script>
<script type="text/javascript">
var slideshow = remark.create();
</script>
<script>
var blinks = document.querySelectorAll('.blink');
blinks.forEach(function (blink) {
setInterval(function() { blink.style.visibility = blink.style.visibility ? '' : 'hidden';}, 750);
});
</script>
</body>
</html>