index.html

<!DOCTYPE html>
<html>
<head>
  <title>Building Your Own Federated Search</title>
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
  <style type="text/css">
    @import url(http://fonts.googleapis.com/css?family=Yanone+Kaffeesatz);
    @import url(http://fonts.googleapis.com/css?family=Droid+Serif:400,700,400italic);
    @import url(http://fonts.googleapis.com/css?family=Ubuntu+Mono:400,700,400italic);

    body { 
      font-family: 'Droid Serif';
    }
    h1, h2, h3 {
      font-family: 'Yanone Kaffeesatz';
      font-weight: normal;
    }
    img { width: 100%; }
    .remark-slide-content {
      color: white;
      background-color: black;
    }
    a {
      color: white;
    }
    .remark-code, .remark-inline-code { 
      font-family: 'Ubuntu Mono';
      text-align: left;
    }
  </style>

  <base target="_blank" />
</head>
<body>
  <textarea id="source">

layout: true
class: center, middle

---

# Building Your Own Federated Search 

https://trott.github.io/building-your-own-federated-search

???

Let's talk about building your own federated search.

---

# What

???

Before we build it, I guess we ought to know what it is.

---

<img alt="Not Metasearch" src="img/ghostbusted.svg" width="600" height="600">

???

For starters, it's not metasearch, which I guess means I better take ten seconds to distinguish between federated search and metasearch.

---

# Metasearch: one index, multiple resources

???

Metasearch operates on a one consolidated index harvested from multiple sources; Summon and similar products are metasearch tools.

---

# Federated search: one index per resource

???

Federated search is more like running the same search on ten different search engines all at once and getting ten different result sets.

---

# There are pros&hellip;

???

Reasons you might want to do this instead of metasearch include&hellip;

---

# <span style="color: green">$</span>

???

&hellip;higher cost for implementing metasearch&hellip;

---

# Extensibility 

???

&hellip;and flexibility to include pretty much anything you can search rather than only things for which you have access to complete metadata.

---

# There are cons&hellip;

???

There are disadvantages to be aware of.

---

# Multiple result sets

???

Getting ten different result sets can be overwhelming compared to a single, sorted result set.

---

# But really, why?

???

Let's say you have dozens of online resources and each of these online resources has its own search tool.

---

# Here are the 387 URLs to try&hellip;

???

But users shouldn't be asked to search in dozens of different interfaces to find what they're looking for.

---

# Have we mentioned <span style="color: green">$</span>?

???

You're cash-constrained&hellip;

---

# I am a special snowflake.

???

&hellip;or you have a specialized collection set that needs searching that is not handled adequately by existing discovery products&hellip;

---

# Hi, I'm Cyndi Lauper, and I create amazing user experiences.

???

&hellip;or just want to have some fun creating something.

---

# Let's federate some search already!

???

Your focus is on federated search because it's frankly simpler and cheaper than metasearch.

---

# 3 Is A Magic Number

Yes it is. It's a magic number.

<small>Somewhere in the ancient mystic trinity, you get 3. That's a magic number.

<small>The past and the present and the future&hellip;

<small>&hellip;faith and hope and charity&hellip;

<small>&hellip;the heart and the brain and the body&hellip;

<small>&hellip;give you 3.

<small>That's a magic number.</small></small></small></small></small></small>

???

There are (at least) three ways to pull data out of other resources in real time, and here they are in descending order of desirability.

---

# #1 

## Cool, they have an API for that!

<span class="blink">This almost never happens.</span>

???

Number 1: The content provider might publish an API, which is something that almost never happens. 

---

# Faux APIs will break your heart.

???

Often what you and I consider an API is not what the content providers consider an API. 

---

# You keep using that word. 

I do not think it means what you think it means.

???

If you give me a script tag that injects a widget into my page, and then you call it an API, that is not an API.

---

# #2

## Screen-scrape the #*%! out of it.

This is by far the most common scenario.

???

Then there's straightforward grab-the-HTML and scrape the information out of it.

---

# <span style="color: red">WARNING</span>

???

As we all know, this brittle because the HTML can change and break your scraper. 

---

# <span style="color: green">UPSIDES</span>

???

But it has two big upsides.

---

# Better than the alternative&hellip;

&hellip;which is nothing.

???

Number 1: It usually works well enough.

---

# You can actually do it.

???

And number 2: It is usually easy and fast to implement.

---

# Web New-dot-Oh

No JavaScript = No Content

???

Lastly, there's what you have to do when confronted with a site powered by front-end technologies where none of the content shows up unless you execute a bunch of JavaScript.

---

# Headless browsers to the rescue

???

For this, make friends with headless browsers like PhantomJS to scrape these sites.

---

# <span style="color: yellow">CAUTION</span>

???

This should be your last resort, not your first choice. 

---

# Fast!

But not that fast.

???

Headless browsers are fast compared to Safari and Internet Explorer, but they're slow compared to curl or API calls.

---

# I have no opinion&hellip;

&hellip;except when I do.

???

I like to think I'm technology-agnostic and it's probably fine if you've concluded the way to do this is to code up a bunch of enterprise Java, compile it to a WAR file, and deploy that to your Tomcat server.

---

# JavaScript FTW

???

But it's impossible to deny that **&quot;JavaScript has a more robust and widely-understood set of conventions and tools for processing blobs of HTML than any other programming language.&quot;**

---

# JS &amp; HTML

BFF

???

It has built-in DOM-handling, a million battle-tested libraries with simple and powerful jQuery-like selectors, and its sole reason for existing is the web and HTML.

---

# Node.js FTW

io.js too!

???

So we decided that it probably made sense to build our federated search server using Node.js.

---

# Browserify FTW

???

Or maybe even take it a step further and just put all the federated search code entirely in the browser with no intermediary server whatsoever.

---

# Amalgamatic

## https://github.com/ucsf-ckm/amalgamatic

???

First, we wrote a pluggable, extensible federated search tool called Amalgamatic.

---

### `npm install --save amalgamatic`

???

You install it with `npm`. 

---

### `npm install --save amalgamatic-pubmed`

???

By itself, it doesn't do much, so you need to install plugins too, which we'll get to in a minute.

---

````javascript
// Load Amalgamatic
var amalgamatic = require('amalgamatic');

// Load some plugins to search PubMed and SFX.
var pubmed = require('amalgamatic-pubmed');
var sfx = require('amalgamatic-sfx');

// Add the plugins to Amalgamatic.
amalgamatic.add('sfx', sfx);
amalgamatic.add('pubmed', pubmed);

var callback = function (err, results) {
  if (err) {
    console.dir(err);
  } else {
    results.forEach( function (result) {
      console.log('\nCollection name: ' + result.name);
      console.dir(result.data);
    });
  }
};

// Do a search!
amalgamatic.search({searchTerm: 'medicine'}, callback);
````

???

Here's sample code for a minimal server.

---

# Plugins

???

So, about those plugins&hellip;

---

### https://www.npmjs.org/browse/keyword/amalgamatic-plugin

???

The plugins are published via npm and tagged `amalgamatic-plugin` so you can find them at this URL or using `npm search`.

---

### https://github.com/ucsf-ckm/amalgamatic/wiki/How-to-write-an-Amalgamatic-plugin

???

Hopefully, there's a plugin for whatever you want to do, but if not, there's a short simple guide to writing plugins and, of course, you can just look at the source code for a similar plugin on GitHub.

---

# Our federated search

## http://search.library.ucsf.edu/

???

We stuck amalgamatic on an API server we run and created this interface to it.

---

# Browserified Demo

### http://trott.github.io/demo-amalgamatic-browserify/

???

But you don't need a server to handle the search federation, so here's a federated search example that runs via Browserify entirely within the browser.

---

````javascript
var sfx = require('amalgamatic-sfx');
sfx.setOptions({
    url: 'http://cors-anywhere.herokuapp.com/ucelinks.cdlib.org:8888/sfx_ucsf/az'
});
````

???

For those of you who might be wondering about same-origin policy issues, the short answer is we used CORS where we could and a CORS proxy where we could not use CORS directly.

---

````javascript
var options = {
  searchTerm: searchTerm, // We snarfed this from the form earlier
  
  pluginCallback: function (err, result) {
    var elem = document.getElementById(result.name);
    if (err) {
      elem.textContent = err.message;
    } else {
      if (result.data.length) {
        // Code that inserts results into the DOM
      } else {
        elem.innerHTML = 'No results. :-(';
      }
    }
  }

};

amalgamatic.search(options);
````
???

And here's the code that does the search.

---

````javascript
amalgamatic.search(options);
````

???

This time, in the search call, we're not using the callback that is invoked when all the plugins have finished returning data.

---

```javascript
var options = {
  searchTerm: searchTerm, // We snarfed this from the form earlier
  
  pluginCallback: function (err, result) {
    var elem = document.getElementById(result.name);
    if (err) {
      elem.textContent = err.message;
    } else {
      if (result.data.length) {
        // Code that inserts results into the DOM
      } else {
        elem.innerHTML = 'No results. :-(';
      }
    }
  }

};
````

???

Instead, in the options object, we're specifying a plugin callback that gets called each time a plugin returns data.

---

<img alt="Have these results while you wait for these other results." src="img/loading.png" width="748" height="279">

???

Using the plugin callback allows us to show results as they arrive rather than making the user wait for the slowest plugin to return before showing anything.

---

`# browserify -o bundle.js main.js`

???

Next, we take that JavaScript, which is in `main.js`, and we bundle it up with Browserify into `bundle.js`.

---

````html
<script src="bundle.js"></script>
````

???

Finally, we include the bundle in our HTML file.

---

# But wait! There's more!

???

Getting creative, we can add search&mdash;again, even without any server-side components&mdash;for small resources that don't even have a search interface or API.

---

<img src="img/cenic-header.png" alt="">

???

Let's take the CENIC conference website, which has a nice search box in the header that, when you enter some search terms and submit the form, does absolutely nothing whatsoever.

---

<img src="img/program.png" alt="">

???

And the program page is hard to use because all the information you care about&mdash;like almost everything aside from the talk title&mdash;is buried behind the Abstract links.

---

### https://github.com/ucsf-ckm/amalgamatic-cenic2015

???

So I wrote an Amalgamatic plugin for the CENIC 2015 program.

---

### https://trott.github.io/cenic2015

???

And a quick interface to handle the JSON data generated by the plugin.

---

# No WiFi? No problem! 

???

It's mobile-friendly and works offline too.

---

# Thanks!!

### https://trott.github.io/building-your-own-federated-search

Rich Trott  
@trott  
UC San Francisco  

  </textarea>
  <script src="js/remark.min.js">
  </script>
  <script type="text/javascript">
  var slideshow = remark.create();
  </script>
  <script>
  var blinks = document.querySelectorAll('.blink');
  blinks.forEach(function (blink) {
    setInterval(function() { blink.style.visibility = blink.style.visibility ? '' : 'hidden';}, 750);
  });
  </script>
</body>
</html>