# Real search engine

* Upgrade from our toy example to something real
* Test an industry-grade search engine
* Useful skill to have:
  * Work with own corpora - powerful search for free
  * Custom search and indexing
  * Very often needed in the data science industry

# Which one?

* Apache Solr
* Elasticsearch
* Vespa.ai
* These are the main contestants on the playground
* We will go with Apache Solr for no strong reason, the main principles apply to all of them
* These are engines (the backend) not interfaces (frontend)
  
## Solr

* Apache project
* Search **engine** - not so much the interface
  1. Index data
  2. Query via HTTP API
  3. Collect results in various formats
  4. Use them any way you need

# Installing Solr

* For what it is, installing Solr is surprisingly easy
* Google "solr download" and download the latest version (a single .tgz which you unpack)
* Once unpacked, you can start it as follows:

In [None]:
# Tells solr to start in foreground and listen on port 8983 (the default)
# foreground -> occupies the command line and is killed when you exit it (please use!)
# background -> releases the command line, doesn't get killed when you exit it -> not what we want!
bin/solr start

2017-03-13 11:02:47.824 INFO  (main) [   ] o.e.j.s.Server jetty-9.3.14.v20161028
2017-03-13 11:02:48.168 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter  ___      _       Welcome to Apache Solr™ version 6.4.2
2017-03-13 11:02:48.168 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter / __| ___| |_ _   Starting in standalone mode on port 16667
2017-03-13 11:02:48.168 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter \__ \/ _ \ | '_|  Install dir: /home/ginter/IR_Course/solr-6.4.2
2017-03-13 11:02:48.189 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter |___/\___/_|_|    Start time: 2017-03-13T11:02:48.172Z
2017-03-13 11:02:48.209 INFO  (main) [   ] o.a.s.c.SolrResourceLoader Using system property solr.solr.home: /home/ginter/IR_Course/solr-6.4.2/server/solr
2017-03-13 11:02:48.217 INFO  (main) [   ] o.a.s.c.SolrXmlConfig Loading container configuration from /home/ginter/IR_Course/solr-6.4.2/server/solr/solr.xml
2017-03-13 11:02:48.598 INFO  (main) [   ] o.a.s.u.UpdateShardHandler Creating UpdateShardHa

* After start, I can visit solr at http://127.0.0.1:8983

# Practical problem:

* Solr is quite impossible to run in the Colab environment
* You can run it locally
* Watch the video of the lecture to see how

# Solr cores

* Data stored in *cores*
* One core - one dataset
* (collections of cores not covered here)

# Creating a core 

* Cores are like projects for Solr
* One instance of Solr can have multiple cores running simultaneously
* Each core has its own data


In [None]:
bin/solr create_core -c mytest

http://127.0.0.1:8983/solr/#/mytest

# Adding data

* bin/post -c mytest /dir/with/data
* .... wait for a while, kill
* curl http://127.0.0.1:8983/solr/mytest/update?commit=true
* ...and enjoy the result...

# Adding data #2

* Let's try to index Wiki Quotes
* Do so programmatically to have full control
* And use that to learn a bit about the query language


In [None]:
# Assuming we are now in the solr-6.4.2 directory
bin/solr create_core -p 16667 -c ir_course -d ~/example_config#ir_course will be the name of the core, change it to whatever you want

* and now we are ready to index data to our core

# Indexing data with Solr

* Using the data we extracted from wikidumps
* Each document will have a title and a quote

Reminder of what our data looked like

In [None]:
cat extracted.txt | head -n 50



###C:Title:Main Page




 

 



###C:Start Section
Wikiquote's sister projects

###C:Start Section
Wikiquote languages


###C:Title:Albert Einstein
A hundred times every day I remind myself that my inner and outer life are based on the labors of other men, living and dead, and that I must exert myself in order to give in the same measure as I have received and am still receiving...
A happy man is too satisfied with the present to dwell too much on the future.
Albert Einstein (14 March 1879 – 18 April 1955) was a theoretical physicist who published the special and general theories of relativity and contributed in other areas of physics. He won the Nobel Prize in physics for his explanation of the photoelectric effect.
See also
Albert Einstein and politics

###C:Start Section
Quotes
Unthinking respect for authority is the greatest enemy of truth.

###C:Start Section
1890s
 Un homme heureux est trop content de la présence [du présent] pour penser beauco

In [None]:
import pysolr

def get_quote(filename):
    with open(filename, "r") as text_file:
        iterator = iter(text_file) #An iterator to go through the lines
        for line in iterator:
            if line.startswith("###"):
                ## Title lines that have more than 2 :'s  are quotes that we don't want, e.g. C:Title:User:X
                if "Title" in line and line.count(":") == 2: 
                    title = line.split(":")[-1].strip()
                elif "Title" in line:
                    # No title for the quotes we don't want, so we can easily skip them all
                    title = ""
                else:
                    # Skipping one extra line if no Title in the ### line, e.g. C:Start Section, as the next line is not a quote
                    next(iterator)
            # Skipping empty lines, lines that are too short and all lines when we don't have a title set
            elif not line or len(line) < 20 or not title: 
                continue
            else:
                # Yielding a quote and a title for it
                yield (line.strip(), title)

if __name__ == "__main__":

    # Connecting to solr
    solr = pysolr.Solr("http://localhost:8983/solr/ir_course")

    quotes = []
    for line, title in get_quote("extracted.txt"):
        quotes.append({"source" : title, "text": line})
        
    print("Example quote:", quotes[0])
    print("Indexing quotes...")
    solr.add(quotes)
    print("%d quotes indexed to Solr." % len(quotes))



Example quote: {'text': 'A hundred times every day I remind myself that my inner and outer life are based on the labors of other men, living and dead, and that I must exert myself in order to give in the same measure as I have received and am still receiving...', 'source': 'Albert Einstein'}
Indexing quotes...
91713 quotes indexed to Solr.


# ...now that we have some real data...

* http://www.solrtutorial.com/solr-query-syntax.html
* text:cat
* text:dep*tion
* text:dog*
* text:dog?
* text:deprivation~
* text:bitter~0.7
* text:"salt and"
* text:"cat dog"~7 http://evex.utu.fi/solr/ir_course/browse?q=text%3A%22cat+dog%22~7
* text:"dog AND (cat OR fish))"
* +text:cat -text:dog
* +text:cat +source:Hamlet
