# Real search engine

* Upgrade from our toy example to something real
* Test an industry-grade search engine
* Useful skill to have:
  * Work with own corpora - powerful search for free
  * Custom search and indexing
  * Especially for full-text search **much** better than SQL
  
## Solr

* Apache project
* Pure Java (yikes)
* Search engine - not so much the interface
  1. Index data
  2. Query via HTTP API
  3. Collect results in various formats http://evex.utu.fi/solr/ENCOW/select?fl=stext&indent=on&q=+stext:spin&wt=json
* And you can use the built-in query page too: http://evex.utu.fi/solr/#/ENCOW/query
* And of course you can talk to Solr also in Python

In [4]:
import requests
solr_user="solr"
solr_password="xxxxxxxxxx" #password not shown
params={"fl":"stext", "q":"+stext:spin","indent":"on","wt":"json"}
result=requests.get("http://evex.utu.fi/solr/ENCOW/select",data=params,auth=(solr_user,solr_password)) 
print(result.text)

{
  "responseHeader":{
    "status":0,
    "QTime":0,
    "params":{
      "q":"+stext:spin",
      "indent":"on",
      "fl":"stext",
      "wt":"json"}},
  "response":{"numFound":123691,"start":0,"docs":[
      {
        "stext":"spin spin spin spin spin ."},
      {
        "stext":"spin , spin , spin , spin , spin ..."},
      {
        "stext":"spin spin spin and more spin ."},
      {
        "stext":"spin spin spin ."},
      {
        "stext":"spin spin spin ."},
      {
        "stext":"spin spin spin ."},
      {
        "stext":"spin spin spin ..."},
      {
        "stext":"spin spin spin ."},
      {
        "stext":"spin spin spin ."},
      {
        "stext":"spin spin spin ."}]
  }}



# Installing Solr

* For what it is, installing Solr is surprisingly easy
* Googling "solr download"
* http://mirror.netinch.com/pub/apache/lucene/solr/6.4.2/solr-6.4.2.tgz

In [5]:
# Download
curl -O http://mirror.netinch.com/pub/apache/lucene/solr/6.4.2/solr-6.4.2.tgz
# Unpack
tar zxf solr-6.4.2.tgz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  140M  100  140M    0     0  24.8M      0  0:00:05  0:00:05 --:--:-- 27.3M


In [6]:
# List current directory
ls

01_classroom_and_computers.ipynb   README.md
01_Intro.ipynb                     [0m[01;34msolr-6.4.2[0m
02_boolean_model_asignments.ipynb  [01;31msolr-6.4.2.tgz[0m
02_boolean_model.ipynb             SOLR-CoNLL-U.ipynb
04_solr.ipynb                      Untitled.ipynb
[01;31mfiwiki-20140809-corpus.txt.gz[0m


In [7]:
# Change to the directory solr-6.4.2
cd solr-6.4.2
ls

[0m[01;34mbin[0m          [01;34mcontrib[0m  [01;34mdocs[0m     [01;34mlicenses[0m     LUCENE_CHANGES.txt  README.txt
CHANGES.txt  [01;34mdist[0m     [01;34mexample[0m  LICENSE.txt  NOTICE.txt          [01;34mserver[0m


* ...and that's all there is to it
* let's run it, see what happens
* ...but first we will need a little detour...

# Internet addresses and ports

* Every computer on Internet has an IP address
* E.g. 193.166.24.207
* DNS is a service which gives these human-readable names:

In [1]:
# echo "**** Look up name for a given IP ****"
echo
dig -x 193.166.24.207
echo
echo
echo "**** Look up IP for a given name ****"
echo
dig vm0964.kaj.pouta.csc.fi



; <<>> DiG 9.9.5-3ubuntu0.11-Ubuntu <<>> -x 193.166.24.207
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 14419
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 3, ADDITIONAL: 3

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;207.24.166.193.in-addr.arpa.	IN	PTR

;; ANSWER SECTION:
207.24.166.193.in-addr.arpa. 1800 IN	PTR	vm0964.kaj.pouta.csc.fi.

;; AUTHORITY SECTION:
24.166.193.in-addr.arpa. 86400	IN	NS	ns2.funet.fi.
24.166.193.in-addr.arpa. 86400	IN	NS	ns-secondary.funet.fi.
24.166.193.in-addr.arpa. 86400	IN	NS	ns.funet.fi.

;; ADDITIONAL SECTION:
ns.funet.fi.		339	IN	A	128.214.46.64
ns-secondary.funet.fi.	339	IN	A	128.214.248.132

;; Query time: 22 msec
;; SERVER: 130.232.202.139#53(130.232.202.139)
;; WHEN: Thu Mar 16 22:20:43 EET 2017
;; MSG SIZE  rcvd: 193



**** Look up IP for a given name ****


; <<>> DiG 9.9.5-3ubuntu0.11-Ubuntu <<>> vm0964.kaj.pouta.csc.fi

* Another address is 127.0.0.1 - local interface not accessible from the outside
* Many local services use this one (safety, among other reasons)

* Every interface has one address, but many more active services/connections
* Traffic must not get mixed up between these
* So every interface has numbered *ports*
* ssh to a server -> talk to it on port 22
* Get a web page from a server -> talk to it on port 80
* Standard services have pre-agreed port numbers in the 1-1024 range (needs admin access to serve on them)
* Any other service can open itself a port (if free) and answer requests
* Jupyter Notebook tends to use 8888, Solr like 8983, etc...

In [8]:
# Tells solr to start in foreground and listen on port 16667
# foreground -> occupies the command line and is killed when you exit it (please use!)
# background -> releases the command line, doesn't get killed when you exit it -> not what we want!
bin/solr start -f -p 16667

2017-03-13 11:02:47.824 INFO  (main) [   ] o.e.j.s.Server jetty-9.3.14.v20161028
2017-03-13 11:02:48.168 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter  ___      _       Welcome to Apache Solr™ version 6.4.2
2017-03-13 11:02:48.168 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter / __| ___| |_ _   Starting in standalone mode on port 16667
2017-03-13 11:02:48.168 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter \__ \/ _ \ | '_|  Install dir: /home/ginter/IR_Course/solr-6.4.2
2017-03-13 11:02:48.189 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter |___/\___/_|_|    Start time: 2017-03-13T11:02:48.172Z
2017-03-13 11:02:48.209 INFO  (main) [   ] o.a.s.c.SolrResourceLoader Using system property solr.solr.home: /home/ginter/IR_Course/solr-6.4.2/server/solr
2017-03-13 11:02:48.217 INFO  (main) [   ] o.a.s.c.SolrXmlConfig Loading container configuration from /home/ginter/IR_Course/solr-6.4.2/server/solr/solr.xml
2017-03-13 11:02:48.598 INFO  (main) [   ] o.a.s.u.UpdateShardHandler Creating UpdateShardHa

* After start, I can visit solr at http://127.0.0.1:16667

# Port-forwarding

* But if our service runs on 127.0.0.1:someport on a server, how can we access it?
* ssh -L 56789:127.0.0.1:8983 someserver
* Forwards all traffic from my own port 56789 to localhost:8983 on someserver
* So now I can head to http://127.0.0.1:56789 and will see there what I would see if I went to 127.0.0.1:8983 on the server
* Makes sense?  - That is how we will access our own solrs on the vm0964.kaj.pouta.csc.fi machine

# Solr cores

* Data stored in *cores*
* One core - one dataset
* (collectios not covered here)

# Creating a core 

* Cores are like projects for Solr
* One instance of Solr can have multiple cores running simultaneously
* Each core has its own data


In [None]:
bin/solr create_core -c mytest

http://127.0.0.1:16667/solr/#/mytest

# Adding data

* bin/post -p 16667 -c mytest /home/ginter
* .... wait for a while, kill
* curl http://127.0.0.1:16667/solr/mytest/update?commit=true
* ...and enjoy the result...

# Adding data #2

* Let's try to index Wiki Quotes
* Do so programmatically to have full control
* And use that to learn a bit about the query language

* Ready-made config files prepared for you:

In [None]:
curl -O http://bionlp-www.utu.fi/.avjves/config_files.tgz
tar -zxf config_files.tgz
mv conf example_config

In [None]:
# Assuming we are now in the solr-6.4.2 directory
bin/solr create_core -p 16667 -c ir_course -d ~/example_config#ir_course will be the name of the core, change it to whatever you want

* and now we are ready to index data to our core

# Indexing data with Solr

* Using the data we extracted from wikidumps
* Each document will have a title and a quote

Reminder of what our data looked like

In [15]:
cat extracted.txt | head -n 50



###C:Title:Main Page




 

 



###C:Start Section
Wikiquote's sister projects

###C:Start Section
Wikiquote languages


###C:Title:Albert Einstein
A hundred times every day I remind myself that my inner and outer life are based on the labors of other men, living and dead, and that I must exert myself in order to give in the same measure as I have received and am still receiving...
A happy man is too satisfied with the present to dwell too much on the future.
Albert Einstein (14 March 1879 – 18 April 1955) was a theoretical physicist who published the special and general theories of relativity and contributed in other areas of physics. He won the Nobel Prize in physics for his explanation of the photoelectric effect.
See also
Albert Einstein and politics

###C:Start Section
Quotes
Unthinking respect for authority is the greatest enemy of truth.

###C:Start Section
1890s
 Un homme heureux est trop content de la présence [du présent] pour penser beauco

In [16]:
import pysolr

def get_quote(filename):
    with open(filename, "r") as text_file:
        iterator = iter(text_file) #An iterator to go through the lines
        for line in iterator:
            if line.startswith("###"):
                ## Title lines that have more than 2 :'s  are quotes that we don't want, e.g. C:Title:User:X
                if "Title" in line and line.count(":") == 2: 
                    title = line.split(":")[-1].strip()
                elif "Title" in line:
                    # No title for the quotes we don't want, so we can easily skip them all
                    title = ""
                else:
                    # Skipping one extra line if no Title in the ### line, e.g. C:Start Section, as the next line is not a quote
                    next(iterator)
            # Skipping empty lines, lines that are too short and all lines when we don't have a title set
            elif not line or len(line) < 20 or not title: 
                continue
            else:
                # Yielding a quote and a title for it
                yield (line.strip(), title)

if __name__ == "__main__":

    # Connecting to solr
    solr = pysolr.Solr("http://localhost:8983/solr/ir_course")

    quotes = []
    for line, title in get_quote("extracted.txt"):
        quotes.append({"source" : title, "text": line})
        
    print("Example quote:", quotes[0])
    print("Indexing quotes...")
    solr.add(quotes)
    print("%d quotes indexed to Solr." % len(quotes))



Example quote: {'text': 'A hundred times every day I remind myself that my inner and outer life are based on the labors of other men, living and dead, and that I must exert myself in order to give in the same measure as I have received and am still receiving...', 'source': 'Albert Einstein'}
Indexing quotes...
91713 quotes indexed to Solr.


# ...now that we have some real data...

* http://www.solrtutorial.com/solr-query-syntax.html
* text:cat
* text:dep*tion
* text:dog*
* text:dog?
* text:deprivation~
* text:bitter~0.7
* text:"salt and"
* text:"cat dog"~7 http://evex.utu.fi/solr/ir_course/browse?q=text%3A%22cat+dog%22~7
* text:"dog AND (cat OR fish))"
* +text:cat -text:dog
* +text:cat +source:Hamlet
