In [1]:
import time
start_time = time.time()

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

## SIGIR 2017 - Candidate Selection for Personalized Search and Recommender Systems

https://github.com/candidate-selection-tutorial-sigir2017/candidate-selection-tutorial

**Abstract**

Modern day social media search and recommender systems require complex query formulation that incorporates both user context and their explicit search queries. Users expect these systems to be fast and provide relevant results to their query and context. With millions of documents to choose from, these systems utilize a **multi-pass scoring function to narrow the results** and provide the most relevant ones to users. **Candidate selection** is required to sift through all the documents in the index and select a relevant few to be ranked by subsequent scoring functions. It becomes crucial to narrow down the document set while maintaining relevant ones in resulting set. In this tutorial we survey various candidate selection techniques and deep dive into case studies on a large scale social media platform. In the later half we provide hands-on tutorial where we explore building these candidate selection models on a real world dataset and see how to balance the tradeoff between relevance and latency.

**Presenters**
Dhruv Arya, Ganesh Venkataraman, Aman Grover, Krishnaram Kenthapadi, Yiqun Liu

In [2]:
# meu
from nltk.tag import StanfordNERTagger  # eles usaram isto para a NER
from nltk.tokenize import word_tokenize

text = """While in France, Christine Lagarde discussed short-term stimulus efforts in a recent interview with the Wall 
Street Journal."""

tokenized_text = word_tokenize(text)

#################  tentar depois melhor com o StanfordNERTagger ##########################

import spacy

# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

[(ent.text, ent.label_) for ent in nlp(text).ents]

[('France', 'GPE'),
 ('Christine Lagarde', 'PERSON'),
 ('the Wall \nStreet Journal', 'ORG')]

**Building blocks**

In [3]:
# In our tutorial we will be using the following dependencies:
import pandas as pd
import pysolr
import web
import nltk

import spacy
from nltk.tag import StanfordNERTagger  # usaram isto para a NER mais tarde

## Assignment 0 & 1

- Assignment 0: https://github.com/candidate-selection-tutorial-sigir2017/candidate-selection-tutorial/tree/master/assignments/assignment0
- Assignment 1: https://github.com/candidate-selection-tutorial-sigir2017/candidate-selection-tutorial/tree/master/assignments/assignment1

**Dataset**

We will be using an open source News Aggregator Dataset. It references to news pages collected from a web aggregator in the period from 10-March-2014 to 10-August-2014. The resources are grouped into clusters that represent pages discussing the same story.

Full details about the dataset can be found at UCI Machine Learning Repository - News Aggregator Dataset

http://archive.ics.uci.edu/ml/datasets/News+Aggregator#

#### get dataset

In [4]:
df = pd.read_csv('2pageSessions.csv', sep='\t', header=None, engine='python')
df.head()

Unnamed: 0,0,1,2,3
0,dxyGGb4iN9Cs9aMZTKQpJeoiQfruM,techcrunch.com,b,http://techcrunch.com/ http://techcrunch.com/2...
1,dxyGGb4iN9Cs9aMZTKQpJeoiQfruM,techcrunch.com,b,http://techcrunch.com/ecommerce/ http://techcr...
2,dxyGGb4iN9Cs9aMZTKQpJeoiQfruM,www.bnn.ca,b,http://www.bnn.ca/News/2014/ http://www.bnn.ca...
3,dxyGGb4iN9Cs9aMZTKQpJeoiQfruM,www.bnn.ca,b,http://www.bnn.ca/news http://www.bnn.ca/News/...
4,dxyGGb4iN9Cs9aMZTKQpJeoiQfruM,www.bnn.ca,b,http://www.bnn.ca/News/News-Listing.aspx?Secto...


In [5]:
df_2 = pd.read_csv('newsCorpora.csv', sep='delimiter', header=None, engine='python', error_bad_lines=False)
df_2.head()

Skipping line 45072: Expected 1 fields in line 45072, saw 3. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
Skipping line 318471: Expected 1 fields in line 318471, saw 3. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.


Unnamed: 0,0
0,1\tFed official says weak data caused by weath...
1,2\tFed's Charles Plosser sees high bar for cha...
2,3\tUS open: Stocks fall after Fed official hin...
3,"4\tFed risks falling 'behind the curve', Charl..."
4,5\tFed's Plosser: Nasty Weather Has Curbed Job...


In [6]:
HEADERS = ["ID", "TITLE", "URL", "PUBLISHER", "CATEGORY", "STORY", "HOSTNAME", "TIMESTAMP"]

# check data
df_news = pd.read_csv('newsCorpora.csv', sep='\t', header=None, engine='python', 
                      error_bad_lines=False, warn_bad_lines=False)

# shape
print(df_news.shape)

# display head
df_news.columns = HEADERS.copy()
display(df_news.head(3))

(421493, 8)


Unnamed: 0,ID,TITLE,URL,PUBLISHER,CATEGORY,STORY,HOSTNAME,TIMESTAMP
0,1,"Fed official says weak data caused by weather,...",http://www.latimes.com/business/money/la-fi-mo...,Los Angeles Times,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.latimes.com,1394470370698
1,2,Fed's Charles Plosser sees high bar for change...,http://www.livemint.com/Politics/H2EvwJSK2VE6O...,Livemint,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.livemint.com,1394470371207
2,3,US open: Stocks fall after Fed official hints ...,http://www.ifamagazine.com/news/us-open-stocks...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371550


#### setup solr

https://lucene.apache.org/solr/guide/6_6/getting-started.html

**Setup a solr instance and create index schema for the dataset.**

In [7]:
from __future__ import unicode_literals

from subprocess import call
import argparse
import csv
import pysolr

INDEX_NAME = 'simpleindex'
INDEX_MAP = ["ID", "TITLE", "URL", "PUBLISHER", "CATEGORY", "STORY", "HOSTNAME", "TIMESTAMP"]
SOLR_URL = 'http://localhost:8983/solr'

data_folder = """C:/Users/Admin/OneDrive/Ikari_Technology_Solutions/Tutorial_SIGIR 2017 - Candidate Selection for Personalized Search and Recommender Systems/"""

# pasta onde está instalado o solr -> ir até ao comando na pasta bin
solr_cmd = data_folder + "solr-8.6.0/bin/solr.cmd"

# data file name
df_news_csv = data_folder + 'newsCorpora.csv'

In [8]:
# função para criar a estrutura de um documento a inserir no index
def create_document(record):
    """
    This function creates a representation for the document to be put in the solr index.
    """    
    #Write an iterator over the INDEX_MAP to fetch fields from the record and return a dictionary representing the document.
    document = {}
    for idx, field in enumerate(INDEX_MAP):
        if field.lower() == 'id':
            document[field.lower()] = record[idx]
        else:
            document["_news_%s" % (field.lower())] = record[idx].lower()
    return document

In [9]:
# solr object
print(pysolr.Solr(url="%s/%s" % (SOLR_URL, INDEX_NAME)), "\n")

# ver só o primeiro
for i in csv.reader(open(df_news_csv, encoding='utf-8'), delimiter='\t'):
    if i[0] == '2':
        break
    else:
        print(i)    

<pysolr.Solr object at 0x000001D093CFCDC8> 

['1', 'Fed official says weak data caused by weather, should not slow taper', 'http://www.latimes.com/business/money/la-fi-mo-federal-reserve-plosser-stimulus-economy-20140310,0,1312750.story\\?track=rss', 'Los Angeles Times', 'b', 'ddUyU0VZz0BRneMioxUPQVP6sIxvM', 'www.latimes.com', '1394470370698']


In [10]:
# estrutura dos documentos (exemplo: 1.º)
create_document(['1', 'Fed official says weak data caused by weather, should not slow taper', 'http://www.latimes.com/business/money/la-fi-mo-federal-reserve-plosser-stimulus-economy-20140310,0,1312750.story\\?track=rss', 'Los Angeles Times', 'b', 'ddUyU0VZz0BRneMioxUPQVP6sIxvM', 'www.latimes.com', '1394470370698'])

{'id': '1',
 '_news_title': 'fed official says weak data caused by weather, should not slow taper',
 '_news_url': 'http://www.latimes.com/business/money/la-fi-mo-federal-reserve-plosser-stimulus-economy-20140310,0,1312750.story\\?track=rss',
 '_news_publisher': 'los angeles times',
 '_news_category': 'b',
 '_news_story': 'dduyu0vzz0brnemioxupqvp6sixvm',
 '_news_hostname': 'www.latimes.com',
 '_news_timestamp': '1394470370698'}

In [11]:
# função para criar um core e inserir os documentos no index
def index(input_file, num_records):
    """
    Creates a representation of the document and puts the document in the solr index. 
    The index name is defined as a part of the url.
    """

    # create the solr core (manualmente: na linha de comandos, em bin, "solr create -c simple_index")
    call(["{}".format(solr_cmd), "create", "-c", INDEX_NAME])
    
    # Create a client instance
    solr_interface = pysolr.Solr(url="%s/%s" % (SOLR_URL, INDEX_NAME))
    
    # index data
    with open(input_file, encoding='utf-8') as csvfile:
        records = csv.reader(csvfile, delimiter='\t')
        batched_documents = []
        for idx, record in enumerate(records):
            if idx == num_records:
                break
            # Write code for creating the document and passing it for indexing
            else:
                batched_documents.append(create_document(record))
                print("Added document %d to the %s index" % (idx, INDEX_NAME))
    
    # and finally index the complete list (batched_documents)
    solr_interface.add(batched_documents)
            
    # Commit the changes to the index after adding the documents
    solr_interface.commit()
    print('Finished adding the documents to the solr index')
    return

In [12]:
%%time

# start the server (Ou, na linha de comandos, na pasta onde está instalado o solr: bin\solr.cmd start )
call(["{}".format(solr_cmd), "start"])

Wall time: 1.73 s


1

In [13]:
%%time

# create the solr core and simple index
index(df_news_csv, num_records=15)

Added document 0 to the simpleindex index
Added document 1 to the simpleindex index
Added document 2 to the simpleindex index
Added document 3 to the simpleindex index
Added document 4 to the simpleindex index
Added document 5 to the simpleindex index
Added document 6 to the simpleindex index
Added document 7 to the simpleindex index
Added document 8 to the simpleindex index
Added document 9 to the simpleindex index
Added document 10 to the simpleindex index
Added document 11 to the simpleindex index
Added document 12 to the simpleindex index
Added document 13 to the simpleindex index
Added document 14 to the simpleindex index
Finished adding the documents to the solr index
Wall time: 6.79 s


In [14]:
# go to the chrome app
print(SOLR_URL)

# solr interface created
solr_interface = pysolr.Solr(url=SOLR_URL + "/{}".format(INDEX_NAME))

# results - Search all (*) - by default there are presented 10 results
len(solr_interface.search("*"))

http://localhost:8983/solr


10

O que fizemos acima (We will start by building a very basic document with prefix _news_ and utilizing pysolr for batch indexing the documents.) foi equivalente a ( ver tutorial em https://lucene.apache.org/solr/guide/8_6/solr-tutorial.html)


curl -X POST -H 'Content-type:application/json' --data-binary '{
  "add-field-type" : {
     "name":"simple_indexed_text",
     "class":"solr.TextField",
     "positionIncrementGap":"100",
     "analyzer" : {
        "tokenizer":{ 
           "class":"solr.WhitespaceTokenizerFactory" }
      }}
}' http://localhost:8983/solr/simpleindex/schema

curl -X POST -H 'Content-type:application/json' --data-binary '{
  "add-dynamic-field":{
     "name":"_news_*",
     "type":"simple_indexed_text",
     "indexed":true,
     "stored":true }
}' http://localhost:8983/solr/simpleindex/schema

### Exploring the News Aggregator Dataset

The first step of building a search index is to understand the dataset and the fields that you want to allow the user to search on. To do this we will read a first few records of the dataset into a tabular form. This will allow us to understand how does the data look like.

This analysis allows us to understand to what degree are the following tasks needed

- Tokenization and Segmentation
- Term Normalization
- Data transformation

We have provided you with a basic script that prints out the data in a tabular form along with some statistics about field values.

To run the script issue the following command or run it with arguments from PyCharm.

cd ~/workspace/candidate-selection-tutorial/assignments/assignment1/exercise/src 
python understand_data.py --input /home/sigir/workspace/candidate-selection-tutorial/finished-product/data/news-aggregator-dataset/newsCorpora.csv

In [15]:
# understand_data.py ---> faz isto:

import csv
from prettytable import PrettyTable

num_records = 10

with open(df_news_csv) as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')
    pretty_table = PrettyTable()
    pretty_table.field_names = HEADERS
    for count, row in enumerate(csvreader):
        if count == num_records:
            break
        pretty_table.add_row(row)
    print(pretty_table)

# Category values
print('\nCategory values in the dataset:')
print(df_news.CATEGORY.value_counts())

# hostnames (most common)
print('\nMost common hostnames in the dataset:')
print(df_news.HOSTNAME.value_counts()[:10])

+----+--------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------+----------+-------------------------------+------------------------------+---------------+
| ID |                                  TITLE                                   |                                                                              URL                                                                               |      PUBLISHER       | CATEGORY |             STORY             |           HOSTNAME           |   TIMESTAMP   |
+----+--------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------+----------+------------

In [16]:
df_news.iloc[0]['URL'].split(',')  

['http://www.latimes.com/business/money/la-fi-mo-federal-reserve-plosser-stimulus-economy-20140310',
 '0',
 '1312750.story\\?track=rss']

In [17]:
# os que têm vírgulas e mais informação para além do url:
corrigir_url = []
[corrigir_url.append(i) for i in range(len(df_news)) if len(df_news.iloc[i]['URL'].split(',')) > 1]

# são mais..
print(len(corrigir_url)) # ~1%

4343


#### stanford_title_ner_tags_case_sensitive

In [18]:
# stanford_title_ner_tags_case_sensitive
stanford_title_NER = pd.read_csv('stanford_title_ner_tags_case_sensitive.csv', sep='\t', header=None, engine='python', 
                                 error_bad_lines=False, warn_bad_lines=False)

stanford_title_NER.head()

Unnamed: 0,0,1
0,1,"{""ORGANIZATION"": [""Fed""]}"
1,2,"{""PERSON"": [""Charles Plosser""]}"
2,3,"{""ORGANIZATION"": [""Fed""], ""LOCATION"": [""US""]}"
3,4,"{""PERSON"": [""Charles Plosser""]}"
4,5,{}


In [19]:
print(len(stanford_title_NER)); 
print(len(df_news))

422419
421493


In [20]:
text_ex = df_news['TITLE'][0]
tokenized_text = word_tokenize(text_ex)

#################  tentar depois melhor com o StanfordNERTagger ##########################

import spacy

# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

print(text_ex)
[(ent.text, ent.label_) for ent in nlp(text_ex).ents]

Fed official says weak data caused by weather, should not slow taper


[('Fed', 'ORG')]

In [21]:
text_ex = df_news['TITLE'][2]
tokenized_text = word_tokenize(text_ex)
import spacy
nlp = spacy.load('en_core_web_sm')
print(text_ex)
[(ent.text, ent.label_) for ent in nlp(text_ex).ents]

US open: Stocks fall after Fed official hints at accelerated tapering


[('US', 'GPE'), ('Fed', 'ORG')]

In [22]:
text_ex = df_news['TITLE'][4]
tokenized_text = word_tokenize(text_ex)
import spacy
nlp = spacy.load('en_core_web_sm')
print(text_ex)
[(ent.text, ent.label_) for ent in nlp(text_ex).ents]  # ----> este o stanfordNER não apanhou

Fed's Plosser: Nasty Weather Has Curbed Job Growth


[('Fed', 'ORG')]

### Query Rewriting and Searching

In this task we will connect our middletier and the frontend to the index. We will accept the query from the search front end, rewrite the query to search our index and send the results back to the frontend for displaying.

As next step open the file **frontend/app.py**. This is our middle tier that does serves the search requests and talks to the search backend. In this file you need to write the **GET** method of **SearchSimpleIndex** class.

- **Match All Query** - This query will be useful for serving queries with no keywords. Refer to solr documentation on how to construct it.
- **Text Based Query** - For queries with keywords we will make use of the catch all field in solr. All content to be indexed in a predefined "catch-all" "_text_" field, to enable single-field search that includes all fields' content. The query should look of the form:

Query: la times <br>
Tokens: ['la', 'times'] <br>
Generated Query: _text_:la AND _text_:times

**Running the server** <br>
To see the search in action follow the commands below:

cd frontend<br>
python app.py <br>

This should run the simple index search server on http://0.0.0.0:8080. The page should look like the image below. Try out some queries to see if you are getting results back.

In [23]:
word_tokenize("la times")

['la', 'times']

In [24]:
import web
import pysolr
import json
from nltk.tokenize import word_tokenize

# MUDEI O CÓDIGO - RETIREI A PARTE draw E NÃO USEI get_web_input

urls = ('/', 'SimpleIndexSearchPage', '/searchSimpleIndex', 'SearchSimpleIndex',)

CATEGORY = {'b': 'Business', 'e': 'Entertainment', 't': 'Science and Technology', 'm': 'Health'}
# render = web.template.render(data_folder + "candidate-selection-tutorial/assignments/assignment1/exercise/src/frontend/templates/",
#                              base='layout')
SOLR_SIMPLEINDEX = pysolr.Solr('http://localhost:8983/solr/simpleindex')


def get_web_input(web_input):
    draw = web_input['draw']
    query = web_input['search[value]']
    offset = web_input['start']
    count = web_input['length']
    return draw, query, offset, count


def search(query, offset, count, solr_endpoint):
    """
    This function is responsible for hitting the solr endpoint and returning the results back.
    """
    results = solr_endpoint.search(q=query, **{'start': int(offset), 'rows': int(count)})
    print("Saw {0} result(s) for query {1}.".format(len(results), query))
    
    formatted_hits = []
    for hit in results.docs:
        formatted_hits.append(
            [hit['_news_title'], hit['_news_publisher'], CATEGORY[hit['_news_category'][0]], hit['_news_url']])
    response = {'recordsFiltered': results.hits,
                'data': formatted_hits}
#     web.header('Content-Type', 'application/json')
    return json.dumps(response)


class SimpleIndexSearchPage:
    def GET(self):
        return render.simpleIndexSearchPage()


class SearchSimpleIndex:
    def GET(self):
        query, offset, count = get_web_input(web_input=web.input())
        # TODO: Write code for handling the empty query (no keywords)
        if query == '*:*':
            return search(query=query, offset=offset, count=count)
        
        # TODO: Write code for tokenizing the search query and creating must clauses for each token
        clauses = []
        for token in word_tokenize(query):
            clauses.append("+_text_:%s" % token)
        query = " AND ".join(clauses)
        return search(query=query, offset=offset, count=count)
        

In [25]:
search("_news_publisher:livemint", offset=0, count=10, solr_endpoint=SOLR_SIMPLEINDEX)

Saw 1 result(s) for query _news_publisher:livemint.


'{"recordsFiltered": 1, "data": [[["fed\'s charles plosser sees high bar for change in pace of tapering"], ["livemint"], "Business", ["http://www.livemint.com/politics/h2evwjsk2ve6of7ik1g3pp/feds-charles-plosser-sees-high-bar-for-change-in-pace-of-ta.html"]]]}'

In [26]:
search("_news_publisher:times", offset=0, count=10, solr_endpoint=SOLR_SIMPLEINDEX)

Saw 2 result(s) for query _news_publisher:times.


'{"recordsFiltered": 2, "data": [[["us jobs growth last month hit by weather:fed president charles plosser"], ["economic times"], "Business", ["http://economictimes.indiatimes.com/news/international/business/us-jobs-growth-last-month-hit-by-weatherfed-president-charles-plosser/articleshow/31788000.cms"]], [["fed official says weak data caused by weather, should not slow taper"], ["los angeles times"], "Business", ["http://www.latimes.com/business/money/la-fi-mo-federal-reserve-plosser-stimulus-economy-20140310,0,1312750.story\\\\?track=rss"]]]}'

In [27]:
search("_news_publisher:times AND _news_title:jobs", offset=0, count=10, solr_endpoint=SOLR_SIMPLEINDEX)

Saw 1 result(s) for query _news_publisher:times AND _news_title:jobs.


'{"recordsFiltered": 1, "data": [[["us jobs growth last month hit by weather:fed president charles plosser"], ["economic times"], "Business", ["http://economictimes.indiatimes.com/news/international/business/us-jobs-growth-last-month-hit-by-weatherfed-president-charles-plosser/articleshow/31788000.cms"]]]}'

In [28]:
%%time
#poderia ajustar a função dps (o search é o mesmo que fazer .contains('x')?)
len([i for i in df_news['TITLE'][:10] if 'Fed' in i])

Wall time: 1e+03 µs


9

## Assignment 2

https://github.com/candidate-selection-tutorial-sigir2017/candidate-selection-tutorial/tree/master/assignments/assignment2

In this assignment we will be building a better index with fields specific to the entities recognized in the title. We will make use of Stanford NER along NLTK. In addition to building the index we will work on utilizing entities in the incoming query and writing a field specific query matching entities in the query with the fields containing those entities in the search index.

### Building the Search Index

We will use our learnings from the previous assignment to build the entity based search index. To utilize the time better we have pregenerated the entity tags using the **Stanford NER** library and english.all.3class.distsim.crf.ser.gz classifier. The classifier provides three tags namely

PERSON <br>
ORGANIZATION <br>
LOCATION <br>

When building the index we will read through the tags and our dataset simultaneously. This will allow us to use the pregenerated tags when building the document to be indexed.

Open the file **entity_aware_index.py** and you will need to implement the following parts

- Writing the function for creating the document. Similar to Assignment 1 we will be building a dictionary with all the index fields. Our focus here will be to add additional title fields specifically for the NER tags. The specific fields **_news_title_person, _news_title_organization and _news_title_location** need to be added in addition to **_news_title**.

Your documents should have a structure similar to the one below -

{
	"_news_url": "http://www.ifamagazine.com/news/us-open-stocks-fall-after-fed-official-hints-at-accelerated-tapering-294436",<br>
	"_news_title_organization": "Fed",<br>
	"_news_title": "us open: stocks fall after fed official hints at accelerated tapering",<br>
	"_news_story": "dduyu0vzz0brnemioxupqvp6sixvm",<br>
	"_news_category": "b",<br>
	"_news_hostname": "www.ifamagazine.com",<br>
	"_news_publisher": "ifa magazine",<br>
	"_news_timestamp": "1394470371550",<br>
	"id": "3",<br>
	"_news_title_location": "US"<br>
}<br>

- The second task involves writing code for addition a document to the Solr index. You can reuse your code from Assignment 1 here.

To begin indexing follow the commands listed below, you need to be in the assignment2 folder for running the commands.

In [29]:
from subprocess import call
import argparse
import json
import csv
import pysolr
import gzip

INDEX_MAP = ["ID", "TITLE", "URL", "PUBLISHER", "CATEGORY", "STORY", "HOSTNAME", "TIMESTAMP"]
SOLR_URL = 'http://localhost:8983/solr'

# Location, Time, Person, Organization, Money, Percent, Date (Stanford NER)

# Person, Norp (Nationalities or religious or political groups.), Facility, Org, GPE (Countries, cities, states.)
# Loc (Non GPE Locations ex. mountain ranges, water), Product (Objects, vehicles, foods, etc. (Not services.),
# EVENT (Named hurricanes, battles, wars, sports events, etc.), WORK_OF_ART (Titles of books, songs, etc), LANGUAGE
# Refer to https://spacy.io/docs/usage/entity-recognition (SPACY NER)


def create_document_ner(record, ner_tag):
    """
    This function creates a representation for the document to be put in the solr index.
    """
    document = {}
    for idx, field in enumerate(INDEX_MAP):
        if field.lower() == 'id':
            document[field.lower()] = record[idx]
        else:
            document["_news_%s" % (field.lower())] = record[idx].lower()
            
    # inserir _news_title_NERtag, ex. title_person, title_organization, title_location.
    for i, j in enumerate(json.loads(ner_tag)):
        document["_news_title_%s" % (j.lower())] = list(json.loads(ner_tag).values())[i]

    return document


In [30]:
# create new index
INDEX_NAME = 'entityawareindex'

# function
def index_ner(input_file, ner_tags_filename, num_records):
    """
    Creates a representation of the document and puts the document in the solr index. 
    The index name is defined as a part of the url.
    """
    
    # create the solr core 
    call(["{}".format(solr_cmd), "create", "-c", INDEX_NAME])
    
    # Create a client instance
    solr_interface = pysolr.Solr(url="%s/%s" % (SOLR_URL, INDEX_NAME))
    
    # index data
    with open(input_file) as csvfile, open(ner_tags_filename) as ner_tags_file:
        records = csv.reader(csvfile, delimiter='\t')
        ner_tags = csv.reader(ner_tags_file, delimiter='\t')
        batched_documents = []
        for idx, (record, ner_tag) in enumerate(zip(records, ner_tags)):
            if idx == num_records:
                break
            
            else:
                # não esquecer de pôr [1] em ner_tag (no ficheiro vem um algarismo sempre primeiro)
                batched_documents.append(create_document_ner(record, ner_tag[1]))
                print("Added document %d to the %s index" % (idx, INDEX_NAME))
    
    # and finally index the complete list (batched_documents)
    solr_interface.add(batched_documents)
                
    # Commit the changes to the index after adding the documents
    solr_interface.commit()
    print('Finished adding the documents to the solr index')
    return


In [31]:
%%time

ner_tags_filename = 'stanford_title_ner_tags_case_sensitive.csv'

# create the solr core and simple index
index_ner(df_news_csv, ner_tags_filename, num_records=15)

Added document 0 to the entityawareindex index
Added document 1 to the entityawareindex index
Added document 2 to the entityawareindex index
Added document 3 to the entityawareindex index
Added document 4 to the entityawareindex index
Added document 5 to the entityawareindex index
Added document 6 to the entityawareindex index
Added document 7 to the entityawareindex index
Added document 8 to the entityawareindex index
Added document 9 to the entityawareindex index
Added document 10 to the entityawareindex index
Added document 11 to the entityawareindex index
Added document 12 to the entityawareindex index
Added document 13 to the entityawareindex index
Added document 14 to the entityawareindex index
Finished adding the documents to the solr index
Wall time: 4.74 s


In [32]:
# go to the chrome app
print(SOLR_URL)

# solr interface created
solr_interface = pysolr.Solr(url=SOLR_URL + "/{}".format(INDEX_NAME))

# results - Search all (*) - by default there are presented 10 results
len(solr_interface.search("*"))

http://localhost:8983/solr


10

### Query Rewriting and Searching

In this task we will define a query that is matched with the document on specific fields. We will make use of our entity understanding and utilize the Stanford NER server at runtime to generate tags.

As next step open the file frontend/app.py. This is our middle tier that does serves the search requests and talks to the search backend. In this file you need to write the GET method of SearchEntityAwareIndex class.

- **Match All Query** - Similar to Assignment 1 add the logic to serve results when the query is empty.
- **Entity & Field Based Query** - For queries with keywords we will make use of the catch all field in solr. All content to be indexed in a predefined "catch-all" _text_ field, to enable single-field search that includes all fields' content. The query should look of the form:

Query: cooperman paypal <br>
Tokens: ['cooperman', 'paypal'] <br>
NER Tags: {"ORGANIZATION": ["PayPal"], "PERSON": ["Cooperman"]} <br>
Generated Query: _news_title_organization:paypal AND _news_title_person:Cooperman <br>

**Helper Code Snippets** 

- Calling the Stanford NER server to get NER tags, accumulate_tags function in SearchEntityAwareIndex is provided for aggregating the NER tags.

entity_tags = STANFORD_NER_SERVER.get_entities(query) <br>
entity_tags = self.accumulate_tags(entity_tags)<br>

- Boosting Paramter - You can pass in an optional boosting parameter of the form to boost matches in certain fields. Example

qf = '_news_title_person^10 _news_title_organization^5 _news_title_location^100 _news_title^2.0 _news_publisher^10.0'

**Running the server**
To see the search in action follow the commands below:

cd frontend <br>
python app.py

In [33]:
from collections import defaultdict
from itertools import groupby
from operator import itemgetter
import web
import pysolr
import string
import json
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet as wn
from sner import Ner

urls = (
    '/', 'SimpleIndexSearchPage',
    '/entityAwareSearchPage', 'EntityAwareSearch',
    '/searchSimpleIndex', 'SearchSimpleIndex',
    '/searchEntityAwareIndex', 'SearchEntityAwareIndex'
)

CATEGORY = {'b': 'Business', 'e': 'Entertainment', 't': 'Science and Technology', 'm': 'Health'}
render = web.template.render('templates/', base='layout')
SOLR_SIMPLEINDEX = pysolr.Solr('http://localhost:8983/solr/simpleindex')
SOLR_ENTITYAWAREINDEX = pysolr.Solr('http://localhost:8983/solr/entityawareindex')
STANFORD_NER_SERVER = Ner(host='localhost', port=9199)

def get_web_input(web_input):
#     draw = web_input['draw']
    query = web_input['search[value]']
    if len(query) == 0:
        query = '*:*'
    offset = web_input['start']
    count = web_input['length']
    return draw, query, offset, count

############################################################################################

# função construída no Assignment 1
def search(query, offset, count, solr_endpoint):
    """
    This function is responsible for hitting the solr endpoint and returning the results back.
    """
    results = solr_endpoint.search(q=query, **{'start': int(offset), 'rows': int(count)})
    print("Saw {0} result(s) for query {1}.".format(len(results), query))
    
    formatted_hits = []
    for hit in results.docs:
        formatted_hits.append(
            [hit['_news_title'], hit['_news_publisher'], CATEGORY[hit['_news_category'][0]], hit['_news_url']])
    response = {'recordsFiltered': results.hits,
                'data': formatted_hits}
#     web.header('Content-Type', 'application/json')
    return json.dumps(response)

############################################################################################

# new entity aware search function
def search_entity_aware_index(query, offset, count, qf, time_in_ms):
        """
        This function is responsible for hitting the solr endpoint and returning the results back.
        """
        results = SOLR_ENTITYAWAREINDEX.search(q=query, **{'start': int(offset), 'rows': int(count),
                                                           'segmentTerminatedEarly': 'true', 'timeAllowed': time_in_ms,
                                                           'cache': 'false', 'qf': qf, 'pf': qf, 'debugQuery': 'true',
                                                           'defType': 'edismax', 'ps': 10})
        print("Saw {0} result(s) for query {1}.".format(len(results), query))
        print(results.debug)
        
        formatted_hits = []
        for hit in results.docs:
            formatted_hits.append(
                [hit['_news_title'], hit['_news_publisher'], CATEGORY[hit['_news_category'][0]], hit['_news_url']])
        response = {'recordsFiltered': results.hits,
                    'data': formatted_hits}
#         web.header('Content-Type', 'application/json')
        return json.dumps(response)


In [34]:
class SimpleIndexSearchPage:
    def GET(self):
        return render.simpleIndexSearchPage()


class EntityAwareSearch:
    def GET(self):
        return render.entityAwareSearchPage()


class SearchSimpleIndex:
    def GET(self):
        draw, query, offset, count = get_web_input(web_input=web.input())

        if query == '*:*':
            return search_simple_index(query=query, offset=offset, count=count, draw=draw)

        clauses = []
        for token in word_tokenize(query):
            clauses.append("+_text_:%s" % token)
        query = " AND ".join(clauses)
        return search_simple_index(query=query, offset=offset, count=count, draw=draw)


class SearchEntityAwareIndex:
    def accumulate_tags(self, list_of_tuples):
        tokens, entities = zip(*list_of_tuples)
        recognised = defaultdict(set)
        duplicates = defaultdict(list)

        for i, item in enumerate(entities):
            duplicates[item].append(i)

        for key, value in duplicates.items():
            for k, g in groupby(enumerate(value), lambda x: x[0] - x[1]):
                indices = list(map(itemgetter(1), g))
                recognised[key].add(' '.join(tokens[index] for index in indices))
        # recognised.pop('O', None)

        recognised = dict(recognised)
        ner_info = {}
        for key, value in recognised.iteritems():
            ner_info[key] = list(value)
        return ner_info


    def get_synonyms(self, text):
        syn_set = []
        for synset in wn.synsets(str):
            for item in synset.lemma_names:
                syn_set.append(item)
        return syn_set


    def tokenize_text(self, text):
        # title = unicode(query, "utf-8")
        stop = stopwords.words('english') + list(string.punctuation)
        return [i for i in word_tokenize(text) if i not in stop]


    def build_clauses(self, prefix, tagged_segments):
        clauses = []
        for tagged_segment in tagged_segments:
            tokens = self.tokenize_text(tagged_segment)
            if len(tokens) == 1:
                clauses.append("%s:%s" % (prefix, tokens[0]))
            else:
                clauses.append("%s:\"%s\"" % (prefix, " ".join(tokens)))
        return clauses


    def GET(self):
        draw, query, offset, count = get_web_input(web_input=web.input())

        if query == '*:*':
            return search_entity_aware_index(query=query, offset=offset, count=count,
                          draw=draw, qf='_text_^1', time_in_ms=100)

        # Utilize entity tagger to give out entities and remove unwanted tags
        entity_tags = STANFORD_NER_SERVER.get_entities(query)
        entity_tags = self.accumulate_tags(entity_tags)
        print('Entity tags for query - %s, %s' % (query, entity_tags))

        clauses = []
        for entity_tag, tagged_segments in entity_tags.iteritems():
            if entity_tag == 'PERSON':
                clauses.extend(self.build_clauses("_news_title_person", tagged_segments))
            elif entity_tag == 'LOCATION':
                clauses.extend(self.build_clauses("_news_title_location", tagged_segments))
            elif entity_tag == 'ORGANIZATION':
                clauses.extend(self.build_clauses("_news_title_organization", tagged_segments))
                clauses.extend(self.build_clauses("_news_title_publisher", tagged_segments))
            else:
                clauses.extend(self.build_clauses("_news_title", tagged_segments))

        query = " AND ".join(clauses)
        qf = '_news_title_person^10 _news_title_organization^5 _news_title_location^100 _news_title^2.0 _news_publisher^10.0'

        return search_entity_aware_index(query=query, offset=offset, count=count, draw=draw, qf=qf, time_in_ms=250)


In [35]:
end_time = time.time()
total_time = end_time - start_time
print("""Time to run: {} minutes""".format(round(total_time/60, 1)))

Time to run: 2.6 minutes
