# Information Retrieval (Search Engine Optimization)

## Preparations
* Put all your imports, and path constants in the next cells
* Make sure all your path constants are **relative to** ***DATA_DIR*** and **NOT hard-coded** in your code.

In [0]:
!pip install whoosh
!pip install pytrec_eval
!pip install wget

Collecting whoosh
[?25l  Downloading https://files.pythonhosted.org/packages/ba/19/24d0f1f454a2c1eb689ca28d2f178db81e5024f42d82729a4ff6771155cf/Whoosh-2.7.4-py2.py3-none-any.whl (468kB)
[K     |▊                               | 10kB 10.1MB/s eta 0:00:01[K     |█▍                              | 20kB 1.9MB/s eta 0:00:01[K     |██                              | 30kB 2.7MB/s eta 0:00:01[K     |██▉                             | 40kB 1.8MB/s eta 0:00:01[K     |███▌                            | 51kB 2.2MB/s eta 0:00:01[K     |████▏                           | 61kB 2.6MB/s eta 0:00:01[K     |█████                           | 71kB 3.0MB/s eta 0:00:01[K     |█████▋                          | 81kB 3.4MB/s eta 0:00:01[K     |██████▎                         | 92kB 3.8MB/s eta 0:00:01[K     |███████                         | 102kB 2.9MB/s eta 0:00:01[K     |███████▊                        | 112kB 2.9MB/s eta 0:00:01[K     |████████▍                       | 122kB 2.9MB/s eta 

In [0]:
# imports
# Put all your imports here
from whoosh import index, writing
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import *
from whoosh import qparser
from whoosh.qparser import QueryParser
from whoosh import scoring
from whoosh.analysis import Filter

import os.path
from pathlib import Path
import tempfile
import subprocess
import pytrec_eval
import nltk
from nltk.stem import *
import wget
wget.download("https://github.com/MIE451-1513-2019/course-datasets/raw/master/government.zip", "government.zip")

'government.zip'

In [0]:
!unzip government.zip

Archive:  government.zip
   creating: government/
  inflating: government/topics-with-full-descriptions.txt  
  inflating: government/gov.topics   
  inflating: government/gov.qrels    
   creating: government/documents/
   creating: government/documents/61/
  inflating: government/documents/61/G00-61-2800209  
  inflating: government/documents/61/G00-61-1192048  
  inflating: government/documents/61/G00-61-1118212  
  inflating: government/documents/61/G00-61-0749882  
  inflating: government/documents/61/G00-61-2230501  
  inflating: government/documents/61/G00-61-0680698  
  inflating: government/documents/61/G00-61-0551387  
  inflating: government/documents/61/G00-61-2575433  
  inflating: government/documents/61/G00-61-0469713  
  inflating: government/documents/61/G00-61-0280746  
  inflating: government/documents/61/G00-61-2574316  
  inflating: government/documents/61/G00-61-3933997  
  inflating: government/documents/61/G00-61-3290635  
  inflating: government/documents/61/G0

In [0]:
DATA_DIR = "government"
DOCUMENTS_DIR = os.path.join(DATA_DIR, "documents")
TOPIC_FILE = os.path.join(DATA_DIR, "gov.topics")
QRELS_FILE = os.path.join(DATA_DIR, "gov.qrels")

P at 5

P@5 indicates the number of relevent documents among the 5 retrieved documents. This measure is chosen as user needs precision while accessing government data.

In [0]:
def createIndex(schema):
    # Generate a temporary directory for the index
    indexDir = tempfile.mkdtemp()

    # create and return the index
    return index.create_in(indexDir, schema)
    # first, define a Schema for the index
mySchema = Schema(file_path = ID(stored=True),
                  file_content = TEXT(analyzer = RegexTokenizer()))

# now, create the index at the path INDEX_DIR based on the new schema
INDEX_Q2 = createIndex(mySchema)
# Make sure you save the final index in the variable INDEX_Q2, your query parser in QP_Q2, and your searcher in SEARCHER_Q2
def addFilesToIndex(indexObj, fileList):
    # open writer
    writer = writing.BufferedWriter(indexObj, period=None, limit=1000)

    try:
        # write each file to index
        for docNum, filePath in enumerate(fileList):
            with open(filePath, "r", encoding="utf-8") as f:
                fileContent = f.read()
                writer.add_document(file_path = filePath,
                                    file_content = fileContent)

                # print status every 1000 documents
                if (docNum+1 % 1000 == 0):
                    print("already indexed:", docNum+1)
        print("done indexing.")

    finally:
        # close the index
        writer.close()
        

In [0]:
filesToIndex = [str(filePath) for filePath in Path(DOCUMENTS_DIR).glob("**/*") if filePath.is_file()]
filesToIndex[:5]

['government/documents/80/G00-80-1408326',
 'government/documents/80/G00-80-1361789',
 'government/documents/80/G00-80-2913262',
 'government/documents/80/G00-80-1849469',
 'government/documents/80/G00-80-0939068']

In [0]:
addFilesToIndex(INDEX_Q2, filesToIndex)

done indexing.


In [0]:
 
QP_Q2 = QueryParser("file_content", schema=INDEX_Q2.schema)
SEARCHER_Q2 = INDEX_Q2.searcher()

In [0]:
def pyTrecEval(topicFile, qrelsFile, queryParser, searcher):
    # Load topic file - a list of topics(search phrases) used for evalutation
    with open(topicFile, "r") as tf:
        topics = tf.read().splitlines()

    # create an output file to which we'll write our results
    tempOutputFile = tempfile.mkstemp()[1]
    with open(tempOutputFile, "w") as outputTRECFile:
        # for each evaluated topic:
        # build a query and record the results in the file in TREC_EVAL format
        for topic in topics:
            topic_id, topic_phrase = tuple(topic.split(" ", 1))
            #print(topic_id, topic_phrase)
            topicQuery = queryParser.parse(topic_phrase)
            topicResults = searcher.search(topicQuery, limit=None)
            for (docnum, result) in enumerate(topicResults):
                score = topicResults.score(docnum)
                print("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))
                outputTRECFile.write("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))
    with open(qrelsFile, 'r') as f_qrel:
        qrel = pytrec_eval.parse_qrel(f_qrel)

    with open(tempOutputFile, 'r') as f_run:
        run = pytrec_eval.parse_run(f_run)

    evaluator = pytrec_eval.RelevanceEvaluator(
        qrel, pytrec_eval.supported_measures)

    results = evaluator.evaluate(run)
    def print_line(measure, scope, value):
        print('{:25s}{:8s}{:.4f}'.format(measure, scope, value))

    for query_id, query_measures in results.items():
        for measure, value in query_measures.items():
            if measure == "runid":
              continue
            print_line(measure, query_id, value)
    for measure in query_measures.keys():
        if measure == "runid":
              continue
        print_line(
            measure,
            'all',
            pytrec_eval.compute_aggregated_measure(
                measure,
                [query_measures[measure]
                 for query_measures in results.values()]))

In [0]:
pyTrecEval(TOPIC_FILE, QRELS_FILE, QP_Q2, SEARCHER_Q2) 

1 Q0 G00-90-0342721 0 26.645398 test

2 Q0 G00-22-3396139 0 17.262139 test

2 Q0 G00-76-0415824 1 10.597055 test

2 Q0 G00-78-1531079 2 8.778648 test

2 Q0 G00-15-1718631 3 8.076860 test

2 Q0 G00-70-2787853 4 6.788751 test

2 Q0 G00-74-1394517 5 3.368380 test

4 Q0 G00-99-2247765 0 16.449155 test

4 Q0 G00-85-1525415 1 13.364613 test

4 Q0 G00-05-1218739 2 12.956314 test

4 Q0 G00-09-0774298 3 11.781349 test

4 Q0 G00-56-4151981 4 11.367248 test

4 Q0 G00-21-2229498 5 10.743958 test

4 Q0 G00-98-4068688 6 10.464865 test

4 Q0 G00-47-2117970 7 10.213356 test

4 Q0 G00-67-0152545 8 8.392871 test

4 Q0 G00-06-1757034 9 6.431556 test

4 Q0 G00-78-2551063 10 3.955775 test

4 Q0 G00-84-0274223 11 2.068438 test

6 Q0 G00-26-3134051 0 13.996502 test

6 Q0 G00-59-0786269 1 13.853934 test

6 Q0 G00-60-3914816 2 11.345260 test

6 Q0 G00-21-0649032 3 5.955903 test

6 Q0 G00-45-4032177 4 5.937137 test

7 Q0 G00-70-0954490 0 16.517930 test

7 Q0 G00-59-2927976 1 16.234779 test

7 Q0 G00-75-1015577 

In [0]:
myReader = INDEX_Q2.reader()

In [0]:
[(docnum, doc_dict) for (docnum, doc_dict) in myReader.iter_docs()][0:5]

[(0, {'file_path': 'government/documents/80/G00-80-1408326'}),
 (1, {'file_path': 'government/documents/80/G00-80-1361789'}),
 (2, {'file_path': 'government/documents/80/G00-80-2913262'}),
 (3, {'file_path': 'government/documents/80/G00-80-1849469'}),
 (4, {'file_path': 'government/documents/80/G00-80-0939068'})]

In [0]:
[term for term in myReader.field_terms("file_content")][10:15]

['0.004', '0.005', '0.007', '0.008', '0.009']

In [0]:
print(myReader.field_length("file_content"))

2165181


In [0]:
print("# docs with 'wireless'", myReader.doc_frequency("file_content", "wireless"))
print("# docs with 'the'", myReader.doc_frequency("file_content", "the"))
print("# docs with 'communications'", myReader.doc_frequency("file_content", "communications"))

# docs with 'wireless' 33
# docs with 'the' 3355
# docs with 'communications' 102


In [0]:
def printRelName(topicFile, qrelsFile, queryParser, searcher, id):
  with open(topicFile, "r") as tf:
        topics = tf.read().splitlines()
  for topic in topics:
        topic_id, topic_phrase = tuple(topic.split(" ", 1))
        if topic_id == id:
          print("---------------------------Topic_id and Topic_phrase----------------------------------")
          print(topic_id, topic_phrase)
          topicQuery = queryParser.parse(topic_phrase)
          topicResults = searcher.search(topicQuery, limit=None)
          print("---------------------------Return documents----------------------------------")
          for (docnum, result) in enumerate(topicResults):
              score = topicResults.score(docnum)
              print("%s Q0 %s %d %lf test" % (topic_id, os.path.basename(result["file_path"]), docnum, score))
          print("---------------------------Relevant documents----------------------------------")
          with open(qrelsFile, 'r') as f_qrel:
            qrels = f_qrel.readlines()
            for i in qrels:
              qid, _, doc, rel = i.rstrip().split(" ")
              if qid == id and rel == "1":
                print(i.rstrip())

In [0]:
printRelName(TOPIC_FILE, QRELS_FILE, QP_Q2, SEARCHER_Q2, "4")


---------------------------Topic_id and Topic_phrase----------------------------------
4 wireless communications
---------------------------Return documents----------------------------------
4 Q0 G00-99-2247765 0 16.449155 test
4 Q0 G00-85-1525415 1 13.364613 test
4 Q0 G00-05-1218739 2 12.956314 test
4 Q0 G00-09-0774298 3 11.781349 test
4 Q0 G00-56-4151981 4 11.367248 test
4 Q0 G00-21-2229498 5 10.743958 test
4 Q0 G00-98-4068688 6 10.464865 test
4 Q0 G00-47-2117970 7 10.213356 test
4 Q0 G00-67-0152545 8 8.392871 test
4 Q0 G00-06-1757034 9 6.431556 test
4 Q0 G00-78-2551063 10 3.955775 test
4 Q0 G00-84-0274223 11 2.068438 test
---------------------------Relevant documents----------------------------------
4 0 G00-03-2855342 1
4 0 G00-36-1275993 1
4 0 G00-47-2117970 1
4 0 G00-65-0162935 1


<b>P@5 - 0.0714</b>

<b>Bad</b>

1,2,4,6,7,9,10,16,28,19

<b>Good</b>

14,18,22,24,26



The query or topic chosen is: <b> 4 wireless communications</b>

This topic was opted to explain the chosen measure P@5 as it had both <b>false positives (irrelevant documents ranked highly) and false negatives (relevant documents not ranked highly)</b>. This topic or query also shows an improvement in P@5=0 (baseline) to P@5=0.4 (Q3)

<b>(1) Documents highly ranked*:</b>

4 Q0 G00-99-2247765 0 16.449155 test

4 Q0 G00-85-1525415 1 13.364613 test

4 Q0 G00-05-1218739 2 12.956314 test

4 Q0 G00-09-0774298 3 11.781349 test

4 Q0 G00-56-4151981 4 11.367248 test

 <b>(2) Below mentioned documents should have been highly ranked*:</b> As these documents are relevant they are expected to be ranked highly.

4 0 G00-03-2855342 1

4 0 G00-36-1275993 1

4 0 G00-47-2117970 1

4 0 G00-65-0162935 1

 <b>(3) False Positives </b>

The documents mentioned in (1) are the examples of false positives. They are judged as irrelevant but are highly ranked. It is found that document <font color ='red'> G00-99-2247765 </font> had <b>tf(wireless) : 7 and tf(communications) : 1.</b> which made it a highly ranked document.

<b> (4) False Negatives</b>

The document <font color="red"> G00-47-2117970</font> is an example of false negative. This is judged as relevant but has not been ranked in the top 5 retrieved documents.
Despite having the higher term frequency of the query tokens in the document, it is found that many case sensitive tokens were not considered and stop words also increased the length of the document thereby impacting its score.

1. Punctuations are unnecessary in the analysis. Motivation is  to remove the punctuations and analyse the basic performance 
2. No real benefit in assessing stop words. Motivastion is to prevent the processing of stop words 
3. It is observed that upppercase tokens aren't  considered in text processing and term frequency calculations.
Motivation is to use lowercase tokens, considering all cases of tokens



<b>* document details obtained from printRelName()</b>





In [0]:
# we start with basic tokenizer
tokenizer = RegexTokenizer()
[token.text for token in tokenizer("The Wireless Telecommunications Bureau (WTB) handles nearly all FCC domestic wireless telecommunications programs and policies. Wireless communications services include Amateur, Cellular, Paging, PCS, Public Safety, and more")]

['The',
 'Wireless',
 'Telecommunications',
 'Bureau',
 'WTB',
 'handles',
 'nearly',
 'all',
 'FCC',
 'domestic',
 'wireless',
 'telecommunications',
 'programs',
 'and',
 'policies',
 'Wireless',
 'communications',
 'services',
 'include',
 'Amateur',
 'Cellular',
 'Paging',
 'PCS',
 'Public',
 'Safety',
 'and',
 'more']

In [0]:
 #We probably want to lower-case it so we add LowercaseFilter
LwrAnalyzer = RegexTokenizer() | LowercaseFilter() 
[token.text for token in LwrAnalyzer("The Wireless Telecommunications Bureau (WTB) handles nearly all FCC domestic wireless telecommunications programs and policies. Wireless communications services include Amateur, Cellular, Paging, PCS, Public Safety, and more")]

['the',
 'wireless',
 'telecommunications',
 'bureau',
 'wtb',
 'handles',
 'nearly',
 'all',
 'fcc',
 'domestic',
 'wireless',
 'telecommunications',
 'programs',
 'and',
 'policies',
 'wireless',
 'communications',
 'services',
 'include',
 'amateur',
 'cellular',
 'paging',
 'pcs',
 'public',
 'safety',
 'and',
 'more']

In [0]:
# we probably want to ignore words like "we", "are", "with" when we index files
# so we add StopFilter to filter stop words
LwrStpAnalyzer = RegexTokenizer() | LowercaseFilter() | StopFilter() 
[token.text for token in LwrStpAnalyzer("The Wireless Telecommunications Bureau (WTB) handles nearly all FCC domestic wireless telecommunications programs and policies. Wireless communications services include Amateur, Cellular, Paging, PCS, Public Safety, and more")]

['wireless',
 'telecommunications',
 'bureau',
 'wtb',
 'handles',
 'nearly',
 'all',
 'fcc',
 'domestic',
 'wireless',
 'telecommunications',
 'programs',
 'policies',
 'wireless',
 'communications',
 'services',
 'include',
 'amateur',
 'cellular',
 'paging',
 'pcs',
 'public',
 'safety',
 'more']

In [0]:
# Creating new schema based on new analyzer
mySchema3 = Schema(file_path = ID(stored=True),
                   file_content = TEXT(analyzer = LwrStpAnalyzer))


In [0]:
# create the index based on the new schema
INDEX_Q3 = createIndex(mySchema3)

In [0]:
addFilesToIndex(INDEX_Q3, filesToIndex)

done indexing.


In [0]:

QP_Q3 = QueryParser("file_content", schema=INDEX_Q3.schema)
SEARCHER_Q3 = INDEX_Q3.searcher()

In [0]:
pyTrecEval(TOPIC_FILE, QRELS_FILE, QP_Q3, SEARCHER_Q3) 

1 Q0 G00-90-0342721 0 22.803461 test

1 Q0 G00-55-3817584 1 13.672799 test

1 Q0 G00-69-2353421 2 6.653522 test

2 Q0 G00-37-1427392 0 19.090128 test

2 Q0 G00-22-3396139 1 19.085040 test

2 Q0 G00-78-1531079 2 16.820244 test

2 Q0 G00-92-0578141 3 16.051704 test

2 Q0 G00-67-0637954 4 15.360254 test

2 Q0 G00-91-1567424 5 15.360254 test

2 Q0 G00-94-1117794 6 13.872165 test

2 Q0 G00-76-0415824 7 12.874618 test

2 Q0 G00-15-1718631 8 11.736098 test

2 Q0 G00-90-3871013 9 11.715659 test

2 Q0 G00-70-2787853 10 9.116448 test

2 Q0 G00-27-2159399 11 8.939224 test

2 Q0 G00-74-1394517 12 2.988885 test

4 Q0 G00-36-1275993 0 17.452534 test

4 Q0 G00-47-2117970 1 16.803991 test

4 Q0 G00-00-1958915 2 15.419385 test

4 Q0 G00-28-2286602 3 14.834750 test

4 Q0 G00-99-2247765 4 14.503729 test

4 Q0 G00-21-2229498 5 14.410003 test

4 Q0 G00-74-4030396 6 14.228823 test

4 Q0 G00-85-1525415 7 14.211391 test

4 Q0 G00-05-1218739 8 14.014439 test

4 Q0 G00-69-0005329 9 14.005216 test

4 Q0 G00-46-1

In [0]:
def printRelName(topicFile, qrelsFile, queryParser, searcher, id):
  with open(topicFile, "r") as tf:
        topics = tf.read().splitlines()
  for topic in topics:
        topic_id, topic_phrase = tuple(topic.split(" ", 1))
        if topic_id == id:
          print("---------------------------Topic_id and Topic_phrase----------------------------------")
          print(topic_id, topic_phrase)
          topicQuery = queryParser.parse(topic_phrase)
          topicResults = searcher.search(topicQuery, limit=None)
          print("---------------------------Return documents----------------------------------")
          for (docnum, result) in enumerate(topicResults):
              score = topicResults.score(docnum)
              print("%s Q0 %s %d %lf test" % (topic_id, os.path.basename(result["file_path"]), docnum, score))
          print("---------------------------Relevant documents----------------------------------")
          with open(qrelsFile, 'r') as f_qrel:
            qrels = f_qrel.readlines()
            for i in qrels:
              qid, _, doc, rel = i.rstrip().split(" ")
              if qid == id and rel == "1":
                print(i.rstrip())

In [0]:
printRelName(TOPIC_FILE, QRELS_FILE, QP_Q3, SEARCHER_Q3, "4")


---------------------------Topic_id and Topic_phrase----------------------------------
4 wireless communications
---------------------------Return documents----------------------------------
4 Q0 G00-36-1275993 0 17.452534 test
4 Q0 G00-47-2117970 1 16.803991 test
4 Q0 G00-00-1958915 2 15.419385 test
4 Q0 G00-28-2286602 3 14.834750 test
4 Q0 G00-99-2247765 4 14.503729 test
4 Q0 G00-21-2229498 5 14.410003 test
4 Q0 G00-74-4030396 6 14.228823 test
4 Q0 G00-85-1525415 7 14.211391 test
4 Q0 G00-05-1218739 8 14.014439 test
4 Q0 G00-69-0005329 9 14.005216 test
4 Q0 G00-46-1439567 10 13.715071 test
4 Q0 G00-84-3349019 11 13.661520 test
4 Q0 G00-16-0059045 12 13.259794 test
4 Q0 G00-44-1482914 13 13.111128 test
4 Q0 G00-71-3454228 14 12.795079 test
4 Q0 G00-02-1720397 15 12.714877 test
4 Q0 G00-09-0774298 16 11.918658 test
4 Q0 G00-07-3064254 17 11.811925 test
4 Q0 G00-67-0152545 18 10.938959 test
4 Q0 G00-05-1550998 19 10.415637 test
4 Q0 G00-56-4151981 20 9.826144 test
4 Q0 G00-59-3586444 21

<b>Modifications:</b>

 <b>Used</b>

1. RegexTokenizer() 

2. LowercaseFilter()

3. StopFilter()

<b>Improvements</b>

 P@5 before for query 4 - <font color='red'>0.0</font>

 P@5 now for query 4 - <font color='blue'> 0.4</font>

 P@5 before for all queries - <font color='red'> 0.0714</font>

 P@5 now for all queries - <font color='blue'>0.12</font>

 False Positive document <font color='red'> G00-99-2247765</font> which was ranked first before the modifications is now ranked fifth

 False Negative documents <font color='blue'>G00-36-1275993, G00-47-2117970 </font> (relevant) but which weren't  ranked in the top 5 are now been ranked 1 and 2 respectively.



Yes, it was good. As it ranked the relevant documents higher than the irrelevant documents thereby reducing  false negatives. 

It also improved the measure P@5 for all queries from 0.07-0.12 which means there is 25% increase in the chance of relevant document being in top 5 retrieved documents.

Only 1 query(22) among 15 queries has shown a reduced P@5 when compared to other queries.

We can try adding more filters to the above used analyzer and check the improvement in the parameters.

<b>Filters</b>

1. StemFilter()
2. IntraWordFilter()

Addition of different type of Stemmers or Lemmatizer can be used to check for increase changes in the measured values

In [0]:
# we probably want to use stemming
stmLwrStpAnalyzer = RegexTokenizer() | LowercaseFilter() | StopFilter() |StemFilter()
[token.text for token in stmLwrStpAnalyzer("The Wireless Telecommunications Bureau (WTB) handles nearly all FCC domestic wireless telecommunications programs and policies. Wireless communications services include Amateur, Cellular, Paging, PCS, Public Safety, and more")]

['wireless',
 'telecommun',
 'bureau',
 'wtb',
 'handl',
 'nearli',
 'all',
 'fcc',
 'domest',
 'wireless',
 'telecommun',
 'program',
 'polici',
 'wireless',
 'commun',
 'servic',
 'includ',
 'amateur',
 'cellular',
 'page',
 'pc',
 'public',
 'safeti',
 'more']

In [0]:
# we also probably want to break phrases like "Tele-communications" into "Tele" and "commun"
# so we add IntraWordFilter
stmLwrStpIntraAnalyzer = RegexTokenizer() | LowercaseFilter() | IntraWordFilter() | StopFilter() | StemFilter()
[token.text for token in stmLwrStpIntraAnalyzer("The Wireless Tele-communications Bureau (WTB) handles nearly all FCC domestic wireless telecommunications programs and policies. Wireless communications services include Amateur, Cellular, Paging, PCS, Public Safety, and more")]

['wireless',
 'tele',
 'commun',
 'bureau',
 'wtb',
 'handl',
 'nearli',
 'all',
 'fcc',
 'domest',
 'wireless',
 'telecommun',
 'program',
 'polici',
 'wireless',
 'commun',
 'servic',
 'includ',
 'amateur',
 'cellular',
 'page',
 'pc',
 'public',
 'safeti',
 'more']

In [0]:
mySchema3_1 = Schema(file_path = ID(stored=True),
                   file_content = TEXT(analyzer = stmLwrStpIntraAnalyzer))

In [0]:
INDEX_Q3_1 = createIndex(mySchema3_1)

In [0]:
addFilesToIndex(INDEX_Q3_1, filesToIndex)

done indexing.


In [0]:
QP_Q3_1 = QueryParser("file_content", schema=INDEX_Q3_1.schema)
SEARCHER_Q3_1 = INDEX_Q3_1.searcher()

In [0]:
pyTrecEval(TOPIC_FILE, QRELS_FILE, QP_Q3_1, SEARCHER_Q3_1) 

1 Q0 G00-90-0342721 0 23.967111 test

1 Q0 G00-55-3817584 1 13.436291 test

1 Q0 G00-69-2353421 2 7.288464 test

2 Q0 G00-37-1427392 0 19.449214 test

2 Q0 G00-22-3396139 1 18.434193 test

2 Q0 G00-78-1531079 2 16.335289 test

2 Q0 G00-92-0578141 3 15.493780 test

2 Q0 G00-67-0637954 4 14.857957 test

2 Q0 G00-91-1567424 5 14.721797 test

2 Q0 G00-94-1117794 6 13.437583 test

2 Q0 G00-76-0415824 7 12.543086 test

2 Q0 G00-15-1718631 8 11.522913 test

2 Q0 G00-90-3871013 9 11.372035 test

2 Q0 G00-70-2787853 10 10.268646 test

2 Q0 G00-27-2159399 11 8.788647 test

2 Q0 G00-74-1394517 12 4.022868 test

4 Q0 G00-36-1275993 0 15.164268 test

4 Q0 G00-47-2117970 1 14.653076 test

4 Q0 G00-99-2247765 2 13.872368 test

4 Q0 G00-85-1525415 3 13.224219 test

4 Q0 G00-00-1958915 4 13.150225 test

4 Q0 G00-74-4030396 5 12.750312 test

4 Q0 G00-28-2286602 6 12.510099 test

4 Q0 G00-84-3349019 7 12.457022 test

4 Q0 G00-05-1218739 8 12.368941 test

4 Q0 G00-21-2229498 9 12.277661 test

4 Q0 G00-69-

In [0]:
# we'll compare two stemmers and a lemmatizer
lrStem = LancasterStemmer()
sbStem = SnowballStemmer("english")
wnLemm = WordNetLemmatizer()

In [0]:
# This filter will run for both the index and the query
from whoosh.analysis import Filter
class CustomFilter(Filter):
    is_morph = True
    def __init__(self, filterFunc, *args, **kwargs):
        self.customFilter = filterFunc
        self.args = args
        self.kwargs = kwargs
    def __eq__(self):
        return (other
                and self.__class__ is other.__class__)
    def __call__(self, tokens):
        for t in tokens:
            if t.mode == 'query': # if called by query parser
                t.text = self.customFilter(t.text, *self.args, **self.kwargs)
                yield t
            else: # == 'index' if called by indexer
                t.text = self.customFilter(t.text, *self.args, **self.kwargs)
                yield t

In [0]:
# Example1: Whoosh filter for NLTK's LancasterStemmer
myFilter1 = RegexTokenizer() | LowercaseFilter() | IntraWordFilter() | StopFilter() | StemFilter() | CustomFilter(LancasterStemmer().stem)
[token.text for token in myFilter1("The Wireless Telecommunications Bureau (WTB) handles nearly all FCC domestic wireless telecommunications programs and policies. Wireless communications services include Amateur, Cellular, Paging, PCS, Public Safety, and more")]

['wireless',
 'telecommun',
 'bureau',
 'wtb',
 'handl',
 'nearl',
 'al',
 'fcc',
 'domest',
 'wireless',
 'telecommun',
 'program',
 'polic',
 'wireless',
 'commun',
 'serv',
 'includ',
 'am',
 'cellul',
 'pag',
 'pc',
 'publ',
 'safet',
 'mor']

In [0]:
# define a Schema with the new analyzer
mySchema3_2 = Schema(file_path = ID(stored=True),
                   file_content = TEXT(analyzer = myFilter1))

In [0]:
INDEX_Q3_2 = createIndex(mySchema3_2)

In [0]:
addFilesToIndex(INDEX_Q3_2, filesToIndex)

done indexing.


In [0]:
QP_Q3_2 = qparser.QueryParser("file_content", schema=INDEX_Q3_2.schema)
SEARCHER_Q3_2 = INDEX_Q3_2.searcher()

In [0]:
pyTrecEval(TOPIC_FILE, QRELS_FILE, QP_Q3_2, SEARCHER_Q3_2) 

1 Q0 G00-90-0342721 0 21.146062 test

1 Q0 G00-55-3817584 1 12.636908 test

1 Q0 G00-69-2353421 2 6.606388 test

2 Q0 G00-37-1427392 0 19.449214 test

2 Q0 G00-22-3396139 1 18.434193 test

2 Q0 G00-78-1531079 2 16.335289 test

2 Q0 G00-92-0578141 3 15.493780 test

2 Q0 G00-67-0637954 4 14.857957 test

2 Q0 G00-91-1567424 5 14.721797 test

2 Q0 G00-94-1117794 6 13.437583 test

2 Q0 G00-76-0415824 7 12.543086 test

2 Q0 G00-15-1718631 8 11.522913 test

2 Q0 G00-90-3871013 9 11.372035 test

2 Q0 G00-70-2787853 10 10.268646 test

2 Q0 G00-27-2159399 11 8.788647 test

2 Q0 G00-74-1394517 12 4.022868 test

4 Q0 G00-36-1275993 0 15.138705 test

4 Q0 G00-47-2117970 1 14.628534 test

4 Q0 G00-99-2247765 2 14.288935 test

4 Q0 G00-85-1525415 3 13.200385 test

4 Q0 G00-00-1958915 4 13.126524 test

4 Q0 G00-74-4030396 5 12.731603 test

4 Q0 G00-28-2286602 6 12.484379 test

4 Q0 G00-84-3349019 7 12.442307 test

4 Q0 G00-05-1218739 8 12.349475 test

4 Q0 G00-21-2229498 9 12.256268 test

4 Q0 G00-69-

In [0]:
#download required resources

nltk.download("wordnet")

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [0]:
# Example2: Whoosh filter for NLTK's WordNetLemmatizer
myFilter2 = RegexTokenizer() | LowercaseFilter() | IntraWordFilter() | StopFilter() | StemFilter()| CustomFilter(WordNetLemmatizer().lemmatize)
[token.text for token in myFilter2("The Wireless Telecommunications Bureau (WTB) handles nearly all FCC domestic wireless telecommunications programs and policies. Wireless communications services include Amateur, Cellular, Paging, PCS, Public Safety, and more")]

['wireless',
 'telecommun',
 'bureau',
 'wtb',
 'handl',
 'nearli',
 'all',
 'fcc',
 'domest',
 'wireless',
 'telecommun',
 'program',
 'polici',
 'wireless',
 'commun',
 'servic',
 'includ',
 'amateur',
 'cellular',
 'page',
 'pc',
 'public',
 'safeti',
 'more']

In [0]:
# define a Schema with the new analyzer
mySchema3_3 = Schema(file_path = ID(stored=True),
                   file_content = TEXT(analyzer = myFilter2))

In [0]:
INDEX_Q3_3 = createIndex(mySchema3_3)

In [0]:
addFilesToIndex(INDEX_Q3_3, filesToIndex)

done indexing.


In [0]:
QP_Q3_3 = qparser.QueryParser("file_content", schema=INDEX_Q3_3.schema)
SEARCHER_Q3_3 = INDEX_Q3_3.searcher()

In [0]:
pyTrecEval(TOPIC_FILE, QRELS_FILE, QP_Q3_3, SEARCHER_Q3_3) 

1 Q0 G00-90-0342721 0 23.967111 test

1 Q0 G00-55-3817584 1 13.436291 test

1 Q0 G00-69-2353421 2 7.288464 test

2 Q0 G00-37-1427392 0 19.449214 test

2 Q0 G00-22-3396139 1 18.434193 test

2 Q0 G00-78-1531079 2 16.335289 test

2 Q0 G00-92-0578141 3 15.493780 test

2 Q0 G00-67-0637954 4 14.857957 test

2 Q0 G00-91-1567424 5 14.721797 test

2 Q0 G00-94-1117794 6 13.437583 test

2 Q0 G00-76-0415824 7 12.543086 test

2 Q0 G00-15-1718631 8 11.522913 test

2 Q0 G00-90-3871013 9 11.372035 test

2 Q0 G00-70-2787853 10 10.268646 test

2 Q0 G00-27-2159399 11 8.788647 test

2 Q0 G00-74-1394517 12 4.022868 test

4 Q0 G00-36-1275993 0 15.164268 test

4 Q0 G00-47-2117970 1 14.653076 test

4 Q0 G00-99-2247765 2 13.872368 test

4 Q0 G00-85-1525415 3 13.224219 test

4 Q0 G00-00-1958915 4 13.150225 test

4 Q0 G00-74-4030396 5 12.750312 test

4 Q0 G00-28-2286602 6 12.510099 test

4 Q0 G00-84-3349019 7 12.457022 test

4 Q0 G00-05-1218739 8 12.368941 test

4 Q0 G00-21-2229498 9 12.277661 test

4 Q0 G00-69-

In [0]:
# Example3: Whoosh filter for NLTK's WordNetLemmatizer for verbs
myFilter3 = RegexTokenizer() | LowercaseFilter() | IntraWordFilter() | StopFilter() | StemFilter()| CustomFilter(WordNetLemmatizer().lemmatize, 'v')
[token.text for token in myFilter3("The Wireless Telecommunications Bureau (WTB) handles nearly all FCC domestic wireless telecommunications programs and policies. Wireless communications services include Amateur, Cellular, Paging, PCS, Public Safety, and more")]

['wireless',
 'telecommun',
 'bureau',
 'wtb',
 'handl',
 'nearli',
 'all',
 'fcc',
 'domest',
 'wireless',
 'telecommun',
 'program',
 'polici',
 'wireless',
 'commun',
 'servic',
 'includ',
 'amateur',
 'cellular',
 'page',
 'pc',
 'public',
 'safeti',
 'more']

In [0]:
# define a Schema with the new analyzer
mySchema3_4 = Schema(file_path = ID(stored=True),
                   file_content = TEXT(analyzer = myFilter3))

In [0]:
INDEX_Q3_4 = createIndex(mySchema3_4)

In [0]:
addFilesToIndex(INDEX_Q3_4, filesToIndex)

done indexing.


In [0]:
QP_Q3_4 = qparser.QueryParser("file_content", schema=INDEX_Q3_4.schema)
SEARCHER_Q3_4 = INDEX_Q3_4.searcher()

In [0]:
pyTrecEval(TOPIC_FILE, QRELS_FILE, QP_Q3_4, SEARCHER_Q3_4) 

1 Q0 G00-90-0342721 0 23.967111 test

1 Q0 G00-55-3817584 1 13.436291 test

1 Q0 G00-69-2353421 2 7.288464 test

2 Q0 G00-37-1427392 0 19.449214 test

2 Q0 G00-22-3396139 1 18.434193 test

2 Q0 G00-78-1531079 2 16.335289 test

2 Q0 G00-92-0578141 3 15.493780 test

2 Q0 G00-67-0637954 4 14.857957 test

2 Q0 G00-91-1567424 5 14.721797 test

2 Q0 G00-94-1117794 6 13.437583 test

2 Q0 G00-76-0415824 7 12.543086 test

2 Q0 G00-15-1718631 8 11.522913 test

2 Q0 G00-90-3871013 9 11.372035 test

2 Q0 G00-70-2787853 10 10.268646 test

2 Q0 G00-27-2159399 11 8.788647 test

2 Q0 G00-74-1394517 12 4.022868 test

4 Q0 G00-36-1275993 0 15.164268 test

4 Q0 G00-47-2117970 1 14.653076 test

4 Q0 G00-99-2247765 2 13.872368 test

4 Q0 G00-85-1525415 3 13.224219 test

4 Q0 G00-00-1958915 4 13.150225 test

4 Q0 G00-74-4030396 5 12.750312 test

4 Q0 G00-28-2286602 6 12.510099 test

4 Q0 G00-84-3349019 7 12.457022 test

4 Q0 G00-05-1218739 8 12.368941 test

4 Q0 G00-21-2229498 9 12.277661 test

4 Q0 G00-69-

From the iterations performed on the above mentioned Analyzers, it is found that <b>myFilter1</b> performs better than others. So, this is used along with other modifications in the following iterations.


In [0]:
# define a Schema with the new analyzer
mySchema3_5 = Schema(file_path = ID(stored=True),
                   file_content = TEXT(analyzer = myFilter1))

In [0]:
INDEX_Q3_5 = createIndex(mySchema3_5)

In [0]:
addFilesToIndex(INDEX_Q3_5, filesToIndex)

done indexing.


In [0]:
QP_Q3_5 = qparser.QueryParser("file_content", schema=INDEX_Q3_5.schema,group=qparser.OrGroup)
SEARCHER_Q3_5 = INDEX_Q3_5.searcher()

In [0]:
pyTrecEval(TOPIC_FILE, QRELS_FILE, QP_Q3_5, SEARCHER_Q3_5) 

1 Q0 G00-90-0342721 0 21.146062 test

1 Q0 G00-07-1172041 1 20.610684 test

1 Q0 G00-86-3214229 2 20.128964 test

1 Q0 G00-21-2004003 3 19.476768 test

1 Q0 G00-48-3798484 4 18.645579 test

1 Q0 G00-42-1455285 5 18.273382 test

1 Q0 G00-26-0088644 6 18.031872 test

1 Q0 G00-50-2059900 7 18.006670 test

1 Q0 G00-23-3149835 8 17.856676 test

1 Q0 G00-27-2048511 9 17.571694 test

1 Q0 G00-32-2907392 10 17.115499 test

1 Q0 G00-73-3632837 11 17.035249 test

1 Q0 G00-02-0351712 12 16.916183 test

1 Q0 G00-31-1216640 13 16.912089 test

1 Q0 G00-34-1044519 14 16.887073 test

1 Q0 G00-94-0326199 15 16.672451 test

1 Q0 G00-24-4085400 16 16.648776 test

1 Q0 G00-74-1802348 17 16.274531 test

1 Q0 G00-98-3517069 18 15.760415 test

1 Q0 G00-10-3730888 19 15.653990 test

1 Q0 G00-01-2689026 20 15.125556 test

1 Q0 G00-08-0995170 21 15.106893 test

1 Q0 G00-27-2669897 22 14.801746 test

1 Q0 G00-27-3209717 23 14.364998 test

1 Q0 G00-00-1006224 24 14.293359 test

1 Q0 G00-02-1239993 25 14.194724 te

Tuning scoring function <font color='blue'>BM25F</font > over <b>myFilter1</b> to optimize the values for b and k1 which results in providing the maximum values for the measures

In [0]:
# Set a custom B value for the "content" field
w = scoring.BM25F(B=0.55, content_B=TEXT(analyzer = myFilter1), K1=2.78)


In [0]:
# define a Schema with the new analyzer
mySchema3_6= Schema(file_path = ID(stored=True),
                   file_content = TEXT(analyzer = myFilter1))

In [0]:

INDEX_Q3_6 = createIndex(mySchema3_6)

In [0]:
addFilesToIndex(INDEX_Q3_6, filesToIndex)

done indexing.


In [0]:
QP_Q3_6 = qparser.QueryParser("file_content", schema=INDEX_Q3_6.schema,group=qparser.OrGroup)
SEARCHER_Q3_6 = INDEX_Q3_6.searcher(weighting=w)

In [0]:
pyTrecEval(TOPIC_FILE, QRELS_FILE, QP_Q3_6, SEARCHER_Q3_6) 

1 Q0 G00-90-0342721 0 28.689865 test

1 Q0 G00-23-3149835 1 28.442654 test

1 Q0 G00-27-2048511 2 28.122707 test

1 Q0 G00-42-1455285 3 28.087868 test

1 Q0 G00-86-3214229 4 26.791805 test

1 Q0 G00-02-0351712 5 26.557437 test

1 Q0 G00-32-2907392 6 26.338438 test

1 Q0 G00-07-1172041 7 26.169660 test

1 Q0 G00-48-3798484 8 25.991045 test

1 Q0 G00-26-0088644 9 25.929722 test

1 Q0 G00-50-2059900 10 25.759484 test

1 Q0 G00-73-3632837 11 25.332217 test

1 Q0 G00-74-1802348 12 25.136753 test

1 Q0 G00-94-0326199 13 24.552580 test

1 Q0 G00-34-1044519 14 24.535725 test

1 Q0 G00-31-1216640 15 24.312010 test

1 Q0 G00-21-2004003 16 22.810863 test

1 Q0 G00-24-4085400 17 22.296775 test

1 Q0 G00-10-3730888 18 22.081206 test

1 Q0 G00-98-3517069 19 21.038023 test

1 Q0 G00-27-2669897 20 20.245510 test

1 Q0 G00-62-3289850 21 19.724694 test

1 Q0 G00-01-2689026 22 19.397236 test

1 Q0 G00-54-2231037 23 18.821899 test

1 Q0 G00-08-0995170 24 18.304341 test

1 Q0 G00-10-3849661 25 18.103571 te

After tuning parameters of BM25F it is found that overall MAP is maximum for <b>b=0.55 and k1=2.78</b>

In [0]:
# define a Schema with the new analyzer
mySchema4= Schema(file_path = ID(stored=True),
                   file_content = TEXT(analyzer = myFilter1))

In [0]:
INDEX_Q4 = createIndex(mySchema4)

In [0]:
addFilesToIndex(INDEX_Q4, filesToIndex)

done indexing.


In [0]:
QP_Q4 = qparser.QueryParser("file_content", schema=INDEX_Q4.schema,group=qparser.OrGroup)
SEARCHER_Q4 = INDEX_Q4.searcher(weighting=w)

In [0]:
pyTrecEval(TOPIC_FILE, QRELS_FILE, QP_Q4, SEARCHER_Q4) 

1 Q0 G00-90-0342721 0 28.689865 test

1 Q0 G00-23-3149835 1 28.442654 test

1 Q0 G00-27-2048511 2 28.122707 test

1 Q0 G00-42-1455285 3 28.087868 test

1 Q0 G00-86-3214229 4 26.791805 test

1 Q0 G00-02-0351712 5 26.557437 test

1 Q0 G00-32-2907392 6 26.338438 test

1 Q0 G00-07-1172041 7 26.169660 test

1 Q0 G00-48-3798484 8 25.991045 test

1 Q0 G00-26-0088644 9 25.929722 test

1 Q0 G00-50-2059900 10 25.759484 test

1 Q0 G00-73-3632837 11 25.332217 test

1 Q0 G00-74-1802348 12 25.136753 test

1 Q0 G00-94-0326199 13 24.552580 test

1 Q0 G00-34-1044519 14 24.535725 test

1 Q0 G00-31-1216640 15 24.312010 test

1 Q0 G00-21-2004003 16 22.810863 test

1 Q0 G00-24-4085400 17 22.296775 test

1 Q0 G00-10-3730888 18 22.081206 test

1 Q0 G00-98-3517069 19 21.038023 test

1 Q0 G00-27-2669897 20 20.245510 test

1 Q0 G00-62-3289850 21 19.724694 test

1 Q0 G00-01-2689026 22 19.397236 test

1 Q0 G00-54-2231037 23 18.821899 test

1 Q0 G00-08-0995170 24 18.304341 test

1 Q0 G00-10-3849661 25 18.103571 te

(a) A clear list of all final modifications made.  
(b)  Why each modification was made – how did it help?  
(c)  The  final  MAP  performance  that  these  modifications  attained.

(a)

*   Baseline --------------- No modification (raw data)
  
*   Q3 ------------------------ Used RegexTokenizer(), StopFilter(), LowercaseFilter()


*   First Iteration ------- Used StemFilter(), IntraWordFilter() in the Analyzer

*   Second Iteration -- Used the analyzer with all the above filters + Lancasterstemmer()


*   Third Iteration ------ Used the analyzer with all the above filters + WordNetLemmatizer()

*   Fourth Iteration ---- Used the analyzer with all the above filters + WordNetLemmatizer() for verbs


*   Fifth Iteration ------- Modified query parser by adding <b>group=qparser.OrGroup</b>

*   Sixth Iteration ------ Modified query parser + scoring function BM25F

(b) 

*  <b> Baseline</b> --------------- No modification (raw data)
*   <b>Q3</b> ------------------------ Used RegexTokenizer(), StopFilter(), LowercaseFilter()   


*  <b> First Iteration</b> ------- Used StemFilter(), IntraWordFilter() in the Analyzer -> They help in stemming words, and tokenizing based on special characters. In this particular dataset, they along with other filters help in improving the overall MAP (from 0.3811 - 0.4116)  which is the prime requirement of government data.

* <b>  Second Iteration</b> -- Used the analyzer with all the above filters + Lancasterstemmer() -> With stemming, words are reduced to their word stems. LancasterStemmer is one of the aggressive stemmers which can work on changing the forms of complex words into simple ones. This process helps in better retrieval of documents.


*  <b> Third Iteration </b>------ Used the analyzer with all the above filters + WordNetLemmatizer()
*  <b> Fourth Iteration </b>---- Used the analyzer with all the above filters + WordNetLemmatizer() for verbs


' Third and Fourth iterations were using WordNetLemmatizer() with above filters to explore the possibility of improvements in the measures. These functions are <font color ='red'><b>not used</b> </font>in the final schema as they have not contributed to the performance. 
Possible Reason:- WordNet Lemmatization is based on the WordNet database (like a web of synonyms or thesaurus). So if the words / tokens don't match with those in the database, then we cannot expect proper lemmas. '



*  <b> Fifth Iteration </b>------- Modified query parser by adding <b>group=qparser.OrGroup</b> -> this modification concludes that all the documents containing either of the tokens would be retrieved. It was observed that the number of rel_retrieved documents increased after this iteration, thereby improving overall MAP to 0.3891

* <b>  Sixth Iteration </b>------ Modified query parser + scoring function BM25F -> Scoring functions are normally used to rank the documents. BM25 works on two main parameters b, k1. 'b' governs the impact of length of documents (If b is bigger, the effects of the length of the document compared to the average length are more amplified). 'k1' is a variable that determines the term frequency saturation (The curve of the impact of tf on the score grows quickly when tf() ≤ k1 and slower and slower when tf() > k1).
Default values of b= 0.75, k1=1.2.

 But, these parameters were tuned to b=0.55 and k1=2.78 for the final iteration incorporating all filters, LancasterStemmer(), modified query parser which produced the optimal MAP value for all queries.








(c) Overall MAP after all modifications with optimal
parameters for scoring function is  <font color='yellow'><b>   0.4116</b> </font>

## Validation

In [0]:
# Run the following cells to make sure your code returns the correct value types

In [0]:
from whoosh.index import FileIndex
from whoosh.qparser import QueryParser
from whoosh.searching import Searcher
import os.path

### Q2 Validation

In [0]:
assert(isinstance(INDEX_Q2, FileIndex)), "Index Type"
assert(isinstance(QP_Q2, QueryParser)), "Query Parser Type"
assert(isinstance(SEARCHER_Q2, Searcher)), "Searcher Type"
print("Q2 Types Validated")

Q2 Types Validated


### Q3 Validation

In [0]:
assert(isinstance(INDEX_Q3, FileIndex)), "Index Type"
assert(isinstance(QP_Q3, QueryParser)), "Query Parser Type"
assert(isinstance(SEARCHER_Q3, Searcher)), "Searcher Type"
print("Q3 Types Validated")

Q3 Types Validated


### Q4 Validation (Graduate Students)

In [0]:
assert(isinstance(INDEX_Q4, FileIndex)), "Index Type"
assert(isinstance(QP_Q4, QueryParser)), "Query Parser Type"
assert(isinstance(SEARCHER_Q4, Searcher)), "Searcher Type"
print("Q4 Types Validated")

Q4 Types Validated
