# Assignment 2: IR

## Preparations
* Put all your imports, and path constants in the next cells
* Make sure ***MATERIALS_DIR*** points to the directory where you extracted the Zip file.
* Make sure all your paths are **relative to ** ***MATERIALS_DIR*** and **NOT hard-coded** in your code.

In [1]:
# imports
# Put all your imports here
from whoosh import index, writing
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import *
from whoosh.qparser import QueryParser
import os, os.path
import shutil

In [2]:
MATERIALS_DIR = r"C:\DSS_Fall2017_Assign2"
#
# Put other path constants here
#
DOCUMENTS_DIR = os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\government\documents")
INDEX_DIR = os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\government\index1")
QUER_FILE = os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\government\topics\gov.topics")
QRELS_FILE = os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\government\qrels\gov.qrels")
OUTPUT_FILE = os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\government\myres")
TREC_EVAL = os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\trec_eval\trec_eval.exe")
INDEX_DIR2 = os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\government\index2")
OUTPUT_FILE2 = os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\government\myres2")
OUTPUT_FILE3 = os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\government\myres3")

## Question 1
Provide your text answers in the following two markdown cells

### Q1 (a): 
MAP

### Q1 (b): 
MAP gives average precision averaged over all queries and all documents. It gives a more comprehensive view than other measures.

## Question 2

### Q2 (a): Write your code below

In [3]:
# Put your code for creating the index here (you can add more cells).
# Make sure you save the final index in the variable INDEX_Q2, your query parser in QP_Q2, and your searcher in SEARCHER_Q2
mySchema = Schema(file_path = ID(stored=True),
                  file_content = TEXT(analyzer = RegexTokenizer()))
# if index exists - remove it
if os.path.isdir(INDEX_DIR):
    shutil.rmtree(INDEX_DIR)

# create the directory for the index
os.makedirs(INDEX_DIR)

# create index
myIndex = index.create_in(INDEX_DIR, mySchema)

# first we build a list of all the full paths of the files in DOCUMENTS_DIR
filesToIndex = []
for root, dirs, files in os.walk(DOCUMENTS_DIR):
    filePaths = [os.path.join(root, fileName) for fileName in files if not fileName.startswith('.')]
    filesToIndex.extend(filePaths)

In [4]:
# open writer
myWriter = writing.BufferedWriter(myIndex, period=20, limit=1000)

try:
    # write each file to index
    for docNum, filePath in enumerate(filesToIndex):
        with open(filePath, "r", encoding='utf8') as f:
            fileContent = f.read()
            myWriter.add_document(file_path = filePath,
                                  file_content = fileContent)
            
            if (docNum % 1000 == 0):
                print("already indexed:", docNum+1)
    print("done indexing.")

finally:
    # save the index
    myWriter.close()

already indexed: 1
already indexed: 1001
already indexed: 2001
already indexed: 3001
already indexed: 4001
done indexing.


In [5]:
INDEX_Q2 = myIndex # Replace None with your index for Q2
QP_Q2 = QueryParser("file_content", schema=myIndex.schema) # Replace None with your query parser for Q2
SEARCHER_Q2 = myIndex.searcher() # Replace None with your searcher for Q2

In [6]:
# Load topic file - a list of topics(search phrases) used for evalutation
topicsFile = open(QUER_FILE,"r")
topics = topicsFile.read().splitlines()

# create an output file to which we'll write our results
outputTRECFile = open(OUTPUT_FILE, "w")

# for each evaluated topic:
# build a query and record the results in the file in TREC_EVAL format
for topic in topics:
    topic_id, topic_phrase = tuple(topic.split(" ", 1))
    topicQuery = QP_Q2.parse(topic_phrase)
    topicResults = SEARCHER_Q2.search(topicQuery, limit=None)
    for (docnum, result) in enumerate(topicResults):
        score = topicResults.score(docnum)
        outputTRECFile.write("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))

# close the topic and results file
outputTRECFile.close()
topicsFile.close()

In [7]:
!cat $QUER_FILE

1 mining gold silver coal
2 juvenile delinquency
4 wireless communications
6 physical therapists
7 cotton industry
9 genealogy searches
10 Physical Fitness
14 Agricultural biotechnology
16 Emergency and disaster preparedness assistance
18 Shipwrecks
19 Cybercrime, internet fraud, and cyber fraud
22 Veteran's Benefits
24 Air Bag Safety
26 Nuclear power plants
28 Early Childhood Education


In [8]:
!$TREC_EVAL -q $QRELS_FILE $OUTPUT_FILE

num_ret               	1	1
num_rel               	1	5
num_rel_ret           	1	0
map                   	1	0.0000
Rprec                 	1	0.0000
bpref                 	1	0.0000
recip_rank            	1	0.0000
iprec_at_recall_0.00  	1	0.0000
iprec_at_recall_0.10  	1	0.0000
iprec_at_recall_0.20  	1	0.0000
iprec_at_recall_0.30  	1	0.0000
iprec_at_recall_0.40  	1	0.0000
iprec_at_recall_0.50  	1	0.0000
iprec_at_recall_0.60  	1	0.0000
iprec_at_recall_0.70  	1	0.0000
iprec_at_recall_0.80  	1	0.0000
iprec_at_recall_0.90  	1	0.0000
iprec_at_recall_1.00  	1	0.0000
P_5                   	1	0.0000
P_10                  	1	0.0000
P_15                  	1	0.0000
P_20                  	1	0.0000
P_30                  	1	0.0000
P_100                 	1	0.0000
P_200                 	1	0.0000
P_500                 	1	0.0000
P_1000                	1	0.0000
num_ret               	10	16
num_rel               	10	1
num_rel_ret           	10	1
map                   	10	0.1667
Rprec                 	10	0.0000


### Q2 (b): 
MAP for all is 0.1971

### Q2 (c): 
Topics 1, 2, 6, 7, 9, 16 and 28 did badly.
Topics 18 and 24 did very well.

## Question 3

### Q3 (a): 
(1) Documents containing all words in the query and with high frequency of these words are highly ranked.

(2) Documents that are relevant should be highly ranked. Relevance is assesed relative to information need.

(3) False positive: Documents that are not related with topic 'Shipwrecks' but containing the word 'Shipwrecks' are retrieved (G00-57-0803089, G00-78-1392606, ...).

False negative: When query is 'an Shipwreck', the document related with topic 'Shipwrecks' is not retrived (G00-07-0978415).

Suggested modification: Improve analyzer by removing stop words like "an", and adding stemmer so that 'Shipwreck' or 'Shipwrecks' in query can get the same result.

In [9]:
sampleQuery = QP_Q2.parse("Shipwrecks")
sampleQueryResults = SEARCHER_Q2.search(sampleQuery, limit=None)

for (docnum, result) in enumerate(sampleQueryResults):
    score = sampleQueryResults.score(docnum)
    fileName = os.path.basename(result["file_path"])
    print(fileName, docnum, score)

G00-07-0978415 0 11.989904787792897
G00-57-0803089 1 9.032380321265611
G00-78-1392606 2 9.032380321265611
G00-12-1397051 3 8.906919342450095
G00-71-2435350 4 8.850182315992956
G00-92-2525720 5 8.850182315992956
G00-98-3853828 6 8.657170860906144
G00-01-1928582 7 8.524380768716991
G00-40-0657116 8 8.524380768716991
G00-11-0184364 9 8.101855123719446
G00-40-2252497 10 8.031602143375721
G00-36-2625159 11 7.397293502138806
G00-01-1619611 12 6.712312620970865
G00-23-0479567 13 6.44754064083282
G00-49-1872970 14 5.539264984813576
G00-18-3177883 15 4.723243796102263
G00-26-3218156 16 2.4758944589490413


In [10]:
sampleQuery = QP_Q2.parse("an Shipwreck")
sampleQueryResults = SEARCHER_Q2.search(sampleQuery, limit=None)

for (docnum, result) in enumerate(sampleQueryResults):
    score = sampleQueryResults.score(docnum)
    fileName = os.path.basename(result["file_path"])
    print(fileName, docnum, score)

G00-58-3804365 0 10.64015516531452


### Q3 (b): Write your code below

In [11]:
# Dont change this! Use it as-is in your code
# This filter will run for both the index and the query
from whoosh.analysis import Filter
class CustomFilter(Filter):
    is_morph = True
    def __init__(self, filterFunc, *args, **kwargs):
        self.customFilter = filterFunc
        self.args = args
        self.kwargs = kwargs
    def __eq__(self):
        return (other
                and self.__class__ is other.__class__)
    def __call__(self, tokens):
        for t in tokens:
            if t.mode == 'query': # if called by query parser
                t.text = self.customFilter(t.text, *self.args, **self.kwargs)
                yield t
            else: # == 'index' if called by indexer
                t.text = self.customFilter(t.text, *self.args, **self.kwargs)
                yield t

In [12]:
# Put your code for creating the index here (you can add more cells).
# Make sure you save the final index in the variable INDEX_Q3, your query parser in QP_Q3, and your searcher in SEARCHER_Q3
import nltk
from nltk.stem import *

nltk.download("wordnet")

[nltk_data] Downloading package wordnet to C:\Windows\ServiceProfiles\
[nltk_data]     LocalService\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [13]:
stmLwrStpIntraAnalyzer = RegexTokenizer() | LowercaseFilter() | IntraWordFilter() | StopFilter() | CustomFilter(WordNetLemmatizer().lemmatize)

mySchema2 = Schema(file_path = ID(stored=True),
                  file_content = TEXT(analyzer = stmLwrStpIntraAnalyzer))
# if index exists - remove it
if os.path.isdir(INDEX_DIR2):
    shutil.rmtree(INDEX_DIR2)

# create the directory for the index
os.makedirs(INDEX_DIR2) 

In [14]:
INDEX_Q3 = index.create_in(INDEX_DIR2, mySchema2) # Replace None with your index for Q3

In [15]:
# open writer
myWriter2 = writing.BufferedWriter(INDEX_Q3, period=20, limit=1000)

try:
    # write each file to index
    for docNum, filePath in enumerate(filesToIndex):
        with open(filePath, "r", encoding='utf8') as f:
            fileContent = f.read()
            myWriter2.add_document(file_path = filePath,
                                  file_content = fileContent)
            
            if (docNum % 1000 == 0):
                print("already indexed:", docNum+1)
    print("done indexing.")

finally:
    # save the index
    myWriter2.close()

already indexed: 1
already indexed: 1001
already indexed: 2001
already indexed: 3001
already indexed: 4001
done indexing.


In [16]:
QP_Q3 = QueryParser("file_content", schema=INDEX_Q3.schema) # Replace None with your query parser for Q3
SEARCHER_Q3 = INDEX_Q3.searcher() # Replace None with your searcher for Q3

In [17]:
# Load topic file - a list of topics(search phrases) used for evalutation
topicsFile = open(QUER_FILE,"r")
topics = topicsFile.read().splitlines()

# create an output file to which we'll write our results
outputTRECFile2 = open(OUTPUT_FILE2, "w")

# for each evaluated topic:
# build a query and record the results in the file in TREC_EVAL format
for topic in topics:
    topic_id, topic_phrase = tuple(topic.split(" ", 1))
    topicQuery = QP_Q3.parse(topic_phrase)
    topicResults = SEARCHER_Q3.search(topicQuery, limit=None)
    for (docnum, result) in enumerate(topicResults):
        score = topicResults.score(docnum)
        outputTRECFile2.write("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))

# close the topic and results file
outputTRECFile2.close()
topicsFile.close()

In [18]:
!$TREC_EVAL -q $QRELS_FILE $OUTPUT_FILE2

num_ret               	1	3
num_rel               	1	5
num_rel_ret           	1	0
map                   	1	0.0000
Rprec                 	1	0.0000
bpref                 	1	0.0000
recip_rank            	1	0.0000
iprec_at_recall_0.00  	1	0.0000
iprec_at_recall_0.10  	1	0.0000
iprec_at_recall_0.20  	1	0.0000
iprec_at_recall_0.30  	1	0.0000
iprec_at_recall_0.40  	1	0.0000
iprec_at_recall_0.50  	1	0.0000
iprec_at_recall_0.60  	1	0.0000
iprec_at_recall_0.70  	1	0.0000
iprec_at_recall_0.80  	1	0.0000
iprec_at_recall_0.90  	1	0.0000
iprec_at_recall_1.00  	1	0.0000
P_5                   	1	0.0000
P_10                  	1	0.0000
P_15                  	1	0.0000
P_20                  	1	0.0000
P_30                  	1	0.0000
P_100                 	1	0.0000
P_200                 	1	0.0000
P_500                 	1	0.0000
P_1000                	1	0.0000
num_ret               	10	24
num_rel               	10	1
num_rel_ret           	10	1
map                   	10	0.3333
Rprec                 	10	0.0000


In [19]:
sampleQuery = QP_Q3.parse("Shipwrecks")
sampleQueryResults = SEARCHER_Q3.search(sampleQuery, limit=None)

for (docnum, result) in enumerate(sampleQueryResults):
    score = sampleQueryResults.score(docnum)
    fileName = os.path.basename(result["file_path"])
    print(fileName, docnum, score)

G00-07-0978415 0 11.17545895291104
G00-45-0636707 1 11.152228242041652
G00-12-1397051 2 11.093303360907743
G00-36-2625159 3 8.911145462371703
G00-01-1619611 4 8.066419949733076
G00-57-0803089 5 7.983023338754952
G00-78-1392606 6 7.983023338754952
G00-92-2525720 7 7.679670796862095
G00-01-1928582 8 7.60740104558958
G00-40-0657116 9 7.60740104558958
G00-98-3853828 10 7.60740104558958
G00-71-2435350 11 7.5465294844542425
G00-35-3362733 12 7.340941404777774
G00-40-2252497 13 7.27487887975228
G00-49-1872970 14 7.249979855914902
G00-11-0184364 15 7.200819970446401
G00-23-0825886 16 7.0571354465837395
G00-01-0409474 17 6.835493834027606
G00-22-1796180 18 6.537981460600386
G00-00-0809130 19 6.295558083433594
G00-58-3804365 20 5.970102513779342
G00-86-0344148 21 5.722567753864319
G00-23-0479567 22 5.642679311712423
G00-18-3212032 23 5.309152520894447
G00-43-0913072 24 5.225802309761936
G00-14-2161877 25 5.140355037693677
G00-28-0647082 26 5.0576571134733825
G00-11-3028759 27 4.973203388595745
G

In [20]:
sampleQuery = QP_Q3.parse("an Shipwreck")
sampleQueryResults = SEARCHER_Q3.search(sampleQuery, limit=None)

for (docnum, result) in enumerate(sampleQueryResults):
    score = sampleQueryResults.score(docnum)
    fileName = os.path.basename(result["file_path"])
    print(fileName, docnum, score)

G00-07-0978415 0 11.17545895291104
G00-45-0636707 1 11.152228242041652
G00-12-1397051 2 11.093303360907743
G00-36-2625159 3 8.911145462371703
G00-01-1619611 4 8.066419949733076
G00-57-0803089 5 7.983023338754952
G00-78-1392606 6 7.983023338754952
G00-92-2525720 7 7.679670796862095
G00-01-1928582 8 7.60740104558958
G00-40-0657116 9 7.60740104558958
G00-98-3853828 10 7.60740104558958
G00-71-2435350 11 7.5465294844542425
G00-35-3362733 12 7.340941404777774
G00-40-2252497 13 7.27487887975228
G00-49-1872970 14 7.249979855914902
G00-11-0184364 15 7.200819970446401
G00-23-0825886 16 7.0571354465837395
G00-01-0409474 17 6.835493834027606
G00-22-1796180 18 6.537981460600386
G00-00-0809130 19 6.295558083433594
G00-58-3804365 20 5.970102513779342
G00-86-0344148 21 5.722567753864319
G00-23-0479567 22 5.642679311712423
G00-18-3212032 23 5.309152520894447
G00-43-0913072 24 5.225802309761936
G00-14-2161877 25 5.140355037693677
G00-28-0647082 26 5.0576571134733825
G00-11-3028759 27 4.973203388595745
G

### Q3 (c): Provide answer to Q3 (c) here 
I used basic tokenizer RegexTokenizer, LowercaseFilter to lower-case words, StopFilter to filter stop words, IntraWordFilter to break phrases, and NLTK's WordNetLemmatizer. 

The overall performance has improved with MAP for all increased from 0.1971 to 0.3402. False negative has improved with more relevant documents retrieved, but False positive doesn't improve with more nonrelevant documents retrieved.

### Q3 (d): Provide answer to Q3 (d) here 
Yes

### Q3 (e): Provide answer to Q3 (e) here
Yes

### Q3 (f): Provide answer to Q3 (f) here
It is good for my idea. The search is improved and now the search results shows that not only documents that contain the exact words are returned, but also documents containing words in different form than the ones in the query.

## Question 4 (Graduate Students)

In [21]:
GRAD_STUDENT = True # change to True if you are a grad student

### Q4 (a): Provide answer to Q4 (a) here
The documents we are working on are mostly structured, consisting of specific fields (head, body, title, paragraphs, etc) and scoring method like BM25F can calculate scores based on the structure of documents and weight terms frequencies accordingly to their field importance. So applying scoring method should help improve Whoosh's performance.

### Q4 (b): Write your code below

In [87]:
# Put your code for creating the index here (you can add more cells).
# Make sure you save the final index in the variable INDEX_Q4, your query parser in QP_Q4, and your searcher in SEARCHER_Q4
from whoosh import scoring
# Set a custom B value for the "content" field
w = scoring.BM25F(B=0.4, content_B=1.0, K1=1.4)

In [88]:
SEARCHER_Q4 = INDEX_Q3.searcher(weighting=w) # Replace None with your searcher for Q3

In [89]:
# Load topic file - a list of topics(search phrases) used for evalutation
topicsFile = open(QUER_FILE,"r")
topics = topicsFile.read().splitlines()

# create an output file to which we'll write our results
outputTRECFile3 = open(OUTPUT_FILE3, "w")

# for each evaluated topic:
# build a query and record the results in the file in TREC_EVAL format
for topic in topics:
    topic_id, topic_phrase = tuple(topic.split(" ", 1))
    topicQuery = QP_Q3.parse(topic_phrase)
    topicResults = SEARCHER_Q4.search(topicQuery, limit=None)
    for (docnum, result) in enumerate(topicResults):
        score = topicResults.score(docnum)
        outputTRECFile3.write("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))

# close the topic and results file
outputTRECFile3.close()
topicsFile.close()

In [90]:
!$TREC_EVAL -q $QRELS_FILE $OUTPUT_FILE3

num_ret               	1	3
num_rel               	1	5
num_rel_ret           	1	0
map                   	1	0.0000
Rprec                 	1	0.0000
bpref                 	1	0.0000
recip_rank            	1	0.0000
iprec_at_recall_0.00  	1	0.0000
iprec_at_recall_0.10  	1	0.0000
iprec_at_recall_0.20  	1	0.0000
iprec_at_recall_0.30  	1	0.0000
iprec_at_recall_0.40  	1	0.0000
iprec_at_recall_0.50  	1	0.0000
iprec_at_recall_0.60  	1	0.0000
iprec_at_recall_0.70  	1	0.0000
iprec_at_recall_0.80  	1	0.0000
iprec_at_recall_0.90  	1	0.0000
iprec_at_recall_1.00  	1	0.0000
P_5                   	1	0.0000
P_10                  	1	0.0000
P_15                  	1	0.0000
P_20                  	1	0.0000
P_30                  	1	0.0000
P_100                 	1	0.0000
P_200                 	1	0.0000
P_500                 	1	0.0000
P_1000                	1	0.0000
num_ret               	10	24
num_rel               	10	1
num_rel_ret           	10	1
map                   	10	0.3333
Rprec                 	10	0.0000


In [91]:
INDEX_Q4 = INDEX_Q3 # Replace None with your index for Q4
QP_Q4 = QP_Q3 # Replace None with your query parser for Q4
SEARCHER_Q4 = SEARCHER_Q4# Replace None with your searcher for Q4

In [92]:
sampleQuery = QP_Q3.parse("Early Childhood Education")
sampleQueryResults = SEARCHER_Q3.search(sampleQuery, limit=None)

for (docnum, result) in enumerate(sampleQueryResults):
    score = sampleQueryResults.score(docnum)
    fileName = os.path.basename(result["file_path"])
    print(fileName, docnum, score)

G00-75-2371200 0 19.793048488760938
G00-93-3702508 1 17.853611679192717
G00-48-1527977 2 17.603532379274643
G00-93-4160214 3 17.603532379274643
G00-99-2279811 4 17.51265710092938
G00-61-3894960 5 17.479350206361914
G00-31-0429249 6 17.163282736169126
G00-30-2788847 7 17.081401322540056
G00-78-2978026 8 17.002874217284575
G00-28-3705847 9 16.96901899684124
G00-50-3231467 10 16.937306199887683
G00-74-2972556 11 16.63819884150422
G00-93-1203370 12 16.269509164050984
G00-91-3997333 13 16.252070998222393
G00-77-3295130 14 16.21496844096906
G00-49-2602614 15 16.06760147251972
G00-02-0541868 16 15.62714276766039
G00-04-3016417 17 15.357458719361937
G00-16-2494170 18 15.187567871827632
G00-82-0211909 19 15.149894640182119
G00-76-4136817 20 15.056947668182074
G00-09-4172401 21 14.756608066199078
G00-54-2576117 22 14.121963976556184
G00-04-3271001 23 14.058893740077554
G00-75-1062837 24 12.825382646320639
G00-27-2159399 25 11.45773632406977
G00-65-4078383 26 10.921712755917326
G00-82-3144058 27 

In [93]:
sampleQuery = QP_Q4.parse("Early Childhood Education")
sampleQueryResults = SEARCHER_Q4.search(sampleQuery, limit=12)

for (docnum, result) in enumerate(sampleQueryResults):
    score = sampleQueryResults.score(docnum)
    fileName = os.path.basename(result["file_path"])
    print(fileName, docnum, score)

G00-75-2371200 0 21.45016181386964
G00-93-3702508 1 19.392673350957523
G00-93-1203370 2 17.10956220715707
G00-30-2788847 3 16.87560958872448
G00-99-2279811 4 16.54499222702476
G00-48-1527977 5 16.422736993849778
G00-93-4160214 6 16.422736993849778
G00-61-3894960 7 16.364973692990862
G00-28-3705847 8 16.35274642713174
G00-50-3231467 9 16.306094347974806
G00-74-2972556 10 16.238647029676752
G00-54-2576117 11 16.106828437085525


In [74]:
!grep $QRELS_FILE -e 'G00-74-2972556'
!grep $QRELS_FILE -e 'G00-54-2576117'

28 0 G00-74-2972556 0
28 0 G00-54-2576117 1


### Q4 (c): Provide answer to Q4 (a) here
I used scoring method BM25F. The MAP for all improved from 0.3402 in Q3 to 0.3411 here. The numbers of false positive and false negative cases did not change, but the ranks of some nonrelevant documents retrieved are lower and ranks of some relevant documents are higher, so the overall performance is improved.

### Q4 (d): Provide answer to Q4 (a) here
Yes

### Q4 (e): Provide answer to Q4 (a) here 
Yes

### Q4 (f): Provide answer to Q4 (a) here
It is good because the result shows that BM25F does improve the search for structured text by calculating scores based on the structure of documents and weighting terms frequencies accordingly to their field importance.

## Validation

In [94]:
# Run the following cells to make sure your code returns the correct value types

In [95]:
from whoosh.index import FileIndex
from whoosh.qparser import QueryParser
from whoosh.searching import Searcher
import os.path

### Path Validation

In [96]:
assert "MATERIALS_DIR" in globals(), "variable MATERIALS_DIR does not exists"
assert(os.path.isdir(os.path.join(MATERIALS_DIR))), "MATERIALS_DIR folder does not exists"
assert(os.path.isdir(os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2"))), "invalid folder structure"
assert(os.path.isdir(os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\government\documents"))), "invalid folder structure"
print("Paths validated")

Paths validated


### Q2 Validation

In [97]:
assert(isinstance(INDEX_Q2, FileIndex)), "Index Type"
assert(isinstance(QP_Q2, QueryParser)), "Query Parser Type"
assert(isinstance(SEARCHER_Q2, Searcher)), "Searcher Type"
print("Q2 Types Validated")

Q2 Types Validated


### Q3 Validation

In [98]:
assert(isinstance(INDEX_Q3, FileIndex)), "Index Type"
assert(isinstance(QP_Q3, QueryParser)), "Query Parser Type"
assert(isinstance(SEARCHER_Q3, Searcher)), "Searcher Type"
print("Q3 Types Validated")

Q3 Types Validated


### Q4 Validation (Graduate Students)

In [99]:
assert((not GRAD_STUDENT) or isinstance(INDEX_Q4, FileIndex)), "Index Type"
assert((not GRAD_STUDENT) or isinstance(QP_Q4, QueryParser)), "Query Parser Type"
assert((not GRAD_STUDENT) or isinstance(SEARCHER_Q4, Searcher)), "Searcher Type"
print("Q4 Types Validated")

Q4 Types Validated
