# Tests file

In this file we will make performance and consistency tests.

In [1]:
import Globals.globals as glob
import time
import SearchAlgorithms.searchAlgorithms as algo
from Tokenization.tokenizer import createListOfTokens, replaceWordsByStem, replaceWordsByLemma, removeStopWords
from DocumentServer import documentServer

documentServer.foldername = "../latimes"
glob.loadDocID2Content()

## Consistency tests

### Impact of the word score on the top 10 documents

In this section, we cumpute the naive algorithm with an inverted file without any stemming, lemmatization or word embedding.
The following user query is used in both case : "Chocolate and internet".

In [9]:
searchAlgorithm = algo.naiveAlgo
query = "Chocolate and internet"
query = createListOfTokens(query)
query = removeStopWords(query)
query = [[x, None] for x in query]
print(query)

[['chocolate', None], ['internet', None]]


Firstly, the word score is simply the number of instance. 

In [10]:
vocabulary_filename = "Globals/nostemm_nolemm_notfidf/vocabulary.dict"
IF_filename = "Globals/nostemm_nolemm_notfidf/IF.dict"

glob.loadVocabulary(vocabulary_filename, IF_filename)

choco_PL = glob.voc2PostingList("chocolate")
internet_PL = glob.voc2PostingList("internet")

print("len(choco_PL) :", len(choco_PL))
print("len(internet_PL) :", len(internet_PL))

print("list(choco_PL.items())[:4] :", list(choco_PL.items())[:4])
print("list(internet_PL.items())[:4] :", list(internet_PL.items())[:4])

len(choco_PL) : 723
len(internet_PL) : 4
list(choco_PL.items())[:4] : [('321713', 38), ('145821', 27), ('321712', 25), ('111', 24)]
list(internet_PL.items())[:4] : [('85032', 8), ('85141', 6), ('105932', 1), ('254071', 1)]


In [22]:
result = searchAlgorithm(query)

content_result = documentServer.serveDocuments(queryResult)

for idx, doc in enumerate(content_result.keys()):
	print(idx+1,"----------------------------------")
	print(content_result[doc]["metadata"]),
print("----------------------------------")

ValueError: too many values to unpack (expected 2)

We make the same test but with tf/idf as the word score.

In [None]:
vocabulary_filename = "Globals/nostemm_nolemm_tfidf/vocabulary.dict"
IF_filename = "Globals/nostemm_nolemm_tfidf/IF.dict"



The score tf/idf seems to be better than simply the number of instance. TO CHANGE !!!

### Impact of the search algorithm on the top 10 documents

In this section, we won't use neither steming/lemmatization nor word embedding. The tf/idf has been choosen as the token score.
We also use the query "Chocolate and internet" for each algorithm.

Firsty, the naive algorithm has been runed previously.
The results was :
....
....
....
....
....
....
....
....
....

We compute the same query with the fagin algorithm.

We compute the same query with the threshold algorithm.

The ????? algorithm seems to be the best.

### Impact of stemming, lemmatization and word embedding

In this section, we will use the fagin algorithm (TAKE THE BEST) with tf/idf scores on the same query as before : "Chocolate and internet".

If we don't use stemming, lemmatization or word embedding we obtain the same results as before:
DETAILS THE RESULTS

We will now add stemming processing on the inverted file and on the user query.

We will now add the lemmatization procedure to tokens in the inverted file and in the query.

Finally we will extend the query with 3 synonyms for each tokens using word embedding.

CONCLUSION ON STEM LEM EMBEDDING

## Performance tests

### Time to build the inverted file

In this section, we will use neither stemming/lemmatization nor word embedding.

Firstly we will build the inverted file over the whole data set in RAM memory and resquest it for one posting list.

Then, we will build the inverted file in memory and request one posting list.

CONCLUSION

### Time to run algorithm

In this section, we will use neither stemming/lemmatization nor word embedding. We will also use the query "Chocolate and internet" for all algorithm.

We compute the naive algorithm on this query.

We compute the fagin algorithm on the query.

CONCLUSION

### Time to run algorithm

In this section, we will use neither stemming/lemmatization nor word embedding. We will also use the query "Chocolate and internet" for all algorithm.

We compute the naive algorithm on this query.

We compute the fagin algorithm on the query.

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

In [16]:
def testTime(queries):
    start_time = time.time()
    for query in queries:
        algo.naiveAlgo(query)
    print("--- %s naiveAlgo seconds ---" % (time.time() - start_time))
    start_time = time.time()
    for query in queries:
        algo.faginAlgo(query)
    print("--- %s faginAlgo seconds ---" % (time.time() - start_time))
    start_time = time.time()
    for query in queries:
        algo.threshold(query)
    print("--- %s threshold seconds ---" % (time.time() - start_time))


In [49]:
glob.loadVocabulary("./Globals/nostemm_nolemm_tf_idf/vocabulary.dict","./Globals/nostemm_nolemm_tf_idf/IF.dict")


In [50]:
oneWord = [
        [("daylight",3)]
    ]

notExist = [[("fdadfdfewf",3)],
           [("114rf4434",3)],
            [("jdifjoiq2323",3)]
           ]

queries = [
                [("love",3), ("chocolate",3)],
                [("january",3)],
                [("narrow",3)],
                [("today",3), ("tomorrow",3)]           
    ]

queries1 = [
      [("love",3), ("and",3), ("chocolate",3)],
                [("january",3)],
                [("narrow",3)],
                [("today",3), ("and",3), ("tomorrow",3)],

]



We compute the three algos on the words not existing in the dict:

In [57]:
testTime(notExist)

--- 4.124641418457031e-05 naiveAlgo seconds ---
--- 0.0005621910095214844 faginAlgo seconds ---
--- 0.00039196014404296875 threshold seconds ---


Result obtained:

--- 4.124641418457031e-05 naiveAlgo seconds ---  
--- 0.0005621910095214844 faginAlgo seconds ---  
--- 0.00039196014404296875 threshold seconds ---

We compute the three algos with one word 

In [1]:
testTime(oneWord)

NameError: name 'testTime' is not defined

We compute the three algos with random words
Remark: We notice that the fagin algo is quite slow because it needs to go through every posting list

In [59]:
testTime(queries)

--- 0.45010828971862793 naiveAlgo seconds ---
--- 8.133760929107666 faginAlgo seconds ---
--- 0.44022607803344727 threshold seconds ---


--- 0.45010828971862793 naiveAlgo seconds ---  
--- 8.133760929107666 faginAlgo seconds ---  
--- 0.44022607803344727 threshold seconds ---