# Tests file

In this file we will make performance and consistency tests.

In [87]:
#import sys
#!conda install --yes --prefix {sys.prefix} -c conda-forge gensim

import Globals.globals as glob
import time
import SearchAlgorithms.searchAlgorithms as algo
from QueryMaker.queryShell import processQueryString
from DocumentServer import documentServer

## Consistency tests

### Impact of the word score on the top 10 documents

In this section, we cumpute the naive algorithm with an inverted file without any stemming, lemmatization or word embedding.
The following user query is used in both case : "Chocolate and internet".

Firstly, the word score is simply the number of instance. 

We make the same test but with tf/idf as the word score.

The score tf/idf seems to be better than simply the number of instance. TO CHANGE !!!

### Impact of the search algorithm on the top 10 documents

In this section, we won't use neither steming/lemmatization nor word embedding. The tf/idf has been choosen as the token score.
We also use the query "Chocolate and internet" for each algorithm.

Firsty, the naive algorithm has been runed previously.
The results was :
....
....
....
....
....
....
....
....
....

We compute the same query with the fagin algorithm.

We compute the same query with the threshold algorithm.

The ????? algorithm seems to be the best.

### Impact of stemming, lemmatization and word embedding

In this section, we will use the fagin algorithm with tf/idf scores on the query : "Chocolate and feet".

If we don't use stemming, lemmatization or word embedding we obtain the same results as before:
DETAILS THE RESULTS

We will now add stemming processing on the inverted file and on the user query.

In [89]:
def applyFaginOnQuery(processedQuery):
    queryResult = algo.faginAlgo(processedQuery)
    if(queryResult):
        returnedDocuments = documentServer.serveDocuments(queryResult)
        print("\n")
        print("results:\n")
        for idx, doc in enumerate(returnedDocuments.keys()):
            print(idx+1,"----------------------------------")
            print(returnedDocuments[doc]["metadata"]),
            print("----------------------------------")
    else:
        print("no result\n")

In [90]:
glob.loadVocabulary("./Globals/stemm_nolemm_tfidf/vocabulary.dict","./Globals/stemm_nolemm_tfidf/IF.dict")

query = "Chocolate and feet"

# Apply stemming on the query
processedQuery = processQueryString(query,stemming = True)
print(processedQuery)

# Apply fagin algorithm
applyFaginOnQuery(processedQuery)

[('chocol', 3), ('feet', 3)]
2
[('110992', 47.751000000000005)]
[('110992', 47.751000000000005), ('247462', 35.199000000000005)]
[('110992', 47.751000000000005), ('247462', 35.199000000000005), ('103552', 32.817)]
[('110992', 47.751000000000005), ('247462', 35.199000000000005), ('103552', 32.817), ('30071', 35.199000000000005)]
[('110992', 47.751000000000005), ('247462', 35.199000000000005), ('103552', 32.817), ('30071', 35.199000000000005), ('134434', 32.817)]
[('110992', 47.751000000000005), ('247462', 35.199000000000005), ('103552', 32.817), ('30071', 35.199000000000005), ('134434', 32.817), ('323491', 38.556000000000004)]
[('110992', 47.751000000000005), ('247462', 35.199000000000005), ('103552', 32.817), ('30071', 35.199000000000005), ('134434', 32.817), ('323491', 38.556000000000004), ('53702', 35.199000000000005)]
[('110992', 47.751000000000005), ('247462', 35.199000000000005), ('103552', 32.817), ('30071', 35.199000000000005), ('134434', 32.817), ('323491', 38.556000000000004),

TypeError: 'NoneType' object is not subscriptable

Result obtained:

    The vocabulary set has a size of  234118

    [('chocol', 3), ('feet', 3)]
    
    Top 10 :


We will now add the lemmatization procedure to tokens in the inverted file and in the query.

In [91]:
glob.loadVocabulary("./Globals/stemm_lemm_tfidf/vocabulary.dict","./Globals/stemm_lemm_tfidf/IF.dict")

query = "Chocolate and feet"

# Apply stemming on the query
processedQuery = processQueryString(query,lemmatization = True)
print(processedQuery)

# Apply fagin algorithm
applyFaginOnQuery(processedQuery)

Could not open/read file: ./Globals/stemm_lemm_tfidf/vocabulary.dict


SystemExit: 

Finally we will extend the query with 3 synonyms for each tokens using word embedding.

In [92]:
glob.loadVocabulary("./Globals/stemm_lemm_tfidf/vocabulary.dict","./Globals/stemm_lemm_tfidf/IF.dict")

embeddingFile = open('./Globals/embeddingModel', 'rb')
model = pickle.load(embeddingFile)
embeddingFile.close()

query = "Chocolate and feet"

# Apply stemming on the query
processedQuery = processQueryString(query,lemmatization = True, embedding = True, embeddingModel = model, nbOfSynonyms = 3)
print(processedQuery)

# Apply fagin algorithm
applyFaginOnQuery(processedQuery)

Could not open/read file: ./Globals/stemm_lemm_tfidf/vocabulary.dict


SystemExit: 

CONCLUSION ON STEM LEM EMBEDDING

## Performance tests

### Time to build the inverted file

In this section, we will use neither stemming/lemmatization nor word embedding.

Firstly we will build the inverted file over the whole data set in RAM memory and resquest it for one posting list.

Then, we will build the inverted file in memory and request one posting list.

In [16]:
def testTime(queries):
    start_time = time.time()
    for query in queries:
        algo.naiveAlgo(query)
    print("--- %s naiveAlgo seconds ---" % (time.time() - start_time))
    start_time = time.time()
    for query in queries:
        algo.faginAlgo(query)
    print("--- %s faginAlgo seconds ---" % (time.time() - start_time))
    start_time = time.time()
    for query in queries:
        algo.threshold(query)
    print("--- %s threshold seconds ---" % (time.time() - start_time))


In [49]:
glob.loadVocabulary("./Globals/nostemm_nolemm_tf_idf/vocabulary.dict","./Globals/nostemm_nolemm_tf_idf/IF.dict")


In [50]:
oneWord = [
        [("daylight",3)]
    ]

notExist = [[("fdadfdfewf",3)],
           [("114rf4434",3)],
            [("jdifjoiq2323",3)]
           ]

queries = [
                [("love",3), ("chocolate",3)],
                [("january",3)],
                [("narrow",3)],
                [("today",3), ("tomorrow",3)]           
    ]

queries1 = [
      [("love",3), ("and",3), ("chocolate",3)],
                [("january",3)],
                [("narrow",3)],
                [("today",3), ("and",3), ("tomorrow",3)],

]



We compute the three algos on the words not existing in the dict:

In [57]:
testTime(notExist)

--- 4.124641418457031e-05 naiveAlgo seconds ---
--- 0.0005621910095214844 faginAlgo seconds ---
--- 0.00039196014404296875 threshold seconds ---


--- 4.124641418457031e-05 naiveAlgo seconds ---  
--- 0.0005621910095214844 faginAlgo seconds ---  
--- 0.00039196014404296875 threshold seconds ---

We compute the three algos with one word 

In [58]:
testTime(oneWord)

--- 0.002106189727783203 naiveAlgo seconds ---
--- 0.004480123519897461 faginAlgo seconds ---
--- 0.0021691322326660156 threshold seconds ---


--- 0.002106189727783203 naiveAlgo seconds ---  
--- 0.004480123519897461 faginAlgo seconds ---  
--- 0.0021691322326660156 threshold seconds ---

We compute the three algos with random words
Remark: We notice that the fagin algo is quite slow because it needs to go through every posting list

In [59]:
testTime(queries)

--- 0.45010828971862793 naiveAlgo seconds ---
--- 8.133760929107666 faginAlgo seconds ---
--- 0.44022607803344727 threshold seconds ---


--- 0.45010828971862793 naiveAlgo seconds ---  
--- 8.133760929107666 faginAlgo seconds ---  
--- 0.44022607803344727 threshold seconds ---