# Project on PyLucene(8.3.0)

This project goal was to index the AppStore comments from differents applications and create a data retriever with query and query expansions using PyLucene. For running this file, make sure, you have PyLucene 8.3.0(available on https://dist.apache.org/repos/dist/dev/lucene/pylucene/8.3.0-rc1/), gensim and lupyne server installed.

This file defines 3 classes:
1) Indexer
2) Searcher
3) Vector_model

The file is divided as follow: a first part focusing on data retrieval, a second part on the indexer, a third part on Searcher(searcher_class and vector_model) and a last part on query example.

In [7]:
import lucene as pl
import os
from java.nio.file import Paths
import csv

In [8]:
from java.io import StringReader
from org.apache.lucene.analysis.standard import StandardAnalyzer, StandardTokenizer
from org.apache.lucene.document import Document, Field, TextField, FieldType
from org.apache.lucene.index import IndexWriter, IndexWriterConfig, DirectoryReader, FieldInfo, IndexOptions,MultiReader, Term
from org.apache.lucene.store import SimpleFSDirectory

In [None]:
#Init Virtual Machine
pl.initVM(vmargs=['-Djava.awt.headless=true'])

# Data retrieval

Given a tsv file, creates fields

In [20]:
def data_reader(file_name):
    '''Read file with 9 fields with ";" separator'''
    data = []
    with open(file_name) as train:
        csv_reader = csv.reader(train, delimiter=';')
        print('Reading file:')
        count = 0
        try:
            for row in csv_reader:
                if (count<1):
                    print('###Example of input###')
                    print('-id: '+ row[0]+ ' -developer: ' + row[1] + ' -AppName: ' + row[2] 
                      + ' -rating: ' + row[3] + ' -n_comments:' + row[4] + ' -top_cat: ' + row[5]
                     + ' -price: ' + row[6] )
                    print('Comment title: ' + row[7])
                    print('Comment body: ' + row[8])
                    print('#####################')
                data.append([row[0],row[1],row[2],row[3],row[4],row[5],row[6],row[7],row[8]])
                count+=1
        except:
            print('-------Error-----------')
            print('unable to retrieve line '+str(count+1))
            print('-----------------------')
            pass
    print(str(count) + ' rows imported out of a bit more than 8000 lines')
    return(data)

In [21]:
file= 'apps_info.tsv'
data = data_reader(file)

Reading file:
###Example of input###
-id: 0 -developer: NianticInc -AppName: ‎Pokémon GO  -rating: 4.1 -n_comments:227.6K -top_cat: 6inRolePlaying -price: Free
Comment title:   Mysteriously addictive....
Comment body: Seems some homework was done on how to build addictive entertainment. Every week I’m surprised at the age group range I see at raid hour in town, everything from 6 to 85 in wheelchairs and all forms in between. Have seen as many as around 200 people on community day flooding the sidewalks and nearly getting ran over.... never said they were all smart and mindful. But I can’t quite give it 5stars :/ Some things I could complain about, are actually what aids in keeping you playing, so I won’t on those things.. However, Shadow Pokémon are a more real issue.. They look cool, but you have to give up the looks to make them useful and that stinks! Collectors are getting gouged on them as you need to significantly increase your room to store them, at some $$ or large amounts of t

In [26]:
print('Example of list:')
print(data[417])

Example of list:
['417', 'PlaytikaLTD', '\u200eWorld Series of Poker - WSOP ', '4.4', '136.2K', '7inCasino', 'Free', '  Buy Buy Buy', 'This is the best poker game out there and it is near perfect I just have a few things that I think could make this truly great.  First, I would like some type of free roll tournament every week or month where people could enter for free and maybe earn some bit of real money.  Second, I think it would also be nice if there was just a little bit more time to make a decision with hands unless I have missed something.  Finally, it would be cool to have leagues of some sort both public and private where people could play like a season or something along those lines. Now to hit on a point that a lot of people bring up in their reviews about having to pay money.  Depends on how good you are in all honesty and how well you understand poker and how the people who make games make money.  If you aren’t good and you want the games bracelets/rings you are either goi

# Indexer

Here, we have to define 9 fields types:
1) ID
2) Developer Company
3) Application name
4) Average Rating
5) Number of ratings/comments
6) Ranking in categorie
7) Price of the app
8) Title of comment
9) Body of comment

An improvement would be to create these fields recursively: it would use the type of data to define if it is tokenized and index. The problem with this method is that some rows are different for example in rank: some have a string as rating while other have just -1 for no rank.

In [27]:
from org.apache.lucene.store import RAMDirectory
from org.apache.lucene.store import FSDirectory

In [30]:
class _indexer:
    def __init__(self,data,path,fields_name):
        self._fields = []
        self.data = data
        self.fields_n = fields_name
        self.writer_path = path
        self.init_fields()
        #StandardAnalyzer example.
        self.analyzer = StandardAnalyzer()
        
    def init_writer(self):
        '''Initializes the writer, the storage directory, and the configuration'''
        storeDir=os.path.dirname((self.writer_path))
        storeDir = Paths.get('index')
        #SimpleFSDirectory
        self.store = SimpleFSDirectory((storeDir))
        config = IndexWriterConfig(self.analyzer)
        config.setOpenMode(IndexWriterConfig.OpenMode.CREATE)#_OR_APPEND)
        #Writer
        self.writer = IndexWriter(self.store,config)
        print(self.writer.getInfoStream())
        self.writer.commit()
        
    def init_fields(self):
        '''Initialize the fields in an array(the indices will be matching the data indices)
        For later use in the data writer'''
        fields=[]

        #Define ID field
        id_f = FieldType()
        id_f.setStored(True)
        id_f.setTokenized(False)
        fields.append(id_f)

        #Define developer field
        dev_f = FieldType()
        dev_f.setStored(True)
        dev_f.setTokenized(False)
        fields.append(dev_f)

        #Define application name field
        app_f = FieldType()
        app_f.setStored(True)
        app_f.setTokenized(True) 
        #app_f.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS)
        fields.append(app_f)

        #Define average review field
        avgr_f = FieldType()
        avgr_f.setStored(True)
        avgr_f.setTokenized(False)
        fields.append(avgr_f)

        #Define number of comments field
        nbcom_f = FieldType()
        nbcom_f.setStored(True)
        nbcom_f.setTokenized(False)
        fields.append(nbcom_f)

        #Define ranking in categorie
        rank_f = FieldType()
        rank_f.setStored(True)
        rank_f.setTokenized(True)
        fields.append(rank_f)

        #Define price field
        price_f = FieldType()
        price_f.setStored(True)
        price_f.setTokenized(False)
        fields.append(price_f)

        #Define comment title field
        comt_f = FieldType()
        comt_f.setStored(True)
        comt_f.setTokenized(True)
        comt_f.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS)
        fields.append(comt_f)

        #Define comment body field
        comb_f = FieldType()
        comb_f.setStored(True)
        comb_f.setTokenized(True)
        comb_f.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS)
        fields.append(comb_f)
        
        self.fields = fields
        return(fields)
    
    def close_writer(self):
        '''Close the writer'''
        try:
            self.writer.commit()
            self.writer.close()
        except:
            print('Writer already closed')
            
    def write_doc(self,doc):
        '''Add document into the writer'''
        self.writer.addDocument(doc)
    
    def write_data(self):
        '''Get an array of array with 9 fields
        Write data into the lucene writer'''
        #Idea was to commit regularly(every 10 docs) but commit are computationaly expensive    
        printo = True
        self.init_writer()
        for d in data:
            #Insert array fields into lucene fields
            d[7] = d[7].replace('  ','')
            f0 = Field(self.fields_n[0],d[0], self.fields[0])
            f1 = Field(self.fields_n[1],d[1], self.fields[1])
            f2 = Field(self.fields_n[2],d[2], self.fields[2])
            f3 = Field(self.fields_n[3],d[3], self.fields[3])
            f4 = Field(self.fields_n[4],d[4], self.fields[4])
            f5 = Field(self.fields_n[5],d[5], self.fields[5])
            f6 = Field(self.fields_n[6],d[6], self.fields[6])
            f7 = Field(self.fields_n[7],d[7], self.fields[7])
            f8 = Field(self.fields_n[8],d[8], self.fields[8])
            #Add lucene fields to a lucene document
            doc = Document()
            doc.add(f0)
            doc.add(f1)
            doc.add(f2)
            doc.add(f3)
            doc.add(f4)
            doc.add(f5)
            doc.add(f6)
            doc.add(f7)
            doc.add(f8)
            #Print and write the documents into the lucene index
            printo = self.print_doc(doc,printo)
            self.write_doc(doc)
        #Close writer
        self.get_fields_name()
        self.close_writer()
    
    
    def print_doc(self,doc,printo):
        '''Print and return document'''
        if(printo==True):
            print(doc)
        printo = False
        return (printo)
    
    def get_fields_name(self):
        '''Print and return fields names'''
        print(self.writer.getFieldNames())
        return(self.writer.getFieldNames())
    
    def get_info(self):
        ''''''
        return ([self.analyzer,self.store])
        

# Searcher

In [31]:
from org.apache.lucene.search import IndexSearcher
from org.apache.lucene.queryparser.classic import QueryParser, MultiFieldQueryParser
from org.apache.lucene.index import IndexReader

In [None]:
#Comment line if gensim already installed on your machine
!pip install gensim

In [39]:
from gensim.test.utils import datapath

In [139]:
#import similarities
from org.apache.pylucene.search.similarities import PythonClassicSimilarity
from org.apache.lucene.search.similarities import BM25Similarity
from org.apache.lucene.search.similarities import TFIDFSimilarity

Here, we used the twitter_25 with 50 dimensions found on https://github.com/RaRe-Technologies/gensim-data/tree/wiki-english-20171001 . We chose the smallest dimension as the model is heavy to load. Note that the pre-trained model is given in Glove format and gensim needs Word2Vec(W2V) format: for further work and for decreasing the computation load, choose a model directly in W2V format.

In [38]:
#Reference to convert from Glove to W2V:
#https://radimrehurek.com/gensim/scripts/glove2word2vec.html
#Accessed on 30/03/2020

from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

glove_file = ("vocab.txt")
tmp_file = get_tmpfile("test_word2vec.txt")
_ = glove2word2vec(glove_file, tmp_file)
model = KeyedVectors.load_word2vec_format(tmp_file)

In [175]:
#Used to sort list based on second element of tuple
def sortSecond(val):
    return(val[1])

class searcher:
    def __init__(self,path,info):
        '''Initializes the searcher class'''
        self.path = path
        self.isearcher = None
        self.store = info[1]
        self.analyzer = info[0]
        self.init_searcher()
        self.similarities = ['bm25similarity','tfidfsimilarity']
    
    def init_searcher(self):
        '''Initializes the lucene searcher'''
        self.isearcher = IndexSearcher(DirectoryReader.open(self.store))
    
    def get_similarity(self):
        '''Print the similarities offered by the searcher'''
        print(self.similarities)
              
    def format_query(self,fields,params):
        '''Format query to string "field:param" '''
        query = ''
        query = str(fields) + ': ' + str(params) + ''
        return(query)
    
    def query(self,field,param,option):
        if (option == "simple"):
            print("#EXECUTING SIMPLE QUERY#")
            result = self.run_query(field,param)
            self.print_results(result)
        elif(option == "expansion"):   
            print("#EXECUTING EXPANDED QUERY#")
            result2 = self.expand(field,param)
        elif(option == "multiple"):
            print("#EXECUTING MULTIPLE FIELD QUERY#")
            print('Under construction')
        
    def expand(self,field,param):
        """Get new words from the given word and call run query for each of them"""
        self.model = vector_model()
        words = self.model.get_similar_words(param)
        words = self.remove_words(words)
        #Add the original word at the beginning of the array
        words= [param,*words]
        print("Word selected to expand:" + str(words))
        results = []
        for w in words:
            results.append(list(self.run_query(field,w)))
        return(self.sort_results(results))
    
    def remove_words(self,words):
        """
        Cast result(w/similarity) to word
        Use a threshold of 0.75 for keepping the more similar words
        """
        sel_words = []
        for w in words:
            if w[1]>0.75:
                sel_words.append(w[0])
        return(sel_words)
        
    def sort_results(self,results):
        '''Cast result to (result, result.score) format
        Compare similarity score of the results for different queries'''
        temp_array=[]
        for res in results:
            for r in res:
                doc = self.isearcher.doc(r.doc)
                temp_array.append((r,r.score))
        temp_array.sort(key=sortSecond,reverse=True)
        sorted_results = [x[0] for x in temp_array]
        self.print_results(sorted_results)
        return sorted_results
    
    def mul_query(self,fields,params):
        pass
    
    def run_query(self,field,param):
        '''Run a query with a given field and parameter'''
        self.init_searcher()
        arg = self.format_query(field,param)
        qp = QueryParser('id',self.analyzer) 
        query = qp.parse( str(arg) ) #title:[* TO *]
        print('###NEW QUERY: field='+field+'; param='+param+' ###')
        result = self.isearcher.search(query,100).scoreDocs
        return(result)

    def print_results(self,result):
        '''Display results in organized way'''
        print('Results size: ', len(result))
        for r in result[:10]:
            doc = self.isearcher.doc(r.doc)
            print("---------------")
            print('Comment ID:'+str(doc.get('id')) +'\t Application Name:' + str(doc.get('application_name')))
            print('Comment title: '+doc.get('comment_title'))
            print(doc.get('comment_body'))
            print(r.score)
            
    def set_similarity(self,sim):
        #https://lucene.apache.org/core/7_0_0/core/index.html?org/apache/lucene/search/similarities/Similarity.html
        if (sim==0 ):
            self.isearcher.setSimilarity(BM25Similarity())
        elif (sim==1):
            print('not working')
            #similarity=TFIDFSimilarity()
            #self.isearcher.setSimilarity(similarity)
        else:
            print('Similarity wrong or not available')
        

In [52]:
class vector_model:
    def __init__(self):
        self._model = model
    def get_vocab(self):
        print("Vector model dimension(word,dimension):")
        print(self._model.wv.vectors.shape)
    def get_similar_words(self,word):
        sim_words = self._model.similar_by_word(word = word)
        return(sim_words)

# Test of the model

## Indexer

In [43]:
#Init_indexer
path = 'data'
fields_n = ['id','developer','application_name','rating',
            'number_comments','topcategory','price',
           'comment_title','comment_body']
indexer = _indexer(data,path,fields_n)
fields = indexer.init_fields()

In [44]:
indexer.write_data()

org.apache.lucene.util.InfoStream$NoOutput@7de62196
Document<stored<id:0> stored<developer:NianticInc> stored<application_name:‎Pokémon GO > stored<rating:4.1> stored<number_comments:227.6K> stored<topcategory:6inRolePlaying> stored<price:Free> stored,indexed,tokenized<comment_title:Mysteriously addictive....> stored,indexed,tokenized<comment_body:Seems some homework was done on how to build addictive entertainment. Every week I’m surprised at the age group range I see at raid hour in town, everything from 6 to 85 in wheelchairs and all forms in between. Have seen as many as around 200 people on community day flooding the sidewalks and nearly getting ran over.... never said they were all smart and mindful. But I can’t quite give it 5stars :/ Some things I could complain about, are actually what aids in keeping you playing, so I won’t on those things.. However, Shadow Pokémon are a more real issue.. They look cool, but you have to give up the looks to make them useful and that stinks! C

In [45]:
info = indexer.get_info()
print(info)

[<StandardAnalyzer: org.apache.lucene.analysis.standard.StandardAnalyzer@2b175c00>, <SimpleFSDirectory: SimpleFSDirectory@/home/docker/images/jupyter/Project/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@74287ea3>]


## Vector model

In [53]:
#Testing vector model with the example of "homework"
test = vector_model()
test.get_vocab()
print(test.get_similar_words('homework'))

Vector model dimension(word,dimension):
(400000, 50)
[('schoolwork', 0.793504536151886), ('chores', 0.758959174156189), ('figuring', 0.7507083415985107), ('basics', 0.7434030175209045), ('typing', 0.7010283470153809), ('tutoring', 0.6962698101997375), ('housekeeping', 0.6656022667884827), ('cramming', 0.6605523824691772), ('worksheets', 0.6596395373344421), ('chore', 0.6594257354736328)]


  print(self._model.wv.vectors.shape)


## Searcher model

In [176]:
searcher_class = searcher(path,info)

In [177]:
#Search in comment_body as simple query
searcher_class.query('comment_body','superstar','simple')

#EXECUTING SIMPLE QUERY#
###NEW QUERY: field=comment_body; param=superstar ###
Results size:  1
---------------
Comment ID:179	 Application Name:‎NBA LIVE Mobile Basketball 
Comment title: Great game, but has much more potential
Overall, this is a really fun game. But it has way more potential to be an amazing game. But first of all, in the video previewing the game it shows them putting James Harden, a SG, in the SF player slot. But how come I can’t put players out of position in the real game? The training also looks different in the video. I think putting players out of position would be awesome, and it just doesn’t count for chemistry. Anyway, my first suggestion is to allow you to sub your bench players in in-game. Also allow for instant replays and at the end of season games/h2h quarters let you see the players’ stats. Also, for the programs, stay more consistent with players’ overalls. For example, the legends started as 89s and to do the set you had to get 4 worse legend cards 

In [178]:
#Search into comment_body with query expansion
searcher_class.query('comment_body','superstar',"expansion")

#EXECUTING EXPANDED QUERY#
Word selected to expand:['superstar', 'star', 'superstars', 'idol']
###NEW QUERY: field=comment_body; param=superstar ###
###NEW QUERY: field=comment_body; param=star ###
###NEW QUERY: field=comment_body; param=superstars ###
###NEW QUERY: field=comment_body; param=idol ###
Results size:  107
---------------
Comment ID:4589	 Application Name:‎Jacob Hit and Miss - Sartorius Endless Runner 
Comment title: Kinda weird
First of all I love Jacob Sartorius with all my heart. He is my idol and I am a sartorian for life. But anyways this game is kinda weird.
4.3032426834106445
---------------
Comment ID:179	 Application Name:‎NBA LIVE Mobile Basketball 
Comment title: Great game, but has much more potential
Overall, this is a really fun game. But it has way more potential to be an amazing game. But first of all, in the video previewing the game it shows them putting James Harden, a SG, in the SF player slot. But how come I can’t put players out of position in the rea

In [180]:
#Search into comment_title as simple query
searcher_class.query('comment_title','horrible',"simple")

#EXECUTING SIMPLE QUERY#
###NEW QUERY: field=comment_title; param=horrible ###
Results size:  36
---------------
Comment ID:873	 Application Name:‎Aaah! Make my nails beautiful! FREE- super fun beauty salon game for little flower girls 
Comment title: Horrible
This game won't do anything unless you pay $9.99.
3.1667425632476807
---------------
Comment ID:1146	 Application Name:‎Baby blocks memory match games without the wifi 
Comment title: Horrible
It doesn't even work when I first got on this app the screen was black for 20 minutes
3.1667425632476807
---------------
Comment ID:1176	 Application Name:‎Baby Car Driver - your toddler's first car 
Comment title: Horrible
Dumbest game ever. It doesn’t do anything. I paid 99 cents assuming there would be more to it. My 4 year old played on it for 30 seconds and put my phone down. I want my money back!
3.1667425632476807
---------------
Comment ID:1376	 Application Name:‎CAD Design 3D - for Interior Design & Floor Plan 
Comment title: Horri

In [181]:
#Search into comment_title with query expansion
searcher_class.query('comment_title','horrible',"expansion")

#EXECUTING EXPANDED QUERY#
Word selected to expand:['horrible', 'terrible', 'awful', 'dreadful', 'horrific', 'horrendous', 'horrifying', 'frightening', 'tragic', 'shocking', 'disgusting']
###NEW QUERY: field=comment_title; param=horrible ###
###NEW QUERY: field=comment_title; param=terrible ###
###NEW QUERY: field=comment_title; param=awful ###
###NEW QUERY: field=comment_title; param=dreadful ###
###NEW QUERY: field=comment_title; param=horrific ###
###NEW QUERY: field=comment_title; param=horrendous ###
###NEW QUERY: field=comment_title; param=horrifying ###
###NEW QUERY: field=comment_title; param=frightening ###
###NEW QUERY: field=comment_title; param=tragic ###
###NEW QUERY: field=comment_title; param=shocking ###
###NEW QUERY: field=comment_title; param=disgusting ###
Results size:  78
---------------
Comment ID:1178	 Application Name:‎Baby Car Driver - your toddler's first car 
Comment title: Awful
I want a refund!!!
3.6417582035064697
---------------
Comment ID:1549	 Applicati