In [1]:
from ProbabilisticModel import *

You can create an inverted index directly from a corpus...

In [2]:
corpus = read_LISA_corpus()
ir = IRsystem.from_corpus(corpus)

1000...2000...3000...4000...5000...6000...

... or you can save the inverted index to a pickle file, so that you don't need to re-build it every time you run the program (but you need to re-read the corpus)

In [3]:
# to be run only once for all
corpus = read_LISA_corpus()
IRsystem.create_pickle(corpus)

1000...2000...3000...4000...5000...6000...finished pickling


In [4]:
# to be run upon execution every time
corpus = read_LISA_corpus()
ir = IRsystem.from_pickle(corpus)

finished reading


First of all, instantiate a query:

In [5]:
myq = Query("public libraries in children education", ir)

To see answers to the query:

In [6]:
myq.update_results()
myq.display_results()    # the number in round parenthesis is the score

[5703]	CHILDREN'S LIBRARY WORK AMONG ADULTS.
 (9.257754403736788)
[5629]	LIBRARY SERVICE TO IMMIGRANT CHILDREN IN DENMARK.
 (9.233650022543618)
[5061]	LIBRARIES AND THE PROMOTION OF READING HABITS IN CHILDREN.
 (9.13284978271449)
[4609]	REFERENCE SERVICES TO CHILDREN AND STUDENTS IN THE PUBLIC LIBRARY.
 (8.76749834321211)
[2624]	THE PUBLIC LIBRARY AND THE POPULATION.
 (8.641850122601383)
[99]	LIBRARIES FOR CHILDREN.
 (8.2915881656502)
[2666]	FILMS ON MENTAL RETARDATION.
 (8.197867918474893)
[483]	SCHOOLTEACHERS' VIEW OF CHILDREN'S BOOKS.
 (8.187676236284222)
[3504]	INTERNATIONALISM' AND THE CHILDREN'S LITERATURE COMMUNITY IN THE UNITED
STATES: A SECOND LOOK
 (8.162385568347947)
[4570]	CHILDREN'S LITERATURE AND LIBRARIES IN MALI' PROBLEMS AND DEVELOPMENT.
 (8.161015116878117)


You can look at the next pages of results:

In [7]:
myq.display_results(page=2)

[1982]	OUR OWN CHILDREN'S MAGAZINES: INSURING THEIR FUTURE.
 (8.096928349317885)
[4444]	MEDIA AND MICROFORMS.
 (8.061782140784882)
[2825]	THE UNION OF THE MUSES.
 (7.836053703172263)
[1075]	A CONTEMPLATION OF CHILDREN'S SERVICES IN PUBLIC LIBRARIES OF WISCONSIN.
 (7.672927984476142)
[925]	ECER ON BRS.
 (7.653470744716268)
[1577]	CHILDREN'S RIGHTS IN THE PUBLIC LIBRARY.
 (7.634246368411948)
[4206]	THE WILL TO SURVIVE.
 (7.594487976194877)
[4563]	LIBRARY SERVICE TO HEARING IMPAIRED CHILDREN.
 (7.557740721251987)
[1076]	YOUTHVIEW: SURVEY OF CHILDREN'S SERVICES IN MISSOURI PUBLIC LIBRARIES.
 (7.539294213005712)
[3617]	EDUCATIONAL CENTRES-A NEW CHALLENGE TO LIBRARIES.
 (7.53061110300381)


Retrieve a document from the corpus to see the abstract

In [16]:
print(corpus[5703].abstract)

CONTRIBUTION TO AN ISSUE DEVOTED TO PUBLIC LIBRARY SERVICES IN SWEDEN. A
COUNTRY WITH A RICH PUBLICATION OF CHILDREN'S LITERATURE SHOULD ALSO HAVE
WELL-APPOINTED AND FLOURISHING CHILDREN'S LIBRARIES. CHILDREN'S LIBRARIES HAVE
EXISTED IN SWEDEN FOR 70 YEARS, BUT NOT UNTIL THE 1970S, WITH GREAT
EDUCATIONAL REFORMS, WAS PRIORITY ACCORDED TO LIBRARY ACTIVITIES FOR CHILDREN.
FROM THE LATE 1970S, LIBRARY CONSULTANTS' POSTS HAVE BEEN ESTABLISHED AT
ALMOST ALL THE COUNTY LIBRARIES. DESCRIBES THE CONSULTANT'S ROLE IN RELATION
TO THAT OF OTHERS CONCERNED WITH CHILD WELFARE.



You can try pseudo relevance feedback to enhance the results

In [9]:
myq.iterative_pseudo_relevance()
myq.display_results()

[5703]	CHILDREN'S LIBRARY WORK AMONG ADULTS.
 (19.649967099668388)
[5629]	LIBRARY SERVICE TO IMMIGRANT CHILDREN IN DENMARK.
 (19.473302856571998)
[4609]	REFERENCE SERVICES TO CHILDREN AND STUDENTS IN THE PUBLIC LIBRARY.
 (19.192809812299053)
[5061]	LIBRARIES AND THE PROMOTION OF READING HABITS IN CHILDREN.
 (19.061365234975106)
[2624]	THE PUBLIC LIBRARY AND THE POPULATION.
 (18.416714333162087)
[4444]	MEDIA AND MICROFORMS.
 (17.788681720956486)
[2666]	FILMS ON MENTAL RETARDATION.
 (17.455458747338227)
[4102]	PUBLIC LIBRARY POLICY.
 (17.177305799403214)
[4570]	CHILDREN'S LITERATURE AND LIBRARIES IN MALI' PROBLEMS AND DEVELOPMENT.
 (16.925035128266103)
[483]	SCHOOLTEACHERS' VIEW OF CHILDREN'S BOOKS.
 (16.804755821825125)


You can interact with the system to do relevance feedback; pass as argument a list with the numbers you see in square parenthesis near the titles of the documents you like

In [10]:
myq.give_feedback([4668, 4609, 5703, 5061, 2624, 4102])

And then see how the results change

In [11]:
myq.update_results()
myq.display_results(how_many=15)   # see more results on a page

[4609]	REFERENCE SERVICES TO CHILDREN AND STUDENTS IN THE PUBLIC LIBRARY.
 (17.986774388845923)
[5703]	CHILDREN'S LIBRARY WORK AMONG ADULTS.
 (17.951316145695603)
[5629]	LIBRARY SERVICE TO IMMIGRANT CHILDREN IN DENMARK.
 (17.855745855522)
[5061]	LIBRARIES AND THE PROMOTION OF READING HABITS IN CHILDREN.
 (17.342861293101393)
[4102]	PUBLIC LIBRARY POLICY.
 (16.919462286072687)
[2624]	THE PUBLIC LIBRARY AND THE POPULATION.
 (16.865820350068013)
[4444]	MEDIA AND MICROFORMS.
 (16.549285063974516)
[2666]	FILMS ON MENTAL RETARDATION.
 (16.1184496054775)
[5685]	WHICH WAY FOR SCHOOL MEDIA SERVICES TO TURN?.
 (15.89761751891816)
[4570]	CHILDREN'S LITERATURE AND LIBRARIES IN MALI' PROBLEMS AND DEVELOPMENT.
 (15.471055620011501)
[925]	ECER ON BRS.
 (15.212494048040838)
[483]	SCHOOLTEACHERS' VIEW OF CHILDREN'S BOOKS.
 (15.095187403829446)
[4578]	DANISH SCHOOL LIBRARY ASSOCIATION ANNUAL MEETING 1981.
 (14.908108945613527)
[1075]	A CONTEMPLATION OF CHILDREN'S SERVICES IN PUBLIC LIBRARIES OF WISCONSI

Youu can repeat relevance feedback as many times as you want, until you are satisfied with the result; the system will keep memory of your previous feedback

In [12]:
myq.give_feedback([5685, 1075, 99])
myq.update_results()
myq.display_results()

[5703]	CHILDREN'S LIBRARY WORK AMONG ADULTS.
 (16.81549405314066)
[5629]	LIBRARY SERVICE TO IMMIGRANT CHILDREN IN DENMARK.
 (16.68053717667079)
[5061]	LIBRARIES AND THE PROMOTION OF READING HABITS IN CHILDREN.
 (16.35487036941556)
[4609]	REFERENCE SERVICES TO CHILDREN AND STUDENTS IN THE PUBLIC LIBRARY.
 (16.34489465818766)
[2624]	THE PUBLIC LIBRARY AND THE POPULATION.
 (15.750148145957416)
[4444]	MEDIA AND MICROFORMS.
 (15.13173727345086)
[2666]	FILMS ON MENTAL RETARDATION.
 (14.929361882419478)
[4570]	CHILDREN'S LITERATURE AND LIBRARIES IN MALI' PROBLEMS AND DEVELOPMENT.
 (14.535997232268794)
[483]	SCHOOLTEACHERS' VIEW OF CHILDREN'S BOOKS.
 (14.457805980939579)
[4102]	PUBLIC LIBRARY POLICY.
 (14.438560565661428)


You can perform user-relevance feedback after pseudo-relevance feedback or vice versa (the system will reset after each): this is more efficient than instantiating a new query with the same text

If you want, you can delete all your feedback given so far

In [13]:
myq.reset_feedback()

You can change the parameters of the probabilistic model

In [14]:
help(Query.__init__)

Help on function __init__ in module ProbabilisticModel:

__init__(self, text, ir, k1_param=1.5, b_param=0.75, k3_param=1.5)
    Parameters
    ----------
    text: text of the query
    ir: IRsystem used to answer the query
    k1_param: parameter of the BM25 model which weights document term frequency (default 1.5)
    b_param: parameter of the BM25 model which weights document length normalisation (default 0.75)
    k3_param: parameter of the BM25 model which weights query term frequency (default 1.5)



In [15]:
myq = Query("automation in libraries", ir, k1_param=0, k3_param=0)
# this is the Binary Indipendence Model
myq.update_results()
myq.display_results()

[175]	ON ALLOCATIONS TO UNIVERSITY LIBRARIES IN THE STATE OF NORTH RHINE-WESTPHALIA
IN THE PERIOD FROM 1975 TO 1980.
 (2.664788898705279)
[181]	A FEW CONSIDERATIONS ON LIBRARY AUTOMATION.
 (2.664788898705279)
[185]	LIBRARIES AND NETWORKS IN TRANSITION: PROBLEMS AND PROSPECTS FOR THE 1980'S.
 (2.664788898705279)
[187]	MICRO COMPUTER SYSTEMS.
 (2.664788898705279)
[188]	THE ROLE OF MICROCOMPUTERS IN LIBRARIES.
 (2.664788898705279)
[254]	THE STATE SYSTEM OF SCIENTIFIC AND TECHNICAL INFORMATION: CURRENT STATE AND
PERSPECTIVES.
 (2.664788898705279)
[276]	DESCRIPTION AND ANALYSIS OF AUTOMATED DATA BANKS.
 (2.664788898705279)
[280]	NEDS NATIONAL EMISSIONS DATA SYSTEM INFORMATION.
 (2.664788898705279)
[282]	SHARING DEVELOPMENT INFORMATION.
 (2.664788898705279)
[283]	INVESTIGATION INTO USERS' REQUIREMENTS AS PART OF THE METHODOLOGICAL APPROACH
TO THE DESIGN OF AUTOMATED INFORMATION SYSTEMS.
 (2.664788898705279)
