Nama Anggota:
- Givari Akbar
- Iqbal Pandu Santoso
- Muhammad Hanif Zuhair
- Narrendra Setyawan Bahar

# Persiapan

## Install Library

In [1]:
%pip install pyserini

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.3.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [14]:
%pip install nltk

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.3.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


## Import Library

In [15]:
import nltk
from nltk.corpus import stopwords
import json
import os
from pyserini.search.lucene import LuceneSearcher

## Definisi Dokumen

In [16]:
docs = [
  ("d1",  "The cat chased a small mouse into the garden."),
  ("d2",  "A friendly dog played fetch by the river."),
  ("d3",  "BM25 is a ranking function widely used in search engines."),
  ("d4",  "Boolean retrieval uses logical operators like AND and OR."),
  ("d5",  "TF-IDF weights terms by frequency and rarity."),
  ("d6",  "Neural retrieval uses dense embeddings for semantic search."),
  ("d7",  "The dog and the cat slept on the same couch."),
  ("d8",  "The library hosts a workshop on information retrieval."),
  ("d9",  "Students implemented BM25 and compared it with TF-IDF."),
  ("d10", "The chef roasted chicken with rosemary and garlic."),
  ("d11", "A black cat crossed the old stone bridge at night."),
  ("d12", "Dogs are loyal companions during long hikes."),
  ("d13", "The dataset contains fifteen short sentences for testing."),
  ("d14", "Reranking models reorder BM25 candidates using transformers."),
  ("d15", "The dog sniffed a cat but ignored the mouse.")
]

# Preprocessing

In [21]:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package stopwords to C:\Users\MyBook Hype
[nltk_data]     AMD\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to C:\Users\MyBook Hype
[nltk_data]     AMD\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to C:\Users\MyBook Hype
[nltk_data]     AMD\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


True

In [59]:
import re

def preprocessing(docs):
    preprocessed_docs = []
    stop_words = set(stopwords.words('english'))

    for doc_id, doc in docs:
        # Mengubah ke lowecase
        doc = doc.lower()

        # Menghilangkan tanda baca
        doc = re.sub(r'[^\w\s]','',doc)

        # Tokenisasi
        tokens = nltk.word_tokenize(doc)

        # Menghilangkan stopwords
        filtered_tokens = [word for word in tokens if word not in stop_words]

        cleaned_doc = ' '.join(filtered_tokens)
        preprocessed_docs.append((doc_id, cleaned_doc))
    return preprocessed_docs

In [60]:
docs_res = preprocessing(docs)
for docs_id, doc in docs_res:
    print(f"{docs_id}: {doc}")

d1: cat chased small mouse garden
d2: friendly dog played fetch river
d3: bm25 ranking function widely used search engines
d4: boolean retrieval uses logical operators like
d5: tfidf weights terms frequency rarity
d6: neural retrieval uses dense embeddings semantic search
d7: dog cat slept couch
d8: library hosts workshop information retrieval
d9: students implemented bm25 compared tfidf
d10: chef roasted chicken rosemary garlic
d11: black cat crossed old stone bridge night
d12: dogs loyal companions long hikes
d13: dataset contains fifteen short sentences testing
d14: reranking models reorder bm25 candidates using transformers
d15: dog sniffed cat ignored mouse


# Indexing Corpus + Stemming

## Menyimpan Hasil Preprocessing ke JSONL 

In [61]:
jsonl_path = 'documents.jsonl'
documents = [{'id': doc_id, 'contents': doc} for doc_id, doc in docs_res]

if not os.path.exists('docs'):
    os.makedirs('docs')

with open(os.path.join('docs', jsonl_path), 'w') as f:
    for doc in documents:
        f.write(json.dumps(doc) + '\n')

## Menjalankan Indexing + Stemming

In [62]:
!python -m pyserini.index.lucene \
  --collection JsonCollection \
  --input docs \
  --index indexes/sample_collection_jsonl \
  --generator DefaultLuceneDocumentGenerator \
  --threads 1 \
  --stemmer porter \
  --storePositions --storeDocvectors --storeRaw

2025-08-26 19:45:28,962 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:205) - Setting log level to INFO
2025-08-26 19:45:28,967 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:209) - AbstractIndexer settings:
2025-08-26 19:45:28,967 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:210) -  + DocumentCollection path: docs
2025-08-26 19:45:28,967 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:211) -  + CollectionClass: JsonCollection
2025-08-26 19:45:28,968 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:212) -  + Index path: indexes/sample_collection_jsonl
2025-08-26 19:45:28,969 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:213) -  + Threads: 1
2025-08-26 19:45:28,970 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:214) -  + Optimize (merge segments)? false
2025-08-26 19:45:29,046 INFO  [main] index.IndexCollection (IndexCollection.java:246) - Using DefaultEnglishAnalyzer
2025-08-26 19:45:29,046 INFO  [main] index.IndexCo

Aug 26, 2025 7:45:29 PM org.apache.lucene.store.MMapDirectory lookupProvider


# Boolean Retrieval

In [82]:
from pyserini.search.lucene import querybuilder

searcher = LuceneSearcher('indexes/sample_collection_jsonl')

In [83]:
queries = [("q1", "dog AND cat"),
           ("q2", "dog OR cat"),
           ("q3", "dog AND NOT cat"),
           ("q4", "(bm25 OR tf-idf) AND retrieval"),
           ("q5", "NOT workshop AND library")]

#Mendefinisikan term
term_dog = querybuilder.get_term_query("dog")
term_cat = querybuilder.get_term_query("cat")
term_bm25 = querybuilder.get_term_query("bm25")
term_tfidf = querybuilder.get_term_query("tf-idf")
term_retrieval = querybuilder.get_term_query("retrieval")
term_workshop = querybuilder.get_term_query("workshop")
term_library = querybuilder.get_term_query("library")

# Mendefinisikan operator boolean
and_op = querybuilder.JBooleanClauseOccur['must'].value
or_op = querybuilder.JBooleanClauseOccur['should'].value
not_op = querybuilder.JBooleanClauseOccur['must_not'].value

#Query 1: dog AND cat
q1_builder = querybuilder.get_boolean_query_builder()
q1_builder.add(term_dog, and_op)
q1_builder.add(term_cat, and_op)
q1 = q1_builder.build()

#Query 2: dog OR cat
q2_builder = querybuilder.get_boolean_query_builder()
q2_builder.add(term_dog, or_op)
q2_builder.add(term_cat, or_op)
q2 = q2_builder.build()

#Query 3: dog AND NOT cat
q3_builder = querybuilder.get_boolean_query_builder()
q3_builder.add(term_dog, and_op)
q3_builder.add(term_cat, not_op)
q3 = q3_builder.build()

#Query 4: (bm25 OR tf-idf) AND retrieval
q4_inner_builder = querybuilder.get_boolean_query_builder()
q4_inner_builder.add(term_bm25, or_op)
q4_inner_builder.add(term_tfidf, or_op)
q4_inner = q4_inner_builder.build()
q4_builder = querybuilder.get_boolean_query_builder()
q4_builder.add(q4_inner, and_op)
q4_builder.add(term_retrieval, and_op)
q4 = q4_builder.build()

#Query 5: NOT workshop AND library
q5_builder = querybuilder.get_boolean_query_builder()
q5_builder.add(term_workshop, not_op)
q5_builder.add(term_library, and_op)
q5 = q5_builder.build()

boolean_queries = [("q1", q1),
                   ("q2", q2),
                   ("q3", q3),
                   ("q4", q4),
                   ("q5", q5)]


In [84]:
queri_idx = 0
for query_id, query in boolean_queries:
    hits = searcher.search(query)
    print(f"\nQuery: {queries[queri_idx][1]}")
    queri_idx += 1
    if len(hits) == 0:
        print("No results found.")
        continue
    for i in range(len(hits)):
        cont_data = json.loads(searcher.doc(hits[i].docid).raw())
        print(f'{i+1:2}. {hits[i].docid:4} {hits[i].score:.5f}  {cont_data["contents"]}')


Query: dog AND cat
 1. d7   1.41170  dog cat slept couch
 2. d15  1.36290  dog sniffed cat ignored mouse

Query: dog OR cat
 1. d7   1.41170  dog cat slept couch
 2. d15  1.36290  dog sniffed cat ignored mouse
 3. d1   0.68150  cat chased small mouse garden
 4. d12  0.68150  dogs loyal companions long hikes
 5. d2   0.68150  friendly dog played fetch river
 6. d11  0.63740  black cat crossed old stone bridge night

Query: dog AND NOT cat
 1. d12  0.68150  dogs loyal companions long hikes
 2. d2   0.68150  friendly dog played fetch river

Query: (bm25 OR tf-idf) AND retrieval
No results found.

Query: NOT workshop AND library
No results found.
