# INFO 4271 - Group Project

Issued: June 17, 2025

Due: July 21, 2025

Please submit a link to your code repository (with a branch that does not change anymore after the submission deadline) and your 4-page report via email to carsten.eickhoff@uni-tuebingen.de by the due date. One submission per team.

---

# 1. Web Crawling & Indexing
Crawl the web to discover **English content related to Tübingen**. The crawled content should be stored locally. If interrupted, your crawler should be able to re-start and pick up the crawling process at any time.

In [None]:
%load_ext autoreload
%autoreload 2

import project

start_urls = [
    r"https://duckduckgo.com/html/?q=tübingen%20english",
    # "https://en.wikipedia.org/wiki/T%C3%BCbingen",
    # 'https://www.mygermanyvacation.com/best-things-to-do-and-see-in-tubingen-germany/',
    # 'https://www.britannica.com/place/Tubingen-Germany',
    # 'https://www.germany.travel/en/cities-culture/tuebingen.html',
    # 'https://visit-tubingen.co.uk/',
]
crawler = project.Crawler(urls=start_urls, max_workers=30)
crawler.run()

In [None]:
filtered = {p for p, d in crawler.visited_pages.items() if d["keywords_found"] and d["is_english"]}
print(len(filtered))
filtered

# 2. Query Processing 
Process a textual query and return the 100 most relevant documents from your index. Please incorporate **at least one retrieval model innovation** that goes beyond BM25 or TF-IDF. Please allow for queries to be entered either individually in an interactive user interface (see also #3 below), or via a batch file containing multiple queries at once. The batch file (see `queries.txt` for an example) will be formatted to have one query per line, listing the query number, and query text as tab-separated entries. An example of the batch file for the first two queries looks like this:

```
1   tübingen attractions
2   food and drinks
```

In [None]:
import random
from datasets import load_dataset
from sentence_transformers import SentenceTransformer, losses, InputExample, evaluation
from torch.utils.data import DataLoader
import pandas as pd

# ===== PARAMETERS =====
# MODEL_NAME       = "sentence-transformers/all-MiniLM-L6-v2"
MODEL_NAME       = "rasyosef/SPLADE-BERT-Mini"
SUBSET_SIZE      = None #10_000
BATCH_SIZE       = 8
EPOCHS           = 3
WARMUP_STEPS     = 100
OUTPUT_PATH      = "student_bge_distilled"
EVAL_SUBSET_SIZE = 2_000

# Load data
ds_train = load_dataset("microsoft/ms_marco", "v1.1", split="train")
if SUBSET_SIZE: ds_train = ds_train.shuffle(seed=42).select(range(SUBSET_SIZE))
ds_dev = load_dataset("microsoft/ms_marco", "v1.1", split="validation")
if EVAL_SUBSET_SIZE: ds_dev = ds_dev.shuffle(seed=7).select(range(EVAL_SUBSET_SIZE))

# ===== 2. PREPARE TRIPLETS =====
triplets = []
for row in ds_train:
    query    = row["query"]
    passages = row["passages"]["passage_text"]
    flags    = row["passages"]["is_selected"]
    positives = [p for p, f in zip(passages, flags) if f == 1]
    negatives = [p for p, f in zip(passages, flags) if f == 0]
    if not positives or not negatives:
        continue
    triplets.append(InputExample(texts=[query, positives[0], random.choice(negatives)]))

dataloader = DataLoader(triplets, shuffle=True, batch_size=BATCH_SIZE)

# ===== 3. LOAD STUDENT MODEL =====
student = SentenceTransformer(MODEL_NAME)

# Build dicts for IR evaluator
queries = {str(i): row['query'] for i, row in enumerate(ds_dev)}
corpus = {}
relevant_docs = {}
for i, row in enumerate(ds_dev):
    pos_ids = set()
    for j, (text, sel) in enumerate(zip(row['passages']['passage_text'], row['passages']['is_selected'])):
        doc_id = f"{i}_{j}"
        corpus[doc_id] = text
        if sel == 1:
            pos_ids.add(doc_id)
    relevant_docs[str(i)] = pos_ids

# Initialize evaluator
evaluator = evaluation.InformationRetrievalEvaluator(
    queries=queries,
    corpus=corpus,
    relevant_docs=relevant_docs,
    show_progress_bar=False,
    name='msmarco-dev'
)

# ===== 5. EVALUATE BEFORE TRAINING =====
print("Evaluating before training:")
stats_before = evaluator(student)
df = pd.DataFrame([
    {"Phase": "Before Training", **stats_before},
    # {"Phase": "After Training", **stats_after}
])
print(df.to_markdown(index=False))

# ===== 6. SETUP LOSS =====
loss = losses.TripletLoss(model=student)
# Alternative: losses.MultipleNegativesRankingLoss(model=student)

# ===== 7. TRAIN =====
student.fit(
    train_objectives=[(dataloader, loss)],
    epochs=EPOCHS,
    warmup_steps=WARMUP_STEPS,
    evaluator=evaluator,
    optimizer_params={"lr": 1e-7},
    evaluation_steps=100,
    output_path=OUTPUT_PATH
)

# # ===== 8. EVALUATE AFTER TRAINING =====
print("Evaluating after training:")
stats_after = evaluator(student)
df = pd.DataFrame([
    {"Phase": "Before Training", **stats_before},
    {"Phase": "After Training", **stats_after}
])
print(df.to_markdown(index=False))

No sentence-transformers model found with name rasyosef/SPLADE-BERT-Mini. Creating a new one with mean pooling.
Some weights of BertModel were not initialized from the model checkpoint at rasyosef/SPLADE-BERT-Mini and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Evaluating before training:
| Phase           |   msmarco-dev_cosine_accuracy@1 |   msmarco-dev_cosine_accuracy@3 |   msmarco-dev_cosine_accuracy@5 |   msmarco-dev_cosine_accuracy@10 |   msmarco-dev_cosine_precision@1 |   msmarco-dev_cosine_precision@3 |   msmarco-dev_cosine_precision@5 |   msmarco-dev_cosine_precision@10 |   msmarco-dev_cosine_recall@1 |   msmarco-dev_cosine_recall@3 |   msmarco-dev_cosine_recall@5 |   msmarco-dev_cosine_recall@10 |   msmarco-dev_cosine_ndcg@10 |   msmarco-dev_cosine_mrr@10 |   msmarco-dev_cosine_map@100 |
|:----------------|--------------------------------:|--------------------------------:|--------------------------------:|---------------------------------:|---------------------------------:|---------------------------------:|---------------------------------:|----------------------------------:|------------------------------:|------------------------------:|------------------------------:|-------------------------------:|---------------------------

                                                                     

Step,Training Loss,Validation Loss,Msmarco-dev Cosine Accuracy@1,Msmarco-dev Cosine Accuracy@3,Msmarco-dev Cosine Accuracy@5,Msmarco-dev Cosine Accuracy@10,Msmarco-dev Cosine Precision@1,Msmarco-dev Cosine Precision@3,Msmarco-dev Cosine Precision@5,Msmarco-dev Cosine Precision@10,Msmarco-dev Cosine Recall@1,Msmarco-dev Cosine Recall@3,Msmarco-dev Cosine Recall@5,Msmarco-dev Cosine Recall@10,Msmarco-dev Cosine Ndcg@10,Msmarco-dev Cosine Mrr@10,Msmarco-dev Cosine Map@100
100,No log,No log,0.195122,0.385573,0.496108,0.618059,0.195122,0.132157,0.104307,0.066321,0.183965,0.367324,0.477815,0.603875,0.384204,0.320314,0.322239
200,No log,No log,0.194603,0.382979,0.49507,0.614427,0.194603,0.131292,0.103996,0.065957,0.183446,0.364729,0.476258,0.600242,0.382369,0.319015,0.321123
300,No log,No log,0.192527,0.380903,0.493513,0.612351,0.192527,0.130773,0.103581,0.06575,0.181629,0.362913,0.474702,0.598166,0.38084,0.317475,0.319849
400,No log,No log,0.192527,0.379865,0.494032,0.612351,0.192527,0.130427,0.103684,0.065698,0.18137,0.362394,0.475221,0.597907,0.38048,0.317226,0.319391
500,5.328400,No log,0.192527,0.380384,0.492994,0.608718,0.192527,0.1306,0.103373,0.065335,0.18137,0.362913,0.473923,0.594274,0.379151,0.316635,0.318906
600,5.328400,No log,0.191489,0.380903,0.4904,0.608718,0.191489,0.130773,0.102854,0.065387,0.180332,0.363172,0.471328,0.594534,0.378735,0.316008,0.318085
700,5.328400,No log,0.190451,0.379865,0.487805,0.605605,0.190451,0.130427,0.102231,0.065023,0.179294,0.362135,0.468474,0.59168,0.377124,0.314793,0.317012
800,5.328400,No log,0.188895,0.380903,0.485729,0.603529,0.188895,0.130773,0.101713,0.064764,0.177737,0.363172,0.466269,0.589344,0.37559,0.313484,0.315781
900,5.328400,No log,0.187857,0.378827,0.485729,0.602491,0.187857,0.130081,0.101713,0.06466,0.176959,0.361616,0.466269,0.588566,0.374418,0.312092,0.314511
1000,5.316000,No log,0.186819,0.376232,0.483134,0.600415,0.186819,0.129043,0.101194,0.064453,0.175921,0.358761,0.463934,0.586836,0.372877,0.310634,0.313107


# 3. Search Result Presentation
Once you have a result set, we want to return it to the searcher in two ways: a) in an interactive user interface. For this user interface, please think of **at least one innovation** that goes beyond the traditional 10-blue-links interface that most commercial search engines employ. b) as a text file used for batch performance evaluation. The text file should be formatted to produce one ranked result per line, listing the query number, rank position, document URL and relevance score as tab-separated entries. An example of the first three lines of such a text file looks like this:

```
1   1   https://www.tuebingen.de/en/3521.html   0.725
1   2   https://www.komoot.com/guide/355570/castles-in-tuebingen-district   0.671
1   3   https://www.unimuseum.uni-tuebingen.de/en/museum-at-hohentuebingen-castle   0.529
...
1   100 https://www.tuebingen.de/en/3536.html   0.178
2   1   https://www.tuebingen.de/en/3773.html   0.956
2   2   https://www.tuebingen.de/en/4456.html   0.797
...
```

In [None]:
#TODO: Implement an interactive user interface for part a of this exercise.

#Produce a text file with 100 results per query in the format specified above.
def batch(results):
    #TODO: Implement me.    
    pass

# 4. Performance Evaluation 
We will evaluate the performance of our search systems on the basis of five queries. Two of them are avilable to you now for engineering purposes:
- `tübingen attractions`
- `food and drinks`

The remaining three queries will be given to you during our final session on July 22nd. Please be prepared to run your systems and produce a single result file for all five queries live in class. That means you should aim for processing times of no more than ~1 minute per query. We will ask you to send carsten.eickhoff@uni-tuebingen.de that file.

# Grading
Your final projects will be graded along the following criteria:
- 25% Code correctness and quality (to be delivered on this sheet)
- 25% Report (4 pages, PDF, explanation and justification of your design choices)
- 25% System performance (based on how well your system performs on the 5 queries relative to the other teams in terms of nDCG)
- 15% Creativity and innovativeness of your approach (in particular with respect to your search system #2 and user interface #3 innovations)
- 10% Presentation quality and clarity

# Permissible libraries
You can use any general-puprose ML and NLP libraries such as scipy, numpy, scikit-learn, spacy, nltk, but please stay away from dedicated web crawling or search engine toolkits such as scrapy, whoosh, lucene, terrier, galago and the likes. Pretrained models are fine to use as part of your system, as long as they have not been built/trained for retrieval. 
