# INFO 4271 - Group Project

Issued: June 11, 2024

Due: July 22, 2024

Please submit a link to your code base (ideally with a branch that does not change anymore after the submission deadline) and your 4-page report via email to carsten.eickhoff@uni-tuebingen.de by the due date. One submission per team.

---

# 1. Web Crawling & Indexing
Crawl the web to discover **English content related to Tübingen**. The crawled content should be stored locally. If interrupted, your crawler should be able to re-start and pick up the crawling process at any time.

In [1]:
#Naive webcrawler approach

In [1]:
import sqlite3

#URL: Unique identifier for each page.
#Title: Useful for relevance and user interface display.
#Content: For search relevance and query matching.
#Outgoing Links: For PageRank calculation.
#Timestamp: For recency of content.
def setup_database(db_name="crawler.db"):
    conn = sqlite3.connect(db_name)
    cursor = conn.cursor()
    cursor.execute('''
    CREATE TABLE IF NOT EXISTS frontier (
        url TEXT PRIMARY KEY,
        crawled INTEGER DEFAULT 0
    )''')
    cursor.execute('''
    CREATE TABLE IF NOT EXISTS documents (
        url TEXT PRIMARY KEY,
        title TEXT,
        content TEXT,
        outgoing_links TEXT,
        timestamp TEXT
    )''')
    conn.commit()
    conn.close()




In [15]:
#Add a document to the index. You need (at least) two parameters:
	#doc: The document to be indexed.
	#index: The location of the local index storing the discovered documents.
def index_doc(doc, index_path):
    conn = sqlite3.connect(index_path)
    cursor = conn.cursor()
    cursor.execute('''
    INSERT OR IGNORE INTO documents (url, title, content, outgoing_links, timestamp)
    VALUES (?, ?, ?, ?, ?)
    ''', (doc['url'], doc['title'], doc['content'], ','.join(doc['outgoing_links']), doc['timestamp']))
    conn.commit()
    conn.close()




#Crawl the web. You need (at least) two parameters:
	#frontier: The frontier of known URLs to crawl. You will initially populate this with your seed set of URLs and later maintain all discovered (but not yet crawled) URLs here.
	#index: The location of the local index storing the discovered documents. 
#def crawl(frontier, index):
    #TODO: Implement me
	#pass

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import time
from bs4 import BeautifulSoup
import datetime
from langdetect import detect, LangDetectException
from concurrent.futures import ThreadPoolExecutor
from tqdm import tqdm

def crawl_page(url, index_path):
    options = Options()
    options.headless = True
    driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options)

    try:
        driver.get(url)
        time.sleep(3)  # Wait for the page to load

        soup = BeautifulSoup(driver.page_source, 'html.parser')
        title = soup.title.string if soup.title else ""
        content = soup.get_text(separator=' ', strip=True)

        # Check if English content is present
        if not any(word in content for word in ['the', 'and', 'is', 'in']):
            return None

        # Language detection
        try:
            if detect(content) != 'en':
                return None
        except LangDetectException:
            return None

        links = [link.get_attribute("href") for link in driver.find_elements(By.TAG_NAME, "a") if link.get_attribute("href") and "tuebingen" in link.get_attribute("href").lower()]
        doc = {
            'url': url,
            'title': title,
            'content': content,
            'outgoing_links': links,
            'timestamp': datetime.datetime.now().isoformat()
        }

    except (StaleElementReferenceException, TimeoutException, NoSuchElementException) as e:
        print(f"Exception {type(e).__name__} encountered at {url}: {e}")
        return None
    except Exception as e:
        print(f"Unexpected exception encountered at {url}: {e}")
        return None
    finally:
        driver.quit()

    return doc

def crawl(frontier, index_path):
    conn = sqlite3.connect(index_path)
    cursor = conn.cursor()

    # Calculate total number of URLs to be crawled
    cursor.execute("SELECT COUNT(*) FROM frontier WHERE crawled = 0")
    total_to_crawl = cursor.fetchone()[0]

    with tqdm(total=total_to_crawl, desc="Crawling Progress", unit="page") as pbar:
        while True:
            cursor.execute("SELECT url FROM frontier WHERE crawled = 0 LIMIT 10")
            rows = cursor.fetchall()
            if not rows:
                break

            urls = [row[0] for row in rows]
            with ThreadPoolExecutor(max_workers=10) as executor:
                results = executor.map(lambda url: crawl_page(url, index_path), urls)

            for url, doc in zip(urls, results):
                if doc:
                    cursor.execute("SELECT 1 FROM documents WHERE url = ? LIMIT 1", (doc['url'],))
                    if cursor.fetchone() is None:
                        index_doc(doc, index_path)
                        pbar.update(1)
                cursor.execute("UPDATE frontier SET crawled = 1 WHERE url = ?", (url,))
                conn.commit()
                if doc:
                    for link in doc['outgoing_links']:
                        cursor.execute("INSERT OR IGNORE INTO frontier (url) VALUES (?)", (link,))
                        conn.commit()
    
    conn.close()

def initialize_frontier(initial_urls, db_name="crawler.db"):
    conn = sqlite3.connect(db_name)
    cursor = conn.cursor()
    for url in initial_urls:
        cursor.execute("INSERT OR IGNORE INTO frontier (url) VALUES (?)", (url,))
    conn.commit()
    conn.close()

def calculate_incoming_links(db_name="crawler.db"):
    conn = sqlite3.connect(db_name)
    cursor = conn.cursor()

    # Create a temporary table to store incoming links
    cursor.execute('''
    CREATE TABLE IF NOT EXISTS incoming_links (
        url TEXT PRIMARY KEY,
        incoming_count INTEGER DEFAULT 0
    )''')

    # Clear any existing data
    cursor.execute("DELETE FROM incoming_links")

    # Retrieve all documents and their outgoing links
    cursor.execute("SELECT url, outgoing_links FROM documents")
    rows = cursor.fetchall()

    for row in rows:
        url, outgoing_links = row
        outgoing_links_list = outgoing_links.split(',')

        for link in outgoing_links_list:
            cursor.execute('''
            INSERT INTO incoming_links (url, incoming_count)
            VALUES (?, 1)
            ON CONFLICT(url) DO UPDATE SET incoming_count = incoming_count + 1
            ''', (link,))

    conn.commit()
    conn.close()

# Initialize and run the crawler
initial_urls = [
    "https://www.tuebingen.de/en/",
    "https://en.wikipedia.org/wiki/T%C3%BCbingen",
    "https://www.uni-tuebingen.de/en.html"
]

setup_database()
initialize_frontier(initial_urls)
crawl("crawler.db", "crawler.db")
calculate_incoming_links()


In [2]:
def print_first_10_urls(db_name="crawler.db"):
    conn = sqlite3.connect(db_name)
    cursor = conn.cursor()

    cursor.execute("SELECT title FROM documents LIMIT 20")
    rows = cursor.fetchall()
    for row in rows:
        print(row[0])

    conn.close()

# Call the function to print the first 10 URLs
print_first_10_urls()


Welcome to Tübingen - City of Tuebingen
Tübingen - Wikipedia
Welcome to Tübingen - City of Tuebingen
Immigration office - City of Tuebingen
European Elections 2024 - City of Tuebingen
Portrait of the City - City of Tuebingen
City and Guests - City of Tuebingen
Culture and Leisure - City of Tuebingen
Imprint - City of Tuebingen
Welcome to Tübingen - City of Tuebingen
File:Altstadt-tuebingen-1.jpg - Wikipedia
File:Wappen Tuebingen.svg - Wikipedia
File:TuebingenNeckar.jpg - Wikipedia
File:TuebingenNeckarfront3.jpg - Wikipedia
File:TuebingenStiftskirche.jpg - Wikipedia
uni-tuebingen.de
Tübingen climate: Weather Tübingen & temperature by month 
Sister Cities - Universitätsstadt Tübingen
Immigration office - City of Tuebingen
European Elections 2024 - City of Tuebingen


# 2. Query Processing 
Process a textual query and return the 100 most relevant documents from your index. Please incorporate **at least one retrieval model innovation** that goes beyond BM25 or TF-IDF. Please allow for queries to be entered either individually in an interactive user interface (see also #3 below), or via a batch file containing multiple queries at once. The batch file will be formatted to have one query per line, listing the query number, and query text as tab-separated entries. An example of the batch file for the first two queries looks like this:

```
1   tübingen attractions
2   food and drinks
```

In [3]:
%pip install -U sentence-transformers rank_bm25

Collecting sentence-transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl.metadata (10 kB)
Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
   ---------------------------------------- 0.0/227.1 kB ? eta -:--:--
   ------------------------------------ --- 204.8/227.1 kB 4.1 MB/s eta 0:00:01
   ---------------------------------------- 227.1/227.1 kB 3.4 MB/s eta 0:00:00
Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank_bm25, sentence-transformers
  Attempting uninstall: sentence-transformers
    Found existing installation: sentence-transformers 2.2.2
    Uninstalling sentence-transformers-2.2.2:
      Successfully uninstalled sentence-transformers-2.2.2
Successfully installed rank_bm25-0.2.2 sentence-transformers-3.0.1
Note: you may need to restart the kernel to use updated packages.


In [4]:
import sqlite3
from sentence_transformers import SentenceTransformer, CrossEncoder, util
import torch

if not torch.cuda.is_available():
    print("Warning: No GPU found. Please add GPU to your notebook")

# Initialize the sentence transformers models
bi_encoder = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
bi_encoder.max_seq_length = 256  # Truncate long passages to 256 tokens
top_k = 32  # Number of passages we want to retrieve with the bi-encoder

cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

# Function to read data from SQLite database
def fetch_documents_from_db(db_name="crawler.db", limit=top_k):
    conn = sqlite3.connect(db_name)
    cursor = conn.cursor()
    
    # Fetch limited number of documents
    cursor.execute("SELECT content FROM documents LIMIT ?", (limit,))
    rows = cursor.fetchall()
    
    passages = [row[0] for row in rows]
    
    conn.close()
    return passages

# Fetch the top_k passages from the database
passages = fetch_documents_from_db(limit=top_k)

print("Number of passages:", len(passages))

# Encode all passages into the vector space
corpus_embeddings = bi_encoder.encode(passages, convert_to_tensor=True, show_progress_bar=True)


  from tqdm.autonotebook import tqdm, trange




modules.json: 100%|██████████| 349/349 [00:00<?, ?B/s] 
config_sentence_transformers.json: 100%|██████████| 116/116 [00:00<?, ?B/s] 
README.md: 100%|██████████| 11.6k/11.6k [00:00<00:00, 11.1MB/s]
sentence_bert_config.json: 100%|██████████| 53.0/53.0 [00:00<00:00, 53.1kB/s]
config.json: 100%|██████████| 612/612 [00:00<?, ?B/s] 
model.safetensors: 100%|██████████| 90.9M/90.9M [00:12<00:00, 7.22MB/s]
tokenizer_config.json: 100%|██████████| 383/383 [00:00<?, ?B/s] 
vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 1.11MB/s]
tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 3.79MB/s]
special_tokens_map.json: 100%|██████████| 112/112 [00:00<?, ?B/s] 
1_Pooling/config.json: 100%|██████████| 190/190 [00:00<00:00, 184kB/s]
config.json: 100%|██████████| 794/794 [00:00<00:00, 782kB/s]
pytorch_model.bin: 100%|██████████| 90.9M/90.9M [00:12<00:00, 7.36MB/s]
  return self.fget.__get__(instance, owner)()
tokenizer_config.json: 100%|██████████| 316/316 [00:00<?, ?B/s] 
vocab.txt: 100%|██████

Number of passages: 32


Batches: 100%|██████████| 1/1 [00:01<00:00,  1.04s/it]


In [5]:
from rank_bm25 import BM25Okapi
from sklearn.feature_extraction import _stop_words
import string
from tqdm.autonotebook import tqdm
import numpy as np

# Tokenizer for BM25 as baseline
def bm25_tokenizer(text):
    tokenized_doc = []
    for token in text.lower().split():
        token = token.strip(string.punctuation)
        if len(token) > 0 and token not in _stop_words.ENGLISH_STOP_WORDS:
            tokenized_doc.append(token)
    return tokenized_doc

# Tokenize the corpus for BM25
tokenized_corpus = [bm25_tokenizer(passage) for passage in tqdm(passages)]
bm25 = BM25Okapi(tokenized_corpus)

100%|██████████| 32/32 [00:00<00:00, 7979.18it/s]


In [13]:
import torch

def search(query):
    print("Input question:", query)

    ##### BM25 search (lexical search) #####
    bm25_scores = bm25.get_scores(bm25_tokenizer(query))
    top_n = np.argpartition(bm25_scores, -5)[-5:]
    bm25_hits = [{'corpus_id': idx, 'score': bm25_scores[idx]} for idx in top_n]
    bm25_hits = sorted(bm25_hits, key=lambda x: x['score'], reverse=True)
    
    print("Top-3 lexical search (BM25) hits")
    for hit in bm25_hits[0:3]:
        print("\t{:.3f}\t{}".format(hit['score'], passages[hit['corpus_id']].replace("\n", " ")))

    ##### Semantic Search #####
    # Encode the query using the bi-encoder and find potentially relevant passages
    question_embedding = bi_encoder.encode(query, convert_to_tensor=True)
    
    # Check if CUDA is available
    if torch.cuda.is_available():
        question_embedding = question_embedding.cuda()
    
    hits = util.semantic_search(question_embedding, corpus_embeddings, top_k=top_k)
    hits = hits[0]  # Get the hits for the first query

    ##### Re-Ranking #####
    # Now, score all retrieved passages with the cross_encoder
    cross_inp = [[query, passages[hit['corpus_id']]] for hit in hits]
    cross_scores = cross_encoder.predict(cross_inp)

    # Sort results by the cross-encoder scores
    for idx in range(len(cross_scores)):
        hits[idx]['cross-score'] = cross_scores[idx]

    # Output of top-3 hits from bi-encoder
    print("\n-------------------------\n")
    print("Top-3 Bi-Encoder Retrieval hits")
    hits = sorted(hits, key=lambda x: x['score'], reverse=True)
    for hit in hits[0:3]:
        print("\t{:.3f}\t{}".format(hit['score'], passages[hit['corpus_id']].replace("\n", " ")))

    # Output of top-3 hits from re-ranker
    print("\n-------------------------\n")
    print("Top-3 Cross-Encoder Re-ranker hits")
    hits = sorted(hits, key=lambda x: x['cross-score'], reverse=True)
    for hit in hits[0:3]:
        print("\t{:.3f}\t{}".format(hit['cross-score'], passages[hit['corpus_id']].replace("\n", " ")))


In [10]:
def print_passages(passages):
    print("Passages:\n")
    for i, passage in enumerate(passages, start=1):
        print(f"Passage {i}:")
        print(f"{passage}\n")
        print("-" * 80)

# Example usage
print_passages(passages)


Passages:

Passage 1:
Welcome to Tübingen - City of Tuebingen Immigration office European Elections 2024 Portrait of the City City and Guests Culture and Leisure Bild: Universitätsstadt Tübingen Show search input Hide search input City Map Deutsch Welcome to Tübingen June 19th, 2024 Small steps, narrow alleys and pointed gables shape the silhouette of old Tübingen on the way up to its castle. The Swabian university town of about 91,000 inhabitants and 28,000 students combines the flair of a lovingly restored medieval centre with the colorful bustle and typical atmosphere of a young and cosmopolitan students' town. Numerous sidewalk cafes, wine taverns and cozy students' pubs, special shops, restaurants and taverns invite visitors to stroll around and to pause here and there. Taking a boat trip in a famous “Stocherkahn“ – the boat exclusive to Tübingen navigated by a long wooden pole – offers a scenic view of the picturesque Neckar waterfront with the famous Hölderlin Tower. Come, have 

In [15]:
search(query = "Tuebingen attractions")

Input question: Tuebingen attractions
Top-3 lexical search (BM25) hits
	1.260	City and Guests - City of Tuebingen Immigration office European Elections 2024 Portrait of the City City and Guests How to get there Accommodation Gastronomy Shopping Trips around Tübingen Culture and Leisure Bild: Alexander Gonschior Show search input Hide search input City Map Deutsch Welcome to Tübingen City and Guests City and Guests Of the roughly 100,000 annual visitors, about every fifth guest is visiting from abroad. It is quite likely that you will be able to get to know Tuebingen in your native language: City tours are offered in many languages, including Norwegian, Mandarin, and even Latin. There is a variety of tours ranging from classical trips through the old town to rather specialized ones that lead through museums or a distillery for Swabian whisky. Just visit www.tuebingen-info.de and choose your tour and language. Sightseeing flight Tübingen's most beautiful buildings: Enjoy the bird's eye v

In [None]:
#Retrieve documents relevnt to a query. You need (at least) two parameters:
	#query: The user's search query
	#index: The location of the local index storing the discovered documents.
def retrieve(query, index):
    #TODO: Implement me
	pass

# 3. Search Result Presentation
Once you have a result set, we want to return it to the searcher in two ways: a) in an interactive user interface. For this user interface, please think of **at least one innovation** that goes beyond the traditional 10-blue-links interface that most commercial search engines employ. b) as a text file used for batch performance evaluation. The text file should be formatted to produce one ranked result per line, listing the query number, rank position, document URL and relevance score as tab-separated entries. An example of the first three lines of such a text file looks like this:

```
1   1   https://www.tuebingen.de/en/3521.html   0.725
1   2   https://www.komoot.com/guide/355570/castles-in-tuebingen-district   0.671
1   3   https://www.unimuseum.uni-tuebingen.de/en/museum-at-hohentuebingen-castle   0.529
...
1   100 https://www.tuebingen.de/en/3536.html   0.178
2   1   https://www.tuebingen.de/en/3773.html   0.956
2   2   https://www.tuebingen.de/en/4456.html   0.797
...
```

In [None]:
#TODO: Implement an interactive user interface for part a of this exercise.

#Produce a text file with 100 results per query in the format specified above.
def batch(results):
    #TODO: Implement me.    
    pass

# 4. Performance Evaluation 
We will evaluate the performance of our search systems on the basis of five queries. Two of them are avilable to you now for engineering purposes:
- `tübingen attractions`
- `food and drinks`

The remaining three queries will be given to you during our final session on July 23rd. Please be prepared to run your systems and produce a single result file for all five queries live in class. That means you should aim for processing times of no more than ~1 minute per query. We will ask you to send carsten.eickhoff@uni-tuebingen.de that file.

# Grading
Your final projects will be graded along the following criteria:
- 25% Code correctness and quality (to be delivered on this sheet)
- 25% Report (4 pages, PDF, explanation and justification of your design choices)
- 25% System performance (based on how well your system performs on the 5 queries relative to the other teams in terms of nDCG)
- 15% Creativity and innovativeness of your approach (in particular with respect to your search system #2 and user interface #3 innovations)
- 10% Presentation quality and clarity

# Permissible libraries
You can use any general-puprose ML and NLP libraries such as scipy, numpy, scikit-learn, spacy, nltk, but please stay away from dedicated web crawling or search engine toolkits such as scrapy, whoosh, lucene, terrier, galago and the likes. Pretrained models are fine to use as part of your system, as long as they have not been built/trained for retrieval. 
