# INFO 4271 - Exercise 1 - Web Crawling

Issued: April 16, 2024

Due: April 22, 2024

Please submit this filled sheet via Ilias by the due date.

---

# 1. Duplicate Detection
When crawling large numbers of Web pages we are likely to encounter a considerable number of duplicate documents. To not flood our index with replicas of the same documents, we need a duplicate detection scheme.

a) Using python's built-in hash() function, process the following documents in order of appearance and flag up any exact duplicates.

- **D1** "This is just some document"
- **D2** "This is another piece of text"
- **D3** "This is another piece of text"
- **D4** "This is just some documents"
- **D5** "Totally different stuff"

In [1]:
#Check a single document against an existing collection of previsouly seen documents for exact duplicates.
def check_exct(doc, docs):
    #TODO: Implement exact duplicate detection
    return hash(doc[1]) in [hash(d[1]) for d in docs] #TODO: Return True if the document is a duplicate

b) Going beyond exact duplicates, we want to also identify any near-duplicates that are very similar but not identical to previously seen content. Implement the SimHash method discussed in class and again process the five documents, this time flagging up exact and near duplicates.

In [2]:
def create_simhash(words: list[str]):
    weights = {hash(i): words.count(i) for i in words}
    max_len = max([len(bin(w)) for w in weights.keys()])
    counts = []
    for i in range(max_len-1, -1, -1):
        count = 0
        for word, weight in weights.items():
            if i >= len(bin(word)): continue
            count += weight * (1 if bin(word)[i] == '1' else -1)
        counts.append(count)
    return int(''.join(map(str, [1 if c>0 else 0 for c in counts])), 2)

#Check a single document against an existing collection of previsouly seen documents for near duplicates
def check_simhash(doc, docs):
    #TODO: Implement near duplicate detection
    doc_simhash = create_simhash(doc[1].split(" "))
    for _, compare_doc in docs:
        compare_simhash = create_simhash(compare_doc.split(" "))
        if bin(doc_simhash ^ compare_simhash).count("1")<10: return True
    return False #TODO: Return True if the document is a duplicate

In [3]:
crawl = [['D1', 'This is just some document'], ['D2', 'This is another piece of text'], ['D3', 'This is another piece of text'], ['D4', 'This is just some documents'], ['D5', 'Totally different stuff']]

#Process raw crawled website content
def process(crawl):
    docs = []
    for doc in crawl:
        if check_simhash(doc, docs): #Can be exchanged for check_simhash()
            print('DUPLICATE: '+doc[0])
        else:
            docs.append(doc)

process(crawl)

DUPLICATE: D3
DUPLICATE: D4


# 2. Focused Search Engines
Suppose you were to build a COVID-19 Web search engine for which you want to collect and eventually serve only COVID-19 information. The general web crawling process follows this scheme:

1. Create a seed set of known URLs (a.k.a the frontier)
2. Pull a URL from the frontier and visit it
3. Save the page content for our search engine (indexing)
4. Once on the page, note down all URLs linked there
5. Put all encountered URLs in the queue
6. Repeat from Step 2 until the queue is empty

In this particular setting, how should the generic step-by-step crawling process be modified/extended? Discuss all relevant considerations:

- Using medical and news pages as the frontier should get most important information. It may be possible to avoid unrelated pages but is very difficult as most pages could contain information on COVID-19 in some form.
- It would be best to use a priority queue. The following things could be used for ranking:
    - The frontier queue could be ranked by the context of the links where links with medical context like from a post about COVID are ranked higher than random links without additional information. This could also be used to avoid unhelpful links (a post about gaming will rarely link to information about COVID)
  - give higher priority to links with keywords like 'covid', 'pandemic', ... OR to all links that appear in sentences with these keywords
  - give low priority to all webpages that were updated before 2020. (A website that for example talks about general things about pandemics could still be useful information about COVID, so filtering them out completely would not be good)
  - give medical sites like hospitals/clinics, higher priority. It would be helpful to have a list of hospitals or medical orientations.
- To avoid unnecessary crawling you could rank pages on their relevance related to COVID and if they rank below a certain threshold their URLs are not added to the frontier. But this could lead to the crawler missing out/not finding interesting pages hidden behind boring pages.