# Self study 1

Self studies should be solved individually, or in small groups of 2-3 students. There is no hand-in of your solutins to the self studies. However, you can bring your solutions to the exam, and use them as the basis for your answers to the exam questions.

In this self-study we construct a simple crawler. Concretely, you should: 

* Select about 5 seed urls, e.g. homepages of universities, e-commerce sites, or similar

* Start crawling from these seeds. Define a strategy for selecting the next url to be crawled. What kind of prioritization (if any) is embodied in your strategy?

* Make sure you obey the robots.txt file, and make ensure that at least 2 seconds elapse between requests to the same host

* Stop when you have crawled approx. 1000 pages

* For each crawled page, save the url and the text string contained in the 'title' element of the document (we do not want to handle the full text of the pages at this point).

* You can repeat this several times, using different seed sets and/or prioritization strategies.

The following two self studies will extend the work that you do in this self study.

The following introduces a few helpful libraries and essential functions. You can use these methods, or use other tools that you are already familiar with and/or prefer to work with. 

A simple crawler implementation can be based on the 'requests' package [https://requests.readthedocs.io/en/master/](https://requests.readthedocs.io/en/master/) for retrieving html documents, and the BeautifulSoup parser https://www.crummy.com/software/BeautifulSoup/bs4/doc/ for parsing the html.

In [1]:
import requests
from bs4 import BeautifulSoup
from time import sleep
from urllib.robotparser import RobotFileParser
from urllib.parse import urlparse
from datetime import datetime, timedelta
from itertools import count
import random


Let's start crawling at https://www.aau.dk/ . We first retrieve the robots.txt file and check whether we are allowed to crawl the top-level url:

In [2]:
rp=RobotFileParser()
rp.set_url("https://www.aau.dk")
rp.read()
print(rp.can_fetch("*","https://www.aau.dk"))

True


We can now get the html using the requests package, which returns a response object:

In [3]:
r=requests.get('https://www.aau.dk/')
print(type(r))

<class 'requests.models.Response'>


A basic view of the contents is accessible via the content attribute:

For serious parsing, we can use the BeautifulSoup html parser:

In [6]:
r_parse = BeautifulSoup(r.text, 'html.parser')

We can get the title:

In [7]:
print(r_parse.find('title'))
print(r_parse.find('title').string)

<title>AAU - Viden for verden - Aalborg Universitet</title>
AAU - Viden for verden - Aalborg Universitet


Importantly, we can get all the links on the page. The following also illustrates the sleep() function to implement time delays (the following will take a while to complete; use the "interrupt kernel" button to terminate early):

In [8]:
url = 'https://www.aau.dk/uddannelser/optagelse/kandidat/ledige-studiepladser-2022'
parsed = urlparse(url)
print(parsed)
newurl = parsed.scheme + '://' + parsed.netloc
print(newurl)

ParseResult(scheme='https', netloc='www.aau.dk', path='/uddannelser/optagelse/kandidat/ledige-studiepladser-2022', params='', query='', fragment='')
https://www.aau.dk


In [9]:
seeds = ['https://www.aau.dk', 'https://www.dr.dk', 'https://www.tv2.dk', 'https://www.bt.dk', 'https://www.mit.edu']
index_arr = {}
crawled_links = []
frontier = []
frontqueue = {
    'one' : [],
    'two' : [],
    'three' : []
}

back_queue = {}
prio_heap = {}

for url in seeds :
    prio_heap[url] = datetime.now()
    back_queue[url] = []
sleep(2)

def get_base_url(url):
    parsed = urlparse(url)
    baseUrl = parsed.scheme + '://' + parsed.netloc
    return baseUrl
def fill_back_queue():
    arr = []
    if(len(frontqueue['one']) != 0):
        arr = frontqueue['one']
        frontqueue['one'] = []
    elif(len(frontqueue['two']) != 0):
        arr = frontqueue['two']
        frontqueue['two'] = []
    elif(len(frontqueue['three']) != 0):
        arr = frontqueue['three']
        frontqueue['three'] = []

    for url in arr:
        if (get_base_url(url) in back_queue.keys()):
            back_queue[get_base_url(url)].append(url)
            prio_heap[get_base_url(url)] = datetime.now()
        else:
            back_queue[get_base_url(url)] = [url]
            prio_heap[get_base_url(url)] = datetime.now()

def get_url():
    #This should be based on a heap but :shrugeg:
    viable_urls = [key for (key, value) in prio_heap.items() if value <= datetime.now() + timedelta(seconds=2)]

    randomUrl = random.choice(viable_urls)
    url = ""
    if(len(back_queue[randomUrl]) != 0):
        url = back_queue[randomUrl].pop()
    else:
        fill_back_queue()
        return get_url()
    crawled_links.append(randomUrl)
    return url

def fetch(url):
    rp.set_url(get_base_url(url))
    rp.read()
    if (True):#rp.can_fetch("*", url)):
        r=requests.get(url)
        r_parse = BeautifulSoup(r.text, 'html.parser')
        return r_parse
    else:
        return 0

def index(doc, url):
    title = doc.find('title')
    if(title):
        if url not in index_arr.keys():
            print(title.string)
            index_arr[url] = title.string
    else:
        print('no title')

def extract_urls(doc, url):
    href_arr = [] 
    for a in doc.find_all('a', href=True):
        link = a['href']
        if(link.startswith('https://www') and not link.startswith('https://www.google.com')):
            if (link not in frontier and link not in href_arr and link not in crawled_links):
                href_arr.append(link)
        #else:
        #    comb_url = get_base_url(url) + link
        #    if (comb_url not in frontier and comb_url not in href_arr and comb_url not in crawled_links):
        #        href_arr.append(comb_url)
    return href_arr

def add_to_frontier(url_list):
    #To make some checks easier this is added
    for url in url_list:
        frontier.append(url)
        slash_count = url.count('/')
        if (slash_count > 5):
            frontqueue['three'].append(url)
        elif(slash_count > 3):  
            frontqueue['two'].append(url)
        else:
            frontqueue['one'].append(url)

add_to_frontier(seeds)

i = 0
while (len(back_queue) != 0):
    i += 1
    url = get_url()
    print(url)
    doc = fetch(url)
    if (doc):
        index(doc, url)
        add_to_frontier(extract_urls(doc, url))
    if(i > 100):
        break

https://www.mit.edu
MIT - Massachusetts Institute of Technology
https://www.aau.dk
AAU - Viden for verden - Aalborg Universitet
https://www.ansatte.aau.dk
for ansatte
https://www.design.aau.dk/
AAU Designguide
https://www.bt.dk
B.T. Nyheder | Læs nyhederne på bt.dk
https://www.adgangforalle.dk
Adgang for alle, online oplÃ¦sning 
https://www.aau.dk/pressen
For pressen - Aalborg Universitet
https://www.okonomi.aau.dk
Økonomiafdelingen
https://www.aau.dk/kontakt
Kontakt Aalborg Universitet (AAU) - Aalborg Universitet
https://www.en.okonomi.aau.dk/
finance and accounts department
https://www.aau.dk/om-cookies
Aalborg Universitets privatlivspolitik og cookiepolitik - Aalborg Universitet
https://www.youtube.com/aalborguniversitet
Inden du fortsætter til YouTube
https://www.tv2.dk
TV 2 - bedst pÃ¥ breaking og live
https://www.kvalitetssikring.aau.dk
Uddannelseskvalitet på Aalborg Universitet
https://www.instagram.com/aaustudieliv
Aalborg Universitet (@aaustudieliv) • Instagram photos and vide

In [10]:
import nltk
nltk.download('punkt')
from nltk.stem.snowball import SnowballStemmer
import codecs

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\johan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [11]:
for key in index_arr.keys():
    print(f"{key}: {index_arr[key]}")

https://www.mit.edu: MIT - Massachusetts Institute of Technology
https://www.aau.dk: AAU - Viden for verden - Aalborg Universitet
https://www.ansatte.aau.dk: for ansatte
https://www.design.aau.dk/: AAU Designguide
https://www.bt.dk: B.T. Nyheder | Læs nyhederne på bt.dk
https://www.adgangforalle.dk: Adgang for alle, online oplÃ¦sning 
https://www.aau.dk/pressen: For pressen - Aalborg Universitet
https://www.okonomi.aau.dk: Økonomiafdelingen
https://www.aau.dk/kontakt: Kontakt Aalborg Universitet (AAU) - Aalborg Universitet
https://www.en.okonomi.aau.dk/: finance and accounts department
https://www.aau.dk/om-cookies: Aalborg Universitets privatlivspolitik og cookiepolitik - Aalborg Universitet
https://www.youtube.com/aalborguniversitet: Inden du fortsætter til YouTube
https://www.tv2.dk: TV 2 - bedst pÃ¥ breaking og live
https://www.kvalitetssikring.aau.dk: Uddannelseskvalitet på Aalborg Universitet
https://www.instagram.com/aaustudieliv: Aalborg Universitet (@aaustudieliv) • Instagram 

In [12]:
import nltk
nltk.download('punkt')
from nltk.stem.snowball import SnowballStemmer
import codecs

dstemmer=SnowballStemmer("danish")
a_file = codecs.open("stopord.txt", "r", "utf-8")
file_contents = a_file.read()
contents_split = file_contents.splitlines()
stop_words = []
for word in contents_split:
    stop_words.append(dstemmer.stem(word))
extra_words = ["|", ",", ".", "-", "!"]
for word in extra_words:
    stop_words.append(word)

for word in stop_words:
    print(word)
a_file.close()

ad
af
aldr
alen
all
all
alligevel
alt
altid
and
and
andr
at
bag
bar
beg
bl.a.
bland
blev
bliv
bliv
burd
bør
ca.
da
de
dem
den
den
den
der
dereft
der
derfor
derfra
deri
dermed
derpå
derved
det
det
dig
din
din
dis
dit
dog
du
eft
egen
ej
ell
ell
en
end
endnu
ene
enest
enhv
ens
ent
er
et
f.eks.
far
fem
fik
fir
fler
flest
flest
for
foran
fordi
for
fra
fx
få
får
før
først
gennem
gjord
gjort
god
godt
gør
gør
gør
ham
han
han
har
havd
hav
hej
hel
hel
helt
hen
hend
hend
henov
her
hereft
heri
hermed
herpå
hos
hun
hvad
hvem
hver
hvilk
hvilk
hvilk
hvis
hvor
hvordan
hvoreft
hvorfor
hvorfra
hvorh
hvori
hvorimod
hvornår
hvorved
i
igen
igennem
ikk
imellem
imen
imod
ind
indtil
ing
int
ja
jeg
jer
jer
jo
kan
kom
kom
kom
kun
kun
lad
lang
lav
lav
lav
lidt
lig
ligesom
lil
læng
man
mand
mang
med
meg
mellem
men
men
mer
mest
mig
min
mindr
mindst
min
mit
mod
må
måsk
ned
nej
nem
ni
nog
nogensind
nog
nogl
nok
nu
ny
nyt
når
nær
næst
næst
og
også
okay
om
omkring
op
os
ott
over
overalt
pga.
på
sam
sam
se
sek
selv
sel

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\johan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [13]:
stemmed_index = {}
for key in index_arr.keys():
    tokens = nltk.word_tokenize(index_arr[key])
    local_tokens = []
    for token in tokens:
        if token not in stop_words:
            local_tokens.append(token)
    stemmed_index[key] = local_tokens

#used to give numbers
numbered_index = {}
i = 1
for url in stemmed_index.keys():
    stemmed_index[url] = {'tokens' : stemmed_index[url], 'id' : i}
    numbered_index[i] = url
    i += 1
for key in numbered_index.keys():
    print(f'{key}: {numbered_index[key]}')

1: https://www.mit.edu
2: https://www.aau.dk
3: https://www.ansatte.aau.dk
4: https://www.design.aau.dk/
5: https://www.bt.dk
6: https://www.adgangforalle.dk
7: https://www.aau.dk/pressen
8: https://www.okonomi.aau.dk
9: https://www.aau.dk/kontakt
10: https://www.en.okonomi.aau.dk/
11: https://www.aau.dk/om-cookies
12: https://www.youtube.com/aalborguniversitet
13: https://www.tv2.dk
14: https://www.kvalitetssikring.aau.dk
15: https://www.instagram.com/aaustudieliv
16: https://www.colourbox.dk/
17: https://www.skyfish.com/
18: https://www.facebook.com/InternationalOfficeAalborgUniversity
19: https://www.handbook.aau.dk/document?contentId=365956
20: https://www.update.aau.dk/
21: https://www.aau.dk/om-websitet
22: https://www.bt.dk/cookiedeklaration
23: https://www.facebook.com/
24: https://www.campusservice.aau.dk/campusomraader-bygninger/
25: https://www.alumni.aau.dk/english/
26: https://www.en.intern.aau.dk/
27: https://www.vacancies.aau.dk/
28: https://www.okonomi.aau.dk/organisati

In [14]:
for key in stemmed_index.keys():
    print(f"{key}: {stemmed_index[key]['tokens']} [{stemmed_index[key]['id']}]")

https://www.mit.edu: ['MIT', 'Massachusetts', 'Institute', 'of', 'Technology'] [1]
https://www.aau.dk: ['AAU', 'Viden', 'verden', 'Aalborg', 'Universitet'] [2]
https://www.ansatte.aau.dk: ['ansatte'] [3]
https://www.design.aau.dk/: ['AAU', 'Designguide'] [4]
https://www.bt.dk: ['B.T', 'Nyheder', 'Læs', 'nyhederne', 'bt.dk'] [5]
https://www.adgangforalle.dk: ['Adgang', 'alle', 'online', 'oplÃ¦sning'] [6]
https://www.aau.dk/pressen: ['For', 'pressen', 'Aalborg', 'Universitet'] [7]
https://www.okonomi.aau.dk: ['Økonomiafdelingen'] [8]
https://www.aau.dk/kontakt: ['Kontakt', 'Aalborg', 'Universitet', '(', 'AAU', ')', 'Aalborg', 'Universitet'] [9]
https://www.en.okonomi.aau.dk/: ['finance', 'accounts', 'department'] [10]
https://www.aau.dk/om-cookies: ['Aalborg', 'Universitets', 'privatlivspolitik', 'cookiepolitik', 'Aalborg', 'Universitet'] [11]
https://www.youtube.com/aalborguniversitet: ['Inden', 'fortsætter', 'YouTube'] [12]
https://www.tv2.dk: ['TV', '2', 'bedst', 'pÃ¥', 'breaking', 'l

In [15]:
simple_inverted_index = {}

for url in stemmed_index.keys():
    for token in stemmed_index[url]['tokens']:
        if token in simple_inverted_index.keys():
            if stemmed_index[url]['id'] not in simple_inverted_index[token]:
                simple_inverted_index[token].append(stemmed_index[url]['id'])
        else:
            simple_inverted_index[token] = [stemmed_index[url]['id']]

for key in simple_inverted_index.keys():
    print(f"{key}: {simple_inverted_index[key]}")
    

MIT: [1]
Massachusetts: [1]
Institute: [1]
of: [1, 88]
Technology: [1]
AAU: [2, 4, 9, 19, 20, 25, 29, 34, 36, 38, 46, 49, 51, 64, 68, 71, 73, 74, 77, 84, 85, 86]
Viden: [2]
verden: [2]
Aalborg: [2, 7, 9, 11, 14, 15, 21, 27, 33, 37, 38, 41, 43, 48, 49, 63, 64, 66, 69, 71, 74, 79, 82, 89, 91]
Universitet: [2, 7, 9, 11, 14, 15, 21, 33, 37, 41, 43, 48, 49, 63, 64, 66, 71, 74, 79, 89, 91]
ansatte: [3]
Designguide: [4]
B.T: [5, 52]
Nyheder: [5, 39, 52, 56]
Læs: [5, 52, 59]
nyhederne: [5, 52]
bt.dk: [5, 52]
Adgang: [6]
alle: [6, 57, 72]
online: [6, 17]
oplÃ¦sning: [6]
For: [7, 26, 41]
pressen: [7]
Økonomiafdelingen: [8, 28, 31]
Kontakt: [9, 62]
(: [9, 15, 33, 48, 75]
): [9, 15, 33, 48, 75]
finance: [10]
accounts: [10]
department: [10]
Universitets: [11]
privatlivspolitik: [11]
cookiepolitik: [11]
Inden: [12, 35]
fortsætter: [12, 35]
YouTube: [12, 35]
TV: [13, 39]
2: [13]
bedst: [13]
pÃ¥: [13]
breaking: [13]
live: [13]
Uddannelseskvalitet: [14, 89]
@: [15, 33, 48]
aaustudieliv: [15, 33]
•: [15

In [17]:
#Single word search

def convert_postings_to_urls(found_postings):
    found_urls = []
    for posting in found_postings:
        found_urls.append(numbered_index[posting])
    return found_urls

def single_word_search(search_word, want_postings = False):
    found_postings = []
    for token in simple_inverted_index.keys():
        if (token == search_word):
            found_postings = simple_inverted_index[token]
    if want_postings:
        return found_postings
    if found_postings:
        return convert_postings_to_urls(found_postings)
    return []
    

for url in single_word_search('DR'):
    print(url)

https://www.dr.dk
https://www.dr.dk/etik-og-rettelser


In [20]:
#OR search

def OR_merge(found_urls_one, found_urls_two):
    return list(set(found_urls_one + found_urls_two))

def OR_search(word_one, word_two):
    found_urls_one = single_word_search(word_one, want_postings=True)
    found_urls_two = single_word_search(word_two, want_postings=True)
    return convert_postings_to_urls(OR_merge(found_urls_one, found_urls_two))

for url in OR_search('DR', 'installation'):
    print(url)

https://www.adgangforalle.dk/default.efact?pid=7953
https://www.dr.dk/etik-og-rettelser
https://www.dr.dk


In [25]:
#AND search

def AND_merge(found_urls_one, found_urls_two):
    return list(set(found_urls_one).intersection(set(found_urls_two)))


def AND_search(word_one, word_two):
    found_urls_one = single_word_search(word_one, want_postings=True)
    found_urls_two = single_word_search(word_two, want_postings=True)
    return convert_postings_to_urls(AND_merge(found_urls_one, found_urls_two))

for url in AND_search('YouTube', 'fortsætter'):
    print(url)

https://www.youtube.com/mit
https://www.youtube.com/aalborguniversitet


In [33]:
#AND_NOT search

def AND_NOT_merge(found_urls_one, found_urls_two):
    a = set(found_urls_one)
    b = set(found_urls_two)    
    return list(a-b)

def AND_NOT_search(word_one, word_two):
    found_urls_one = single_word_search(word_one, want_postings=True)
    found_urls_two = single_word_search(word_two, want_postings=True)
    return convert_postings_to_urls(AND_NOT_merge(found_urls_one, found_urls_two))

for url in AND_NOT_search('Aalborg', 'Universitet'):
    print(url)

https://www.news.aau.dk/?page=1
https://www.vacancies.aau.dk/
https://www.news.aau.dk/
https://www.facebook.com/AlumniAAU
