# Self study 1

Self studies should be solved individually, or in small groups of 2-3 students. There is no hand-in of your solutins to the self studies. However, you can bring your solutions to the exam, and use them as the basis for your answers to the exam questions.

In this self-study we construct a simple crawler. Concretely, you should: 

* Select about 5 seed urls, e.g. homepages of universities, e-commerce sites, or similar

* Start crawling from these seeds. Define a strategy for selecting the next url to be crawled. What kind of prioritization (if any) is embodied in your strategy?

* Make sure you obey the robots.txt file, and make ensure that at least 2 seconds elapse between requests to the same host

* Stop when you have crawled approx. 1000 pages

* For each crawled page, save the url and the text string contained in the 'title' element of the document (we do not want to handle the full text of the pages at this point).

* You can repeat this several times, using different seed sets and/or prioritization strategies.

The following two self studies will extend the work that you do in this self study.

The following introduces a few helpful libraries and essential functions. You can use these methods, or use other tools that you are already familiar with and/or prefer to work with. 

A simple crawler implementation can be based on the 'requests' package [https://requests.readthedocs.io/en/master/](https://requests.readthedocs.io/en/master/) for retrieving html documents, and the BeautifulSoup parser https://www.crummy.com/software/BeautifulSoup/bs4/doc/ for parsing the html.

In [1]:
import requests
from bs4 import BeautifulSoup
from time import sleep
from urllib.robotparser import RobotFileParser

Let's start crawling at https://www.aau.dk/ . We first retrieve the robots.txt file and check whether we are allowed to crawl the top-level url:

In [2]:
rp=RobotFileParser()
rp.set_url("https://www.aau.dk/")
rp.read()
print(rp.can_fetch("*","https://www.aau.dk"))

True


We can now get the html using the requests package, which returns a response object:

In [3]:
r=requests.get('https://www.aau.dk/')
print(type(r))

<class 'requests.models.Response'>


A basic view of the contents is accessible via the content attribute:

In [4]:
r.content

b'<!DOCTYPE html>\r\n<html class="no-js" prefix="og: http://ogp.me/ns#">\r\n<head>\r\n<meta charset="utf-8" />\r\n<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1">\r\n\r\n\r\n<meta name="description" content="Aalborg Universitet - Problem- og projektbaseret forskning og uddannelse, der i samspil mellem AAU og omverdenen skaber viden, der forandrer verden." />\r\n<title>AAU - Viden for verden</title>\r\n\r\n<!-- Cookie Information Consent -->\r\n<script id="CookieConsent" src="https://policy.app.cookieinformation.com/uc.js" data-culture="DA" type="text/javascript"></script>\r\n<!-- Remove no-js enable html5 elements -->\r\n<script type="text/javascript">\r\n    //Clear no-js\r\n    document.getElementsByTagName(\'html\')[0].className = document\r\n            .getElementsByTagName(\'html\')[0].className.replace(\'no-js\', \'\');\r\n    //Enable html5 elements in IE\r\n    \'article aside footer header nav section time\'.replace(/\\w+/g, function(n) {\

For serious parsing, we can use the BeautifulSoup html parser:

In [5]:
r_parse = BeautifulSoup(r.text, 'html.parser')
print(r_parse.prettify())

<!DOCTYPE html>
<html class="no-js" prefix="og: http://ogp.me/ns#">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1, maximum-scale=1" name="viewport"/>
  <meta content="Aalborg Universitet - Problem- og projektbaseret forskning og uddannelse, der i samspil mellem AAU og omverdenen skaber viden, der forandrer verden." name="description">
   <title>
    AAU - Viden for verden
   </title>
   <!-- Cookie Information Consent -->
   <script data-culture="DA" id="CookieConsent" src="https://policy.app.cookieinformation.com/uc.js" type="text/javascript">
   </script>
   <!-- Remove no-js enable html5 elements -->
   <script type="text/javascript">
    //Clear no-js
    document.getElementsByTagName('html')[0].className = document
            .getElementsByTagName('html')[0].className.replace('no-js', '');
    //Enable html5 elements in IE
    'article aside footer header nav section time'.replace(/\w+/g, function(n) {
        document.createElement(n

We can get the title:

In [6]:
print(r_parse.find('title'))
print(r_parse.find('title').string)

<title>AAU - Viden for verden</title>
AAU - Viden for verden


Importantly, we can get all the links on the page. The following also illustrates the sleep() function to implement time delays (the following will take a while to complete; use the "interrupt kernel" button to terminate early):

In [7]:
for a in r_parse.find_all('a'):
    sleep(1)
    print(a['href'])

https://www.aau.dk/nyheder
https://www.aau.dk/arrangementer
https://www.aau.dk/kontakt
https://www.aau.dk/om-aau/organisation/campus
https://www.aau.dk/pressen
https://www.alumni.aau.dk/
#
https://www.aau.dk/nyheder
https://www.aau.dk/arrangementer
https://www.aau.dk/kontakt
https://www.aau.dk/om-aau/organisation/campus
https://www.aau.dk/pressen
https://www.alumni.aau.dk/
https://www.aau.dk/uddannelser/
https://www.aau.dk/forskning/
https://www.aau.dk/samarbejde/
https://www.aau.dk/om-aau/
https://www.stillinger.aau.dk/
https://www.intern.aau.dk/
https://www.search.aau.dk/?locale=da
https://www.en.aau.dk/
https://www.search.aau.dk/?locale=da
https://www.en.aau.dk/
https://www.aau.dk
#
https://www.aau.dk/uddannelser/
https://www.aau.dk/uddannelser/bachelor
https://www.aau.dk/uddannelser/kandidat
https://www.aau.dk/uddannelser/sidefag-tilvalgsfag
https://www.aau.dk/uddannelser/bliv-gymnasielaerer
https://www.evu.aau.dk
https://www.aau.dk/uddannelser/optagelse/bachelor
http://www.adgangs

KeyboardInterrupt: 

In [2]:
from queue import SimpleQueue as Queue
import random
from datetime import datetime
from datetime import timedelta


# Keep lists of visited pages and a result of urls and titles
visited = []
results = []

# Initialize frontier of 10 front queues that are assigned randomly
frontier = []
for i in range(4):
    frontier.append(Queue())

def enqueue(qlist, obj):
    q = random.choice(qlist)
    q.put(obj)



# Keep back queues as dictionary
backQ = {}

# Extract the next url crawl from back queues
def get_crawl(qd):
    result = None
    keys = list(qd)
    i = 0
    sec2 = timedelta(0,2)
    # Search through each back queue
    keys.sort(key=lambda x: qd[x]['time'])
    while result is None and i < len(keys):
        key = keys[i]
        if not qd[key]['queue'].empty() and datetime.now() > qd[key]['time']:
            # This queue is not empty and the timestamp permits
            result = qd[key]['queue'].get()
            # Update with new timestamp
            qd[key]['time'] = datetime.now() + sec2

        else:
            i += 1
    return result



# Approximate host domain
def extract_domain(url):
    s = url.split("/")
    return s[2]


In [None]:
# define start seed
seed = ["https://www.pinchofyum.com/", "https://www.loveandlemons.com/","https://www.imdb.com/", "https://www.gamegrumps.com/", "https://www.aau.dk/"]

# Put seeds in frontier
for s in seed:
    enqueue(frontier, s)

while len(results) < 1000:
    next_url = get_crawl(backQ)
    if not backQ or get_crawl(backQ) is None:
        # If all back queues are empty, refill by emptying front queues
        for f in frontier:
            while not f.empty():
                url = f.get()
                domain = extract_domain(url)
                if domain not in backQ.keys():
                    # Add new back queue if one for this domain does not exist
                    backQ[domain] = {'time': datetime.now(), 'queue': Queue()}
                backQ[domain]['queue'].put(url)

    else:
        try:
            # Logic for crawling a page
            if next_url in visited or next_url is None:
                continue
            print('crawling at ' + next_url)
            # Initialize robotfile parser
            rp=RobotFileParser()
            rp.set_url(next_url)
            rp.read()
            # Check if robots.txt allows
            if rp.can_fetch("*", next_url):
                r=requests.get(next_url)
                visited.append(next_url)
                #extract title
                r_parse = BeautifulSoup(r.text, 'html.parser')
                title = r_parse.find('title')
                if title is not None:
                    title = title.string
                    # save result
                    res = {'url': next_url, 'title': title}
                    results.append(res)
                    for a in r_parse.find_all('a'):
                        if 'href' in a.attrs:
                            l = a['href']
                            if l.startswith('https') and l not in visited:
                                enqueue(frontier, l)
            else:
                print('could not crawl at ' + next_url)
        except:
            print(f'woops at {next_url}')

crawling at https://www.loveandlemons.com/
crawling at https://www.imdb.com/
crawling at https://www.loveandlemons.com/easy-dinner-ideas/
crawling at https://www.pinterest.com/loveandlemons/
crawling at https://www.facebook.com/lovelemonsfood
crawling at https://instagram.com/imdb
crawling at https://help.imdb.com/imdb?ref_=cons_nb_hlp
crawling at https://youtube.com/imdb/
crawling at https://www.amazon.jobs/en/teams/imdb
crawling at https://m.imdb.com/feature/bestpicture/?ref_=nv_ch_osc
crawling at https://pro.imdb.com?ref_=cons_nb_hm&rf=cons_nb_hm
crawling at https://facebook.com/imdb
crawling at https://pro.imdb.com?ref_=cons_tf_pro&rf=cons_tf_pro
crawling at https://www.imdb.com/search/
crawling at https://help.imdb.com/article/imdb/general-information/why-do-i-need-to-enable-my-cookies-on-imdb/GWE3JQ8VUQDCFW3Q?ref_=helpsrall#
crawling at https://slyb.app.link/SKdyQ6A449
crawling at https://www.loveandlemons.com/thumbprint-cookies/
crawling at https://instagram.com/loveandlemons/
c