# Self study 1

Self studies should be solved individually, or in small groups of 2-3 students. There is no hand-in of your solutins to the self studies. However, you can bring your solutions to the exam, and use them as the basis for your answers to the exam questions.

In this self-study we construct a simple crawler. Concretely, you should: 

* Select about 5 seed urls, e.g. homepages of universities, e-commerce sites, or similar

* Start crawling from these seeds. Define a strategy for selecting the next url to be crawled. What kind of prioritization (if any) is embodied in your strategy?

* Make sure you obey the robots.txt file, and make ensure that at least 2 seconds elapse between requests to the same host

* Stop when you have crawled approx. 1000 pages

* For each crawled page, save the url and the text string contained in the 'title' element of the document (we do not want to handle the full text of the pages at this point).

* You can repeat this several times, using different seed sets and/or prioritization strategies.

The following two self studies will extend the work that you do in this self study.

The following introduces a few helpful libraries and essential functions. You can use these methods, or use other tools that you are already familiar with and/or prefer to work with. 

A simple crawler implementation can be based on the 'requests' package [https://requests.readthedocs.io/en/master/](https://requests.readthedocs.io/en/master/) for retrieving html documents, and the BeautifulSoup parser https://www.crummy.com/software/BeautifulSoup/bs4/doc/ for parsing the html.

In [1]:
import requests
from bs4 import BeautifulSoup
from time import sleep
from urllib.robotparser import RobotFileParser
from urllib.parse import urlparse
from datetime import datetime, timedelta
from itertools import count
import pandas as pd
import numpy as np
import random


Let's start crawling at https://www.aau.dk/ . We first retrieve the robots.txt file and check whether we are allowed to crawl the top-level url:

In [2]:
rp=RobotFileParser()
rp.set_url("https://www.aau.dk")
rp.read()
print(rp.can_fetch("*","https://www.aau.dk"))

True


We can now get the html using the requests package, which returns a response object:

In [3]:
r=requests.get('https://www.aau.dk/')
print(type(r))

<class 'requests.models.Response'>


A basic view of the contents is accessible via the content attribute:

For serious parsing, we can use the BeautifulSoup html parser:

In [4]:
r_parse = BeautifulSoup(r.text, 'html.parser')

We can get the title:

In [5]:
print(r_parse.find('title'))
print(r_parse.find('title').string)

<title>AAU - Viden for verden - Aalborg Universitet</title>
AAU - Viden for verden - Aalborg Universitet


Importantly, we can get all the links on the page. The following also illustrates the sleep() function to implement time delays (the following will take a while to complete; use the "interrupt kernel" button to terminate early):

In [6]:
url = 'https://www.aau.dk/uddannelser/optagelse/kandidat/ledige-studiepladser-2022'
parsed = urlparse(url)
print(parsed)
newurl = parsed.scheme + '://' + parsed.netloc
print(newurl)

ParseResult(scheme='https', netloc='www.aau.dk', path='/uddannelser/optagelse/kandidat/ledige-studiepladser-2022', params='', query='', fragment='')
https://www.aau.dk


In [7]:

seeds = ['https://www.aau.dk', 'https://www.dr.dk', 'https://www.tv2.dk', 'https://www.bt.dk']
index_arr = {}
crawled_links = []
frontier = []
frontqueue = {
    'one' : [],
    'two' : [],
    'three' : []
}

back_queue = {}
prio_heap = {}

for url in seeds :
    prio_heap[url] = datetime.now()
    back_queue[url] = []
sleep(2)

def get_base_url(url):
    parsed = urlparse(url)
    baseUrl = parsed.scheme + '://' + parsed.netloc
    return baseUrl
def fill_back_queue():
    arr = []
    if(len(frontqueue['one']) != 0):
        arr = frontqueue['one']
        frontqueue['one'] = []
    elif(len(frontqueue['two']) != 0):
        arr = frontqueue['two']
        frontqueue['two'] = []
    elif(len(frontqueue['three']) != 0):
        arr = frontqueue['three']
        frontqueue['three'] = []

    for url in arr:
        if (get_base_url(url) in back_queue.keys()):
            back_queue[get_base_url(url)].append(url)
            prio_heap[get_base_url(url)] = datetime.now()
        else:
            back_queue[get_base_url(url)] = [url]
            prio_heap[get_base_url(url)] = datetime.now()

def get_url():
    #This should be based on a heap but :shrugeg:
    viable_urls = [key for (key, value) in prio_heap.items() if value <= datetime.now() + timedelta(seconds=2)]

    randomUrl = random.choice(viable_urls)
    url = ""
    if(len(back_queue[randomUrl]) != 0):
        url = back_queue[randomUrl].pop()
    else:
        fill_back_queue()
        return get_url()
    crawled_links.append(randomUrl)
    return url

def fetch(url):
    rp.set_url(get_base_url(url))
    rp.read()
    if (True):#rp.can_fetch("*", url)):
        r=requests.get(url)
        r_parse = BeautifulSoup(r.text, 'html.parser')
        return r_parse
    else:
        return 0

def index(doc, url):
    title = doc.find('title')
    if(title):
        if url not in index_arr.keys():
            index_arr[url] = title.string
    else:
        print('no title')

def extract_urls(doc, url):
    href_arr = [] 
    for a in doc.find_all('a', href=True):
        link = a['href']
        if(link.startswith('https://www') and '.dk' in link and not link.startswith('https://www.google.com')):
            if (link not in frontier and link not in href_arr and link not in crawled_links):
                href_arr.append(link)
        #else:
        #    comb_url = get_base_url(url) + link
        #    if (comb_url not in frontier and comb_url not in href_arr and comb_url not in crawled_links):
        #        href_arr.append(comb_url)
    return href_arr

def add_to_frontier(url_list):
    #To make some checks easier this is added
    for url in url_list:
        frontier.append(url)
        slash_count = url.count('/')
        if (slash_count > 5):
            frontqueue['three'].append(url)
        elif(slash_count > 3):  
            frontqueue['two'].append(url)
        else:
            frontqueue['one'].append(url)

def loop_code():
    url = get_url()
    print(url)
    if not url == '':
        print(i)
        doc = fetch(url)
        if (doc):
            index(doc, url)
            add_to_frontier(extract_urls(doc, url))
add_to_frontier(seeds)

i = 0
while (len(back_queue) != 0):
    i += 1
    try:
        loop_code()
    except:
        print('error')
        continue
    if(i > 1000):
        break



index_frame = pd.DataFrame(index_arr, index=[0])
index_frame.to_csv('index.csv')


https://www.tv2.dk
1
https://www.dr.dk
2
error
https://www.aau.dk
3
error
https://www.bt.dk
4


: 

: 

In [1]:
link = 'tv.dk'

if '.dk' in link:
    print('öui')

öui
