# Crawler

This notebook contains started code structure for creating a crawler on single machine

**Author:** Noshaba Nasir

**Date:** 9/4/2021

**Updated by:** Shehroz Ali   L17-6334   IR_A
    

### Note
#### Each individual module of the crawler eg fetcher, Parser etc, are coded keep OOP in mind. Each module has it own class. Each module exposes specific apis that frontier can use to perform its task. These Seperate module instead of a whole one big frontier will allow me debug and add functionities with ease.

In [1]:
import random
from bs4 import BeautifulSoup
import json
import requests as rq
import mimetypes
from urllib import parse, request, robotparser as botparser
from queue import Queue
import threading
from queue import PriorityQueue
from time import time, sleep
import re

In [2]:
BACKQUEUES = 3
THREADS = BACKQUEUES * 3
FRONTQUEUES = 5
WAITTIME = 15  # wait 15 seconds before fetching URLS from
urls_fetched = 0
Filename = "URLs.Json" # File where all of the Raw Content will be dumped 

### Storage Class (Will handle content dumping to disk)

In [3]:
# This Class will handle everything related content storage
class Storage:
    def __init__(self, filename):
        self.filename = filename
        self.open = True  # Will let us know when if our file is open or closed
        self.filehandle = open(self.filename, "a")

    def PutContent(self, URL, Content):
        json.dump({URL: Content}, self.filehandle)

    def GetObjects(self):
        return json.load(self.filehandle)

    def __del__(self):
        self.filehandle.close()
        print("File Closed")



### Fetcher 

In [4]:
class Fetcher:  # will take a URL and Retrieve all the html URLS
    opener = None
    initialized = False

    def __init__(self):
        if not Fetcher.initialized:
            Fetcher.opener = request.build_opener()
            Fetcher.opener.addheaders = [('User-agent', 'Mozilla/5.0')]
            request.install_opener(Fetcher.opener)
            Fetcher.initialized = True
    def fetch_Content(self, base_url):
        Request_Url = rq.get(base_url)
        soup = BeautifulSoup(Request_Url.text, "html.parser")
        links = []

        raw_content = Request_Url.text
        x = soup.findAll('a', attrs={'href': re.compile("^http://|^https://|^/")})

        # for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
        for link in x:
            # Convert relative link to absolute link if any
            full_url = parse.urljoin(base_url, link.get('href'))
            links.append(full_url)

        return links, raw_content



### Parser

#### Parser performs 2 Operations, It checks if the URL is duplicate by search in a URL list it Maintains, next it check if its allowed to fetch the page or not, this is done by reading the robot.txt file for the domain. The robot.txt object for a domain is keep in cache (list) and is reuse when a url of similar domain comes.  For assignment 1 Parser also perform a 3rd operation i.e is checking that the URL's "content-type" is text/html, It tried to determine the mime type using the URL itself. if its not sucessfull then and only then it request JUST the "header" NOT complete webpage to determine the mime type.   

In [5]:
class Parser:
    # the parsor will maintain a list of URL that it haS seen (Duplicate Removal)
    # the parser will also have a list of robot.txt for the domains it has seen
    # Will also check if URl given is allowed to be fetched
    def __init__(self):
        self.URL_List = []  # if a URL is not in this it will added on first encounter
        self.Robots_List = []

    def Is_URL_Dup(self, URL):  # return true if URL is Duplicate and False if its not
        if URL in self.URL_List:
            return True
        else:
            # Add the URL in the List and Return false
            self.URL_List.append(URL)
            return False

    def __Add_RobotTxt_Content__(self, URLDomain):
        RP = botparser.RobotFileParser()
        RP.set_url("http://" + URLDomain + "/robots.txt")

        try:
            RP.read()
        except Exception:
            return None

        # Add the RP Object to out RobotList_
        self.Robots_List.append({URLDomain: RP})
        # ret = next((i for i,item in enumerate(self.Robots_List) if URLDomain in item),None)
        # return ret
        return RP

    def Is_URL_Allowed(self, URL):  # is URL Allowed to be fetched
        URL_Domain = parse.urlparse(URL).netloc

        P = next((d for i, d in enumerate(self.Robots_List) if URL_Domain in d), None)
        if P == None:
            P = self.__Add_RobotTxt_Content__(URL_Domain)
            if P is None:
                return True

        # rp = P[URL_Domain]
        try:
            if P.can_fetch("*", URL):
                return True
            else:
                return False
        except Exception:
            return True

    def get_link_type(self, link, strict=True):
        link_type, _ = mimetypes.guess_type(link)
        if link_type is None and strict:
            try:
                u = rq.head(link)
                # u = request.urlopen(link)
                link_type = u.headers.get("content-type", '')
            except Exception:
                link_type = ""
        return link_type



### BackQueue

In [6]:
class Bqueue:  # back queue
    MAX_URLS = 1000

    def __init__(self):
        self.queue = Queue(Bqueue.MAX_URLS)
        self.domain = ''
        self.empty = True

    def get_domain(self):
        return self.domain

    def add_URL(self, URL):
        if not self.queue.full():
            # Check if queue is Empty to begin with
            if self.empty:
                self.empty = False
                self.domain = parse.urlparse(URL).netloc

            self.queue.put(URL)
            return True
        else:
            # Queue is full
            return False

    def get_URL(self):
        ret = self.queue.get()
        if self.queue.empty():
            self.empty = True
        return ret


### FrontQueue

In [7]:
class Fqueue:  # front Queue
    MAX_URLS = 1000

    def __init__(self, Priority):
        self.Priority = Priority
        self.queue = Queue(Fqueue.MAX_URLS)

    def get_Priority(self):
        return self.Priority

    def add_URL(self, URL):
        if not self.queue.full():
            self.queue.put(URL)
            return True
        else:
            return False

    def get_URL(self):
        if not self.queue.empty():
            return self.queue.get()
        else:
            return None


# FRONTIER

In [8]:
class frontier:
    # add the code for frontier here
    # should have functions __init__, get_URL, add_URLs, add_to_backqueue
    def __init__(self, FrontQueues, Backqueues,
                 filename):  # here FrontQueues and BackQueues are number of queues to be created
        self.URL_Count = 0
        # self.URL_List = []
        self.FQueues = []  # list that will contain FQueue Objects
        self.BQueues = []  # list that will contain Backqueue Objects
        self.fetcher = Fetcher()
        self.parser = Parser()
        self.storage = Storage(filename)
        self.TotalPriority = 0
        self.TimeHeap = PriorityQueue()

        for i in range(FrontQueues):
            self.FQueues.append(Fqueue(i + 1))  # our Prority starts from 1

        for i in range(Backqueues):
            self.BQueues.append(Bqueue())

        # Intiliaze the Heap Queue Also
        for i in range(Backqueues):
            self.TimeHeap.put((time(), i))  # Intilially time for threads will be zero at start
            # The Tuple is in the following format (time,Index)

    def add_urls_list(self, URLLIST):
        # this function take a URLList and Perform Duplicate, Permission and mime type check and then add int to a
        # Front Queue, return number of links added
        i = 0
        for url in URLLIST:
            # perform a mime check first 3
            if (self.parser.get_link_type(url)).find("text/html") != -1:
                # check if its Duplicate
                if not self.parser.Is_URL_Dup(url):
                    # Check if allowed to fetch
                    if self.parser.Is_URL_Allowed(url):
                        self.add_URL(url)
                        i += 1
                        # print("Count: " +str(i) +" Url Added to Front Queues: " + url)
                        print("Url Added to Front Queues: " + url)
                    else:
                        print("url: " + url + " Rejected, Not allowed to access")
                else:
                    print("url: " + url + " Rejected, Duplicate URL")
            else:
                print("url: " + url + " Rejected, Invalid mime type (Not text/html)")
        return i

    def add_URL(self, URL):
        x = self.prioritizer()
        self.FQueues[x].add_URL(URL)

    def __Get_URL_From_Fqueues__(self):
        return self.FQueues[self.prioritizer()].get_URL()  # This will also remove the URL from the queue

    def __Are_All_FQueues_Empty(self):
        for x in (self.FQueues):
            if not x.queue.empty():
                return False
        return True

    def __has_domain__(self, URL):
        # This function will return the queue if its a queue of it domain exists
        for x in self.BQueues:
            if not x.empty:
                if x.domain == parse.urlparse(URL).netloc:
                    return True, x

        return False, None

    def fill_Backqueue(self):
        #  This Function will fill backqueues until all the backqueues are filled or All or FQueues are empited
        for q in self.BQueues:
            while (q.empty):
                U = None
                while True:
                    U = self.FQueues[self.prioritizer()].get_URL()
                    if U is not None:
                        break

                if U is not None:
                    # Check iF U already has a queue allocated to it
                    r, p = self.__has_domain__(U)
                    if r is True:
                        p.add_URL(U)
                    else:
                        q.add_URL(U)

    def get_URL(self,threadID):  # This function will be called by a thread to get a URL from the Back Queue
        ret = self.TimeHeap.get()
        if (time() - ret[0]) < 0:
            # we have to wait
            print("Thread " + str(threadID) + " waiting for " + str(round(ret[0]-time())) + " seconds")
            sleep(round(ret[0] - time()))

        url = self.BQueues[ret[1]].get_URL()  # get URL from Bqueue

        if self.BQueues[ret[1]].empty:
            self.fill_Backqueue()  # if the queue is empty fill up the queue

        self.TimeHeap.put((ret[0] + WAITTIME, ret[1]))  # Update the heap

        return url

  
    def prioritizer(self, f=None, URL=None):
        """
        Take URL and returns priority from 1 to F
        Right now it like a stub function.
        It will return a random number from 1 to f for given inputs.
        """
        return random.randint(0, len(self.FQueues) - 1)


    

# Run Crawler

In [9]:
def thread_func(threadID, Frontier):  # The thread Function will take a frontier Object and a threadID as  arguments
    global urls_fetched
    k = 0
    while urls_fetched < 10:
        u = Frontier.get_URL(threadID)
        print("Thread " + str(threadID) + ": fetching url: " + u)

        URLList, content = Frontier.fetcher.fetch_Content(u)
        print("Thread " + str(threadID) + ": " + str(len(URLList)) + " Raw Urls fetched")
        print("Thread " + str(threadID) + ": Dumping Raw Content to Disk")
        Frontier.storage.PutContent(u, content)
        print("Thread " + str(threadID) + ": Testing Urls")
        # Removing Duplicate, those which are not allowed to be parsed and those mime type is not text/html
        i = Frontier.add_urls_list(URLList)
        print("Thread " + str(threadID) + ": Added " + str(i) + " Urls to Fqueues after parsing ")
        urls_fetched += 1
        k += 1

    print("Thread " + str(threadID) + ": Total URLS fetched: " + str(k))


### Initlize Variables and Objects

In [10]:
f = frontier(FRONTQUEUES, BACKQUEUES, Filename)
f.add_URL("https://docs.oracle.com/en/")
f.add_URL("https://www.oracle.com/corporate/")
f.add_URL("https://en.wikipedia.org/wiki/Machine_learning")
f.add_URL("https://www.csie.ntu.edu.tw/~cjlin/libsvm/index.html")
f.add_URL("https://docs.oracle.com/middleware/jet210/jet/index.html")
f.add_URL("https://en.wikipedia.org/w/api.php")
f.add_URL("https://en.wikipedia.org/api/")
f.add_URL("https://en.wikipedia.org/wiki/Weka_(machine_learning)")
f.fill_Backqueue()


In [None]:
threads = []
for i in range(THREADS):
    t = threading.Thread(target=thread_func, args=[i, f])
    t.start()
    threads.append(t)

for thread in threads:
    thread.join()


Thread 0: fetching url: https://en.wikipedia.org/wiki/Machine_learning
Thread 1: fetching url: https://docs.oracle.com/en/Thread 2: fetching url: https://www.oracle.com/corporate/

Thread 3 waiting for 12 seconds
Thread 4 waiting for 12 seconds
Thread 5 waiting for 12 seconds
Thread 2: 41 Raw Urls fetched
Thread 2: Dumping Raw Content to Disk
Thread 2: Testing Urls
Thread 1: 0 Raw Urls fetched
Thread 1: Dumping Raw Content to Disk
Thread 1: Testing Urls
Thread 1: Added 0 Urls to Fqueues after parsing 
Thread 0: 1347 Raw Urls fetched
Thread 0: Dumping Raw Content to Disk
Thread 0: Testing Urls
Url Added to Front Queues: https://www.oracle.com/corporate/accessibility/
Thread 3: fetching url: https://en.wikipedia.org/w/api.php
Thread 5: fetching url: https://www.csie.ntu.edu.tw/~cjlin/libsvm/index.html
Thread 6 waiting for 15 seconds
Thread 7 waiting for 15 seconds
url: https://www.oracle.com/ Rejected, Invalid mime type (Not text/html)
Url Added to Front Queues: https://en.wikipedia.org/

#### Had to Terminate the Crawler myself as the supply of urls is limitless :)

## ------------------------------------------------------End of Notebook---------------------------------------------------