In [1]:
# ignore warnings
import warnings
import os
import fcntl
import threading
warnings.filterwarnings("ignore")

# Task1: Crawling Documents
Each individual is responsible for writing their own crawler, and crawling from their own seed URLs.

Set up Elastic Search with your teammates to have the same cluster name and the same index name.

Your crawler will manage a frontier of URLs to be crawled. The frontier will initially contain just your seed URLs. URLs will be added to the frontier as you crawl, by finding the links on the web pages you crawl.

1. You should crawl at least 30,000 documents (10,000 documents for undergrads) individually, starting from the seed URLs. This will take several hours, so think carefully about how to adequately test your program without running it to completion in each debugging cycle.

2. You should choose the next URL to crawl from your frontier using a best-first strategy. See Frontier Management, below.
3. Your crawler must strictly conform to the politeness policy detailed in the section below. You will be consuming resources owned by the web sites you crawl, and many of them are actively looking for misbehaving crawlers to permanently block. Please be considerate of the resources you consume.

4. You should only crawl HTML documents. It is up to you to devise a way to ensure this. However, do not reject documents simply because their URLs don’t end in .html or .htm.

5. You should find all outgoing links on the pages you crawl, canonicalize them, and add them to your frontier if they are new. See the Document Processing and URL Canonicalization sections below for a discussion.

6. For each page you crawl, you should store the following fields with ElasticSearch : an id, the URL, the HTTP headers, the page contents cleaned (with term positions), the raw html, and a list of all in-links (known) and out-links for the page.

Once your crawl is done, you should get together with your teammates and figure out how to merge the indexes. Either ElasticSearch will do the merging itself (if your computers are connected while indexing new documents), but you still have to manage the link graph. Alternatively you can write a script to merge the individual indexes; ultimately all team members should end up with the merged index

### Politeness Policy
Your crawler must strictly observe this politeness policy at all times, including during development and testing. Violating these policies can harm the web sites you crawl, and cause the web site administrators to block the IP address from which you are crawling.

1. Make no more than one HTTP request per second from any given domain. You may crawl multiple pages from different domains at the same time, but be prepared to convince the TAs that your crawler obeys this rule. The simplest approach is to make one request at a time and have your program sleep between requests. The one exception to this rule is that if you make a HEAD request for a URL, you may then make a GET request for the same URL without waiting.
2. Before you crawl the first page from a given domain, fetch its robots.txt file and make sure your crawler strictly obeys the file. You should use a third party library to parse the file and tell you which URLs are OK to crawl.

### Frontier Management
The frontier is the data structure you use to store pages you need to crawl. For each page, the frontier should store the canonicalized page URL and the in-link count to the page from other pages you have already crawled. After processing a batch of URLs, you should locally rearrange the frontier by (some of) the following criterias (using a proper datastructure for the frontier can make a big difference)

1. Seed URLs should always be crawled first. You can add more seed URLs on the topic assigned to you.
2. Must use BFS "wave number" as the baseline graph traversal (variations below encouraged)
3. Prefer pages with higher in-link counts.
4. Prefer URLs with matching keywords in link or in achor text.
5. Prefer URLs extracted from a relevant page.
6. Prefer certain domains.
7. Prefer recent URLs
8. If multiple pages have maximal in-link counts, choose the option which has been in the queue the longest.

If the next page in the frontier is at a domain you have recently crawled a page from and you do not wish to wait, then you should crawl the next page from a different domain instead.

In [None]:
from urllib.parse import urlparse, urlunparse
import queue
import math
import json
from urllib import robotparser
from datetime import datetime
import time
import requests
from bs4 import BeautifulSoup
import threading
from concurrent.futures import ThreadPoolExecutor
import urllib.robotparser
from urllib.request import urlopen, HTTPError
from time import sleep

def canonicalize_url(url):
    """
    Canonicalizes a given URL according to specific rules:
    - 1. Convert the scheme and host to lower case: 
HTTP://www.Example.com/SomeFile.html → http://www.example.com/SomeFile.html
    - 2. Remove port 80 from http URLs, and port 443 from HTTPS URLs: 
http://www.example.com:80 → http://www.example.com
    - 3. Make relative URLs absolute: 
      If you crawl http://www.example.com/a/b.html and find the URL ../c.html, 
      it should canonicalize to http://www.example.com/c.html.
    - 4. Remove the fragment, which begins with #: 
http://www.example.com/a.html#anything → http://www.example.com/a.html
    - 5. Remove duplicate slashes: 
http://www.example.com//a.html → http://www.example.com/a.html
    """
    # use the urlparse to parse the URL
    parse_res = urlparse(url) # ParseResult(scheme='https', netloc='www.mphonline.org', path='/worst-pandemics-in-history/', params='', query='', fragment='')
    
    # 1. Convert scheme and host to lower case
    scheme = parse_res.scheme.lower()
    netloc = parse_res.netloc.lower()
 
    # 2. Remove default port 80 for HTTP and 443 for HTTPS
    if scheme == 'http' and parse_res.port == 80:
        netloc = parse_res.hostname
    elif scheme == 'https' and parse_res.port == 443:
        netloc = parse_res.hostname
 
    path = parse_res.path
    # 3. Make relative URLs absolute
    if not path.startswith('/'):
        path = '/' + path
 
    # 4. Remove the fragment: make the fragment empty
    fragment = ''
 
    # 5. Remove duplicate slashes
    path = path.replace('//', '/')

    black_list = (".jpg", ".svg", ".png", ".pdf", ".gif", "youtube", "amazon")
 
    if any(keyword in path for keyword in black_list):
        return "Invalid URL"
    
    # use urlunparse to reconstruct the URL
    canonicalized_url = urlunparse((scheme, netloc, path, parse_res.params, parse_res.query, fragment))
    return canonicalized_url

SEEDS = ["https://www.cdc.gov/flu/pandemic-resources/2009-h1n1-pandemic.html", 
        "https://en.wikipedia.org/wiki/2009_swine_flu_pandemic",
        "https://www.google.com/search?q=h1n1+swine+flu+pandemic&oq=H1N1+Swine+Flu+pandemic&aqs=chrome.0.0l3j0i22i30l2.1219j0j9&sourceid=chrome&ie=UTF-8"]

class RobotTimeout(urllib.robotparser.RobotFileParser):
    def __init__(self, url='', timeout=3):
        super().__init__(url)
        self.timeout = timeout

    def read(self):
        self.disallow_all = False
        self.allow_all = False
        try:
            with urlopen(self.url, timeout=self.timeout) as f:
                raw = f.read()
            self.parse(raw.decode("utf-8").splitlines())
        except HTTPError as err:
            if err.code in (401, 403):
                self.disallow_all = True
            elif err.code >= 400:
                self.allow_all = True

class Robots:
    def __init__(self, url):
        self.url = url
        self.delay = self.get_delay()

    def get_delay(self):
        r = RobotTimeout(self.url)
        r.read()
        delay = r.crawl_delay(useragent="*")
        return delay if delay is not None else 1.0

    def fetch(self, new_url):
        sleep(self.delay)  # Added delay
        r = RobotTimeout(self.url)
        r.read()
        return r.can_fetch("*", new_url)


class Frontier():
    def __init__(self, seeds):
        self.queue = queue.PriorityQueue()
        self.waves = {}
        self.frontier_entity_objects = {}

        self.waves[0] = set()
        for curr_link in seeds:
            frontier_entity = FrontierEntity(curr_link)
            frontier_entity.update_score()
            self.frontier_entity_objects[curr_link] = frontier_entity
            self.waves[0].add(curr_link)
            self.queue.put((0, frontier_entity.score, curr_link))
        
    def empty(self):
        return self.queue.empty()
        
    def pop(self):
        return self.queue.get()

    def push(self, frontier_entity, wave_number):
        self.frontier_entity_objects[frontier_entity.url] = frontier_entity
        self.waves.get(wave_number, set()).add(frontier_entity.url)
        self.queue.put((wave_number, frontier_entity.score, frontier_entity.url))
    
    def update_in_links(self, url, in_links):
        self.frontier_entity_objects[url].update_in_links(in_links)


class FrontierEntity():
    def __init__(self, url):
        self.key_words = ["pandemic", "flu", "h1n1", "swine"]
        self.in_links = set()
        self.score = 0
        self.url = url

    def update_score(self):
        for word in self.key_words:
            # Case insensitive
            if self.url.lower().count(word) > 0:
                self.score += 1
        keyword_match_score = math.exp(-self.score)
        inlink_score = math.exp(-len(self.in_links))
        self.score = keyword_match_score + inlink_score

        # Score should be quality + relevance
        # Quality : wave_number, in_links, topic_relevance
        
    
    def update_in_links(self, in_links):
        self.in_links.update(in_links)

import concurrent.futures
import logging
from tqdm import tqdm

# Configure logging
logging.basicConfig(filename='trial_run/crawler.log', level=logging.INFO, format='%(asctime)s - %(message)s')
logger = logging.getLogger()

class ParallelCrawler:
    def __init__(self, seed_urls):
        self.seed_urls = set(seed_urls)
        self.frontier = Frontier(seed_urls)
        self.canonicalizer = canonicalize_url

        self.processed_lins = set(seed_urls)
        self.crawled_links = set()
        self.redirects = {}

        self.counter = 0
        self.global_doc_no = 0
        self.crawl_limit = 30000
        
        self.out_links = {}
        
        self.robots_info = {}
        self.robots_delay = {}
        self.robots_timer = {}
        self.time_out = 3
        self.file_lock = threading.Lock()
        
        self.executor_for_domain = {}
    
    def update_robots_info(self, domain):
        robot = Robots("http://" + domain + "/robots.txt")
        self.robots_info[domain] = robot
        self.robots_delay[domain] = min(robot.delay, 1)
        self.robots_timer[domain] = datetime.now()

    def crawl(self):
        with concurrent.futures.ThreadPoolExecutor(max_workers=400) as executor:
            while True:
                if self.counter >= self.crawl_limit:
                    logger.info("Crawling finished")
                    logger.info(f"Number of URLs crawled: {self.counter}")
                    logger.info(f"Is empty:{self.frontier.empty()}")
                    self._write_inlinks()
                wave_number, score, url = self.frontier.pop()
                executor.submit(self._process_url, url, wave_number)


    def _process_url(self, url, current_wave_number):
        curr_domain = urlparse(url).netloc
        if curr_domain not in self.robots_info:
            try:
                self.update_robots_info(curr_domain)
            except Exception as e:
                logger.info(f"Error : {str(e)} in updating robots.txt for domain: {curr_domain} when processing URL: {url}")
                return
        if not self.robots_info[curr_domain].fetch(url):
            logger.info(f"Cannot fetch URL as per the robots.txt for URL: {url}")
            return
        else:
            if curr_domain in self.executor_for_domain:
                domain_executor = self.executor_for_domain[curr_domain]
                domain_executor.submit(self._process_url_in_executor, url, current_wave_number)
                wave_number, score, next_url = self.frontier.pop()
                if not self.frontier.empty():
                    self._process_url(next_url, wave_number)
            else:
                domain_executor = ThreadPoolExecutor(max_workers=1)
                self.executor_for_domain[curr_domain] = domain_executor
                domain_executor.submit(self._process_url_in_executor, url, current_wave_number)

    def _process_url_in_executor(self, url, current_wave_number):
        curr_domain = urlparse(url).netloc
        time_difference = (datetime.now() - self.robots_timer[curr_domain]).seconds
        if time_difference < self.robots_delay[curr_domain]:
            time.sleep(self.robots_delay[curr_domain] - time_difference)
        header_response = self._get_header(url)

        if header_response == "NA":
            logger.info(f"Error in getting header response for URL: {url}")
            return
        if "content-type" in header_response.headers:
            content_type = header_response.headers["content-type"]
        else:
            content_type = "text/html"

        if 'text/html' not in content_type:
            logger.info(f"Content type is not text/html for URL: {url}")
            return
        else:
            page_info = self._get_page_info(url, current_wave_number)
            if not page_info:
                return
            else:
                self.counter += 1

    def _get_header(self, url):
        headers = {"Connection": "close"}
        try:
            response = requests.head(url, headers=headers, timeout=self.time_out, allow_redirects=True)
            return response
        except requests.exceptions.RequestException as e:
            logger.info(str(e))
            return "NA"

    def _get_page_info(self, url, current_wave_number):
        headers = {"Connection": "close"}
        try:
            response = requests.get(url, headers=headers, timeout=self.time_out, allow_redirects=True)
        except requests.exceptions.RequestException as e:
            logger.info(str(e))
            return False
        bs = BeautifulSoup(response.text, "html.parser")

        lang = 'en'
        try:
            if bs.select("html")[0].has_attr("lang"):
                lang = bs.select("html")[0]["lang"]
        except:
            pass
        
        # base_url is the url after redirection
        base_url = response.url
        domain = urlparse(url).netloc
        self.robots_timer[domain] = datetime.now()
        
        if not self.robots_info[domain].fetch(base_url):
            return False
        if not self._can_crawl(base_url, lang):
            return False
        if base_url in self.crawled_links:
            out_links = self.out_links.get(base_url, set())
            self.frontier.frontier_entity_objects[base_url].update_in_links(out_links)
            return False
        else:
            self.crawled_links.add(base_url)
            frontier_entity = FrontierEntity(base_url)
            frontier_entity.in_links = self.frontier.frontier_entity_objects[url].in_links
            frontier_entity.update_score()
            self.frontier.frontier_entity_objects[base_url] = frontier_entity
            self.redirects[url] = base_url

            out_links = self._get_outlinks(bs)
            processed_out_links = set()

            content = self._extract_text(bs)
            
            if len(set(content)) < 5 or len(content) < 50:
                return False
            if bs.title:
                if "Page Not Found" in bs.title.string:
                    return False
                
            number_of_outlinks = len(out_links)

            self._write_contents(base_url, content, current_wave_number,number_of_outlinks , bs.title)

            for link in out_links:
                canonicalized_link = self.canonicalizer(link)
                processed_out_links.add(canonicalized_link)
                if canonicalized_link not in self.processed_lins:
                    frontier_entity = FrontierEntity(canonicalized_link)
                    frontier_entity.update_in_links({base_url})
                    frontier_entity.update_score()

                    self.frontier.push(frontier_entity, current_wave_number + 1)
                    self.processed_lins.add(canonicalized_link)
                elif canonicalized_link in self.redirects:
                    self.frontier.update_in_links(self.redirects[canonicalized_link], {base_url})
                else:
                    self.frontier.update_in_links(canonicalized_link, {base_url})

        self._write_outlinks({base_url: list(processed_out_links)})
        if self.counter % 1000 == 0:
            self._write_inlinks()
        return True

    def _can_crawl(self, base_url, lang):
        black_list = (".jpg", ".svg", ".png", ".pdf", ".gif", "youtube", "amazon")
        if "en" not in lang.lower():
            return False
        elif any(keyword in base_url for keyword in black_list):
            return False
        return True
        
    def _extract_text(self, bs):
        content = ""
        text_lists = bs.find_all(["p"])
        for text in text_lists:
            content += text.get_text().replace("\n", " ").replace("\t", " ")
        return content

    def _get_outlinks(self, bs):
        out_links = set()
        for link in bs.find_all("a"):
            href = link.get("href")
            if href is not None and href.startswith(('http://', 'https://')):
                out_links.add(href)
        return out_links
        

    def _write_contents(self, url, text, wave_number, no_outlinks, title=None):
        with self.file_lock:
            with open("trial_run/documents.txt", "a", encoding="utf-8") as file:
                file.write("<DOC>\n")
                self.global_doc_no += 1
                file.write(f"<DOCNO>ANSON-{self.global_doc_no}:{url}</DOCNO>\n")
                # add number of inlinks and outlinks
                file.write(f"<WAVENO>{wave_number}</WAVENO>\n")
                file.write(f"<OUTLINKNO>{no_outlinks}</OUTLINKNO>\n")
                file.write(f"<TIME>{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}</TIME>\n")
                if title:
                    file.write("<HEAD><TITLE>{}</TITLE></HEAD>\n".format(title.string))
                file.write("<TEXT>\n")
                file.write(text + "\n")
                file.write("</TEXT>\n")
                file.write("</DOC>\n")
                tmp_que_size = self.frontier.queue.qsize()
                logger.info(f"\nRemaining elements in the queue: {tmp_que_size}")
                logger.info(f"Crawled {self.global_doc_no} documents")
                print(f"Crawled {self.global_doc_no} documents")
    
    def _write_inlinks(self):
        with self.file_lock:
            inlinks = {}
            for url in self.crawled_links:
                inlinks[url] = list(self.frontier.frontier_entity_objects[url].in_links)
            with open("trial_run/inlinks.json", "w", encoding="utf-8") as file:
                json.dump(inlinks, file)
    
    def _write_outlinks(self, outlinks):
        with self.file_lock:
            with open("trial_run/outlinks.json", "a", encoding="utf-8") as file:
                json.dump(outlinks, file)


parallel_crawler = ParallelCrawler(SEEDS)
parallel_crawler.crawl()


Crawled 1 documents
Crawled 2 documents
Crawled 3 documents
Crawled 4 documents
Crawled 5 documents
Crawled 6 documents
Crawled 7 documents
Crawled 8 documents
Crawled 9 documents
Crawled 10 documents
Crawled 11 documents
Crawled 12 documents
Crawled 13 documents
Crawled 14 documents
Crawled 15 documents
Crawled 16 documents
Crawled 17 documents
Crawled 18 documents
Crawled 19 documents
Crawled 20 documents
Crawled 21 documents
Crawled 22 documents
Crawled 23 documents
Crawled 24 documents
Crawled 25 documents
Crawled 26 documents
Crawled 27 documents
Crawled 28 documents
Crawled 29 documents
Crawled 30 documents
Crawled 31 documents
Crawled 32 documents
Crawled 33 documents
Crawled 34 documents
Crawled 35 documents
Crawled 36 documents
Crawled 37 documents
Crawled 38 documents
Crawled 39 documents
Crawled 40 documents
Crawled 41 documents
Crawled 42 documents
Crawled 43 documents
Crawled 44 documents
Crawled 45 documents
Crawled 46 documents
Crawled 47 documents
Crawled 48 documents
C



Crawled 5508 documentsCrawled 5509 documents
Crawled 5510 documents
Crawled 5511 documents
Crawled 5512 documents
Crawled 5513 documents
Crawled 5514 documents
Crawled 5515 documents
Crawled 5516 documents
Crawled 5517 documentsCrawled 5518 documents
Crawled 5519 documents
Crawled 5520 documents
Crawled 5521 documents
Crawled 5522 documents
Crawled 5523 documents
Crawled 5524 documents
Crawled 5525 documents
Crawled 5526 documents
Crawled 5527 documents
Crawled 5528 documents
Crawled 5529 documents
Crawled 5530 documents
Crawled 5531 documents
Crawled 5532 documents
Crawled 5533 documents
Crawled 5534 documents
Crawled 5535 documents
Crawled 5536 documents
Crawled 5537 documents
Crawled 5538 documents
Crawled 5539 documents
Crawled 5540 documents
Crawled 5541 documentsCrawled 5542 documents
Crawled 5543 documents
Crawled 5544 documents
Crawled 5545 documents
Crawled 5546 documents
Crawled 5547 documents
Crawled 5548 documents
Crawled 5549 documents
Crawled 5550 documents
Crawled 5551 d



TypeError: _write_inlinks() takes 1 positional argument but 2 were given

Crawled 57843 documents
Crawled 57844 documents
Crawled 57845 documents
Crawled 57846 documents
Crawled 57847 documents
Crawled 57848 documents
Crawled 57849 documents
Crawled 57850 documents
Crawled 57851 documents
Crawled 57852 documents
Crawled 57853 documents
Crawled 57854 documents
Crawled 57855 documents
Crawled 57856 documents
Crawled 57857 documents
Crawled 57858 documents
Crawled 57859 documents
Crawled 57860 documents
Crawled 57861 documents
Crawled 57862 documents
Crawled 57863 documents
Crawled 57864 documents
Crawled 57865 documents
Crawled 57866 documents
Crawled 57867 documents
Crawled 57868 documents
Crawled 57869 documents
Crawled 57870 documents
Crawled 57871 documents
Crawled 57872 documents
Crawled 57873 documents
Crawled 57874 documents
Crawled 57875 documents
Crawled 57876 documents
Crawled 57877 documents
Crawled 57878 documentsCrawled 57879 documents
Crawled 57880 documents
Crawled 57881 documents
Crawled 57882 documents
Crawled 57883 documents
Crawled 57884 doc