# 1. Crawler

## 1.0. Related example

This code shows `wget`-like tool written in python. Run it from console (`python wget.py`), make it work. Check the code, reuse, and modify for your needs.

In [217]:
import argparse
import os
import re
import requests


def wget(url, filename):
    # allow redirects - in case file is relocated
    resp = requests.get(url, allow_redirects=True)
    # this can also be 2xx, but for simplicity now we stick to 200
    # you can also check for `resp.ok`
    if resp.status_code != 200:
        print(resp.status_code, resp.reason, 'for', url)
        return
    
    # just to be cool and print something
    print(*[f"{key}: {value}" for key, value in resp.headers.items()], sep='\n')
    print()
    
    # try to extract filename from url
    if filename is None:
        # start with http*, ends if ? or # appears (or none of)
        m = re.search("^http.*/([^/\?#]*)[\?#]?", url)
        filename = m.group(1)
        if not filename:
            raise NameError(f"Filename neither given, nor found for {url}")

    # what will you do in case 2 websites store file with the same name?
    if os.path.exists(filename):
        raise OSError(f"File {filename} already exists")
    
    with open(filename, 'wb') as f:
        f.write(resp.content)
        print(f"File saved as {filename}")


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='download file.')
    parser.add_argument("-O", type=str, default=None, dest='filename', help="output file name. Default -- taken from resource")
    parser.add_argument("url", type=str, default=None, help="Provide URL here")
    args = parser.parse_args()
    wget(args.url, args.filename)

usage: ipykernel_launcher.py [-h] [-O FILENAME] url
ipykernel_launcher.py: error: unrecognized arguments: -f


SystemExit: 2

### 1.0.1. How to parse a page?

If you build a crawler, you might follow one of the approaches:
1. search for URLs in the page, assuming this is just a text.
2. search for URLs in the places where URLs should appear: `<a href=..`, `<img src=...`, `<iframe src=...` and so on.

To follow the first approach you can rely on some good regular expression. [Like this](https://stackoverflow.com/a/3809435).

To follow the second approach just read one of these: [short answer](https://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautifulsoup) or [exhaustive explanation](https://hackersandslackers.com/scraping-urls-with-beautifulsoup/).

## 1.1. [15] Download and persist #
Please complete a code for `load()`, `download()` and `persist()` methods of `Document` class. What they do:
- for a given URL `download()` method downloads binary data and stores in `self.content`. It returns `True` for success, else `False`.
- `persist()` method saves `self.content` somewhere in file system. We do it to avoid multiple downloads (for caching in other words).
- `load()` method loads data from hard drive. Returns `True` for success.

Tests checks that your code somehow works.

**NB Passing the test doesn't mean you correctly completed the task.** These are **criteria, which have to be fullfilled**:
1. URL is a unique identifier (as it is a subset of URI). Thus, documents with different URLs should be stored in different files. Typical errors: documents from the same domain are overwritten to the same file, URLs with similar endings are downloaded to the same file, etc.
2. The document can be not only a text file, but also a binary. Pay attention that if you download `mp3` file, it still can be played. Hint: don't hurry to convert everything to text.

In [176]:
from urllib.parse import urlparse # for retrieving file extension from url
from hashlib import sha512 # for filename hashing
import requests

class Document:
    
    def __init__(self, url):
        self.url = url
        
        hashed_filename = sha512(url.encode()).hexdigest()
        extension = urlparse(url).path.split('.') 
        if len(extension) > 1: # if there is an extension
            self.name = f"{hashed_filename}.{extension[-1]}"
            self.ext = extension[-1]
        else: # if there is no extension treat it as html
            self.name = f"{hashed_filename}.html"
            self.ext = "html"
        
        os.makedirs("files", exist_ok=True) # create files directory if it does not exist
        
    def get(self):
        if not self.load():
            if not self.download():
                raise FileNotFoundError(self.url)
            else:
                self.persist()
    
    def download(self):
        downloaded_content = requests.get(self.url, timeout=5) # get content from url with timeout of 5 seconds to avoid spending too much time on one file
        if downloaded_content.status_code == 200: # if status code is 200 (OK)
            self.content = downloaded_content.content # save content
            return True
        return False
    
    def persist(self):
        with open(f"files/{self.name}", 'wb') as f: # open file in write binary mode
            f.write(self.content) # write content to file
        pass
            
    def load(self):
        if os.path.exists(f"files/{self.name}"): # if file exists
            with open(f"files/{self.name}", 'rb') as f: # open file in read binary mode
                self.content = f.read() # read content from file
                return True
        return False

### 1.1.1. Tests ###

In [177]:
doc = Document('http://sprotasov.ru/data/iu.txt')

doc.get()
assert doc.content, "Document download failed"
assert "Code snippets, demos and labs for the course" in str(doc.content), "Document content error"

doc.get()
assert doc.load(), "Load should return true for saved document"
assert "Code snippets, demos and labs for the course" in str(doc.content), "Document load from disk error"

## 1.2. [10] Parse HTML
`BeautifulSoap` library is a de facto standard to parse XML and HTML documents in python. Use it to complete `parse()` method that extracts document contents. You should initialize:
1. `self.anchors` list of tuples `('text', 'url')` met in a document. Be aware, there exist relative links (e.g. `../content/pic.jpg`). Use `urllib.parse.urljoin()` to fix this issue.
2. `self.images` list of images met in a document. Again, links can be relative to current page.
3. `self.text` should keep plain text of the document without scripts, tags, comments and so on. You can refer to [this stackoverflow answer](https://stackoverflow.com/a/1983219) for details.

**NB All these 3 criteria must be fulfilled to get full point for the task.**

In [178]:
from bs4 import BeautifulSoup
import urllib.parse

class HtmlDocument(Document):
    
    def parse(self):
        soup = BeautifulSoup(self.content, 'html.parser') # parse html
        self.anchors = self.parse_anchors(soup=soup) 
        self.images = self.parse_images(soup=soup)
        self.text = self.parse_text(soup=soup)
        
    def parse_anchors(self, soup):
        anchors = []
        for link in soup.find_all('a'): 
            href = link.get('href') 
            if href: 
                anchors.append((link.text, urllib.parse.urljoin(self.url, href))) # add tuple of text and url to anchors
        return anchors

    def parse_images(self, soup):
        images = []
        for image in soup.find_all('img'):
            src = image.get('src') 
            if src:
                images.append(urllib.parse.urljoin(self.url, src)) # add url to images
        return images

    def parse_text(self, soup): 
        # since we are parsing html, we can use soup.find('body') to get body which will contain all plain texts without scripts, tags, comments and so on 
        # and it will also help us later on in the document analysis
        text = soup.find('body').text.strip() # get text from body
        return text

### 1.2.1. Tests

In [179]:
doc = HtmlDocument("http://sprotasov.ru")
doc.get()
doc.parse()

assert "just few links" in doc.text, "Error parsing text"
assert "http://sprotasov.ru/images/gb.svg" in doc.images, "Error parsing images"
assert any(p[1] == "https://twitter.com/07C3" for p in doc.anchors), "Error parsing links"

## 1.3. [10] Document analysis ##
Complete the code for `HtmlDocumentTextData` class. Implement word and sentence splitting (use any method you can propose). 

**Criteria to succeed in the task**: 
1. Your `get_word_stats()` method should return `Counter` object.
2. Don't forget to lowercase your words for counting.
3. Sentences should be obtained from inside `<body>` tag only.

In [180]:
#references: 
# https://www.w3tweaks.com/html-file-extensions.html#jspextension
# https://fileinfo.com/filetypes/web
# https://developer.mozilla.org/en-US/docs/Learn/HTML/Introduction_to_HTML#What_is_HTML
# https://www.w3schools.com/whatis/whatis_html.asp

from collections import Counter

class HtmlDocumentTextData:
    # Due to the class name I assume that it is responsible for text data extraction from ONLY html documents
    # So I decided to check the extension of the document and raise an error if it is not one of the most common html extensions ('html', 'htm', 'php', 'do', 'asp', 'aspx', 'jsp', 'cfm', 'shtml', 'xhtml')
    # but since there are too many html extensions, i decided to check if the extension is not wanted ('pdf', 'mp3', 'avi', 'mp4', 'txt') instead of checking if it is wanted
    # if it is not wanted then raise an error
    def __init__(self, url):
        self.doc = HtmlDocument(url)
        if self.doc.ext in ('pdf', 'mp3', 'avi', 'mp4', 'txt'): # if extension is not an html extension
            raise ValueError("Only html documents are supported, got: " + self.doc.ext) # raise error
        self.doc.get()
        self.doc.parse()
    
    def get_sentences(self):
        # since we already parsed the text from the body in the previous part, we can just split it by line breaks
        result = self.doc.text.splitlines() # split text by lines, because sentences are usually separated by line breaks
        return result
    
    def get_word_stats(self):
        result = Counter()
        for sentence in self.get_sentences(): # for each sentence
            for word in sentence.split(): # split sentence by words
                word = word.lower() # convert word to lowercase
                while word and not word[0].isalnum(): # remove non-alphanumeric characters from the beginning of the word
                    word = word[1:]
                while word and not word[-1].isalnum(): # remove non-alphanumeric characters from the end of the word
                    word = word[:-1]
                if word == '': # if word is empty string, skip it. It happens when there are only non-alphanumeric characters in the word
                    continue
                result[word] += 1 # add word to counter
        return result

### 1.3.1. Tests ###

In [181]:
doc = HtmlDocumentTextData("https://innopolis.university/")

print(doc.get_word_stats().most_common(10))
assert [x for x in doc.get_word_stats().most_common(10) if x[0] == 'иннополис'], 'иннополис should be among most common'

[('и', 42), ('в', 22), ('иннополис', 20), ('с', 13), ('на', 12), ('университет', 11), ('университета', 11), ('центр', 10), ('для', 9), ('по', 8)]


## 1.4. [15] Crawling ##

Method `crawl_generator()` is given starting url (`source`) and max depth of search. It should return a **generator** of `HtmlDocumentTextData` objects (return a document as soon as it is downloaded and parsed). You can benefit from `yield obj_name` python construction. Use `HtmlDocumentTextData.anchors` field to go deeper.

In [185]:
from queue import Queue

class Crawler:
    
    def crawl_generator(self, source, depth=1): 
        queue = Queue() # queue that will be used to store urls and their depth
        queue.put((source, 0)) 
        visited = set() # set that will be used to store visited urls
        while not queue.empty(): # while queue is not empty
            url, level = queue.get() # get url and level from queue
            if url in visited: # if url is already visited, skip it
                continue
            visited.add(url) # add url to visited set
            try: # try to get and parse document
                yield HtmlDocumentTextData(url) # yield document
                if level < depth: # if level is less than depth
                    site = HtmlDocumentTextData(url) 
                    for anchor in site.doc.anchors: # for each anchor
                        queue.put((anchor[1], level + 1)) # add url and level to queue
            except Exception as e: # if there is an error print it and continue
                print("Error processing {}: {}".format(url, e))
        

### 1.4.1. Tests ###

In [186]:
crawler = Crawler()
counter = Counter()

for c in crawler.crawl_generator("https://innopolis.university/en/", 2):
    print(c.doc.url)
    if c.doc.url[-4:] in ('.pdf', '.mp3', '.avi', '.mp4', '.txt'):
        print("Skipping", c.doc.url)
        continue
    # since not all urls end with the extension i decided to make another if block to check if c.doc.ext is in the list of extensions
    # THE UPCOMING BLOCK IS ADDED BY ME
    if c.doc.ext in ('pdf', 'mp3', 'avi', 'mp4', 'txt'):
        print("Skipping", c.doc.url)
        continue
    # THE PREVIOUS BLOCK IS ADDED BY ME
    counter.update(c.get_word_stats())
    print(len(counter), "distinct word(s) so far")
    
print("Done")

print(counter.most_common(20))
assert [x for x in counter.most_common(20) if x[0] == 'innopolis'], 'innopolis sould be among most common'

https://innopolis.university/en/
334 distinct word(s) so far
https://innopolis.university/
889 distinct word(s) so far
https://apply.innopolis.university/en
1522 distinct word(s) so far
Error processing https://innopolis.university/proekty/activity/en": https://innopolis.university/proekty/activity/en"
https://media.innopolis.university/en
1592 distinct word(s) so far
https://innopolis.university/lk/
1604 distinct word(s) so far
https://innopolis.university/en/about/
1788 distinct word(s) so far
https://innopolis.university/en/board/
1867 distinct word(s) so far
https://innopolis.university/en/team/
1868 distinct word(s) so far
https://innopolis.university/en/team-structure/
1871 distinct word(s) so far
https://innopolis.university/en/team-structure/education-academics/
1872 distinct word(s) so far
https://innopolis.university/en/team-structure/techcenters/
1876 distinct word(s) so far
https://innopolis.university/en/faculty/
2776 distinct word(s) so far
https://career.innopolis.univer

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://summer.yonsei.ac.kr/home/
27098 distinct word(s) so far
https://summer.yonsei.ac.kr/home/program/courses.asp?cateNo=0
27259 distinct word(s) so far
https://drive.google.com/file/d/1EX9bJQuplXQQSApSEqQ5F0SCyfXBOCE6/view
27259 distinct word(s) so far
https://international.jnu.ac.kr/Inbound/SummerSession/OverView
27344 distinct word(s) so far
https://international.jnu.ac.kr/Inbound/SummerSession/Course
27452 distinct word(s) so far
Error processing https://international.ui.ac.id/wp-content/uploads/2020/02/CNUISS2020-Brochure_.pdf: Only html documents are supported, got: pdf
https://www.hufsiss.online/home
27572 distinct word(s) so far
https://www.hufsiss.online/about/course-info
27801 distinct word(s) so far
https://summer.skku.edu/summer/index.do
27829 distinct word(s) so far
https://summer.skku.edu/summer/program/Course_DATA.do
27845 distinct word(s) so far
https://summer.skku.edu/summer/registration/scholarships.do
27853 distinct word(s) so far
https://oidb.metu.edu.tr/en/summe