# 1. Download and persist #
Please complete a code for `load()`, `download()` and `persist()` methods of `Document` class. What they do:
- for a given URL `download()` method downloads binary data and stores in `self.content`. It returns `True` for success, else `False`.
- `persist()` method saves `self.content` somewhere. We do it to avoid multiple downloads.
- `load()` method loads data from hard drive. Returns `True` for success.

Tests checks that your code somehow works.

In [0]:
import requests
from urllib.parse import quote

class Document:
    
    def __init__(self, url):
        self.url = url
        
    def get(self):
        if not self.load():
            if not self.download():
                raise FileNotFoundError(self.url)
            else:
                self.persist()
    
    def download(self):
        try:
            response = requests.get(self.url)
            if response.status_code == 200:
                self.content = response.content
                return True
            else:
                return False
        except:
            return False
    
    def persist(self):
        with open(quote(self.url).replace('/', '_'), 'wb') as f:
            f.write(self.content)
            
    def load(self):
        # print("Loading from drive")
        try:
            with open(quote(self.url).replace('/', '_'), 'rb') as f:
                self.content = f.read()
            # print("Loaded from drive")
            return True
        except:
            return False

## 1.1. Tests ##

In [0]:
doc = Document('http://sprotasov.ru/data/iu.txt')

doc.get()
assert doc.content, "Document download failed"
assert "Code snippets, demos and labs for the course" in str(doc.content), "Document content error"

doc.get()
assert doc.load(), "Load should return true for saved document"
assert "Code snippets, demos and labs for the course" in str(doc.content), "Document load from disk error"

# 2. Parse HTML #
`BeautifulSoap` library is a de facto standard to parse XML and HTML documents in python. Use it to complete `parse()` method that extracts document contents. You should initialize:
- `self.anchors` list of tuples `('text', 'url')` met in a document. Be aware, there exist relative links. Use `urllib.parse.urljoin()` to fix this issue.
- `self.images` list of images met in a document. Again links can be relative.
- `self.text` should keep plain text of the document without scripts, tags, comments and so on. You can refer to [this stackoverflow answer](https://stackoverflow.com/a/1983219) for details.

In [0]:
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.parse


class HtmlDocument(Document):
    
    def normalize(self, href):
        if href is not None and href[:4] != 'http':
            href = urllib.parse.urljoin(self.url, href)
        return href
    
    
    def parse(self):
        
        def tag_visible(element):
            if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
                return False
            if isinstance(element, Comment):
                return False
            return True
            
        
        model = BeautifulSoup(self.content)
        
        self.anchors = []
        a = model.find_all('a')
        for anchor in a:
            href = self.normalize(anchor.get('href'))
            text = anchor.text
            self.anchors.append((text, href))
                        
        self.images = []
        i = model.find_all('img')
        for img in i:
            href = self.normalize(img.get('src'))
            self.images.append(href)
        
        texts = model.findAll(text=True)
        visible_texts = filter(tag_visible, texts)  
        self.text = u" ".join(t.strip() for t in visible_texts)

## 2.1. Tests ##

In [0]:
doc = HtmlDocument("http://sprotasov.ru")
doc.get()
doc.parse()

assert "тестирующий сервер codetest" in doc.text, "Error parsing text"
assert "http://sprotasov.ru/images/phone.png" in doc.images, "Error parsing images"
assert any(p[1] == "http://university.innopolis.ru/" for p in doc.anchors), "Error parsing links"

# 3. Document analysis #
Complete the code for `HtmlDocumentTextData` class. Implement word (and sentence) splitting. Your `get_word_stats()` method should return `Counter` object. Don't forget to lowercase your words.

In [0]:
from collections import Counter
import nltk
import re

class HtmlDocumentTextData:
    
    def __init__(self, url):
        self.doc = HtmlDocument(url)
        self.doc.get()
        self.doc.parse()
    
    def get_sentences(self):
        nltk.download('punkt')
        sentences = nltk.tokenize.sent_tokenize(self.doc.text)
        result = []
        for sent in sentences:
            for line in sent.split('\n'):
                if line.strip():
                    result.append(line.strip())
        return result
    
    def get_word_stats(self):
        return Counter([word.lower() for word in re.split('\W', self.doc.text) if word and not word[0].isdigit()])

## 3.1. Tests ##

In [0]:
doc = HtmlDocumentTextData("https://university.innopolis.ru")

print(doc.get_word_stats().most_common(10))
assert [x for x in doc.get_word_stats().most_common(10) if x[0] == 'иннополис'], 'иннополис sould be among most common'

[('и', 52), ('в', 37), ('иннополис', 32), ('по', 30), ('ул', 25), ('на', 24), ('из', 18), ('ост', 16), ('университет', 15), ('ит', 14)]


# 4. Crawling #

Method `crawl_generator()` is given starting url (`source`) and max depth of search. It should return a **generator** of `HtmlDocumentTextData` objects (return a document as soon as it is downloaded and parsed). You can benefit from `yield obj_name` python construction. Use `HtmlDocumentTextData.anchors` field to go deeper.

In [0]:
from queue import Queue

class Crawler:
    
    def crawl_generator(self, source, depth=1):
        q = Queue()
        q.put((source, 0))
        visited = set()
        while not q.empty():
            url, url_depth = q.get()
            if url not in visited:
                visited.add(url)
                try:
                    doc = HtmlDocumentTextData(url)
                    for a in doc.doc.anchors:
                        if url_depth + 1 < depth:
                            q.put((a[1], url_depth + 1))
                    yield doc
                except FileNotFoundError as e:
                    print("Analyzing", url, "led to FileNotFoundError")

## 4.1. Tests ##

In [0]:
crawler = Crawler()
counter = Counter()

for c in crawler.crawl_generator("https://university.innopolis.ru/en/", 2):
    print(c.doc.url)
    if c.doc.url[-4:] in ('.pdf', '.mp3', '.avi', '.mp4', '.txt'):
        print("Skipping", c.doc.url)
        continue
    counter.update(c.get_word_stats())
    print(len(counter), "distinct word(s) so far")
    
print("Done")

print(counter.most_common(20))
assert [x for x in counter.most_common(20) if x[0] == 'innopolis'], 'innopolis sould be among most common'

https://university.innopolis.ru/en/
395 distinct word(s) so far
https://university.innopolis.ru/
899 distinct word(s) so far
https://university.innopolis.ru/en/?special=Y
910 distinct word(s) so far
https://university.innopolis.ru/en/about/
1047 distinct word(s) so far
https://university.innopolis.ru/en/about/city
1112 distinct word(s) so far
https://university.innopolis.ru/en/about/board
1172 distinct word(s) so far
https://university.innopolis.ru/en/about/job
1430 distinct word(s) so far
https://university.innopolis.ru/en/about/structure
1571 distinct word(s) so far
https://university.innopolis.ru/en/about/teaching-composition/
1648 distinct word(s) so far
https://university.innopolis.ru/upload/iblock/026/IU_AR2018_eng.pdf
Skipping https://university.innopolis.ru/upload/iblock/026/IU_AR2018_eng.pdf
https://university.innopolis.ru/en/education/
1680 distinct word(s) so far
https://university.innopolis.ru/en/education/bachelor/
1709 distinct word(s) so far
https://university.innopolis.