# Crawler
## 1. Download and persist #
Please complete a code for `load()`, `download()` and `persist()` methods of `Document` class. What they do:
- for a given URL `download()` method downloads binary data and stores in `self.content`. It returns `True` for success, else `False`.
- `persist()` method saves `self.content` somewhere in file system. We do it to avoid multiple downloads (for caching in other words).
- `load()` method loads data from hard drive. Returns `True` for success.

Tests checks that your code somehow works.

In [86]:
import requests
from urllib.parse import quote

class Document:
    
    def __init__(self, url):
        self.url = url
        
    def get(self):
        if not self.load():
            if not self.download():
                raise FileNotFoundError(self.url)
            else:
                self.persist()
    
    def download(self):
        response = requests.get(self.url)
        if response.status_code == 200:
            self.content = response.content
            return True;        
        return False
    
    def persist(self):
        with open(quote(self.url).replace('/', '_'), 'wb') as file:
            file.write(self.content)
            
    def load(self):
        try:
            with open(quote(self.url).replace('/', '_'), 'rb') as file:
                self.content = file.read()            
            return True
        except:
            return False

### 1.1. Tests ###

In [87]:
doc = Document('http://sprotasov.ru/data/iu.txt')

doc.get()
assert doc.content, "Document download failed"
assert "Code snippets, demos and labs for the course" in str(doc.content), "Document content error"

doc.get()
assert doc.load(), "Load should return true for saved document"
assert "Code snippets, demos and labs for the course" in str(doc.content), "Document load from disk error"

## 2. Parse HTML ##
`BeautifulSoap` library is a de facto standard to parse XML and HTML documents in python. Use it to complete `parse()` method that extracts document contents. You should initialize:
- `self.anchors` list of tuples `('text', 'url')` met in a document. Be aware, there exist relative links (e.g. `../content/pic.jpg`). Use `urllib.parse.urljoin()` to fix this issue.
- `self.images` list of images met in a document. Again, links can be relative to current page.
- `self.text` should keep plain text of the document without scripts, tags, comments and so on. You can refer to [this stackoverflow answer](https://stackoverflow.com/a/1983219) for details.

In [88]:
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.parse


class HtmlDocument(Document):
    
    def parse(self):
        soup = BeautifulSoup(self.content, 'lxml')
        
        self.anchors = []
        self.images = []
        self.text = ""
        
        links = soup.find_all('a')
        for link in links:
            self.anchors.append((link.text, link.get('href')))
        
        imgs = soup.find_all('img')
        for img in imgs:
            img = urllib.parse.urljoin(self.url, img.get('src'))
            self.images.append(img)
        
       
        self.text = self.text_from_html(self.content)
    
    def tag_visible(self, element):
        if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
            return False
        if isinstance(element, Comment):
            return False
        return True


    def text_from_html(self, body):
        soup = BeautifulSoup(body, 'html.parser')
        texts = soup.findAll(text=True)
        visible_texts = filter(self.tag_visible, texts)  
        return u" ".join(t.strip() for t in visible_texts)
    

### 2.1. Tests ###

In [89]:
doc = HtmlDocument("http://sprotasov.ru")
doc.get()
doc.parse()

assert "тестирующий сервер codetest" in doc.text, "Error parsing text"
assert "http://sprotasov.ru/images/phone.png" in doc.images, "Error parsing images"
assert any(p[1] == "http://university.innopolis.ru/" for p in doc.anchors), "Error parsing links"

## 3. Document analysis ##
Complete the code for `HtmlDocumentTextData` class. Implement word and sentence splitting (use any method you can propose). Your `get_word_stats()` method should return `Counter` object. Don't forget to lowercase your words.

In [90]:
from collections import Counter
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
nltk.download('punkt')
nltk.download('stopwords')
stopwords = stopwords.words("russian")

class HtmlDocumentTextData:
    
    def __init__(self, url):
        self.doc = HtmlDocument(url)
        self.doc.get()
        self.doc.parse()
    
    def get_sentences(self):
        #TODO*: implement sentence parser
        result = sent_tokenize(self.doc.text, language = "russian")
        return result
    
    def get_word_stats(self):
        #TODO return Counter object of the document, containing mapping {`word` -> count_in_doc}
        result = RegexpTokenizer(r'\w+').tokenize(self.doc.text.lower())
        result_withouth_sw = [word for word in result if not word in stopwords]

        return Counter(result_withouth_sw)
    

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Romulus\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Romulus\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### 3.1. Tests ###

In [91]:
doc = HtmlDocumentTextData("https://innopolis.university/")

print(doc.get_word_stats().most_common(10))
assert [x for x in doc.get_word_stats().most_common(10) if x[0] == 'иннополис'], 'иннополис sould be among most common'

[('иннополис', 21), ('университета', 10), ('университет', 10), ('лаборатория', 10), ('университете', 7), ('области', 7), ('технологий', 7), ('робототехники', 7), ('12', 7), ('2020', 7)]


## 4. Crawling ##

Method `crawl_generator()` is given starting url (`source`) and max depth of search. It should return a **generator** of `HtmlDocumentTextData` objects (return a document as soon as it is downloaded and parsed). You can benefit from `yield obj_name` python construction. Use `HtmlDocumentTextData.anchors` field to go deeper.

In [92]:
from queue import Queue

class Crawler:
    
    def crawl_generator(self, source, depth=1):
        queue = []
        queue.append((source, 1))
        visited = [source]
        while(len(queue) > 0):
            cur_url, cur_depth = queue.pop(0)
            if(cur_depth > depth):
                return
            try:
                HtmlDocument = HtmlDocumentTextData(cur_url)
                for node in HtmlDocument.doc.anchors:
                    node_url = node[1] 
                    if (node_url not in visited):
                        visited.append(node_url)
                        queue.append((node_url, cur_depth+1))
                yield HtmlDocument
            except: 
                continue

### 4.1. Tests ###

In [93]:
crawler = Crawler()
counter = Counter()

for c in crawler.crawl_generator("https://innopolis.university/en/", 2):
    print(c.doc.url)
    if c.doc.url[-4:] in ('.pdf', '.mp3', '.avi', '.mp4', '.txt'):
        print("Skipping", c.doc.url)
        continue
    counter.update(c.get_word_stats())
    print(len(counter), "distinct word(s) so far")
    
print("Done")

print(counter.most_common(20))
assert [x for x in counter.most_common(20) if x[0] == 'innopolis'], 'innopolis sould be among most common'

https://innopolis.university/en/
312 distinct word(s) so far
http://old.innopolis.university/en/
564 distinct word(s) so far
https://media.innopolis.university/en/
637 distinct word(s) so far
https://www.facebook.com/InnopolisU
701 distinct word(s) so far
https://vk.com/innopolisu
928 distinct word(s) so far
https://www.youtube.com/user/InnopolisU
952 distinct word(s) so far
https://habr.com/ru/users/t-fazullin/posts/
1653 distinct word(s) so far
https://apply.innopolis.university/en/
2369 distinct word(s) so far
https://corporate.innopolis.university/
2636 distinct word(s) so far
https://media.innopolis.university/
2883 distinct word(s) so far
https://innopolis.university/lk/
3014 distinct word(s) so far
https://career.innopolis.university/en/job/
3139 distinct word(s) so far
https://career.innopolis.university/en/
3468 distinct word(s) so far
https://apply.innopolis.university/en/postgraduate-study/
3631 distinct word(s) so far
https://apply.innopolis.university/en/bachelor/
3707 dis