# 1. Crawler

## 1.0. Related example

This code shows `wget`-like tool written in python. Run it from console (`python wget.py`), make it work. Check the code, reuse, and modify for your needs.

In [3]:
import argparse
import os
import re
import requests


def wget(url, filename):
    # allow redirects - in case file is relocated
    resp = requests.get(url, allow_redirects=True)
    # this can also be 2xx, but for simplicity now we stick to 200
    # you can also check for `resp.ok`
    if resp.status_code != 200:
        print(resp.status_code, resp.reason, 'for', url)
        return
    
    # just to be cool and print something
    print(*[f"{key}: {value}" for key, value in resp.headers.items()], sep='\n')
    print()
    
    # try to extract filename from url
    if filename is None:
        # start with http*, ends if ? or # appears (or none of)
        m = re.search("^http.*/([^/\?#]*)[\?#]?", url)
        filename = m.group(1)
        if not filename:
            raise NameError(f"Filename neither given, nor found for {url}")

    # what will you do in case 2 websites store file with the same name?
    if os.path.exists(filename):
        raise OSError(f"File {filename} already exists")
    
    with open(filename, 'wb') as f:
        f.write(resp.content)
        print(f"File saved as {filename}")


# if __name__ == "__main__":
#     parser = argparse.ArgumentParser(description='download file.')
#     parser.add_argument("-O", type=str, default=None, dest='filename', help="output file name. Default -- taken from resource")
#     parser.add_argument("url", type=str, default=None, help="Provide URL here")
#     args = parser.parse_args()
#     wget(args.url, args.filename)

### 1.0.1. How to parse a page?

If you build a crawler, you might follow one of the approaches:
1. search for URLs in the page, assuming this is just a text.
2. search for URLs in the places where URLs should appear: `<a href=..`, `<img src=...`, `<iframe src=...` and so on.

To follow the first approach you can rely on some good regular expression. [Like this](https://stackoverflow.com/a/3809435).

To follow the second approach just read one of these: [short answer](https://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautifulsoup) or [exhaustive explanation](https://hackersandslackers.com/scraping-urls-with-beautifulsoup/).

## 1.1. [15] Download and persist #
Please complete a code for `load()`, `download()` and `persist()` methods of `Document` class. What they do:
- for a given URL `download()` method downloads binary data and stores in `self.content`. It returns `True` for success, else `False`.
- `persist()` method saves `self.content` somewhere in file system. We do it to avoid multiple downloads (for caching in other words).
- `load()` method loads data from hard drive. Returns `True` for success.

Tests checks that your code somehow works.

**NB Passing the test doesn't mean you correctly completed the task.** These are **criteria, which have to be fullfilled**:
1. URL is a unique identifier (as it is a subset of URI). Thus, documents with different URLs should be stored in different files. Typical errors: documents from the same domain are overwritten to the same file, URLs with similar endings are downloaded to the same file, etc.
2. The document can be not only a text file, but also a binary. Pay attention that if you download `mp3` file, it still can be played. Hint: don't hurry to convert everything to text.

In [4]:
import requests
from urllib.parse import quote
import hashlib
import os


class Document:
    
    def __init__(self, url):
        self.url = url
        self.dir = './files'
        self.create_directory()
        self.generate_path()

    def create_directory(self):
        if not os.path.exists(self.dir):
            os.makedirs(self.dir)

    def generate_path(self):
        m = re.search(".*([.]\w*)[\?#]?", self.url)
        file_extension = m.group(1)

        filename = quote(self.url)
        self.filename = hashlib.md5(filename.encode("utf-8")).hexdigest()

        if file_extension in ['.pdf', '.mp3', '.avi', '.mp4', '.txt', '.png', '.jpg', '.html']:
            self.filename += file_extension

        self.path = os.path.join(self.dir, self.filename)
        
    def get(self):
        if not self.load():
            if not self.download():
                raise FileNotFoundError(self.url)
            else:
                self.persist()
    
    def download(self):
        '''Downloads self.url content, stores it in self.content and returns True in case of success.
        '''
        try:
            resp = requests.get(self.url, allow_redirects=True, timeout=10)
            if resp.status_code != 200:
                print(resp.status_code, resp.reason, 'for', self.url)
                return False
            
            self.content = resp.content
            return True
        
        except:
            return False
    
    def persist(self):
        '''Writes document content to hard drive.
        '''
        with open(self.path, 'wb') as f:
            f.write(self.content)
            # print(f"File saved in {self.dir} as {self.filename}")
            
    def load(self):
        '''Loads content from hard drive, stores it in self.content and returns True in case of success.
        '''
        try:
            with open(self.path, 'rb') as f:
                self.content = f.read()
            return True
        except FileNotFoundError:
            return False

In [5]:
pic = Document('https://i.pinimg.com/originals/9c/cf/9e/9ccf9ed7a40f29bef23dc20bf5fe13b5.jpg')
pic.get()

In [6]:
prot = Document('http://sprotasov.ru/data/iu.txt')
prot.get()

### 1.1.1. Tests ###

In [7]:
doc = Document('http://sprotasov.ru/data/iu.txt')

doc.get()
assert doc.content, "Document download failed"
assert "Code snippets, demos and labs for the course" in str(doc.content), "Document content error"

doc.get()
assert doc.load(), "Load should return true for saved document"
assert "Code snippets, demos and labs for the course" in str(doc.content), "Document load from disk error"

## 1.2. [10] Parse HTML
`BeautifulSoap` library is a de facto standard to parse XML and HTML documents in python. Use it to complete `parse()` method that extracts document contents. You should initialize:
1. `self.anchors` list of tuples `('text', 'url')` met in a document. Be aware, there exist relative links (e.g. `../content/pic.jpg`). Use `urllib.parse.urljoin()` to fix this issue.
2. `self.images` list of images met in a document. Again, links can be relative to current page.
3. `self.text` should keep plain text of the document without scripts, tags, comments and so on. You can refer to [this stackoverflow answer](https://stackoverflow.com/a/1983219) for details.

**NB All these 3 criteria must be fulfilled to get full point for the task.**

In [8]:
from bs4 import BeautifulSoup
from bs4.element import Comment
from urllib.parse import urljoin


class HtmlDocument(Document):

    def get_anchors_from_html(self):
        all_hrefs = self.soup.find_all('a', href=True)
        anchors = set()

        for a in all_hrefs:
            url = urljoin(self.url, a["href"])
            if re.match("^http.*/[^/\?#]*[\?#]?", url):
                anchors.add((a.get_text(), url))

        anchors = list(anchors)
        return anchors
    
    def get_images_from_html(self):
        urls = []

        for image_url in self.soup.find_all('img'):
            try:
                src = image_url['src']
                src = urljoin(self.url, src)
                urls.append(src)
            except KeyError:
                pass

        urls = list(set(urls))
        return urls

    def tag_visible(self, element):
        if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
            return False
        if isinstance(element, Comment):
            return False
        return True

    def get_text_from_html(self):
        texts = self.soup.findAll(text=True)
        visible_texts = filter(self.tag_visible, texts)  
        return u" ".join(t.strip() for t in visible_texts)
    
    def parse(self):
        '''Extracts plain text, images and links from the document.
        '''
        self.soup = BeautifulSoup(self.content, 'html.parser')
        self.anchors = self.get_anchors_from_html()
        self.images = self.get_images_from_html()
        self.text = self.get_text_from_html()

### 1.2.1. Tests

In [9]:
doc = HtmlDocument("https://infatica.io/blog/scrape-images-with-python/")
doc.get()
doc.parse()
# doc.text
doc.images
# doc.anchors

['https://infatica.io/blog/assets/images/Logo-black.svg?v=8fcb0d6019',
 'https://infatica.io/blog/assets/images/blog/menu-1.svg?v=8fcb0d6019',
 'https://infatica.io/blog/content/images/2020/06/mikayla-alston-portrait.jpg',
 'https://infatica.io/blog/assets/images/Logo-white.svg?v=8fcb0d6019',
 'https://infatica.io/blog/content/images/2021/07/web-scraping-legal-disclaimer.png',
 'https://infatica.io/blog/assets/images/blog/menu-2.svg?v=8fcb0d6019',
 'https://infatica.io/blog/content/images/2022/12/how-to-set-up-proxies-on-android.png',
 'https://infatica.io/blog/content/images/2021/07/why-choose-python.png',
 'https://infatica.io/blog/content/images/2021/07/using-python-to-scrape-images-from-the-web.png',
 'https://infatica.io/blog/content/images/2023/01/web-crawlers-explained.png',
 'https://infatica.io/blog/content/images/2023/01/http-proxies-explained.png',
 'https://infatica.io/blog/content/images/2020/06/lucas-walker-portrait.jpg',
 'https://infatica.io/blog/assets/images/blog/menu

In [10]:
doc = HtmlDocument("http://sprotasov.ru")
doc.get()
doc.parse()

assert "just few links" in doc.text, "Error parsing text"
assert "http://sprotasov.ru/images/gb.svg" in doc.images, "Error parsing images"
assert any(p[1] == "https://twitter.com/07C3" for p in doc.anchors), "Error parsing links"

## 1.3. [10] Document analysis ##
Complete the code for `HtmlDocumentTextData` class. Implement word and sentence splitting (use any method you can propose). 

**Criteria to succeed in the task**: 
1. Your `get_word_stats()` method should return `Counter` object.
2. Don't forget to lowercase your words for counting.
3. Sentences should be obtained from inside `<body>` tag only.

In [53]:
from collections import Counter
from nltk.tokenize import sent_tokenize, word_tokenize
from string import punctuation


class HtmlDocumentTextData:
    
    def __init__(self, url):
        self.doc = HtmlDocument(url)
        self.doc.get()
        self.doc.parse()
    
    def get_sentences(self) -> list:
        '''Parses sentences.
        '''
        result = []

        sents = sent_tokenize(self.doc.text)
        for sent in sents:
            sents_ = sent.split('  ')
            for sent_ in sents_:
                if sent_ != '':
                    result.append(sent_)
                    
        return result
    
    def get_word_stats(self) -> Counter:
        '''Returns Counter object of the document, 
        containing mapping {`word` -> count_in_doc}.
        '''
        counter = Counter()

        for sent in self.get_sentences():
            words = word_tokenize(sent)
            for word in words:
                if word not in punctuation:
                    counter[word.lower()] += 1

        return counter

### 1.3.1. Tests ###

In [12]:
doc = HtmlDocumentTextData("https://innopolis.university/")

print(doc.get_word_stats().most_common(10))
assert [x for x in doc.get_word_stats().most_common(10) if x[0] == 'иннополис'], 'иннополис should be among most common'

[('и', 40), ('в', 24), ('иннополис', 21), ('по', 14), ('университет', 12), ('на', 12), ('центр', 11), ('с', 11), ('университета', 10), ('для', 10)]


## 1.4. [15] Crawling ##

Method `crawl_generator()` is given starting url (`source`) and max depth of search. It should return a **generator** of `HtmlDocumentTextData` objects (return a document as soon as it is downloaded and parsed). You can benefit from `yield obj_name` python construction. Use `HtmlDocumentTextData.anchors` field to go deeper.

In [20]:
from queue import Queue
from IPython.display import clear_output

class Crawler:
    
    def crawl_generator(self, source, depth=1):
        queues = [Queue() for _ in range(depth + 1)]
        queues[0].put(source)
        processed = set()

        for i in range(depth + 1):
            while not queues[i].empty():
                url = queues[i].get()

                if url in processed:
                    continue
                else:
                    # clear_output(wait=True)
                    processed.add(url)
                    # print(len(processed))
                    # print(url)

                try:
                    document = HtmlDocumentTextData(url)

                    if i < depth:
                        for _, anchor in document.doc.anchors:
                            queues[i + 1].put(anchor)
                            
                    yield document
                            
                except FileNotFoundError:
                    print(url, 'was not found.')

### 1.4.1. Tests ###

In [57]:
crawler = Crawler()
counter = Counter()

for c in crawler.crawl_generator("https://innopolis.university/en/", 1):
    print(c.doc.url)
    if c.doc.url[-4:] in ('.pdf', '.mp3', '.avi', '.mp4', '.txt'):
        print("Skipping", c.doc.url)
        continue
    counter.update(c.get_word_stats())
    print(len(counter), "distinct word(s) so far")
    
print("Done")

print(counter.most_common(20))
assert [x for x in counter.most_common(20) if x[0] == 'innopolis'], 'innopolis sould be among most common'

https://innopolis.university/en/
341 distinct word(s) so far
https://vk.com/innopolisu
690 distinct word(s) so far
https://innopolis.university/en/international-relations-office/
1221 distinct word(s) so far
https://innopolis.university/en/research/
1300 distinct word(s) so far
https://media.innopolis.university/en
1360 distinct word(s) so far
403 Forbidden for http://www.campuslife.innopolis.ru
http://www.campuslife.innopolis.ru was not found.
https://apply.innopolis.university/en/
1846 distinct word(s) so far
https://apply.innopolis.university/en/master/
1900 distinct word(s) so far
https://career.innopolis.university/en/
2049 distinct word(s) so far


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://innopolis.university/public/files/Consent_to_the_processing_of_PD_for_UI.pdf
Skipping https://innopolis.university/public/files/Consent_to_the_processing_of_PD_for_UI.pdf
https://innopolis.university/en/team-structure/
2055 distinct word(s) so far
https://apply.innopolis.university/en/bachelor/
2147 distinct word(s) so far
https://media.innopolis.university/news/innopolis-university-extends-international-application-deadline-/
2302 distinct word(s) so far
https://university.innopolis.ru/en/about/ was not found.
https://minobrnauki.gov.ru/
2614 distinct word(s) so far
https://innopolis.university/en/form/
2745 distinct word(s) so far
https://media.innopolis.university/en/news/
2745 distinct word(s) so far
https://media.innopolis.university/news/registration-innopolis-open-2020/
2854 distinct word(s) so far
https://media.innopolis.university/news/self-driven-school/
2960 distinct word(s) so far
https://apply.innopolis.university/en/postgraduate-study/
3024 distinct word(s) so far