# Damir Nabiullin (d.nabiullin@innopolis.university) - Assignment 1

# 1. Crawler

## 1.0. Related example

This code shows `wget`-like tool written in python. Run it from console (`python wget.py`), make it work. Check the code, reuse, and modify for your needs.

In [1]:
# import argparse
# import os
# import re
# import requests


# def wget(url, filename):
#     # allow redirects - in case file is relocated
#     resp = requests.get(url, allow_redirects=True)
#     # this can also be 2xx, but for simplicity now we stick to 200
#     # you can also check for `resp.ok`
#     if resp.status_code != 200:
#         print(resp.status_code, resp.reason, 'for', url)
#         return
    
#     # just to be cool and print something
#     print(*[f"{key}: {value}" for key, value in resp.headers.items()], sep='\n')
#     print()
    
#     # try to extract filename from url
#     if filename is None:
#         # start with http*, ends if ? or # appears (or none of)
#         m = re.search("^http.*/([^/\?#]*)[\?#]?", url)
#         filename = m.group(1)
#         if not filename:
#             raise NameError(f"Filename neither given, nor found for {url}")

#     # what will you do in case 2 websites store file with the same name?
#     if os.path.exists(filename):
#         raise OSError(f"File {filename} already exists")
    
#     with open(filename, 'wb') as f:
#         f.write(resp.content)
#         print(f"File saved as {filename}")


# if __name__ == "__main__":
#     parser = argparse.ArgumentParser(description='download file.')
#     parser.add_argument("-O", type=str, default=None, dest='filename', help="output file name. Default -- taken from resource")
#     parser.add_argument("url", type=str, default=None, help="Provide URL here")
#     args = parser.parse_args()
#     wget(args.url, args.filename)

### 1.0.1. How to parse a page?

If you build a crawler, you might follow one of the approaches:
1. search for URLs in the page, assuming this is just a text.
2. search for URLs in the places where URLs should appear: `<a href=..`, `<img src=...`, `<iframe src=...` and so on.

To follow the first approach you can rely on some good regular expression. [Like this](https://stackoverflow.com/a/3809435).

To follow the second approach just read one of these: [short answer](https://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautifulsoup) or [exhaustive explanation](https://hackersandslackers.com/scraping-urls-with-beautifulsoup/).

## 1.1. [15] Download and persist #
Please complete a code for `load()`, `download()` and `persist()` methods of `Document` class. What they do:
- for a given URL `download()` method downloads binary data and stores in `self.content`. It returns `True` for success, else `False`.
- `persist()` method saves `self.content` somewhere in file system. We do it to avoid multiple downloads (for caching in other words).
- `load()` method loads data from hard drive. Returns `True` for success.

Tests checks that your code somehow works.

**NB Passing the test doesn't mean you correctly completed the task.** These are **criteria, which have to be fullfilled**:
1. URL is a unique identifier (as it is a subset of URI). Thus, documents with different URLs should be stored in different files. Typical errors: documents from the same domain are overwritten to the same file, URLs with similar endings are downloaded to the same file, etc.
2. The document can be not only a text file, but also a binary. Pay attention that if you download `mp3` file, it still can be played. Hint: don't hurry to convert everything to text.

In [1]:
import requests
from urllib.parse import quote
import hashlib
import re
import os

class Document:
    
    def __init__(self, url):
        # Initialize url and quoted_url
        self.url = url
        self.quoted_url = quote(url, safe='/')

        # Extract file type from url
        bare_link_regex = '^https?://[\w\-\.]+/?'
        type_pattern = re.compile('[.]\w+$')
        
        link_path_part = re.sub(bare_link_regex, '', self.url)
        # print(link_path_part)
        types = type_pattern.findall(link_path_part)
        file_type = types[0] if types else '.html'

        # Create file path
        folder_path = './downloaded_files/'
        if not os.path.exists(folder_path):
            os.makedirs(folder_path)

        self.file_path = folder_path + hashlib.sha256(self.quoted_url.encode()).hexdigest() + file_type

        # print(file_type)
        # print(self.file_path)
        
    def get(self):
        if not self.load():
            if not self.download():
                raise FileNotFoundError(self.url)
            else:
                self.persist()
    
    def download(self):
        #TODO download self.url content, store it in self.content and return True in case of success
        response = requests.get(self.url)
        if response.status_code == 200:
            self.content = response.content
            return True
        return False
    
    def persist(self):
        #TODO write document content to hard drive
        with open(self.file_path, 'wb') as persist_file:
            persist_file.write(self.content)
            
    def load(self):
        #TODO load content from hard drive, store it in self.content and return True in case of success
        try:
            with open(self.file_path, 'rb') as load_file:
                self.content = load_file.read()
                return True
        except FileNotFoundError:

            return False

### 1.1.1. Tests ###

In [2]:
doc = Document('https://video-moon.com/ar/')
# mp3_link_test = https://file-examples.com/storage/feeb72b10363daaeba4c0c9/2017/11/file_example_MP3_700KB.mp3

doc.get()
assert doc.content, "Document download failed"
assert "Code snippets, demos and labs for the course" in str(doc.content), "Document content error"

doc.get()
assert doc.load(), "Load should return true for saved document"
assert "Code snippets, demos and labs for the course" in str(doc.content), "Document load from disk error"

AssertionError: Document content error

## 1.2. [10] Parse HTML
`BeautifulSoap` library is a de facto standard to parse XML and HTML documents in python. Use it to complete `parse()` method that extracts document contents. You should initialize:
1. `self.anchors` list of tuples `('text', 'url')` met in a document. Be aware, there exist relative links (e.g. `../content/pic.jpg`). Use `urllib.parse.urljoin()` to fix this issue.
2. `self.images` list of images met in a document. Again, links can be relative to current page.
3. `self.text` should keep plain text of the document without scripts, tags, comments and so on. You can refer to [this stackoverflow answer](https://stackoverflow.com/a/1983219) for details.

**NB All these 3 criteria must be fulfilled to get full point for the task.**

In [4]:
from bs4 import BeautifulSoup
from bs4.element import Comment
from urllib.parse import urljoin


class HtmlDocument(Document):
    
    def parse(self):
        #TODO extract plain text, images and links from the document
        self.anchors = []
        self.images = []
        self.text = None

        # Create a BeautifulSoup object
        soup = BeautifulSoup(self.content.decode('utf-8'), 'html.parser')

        # Exctract anchors
        for anchor in soup.find_all('a'):
            anchor_text = anchor.text
            anchor_bare_url = anchor.get('href')
            anchor_url = urljoin(self.url, anchor_bare_url)
            self.anchors.append((anchor_text, anchor_url))

        # Extract images
        for image in soup.find_all('img'):
            image_bare_url = image.get('src')
            image_url = urljoin(self.url, image_bare_url)
            self.images.append(image_url)

        # Extract text
        # I decided to use the filter that was provided to us, changing it a bit
        # https://stackoverflow.com/questions/1936466/how-to-scrape-only-visible-webpage-text-with-beautifulsoup/1983219#1983219
        all_texts = soup.findAll(text=True)
        filter_func = lambda x: x.parent.name not in ['style', 'script', '[document]', 'head', 'title'] and not isinstance(x, Comment)
        text_array = [el for el in all_texts if filter_func(el)]
        # print(text_array)
        self.text = ' '.join(text_array)
        
        # print(self.anchors)
        # print(self.images)
        # print(self.text)

### 1.2.1. Tests

In [5]:
doc = HtmlDocument("http://sprotasov.ru")
doc.get()
doc.parse()

assert "just few links" in doc.text, "Error parsing text"
assert "http://sprotasov.ru/images/gb.svg" in doc.images, "Error parsing images"
assert any(p[1] == "https://twitter.com/07C3" for p in doc.anchors), "Error parsing links"

## 1.3. [10] Document analysis ##
Complete the code for `HtmlDocumentTextData` class. Implement word and sentence splitting (use any method you can propose). 

**Criteria to succeed in the task**: 
1. Your `get_word_stats()` method should return `Counter` object.
2. Don't forget to lowercase your words for counting.
3. Sentences should be obtained from inside `<body>` tag only.

In [6]:
from collections import Counter

class HtmlDocumentTextData:
    
    def __init__(self, url):
        self.doc = HtmlDocument(url)
        self.doc.get()
        self.doc.parse()
    
    def get_sentences(self):
        #TODO implement sentence parser
        result = []

        process_text = self.doc.text

        # I decided to clear the text a bit
        # before splitting it into sentences
        # in case to avoid some problems with
        # charracters.
        # Moreover, it helps to debug the code
        # and to see the text itself clearly.

        # Patterns for cleaning text
        bad_symbols_pattern = r'\t|\r'
        bad_new_lines_pattern = r'\n([ ]\n)+'
        long_spaces_pattern = r'[ ]{2,}'
        spaces_pattern = r'^[ ]+|\n[ ]+|[ ]\n[ ]?'
        few_new_lines_pattern = r'\n{2,}'

        # Clean text
        process_text = re.sub(bad_symbols_pattern, '', process_text)
        process_text = re.sub(bad_new_lines_pattern, '\n', process_text)
        process_text = re.sub(long_spaces_pattern, ' ', process_text)
        process_text = re.sub(spaces_pattern, '\n', process_text)
        process_text = re.sub(few_new_lines_pattern, '\n', process_text)

        # print(process_text)

        # Pattern to split text into sentences
        sentences = re.split(r'[.?!][ \n]|\n', process_text)

        # Itterate through 
        for sentence in sentences:
            sentence = sentence.strip()

            if len(sentence) < 1:
                continue

            if re.match(r'^[a-zа-я]' , sentence) and len(result) > 0:
                result[-1] += (' ' + sentence)
                continue
            
            result.append(sentence)

        return result
    
    def get_word_stats(self):
        #TODO return Counter object of the document, containing mapping {`word` -> count_in_doc}
        words = []

        # Characters that split words
        split_pattern = r'[, :;]+'

        # Get sentences
        sentences = self.get_sentences()

        # Split sentences into words
        for sentence in sentences:
            lower_case_sentence = sentence.lower()
            words += re.split(split_pattern, lower_case_sentence)

        return Counter(words)

### 1.3.1. Tests ###

In [7]:
doc = HtmlDocumentTextData("https://innopolis.university/")

print(doc.get_word_stats().most_common(10))
assert [x for x in doc.get_word_stats().most_common(10) if x[0] == 'иннополис'], 'иннополис should be among most common'

[('и', 44), ('в', 22), ('иннополис', 19), ('с', 13), ('на', 12), ('университета', 11), ('университет', 10), ('центр', 10), ('для', 9), ('образование', 8)]


## 1.4. [15] Crawling ##

Method `crawl_generator()` is given starting url (`source`) and max depth of search. It should return a **generator** of `HtmlDocumentTextData` objects (return a document as soon as it is downloaded and parsed). You can benefit from `yield obj_name` python construction. Use `HtmlDocumentTextData.anchors` field to go deeper.

In [8]:
from queue import Queue

class Crawler:
    
    def crawl_generator(self, source, depth=1):
        #TODO return real crawling results. Don't forget to process failures, 
        # exceptions, 3**, 4** codes

        # Create initial queue and set for urls
        to_visit = Queue()

        # I decided to use a set for urls
        # To avoid visiting the same url twice
        visited = set()

        to_visit.put((source, 0))

        while not to_visit.empty():
            url_tuple = to_visit.get()

            # Get url and depth
            url = url_tuple[0]
            cur_depth = url_tuple[1]
            
            try:
                text_document = HtmlDocumentTextData(url)
                yield text_document

                visited.add(url)

                if cur_depth < depth:
                    anchors = text_document.doc.anchors
                    
                    for anchor in anchors:
                        anchor_url = anchor[1]

                        if anchor_url not in visited:
                            to_visit.put((anchor_url, cur_depth + 1))
            except:
                pass

        # for i in range(3):
        #     yield HtmlDocumentTextData(source)

### 1.4.1. Tests ###

In [9]:
crawler = Crawler()
counter = Counter()

for c in crawler.crawl_generator("https://innopolis.university/en/", depth=2):
    print(c.doc.url)
    if c.doc.url[-4:] in ('.pdf', '.mp3', '.avi', '.mp4', '.txt'):
        print("Skipping", c.doc.url)
        continue
    counter.update(c.get_word_stats())
    print(len(counter), "distinct word(s) so far")
    
print("Done")

print(counter.most_common(20))
assert [x for x in counter.most_common(20) if x[0] == 'innopolis'], 'innopolis sould be among most common'

https://innopolis.university/en/
345 distinct word(s) so far
https://innopolis.university/
903 distinct word(s) so far
https://apply.innopolis.university/en
1623 distinct word(s) so far
https://media.innopolis.university/en
1696 distinct word(s) so far
https://innopolis.university/lk/
1709 distinct word(s) so far
https://innopolis.university/en/about/
1910 distinct word(s) so far
https://innopolis.university/en/board/
1994 distinct word(s) so far
https://innopolis.university/en/team/
1995 distinct word(s) so far
https://innopolis.university/en/team-structure/
1998 distinct word(s) so far
https://innopolis.university/en/team-structure/education-academics/
2001 distinct word(s) so far
https://innopolis.university/en/team-structure/techcenters/
2005 distinct word(s) so far
https://innopolis.university/en/faculty/
2995 distinct word(s) so far
https://innopolis.university/en/faculty/
2995 distinct word(s) so far
https://career.innopolis.university/en/job/
3581 distinct word(s) so far
https: