# 1. Crawler

## 1.0. Related example

This code shows `wget`-like tool written in python. Run it from console (`python wget.py`), make it work. Check the code, reuse, and modify for your needs.

In [2]:
import argparse
import os
import re
import requests


def wget(url, filename):
    # allow redirects - in case file is relocated
    resp = requests.get(url, allow_redirects=True)
    
    # this can also be 2xx, but for simplicity now we stick to 200
    # you can also check for `resp.ok`
    if resp.status_code != 200:
        print(resp.status_code, resp.reason, 'for', url)
        return
    
    # just to be cool and print something
    print(*[f"{key}: {value}" for key, value in resp.headers.items()], sep='\n')
    print()
    
    # try to extract filename from url
    if filename is None:
        # start with http*, ends if ? or # appears (or none of)
        m = re.search("^http.*/([^/\?#]*)[\?#]?", url)
        filename = m.group(1)
        if not filename:
            raise NameError(f"Filename neither given, nor found for {url}")

    # what will you do in case 2 websites store file with the same name?
    # the real wget will just download again with "filename.ext.<frequency>"
    if os.path.exists(filename):
        raise OSError(f"File {filename} already exists")
    
    with open(filename, 'wb') as f:
        f.write(resp.content)
        print(f"File saved as {filename}")


# if __name__ == "__main__":
#     parser = argparse.ArgumentParser(description='download file.')
#     parser.add_argument("-O", type=str, default=None, dest='filename', help="output file name. Default -- taken from resource")
#     parser.add_argument("url", type=str, default=None, help="Provide URL here")
#     args = parser.parse_args()
#     wget(args.url, args.filename)



### 1.0.1. How to parse a page?

If you build a crawler, you might follow one of the approaches:
1. search for URLs in the page, assuming this is just a text.
2. search for URLs in the places where URLs should appear: `<a href=..`, `<img src=...`, `<iframe src=...` and so on.

To follow the first approach you can rely on some good regular expression. [Like this](https://stackoverflow.com/a/3809435).

To follow the second approach just read one of these: [short answer](https://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautifulsoup) or [exhaustive explanation](https://hackersandslackers.com/scraping-urls-with-beautifulsoup/).

## 1.1. [15] Download and persist #
Please complete a code for `load()`, `download()` and `persist()` methods of `Document` class. What they do:
- for a given URL `download()` method downloads binary data and stores in `self.content`. It returns `True` for success, else `False`.
- `persist()` method saves `self.content` somewhere in file system. We do it to avoid multiple downloads (for caching in other words).
- `load()` method loads data from hard drive. Returns `True` for success.

Tests checks that your code somehow works.

**NB Passing the test doesn't mean you correctly completed the task.** These are **criteria, which have to be fullfilled**:
1. URL is a unique identifier (as it is a subset of URI). Thus, documents with different URLs should be stored in different files. Typical errors: documents from the same domain are overwritten to the same file, URLs with similar endings are downloaded to the same file, etc.
2. The document can be not only a text file, but also a binary. Pay attention that if you download `mp3` file, it still can be played. Hint: don't hurry to convert everything to text.

In [3]:
import os
if not os.path.exists('tmp'):
    os.mkdir('tmp') # for storing webpages

In [4]:
import requests

class Document:
    
    def __init__(self, url):
        if url[-1] == '/': # remove trailing slash from url
            url = url[:-1]
        self.url = url
        self.filename = None
        
    def get(self): # load file if cached, download and persist otherwise
        if not self.load():
            if not self.download():
                raise FileNotFoundError(self.url)
            else:
                self.persist()
    
    def download(self):
        r = requests.get(self.url, allow_redirects=True)
        if r.status_code == 200:
            self.content = r.content
            return True
        return False
    
    def persist(self):
        m = re.search("^http.*/([^/\?#]*)[\?#]?", self.url)
        self.filename = m.group(1)
        
        if self.filename == '':
            self.filename = 'index.html'
            
        with open('tmp/' + self.filename, 'wb') as f:
            f.write(self.content)
            print(f"File saved as tmp/{self.filename}")
            
    def load(self):
        if not self.filename:  # file not cached
            return False
            
        with open('tmp/' + self.filename, 'rb') as f:
            self.content = f.read()
        
        return True

### 1.1.1. Tests ###

In [5]:
doc = Document('http://sprotasov.ru/data/iu.txt')

doc.get()
assert doc.content, "Document download failed"
assert "Code snippets, demos and labs for the course" in str(doc.content), "Document content error"

doc.get()
assert doc.load(), "Load should return true for saved document"
assert "Code snippets, demos and labs for the course" in str(doc.content), "Document load from disk error"

File saved as tmp/iu.txt


## 1.2. [M][15] Account the caching policy

Sometimes remote documents (especially when we speak about static content like `js` or `gif`) can swear that they will not change for some time. This is done by setting [Cache-Control response header](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cache-Control).

In [6]:
import requests
requests.get('https://polyfill.io/v3/polyfill.min.js').headers['Cache-Control']

'public, s-maxage=31536000, max-age=604800, stale-while-revalidate=604800, stale-if-error=604800'

Please study the documentation and implement a descendant to a `Document` class, which will refresh the document in case of expired cache even if the file is already on the hard drive.

In [7]:
import time

class CachedDocument(Document):
    def load(self):
        # if the file is not cached, or max-age expired, return False so that a new copy is requested.
        max_age = re.search('max-age=(\d+)', requests.get(self.url).headers['Cache-Control'])
        if not self.filename or time.time() - self.ts > int(max_age.group(1)):
            self.ts = time.time()
            return False
        
        print(f'{self.filename} is cached and will not be requested again.')
        with open('tmp/' + self.filename, 'rb') as f:
            self.content = f.read()
        
        return True

### 1.2.1. Tests

Add logging in your code and show that your code behaves differently for documents with different caching policy.

In [8]:
import time

doc = CachedDocument('https://polyfill.io/v3/polyfill.min.js')
doc.get()
time.sleep(2)
doc.get()
time.sleep(2)
doc.get()

doc = CachedDocument('https://google.com/') # beware bot detectors :)
doc.get()
time.sleep(2)
doc.get()
time.sleep(2)
doc.get()

File saved as tmp/polyfill.min.js
polyfill.min.js is cached and will not be requested again.
polyfill.min.js is cached and will not be requested again.
File saved as tmp/google.com
File saved as tmp/google.com
File saved as tmp/google.com


## 1.3. [10] Parse HTML ##
`BeautifulSoap` library is a de facto standard to parse XML and HTML documents in python. Use it to complete `parse()` method that extracts document contents. You should initialize:
- `self.anchors` list of tuples `('text', 'url')` met in a document. Be aware, there exist relative links (e.g. `../content/pic.jpg`). Use `urllib.parse.urljoin()` to fix this issue.
- `self.images` list of images met in a document. Again, links can be relative to current page.
- `self.text` should keep plain text of the document without scripts, tags, comments and so on. You can refer to [this stackoverflow answer](https://stackoverflow.com/a/1983219) for details.

**NB All these 3 criteria must be fulfilled to get full point for the task.**

In [9]:
from bs4 import BeautifulSoup, SoupStrainer
from bs4.element import Comment
from urllib.parse import urljoin


class HtmlDocument(Document):
    
    def _tag_visible(self, element):
        if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
            return False
        if isinstance(element, Comment):
            return False
        return True

    def _extract_text(self):
        soup = BeautifulSoup(self.content, 'html.parser')
        texts = soup.findAll(text=True)
        visible_texts = filter(self._tag_visible, texts)  
        return u" ".join(t.strip() for t in visible_texts)
    
    def _extract_images(self):
        images = []
            
        for link in BeautifulSoup(self.content, 'html.parser', parse_only=SoupStrainer('img')):
            if link.has_attr('src'):
                if not link['src'].startswith('http'): # relative link
                    images.append(urljoin(self.url, link['src']))
                else:
                    images.append(link['src'])
        return images
    
    def _extract_anchors(self):
        anchors = []
            
        for link in BeautifulSoup(self.content, 'html.parser', parse_only=SoupStrainer('a')):
            if link.has_attr('href'):
                link.contents.append('')
                if not link['href'].startswith('http'): # relative link
                    anchors.append((link.contents[0], urljoin(self.url, link['href'])))
                else:
                    anchors.append((link.contents[0], link['href']))
        return anchors
    
    def parse(self):
        self.text = self._extract_text()
        self.anchors = self._extract_anchors()
        self.images = self._extract_images()
        #print(self.text, self.anchors, self.images, sep='\n\n')

### 1.3.1. Tests ###

In [10]:
doc = HtmlDocument("http://sprotasov.ru")
doc.get()
doc.parse()

assert "just few links" in doc.text, "Error parsing text"
assert "http://sprotasov.ru/images/gb.svg" in doc.images, "Error parsing images"
assert any(p[1] == "https://twitter.com/07C3" for p in doc.anchors), "Error parsing links"

File saved as tmp/sprotasov.ru


## 1.4. [10] Document analysis ##
Complete the code for `HtmlDocumentTextData` class. Implement word and sentence splitting (use any method you can propose). 

**Criteria of success**: 
1. Your `get_word_stats()` method should return `Counter` object.
2. Don't forget to lowercase your words for counting.
3. Sentences should be obtained inside `<body>` tag only.

In [12]:
!pip install nltk

Collecting nltk
  Downloading nltk-3.6.7-py3-none-any.whl (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 646 kB/s eta 0:00:01
[?25hCollecting joblib
  Downloading joblib-1.1.0-py2.py3-none-any.whl (306 kB)
[K     |████████████████████████████████| 306 kB 2.4 MB/s eta 0:00:01
[?25hCollecting tqdm
  Downloading tqdm-4.62.3-py2.py3-none-any.whl (76 kB)
[K     |████████████████████████████████| 76 kB 2.6 MB/s eta 0:00:01
Collecting regex>=2021.8.3
  Downloading regex-2022.1.18-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (764 kB)
[K     |████████████████████████████████| 764 kB 3.2 MB/s eta 0:00:01
[?25hInstalling collected packages: joblib, tqdm, regex, nltk
Successfully installed joblib-1.1.0 nltk-3.6.7 regex-2022.1.18 tqdm-4.62.3


In [13]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /home/ahmed/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /home/ahmed/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [14]:
from collections import Counter
from nltk.corpus import stopwords

class HtmlDocumentTextData:  # This class assumes english documents, multilingual version below
    
    def __init__(self, url):
        self.doc = HtmlDocument(url)
        self.doc.get()
        self.doc.parse()
    
    def get_sentences(self):
        return nltk.tokenize.sent_tokenize(self.doc.text)
    
    def get_word_stats(self):
        wc = {}
        words = [w for w in nltk.tokenize.word_tokenize(self.doc.text) if (not w in stopwords.words('english')) and w.isalpha()]
        for word in words:
            wc[word.lower()] = wc.get(word.lower(), 0) + 1
        return Counter(wc)

### 1.4.1. Tests ###

In [15]:
doc = HtmlDocumentTextData("https://innopolis.university/")

print(doc.get_word_stats().most_common(10))
assert [x for x in doc.get_word_stats().most_common(10) if x[0] == 'иннополис'], 'иннополис should be among most common'

File saved as tmp/innopolis.university
[('и', 59), ('в', 30), ('иннополис', 20), ('по', 17), ('на', 14), ('университет', 12), ('области', 10), ('с', 10), ('лаборатория', 10), ('университета', 9)]


## 1.5. [M][35] Languages
Maybe you heard, that there are multiple languages in the world. European languages, like Russian and English, use similar puctuation, but even in this family there is ¡Spanish!

Other languages can use different punctiation rules, like **Arabic or [Thai](http://www.thai-language.com/ref/breaking-words)**.

Your task is to support (at least) three languages (English, Arabic, and Thai) tokenization in your `HtmlDocumentTextData` class descendant.

What should you do:
1. Use any language dection techniques, e.g. [langdetect](https://pypi.org/project/langdetect/).
2. Use language-specific tokenization tools, e.g. for [Thai](https://pythainlp.github.io/tutorials/notebooks/pythainlp_get_started.html#Tokenization-and-Segmentation) and [Arabic](https://github.com/CAMeL-Lab/camel_tools).
3. Use these pages to test your code: [1](https://www.bangkokair.com/tha/baggage-allowance) and [2](https://alfajr-news.net/details/%D9%85%D8%B4%D8%B1%D9%88%D8%B9-%D8%AF%D9%8A%D9%85%D9%88%D9%82%D8%B1%D8%A7%D8%B7%D9%8A-%D9%81%D9%8A-%D8%A7%D9%84%D9%83%D9%88%D9%86%D8%BA%D8%B1%D8%B3-%D8%A7%D9%84%D8%A3%D9%85%D8%B1%D9%8A%D9%83%D9%8A-%D9%84%D9%85%D8%B9%D8%A7%D9%82%D8%A8%D8%A9-%D8%A8%D9%88%D8%AA%D9%8A%D9%86).

In [16]:
!pip install langdetect

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[K     |████████████████████████████████| 981 kB 1.0 MB/s eta 0:00:01
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25ldone
[?25h  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993222 sha256=388ea8c5a8619a826cf69749780578f3a94d6db773439e098894e99e00641ca6
  Stored in directory: /home/ahmed/.cache/pip/wheels/13/c7/b0/79f66658626032e78fc1a83103690ef6797d551cb22e56e734
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9


In [17]:
!pip install iso-639

Collecting iso-639
  Downloading iso-639-0.4.5.tar.gz (167 kB)
[K     |████████████████████████████████| 167 kB 502 kB/s eta 0:00:01
[?25hBuilding wheels for collected packages: iso-639
  Building wheel for iso-639 (setup.py) ... [?25ldone
[?25h  Created wheel for iso-639: filename=iso_639-0.4.5-py3-none-any.whl size=169061 sha256=b507358a18ddcdf304081b5ebd7f84c925e77e34b59d63f468b9bb3ba48184a7
  Stored in directory: /home/ahmed/.cache/pip/wheels/ed/ce/cc/1961a4de7090b2e92895fb087abfa0080a542a5706c5948bcc
Successfully built iso-639
Installing collected packages: iso-639
Successfully installed iso-639-0.4.5


In [18]:
!pip install stopwordsiso

Collecting stopwordsiso
  Downloading stopwordsiso-0.6.1-py3-none-any.whl (73 kB)
[K     |████████████████████████████████| 73 kB 1.1 MB/s eta 0:00:01
[?25hInstalling collected packages: stopwordsiso
Successfully installed stopwordsiso-0.6.1


In [19]:
from langdetect import detect
import stopwordsiso
from iso639 import languages
import string

class MultilingualHtmlDocumentTextData(HtmlDocumentTextData):
    def _get_document_lang(self):
        lang = languages.get(alpha2=detect(self.doc.text)).name.lower()
        return lang
    
    def get_sentences(self):
        try:  # use dedicated language tokenizer if available, use default otherwise.
            return nltk.tokenize.sent_tokenize(self.doc.text, language=self._get_document_lang())
        except LookupError:
            return nltk.tokenize.sent_tokenize(self.doc.text)
        
    def get_word_stats(self):
        lang = detect(self.doc.text)
        if stopwordsiso.has_lang(lang):
            sw = stopwordsiso.stopwords(lang)
        else:
            sw = []
        words = [w for w in nltk.tokenize.word_tokenize(self.doc.text) if (not w in sw) and w.isalpha()]
        wc = {}
        for word in words:
            wc[word.lower()] = wc.get(word.lower(), 0) + 1
        return Counter(wc)

### 1.5.1. Tests

In [20]:
doc = MultilingualHtmlDocumentTextData("https://www.bangkokair.com/tha/baggage-allowance")
print(doc.get_word_stats().most_common(10))

doc = MultilingualHtmlDocumentTextData("https://alfajr-news.net/details/%D9%85%D8%B4%D8%B1%D9%88%D8%B9-%D8%AF%D9%8A%D9%85%D9%88%D9%82%D8%B1%D8%A7%D8%B7%D9%8A-%D9%81%D9%8A-%D8%A7%D9%84%D9%83%D9%88%D9%86%D8%BA%D8%B1%D8%B3-%D8%A7%D9%84%D8%A3%D9%85%D8%B1%D9%8A%D9%83%D9%8A-%D9%84%D9%85%D8%B9%D8%A7%D9%82")
print(doc.get_word_stats().most_common(10))

File saved as tmp/baggage-allowance
[('โซน', 11), ('x', 6), ('ภาษาไทย', 4), ('usd', 4), ('english', 2), ('繁體中文', 2), ('简体中文', 2), ('thb', 2), ('sgd', 2), ('myr', 2)]
File saved as tmp/%D9%85%D8%B4%D8%B1%D9%88%D8%B9-%D8%AF%D9%8A%D9%85%D9%88%D9%82%D8%B1%D8%A7%D8%B7%D9%8A-%D9%81%D9%8A-%D8%A7%D9%84%D9%83%D9%88%D9%86%D8%BA%D8%B1%D8%B3-%D8%A7%D9%84%D8%A3%D9%85%D8%B1%D9%8A%D9%83%D9%8A-%D9%84%D9%85%D8%B9%D8%A7%D9%82
[('تعليق', 14), ('مشاهده', 11), ('الإمارات', 5), ('الفجر', 4), ('فن', 4), ('محمد', 3), ('دبي', 3), ('أخبار', 3), ('صحة', 3), ('أغسطس', 3)]


## 1.5. [15] Crawling ##

Method `crawl_generator()` is given starting url (`source`) and max depth of search. It should return a **generator** of `HtmlDocumentTextData` objects (return a document as soon as it is downloaded and parsed). You can benefit from `yield obj_name` python construction. Use `HtmlDocumentTextData.anchors` field to go deeper.

In [44]:
from queue import Queue

class Crawler:
    def __init__(self):
        self.visited = {}
    
    def remove_trailing_slash(self, url):
        if url[-1] == '/': # remove trailing slash from url
            url = url[:-1]
        return url
        
    def crawl_generator(self, source, depth=0):
        links = Queue()
        
        source = self.remove_trailing_slash(source)
        links.put((source, 0))
        
        while not links.empty():
            url, dep = links.get()
            url = self.remove_trailing_slash(url)
            self.visited[url] = True
            try:
                page = HtmlDocumentTextData(url)
                yield page
            
            except:
                print(f'[WARN] Failed to scrape {url}')
                continue
                
            if dep == depth: # max depth reached
                return
            
            for _, link in page.doc.anchors:
                link = self.remove_trailing_slash(link)
                if not self.visited.get(link):
                    links.put((link, dep + 1))
                    self.visited[link] = True

### 1.5. Tests ###

In [45]:
crawler = Crawler()
counter = Counter()

for c in crawler.crawl_generator("https://innopolis.university/en/", 2):
    print(c.doc.url)
    if c.doc.url[-4:] in ('.pdf', '.mp3', '.avi', '.mp4', '.txt'):
        print("Skipping", c.doc.url)
        continue
    counter.update(c.get_word_stats())
    print(len(counter), "distinct word(s) so far")

print("Done")

print(counter.most_common(20))
assert [x for x in counter.most_common(20) if x[0] == 'innopolis'], 'innopolis should be among most common'

File saved as tmp/en
https://innopolis.university/en
281 distinct word(s) so far
File saved as tmp/en
https://apply.innopolis.university/en
906 distinct word(s) so far
File saved as tmp/en
https://corporate.innopolis.university/en
1050 distinct word(s) so far
File saved as tmp/en
https://media.innopolis.university/en
1092 distinct word(s) so far
File saved as tmp/lk
https://innopolis.university/lk
1436 distinct word(s) so far
File saved as tmp/about
https://innopolis.university/en/about
1533 distinct word(s) so far
File saved as tmp/board
https://innopolis.university/en/board
1605 distinct word(s) so far
File saved as tmp/team
https://innopolis.university/en/team
1606 distinct word(s) so far
File saved as tmp/team-structure
https://innopolis.university/en/team-structure
1609 distinct word(s) so far
File saved as tmp/education-academics
https://innopolis.university/en/team-structure/education-academics
1613 distinct word(s) so far
File saved as tmp/techcenters
https://innopolis.universi

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://innopolis.university/public/files/Consent_to_the_processing_of_PD_for_UI.pdf
Skipping https://innopolis.university/public/files/Consent_to_the_processing_of_PD_for_UI.pdf
File saved as tmp/index.html
https://apply.innopolis.university/en/?special=Y
6034 distinct word(s) so far
Done
[('university', 1131), ('innopolis', 582), ('research', 529), ('development', 521), ('lab', 517), ('education', 484), ('и', 483), ('science', 475), ('students', 467), ('software', 443), ('data', 431), ('robotics', 377), ('it', 355), ('the', 345), ('в', 340), ('engineering', 334), ('systems', 328), ('intelligence', 305), ('artificial', 302), ('computer', 300)]
