# Home Assignment 1 - Crawler

**Author:**    Danis Alukaev <br/>
**Email:**      d.alukaev@innopolis.university <br/>
**Group:**      B19-DS-01

## 0.0. Prerequisites
In the code I use five additional libraries: [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/en/latest/) for html parsing, [NLTK](https://www.nltk.org) for english text processing, [langdetect](https://pypi.org/project/langdetect/) for language detection, [CAMeL Tools](https://github.com/CAMeL-Lab/camel_tools) for arabic text processing and [PyThaiNLP](https://pythainlp.github.io/tutorials/notebooks/pythainlp_get_started.html#Tokenization-and-Segmentation) for thai text processing. Please, install them in your environment by running the following cells. Also, I've left exactly the same cells before the actual usage of these packages.

In [None]:
%pip install beautifulsoup4

In [None]:
%pip install nltk

In [None]:
%pip install langdetect

In [None]:
%pip install camel-tools

In [None]:
%pip install pythainlp
%pip install epitran
%pip install python-crfsuite

## 1.0. Related example

This code shows `wget`-like tool written in python. Run it from console (`python wget.py`), make it work. Check the code, reuse, and modify for your needs.

In [None]:
import argparse
import os
import re
import requests


def wget(url, filename):
    # allow redirects - in case file is relocated
    resp = requests.get(url, allow_redirects=True)
    # this can also be 2xx, but for simplicity now we stick to 200
    # you can also check for `resp.ok`
    if resp.status_code != 200:
        print(resp.status_code, resp.reason, 'for', url)
        return
    
    # just to be cool and print something
    print(*[f"{key}: {value}" for key, value in resp.headers.items()], sep='\n')
    print()
    
    # try to extract filename from url
    if filename is None:
        # start with http*, ends if ? or # appears (or none of)
        m = re.search("^http.*/([^/\?#]*)[\?#]?", url)
        filename = m.group(1)
        if not filename:
            raise NameError(f"Filename neither given, nor found for {url}")

    # what will you do in case 2 websites store file with the same name?
    if os.path.exists(filename):
        raise OSError(f"File {filename} already exists")
    
    with open(filename, 'wb') as f:
        f.write(resp.content)
        print(f"File saved as {filename}")


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='download file.')
    parser.add_argument("-O", type=str, default=None, dest='filename', help="output file name. Default -- taken from resource")
    parser.add_argument("url", type=str, default=None, help="Provide URL here")
    args = parser.parse_args()
    wget(args.url, args.filename)

### 1.0.1. How to parse a page?

If you build a crawler, you might follow one of the approaches:
1. search for URLs in the page, assuming this is just a text.
2. search for URLs in the places where URLs should appear: `<a href=..`, `<img src=...`, `<iframe src=...` and so on.

To follow the first approach you can rely on some good regular expression. [Like this](https://stackoverflow.com/a/3809435).

To follow the second approach just read one of these: [short answer](https://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautifulsoup) or [exhaustive explanation](https://hackersandslackers.com/scraping-urls-with-beautifulsoup/).

## 1.1. [15] Download and persist #
Please complete a code for `load()`, `download()` and `persist()` methods of `Document` class. What they do:
- for a given URL `download()` method downloads binary data and stores in `self.content`. It returns `True` for success, else `False`.
- `persist()` method saves `self.content` somewhere in file system. We do it to avoid multiple downloads (for caching in other words).
- `load()` method loads data from hard drive. Returns `True` for success.

Tests checks that your code somehow works.

**NB Passing the test doesn't mean you correctly completed the task.** These are **criteria, which have to be fullfilled**:
1. URL is a unique identifier (as it is a subset of URI). Thus, documents with different URLs should be stored in different files. Typical errors: documents from the same domain are overwritten to the same file, URLs with similar endings are downloaded to the same file, etc.
2. The document can be not only a text file, but also a binary. Pay attention that if you download `mp3` file, it still can be played. Hint: don't hurry to convert everything to text.

In [1]:
import requests
from urllib.parse import quote
from os.path import exists, join 
import hashlib

class Document:
    
    def __init__(self, url):
        self.url = url
        self.content = None
        self.dir = "./data"
        if not exists(self.dir):
            os.makedirs(self.dir)
        
    def get(self):
        if not self.load():
            if not self.download():
                raise FileNotFoundError(self.url)
            else:
                self.persist()
    
    def download(self):
        """
        Downloads binary data and stores it in the content attribute.

        :return: boolean whether the data was downloaded successfully.
        """
        response = requests.get(self.url, allow_redirects=True)
        if response.status_code == 200:
            self.content = response.content
            return True
        print(f"Download failed for {self.url}: {response.status_code} ({response.reason})")
        return False
    
    def _get_filepath(self):
        """
        Returns path to the file storing content of url.

        :return: filepath
        """
        filename = quote(self.url, '')
        filename = hashlib.md5(filename.encode("utf-8")).hexdigest()
        filepath = join(self.dir, filename)
        return filepath


    def persist(self):
        """
        Saves content in the local directory data (kinda caching mechanism).
        When invoked it overrides the file if it is already exist.
        Ref: https://t.me/c/1784134413/102

        :return: boolean whether the data was stored successfully.
        """
        try:
            filepath = self._get_filepath()
            with open(filepath, 'wb') as f:
                f.write(self.content)
        except Exception as e:
            print(e)
            return False
        return True
            
    def load(self):
        """
        Loads data from the local directory.

        :return: boolean whether the data was loades successfully.
        """
        filepath = self._get_filepath()
        if exists(filepath):
            try:
                with open(filepath, 'r') as f:
                    self.content = f.read()
            except UnicodeDecodeError:
                with open(filepath, 'rb') as f:
                    self.content = f.read()
            return True
        return False

**Note:**
During the lab I asked the question regarding `<a>` and `<A>` tags. The reply was sent to our [chat in telegram](https://t.me/c/1784134413/97). It appeared that Beatiful Soap v.4 handles old-fashioned (capitalized) tags automatically, and there is no need to lead them to lower-case. 

### 1.1.1. Tests ###

In [2]:
doc = Document('http://sprotasov.ru/data/iu.txt')

doc.get()
assert doc.content, "Document download failed"
assert "Code snippets, demos and labs for the course" in str(doc.content), "Document content error"

doc.get()
assert doc.load(), "Load should return true for saved document"
assert "Code snippets, demos and labs for the course" in str(doc.content), "Document load from disk error"

The following cell is the additional test for link to `mp3` file. The link leads to the Mozart's "Serenades and Divertimenti for Winds, no.5 Larghetto". Run and check the `data` folder.

In [3]:
doc = Document('https://docs.google.com/uc?export=download&id=1TQbsl1h3d-X7UO2fG0pXwOxCfWg_8VNR')
doc.get()

## 1.2. [M][15] Account the caching policy

Sometimes remote documents (especially when we speak about static content like `js` or `gif`) can swear that they will not change for some time. This is done by setting [Cache-Control response header](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cache-Control).

In [4]:
import requests
requests.get('https://polyfill.io/v3/polyfill.min.js').headers['Cache-Control']

'public, s-maxage=31536000, max-age=604800, stale-while-revalidate=604800, stale-if-error=604800'

Please study the documentation and implement a descendant to a `Document` class, which will refresh the document in case of expired cache even if the file is already on the hard drive.

In [5]:
from datetime import datetime, timedelta


class CachedDocument(Document):
    
    def __init__(self, url):
        super().__init__(url)
        self.expire_time = datetime(1, 1, 1)

    def get(self):
        """
        Overrides method get of super class.
        If the cache is expired, refresh the document.
        """
        request_time = datetime.now()
        if self._check_no_cache():
            if self._exists_new_version():
                self._update_document()
            else:
                self.load()
        elif self.expire_time < request_time:
            self._update_document()
            self._update_expire_time(request_time)
        else: 
            self.load()
    
    def _update_document(self):
        """
        Updates the content attribute and saves content locally.
        """
        if not self.download():
            raise FileNotFoundError(self.url)
        else:
            self.persist()

    def _update_expire_time(self, request_time):
        """
        Updates attribute expire_time based on the max-age and time of the last request.
        """
        max_age = self._get_max_age()
        self.expire_time = request_time + timedelta(seconds=max_age)

    def _get_max_age(self):
        """
        Retrieves max-age term from the cache-control vocabulary.
        Max-age stands for number of seconds document remains fresh.

        :return: max-age term
        """
        headers = requests.get(self.url).headers
        if 'Cache-Control' not in headers:
            return 0 
        vocabulary = headers['Cache-Control'].replace(' ','').split(sep=',')
        max_age = 0
        for term in vocabulary:
            if 'max-age' in term:
                max_age = int(term.split("=")[1])
        return max_age
    
    def _check_no_cache(self):
        """
        Checks whether the cache-control header contains no-cache term.

        :return: boolean whether there is no-cache
        """
        headers = requests.get(self.url).headers
        if 'Cache-Control' not in headers:
            return False
        if 'no-cache' in headers['Cache-Control']:
            return True
        return False
    
    def _exists_new_version(self):
        """
        Checks whether there exists new version of document.

        :return: boolean whether there is new version
        """
        response = requests.get(self.url, allow_redirects=True)
        if response.status_code != 200:
            return False
        current_version = response.content
        if self.content != current_version:
            return True
        return False

### 1.2.1. Tests

Add logging in your code and show that your code behaves differently for documents with different caching policy.

In [6]:
import time

doc = CachedDocument('https://polyfill.io/v3/polyfill.min.js')
doc.get()
time.sleep(2)
doc.get()
time.sleep(2)
doc.get()

doc = CachedDocument('https://yandex.ru/')
doc.get()
time.sleep(2)
doc.get()
time.sleep(2)
doc.get()

## 1.3. [10] Parse HTML ##
`BeautifulSoap` library is a de facto standard to parse XML and HTML documents in python. Use it to complete `parse()` method that extracts document contents. You should initialize:
- `self.anchors` list of tuples `('text', 'url')` met in a document. Be aware, there exist relative links (e.g. `../content/pic.jpg`). Use `urllib.parse.urljoin()` to fix this issue.
- `self.images` list of images met in a document. Again, links can be relative to current page.
- `self.text` should keep plain text of the document without scripts, tags, comments and so on. You can refer to [this stackoverflow answer](https://stackoverflow.com/a/1983219) for details.

**NB All these 3 criteria must be fulfilled to get full point for the task.**

In [None]:
%pip install beautifulsoup4

In [7]:
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.parse
import re


class HtmlDocument(Document):
    
    def parse(self):
        self.soup = BeautifulSoup(self.content, 'html.parser')

        self.anchors = self._get_anchors()
        self.images = self._get_images()
        self.text = self._get_text()

    def _get_anchors(self):
        """
        Creates a list of anchor tuples ('text', 'url') met in a document.
        
        :return: list of anchor tuples.
        """
        absolute_link_marker = "//"
        anchors = self.soup.find_all('a')
        result = []
        for anchor in anchors:
            link = anchor.get('href')
            if not link:
                continue
            if absolute_link_marker not in anchor:
                link = urllib.parse.urljoin(self.url, anchor['href'])
            entry = (anchor.text.strip(), link)
            result.append(entry)
        return result
        
    
    def _get_images(self):
        """
        Creates a list of links to images met in document. 
        The duplicated links are omitted.

        :return: list of links to images.
        """
        absolute_link_marker = "//"
        images = self.soup.find_all("img")
        result = []
        for entry in images:
            result.append(entry)
            if absolute_link_marker not in entry:
                result[-1] = urllib.parse.urljoin(self.url, entry["src"])
        result = list(set(result))
        return result
    
    def _get_text(self):
        """
        Converts document content to plain text.
        
        :return: string with plain text.
        """
        texts = self.soup.findAll(text=True)
        containers = ['head', 'title', 'style', 'script', 'meta', '[document]']

        cleaned_texts = []
        for element in texts:
            parent = element.parent.name 
            if parent in containers or isinstance(element, Comment):
                continue
            cleaned_texts.append(element.strip())
        plain = re.sub('\s{2,}', ' ', " ".join(cleaned_texts).replace('\n', ' ')).strip()  
        
        return plain


### 1.3.1. Tests ###

In [8]:
doc = HtmlDocument("http://sprotasov.ru")
doc.get()
doc.parse()

assert "just few links" in doc.text, "Error parsing text"
assert "http://sprotasov.ru/images/gb.svg" in doc.images, "Error parsing images"
assert any(p[1] == "https://twitter.com/07C3" for p in doc.anchors), "Error parsing links"

Let's check the anchors, images, text attributes.

In [9]:
from pprint import pprint

print("Anchors:")
pprint(doc.anchors)
print()

print("Images:")
pprint(doc.images)
print()

print("Text:")
print(doc.text)

Anchors:
[('Curriculum vitae',
  'https://docs.google.com/document/d/e/2PACX-1vQqlsxmlbkwp7CypdNg5vcl9zEfE1w6EFppJ2iBbHpZrpOI0AIzFkeu21-Or1_PYlnq1ICyLR1qaNlu/pub'),
 ('Google Scholar', 'https://scholar.google.ru/citations?user=pDske8oAAAAJ'),
 ('GitHub', 'https://github.com/str-anger'),
 ('ResearchGate', 'https://www.researchgate.net/profile/Stanislav-Protasov'),
 ('Публикации в eLibrary',
  'http://elibrary.ru/author_items.asp?authorid=789317'),
 ('Facebook', 'https://www.facebook.com/stanislav.protasov'),
 ('LinkedIn', 'https://www.linkedin.com/pub/stanislav-protasov/28/651/b38'),
 ('Research with Stas telegram channel', 'https://t.me/iu_aml'),
 ('Подкаст "Происхождение видов": telegram', 'https://t.me/origin_of_species'),
 ('iTunes',
  'https://itunes.apple.com/ru/podcast/происхождение-видов/id1282666034'),
 ('RSS', 'http://sprotasov.ru/podcast/rss.xml'),
 ('Automatic testing system', 'http://code-test.ru/'),
 ('source code', 'https://bitbucket.org/str-anger/stick-rope'),
 ('Книга "

## 1.4. [10] Document analysis ##
Complete the code for `HtmlDocumentTextData` class. Implement word and sentence splitting (use any method you can propose). 

**Criteria of success**: 
1. Your `get_word_stats()` method should return `Counter` object.
2. Don't forget to lowercase your words for counting.
3. Sentences should be obtained inside `<body>` tag only.

In [None]:
%pip install nltk

In [10]:
import re
from string import punctuation
from collections import Counter
from nltk.tokenize import sent_tokenize, word_tokenize


class HtmlDocumentTextData:
    
    def __init__(self, url):
        self.doc = HtmlDocument(url)
        self.doc.get()
        self.doc.parse()
    
    def get_sentences(self):
        """
        Creates a list of sentences from the <body> tag of the content.

        :return: list of sentences.
        """
        body_strings = doc.doc.soup.body.strings
        body = re.sub('\s{2,}', ' ', " ".join(list(body_strings)).replace('\n', ' ')).strip()   
        result = sent_tokenize(body)
        return result
    
    def get_word_stats(self):
        """
        Creates a counter of each word in the content text.

        :return: counter.
        """
        counter = Counter()
        words = [word.lower() for word in word_tokenize(self.doc.text) if word not in list(punctuation)]
        for word in words:
            counter[word] += 1
        return counter

### 1.4.1. Tests ###

In [11]:
doc = HtmlDocumentTextData("https://innopolis.university/")
print(doc.get_word_stats().most_common(10))
assert [x for x in doc.get_word_stats().most_common(10) if x[0] == 'иннополис'], 'иннополис should be among most common'

[('и', 59), ('в', 30), ('иннополис', 20), ('по', 17), ('на', 14), ('университет', 12), ('области', 10), ('с', 10), ('лаборатория', 10), ('университета', 9)]


The following cell displays the parsed from `<body>` tag sentences. Note that although `pprint` method writes each part of the string on different line (and surrond with quotation marks), it separates each string (sentence) with the comma.  

In [12]:
from pprint import pprint

pprint(doc.get_sentences())

['Все медиа Facebook Вконтакте Youtube Twitter Instagram habr Абитуриентам '
 'Бизнесу Медиа Личный кабинет Университет Об университете Органы управления '
 'Учредители Наблюдательный совет Команда университета Организационная '
 'структура Образовательные и научные подразделения Технологические центры '
 'Преподавательский состав Профессорско-преподавательский состав Вакантные '
 'должности ППС Работа в университете Карьера в университете Корпоративная '
 'жизнь Релокация в Иннополис Вакансии Кампус Кампус Информация о жилом, '
 'учебном и спортивном комплексах, медцентре, питании и досуге на территории '
 'города и Университета Иннополис.',
 'Ответы на часто задаваемые вопросы Сведения об образовательной организации '
 'Сведения об образовательной организации Информация об образовательной '
 'деятельности, приёмной кампании, структуре и органах управления '
 'университетом, финансово-хозяйственной деятельности Сувениры и одежда Мерч '
 'Университета Иннополис Университет Иннополис Сп

## 1.5. [15] Crawling ##

Method `crawl_generator()` is given starting url (`source`) and max depth of search. It should return a **generator** of `HtmlDocumentTextData` objects (return a document as soon as it is downloaded and parsed). You can benefit from `yield obj_name` python construction. Use `HtmlDocumentTextData.anchors` field to go deeper.

In [13]:
from queue import Queue 


class Crawler:
    
    def crawl_generator(self, source, depth=1):
        """
        Crawl the internet from the given source url and for a given depth.
        Keeps track of the processed links in order not to handle the same link.
        Saves web pages/files locally.

        :param source: starting for crawling url.
        :param depth: extent to which a crawlier indexes the pages. Ref: https://t.me/c/1784134413/75 
        :return: generator with all processed links.
        """
        queues = {i: Queue() for i in range(depth)}
        queues[0].put(source)
        processed = set()

        for i in range(depth):
            while not queues[i].empty():
                url = queues[i].get()
                processed.add(url)
                try: 
                    doc = HtmlDocumentTextData(url)
                    yield doc

                    if i == depth - 1:
                        continue
                    for _, link in doc.doc.anchors:
                        if link not in processed:
                            queues[i + 1].put(link)
                except Exception as e:
                    print(f"Error with processing {url}: {e}")
                    continue

### 1.5. Tests ###

In [14]:
crawler = Crawler()
counter = Counter()

for c in crawler.crawl_generator("https://innopolis.university/en/", 2):
    print(c.doc.url)
    if c.doc.url[-4:] in ('.pdf', '.mp3', '.avi', '.mp4', '.txt'):
        print("Skipping", c.doc.url)
        continue
    counter.update(c.get_word_stats())
    print(len(counter), "distinct word(s) so far")
    
print("Done")

print(counter.most_common(20))
assert [x for x in counter.most_common(20) if x[0] == 'innopolis'], 'innopolis sould be among most common'

https://innopolis.university/en/
343 distinct word(s) so far
https://apply.innopolis.university/en
1079 distinct word(s) so far
https://corporate.innopolis.university/en
1241 distinct word(s) so far
https://media.innopolis.university/en
1299 distinct word(s) so far
https://innopolis.university/lk/
1655 distinct word(s) so far
https://innopolis.university/en/about/
1791 distinct word(s) so far
https://innopolis.university/en/board/
1878 distinct word(s) so far
https://innopolis.university/en/team/
1879 distinct word(s) so far
https://innopolis.university/en/team-structure/
1883 distinct word(s) so far
https://innopolis.university/en/team-structure/education-academics/
1887 distinct word(s) so far
https://innopolis.university/en/team-structure/techcenters/
1889 distinct word(s) so far
https://innopolis.university/en/faculty/
2801 distinct word(s) so far
https://innopolis.university/en/faculty/
2801 distinct word(s) so far
https://career.innopolis.university/en/job/
3242 distinct word(s) 

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://innopolis.university/public/files/Consent_to_the_processing_of_PD_for_UI.pdf
Skipping https://innopolis.university/public/files/Consent_to_the_processing_of_PD_for_UI.pdf
Done
[('and', 4488), ('of', 4097), ('the', 3553), ('in', 2388), ('to', 1643), ('university', 1613), ('for', 1206), ('a', 1157), ('innopolis', 814), ('at', 738), ('it', 698), ('research', 684), ('lab', 671), ('you', 657), ('development', 647), ('education', 643), ('science', 636), ('и', 627), ('students', 580), ('software', 575)]


## 1.6. [M][35] Languages
Maybe you heard, that there are multiple languages in the world. European languages, like Russian and English, use similar puctuation, but even in this family there is ¡Spanish!

Other languages can use different punctiation rules, like **Arabic or [Thai](http://www.thai-language.com/ref/breaking-words)**.

Your task is to support (at least) three languages (English, Arabic, and Thai) tokenization in your `HtmlDocumentTextData` class descendant.

What should you do:
1. Use any language dection techniques, e.g. [langdetect](https://pypi.org/project/langdetect/).
2. Use language-specific tokenization tools, e.g. for [Thai](https://pythainlp.github.io/tutorials/notebooks/pythainlp_get_started.html#Tokenization-and-Segmentation) and [Arabic](https://github.com/CAMeL-Lab/camel_tools).
3. Use these pages to test your code: [1](https://www.bangkokair.com/tha/baggage-allowance) and [2](https://alfajr-news.net/details/%D9%85%D8%B4%D8%B1%D9%88%D8%B9-%D8%AF%D9%8A%D9%85%D9%88%D9%82%D8%B1%D8%A7%D8%B7%D9%8A-%D9%81%D9%8A-%D8%A7%D9%84%D9%83%D9%88%D9%86%D8%BA%D8%B1%D8%B3-%D8%A7%D9%84%D8%A3%D9%85%D8%B1%D9%8A%D9%83%D9%8A-%D9%84%D9%85%D8%B9%D8%A7%D9%82%D8%A8%D8%A9-%D8%A8%D9%88%D8%AA%D9%8A%D9%86).

In [None]:
%pip install langdetect

In [None]:
%pip install camel-tools

In [None]:
%pip install pythainlp
%pip install epitran
%pip install python-crfsuite

In [15]:
from langdetect import detect
from camel_tools.tokenizers.word import simple_word_tokenize
import pythainlp
import nltk

class MultilingualHtmlDocumentTextData(HtmlDocumentTextData):
    
    def __init__(self, url):
        super().__init__(url)
        self.language = detect(self.doc.text)

    def _arabian_sent_tokenize(self, string):
        string = re.sub("؟", "?", string)
        return nltk.sent_tokenize(string)

    def get_sentences(self):
        """
        Creates a list of sentences from the <body> tag of the content.
        Supported languages: English, Thai, Arabic.
        Applies special method for each language. 

        :return: list of sentences.
        """
        body_strings = doc.doc.soup.body.strings
        body = re.sub('\s{2,}', ' ', " ".join(list(body_strings)).replace('\n', ' ')).strip()   

        methods = {
            'th': pythainlp.sent_tokenize,
            'ar': self._arabian_sent_tokenize,
            'en': nltk.sent_tokenize
        }

        result = methods[self.language](body)
        return result
    
    def get_word_stats(self):
        """
        Creates a counter of each word in the content text.
        Supported languages: English, Thai, Arabic.
        Applies special method for each language.        

        :return: counter.
        """
        counter = Counter()

        methods = {
            'th': pythainlp.word_tokenize,
            'ar': simple_word_tokenize,
            'en': nltk.word_tokenize
        }
        words = [word.lower() for word in methods[self.language](self.doc.text) if word not in list(punctuation)]
        for word in words:
            if word == ' ':
                continue
            counter[word] += 1
        return counter


### 1.5.1. Tests

In [16]:
doc = MultilingualHtmlDocumentTextData("https://www.bangkokair.com/tha/baggage-allowance")
print(doc.get_word_stats().most_common(10))

doc = MultilingualHtmlDocumentTextData("https://alfajr-news.net/details/%D9%85%D8%B4%D8%B1%D9%88%D8%B9-%D8%AF%D9%8A%D9%85%D9%88%D9%82%D8%B1%D8%A7%D8%B7%D9%8A-%D9%81%D9%8A-%D8%A7%D9%84%D9%83%D9%88%D9%86%D8%BA%D8%B1%D8%B3-%D8%A7%D9%84%D8%A3%D9%85%D8%B1%D9%8A%D9%83%D9%8A-%D9%84%D9%85%D8%B9%D8%A7%D9%82")
print(doc.get_word_stats().most_common(10))

[('สัมภาระ', 34), ('การ', 25), ('เรา', 24), ('และ', 22), ('ของ', 21), ('ที่', 21), ('กิโลกรัม', 21), ('เดินทาง', 17), ('เที่ยวบิน', 16), ('บริการ', 16)]
[('تعليق', 12), ('مشاهده', 10), ('الإمارات', 5), ('بن', 4), ('زايد', 4), ('في', 4), ('الفجر', 4), ('و', 4), ('فن', 4), ('أخبار', 3)]


Let's also check whether it works for sentences.

In [17]:
from pprint import pprint

doc = MultilingualHtmlDocumentTextData("https://www.bangkokair.com/tha/baggage-allowance")
pprint(doc.get_sentences())

['ตรวจสอบรายละเอียด ที่นี่ ',
 'สำหรับข่าวประกาศต่างๆ เกี่ยวกับ COVID-19 '
 'รวมถึงตารางปรับลดเที่ยวบินของบางกอกแอร์เวย์ส × ',
 'เว็บไซต์นี้มีการใช้งานคุกกี้ (Cookies) '
 'เพื่อจัดการข้อมูลส่วนบุคคลและช่วยเพิ่มประสิทธ์ภาพในการใช้งานเว็บไซต์ของท่าน ',
 'ท่านสามารถศึกษารายละเอียดเพิ่มเติมและการตั้งค่าคุกกี้ได้ที่ '
 'นโยบายการใช้คุ้กกี้ โดย คลิกที่นี่ ภาษาไทย English ภาษาไทย 繁體中文 简体中文 '
 'สกุลเงิน : USD THB SGD MYR USD GBP EUR CNY JPY เข้าสู่ระบบ '
 'สมัครสมาชิกฟลายเออร์โบนัส ภาษาไทย English ภาษาไทย 繁體中文 简体中文 '
 'ประกาศระงับเที่ยวบินในประเทศทุกเส้นทาง ',
 'บางกอกแอร์เวย์สยกเลิกเที่ยวบินภายในประเทศทุกเส้นทางเป็นการชั่วคราว '
 'ระหว่างวันที่ 7-30 เมษายน 2563 '
 'ผู้โดยสารสามารถดำเนินการเพื่อขอรับเงินคืนผ่านทาง www.bangkokair.com/refund '
 'หรือติดต่อศูนย์บริการข้อมูลลูกค้า โทร 1771 (ตลอด 24 ชั่วโมง) 02 270 6699 '
 'หรือ สำนักงานออกบัตรโดยสารบางกอกแอร์เวย์สทั่วประเทศ ',
 'ในกรณีที่ผู้โดยสารออกบัตรโดยสารผ่านทางสำนักงานตัวแทนจำหน่าย ',
 'กรุณาติดต่อตัวแทนจำหน่ายฯ โดยตรง ',
 'ไม่ต้องแสดงสิ่งนี

In [18]:
from pprint import pprint

doc = MultilingualHtmlDocumentTextData("https://alfajr-news.net/details/%D9%85%D8%B4%D8%B1%D9%88%D8%B9-%D8%AF%D9%8A%D9%85%D9%88%D9%82%D8%B1%D8%A7%D8%B7%D9%8A-%D9%81%D9%8A-%D8%A7%D9%84%D9%83%D9%88%D9%86%D8%BA%D8%B1%D8%B3-%D8%A7%D9%84%D8%A3%D9%85%D8%B1%D9%8A%D9%83%D9%8A-%D9%84%D9%85%D8%B9%D8%A7%D9%82")
pprint(doc.get_sentences())

['JavaScript seems to be disabled in your browser.',
 'You must have JavaScript enabled in your browser to utilize the '
 'functionality of this website.',
 'جريده يومية - سياسية - مستقله تأسست عام 1975 تحميل ... أبو ظبى - المزيد '
 'الموافق : الأخبار : رئيس الدولة ونائبه ومحمد بن زايد يهنئون الملكة إليزابيث '
 'الثانية بمناسبة ذكرى اعتلاء جلالتها العرش محمد بن زايد يعزي هاتفيا ملك '
 'المغرب في وفاة الطفل ريان منصور بن زايد يعتمد لائحة إجراءات قانون الزواج '
 'والطلاق المدني للأجانب في إمارة أبوظبي الهلال الأحمر ويونيليفر جلف يوقعان '
 'اتفاقية تعاون في المجال الإنساني نافورة الإمارات أيقونة تنثر الفرح والبهجة '
 'في مهرجان الشيخ زايد Toggle navigation الرئيسية أخبار الأمارات أخبار عربية '
 'ودولية حوادث وقضايا مال وأعمال الفجر الرياضى صحة و تغذية معرض الصور الفيديو '
 'ملحق الجريدة مقالات الكتاب منوعات همس الفجر فن عربى ثقافة المرأة حول العالم '
 'محلية علوم طب فن اجنبى مجتمع الإمارات سياحة الأصدارات السابقة 1 يناير 1970 '
 'المصدر : تعليق مشاهدة طباعة التعليقات لا يوجد تعليقات اضف ت