# Crawlig Laban

- https://dict.laban.vn/

Curl for en-vi and the word train:

```
curl 'https://dict.laban.vn/find?type=1&query=train' --compressed -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:125.0) Gecko/20100101 Firefox/125.0' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8' -H 'Accept-Language: en-US,en;q=0.5' -H 'Accept-Encoding: gzip, deflate, br' -H 'Connection: keep-alive' -H 'Referer: https://dict.laban.vn/' -H 'Upgrade-Insecure-Requests: 1' -H 'Sec-Fetch-Dest: document' -H 'Sec-Fetch-Mode: navigate' -H 'Sec-Fetch-Site: same-origin' -H 'Sec-Fetch-User: ?1'
```

In [1]:
import requests
import os

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:125.0) Gecko/20100101 Firefox/125.0",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    # 'Accept-Encoding': 'gzip, deflate, br',
    # 'Connection': 'keep-alive',
    "Referer": "https://dict.laban.vn/",
    "Upgrade-Insecure-Requests": "1",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "same-origin",
    "Sec-Fetch-User": "?1",
}

params = {
    "type": "1",
    "query": "train",
}

response = requests.get("https://dict.laban.vn/find", params=params, headers=headers)

if response.status_code != 200:
    print("Failed to get response")
    exit(1)
else:
    if "example_found.html" not in os.listdir():
        with open("example_found.html", "w+") as f:
            f.write(response.text)

## Parser

Potential elements to select by:

```html
<li class="slide_content " rel="0">
    <div id="content_selectable" class="content">
        <div class="bg-grey bold font-large m-top20"><span>Danh từ</span></div>
        <div class="green bold margin25 m-top15">xe lửa, (khẩu ngữ) tàu</div>

```

In [2]:
import bs4
from bs4 import BeautifulSoup

html_content = response.content

soup = BeautifulSoup(html_content, "html.parser")

In [3]:
li_elements = soup.select('li.slide_content:not([class*=" "])')

In [4]:
found_children = list(li_elements[0].children)

In [5]:
e = found_children[1]
e, type(e)

(<div class="content" id="content_selectable">
 <div class="bg-grey bold font-large m-top20"><span>Danh từ</span></div>
 <div class="green bold margin25 m-top15">xe lửa, (khẩu ngữ) tàu</div>
 <div class="color-light-blue margin25 m-top15"><a class="find_word" dict="1" href="javascript:void(0);" rel="a">a</a> <a class="find_word" dict="1" href="javascript:void(0);" rel="passenger">passenger</a> train</div>
 <div class="margin25">xe lửa hành khách</div>
 <div class="color-light-blue margin25 m-top15"><a class="find_word" dict="1" href="javascript:void(0);" rel="express">express</a> train</div>
 <div class="margin25">xe lửa tốc hành</div>
 <div class="color-light-blue margin25 m-top15"><a class="find_word" dict="1" href="javascript:void(0);" rel="if">if</a> <a class="find_word" dict="1" href="javascript:void(0);" rel="you">you</a> <a class="find_word" dict="1" href="javascript:void(0);" rel="miss">miss</a> <a class="find_word" dict="1" href="javascript:void(0);" rel="the">the</a> train <a

Processing:

Iterate through all children of the `slide-content` class div and look for divs with the following classes:

1. `bg-grey bold` are the word category of this definition (Noun, Verb etc.)
2. `green bold` are the translation
3. `color-light-blue` is an example in english
4. `margin25` is the same example in vietamese

We can iterate one by one and construct the following objects:

- definition
- translation
- examples:
  - first find english, then matching vietnamese
  - only if two exist at the same time, construct the object and append to a list
    - can pop a list for this

In [6]:
t = list(e.children)[1]

In [7]:
t.attrs["class"]

['bg-grey', 'bold', 'font-large', 'm-top20']

In [9]:
from typing import List


class Translation:
    def __init__(self):
        self.text: str = None
        self.examples: List[(str, str)] = []
        self.current_ex = (None, None)

    def current_ex_filled(self):
        return all(self.current_ex)

    def fill_current_ex(self, en_ex=None, vi_ex=None):
        if en_ex:
            self.current_ex = (en_ex, self.current_ex[1])

        if vi_ex:
            self.current_ex = (self.current_ex[0], vi_ex)

        if self.current_ex_filled():
            self.current_examples.append(self.current_ex)
            self.current_ex = (None, None)


class DictEntry:
    def __init__(self):
        self.category: str = None
        self.translations: List[Translation] = []

    def filled(self):
        return all([self.category, self.translations])

    def __str__(self) -> str:
        return f"{self.category}: {self.translations}"

    def __repr__(self) -> str:
        return self.__str__()


def get_div_type(div):
    category = ["bg-grey", "bold"]
    translation = ["green", "bold"]
    ex_en = ["color-light-blue"]
    ex_vi = ["margin25"]

    if all([c in div.attrs["class"] for c in category]):
        return "category"
    elif all([c in div.attrs["class"] for c in translation]):
        return "translation"
    elif all([c in div.attrs["class"] for c in ex_en]):
        return "ex_en"
    elif all([c in div.attrs["class"] for c in ex_vi]):
        return "ex_vi"
    else:
        return None

In [68]:
dict_entry = DictEntry()
cur_translation = Translation()
completed_entries = []
for c in e.children:
    if type(c) != bs4.element.Tag:
        continue

    print("processing", c.text, c.attrs["class"])

    div_type = get_div_type(c)

    if div_type == "category":
        # New entry, save the current one
        if dict_entry.filled():
            print(dict_entry.category)
            completed_entries.append(dict_entry)
            dict_entry = DictEntry()
        dict_entry.category = c.text
    elif div_type == "translation":
        cur_translation.text = c.text
    elif div_type == "ex_en":
        cur_translation.fill_current_ex(en_ex=c.text)
    elif div_type == "ex_vi":
        cur_translation.fill_current_ex(vi_ex=c.text)

# Check if the last entry is filled
if dict_entry.filled():
    print(dict_entry.category)
    completed_entries.append(dict_entry)

processing Danh từ ['bg-grey', 'bold', 'font-large', 'm-top20']
processing xe lửa, (khẩu ngữ) tàu ['green', 'bold', 'margin25', 'm-top15']
processing a passenger train ['color-light-blue', 'margin25', 'm-top15']
processing xe lửa hành khách ['margin25']


In [67]:
completed_entries

[Danh từ: đuôi dài lê thê (của áo phụ nữ), Động từ: uốn, nắn (cây, cảnh)]