<a href="https://colab.research.google.com/github/drew-chien/dictionary-crawler/blob/main/dictionary_crawler.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Put in words

In [217]:
# put in the words
words = ['test', 'bombard', 'compensation']

# Data processing

In [218]:
import requests
from bs4 import BeautifulSoup
import re
import docx
from google.colab import drive

class Definition():
  def __init__(self, meaning):
    self.meaning = meaning.text
    print('meaning: ' + self.meaning)
    # chinese
    self.chinese = meaning.next_sibling.find('span', {'lang': 'zh-Hant'}).text
    print('chinese: ' + self.chinese)
    # examples
    self.examples = []
    for example in meaning.next_sibling.find_all('div', {'class': 'examp'}):
      self.examples.append(re.sub('\n$', '', example.text))
    print(self.examples)


class WordData():
  def __init__(self, word):
    # Make the request to a url
    r = requests.get('https://dictionary.cambridge.org/dictionary/english-chinese-traditional/' + word,
              headers = {'user-agent':'Mozilla/5.0 (Windows NT 6.0; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0'})
    soup = BeautifulSoup(r.content)

    # get the title
    title = soup.find('title').text
    print('raw title: ' + title)
    self.title = re.sub(' \|.*', '', title)
    print('clean title: ' + self.title)

    # get the pronouciation
    self.pronunciations = []
    for pron in soup.find_all('span', {'class': 'pron'}):
      print('pron: ' + pron.text)
      self.pronunciations.append(pron.text)
    print(self.pronunciations)

    # get the meaning
    self.definitions = []
    for meaning in soup.find_all('div', {'class': 'ddef_h'}):
      self.definitions.append(Definition(meaning))

  def ToDoc(self, doc):
    p = doc.add_paragraph(self.title, style='List Bullet')
    p.add_run('\n')
    first = True
    for pron in self.pronunciations:
      if not first:
        p.add_run(', ')
      first = False
      p.add_run(pron)
    
    for definition in self.definitions:
      p = doc.add_paragraph(definition.meaning, style='List Bullet 2')
      p.add_run('\n')
      p.add_run(definition.chinese)
      for example in definition.examples:
        doc.add_paragraph(example, style='List Bullet 3')

# processing
doc = docx.Document()
for word in words:
  wordData = WordData(word)
  wordData.ToDoc(doc)

# mount the drive and then output the file to the drive
drive.mount('/content/drive')
doc.save('/content/drive/MyDrive/helloWorld.docx')

raw title: test | translate to Traditional Chinese: Cambridge Dictionary
clean title: test
pron: /test/
pron: /test/
pron: /test/
pron: /test/
['/test/', '/test/', '/test/', '/test/']
meaning: A1 a way of discovering, by questions or practical activities, what someone knows, or what someone or something can do or is like 
chinese: 測驗，考查
[' The class are doing/having a spelling test today.\n今天班裡有一個拼寫測驗。', ' She had to take/do an aptitude test before she got the job.\n她先接受了能力測試後才得到這份工作。']
meaning: B1 a medical examination of part of your body in order to find out how healthy it is or what is happening with it 
chinese: （體格）檢查；化驗
[' a blood/urine test\n驗血／尿檢', ' an eye test\n視力檢查', ' a pregnancy test\n妊娠檢查', " The doctors have done some tests to try and find out what's wrong with her.\n醫生做了一些檢查，想查出她的問題出在哪裡。"]
meaning:  an act of using something to find out if it is working correctly or how effective it is 
chinese: 試驗
[' The new missiles are currently undergoing tests.\n新導彈目前正在進行試驗。']
mea

# Download and install all required package

In [216]:
# install python-docx to write to doc
!pip install python-docx
!pip show python-docx

Name: python-docx
Version: 0.8.10
Summary: Create and update Microsoft Word .docx files.
Home-page: https://github.com/python-openxml/python-docx
Author: Steve Canny
Author-email: python-docx@googlegroups.com
License: The MIT License (MIT)
Location: /usr/local/lib/python3.6/dist-packages
Requires: lxml
Required-by: 
