In [1]:
# Name of Creator
CREATOR_NAME = "Jingheng Wang"

This file is intended to scrape the readings/meanings of the whole 20K word list from [Jisho](jisho.org). A .py file with temporary saving features should be in the same directory with this notebook. That .py file has the same usage but friendlier if you want to run the program on a server. It really takes **A LONG PERIOD OF TIME** to scrape the whole word list! According to my calculation, in total more than 40K requests are made before the program terminates.

Run all the cells (or the .py file) will generate a "wikiword_table_new.csv" file, containing all word/reading/meaning entries of the 20K word list.

In [2]:
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from requests import get
from time import sleep
from random import randint

In [3]:
df = pd.read_csv('wikitionary_wordlist.csv')
sr = df['word']

In [4]:
kamei_list = list("あいうえおかきくけこさしすせそたちつてとなにぬねのはひふへほまみむめもやゆよらりるれろわをんがぎぐげござじずぜぞだぢづでどばびぶべぼパピプペポアイウエロカキクケコサシスセソタチツテトナニヌネノハヒフヘホマミムメモヤユヨラリルレロワヲンガギグゲゴザジズゼゾダヂヅデドバビブベボパピプペポャュョー")


In [5]:
def get_furigana(soup):
    furi = soup.find('span',class_='furigana')
    #print(list(furi.children))
    txt = furi.next_sibling.next_sibling
    text_cont = list(str(txt.get_text()).strip())
    furistring = ""
    txtstring = ""
    text_loc = 0
    for f1 in furi.children:
        #print(f1)
        if ((f1 != '\n') & (f1 is not None)):
            t1 = text_cont[text_loc]
            txtstring += t1
            if (f1.get_text() == '') & (t1 in kamei_list):
                furistring += t1
            else:
                furistring += f1.get_text().strip()
            text_loc += 1
    return (furistring, txtstring)

In [6]:
def get_meanings(soup):
    meanings = ""
    raw_meanings = soup.findAll('div', class_='meaning-wrapper')
    for x in raw_meanings:
        flag = x.findAll('span', class_='meaning-definition-section_divider')
        if ((flag is not None) & (flag != [])):
            #print(flag)
            tag = flag[0]
            #print(tag)
            #print(tag.next_sibling)
            meanings += tag.get_text() + tag.next_sibling.get_text() + '$'
    return meanings

In [8]:
wordlist = []
furigana = []
meanings = []

url_base = 'https://jisho.org/word/'

requests = 0
i = 0

for k in np.arange(len(sr)):
    i = 0
    word = sr[k]
#word = '三つ'
#if (True):
    while (True):
        if (i == 0):
            url = url_base+word
        else:
            url = url_base+word+'-'+str(i)
            
        #print(url)
        
        response = get(url)
        requests += 1
        print("Requests Made: {}, status {}".format(requests, response.status_code))
        sleep(0.5)
        
        i += 1
        
        if (response.status_code != 200):
            if (response.status_code == 408):
                # Timed Out
                i -= 1
                continue
            if (response.status_code != 404):
                print("Error: code {} at word {}, i={}".format(response.status_code, word, i-1))
            break
        else:
            soup = BeautifulSoup(response.text, 'html.parser')
            furi, txt = get_furigana(soup)
            mean = get_meanings(soup)
            print(txt+' / '+furi+' / '+mean)
            wordlist.append(txt)
            furigana.append(furi)
            meanings.append(mean)
    
   
    

Requests Made: 1, status 200
の / の / 1. indicates possessive$2. nominalizes verbs and adjectives$3. substitutes for "ga" in subordinate phrases$4. (at sentence-end, falling tone) indicates a confident conclusion$5. (at sentence-end) indicates emotional emphasis$6. (at sentence-end, rising tone) indicates question$7. No (kana)$
Requests Made: 2, status 404
Requests Made: 3, status 200
に / に / 1. at (place, time); in; on; during$2. to (direction, state); toward; into$3. for (purpose)$4. because of (reason); for; with$5. by; from$6. as (i.e. in the role of)$7. per; in; for; a (e.g. "once a month")$8. and; in addition to$9. if; although$10. Ni (kana)$
Requests Made: 4, status 404
Requests Made: 5, status 404
Requests Made: 6, status 200
は / は / 1. topic marker particle$2. indicates contrast with another option (stated or unstated)$3. adds emphasis$4. Ha (kana)$
Requests Made: 7, status 404
Requests Made: 8, status 200
を / を / 1. indicates direct object of action$2. indicates subject of cau

KeyboardInterrupt: 

In [None]:
pd.DataFrame({
        'word': wordlist,
        'reading': furigana,
        'meaning': meanings
        }).to_csv('wikiword_table_new.csv')