In [1]:
import os
import numpy as np
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm

# Function define
Most of the commands in this function will be used and explained later.

In [2]:
def extract_html(url, save_dir, chapter_name):
    """
    Read the url of each chapter;
    novel_title as the name of the folder;
    chapter_name, extracted from the content page, which will be used to name the article file.
    """
    r=requests.get(url)
    r.encoding='utf-8'
    soup = BeautifulSoup(r.text)
    
    # Extract chapter title of the article from this ESJZone page.
    title_h2=soup.find_all('h2')[0].get_text()
    # Extract the article
    content = soup.find('div',attrs={'class': 'forum-content mt-3'})
    # Create a html file and write content into the file.
    with open('./{}/{}.html'.format(save_dir, chapter_name), 'w', encoding="utf-8") as f:
        f.write(title_h2)
        f.write(content.__str__())

# Use BeautifulSoup to extract the content

Past the ESJZone URL here. This URL should be the link of the content page.<br/>
把目標小說的目錄頁面網址貼上。

In [3]:
url = 'https://www.esjzone.cc/detail/1557379934.html'

Use beautiful soup to get the page.<br/>
以beautiful soup將網頁的資料讀出。

In [4]:
r=requests.get(url)
r.encoding='utf-8'
soup = BeautifulSoup(r.text[10000:])
soup

<html><body><p>der&gt;

</p><div class="offcanvas-wrapper" id="">
<div class="page-title">
<div class="container">
<div class="column">
<h1>輕小說</h1>
</div>
<div class="column">
<ul class="breadcrumbs">
<li><a href="/">Home</a></li>
<li class="slash"> / </li>
<li>輕小說</li>
</ul>
</div>
</div>
</div>
<section class="container">
<div class="row">
<div class="col-xl-9 col-lg-8 p-r-30">
<div class="row mb-3">
<div class="col-md-3">
<div class="product-gallery text-center mb-3">
<a href="https://i.loli.net/2021/07/30/Ih9RzSaJVxt214W.jpg" target="_blank"><img alt="嘆息的亡靈想引退" src="https://i.loli.net/2021/07/30/Ih9RzSaJVxt214W.jpg"/></a>
</div>
<div class="text-center">
<div class="d-inline align-baseline display-3 mr-1">4.9</div>
<div class="rating-stars">
<i class="icon-star-s filled"></i><i class="icon-star-s filled"></i><i class="icon-star-s filled"></i><i class="icon-star-s filled"></i><i class="icon-star-s"></i> </div>
</div>
<a class="btn btn-secondary btn-block btn-load-modal-data" data-p

As can be seen, the file contains too much information. As a result, I only print out string from 15998 to 19000 to show the target that we are going to extract.<br/>
在上面的soup檔案裡我只印出了部分對我們之後備份章節時有意義的段落，待會的目標就是把這些章節對應的URL都找出來後去對應的章節頁面備份翻譯文章。

# Create a folder named by the novel title

Make a folder if the folder does not exist.<br/>
如果ESJZone下沒有以該小說名為名的資料夾的話，在其下面建立一個以該小說名命名的資料夾。

In [5]:
novel_dict = soup.find('div', attrs={'class': 'product-gallery text-center mb-3'})
novel_title = novel_dict.find('img', alt=True)['alt']
novel_title=''.join(x for x in novel_title if x not in '\/:*?<>|')

save_dir='./ESJZone/{}/'.format(novel_title)
if not os.path.exists(save_dir):
    os.mkdir(save_dir)

Verify this is the novel that we are going to download.<br/>
檢查是否為我們欲下載的小說名。

In [6]:
print(novel_title)

嘆息的亡靈想引退


Find the section with all chapters.<br/>
以find找到soup中div下的id="chapterList"的資料，這批資料是目錄頁面下各章節的標題以及對應的URL。

In [7]:
content = soup.find('div', attrs={'id': 'chapterList'})
content

<div id="chapterList"><p class="non">第一部</p>
<a data-title="1 成員招募 https://www.esjzone.cc/forum/1557379934/79352.html" href="https://www.esjzone.cc/forum/1557379934/79352.html" target="_blank"><p>1 成員招募 </p></a>
<a data-title="2 成員招募② https://www.esjzone.cc/forum/1557379934/79353.html" href="https://www.esjzone.cc/forum/1557379934/79353.html" target="_blank"><p>2 成員招募② </p></a>
<a data-title="3 成員招募③ https://www.esjzone.cc/forum/1557379934/79354.html" href="https://www.esjzone.cc/forum/1557379934/79354.html" target="_blank"><p>3 成員招募③ </p></a>
<a data-title="4 成員招募④ https://www.esjzone.cc/forum/1557379934/79355.html" href="https://www.esjzone.cc/forum/1557379934/79355.html" target="_blank"><p>4 成員招募④ </p></a>
<a data-title="5 帝都 https://www.esjzone.cc/forum/1557379934/79356.html" href="https://www.esjzone.cc/forum/1557379934/79356.html" target="_blank"><p>5 帝都 </p></a>
<a data-title="6 懲罰遊戲 https://www.esjzone.cc/forum/1557379934/79357.html" href="https://www.esjzone.cc/forum/155737993

Extract the links from the chapter list, and store them as a python list called "links". We also store all the chapter names into a list called "chapters". We will later use them to name our article html files.<br/>
再接著從個章節標題對應的代碼中取出用章節翻譯文章的網址，並將網址放入一個python list "links"中。這個可以以find_all('a')簡單達成；同時將章節標題存入另一個python list "chapters"。待會可以將這個標題用來命名我們抓出的文章html檔。

In [8]:
links = content.find_all('a')
chapters = content.find_all('p', attrs={'class': None})

# Loop over links and download the title and article in the link
### 迴圈所有存在links中的URL，將對應URL中的文章備份並存於以該章節名命名的html檔中。

If the link is not to ESJ, jump over the links.<br/>
有時ESJ的前幾個link是百度或twitter或不知道哪裡的URL，這時就簡單的將其跳過即可。如果前三個都是這樣的檔案，設num=3。<br/>
報錯出現list out of range通常就是這導致的。

In [9]:
num=0
hrefs = []
hashmap={}

for i, link_i in enumerate(tqdm(links)):

    if i < num:
        continue
    url = link_i.get('href')
    chapter_name0 = chapters[i].get_text()
    chapter_name0 = ''.join(x for x in chapter_name0 if x not in '\/:*?<>|')
    if chapter_name0 in hashmap:
        chapter_name=chapter_name0+'-{}'.format(i)
        hashmap[chapter_name0]+=1
    else:
        chapter_name=chapter_name0
        hashmap[chapter_name0]=1
        
    extract_html(url, save_dir, chapter_name)



100%|██████████| 289/289 [02:14<00:00,  2.16it/s]
