# Fiction Bot IV
### A notebook based automated fiction scraper + EPub generator
This notebook is able to scrape and download all chapters from a provided internet novel url (biquge.com.cn), then auto generate a well-formatted ePub ebook, with **Table Of Contents** of course!

In [1]:
from bs4 import BeautifulSoup
import requests
import os
import shutil

In [73]:
base_url = "xbiquge.so"
book_url = "https://www.xbiquge.so/book/53005/"

In [4]:
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
book_title = soup.h1.text
author = soup.find("meta", attrs={"property":"og:novel:author"})['content']

### Create Folder Structure for EPub
These two folders are necessary under the root directory
- META-INF
- OPS

Plus a file:
- mimetype

In [7]:
try:
    os.mkdir(f"./{book_title}")
except:
    print(f"Folder exists: ./{book_title}")
    pass

try:
    os.mkdir(f"./{book_title}/META-INF")
    os.mkdir(f"./{book_title}/OPS")
except:
    pass

Write `application/epub+zip` to the mimetype file

In [8]:
with open(f"./{book_title}/mimetype", "w") as tmp:
    tmp.write("application/epub+zip")

Create `container.xml` file

In [9]:
with open(f"./{book_title}/META-INF/container.xml", "w", encoding="utf-8") as tmp:
    tmp.write('''<?xml version="1.0" encoding="UTF-8" ?>
<container version="1.0" xmlns="urn:oasis:names:tc:opendocument:xmlns:container">
  <rootfiles>\n    <rootfile full-path="OPS/content.opf" media-type="application/oebps-package+xml"/>\n  </rootfiles>
</container>
''')

In [10]:
opfcontent = '''<?xml version="1.0" encoding="UTF-8" ?>
<package version="2.0" unique-identifier="PrimaryID" xmlns="http://www.idpf.org/2007/opf">
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf">
%(metadata)s
<meta name="cover" content="cover"/>
</metadata>
<manifest>
%(manifest)s
<item id="ncx" href="content.ncx" media-type="application/x-dtbncx+xml"/>
<item id="cover" href="cover.jpg" media-type="image/jpeg"/>
</manifest>
<spine toc="ncx">
%(ncx)s
</spine>
</package>
'''
dc = '<dc:%(tag)s>%(value)s</dc:%(tag)s>'
item = "<item id='%(id)s' href='%(url)s' media-type='application/xhtml+xml'/>"
itemref = "<itemref idref='%(id)s'/>"
metadata = '\n'.join([
    dc % {'tag': 'title', 'value': book_title},
    dc % {'tag': 'creator', 'value': author},
    dc % {'tag': 'decription', 'value': "本文档由Fiction Bot IV自动生成"},
])

In [11]:
ncxcontent = '''<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE ncx PUBLIC "-//NISO//DTD ncx 2005-1//EN" "http://www.daisy.org/z3986/2005/ncx-2005-1.dtd">
<ncx version="2005-1" xmlns="http://www.daisy.org/z3986/2005/ncx/">
<head>
  <meta name="dtb:uid" content=" "/>
  <meta name="dtb:depth" content="-1"/>
  <meta name="dtb:totalPageCount" content="0"/>
  <meta name="dtb:maxPageNumber" content="0"/>
</head>
 <docTitle><text>%(title)s</text></docTitle>
 <docAuthor><text>%(creator)s</text></docAuthor>
<navMap>
%(navpoints)s
</navMap>
</ncx>
'''
navpoint = '''<navPoint id='%s' class='level1' playOrder='%d'>
<navLabel> <text>%s</text> </navLabel>
<content src='%s'/></navPoint>'''

Fetch all HTML tags of the menu entries, store in `menu_raw`

In [39]:
menu_raw = soup.find_all('dd')
menu_raw = menu_raw[12:]

Then parse href and chapter titiles from each HTML tag, store the information in `menu_info`.

In [47]:
class MenuInfo:
    
    
    def __init__(self, url, chapter_title):
        self.url = url
        self.chapter_title = chapter_title
        self.id = None
        self.epub_link = None
    
    def get_title(self):
        return self.chapter_title
    
    def get_url(self):
        return self.get_url
    
    def __str__(self):
        return f"{self.id}: {self.chapter_title} - {self.url} - {self.epub_link}"

In [54]:
menu_info = []
for index, data in enumerate(menu_raw, 1):
    try:
        m = MenuInfo(url = data.a['href'], chapter_title = data.text)
        menu_info.append(m)
    except:
        pass

# menu_preprocessing.sort(key=lambda x: x.url)
for i in range(0, 10):
    print(str(menu_info[-1-i]))


None: 新书《终末的绅士》已发布 - 39728608.html - None
None: 新书《终末的绅士》及简介 - 39714241.html - None
None: 新书预告Part.2 - 39503263.html - None
None: 新书预告Part.1 - 39396290.html - None
None: 完本感言 - 39394738.html - None
None: 第二千一百六十七章 我的细胞监狱（大结局） - 39391185.html - None
None: 第二千一百六十六章 命运合同与混沌之道 - 39375344.html - None
None: 第二千一百六十五章 线与道路 - 39375089.html - None
None: 第二千一百六十四章 工作交接 - 39374077.html - None
None: 第二千一百六十三章 本质 - 39373711.html - None


In [55]:
for c, d in enumerate(menu_info, 1):
    try:
        d.id = c
        d.epub_link = f'chapter_{c}.html'
    except:
        print(d)
    
for i in range(0, 10):
    print(str(menu_info[i]))

1: 第一章 神秘的监狱 - 35935698.html - chapter_1.html
2: 第二章 韩东的发现 - 35935700.html - chapter_2.html
3: 第三章 从头开始 - 35935701.html - chapter_3.html
4: 第四章 上吊的青年 - 35935702.html - chapter_4.html
5: 第五章 便携式监狱 - 35935703.html - chapter_5.html
6: 第六章 祭典广场 - 35935704.html - chapter_6.html
7: 第七章 命运空间 - 35935705.html - chapter_7.html
8: 第八章 六人小队 - 35935706.html - chapter_8.html
9: 第九章 王婆 - 35935707.html - chapter_9.html
10: 第十章 噩梦 - 35935708.html - chapter_10.html


In [56]:
# {
#     'id': c, 
#     'link': f'chapter_{c}.html', 
#     'url':d.a['href'], 
#     'chapter':d.text
# }

manifest = []
ncx = []
navpoints = []
for m in menu_info:
    manifest.append(item % {'id': m.epub_link, 'url':m.epub_link})
    ncx.append(itemref % {'id': m.epub_link})
    navpoints.append(navpoint % (m.epub_link, m.id, m.chapter_title, m.epub_link))

In [57]:
manifest = '\n'.join(manifest)
ncx = '\n'.join(ncx)

In [58]:
with open(f'./{book_title}/OPS/content.opf', 'w', encoding="utf-8") as tmp:
    tmp.write(opfcontent % {
        'metadata': metadata,
        'manifest': manifest,
        'ncx': ncx,
    })

In [59]:
with open(f'./{book_title}/OPS/content.ncx', 'w', encoding="utf-8") as tmp:
    tmp.write(ncxcontent % {
        'title': book_title,
        'creator': author,
        'navpoints': '\n'.join(navpoints)
    })

## Download!

In [60]:
os.getcwd()

'C:\\Users\\jackz\\Developer\\FictionBot-IV'

In [61]:
cover_img = soup.find("img")
if cover_img:
    cover_img = cover_img['src']
    img = requests.get(cover_img, stream=True)
    if img.status_code == 200:
        with open(f"./{book_title}/OPS/cover.jpg", "wb") as f:
            shutil.copyfileobj(img.raw, f)
    del img

In [62]:
template = '''<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="zh-CN">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<link rel="stylesheet" type="text/css" href="css/main.css"/>
<title>%(title)s</title>
</head>
<body> c
<h2>%(title)s</h2>
<div>
%(content)s
</div>
</body>
</html>
'''

In [72]:
ch = menu_info[0]
print(ch.url)
source = requests.get(book_url + ch.url)
soup = BeautifulSoup(source.text, "html.parser")
sentences = soup.find("div", attrs={'id':'content'})
sentences

35935698.html


<div id="content" name="content">笔趣阁 www.xbiquge.so，最快更新我的细胞监狱 ！<br/><br/>    污水横流、菌斑肆掠。<br><br>     某一废弃的监狱深处……<br/><br/>     啪！<br/><br/>     一团由质膜包裹的细胞团，竟从某具尸体的表面分离了出来。<br/><br/>     细胞团体仅有人类食指大小，宏观外表就像一团白色鼻涕。<br/><br/>     这一细胞团似乎具备着独立的思维与行动能力。<br/><br/>     但由于神经系统的不完善性，并无五感。<br/><br/>     不过，在细胞团与外物接触时，能通过‘细胞间信号传递’的特别手段，获取物质的基础信息。<br/><br/>     分离出来的细胞没有过久的停留，开始移动了。<br/><br/>     转录与翻译。<br/><br/>     肌动蛋白产生。<br/><br/>     这团手指头大小的细胞群体开始进行极为缓慢的‘迁移运动’，大约与蜗牛的移动速度相当。<br/><br/>     “不行……不够完美，也不是我想要的。”<br/><br/>     细胞团似乎主动舍弃了这具看似完整且强壮的肉体，继续在监狱里展开搜寻……<br/><br/>     若当前能有一束火把提供照明。<br/><br/>     你会发现这一处巨型牢房中，由细胞团所舍弃的肉体达到上百具。<br/><br/>     …………<br/><br/>     细胞团并非自主形成。<br/><br/>     它有着自己的名字-韩东。<br/><br/>     华侨，意大利佛罗伦萨大学生命科学院的副教授，在成为这团细胞凝聚体前，他刚好31岁。<br/><br/>     于2018年7月21日，因肺癌死在佛罗伦萨中心医院的病床上。<br/><br/>     病房里摆满着鲜花，然而这些鲜花却来源于他的学生，而非家人。<br/><br/>     在死亡的一刻，病痛折磨消散一空，韩东反而感觉释然。<br/><br/>     没有升入天堂或是堕入地狱，也没有喝下孟婆汤、走上来奈何桥、经历所谓的轮回转世。<br/><br/>     迎接他的只有无尽黑暗而已。<br/><br/>     然而，他的意识却在这一过程中始终存

In [64]:
for ch in menu_info:
    t = ch.epub_link
    print("正在下载：" + t)
    source = requests.get(book_url + ch.url)
    soup = BeautifulSoup(source.text, "html.parser")
    sentences = soup.find("div", attrs={'id':'content'}).findAll(text=True)
    contents = []
    for s in sentences:
        tmp = s.replace('\xa0', '')
        contents.append(f'<p>{tmp}</p>')
    with open(f'./{book_title}/OPS/{t}', 'w', encoding="utf-8") as f:
        f.write(template % {
            'title': ch.chapter_title,
            'content': '\n'.join(contents)
        })

Processing chapter_1.html


AttributeError: 'NoneType' object has no attribute 'findAll'

## Pack the EPub Book!

In this step, we will zip the folder then turn it into a \*.epub package.

In [20]:
# Collect all files in the folder
file_paths = []
for root, directories, files in os.walk(f'./{book_title}'): 
    for filename in files: 
        # join the two strings in order to form the full filepath. 
        filepath = os.path.join(root, filename) 
        file_paths.append(filepath)

In [21]:
from zipfile import ZipFile
with ZipFile(f"./{book_title}.epub", "w") as z:
    for f in file_paths:
        z.write(f)
        
print(f"Congratulations, {book_title}.epub has been freshly made!")

Congratulations, 我的细胞监狱.epub has been freshly made!


*Reference: https://www.jianshu.com/p/75b993cd2f68*
## The End