# Fiction Bot IV
### A notebook based automated fiction scraper + EPub generator
This notebook is able to scrape and download all chapters from a provided internet novel url (biquge.com.cn), then auto generate a well-formatted ePub ebook, with **Table Of Contents** of course!

In [2]:
from bs4 import BeautifulSoup
import requests
import os
import shutil

In [None]:
base_url = "https://www.biquge.com.cn/"
url = "https://www.biquge.com.cn/book/34885/"

In [None]:
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
book_title = soup.h1.text
author = soup.find("meta", attrs={"property":"og:novel:author"})['content']

### Create Folder Structure for EPub
These two folders are necessary under the root directory
- META-INF
- OPS

Plus a file:
- mimetype

In [None]:
try:
    os.mkdir(f"./{book_title}")
except:
    print(f"Folder exists: ./{book_title}")
    pass

try:
    os.mkdir(f"./{book_title}/META-INF")
    os.mkdir(f"./{book_title}/OPS")
except:
    pass

Write `application/epub+zip` to the mimetype file

In [None]:
with open(f"./{book_title}/mimetype", "w") as tmp:
    tmp.write("application/epub+zip")

Create `container.xml` file

In [None]:
with open(f"./{book_title}/META-INF/container.xml", "w") as tmp:
    tmp.write('''<?xml version="1.0" encoding="UTF-8" ?>
<container version="1.0" xmlns="urn:oasis:names:tc:opendocument:xmlns:container">
  <rootfiles>\n    <rootfile full-path="OPS/content.opf" media-type="application/oebps-package+xml"/>\n  </rootfiles>
</container>
''')

In [None]:
opfcontent = '''<?xml version="1.0" encoding="UTF-8" ?>
<package version="2.0" unique-identifier="PrimaryID" xmlns="http://www.idpf.org/2007/opf">
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf">
%(metadata)s
<meta name="cover" content="cover"/>
</metadata>
<manifest>
%(manifest)s
<item id="ncx" href="content.ncx" media-type="application/x-dtbncx+xml"/>
<item id="cover" href="cover.jpg" media-type="image/jpeg"/>
</manifest>
<spine toc="ncx">
%(ncx)s
</spine>
</package>
'''
dc = '<dc:%(tag)s>%(value)s</dc:%(tag)s>'
item = "<item id='%(id)s' href='%(url)s' media-type='application/xhtml+xml'/>"
itemref = "<itemref idref='%(id)s'/>"
metadata = '\n'.join([
    dc % {'tag': 'title', 'value': book_title},
    dc % {'tag': 'creator', 'value': author},
    dc % {'tag': 'decription', 'value': "本文档由Fiction Bot IV自动生成"},
])

In [None]:
ncxcontent = '''<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE ncx PUBLIC "-//NISO//DTD ncx 2005-1//EN" "http://www.daisy.org/z3986/2005/ncx-2005-1.dtd">
<ncx version="2005-1" xmlns="http://www.daisy.org/z3986/2005/ncx/">
<head>
  <meta name="dtb:uid" content=" "/>
  <meta name="dtb:depth" content="-1"/>
  <meta name="dtb:totalPageCount" content="0"/>
  <meta name="dtb:maxPageNumber" content="0"/>
</head>
 <docTitle><text>%(title)s</text></docTitle>
 <docAuthor><text>%(creator)s</text></docAuthor>
<navMap>
%(navpoints)s
</navMap>
</ncx>
'''
navpoint = '''<navPoint id='%s' class='level1' playOrder='%d'>
<navLabel> <text>%s</text> </navLabel>
<content src='%s'/></navPoint>'''

Fetch all HTML tags of the menu entries, store in `menu_raw`

In [None]:
menu_raw = soup.find_all('dd')

Then parse href and chapter titiles from each HTML tag, store the information in `menu_info`.

In [None]:
menu_info = []
for c, d in enumerate(menu_raw, 1):
    menu_info.append({
        'id': c, 
        'link': f'chapter_{c}.html', 
        'url':d.a['href'], 
        'chapter':d.text})
    
menu_info[0]

In [None]:
manifest = []
ncx = []
navpoints = []
for m in menu_info:
    manifest.append(item % {'id': m['link'], 'url':m['link']})
    ncx.append(itemref % {'id': m['link']})
    navpoints.append(navpoint % (m['link'], m['id'], m['chapter'], m['link']))

In [None]:
manifest = '\n'.join(manifest)
ncx = '\n'.join(ncx)

In [None]:
with open(f'./{book_title}/OPS/content.opf', 'w') as tmp:
    tmp.write(opfcontent % {
        'metadata': metadata,
        'manifest': manifest,
        'ncx': ncx,
    })

In [None]:
with open(f'./{book_title}/OPS/content.ncx', 'w') as tmp:
    tmp.write(ncxcontent % {
        'title': book_title,
        'creator': author,
        'navpoints': '\n'.join(navpoints)
    })

## Download!

In [3]:
os.getcwd()

'/private/var/mobile/Library/Mobile Documents/com~apple~CloudDocs'

In [None]:
cover_img = soup.find("img")
if cover_img:
    cover_img = cover_img['src']
    img = requests.get(cover_img, stream=True)
    if img.status_code == 200:
        with open(f"./{book_title}/OPS/cover.jpg", "wb") as f:
            shutil.copyfileobj(img.raw, f)
    del img

In [None]:
template = '''<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="zh-CN">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<link rel="stylesheet" type="text/css" href="css/main.css"/>
<title>%(title)s</title>
</head>
<body> c
<h2>%(title)s</h2>
<div>
%(content)s
</div>
</body>
</html>
'''

In [None]:
for ch in menu_info:
    t = ch['link']
    print("Processing " + t)
    source = requests.get(base_url + ch['url'])
    soup = BeautifulSoup(source.text, "html.parser")
    sentences = soup.find("div", attrs={'id':'content'}).findAll(text=True)
    contents = []
    for s in sentences:
        tmp = s.replace('\xa0', '')
        contents.append(f'<p>{tmp}</p>')
    with open(f'./{book_title}/OPS/{t}', 'w') as f:
        f.write(template % {
            'title': ch['chapter'],
            'content': '\n'.join(contents)
        })

## Pack the EPub Book!

In this step, we will zip the folder then turn it into a \*.epub package.

In [None]:
# Collect all files in the folder
file_paths = []
for root, directories, files in os.walk(f'./{book_title}'): 
    for filename in files: 
        # join the two strings in order to form the full filepath. 
        filepath = os.path.join(root, filename) 
        file_paths.append(filepath)

In [None]:
from zipfile import ZipFile
with ZipFile(f"./{book_title}.epub", "w") as z:
    for f in file_paths:
        z.write(f)
        
print(f"Congratulations, {book_title}.epub has been freshly made!")

*Reference: https://www.jianshu.com/p/75b993cd2f68*
## The End