This project is for NaNoGenMo 2017. The goal of this project is to generate a novel that can fool human readers. The project progress can be checked on the development blog (It's in traditional Chinese :D). This project contains four major parts:
- Web scraping - scap free or public domain books with given genre
- Text regularization - get the regular content from scraped text
- Text generaton - use LSTM to generate sentences
- GAN novel generation - use GAN to build a novel generator
- BeautifulSoup
bs4
- Requests
requests
- ebooklib
ebooklib
(forked and modified from ebooklib) - Matplotlib
matplotlib
Clone this project and submodules
git clone --recursive https://github.com/KodeWorker/NovelAlchemist.git
My favorite website option is Project Gutenberg. However, the terms of use clearly states...
The Project Gutenberg website is for human users only. Any real or perceived use of automated tools to access our site will result in a block of your IP address. This site utilizes cookies, captchas and related technologies to help assure the site is maximally available for human users only.
The second option is Feedbooks. This site also has a similar term to prevent scrapers, but the language is kinda okay. I would test my luck if things are getting desperate.
6.15 use any robot, spider, scraper, or other automated means to access the FeedBooks Website bypass any measures FeedBooks may use to prevent or restrict access to the FeedBooks Website
Finally, I found Manybooks. This site contains books from "Project Gutenberg" and other internet archives. Most importantly, it has no regulations on web scraping (or I just too blind to read.)
- Run the scraper:
The default selection is "English Sci-Fi Novels".
If you want to scrap different genre or language, set the function (
sel_genre
andsel_language
) parameterselect=None
.
python ./web_scraping/novel_scraper.py
(under construction)
- 2017/07/19 - start building the web scraper
- 2017/07/20 - complete the web scraper
- 2017/07/21 - start building the text regularization
- Text generation
- GAN novel generation