GitHub - KodeWorker/NovelAlchemist: This project is for NaNoGenMo 2017.

Novel Alchemist

This project is for NaNoGenMo 2017. The goal of this project is to generate a novel that can fool human readers. The project progress can be checked on the development blog (It's in traditional Chinese :D). This project contains four major parts:

Web scraping - scap free or public domain books with given genre
Text regularization - get the regular content from scraped text
Text generaton - use LSTM to generate sentences
GAN novel generation - use GAN to build a novel generator

Dependencies

BeautifulSoup bs4
Requests requests
ebooklib ebooklib (forked and modified from ebooklib)
Matplotlib matplotlib

Clone this project and submodules

git clone --recursive https://github.com/KodeWorker/NovelAlchemist.git

Web scrapping

Source

My favorite website option is Project Gutenberg. However, the terms of use clearly states...

The Project Gutenberg website is for human users only. Any real or perceived use of automated tools to access our site will result in a block of your IP address. This site utilizes cookies, captchas and related technologies to help assure the site is maximally available for human users only.

The second option is Feedbooks. This site also has a similar term to prevent scrapers, but the language is kinda okay. I would test my luck if things are getting desperate.

6.15 use any robot, spider, scraper, or other automated means to access the FeedBooks Website bypass any measures FeedBooks may use to prevent or restrict access to the FeedBooks Website

Finally, I found Manybooks. This site contains books from "Project Gutenberg" and other internet archives. Most importantly, it has no regulations on web scraping (or I just too blind to read.)

Details

Run the scraper: The default selection is "English Sci-Fi Novels". If you want to scrap different genre or language, set the function (sel_genre and sel_language) parameter select=None.

python ./web_scraping/novel_scraper.py

Text Regularization

(under construction)

Development Records

2017/07/19 - start building the web scraper
2017/07/20 - complete the web scraper
2017/07/21 - start building the text regularization

To Do List

Text generation
GAN novel generation

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
text_regularization		text_regularization
web_scraping		web_scraping
.gitignore		.gitignore
.gitmodules		.gitmodules
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

text_regularization

text_regularization

web_scraping

web_scraping

.gitignore

.gitignore

.gitmodules

.gitmodules

readme.md

readme.md

Repository files navigation

Novel Alchemist

Dependencies

Web scrapping

Source

Details

Text Regularization

Development Records

To Do List

Known Issues

About

Releases

Packages

Languages

KodeWorker/NovelAlchemist

Folders and files

Latest commit

History

Repository files navigation

Novel Alchemist

Dependencies

Web scrapping

Source

Details

Text Regularization

Development Records

To Do List

Known Issues

About

Topics

Resources

Stars

Watchers

Forks

Languages