Corpus-DB

Corpus-DB is a textual corpus database for the digital humanities. This project aggregates public domain texts, enhances their metadata from sources like Wikipedia, and makes those texts available according to that metadata. This will make it easy to download subcorpora like:

Bildungsromans
Dickens novels
Poetry published in the 1880s
Novels set in London

Corpus-DB has several components:

Scripts for aggregating metadata, written in Python
The database, currently a few SQLite databases
A REST API for querying the database, written in Haskell (currently in progress)
Analytic experiments, mostly in Python

Read more about the database at this introductory blog post. Scripts used to generate the database are in the gitenberg-experiments repo.

Contributing

I could use some help with this, especially if you know Python or Haskell, have library or bibliography experience, or simply like books. Get in touch in the chat room, or contact me via email.

Hacking

If you want to build the website and API, you'll need the Haskell tool stack.

stack build
cd src
export ENV=dev
stack runhaskell Main.hs

If you use ENV=dev, this will set the database path to /data/dev.db, which is a 30-row subset of the main database, since the main database is too big (16GB at the moment) to put on GitHub. You can use this dev database for hacking around on. If you need the full database for some reason, let me know.

Upcoming Changes

I'm rewriting corpus-db from scratch (see issues labeled 2.0). This is to make the whole toolchain in Corpus-DB repeatable, in case of data loss, and future-proof, so that it can ingest new texts from Project Gutenberg and other sources as they arrive. Feel free to help out with this!

Parse Project Gutenberg RDF/XML metadata, and put it into a database.
Mirror PG, using an rsync script.
Clean PG texts, and add them to that database. Also add HTML files.
Write an ORM-level database layer, using Persistent, for more native DB interactions and typesafe queries.

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
data		data
examples		examples
scripts		scripts
src		src
.gitignore		.gitignore
Great_Gatsby_Analysis.ipynb		Great_Gatsby_Analysis.ipynb
LICENSE		LICENSE
README.md		README.md
Setup.hs		Setup.hs
corpus-db.cabal		corpus-db.cabal
dh2020.pdf		dh2020.pdf
stack.yaml		stack.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

examples

examples

scripts

scripts

src

src

.gitignore

.gitignore

Great_Gatsby_Analysis.ipynb

Great_Gatsby_Analysis.ipynb

LICENSE

LICENSE

README.md

README.md

Setup.hs

Setup.hs

corpus-db.cabal

corpus-db.cabal

dh2020.pdf

dh2020.pdf

stack.yaml

stack.yaml

Repository files navigation

Corpus-DB

Contributing

Hacking

Upcoming Changes

About

Releases

Packages

Contributors 5

Languages

License

JonathanReeve/corpus-db

Folders and files

Latest commit

History

Repository files navigation

Corpus-DB

Contributing

Hacking

Upcoming Changes

About

Topics

Resources

License

Stars

Watchers

Forks

Languages