Skip to content

maxsun/WikipediaSummaries

Repository files navigation

Wikipedia Text Corpus + Generator Code

This repo contains code for scanning Wikipedia and recording article summaries and links.

scan.py scrapes summaries and saves the following data (indexed by article url):

{
    'title': str,
    'summary': str,
    'links': [str],
    'doa': datetime
}

wiki.json.gz contains the currently collected data. Most recently calculated stats:

python make_readme.py will rebuild this Readme and update the stats.

python make_textcorpus.py will create wiki.txt; a file with all of the summaries combined.

About

Simple code for scanning and saving Wikipedia summaries.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages