WikiScraper

WikiScraper is a professional Python library to scrape Wikipedia articles easily. It allows you to scrape a single page or all linked articles recursively, supporting both .txt and .csv outputs.

Features

Scrape a single Wikipedia page or all linked articles recursively.
Supports .txt and .csv output formats.
Optionally add titles to scraped content.
Logging options for file saves and all actions.
Append all scraped articles into a single file or save separately.
Handles multiple languages and errors gracefully.
Polite crawling with configurable delay between requests.

Installation

pip install wikiscraper-py

Usage

Scrape a single page

from wikiscraper import WikiScraper

scraper = WikiScraper(file_type="txt", add_title=True)
scraper.scrape_one("https://en.wikipedia.org/wiki/Python")

Scrape all linked articles

from wikiscraper import WikiScraper

scraper = WikiScraper(file_type="txt", add_title=True, all_on_one_file=True, polite_time=2)
scraper.scrape_all("https://en.wikipedia.org/wiki/Python")

CSV Output Example

from wikiscraper import WikiScraper

scraper = WikiScraper(file_type="csv", add_title=True, all_on_one_file=True)
scraper.scrape_all("https://en.wikipedia.org/wiki/Ethiopia")

If add_title=True and output is CSV:
- The first column will contain the article title.
- The second column will contain the article text.

Parameters

file_type: 'txt' or 'csv'. Default is 'txt'.
add_title: Add the article title at the top of the file or first CSV column. Default is False.
log_saving: Log only file saves. Default is True.
log_all: Log all actions including errors and skipped links. Default is False.
polite_time: Delay between requests in seconds. Default is 3.
all_on_one_file: Append all articles into a single file when scraping multiple pages. Default is True.

Directory Structure

Scraped files are saved in the data/ folder created automatically in the working directory.

If all_on_one_file=True: all articles are appended to data/wikipedia_all.txt or .csv.
If all_on_one_file=False: each article is saved as a separate file with its title as filename.

Error Handling

Skips invalid Wikipedia URLs.
Logs network errors and pages without titles.
Automatically filters out non-article links (categories, special pages, user pages, etc.).

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
wikiscraper		wikiscraper
wikiscraper_py.egg-info		wikiscraper_py.egg-info
.gitignore		.gitignore
LICENSE		LICENSE
Readme.md		Readme.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WikiScraper

Features

Installation

Usage

Scrape a single page

Scrape all linked articles

CSV Output Example

Parameters

Directory Structure

Error Handling

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WikiScraper

Features

Installation

Usage

Scrape a single page

Scrape all linked articles

CSV Output Example

Parameters

Directory Structure

Error Handling

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages