SCP Scraper

A small Python library designed for scraping data from the SCP wiki. Made with AI training (namely NLP models) and dataset collection (for things like categorization of SCPs for external projects) in mind, and has arguments to allow for ease of use in those applications.

Below you will find installation instructions, examples of how to use this library, and the ways in which you can utilize it. I hope you find this as useful as I have!

Sample Code

Installation

scpscraper can be installed via pip install. Here's the command I recommend using, so you consistently have the latest version.

pip3 install --upgrade scpscraper

The Basics

Importing the Library

# Before we begin, we obviously have to import scpscraper.
import scpscraper

Grabbing an SCP's Name

# Let's use 3001 (Red Reality) as an example.
name = scpscraper.get_scp_name(3001)

print(name) # Outputs "Red Reality"

Grabbing as many details as possible about an SCP

# Again using 3001 as an example
info = scpscraper.get_scp(3001)

print(info) # Outputs a dictionary with the
# name, object id, rating, page content by section, etc.

The Fun Stuff

Grabbing an SCP's `page-content` div HTML

For reference, the page-content div contains what the user actually wrote, without all the extra Wikidot external stuff.

# Once again, 3001 is the example
scp = scpscraper.get_single_scp(3001)

# Grab the page-content div specifically
content = scp.find_all('div', id='page-content')

print(content) # Outputs "<div id="page-content"> ... </div>"

Scraping HTML or information from multiple SCPs

# Grab info on SCPs 000-099
scpscraper.scrape_scps(0, 100)

# Same as above, but only grabbing Keter-class SCPs
scpscraper.scrape_scps(0, 100, tags=['keter'])

# Grab 000-099 in a format that can be used to train AI
scpscraper.scrape_scps(0, 100, ai_dataset=True)

# Scrape the page-content div's HTML from SCP-000 to SCP-099

# Only including this as an example, but scrape_scps_html() has
# all the same options as scrape_scps().
scpscraper.scrape_scps_html(0, 100)

Google Colaboratory Only Usage

Because of the google.colab module included in Google Colaboratory, we can do a few extra things there that we can't otherwise.

Mount your Google Drive to the Colaboratory VM

# Mounts it to the directory /content/drive/
scpscraper.gdrive.mount()

Scrape SCP info/HTML and copy to your Google Drive afterwards

# Requires your Google Drive to be mounted at the directory /content/drive/
scpscraper.scrape_scps(0, 100, copy_to_drive=True)

scpscraper.scrape_scps_html(0, 100, copy_to_drive=True)

Copy other files to/from your Google Drive

# Requires your Google Drive to be mounted at the directory /content/drive/
scpscraper.gdrive.copy_to_drive('example.txt')

scpscraper.gdrive.copy_from_drive('example.txt')

Planned Updates

Potential updates in the future to make scraping data from any website easy/viable, allowing for easy mass collection of data.

Link to GitHub Repo

Please consider checking it out! You can report issues, request features, contribute to this project, etc. in the GitHub Repo. That is the best way to reach me for issues/feedback relating to this project.

https://github.com/JaonHax/scpscraper/

Name		Name	Last commit message	Last commit date
Latest commit History 131 Commits
.github		.github
scpscraper		scpscraper
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github

.github

scpscraper

scpscraper

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

setup.cfg

setup.cfg

setup.py

setup.py

Repository files navigation

SCP Scraper

Sample Code

Installation

The Basics

Importing the Library

Grabbing an SCP's Name

Grabbing as many details as possible about an SCP

The Fun Stuff

Grabbing an SCP's `page-content` div HTML

Scraping HTML or information from multiple SCPs

Google Colaboratory Only Usage

Mount your Google Drive to the Colaboratory VM

Scrape SCP info/HTML and copy to your Google Drive afterwards

Copy other files to/from your Google Drive

Planned Updates

Link to GitHub Repo

About

Releases 21

Languages

License

JadynHax/scpscraper

Folders and files

Latest commit

History

Repository files navigation

SCP Scraper

Sample Code

Installation

The Basics

Importing the Library

Grabbing an SCP's Name

Grabbing as many details as possible about an SCP

The Fun Stuff

Grabbing an SCP's page-content div HTML

Scraping HTML or information from multiple SCPs

Google Colaboratory Only Usage

Mount your Google Drive to the Colaboratory VM

Scrape SCP info/HTML and copy to your Google Drive afterwards

Copy other files to/from your Google Drive

Planned Updates

Link to GitHub Repo

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

Grabbing an SCP's `page-content` div HTML