<a href="https://colab.research.google.com/github/JaonHax/scpscraper/blob/master/SCP_Scraper_Examples.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SCP Scraper Examples

### Install and import `scpscraper` to use it

In [1]:
!pip install --user --upgrade scpscraper==0.2.8a2

Requirement already up-to-date: scpscraper==0.2.8a2 in /root/.local/lib/python3.6/site-packages (0.2.8a2)


In [2]:
import scpscraper

## General Use Cases

### Grab an SCP's name with `scpscraper`

In [3]:
name = scpscraper.get_scp_name(5000)

print(name)

Why?


### Grab as much info as possible on an SCP with `scpscraper`

In [4]:
original = scpscraper.get_scp(173)

keylist = original.keys()
content_keylist = original['content'].keys()

# Just making the formatting nicer to display it all.
for key in keylist:
  if key == 'content':
    print(f'\n{key}:')
    for content_key in content_keylist:
      print(f'\t{content_key}: {original[key][content_key]}')
    print('')
  else:
    print(f'{key}: {original[key]}')

id: 173
rating: 6331
image: {'src': 'http://scp-wiki.wdfiles.com/local--files/scp-173/SCP-173.jpg', 'caption': 'SCP-173 in containment'}

content:
	Item #: SCP-173
	Object Class: Euclid
	Special Containment Procedures: Item SCP-173 is to be kept in a locked container at all times. When personnel must enter SCP-173's container, no fewer than 3 may enter at any time and the door is to be relocked behind them. At all times, two persons must maintain direct eye contact with SCP-173 until all personnel have vacated and relocked the container.
	Description: Moved to Site-19 1993. Origin is as of yet unknown. It is constructed from concrete and rebar with traces of Krylon brand spray paint. SCP-173 is animate and extremely hostile. The object cannot move while within a direct line of sight. Line of sight must not be broken at any time with SCP-173. Personnel assigned to enter container are instructed to alert one another before blinking. Object is reported to attack by snapping the neck at th

### Scrape as much info as possible from *multiple* SCPs with `scpscraper`

In [5]:
scpscraper.scrape_scps(0, 5)

# List of output files.
filelist = [
             'scp-descrips.txt',
             'scp-conprocs.txt',
             'scp-titles.txt',
             'scp-addenda.txt'
]

# Just nice formatting again.
print('\n')
for data_file in filelist:
  print(f'File: {data_file}\n===============================')
  with open(data_file, 'r') as in_text:
    print(in_text.read())

Fetching skips... 100.00% |██████████████████████████████████████████████████████████████████████████████████████████|  [00:00 remaining,  7.65s/skip]


File: scp-descrips.txt
Description: SCP-002 resembles a tumorous, fleshy growth with a volume of roughly 60 m³ (or 2000 ft³). An iron valve hatch on one side leads to its interior, which appears to be a standard low-rent apartment of modest size. One wall of the room possesses a single window, though no such opening is visible from the exterior. The room contains furniture which, upon close examination, appears to be sculpted bone, woven hair, and various other biological substances produced by the human body. All matter tested thus far show independent or fragmented DNA sequences for each object in the room.
Refer to the Mulhausen Report [cross-ref:document00.023.603] for details related to object's discovery.

Description: SCP-003 consists of two related components of separate origin, referred to as SCP-003-1 and SCP-003-2.
SCP-003-1

### Scrape the HTML code of an SCP page with `scpscraper`

In [6]:
html = scpscraper.get_single_scp(3001)

# You can use the straight HTML, if you want.
print('Straight HTML\n=================================', html, sep='\n')

# Or...

# Grab the page content (what the author actually wrote).
content = html.find("div", id="page-content")

# And use that instead for less clutter.
print('\n\nPage Content Div\n=================================', content, sep='\n')

# You can even prettify the HTML if you like.
print('\n\nPrettified Page Content\n=================================', content.prettify(), sep='\n')

Straight HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>SCP-3001 - SCP Foundation</title>
<script type="text/javascript">
var googletag = googletag || {};
googletag.cmd = googletag.cmd || [];
(function() {
var gads = document.createElement('script');
gads.async = true;
gads.type = 'text/javascript';
var useSSL = 'https:' == document.location.protocol;
gads.src = (useSSL ? 'https:' : 'http:') + 
'//www.googletagservices.com/tag/js/gpt.js';
var node = document.getElementsByTagName('script')[0];
node.parentNode.insertBefore(gads, node);
})();
</script>
<script type="text/javascript">
googletag.cmd.push(function() {
    // DEFINE DFP SLOTS
googletag.defineSlot('/1030917/wikidot_free_sites_bottom_300x250', [300, 250], 'div-gpt-ad-1410946564449-0').addService(googletag.pubads());

// googletag.pubads().enableSingleRequest();
google

### Scrape the HTML code from *multiple* SCPs with `scpscraper`

Be forewarned: large output.

In [7]:
scpscraper.scrape_scps_html(0, 5)

# Just nice formatting again.
print('\n')
with open('scp_html.txt', 'r') as in_text:
  for line in in_text.readlines():
    print(f'{line}', end='')

Fetching skips... 100.00% |██████████████████████████████████████████████████████████████████████████████████████████|  [00:00 remaining,  1.23skip/s]


<div id="page-content">
<div style="text-align: right;"><div class="page-rate-widget-box"><span class="rate-points">rating: <span class="number prw54353">+1204</span></span><span class="rateup btn btn-default"><a href="javascript:;" onclick="WIKIDOT.modules.PageRateWidgetModule.listeners.rate(event, 1)" title="I like it">+</a></span><span class="ratedown btn btn-default"><a href="javascript:;" onclick="WIKIDOT.modules.PageRateWidgetModule.listeners.rate(event, -1)" title="I don't like it">–</a></span><span class="cancel btn btn-default"><a href="javascript:;" onclick="WIKIDOT.modules.PageRateWidgetModule.listeners.cancelVote(event)" title="Cancel my vote">x</a></span></div></div>
<p><strong>Ittëm #</strong> ŚČР-000</p>
<p><strong>ØbjectX_XClas§:</strong> #NULL</p>
<p><strong>SpecïÅl ςόЛţДĬЛΜ$%#ll to undefined function PROCEDURES():</st