Skip to content

Implementing scrapers

Hamuko edited this page Jun 3, 2017 · 2 revisions

This page is a work in progress.

Table of Contents

  1. Overview
  2. Base classes
    1. BaseSeries
    2. BaseChapter
  3. Series
    1. Required methods and properties
  4. Chapter
    1. Initialisation
    2. Required methods and properties
  5. Regular expressions

1. Overview

cum scrapers are all located in the cum/scrapers/ directory, which is where you should also add your new scraper file. Each site has its own scraper. In addition the directory contains some special files, including the all-important base scraper.

cum uses two kinds of objects when dealing with manga sites: series objects and chapter objects. Series objects represent a singe individual series on the site that can have one or more chapters attached to it. When a user uses cum follow, they will supply the application a URL for a series. Chapter objects represent a single chapter on a site that often belongs under a series (depending on the site setup). Chapters are often handled in the software without parent series, so they also include some basic information from their parent series.

2. Base classes

cum has two base classes for scrapers: BaseChapter and BaseSeries. All scrapers must inherit from these two classes when defining their own classes. Hence (almost) all scrapers include this import.

from cum.scrapers.base import BaseChapter, BaseSeries

The base classes add important functions and properties for the scraper objects and ensure that the classes conform to the standards.

i. BaseSeries

The base series is initialised with a URL pointing to the series on the site as well as with any number of keyword arguments. The base class is in charge of saving the URL in the series objects as self.url as well as handling any custom directories, so the scraper classes do not have to worry about these properties.

__init__(self, url, **kwargs)

ii. BaseChapter

3. Series

Each scraper must implement its own series object that inherits from BaseSeries. The series object represents one individual series on a site and is in charge of returning chapter information for itself.

i. Required methods and properties

def __init__(self, url, **kwargs):

Series objects use the initialisation to request the chapters from the remote site and to save them under self.chapters. Because the base scraper needs to access some of the data passed onto the series initialisation, you must call super().__init__(url, **kwargs) at the beginning of the method.

def get_chapters(self):

Series objects fetch the list of chapter objects using this method. Even though the rest of the application accesses the chapters through the chapters property, which is set in __init__, the generation of the chapter list is done within this method for clarity and readability. The signature for this method doesn't matter, so you are free to add other parameters for this method.

@property
def name(self):

Returns the series name as a string. The string can have special characters as part of it if they are present in the series name. This string is used by the base class in order to form the alias of the series that is used in the cum UI as well as to form the path for the downloads.

4. Chapter

Each scraper must implement its own chapter object that inherits from BaseChapter. The chapter object represents one individual chapter on a site and provides methods for downloading the chapter from the site.

i. Initialisation

Chapter objects can either be created by scraping the web (using a series or chapter URL) or by returning stored entries from the local database. For this reason, the chapter class should not access the network during initialisation; cum can access hundreds of chapters at once from the database and turn them into appropriate chapter objects for handling.

Chapters are initialised by using various keyword arguments. The BaseChapter class already handles the most common properties for chapters: name, alias, chapter, title, url, groups, directory. If your chapter does not need to store additional data during initialisation, you should not overwrite the default initialisation method inherited from BaseChapter. If you need to store additional data, use the same signature as the BaseChapter.__init__ and call its initialisation method first.

class ExampleSiteChapter(BaseChapter):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.api_key = kwargs.get('api_key')

ii. Required methods and properties

def download(self):

This method is in charge of actually downloading the manga archive or pages from the site and saving it onto the disk.

def from_url(url):

5. Regular expressions

cum uses regular expressions to match URLs to appropriate scrapers. Each scraper must define regular expressions for both series and chapter classes that allow cum to match the URLs. Both series and chapter classes use a class-level pre-compiled regular expression using the re.compile with the name url_re.

import re

class ExampleSiteSeries(BaseSeries):
    url_re = re.compile(r'https?://www\.example-manga\.com/series/')

class ExampleSiteChapter(BaseSeries):
    url_re = re.compile(r'https?://www\.example-manga\.com/chapter/')

Be mindful of supporting both HTTP and HTTPS where appropriate.

Scrapers can (and often should) have additional pre-compiled regular expressions as long as they adhere to the style guide.