Skip to content

Latest commit

 

History

History
75 lines (54 loc) · 2.12 KB

under_the_hood.rst

File metadata and controls

75 lines (54 loc) · 2.12 KB

Under the hood

scrubadub consists of three separate components:

  • Filth objects are used to identify specific parts of a piece of dirty dirty text that contain sensitive information and they are responsible for deciding how the resulting information should be replaced in the cleaned text.
  • Detector objects are used to detect specific types of Filth.
  • The Scrubber is responsible for managing all of the Detector objects and resolving any conflicts that may arise between different Detector objects.

Filth

Filth objects are responsible for marking particular sections of text as containing that type of filth. It is also responsible for knowing how it should be cleaned. Every type of Filth inherits from scrubadub.filth.base.Filth.

scrubadub.filth.base.Filth

There is also a convenience class for RegexFilth, which makes it easy to quickly remove new types of filth that can be identified from regular expressions:

scrubadub.filth.base.RegexFilth

Detectors

scrubadub consists of several Detector's, which are responsible for identifying and iterating over the Filth that can be found in a piece of text. Every type of Filth has a Detector that inherits from scrubadub.detectors.base.Detector:

scrubadub.detectors.base.Detector

For convenience, there is also a RegexDetector, which makes it easy to quickly add new types of Filth that can be identified from regular expressions:

scrubadub.detectors.base.RegexDetector

Scrubber

All of the Detector's are managed by the Scrubber. The main job of the Scrubber is to handle situations in which the same section of text contains different types of Filth.

scrubadub.scrubbers.Scrubber