Skip to content

Commit

Permalink
Merge pull request #60 from LeapBeyond/feature/api_changes
Browse files Browse the repository at this point in the history
Internal changes to allow for new features
  • Loading branch information
thomasbird committed Oct 20, 2020
2 parents e1b2343 + 3b2c844 commit 0eec66e
Show file tree
Hide file tree
Showing 76 changed files with 3,306 additions and 467 deletions.
10 changes: 8 additions & 2 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,19 @@ python:
install:
- pip install -r requirements/python-dev
- python -m textblob.download_corpora
- pip install .
# - apt-get install curl autoconf automake libtool pkg-config
# - git clone https://github.com/openvenues/libpostal && cd libpostal && ./bootstrap.sh && ./configure && \
# make -j4 && make install

# commands to run the testing suite. if any of these fail, travic lets us know
# Enabling type checking with mypy, but only showing the warning messages
script:
- mypy --config-file setup.cfg scrubadub/
- nosetests --with-coverage --cover-package=scrubadub
- mypy --config-file setup.cfg scrubadub/ || true
- pycodestyle scrubadub/
- flake8 --config setup.cfg scrubadub/
- python3 ./tests/benchmark_accuracy.py
- python3 ./tests/benchmark_time.py
- cd docs && make html && cd -

# commands to run after the tests successfully complete
Expand Down
85 changes: 85 additions & 0 deletions docs/accuracy.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
.. _comparison:

Accuracy
========

The most common question that people have about scrubadub is:

How accurately can scrubadub detect PII?

It's a great question that's hard, but essential to answer.

It is straightforward to measure this on pseudo-data (fake data that is generated), but its not clear how applicable this is to real-world applications.
There might be the possibility a possibility to use some open real-world datasets, but it's not clear if such things exist given the sensitivity of PII.

We show the precision and recall for each of the `Filth` types detected by the various `Detector`\ s.
`Wikipedia <https://en.wikipedia.org/wiki/Precision_and_recall>`_ has a good explanation, but these are defined as:

- **Precision:** Percentage of true `Filth` detected out of all `Filth` selected by the `Detector`

- If this is low, there is lots of clean text incorrectly detected as `Filth`

- **Recall:** Percentage of the true `Filth` that is selected by the `Detector`

- If this is low, there is lots of dirty text that is not detected as `Filth`

Pseudo-data performance
-----------------------

This section uses data created by the Faker package to test the effectiveness of the various detectors.
Here the detectors all generally perform very well (often 100%) but this will likely not be representative on actual data.

+----------------+----------------+-----------+-------------+-------------+
| Filth type | Detector | Locale | Precision | Recall |
+================+================+===========+=============+=============+
| Address | Address | en_GB | 100% | 96% |
+----------------+----------------+-----------+-------------+-------------+
| Address | Address | en_US | 100% | 74% |
+----------------+----------------+-----------+-------------+-------------+
| Email | Email | N/A | 100% | 100% |
+----------------+----------------+-----------+-------------+-------------+
| Name | Name | en_US | 9% | 100% |
+----------------+----------------+-----------+-------------+-------------+
| Name | Stanford NER | en_US | 95% | 86% |
+----------------+----------------+-----------+-------------+-------------+
| Phone Number | Phone Number | en_GB | TODO | TODO |
+----------------+----------------+-----------+-------------+-------------+
| Phone Number | Phone Number | en_US | 100% | 100% |
+----------------+----------------+-----------+-------------+-------------+
| Postal code | Postal code | en_GB | 100% | 74% |
+----------------+----------------+-----------+-------------+-------------+
| SSN | SSN | en_US | 100% | 100% |
+----------------+----------------+-----------+-------------+-------------+
| Twitter | Twitter | N/A | 100% | 100% |
+----------------+----------------+-----------+-------------+-------------+
| URL | URL | N/A | 100% | 100% |
+----------------+----------------+-----------+-------------+-------------+


Real data performance
---------------------

We are trying to find datasets that could be used to evaluate performance; if you know of any, let us know.
Stay tuned for more updates.

Measuring performance
---------------------

Read this section if you want to measure performance on your own data.

First data must be obtained with PII in and it must be tagged as true PII, usually by a human.
If you cannot get real data, you can generate fake data, but this is never as good; the function ``make_fake_document()`` below makes a fake document and provides the known filth items needed for the `KnownFilthDetector`.

Once this is done you can add the ``KnownFilthDetector`` to your scrubber and provide it with your known true Filth.
Then you can use the ``get_filth_classification_report(filth_list)`` function to get a report containing the recall and precision of the detectors.
In addition to this classification report, there is also the ``get_filth_dataframe(filth_list)`` function that returns a pandas `DataFrame` that can be used to get more information on the types of `Filth` that were detected.

.. autofunction:: scrubadub.comparison.get_filth_classification_report

.. autofunction:: scrubadub.comparison.get_filth_dataframe

.. autofunction:: scrubadub.comparison.make_fake_document




81 changes: 0 additions & 81 deletions docs/advanced_usage.rst

This file was deleted.

68 changes: 58 additions & 10 deletions docs/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,19 +3,42 @@
API
===

``scrubadub`` consists of three separate components:
``scrubadub`` consists of four separate components:

* The ``Scrubber`` is responsible for managing all of the ``Detector`` objects
and resolving any conflicts that may arise between different ``Detector``
objects.

* ``Detector`` objects are used to detect specific types of ``Filth``.

* ``Filth`` objects are used to identify specific parts of a piece of dirty
dirty text that contain sensitive information and they are responsible for
deciding how the resulting information should be replaced in the cleaned
text.

* ``Detector`` objects are used to detect specific types of ``Filth``.

* ``PostProcessor`` objects are used to alter the found ``Filth``.
This could be to replace the ``Filth`` with a hash or token.

* The ``Scrubber`` is responsible for managing the cleaning process.
It keeps track of the ``Detector``, ``PostProcessor`` and ``Filth`` objects.
It also resolves conflicts that may arise between different ``Detector``
objects.


scrubadub
---------

There are several convenience functions to make using scrubadub quick and simple.
These functions either remove the Filth from the text (such as ``scrubadub.clean``) or
return a list of Filth objects that were found (such as ``scrubadub.list_filth``).
These functions either work on a single document in a string (such as ``scrubadub.clean``) or
work on a set of documents given in either a dictonary or list (such as ``scrubadub.clean_documents``).

.. autofunction:: scrubadub.clean

.. autofunction:: scrubadub.clean_documents

.. autofunction:: scrubadub.list_filth

.. autofunction:: scrubadub.list_filth_documents


Scrubber
--------
Expand Down Expand Up @@ -105,11 +128,36 @@ be cleaned. Every type of ``Filth`` inherits from `scrubadub.filth.base.Filth`.
:undoc-members:
:show-inheritance:

There is also a convenience class for ``RegexFilth``, which makes it easy to
quickly remove new types of filth that can be identified from regular
expressions:
PostProcessors
--------------

``PostProcessor``\ s generally can be used to process the detected ``Filth``
objects and make changes to them.

These are a new addition to scrubadub and at the moment only simple ones
exist that alter the replacement string.

.. autoclass:: scrubadub.post_processors.base.PostProcessor
:members:
:undoc-members:
:show-inheritance:

.. autoclass:: scrubadub.post_processors.text_replacers.filth_type.FilthTypeReplacer
:members:
:undoc-members:
:show-inheritance:

.. autoclass:: scrubadub.post_processors.text_replacers.hash.HashReplacer
:members:
:undoc-members:
:show-inheritance:

.. autoclass:: scrubadub.post_processors.text_replacers.numeric.NumericReplacer
:members:
:undoc-members:
:show-inheritance:

.. autoclass:: scrubadub.filth.base.RegexFilth
.. autoclass:: scrubadub.post_processors.text_replacers.prefix_suffix.PrefixSuffixReplacer
:members:
:undoc-members:
:show-inheritance:
45 changes: 45 additions & 0 deletions docs/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,51 @@ latest changes in development for next release

.. THANKS FOR CONTRIBUTING; MENTION WHAT YOU DID IN THIS SECTION HERE!
2.0.0
-----

There have been some changes in the scrubadub API, but (almost) no breaking changes.
The changes include:

* Introduced the concept of a `PostProcessor`.
This will allow more complex groupings of `Filth`\ s and new types of tokenization.
* Added ability to easily evaluate a `Detector`\ 's performance.
* The the name of the detector has been separated from the type of filth found.
This means multiple instances of the same detector (configured differently) can be in the same `Scrubber` instance and one `Detector` can return multiple types of `Filth`.
* A default set of `Detector`\ s are loaded instead of all `Detector`\ s.
This is particularly useful for optional detectors with complex dependencies.


Scrubber
^^^^^^^^

* `Detector`\ s can be added and removed using a string containing their default name, their class or an instance.
* You can clean multiple documents with one `Scrubber().clean_documents(docs)` call

Detectors
^^^^^^^^^

* Detectors now require a class instance variable called name, which should be unique within a `Scrubber` instance.
* Regular expressions used by the `RegexDetector` class have been moved from ``RegexFilth.regex`` to ``RegexDetector.regex``.

Filth
^^^^^

* Introduced two parameters in the constructor `detector_name` and `document_name`.
These keep track of the `Detector` that found the `Filth` and the document it came from.
This results in `Filth` objects being passed additional parameters on initialisation.
This is the one breaking change, `Filth.__init__` should accept `detector_name` and `document_name` keywords and call the base class constructor.

PostProcessors
^^^^^^^^^^^^^^

* Introduction of simple `PostProcessors`:
* `FilthTypeReplacer`: Replace the filth with the type of filth ``example@example.com -> EMAIL``
* `HashReplacer`: Replace the filth with a configurable hash ``example@example.com -> 196aa39e9f8159ec``
* `NumericReplacer`: Replace the filth with a monotonically increasing number for each unique piece of filth, optionally including the filth type ``example@example.com -> EMAIL-1``.
* `PrefixSuffixReplacer`: Add a prefix and/or suffix onto the replacement ``EMAIL-1 -> {{EMAIL-1}}``
* It is envisioned that other more complex operations can be done here too such as grouping filth (e.g. "John", "John Doe" and "Mr. Doe" could be grouped together).

1.2.2
-----

Expand Down
33 changes: 21 additions & 12 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -50,20 +50,12 @@ incorporating it into your python scripts like this:
# John may be a cat, but he doesn't want other people to know it.
>>> text = u"John is a cat"
# Replace names with {{NAME}} placeholder. This is the scrubadub default
# because it maximally omits any information about people.
>>> scrubadub.clean(text)
u"{{NAME}} is a cat"
# Replace names with {{NAME-ID}} anonymous, but consistent IDs.
>>> scrubadub.clean(text, replace_with='identifier')
>>> scrubadub.clean(text)
u"{{NAME-0}} is a cat"
>>> scrubadub.clean("John spoke with Doug.", replace_with='identifier')
u"{{NAME-0}} spoke with {{NAME-1}}."
.. # Replace names with random, gender-consistent names
>>> scrubadub.clean(text, replace_with='surrogate')
u"Billy is a cat"
>>> scrubadub.clean("John spoke with Doug.")
u"{{NAME-0}} spoke with {{NAME-1}}."
There are many ways to tailor the behavior of ``scrubadub`` using
Expand All @@ -72,6 +64,22 @@ There are many ways to tailor the behavior of ``scrubadub`` using
in which ``scrubadub`` cleans dirty dirty text.


Installation
------------

To install scrubadub using pip, simply type::

pip install scrubadub

This package requires at least python 3.5.
For python 2.7 support see v1.2.2 which is the last version with python 2.7 support.

There are a few python dependencies, which can be seen in the
`requirements file <https://github.com/LeapBeyond/scrubadub/blob/master/requirements/python>`__,
but these should be installed automatically when installing the package via pip.

.. TODO: talk about the fact that extra detectors can be installed here with pip install scrubadub[stanford_ner] in the future.
Related work
------------

Expand Down Expand Up @@ -103,8 +111,9 @@ Contents
.. toctree::
:maxdepth: 2

advanced_usage
usage
api
accuracy
contributing
changelog

Expand Down
Loading

0 comments on commit 0eec66e

Please sign in to comment.