Merge pull request #60 from LeapBeyond/feature/api_changes

Internal changes to allow for new features
LeapBeyond · Oct 20, 2020 · 0eec66e · 0eec66e
2 parents e1b2343 + 3b2c844
commit 0eec66e
Show file tree

Hide file tree

Showing 76 changed files with 3,306 additions and 467 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -8,13 +8,19 @@ python:
 install:
   - pip install -r requirements/python-dev
   - python -m textblob.download_corpora
+  - pip install .
+#  - apt-get install curl autoconf automake libtool pkg-config
+#  - git clone https://github.com/openvenues/libpostal && cd libpostal && ./bootstrap.sh && ./configure && \
+#    make -j4 && make install
 
 # commands to run the testing suite. if any of these fail, travic lets us know
 # Enabling type checking with mypy, but only showing the warning messages
 script:
+  - mypy --config-file setup.cfg scrubadub/
   - nosetests --with-coverage --cover-package=scrubadub
-  - mypy --config-file setup.cfg scrubadub/ || true
-  - pycodestyle scrubadub/
+  - flake8  --config setup.cfg scrubadub/
+  - python3 ./tests/benchmark_accuracy.py
+  - python3 ./tests/benchmark_time.py
   - cd docs && make html && cd -
 
 # commands to run after the tests successfully complete

diff --git a/docs/accuracy.rst b/docs/accuracy.rst
@@ -0,0 +1,85 @@
+.. _comparison:
+
+Accuracy
+========
+
+The most common question that people have about scrubadub is:
+
+    How accurately can scrubadub detect PII?
+
+It's a great question that's hard, but essential to answer.
+
+It is straightforward to measure this on pseudo-data (fake data that is generated), but its not clear how applicable this is to real-world applications.
+There might be the possibility a possibility to use some open real-world datasets, but it's not clear if such things exist given the sensitivity of PII.
+
+We show the precision and recall for each of the `Filth` types detected by the various `Detector`\ s.
+`Wikipedia <https://en.wikipedia.org/wiki/Precision_and_recall>`_ has a good explanation, but these are defined as:
+
+- **Precision:** Percentage of true `Filth` detected out of all `Filth` selected by the `Detector`
+
+  - If this is low, there is lots of clean text incorrectly detected as `Filth`
+
+- **Recall:** Percentage of the true `Filth` that is selected by the `Detector`
+
+  - If this is low, there is lots of dirty text that is not detected as `Filth`
+
+Pseudo-data performance
+-----------------------
+
+This section uses data created by the Faker package to test the effectiveness of the various detectors.
+Here the detectors all generally perform very well (often 100%) but this will likely not be representative on actual data.
+
++----------------+----------------+-----------+-------------+-------------+
+| Filth type     | Detector       | Locale    | Precision   | Recall      |
++================+================+===========+=============+=============+
+| Address        | Address        | en_GB     | 100%        | 96%         |
++----------------+----------------+-----------+-------------+-------------+
+| Address        | Address        | en_US     | 100%        | 74%         |
++----------------+----------------+-----------+-------------+-------------+
+| Email          | Email          | N/A       | 100%        | 100%        |
++----------------+----------------+-----------+-------------+-------------+
+| Name           | Name           | en_US     | 9%          | 100%        |
++----------------+----------------+-----------+-------------+-------------+
+| Name           | Stanford NER   | en_US     | 95%         | 86%         |
++----------------+----------------+-----------+-------------+-------------+
+| Phone Number   | Phone Number   | en_GB     | TODO        | TODO        |
++----------------+----------------+-----------+-------------+-------------+
+| Phone Number   | Phone Number   | en_US     | 100%        | 100%        |
++----------------+----------------+-----------+-------------+-------------+
+| Postal code    | Postal code    | en_GB     | 100%        | 74%         |
++----------------+----------------+-----------+-------------+-------------+
+| SSN            | SSN            | en_US     | 100%        | 100%        |
++----------------+----------------+-----------+-------------+-------------+
+| Twitter        | Twitter        | N/A       | 100%        | 100%        |
++----------------+----------------+-----------+-------------+-------------+
+| URL            | URL            | N/A       | 100%        | 100%        |
++----------------+----------------+-----------+-------------+-------------+
+
+
+Real data performance
+---------------------
+
+We are trying to find datasets that could be used to evaluate performance; if you know of any, let us know.
+Stay tuned for more updates.
+
+Measuring performance
+---------------------
+
+Read this section if you want to measure performance on your own data.
+
+First data must be obtained with PII in and it must be tagged as true PII, usually by a human.
+If you cannot get real data, you can generate fake data, but this is never as good; the function ``make_fake_document()`` below makes a fake document and provides the known filth items needed for the `KnownFilthDetector`.
+
+Once this is done you can add the ``KnownFilthDetector`` to your scrubber and provide it with your known true Filth.
+Then you can use the ``get_filth_classification_report(filth_list)`` function to get a report containing the recall and precision of the detectors.
+In addition to this classification report, there is also the ``get_filth_dataframe(filth_list)`` function that returns a pandas `DataFrame` that can be used to get more information on the types of `Filth` that were detected.
+
+.. autofunction:: scrubadub.comparison.get_filth_classification_report
+
+.. autofunction:: scrubadub.comparison.get_filth_dataframe
+
+.. autofunction:: scrubadub.comparison.make_fake_document
+
+
+
+
diff --git a/docs/advanced_usage.rst b/docs/advanced_usage.rst
diff --git a/docs/api.rst b/docs/api.rst
@@ -3,19 +3,42 @@
 API
 ===
 
-``scrubadub`` consists of three separate components:
+``scrubadub`` consists of four separate components:
 
-* The ``Scrubber`` is responsible for managing all of the ``Detector`` objects
-  and resolving any conflicts that may arise between different ``Detector``
-  objects.
-
-* ``Detector`` objects are used to detect specific types of ``Filth``.
 
 * ``Filth`` objects are used to identify specific parts of a piece of dirty
   dirty text that contain sensitive information and they are responsible for
   deciding how the resulting information should be replaced in the cleaned
   text.
 
+* ``Detector`` objects are used to detect specific types of ``Filth``.
+
+* ``PostProcessor`` objects are used to alter the found ``Filth``.
+  This could be to replace the ``Filth`` with a hash or token.
+
+* The ``Scrubber`` is responsible for managing the cleaning process.
+  It keeps track of the ``Detector``, ``PostProcessor`` and ``Filth`` objects.
+  It also resolves conflicts that may arise between different ``Detector``
+  objects.
+
+
+scrubadub
+---------
+
+There are several convenience functions to make using scrubadub quick and simple.
+These functions either remove the Filth from the text (such as ``scrubadub.clean``) or
+return a list of Filth objects that were found (such as ``scrubadub.list_filth``).
+These functions either work on a single document in a string (such as ``scrubadub.clean``) or
+work on a set of documents given in either a dictonary or list (such as ``scrubadub.clean_documents``).
+
+.. autofunction:: scrubadub.clean
+
+.. autofunction:: scrubadub.clean_documents
+
+.. autofunction:: scrubadub.list_filth
+
+.. autofunction:: scrubadub.list_filth_documents
+
 
 Scrubber
 --------
@@ -105,11 +128,36 @@ be cleaned. Every type of ``Filth`` inherits from `scrubadub.filth.base.Filth`.
     :undoc-members:
     :show-inheritance:
 
-There is also a convenience class for ``RegexFilth``, which makes it easy to
-quickly remove new types of filth that can be identified from regular
-expressions:
+PostProcessors
+--------------
+
+``PostProcessor``\ s generally can be used to process the detected ``Filth``
+objects and make changes to them.
+
+These are a new addition to scrubadub and at the moment only simple ones
+exist that alter the replacement string.
+
+.. autoclass:: scrubadub.post_processors.base.PostProcessor
+    :members:
+    :undoc-members:
+    :show-inheritance:
+
+.. autoclass:: scrubadub.post_processors.text_replacers.filth_type.FilthTypeReplacer
+    :members:
+    :undoc-members:
+    :show-inheritance:
+
+.. autoclass:: scrubadub.post_processors.text_replacers.hash.HashReplacer
+    :members:
+    :undoc-members:
+    :show-inheritance:
+
+.. autoclass:: scrubadub.post_processors.text_replacers.numeric.NumericReplacer
+    :members:
+    :undoc-members:
+    :show-inheritance:
 
-.. autoclass:: scrubadub.filth.base.RegexFilth
+.. autoclass:: scrubadub.post_processors.text_replacers.prefix_suffix.PrefixSuffixReplacer
     :members:
     :undoc-members:
     :show-inheritance:
diff --git a/docs/changelog.rst b/docs/changelog.rst
@@ -11,6 +11,51 @@ latest changes in development for next release
 
 .. THANKS FOR CONTRIBUTING; MENTION WHAT YOU DID IN THIS SECTION HERE!
 
+2.0.0
+-----
+
+There have been some changes in the scrubadub API, but (almost) no breaking changes.
+The changes include:
+
+* Introduced the concept of a `PostProcessor`.
+  This will allow more complex groupings of `Filth`\ s and new types of tokenization.
+* Added ability to easily evaluate a `Detector`\ 's performance.
+* The the name of the detector has been separated from the type of filth found.
+  This means multiple instances of the same detector (configured differently) can be in the same `Scrubber` instance and one `Detector` can return multiple types of `Filth`.
+* A default set of `Detector`\ s are loaded instead of all `Detector`\ s.
+  This is particularly useful for optional detectors with complex dependencies.
+
+
+Scrubber
+^^^^^^^^
+
+* `Detector`\ s can be added and removed using a string containing their default name, their class or an instance.
+* You can clean multiple documents with one `Scrubber().clean_documents(docs)` call
+
+Detectors
+^^^^^^^^^
+
+* Detectors now require a class instance variable called name, which should be unique within a `Scrubber` instance.
+* Regular expressions used by the `RegexDetector` class have been moved from ``RegexFilth.regex`` to ``RegexDetector.regex``.
+
+Filth
+^^^^^
+
+* Introduced two parameters in the constructor `detector_name` and `document_name`.
+  These keep track of the `Detector` that found the `Filth` and the document it came from.
+  This results in `Filth` objects being passed additional parameters on initialisation.
+  This is the one breaking change, `Filth.__init__` should accept `detector_name` and `document_name` keywords and call the base class constructor.
+
+PostProcessors
+^^^^^^^^^^^^^^
+
+* Introduction of simple `PostProcessors`:
+   * `FilthTypeReplacer`: Replace the filth with the type of filth ``example@example.com -> EMAIL``
+   * `HashReplacer`: Replace the filth with a configurable hash ``example@example.com -> 196aa39e9f8159ec``
+   * `NumericReplacer`: Replace the filth with a monotonically increasing number for each unique piece of filth, optionally including the filth type ``example@example.com -> EMAIL-1``.
+   * `PrefixSuffixReplacer`: Add a prefix and/or suffix onto the replacement ``EMAIL-1 -> {{EMAIL-1}}``
+* It is envisioned that other more complex operations can be done here too such as grouping filth (e.g. "John", "John Doe" and "Mr. Doe" could be grouped together).
+
 1.2.2
 -----
 

diff --git a/docs/index.rst b/docs/index.rst
@@ -50,20 +50,12 @@ incorporating it into your python scripts like this:
     # John may be a cat, but he doesn't want other people to know it.
     >>> text = u"John is a cat"
 
-    # Replace names with {{NAME}} placeholder. This is the scrubadub default
-    # because it maximally omits any information about people.
-    >>> scrubadub.clean(text)
-    u"{{NAME}} is a cat"
-
     # Replace names with {{NAME-ID}} anonymous, but consistent IDs.
-    >>> scrubadub.clean(text, replace_with='identifier')
+    >>> scrubadub.clean(text)
     u"{{NAME-0}} is a cat"
-    >>> scrubadub.clean("John spoke with Doug.", replace_with='identifier')
-    u"{{NAME-0}} spoke with {{NAME-1}}."
 
-..    # Replace names with random, gender-consistent names
-    >>> scrubadub.clean(text, replace_with='surrogate')
-    u"Billy is a cat"
+    >>> scrubadub.clean("John spoke with Doug.")
+    u"{{NAME-0}} spoke with {{NAME-1}}."
 
 
 There are many ways to tailor the behavior of ``scrubadub`` using
@@ -72,6 +64,22 @@ There are many ways to tailor the behavior of ``scrubadub`` using
 in which ``scrubadub`` cleans dirty dirty text.
 
 
+Installation
+------------
+
+To install scrubadub using pip, simply type::
+
+    pip install scrubadub
+
+This package requires at least python 3.5.
+For python 2.7 support see v1.2.2 which is the last version with python 2.7 support.
+
+There are a few python dependencies, which can be seen in the
+`requirements file <https://github.com/LeapBeyond/scrubadub/blob/master/requirements/python>`__,
+but these should be installed automatically when installing the package via pip.
+
+.. TODO: talk about the fact that extra detectors can be installed here with pip install scrubadub[stanford_ner] in the future.
+
 Related work
 ------------
 
@@ -103,8 +111,9 @@ Contents
 .. toctree::
    :maxdepth: 2
 
-   advanced_usage
+   usage
    api
+   accuracy
    contributing
    changelog