Context aware, pluggable and customizable data protection and PII data anonymization service for text and images
Branch: master
Clone or download
Latest commit af7d82b Jan 10, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.github/ISSUE_TEMPLATE Initial docs version Aug 5, 2018
charts/presidio Image support (#87) Jan 8, 2019
docs Design Diagram change (#90) Jan 10, 2019
pkg Image support (#87) Jan 8, 2019
presidio-analyzer Image support (#87) Jan 8, 2019
presidio-anonymizer-image Image support (#87) Jan 8, 2019
presidio-anonymizer Image support (#87) Jan 8, 2019
presidio-api
presidio-collector Create redis cache for templates (#85) Jan 8, 2019
presidio-datasink
presidio-ocr Image support (#87) Jan 8, 2019
presidio-scheduler Create redis cache for templates (#85) Jan 8, 2019
tests
.dockerignore Version 0.1.0 (#78) Dec 18, 2018
.editorconfig Spacy tokens (#75) Nov 10, 2018
.gitignore Initial presidio version Oct 10, 2018
AUTHORS Create AUTHORS Dec 18, 2018
Dockerfile.golang.base Image support (#87) Jan 8, 2019
Dockerfile.golang.deps Image support (#87) Jan 8, 2019
Dockerfile.python.deps Version 0.1.0 (#78) Dec 18, 2018
Gopkg.lock Image support (#87) Jan 8, 2019
Gopkg.toml Image support (#87) Jan 8, 2019
LICENSE Initial presidio version Oct 10, 2018
Makefile Image support (#87) Jan 8, 2019
NOTICE Version 0.1.1 (#82) Dec 19, 2018
README.MD Image support (#87) Jan 8, 2019
azure-pipelines.yml Version 0.1.0 (#78) Dec 18, 2018
gometalinter.json Version 0.1.0 (#78) Dec 18, 2018
pylintrc
pytest.ini Initial presidio version Oct 10, 2018

README.MD

Build status Go Report Card MIT license


Presidio - Data Protection API

Context aware, pluggable and customizable data protection and PII anonymization service for text and images

Description

Presidio (Origin from Latin praesidium ‘protection, garrison’) helps to ensure sensitive text is properly managed and governed. It provides fast analytics and anonymization for sensitive text such as credit card numbers, bitcoin wallets, names, locations, social security numbers, US phone numbers and financial data. Presidio analyzes the text using predefined analyzers to identify patterns, formats, and checksums with relevant context.

You can find a more detailed list here

⚠️ Presidio can help identify sensitive/PII data in un/structured text. However, because Presidio is using trained ML models, there is no guarantee that Presidio will find all sensitive information. Consequently, additional systems and protections should be employed.

Features

Free text anonymization

Image1

Text anonymization in images

Image2

  • Text analytics - Predefined analyzers with customizable fields.

  • Probability scores - Customize the sensitive text detection threshold.

  • Anonymization - Anonymize sensitive text and images

  • Workflow and pipeline integration - Monitor your data with periodic scans or events of/from:

    1. Storage solutions
      • Azure Blob Storage
      • S3
      • Google Cloud Storage
    2. Databases
      • MySQL
      • PostgreSQL
      • Sql Server
      • Oracle
    3. Streaming platforms
      • Kafka
      • Azure Events Hubs

    and export the results for further analytics:

    1. Storage solutions
    2. Databases
    3. Streaming platforms

The Technology Stack

Presidio leverages:

The design document introduces Presidio concepts and architecture.

Quickstart

  1. Install Presidio
  2. Create a Presidio project
  3. Start using the Presidio analyze and anonymize services

Note: Examples are made with HTTPie

Sample 1

  1. Analyze text
    $ echo -n '{"text":"John Smith lives in New York. We met yesterday morning in Seattle. I called him before on (212) 555-1234 to verify the appointment. He also told me that his drivers license is AC333991", "analyzeTemplate":{"fields":[]}  }' | http <api-service-address>/api/v1/projects/<my-project>/analyze

Sample 2

You can also create reusable templates

  1. Create an analyzer project

    $ echo -n '{"fields":[]}' | http <api-service-address>/api/v1/templates/<my-project>/analyze/<my-template-name>
  2. Analyze text

    $ echo -n '{"text":"my credit card number is 2970-84746760-9907 345954225667833 4961-2765-5327-5913", "AnalyzeTemplateId":"<my-template-name>"  }' | http <api-service-address>/api/v1/projects/<my-project>/analyze

Sample 3

  1. Create an analyzer project

    $ echo -n '{"fields":[{"name":"PHONE_NUMBER"}, {"name":"LOCATION"}, {"name":"DATE_TIME"}]}' | http <api-service-address>/api/v1/templates/<my-project>/analyze/<my-template-name>
  2. Analyze text

    $ echo -n '{"text":"We met yesterday morning in Seattle and his phone number is (212) 555 1234", "AnalyzeTemplateId":"<my-template-name>"  }' | http <api-service-address>/api/v1/projects/<my-project>/analyze

Sample 4

  1. Create an anonymizer template (This template replaces values in PHONE_NUMBER and redacts CREDIT_CARD)

    $ echo -n '{"fieldTypeTransformations":[{"fields":[{"name":"PHONE_NUMBER"}],"transformation":{"replaceValue":{"newValue":"\u003cphone-number\u003e"}}},{"fields":[{"name":"CREDIT_CARD"}],"transformation":{"redactValue":{}}}]}' | http <api-service-address>/api/v1/templates/<my-project>/anonymize/<my-anonymize-template-name>
  2. Anonymize text

    $ echo -n '{"text":"my phone number is 057-555-2323 and my credit card is 4961-2765-5327-5913", "AnalyzeTemplateId":"<my-analyze-template-name>", "AnonymizeTemplateId":"<my-anonymize-template-name>"  }' | http <api-service-address>/api/v1/projects/<my-project>/anonymize

Sample 5 (Image anonymization)

  1. Create an anonymizer image template (This template redact values with black color)

    $ echo -n '{"fieldTypeGraphics":[{"graphic":{"fillColorValue":{"blue":0,"red":0,"green":0}}}]}' | http <api-service-address>/api/v1/templates/<my-project>/anonymize-image/<my-anonymize-image-template-name>
  2. Anonymize image

    $ http -f POST <api-service-address>/api/v1/projects/<my-project>/anonymize-image detectionType='OCR' analyzeTemplateId='<my-analyze-template-name>' anonymizeImageTemplateId='<my-anonymize-image-template-name>' imageType='image/png' file@~/test-ocr.png > test-output.png

Current Features Status

Module Feature Status
API HTTP input
Scanner MySQL
Scanner MSSQL
Scanner PostgreSQL
Scanner Oracle
Scanner Azure Blob Storage
Scanner S3
Scanner Google Cloud Storage
Streams Kafka 🔶
Streams Azure Event Hub 🔶
Datasink (output) MySQL
Datasink (output) MSSQL
Datasink (output) Oracle
Datasink (output) PostgreSQL
Datasink (output) Kafka
Datasink (output) Azure Event Hub
Datasink (output) Azure Blob Storage
Datasink (output) S3
Datasink (output) Google Cloud Storage
  • - Working
  • 🔶 - Partially working
  • - Not working yet but we are on it 😉

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.