<a href="https://colab.research.google.com/github/Blackfang2511/PHI_Deidentification/blob/main/presidio_PHI_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# download presidio
!pip install presidio_analyzer presidio_anonymizer
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


Path to notebook: [https://www.github.com/microsoft/presidio/blob/main/docs/samples/python/presidio_notebook.ipynb](https://www.github.com/microsoft/presidio/blob/main/docs/samples/python/presidio_notebook.ipynb)

In [None]:
from presidio_analyzer import AnalyzerEngine, PatternRecognizer
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
import json
from pprint import pprint

# Analyze Text for PII Entities

Using Presidio Analyzer, analyze a text to identify PII entities.
The Presidio analyzer is using pre-defined entity recognizers, and offers the option to create custom recognizers.

The following code sample will:

- Set up the Analyzer engine: load the NLP module (spaCy model by default) and other PII recognizers
- Call analyzer to get analyzed results for "PHONE_NUMBER" entity type


In [None]:
text_to_anonymize = "His name is Mr. Jones and his phone number is 212-555-5555"

In [None]:
analyzer = AnalyzerEngine()
analyzer_results = analyzer.analyze(text=text_to_anonymize, entities=["PHONE_NUMBER"], language='en')

print(analyzer_results)



[type: PHONE_NUMBER, start: 46, end: 58, score: 0.75]


## Create Custom PII Entity Recognizers

Presidio Analyzer comes with a pre-defined set of entity recognizers. It also allows adding new recognizers without changing the analyzer base code, **by creating custom recognizers**.
In the following example, we will create two new recognizers of type `PatternRecognizer` to identify titles and pronouns in the analyzed text.
A `PatternRecognizer` is a PII entity recognizer which uses regular expressions or deny-lists.

The following code sample will:
- Create custom recognizers
- Add the new custom recognizers to the analyzer
- Call analyzer to get results from the new recognizers

Call Presidio Analyzer and get analyzed results with all the configured recognizers - default and new custom recognizers

In [None]:
analyzer_results = analyzer.analyze(text=text_to_anonymize, language='en')

analyzer_results

# Anonymize Text with Identified PII Entities

<br>Presidio Anonymizer iterates over the Presidio Analyzer result, and provides anonymization capabilities for the identified text.
<br>The anonymizer provides 5 types of anonymizers - replace, redact, mask, hash and encrypt. The default is **replace**

<br>The following code sample will:
<ol>
<li>Setup the anonymizer engine </li>
<li>Create an anonymizer request - text to anonymize, list of anonymizers to apply and the results from the analyzer request</li>
<li>Anonymize the text</li>
</ol>

In [None]:
anonymizer = AnonymizerEngine()

anonymized_results = anonymizer.anonymize(
    text=text_to_anonymize,
    analyzer_results=analyzer_results,
    operators={"DEFAULT": OperatorConfig("replace", {"new_value": "<ANONYMIZED>"}),
                        "PHONE_NUMBER": OperatorConfig("mask", {"type": "mask", "masking_char" : "*", "chars_to_mask" : 12, "from_end" : True}),
                        "TITLE": OperatorConfig("redact", {})}
)

print(f"text: {anonymized_results.text}")
print("detailed response:")

pprint(json.loads(anonymized_results.to_json()))

In [None]:
person1 = {"name": "Martin Smith",
           "special_value":"145A",
           "free_text": "Martin Smith, id 145A, likes playing basketball"}
person2 = {"name":"Deb Schmidt",
           "special_value":"256B",
           "free_text": "Deb Schmidt, id 256B likes playing soccer"}
person3 = {"name":"R2D2",
           "special_value":"X1T2",
           "free_text": "X1T2 is R2D2's special value"}

dataset = [person1, person2, person3]
dataset

[{'name': 'Martin Smith',
  'special_value': '145A',
  'free_text': 'Martin Smith, id 145A, likes playing basketball'},
 {'name': 'Deb Schmidt',
  'special_value': '256B',
  'free_text': 'Deb Schmidt, id 256B likes playing soccer'},
 {'name': 'R2D2',
  'special_value': 'X1T2',
  'free_text': "X1T2 is R2D2's special value"}]

In [None]:
for person in dataset:

    # Get the different known values
    name = person['name']
    special_val = person['special_value']

    # Get the free text to anonymize
    free_text = person['free_text']

    # Create ad-hoc recognizers
    ad_hoc_name_recognizer = PatternRecognizer(supported_entity="name", deny_list = [name])
    ad_hoc_id_recognizer = PatternRecognizer(supported_entity="special_value", deny_list = [special_val])

    # Run the analyze method with ad_hoc_recognizers:
    analyzer_results = analyzer.analyze(text=free_text,
                                        language="en",
                                        ad_hoc_recognizers=[ad_hoc_name_recognizer, ad_hoc_id_recognizer])

    # Anonymize results

    # Store output in original dataset

In [None]:
analyzer_results

[type: special_value, start: 0, end: 4, score: 1.0,
 type: name, start: 8, end: 12, score: 1.0,
 type: PERSON, start: 0, end: 4, score: 0.85]

In [None]:
import nltk

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [None]:
import nltk

from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree

text = '''
Diego Estrada \n Design Thinking Assignment\n\nVisualization Tool\n\nChallenge & Selection
\n\nThe elderly were having a hard time adapting to the changes we brought in our bank.
As  a result of a poorly implemented linear solution, a more customer centric approach was  needed.
\n\nAfter learning about design thinking in this course, we decided to apply it to solve this  problem.
The visualization tool allowed the team to create a dynamic presentation using  diagrams, figures and drawings on the go that really resonated among the stakeholders.
  Previous to this change, none of our solutions seemed to be adequate for them, but the  new implementation created a different type of
   connection with them that helped them  understand the problem in the way the team and I did.\n\nApplication\n\nThe process starts in the prep time.
    The team uses a series of tools and software to  develop a presentation using the surveys gathered during research and the solutions we  created during
     the process. The use of graphs to quickly show statistics in a fully visual  way, rather than verbally was a game changer.
     \n\nAfter having a presentation prepared, the team hands an activity to the stakeholders,
     where the solutions discussed previously appear. Nonetheless, the solutions need more  work to them. After this.
      The stakeholders are asked to help complete the solutions  while the team and I create diagrams on a blackboard to represent how their
      suggestions would impact on this specific problem.\n\nThe use of a group activity strengthens the bond between the company and their  investors.
       It makes them feel like they take part and help solve the problems as well as  show how customer centric the solutions are. Every complaint and
       suggestion from  customers are read and evaluated using the graph shown in the course (Involving: can  we do it? Can we afford it? …).
       The finalization of this activity leaves the team and the  stakeholders on the same page. It allows them to completely understand and
        feel part  of the solution and also gives them the chance to ask better questions, which eases the  work of the team.
        \n\nInsight & Approach\n\nThe use of this method created a new workflow in the Design Team.
         It increased the  productivity and the success rate as well as the customer/stakeholders satisfaction.
          The  use of the visualization tool created an engaged group of people who work together
          to\n\nDiego Estrada\n\nfind a solution based on their customer satisfaction. This solution is later revised and
           tweaked with the help of the stakeholders who are deeply involved in the process.\n\nPresentations, graphics, and activities have added a huge increase in satisfaction. As a  company we also learnt that engaging different areas can be difficult because of the  varying levels of understanding, but when paired with the adequate process things just  flow.\n\n(This story is fictional and was created for solving the assignment)\n\n'
'''

nltk_results = ne_chunk(pos_tag(word_tokenize(text)))
for nltk_result in nltk_results:
    if type(nltk_result) == Tree:
        name = ''
        for nltk_result_leaf in nltk_result.leaves():
            name += nltk_result_leaf[0] + ' '
        print ('Type: ', nltk_result.label(), 'Name: ', name)

Type:  PERSON Name:  Diego 
Type:  PERSON Name:  Estrada Design Thinking Assignment 
Type:  ORGANIZATION Name:  Design Team 
Type:  PERSON Name:  Diego Estrada 


In [None]:
!pip install openai



In [None]:
pip install openai



In [None]:
import os

# Set the OPENAI_API_KEY environment variable
os.environ['OPENAI_API_KEY'] = 'sk-......'

In [None]:
s='   ISA*01*0000000000*01*0000000000*ZZ*ABCDEFGHIJKLMNO*ZZ*123456789012345*101127*1719*U*00400*000003438*0*P*>\n0   GS*HC*99999999999*888888888888*20111219*1340*1...                                                       \n1                              ST*837*0001*005010X222                                                       \n2               BHT*0019*00*565743*20110523*154959*CH                                                       \n3                   NM1*41*2*SAMPLE INC*****46*496103                                                       \n4   PER*IC*EDI DEPT*EM*FEEDBACK@1edisource.com*TE*...                                                       \n5                      NM1*40*2*PPO BLUE*****46*54771                                                       \n6                                          HL*1**20*1                                                       \n7                               PRV*BI*PXC*333600000X                                                       \n8      NM1*85*2*EDI SPECIALTY SAMPLE*****XX*123456789                                                       \n9                                 N3*1212 DEPOT DRIVE                                                       \n10                            N4*CHICAGO*IL*606930159                                                       \n11                                   REF*EI*300123456                                                       \n12                                        HL*2*1*22*1                                                       \n13                                    SBR*P********BL                                                       \n14         NM1*IL*1*CUSTOMER*KAREN****MI*YYX123456789                                                       \n15                            N3*228 PINEAPPLE CIRCLE                                                       \n16                                   N4*CORA*PA*15108                                                       \n17                                  DMG*D8*19630625*M                                                       \n18                     NM1*PR*2*PPO BLUE*****PI*54771                                                       \n19                                    N3*PO BOX 12345                                                       \n20                              N4*CAMP HILL*PA*17089                                                       \n21                                        HL*3*2*23*0                                                       \n22                                             PAT*19                                                       \n23                             NM1*QC*1*CUSTOMER*COLE                                                       \n24                            N3*228 PINEAPPLE CIRCLE                                                       \n25                                   N4*CORA*PA*15108                                                       \n26                                  DMG*D8*19940921*M                                                       \n27              CLM*945405*5332.54***12>B>1*Y*A*Y*Y*P                                                       \n28                                         HI*BK>2533                                                       \n29                                               LX*1                                                       \n30                     SV1*HC>J2941*5332.54*UN*84***1                                                       \n31                      DTP*472*RD8*20110511-20110511                                                       \n32                                  REF*6R*1099999731                                                       \n33                     NTE*ADD*GENERIC 12MG CARTRIDGE                                                       \n34                                LIN**N4*00013264681                                                       \n35                                        CTP****7*UN                                                       \n36         NM1*DK*1*PATIENT*DEBORAH****XX*12345679030                                                       \n37                               N3*123 MAIN ST*APT B                                                       \n38                         N4*PITTSBURGH*PA*152181871                                                       \n39                                         SE*39*0001                                                       \n40                                          GE*1*1377                                                       \n41                                    IEA*1*000001377                                                       '

In [None]:
analyzer_results = analyzer.analyze(text=s, entities=["PERSON"], language='en')

print(analyzer_results)

[type: PERSON, start: 249, end: 255, score: 0.85, type: PERSON, start: 4044, end: 4086, score: 0.85]


In [None]:
s[4044:4086]

'NM1*DK*1*PATIENT*DEBORAH****XX*12345679030'

In [None]:

nltk_results = ne_chunk(pos_tag(word_tokenize(s)))
for nltk_result in nltk_results:
    if type(nltk_result) == Tree:
        name = ''
        for nltk_result_leaf in nltk_result.leaves():
            name += nltk_result_leaf[0] + ' '
        print ('Type: ', nltk_result.label(), 'Name: ', name)

Type:  ORGANIZATION Name:  ISA 
Type:  ORGANIZATION Name:  ABCDEFGHIJKLMNO 
Type:  ORGANIZATION Name:  NM1 
Type:  ORGANIZATION Name:  SAMPLE 
Type:  ORGANIZATION Name:  PER 
Type:  ORGANIZATION Name:  NM1 
Type:  ORGANIZATION Name:  PPO 
Type:  ORGANIZATION Name:  PRV 
Type:  ORGANIZATION Name:  NM1 
Type:  ORGANIZATION Name:  EDI 
Type:  ORGANIZATION Name:  CHICAGO 
Type:  ORGANIZATION Name:  REF 
Type:  ORGANIZATION Name:  SBR 
Type:  ORGANIZATION Name:  NM1 
Type:  ORGANIZATION Name:  CUSTOMER 
Type:  ORGANIZATION Name:  PINEAPPLE 
Type:  ORGANIZATION Name:  CORA 
Type:  ORGANIZATION Name:  DMG 
Type:  ORGANIZATION Name:  NM1 
Type:  ORGANIZATION Name:  CAMP 
Type:  ORGANIZATION Name:  PAT 
Type:  ORGANIZATION Name:  NM1 
Type:  ORGANIZATION Name:  CUSTOMER 
Type:  ORGANIZATION Name:  PINEAPPLE 
Type:  ORGANIZATION Name:  CORA 
Type:  ORGANIZATION Name:  DMG 
Type:  ORGANIZATION Name:  CLM 
Type:  ORGANIZATION Name:  SV1 
Type:  PERSON Name:  J2941 
Type:  ORGANIZATION Name:  DTP 


In [None]:
for i in