## Identify Sensitive Comments

The following notebook outlines the procedure for removing the sensitive information. The finalized script is in the src directory.

Using the python module `spacy` we are able to perform named entity recognition (NER) on all of the comments. This tags a variety of labels on words or ngrams in each comment. We are interested in the text that has been tagged `PERSON` which could potentially reveal sensitive information. In theory, NER should have identified any possible names, however there is a possibility that it missed some. This is the case if someone used a persons name but all in lower case letters. Manually looking at a sample of the data we saw no instances of this or other cases where spacy was unable to identify a persons name being used.

In [2]:
import pandas as pd
import spacy

In [3]:
# Read the Qualitative Data
df_raw = pd.read_excel("../data/raw/2018 WES Qual Coded - Final Comments and Codes.xlsx",
                       skiprows = 1)
comments = df_raw["2018 Comment"]

In [6]:
# Use spacy's library and apply NER
nlp = spacy.load("en_core_web_sm")
docs = [nlp(comment) for comment in comments]

In [7]:
# Grab the documents with an entity label PERSON
documents = []
person_text = []
raw_index = []

for index, doc in enumerate(docs):
    for ent in doc.ents:
        if ent.label_ == "PERSON":
            raw_index.append(index)
            documents.append(doc.text)
            person_text.append(ent.text)          

df_persons = pd.DataFrame({'original_index': raw_index,
                           'documents': documents, 
                           'person_text': person_text})

df_persons.head()

Unnamed: 0,original_index,documents,person_text
0,2,The problem with the BCSS is Linda Cavanaugh a...,Linda Cavanaugh
1,2,The problem with the BCSS is Linda Cavanaugh a...,Sheriffs
2,2,The problem with the BCSS is Linda Cavanaugh a...,JIBC
3,2,The problem with the BCSS is Linda Cavanaugh a...,JIBC
4,6,Administration people should have better oppor...,Admin


An example comment with sensitive information is printed below:

In [8]:
comments[2]

"The problem with the BCSS is Linda Cavanaugh and the CSB. Sheriffs are minimized and trivialized. The ADM has zero law enforcement experience. It is sickening how we are treated lumped in with civilian employees. Until sheriffs are removed from the CSB nothing will change. BCSS management has no ability to make changes because the ADM has her own civilian agenda. When will Government listen to us about how the CSB is killing us? ADM for years denied we had staffing  and wage issues when it was patently untrue. Why does the JIBC have ANY say into who we hire as instructors? JIBC is NOT a gov entity and should not have say on panels or appointing PTO's. JIBC is essentially a secret society. Only those who play their game get to teach there"

There are 893 words that have been identified with a `PERSON` tag

In [9]:
len(person_text)

893

**CORRECTING THE FALSE POSITIVES AND FALSE NEGATIVES**:  
However there are several comments that NER incorrectly tagged as `PERSON`, and therefore they do not contain sensitive information. An example of this is the word "Admin", which is used quite often in the comments but is not a person. To adjust for this we have cross checked all of the sensitive persons with a database of ~90,000 names. There are also cases of false positives that are in the names list but are not sensitive. An example of this is "Langford" which NER has tagged, and it is in the names list. We have iteratively built a list of these names that shouldn't be in the names list.

In [10]:
# CREATE list of names to cross reference 

df_names = pd.read_csv("../references/data-dictionaries/NationalNames.csv")
names = df_names.Name.unique().tolist()

# Names that are in the names list and NER labels as Person, but are not actually
# sensitive. ie. they are false positives
false_names = ['Sheriff', 'Law', 'Child', 'Warden', 'Care', 'Cloud', 'Honesty',
               'Maple', 'Marijuana', 'Parks', 'Ranger', 'Travel', 'Young', 'Branch',
               'Field', 'Langford', 'Surrey', 'Cap', 'Lean', 'Van', 'Case', 'Min',
               'Merit', 'Job', 'Win', 'Forest', 'Victoria']

# Drop the false_names from the names list 
names = list(set(names).difference(set(false_names)))

# Names that are not in the names list, but should be! ie. false negatives
missing_names = ['Kristofferson']

# Add missing names
for missing_name in missing_names:
    names.append(missing_name)

In [11]:
# Cross reference the "persons" with the name database
sensitive_person = []
person_index = []

for index, person in enumerate(person_text):
    for name in person.split(): 
        if name in names:
            sensitive_person.append(person)
            person_index.append(index)
            break       

After accounting for words that are not actual names we have reduced the list of sensitive persons to 153.

In [12]:
len(sensitive_person)

153

Lets take a look at all of the `PERSON`s that we considered not to be sensitive based on the cross referencing the names data. We can see we correctly removed these `PERSON`s from the sensitive list. For printing below i have just shown 10 examples.

In [33]:
list(set(person_text).difference(set(sensitive_person)))[1:10]

['Adult',
 'Happier',
 'Service Level Agreements',
 'Wardens',
 'Teleworker',
 'Sad',
 'Limit',
 'Kamloops',
 'B.C.  ']

Finally, we can grab the index of the sensitive comments which can be used to remove them from the dataset.

In [14]:
sensitive_comment_indices = df_persons.original_index[person_index].tolist()

These are all the sensitive comments that have been identified and removed:

In [19]:
comments[sensitive_comment_indices].unique()

array(["The problem with the BCSS is Linda Cavanaugh and the CSB. Sheriffs are minimized and trivialized. The ADM has zero law enforcement experience. It is sickening how we are treated lumped in with civilian employees. Until sheriffs are removed from the CSB nothing will change. BCSS management has no ability to make changes because the ADM has her own civilian agenda. When will Government listen to us about how the CSB is killing us? ADM for years denied we had staffing  and wage issues when it was patently untrue. Why does the JIBC have ANY say into who we hire as instructors? JIBC is NOT a gov entity and should not have say on panels or appointing PTO's. JIBC is essentially a secret society. Only those who play their game get to teach there",
       '-Follow the recommendations of Chief Ed John and also the recommendations as per the residential school commission report.  -Prioritize funding to support social workers. Some of the caseloads are too big to manage so you get a lot of

## Tokenize procedure

The remaining comments that have not been identified as sensitive are tokenized as shown below and subsequently fed into the LSTM model at a later step.

In [23]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import numpy as np

In [28]:
max_words = 12000
maxlen = 700

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(np.array(comments))

Example Comment:

In [25]:
comments[0]

'I would suggest having a developmental growth plan within the Hiring Centre. For example: providing training for internal staff that are currently hiring clerks, to working with intake and then becoming a hiring advisor. I believe this would be an additional option for filing advisor vacancies, as admin staff would already have knowledge of all of the systems that we use internally and the hiring processes that we currently follow.'

Tokenizing the above comment, you can see below that it is parsed as an array of unique numbers, each number representing a word. This array is what will be loaded on google collab to be used for our LSTM model.

In [27]:
pad_sequences(tokenizer.texts_to_sequences(comments[0]), maxlen=maxlen)

array([[   0,    0,    0, ...,    0,    0,   11],
       [   0,    0,    0, ...,    0,    0,    0],
       [   0,    0,    0, ...,    0,    0, 1881],
       ...,
       [   0,    0,    0, ...,    0,    0, 2725],
       [   0,    0,    0, ...,    0,    0, 1881],
       [   0,    0,    0, ...,    0,    0,    0]], dtype=int32)