# Assessing free-form text fields

Since free-form text fields are present in the dataset, we decided to look manually into their contents. This serves 2 purposes: first, we want to make sure that no sensitive or identifying information was inserted in those fields to sanitize the database. Second, we hope to find information that has not been coded in the structured columns.

## loading the data

In [1]:
import pandas as pd

In [31]:
df = pd.read_csv('full-310k - Copy.csv')

In [33]:
# list of free text columns, per data dictionary
columns = 'hwrep_dx_oth', 'hwrep_comment_final', 'hwrep_tx_oth', 's_oth_sympt_entered'

In [65]:
corpus = pd.DataFrame({col: df[col] for col in columns})
import collections
counters = collections.defaultdict(lambda: 0)
for v in comment[pd.notna(comment)]:
    for w in v.split():
         counters[w] += 1

pd.DataFrame(sorted((c,w) for w,c in counters.items()), columns=('occurences', 'word'))

Unnamed: 0,occurences,word
0,1,!
1,1,#
2,1,&
3,1,(bacterial)so
4,1,(tachycardia
...,...,...
905,71,not
906,88,child
907,96,for
908,103,was


## Remarks
We did not find any patient names or identifiers in this data. In one instance, one of the doctors appears to have given out their own phone number. Since there might be undetected quasi-identifiers, the raw data should be considered private.

Potentially interesting features are:
- medical terms and treatment names not coded in the other columns (and their various alternative spellings across French and English)
- "MRDT was not done" (and other ways to phrase it)

In any case, those features would only have a couple hundred occurrences, which is of limited utility given the size of the dataset. Hence, from a machine learning perspective, this is not useful enough to keep.

In [None]:
# in this instance, the doctor gives their phone number
corpus['hwrep_comment_final'][1153]

## Conclusion
In order to sanitize the database for privacy breaches, the free-form data should be removed and optionally replaced by feature data extracted using regular expressions. We chose not to extract any features.