<a href="https://colab.research.google.com/github/Tavleen1203/LogAnonymization_Challenge/blob/main/CodingChallenge_LFX_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **CODING CHALLENGE:ANONYMIZATION USING NLP**

**Approaches Used**

1. For this challenge I have used two approaches, firstly I have done a simple non-NLP approach of hiding PIIs to gain intuition on the project, I have simply identified the PIIs, and hidden them by using X symbols.

2. Next, after gaining sense of the implementation I have moved to the NLP Code, the technique that I used here was text masking. For this, I have created a sensitive_info object, for reference. Then upon iterating over the data, each time a pattern of the object is encountered, the mask_text() function is called and it masks the text based on tag defined in the sensitive_info.

Dataset Used: https://github.com/logpai/loghub/blob/master/OpenStack/OpenStack_2k.log_structured.csv

**1. ANONYMIZATION INTUITION**

In [14]:
import pandas as pd

# Load the dataset
data = pd.read_csv("data_log.csv", delimiter=",")

# Anonymizing feilds
data['User'] = 'XXXXX'
data['PID'] = 'XXXXX'
data['Address'] = 'XXXXX'


data.to_csv("anonymized_data_log.csv", index=False)




**2. TEXT MASKING**

In [16]:
import pandas as pd

def mask_text(text):

    # Giving pattern to understand sensitive data
    sensitive_info = {
        'USER': ['calvisitor', 'authorMacBook-Pro'],
        'IP_ADDRESS': ['10.105.160.95', '10.105.162.105']
        # Add more sensitive information patterns as needed
    }

    # Tokenizing
    for token, patterns in sensitive_info.items():
        for pattern in patterns:
            text = text.replace(pattern, f'<{token}>')

    return text


data = pd.read_csv("data_log.csv", delimiter="\t")

# Masking
for column in data.select_dtypes(include='object'):
    data[column] = data[column].apply(mask_text)


data.to_csv("masked_data_log.csv", index=False)


  **3. BUILDING AN ENSEMBLE MODEL TO TRY ANOMYZATION**

In [22]:

import re
import spacy
import random


nlp = spacy.load("en_core_web_sm")

# sample text
text = "My name is Tavleen, I live in Delhi and my number is 9811264475."


number_regex = r'\b\d+\b'

def anonymize_numbers(text):
    return re.sub(number_regex, 'XXXXX', text)


def scramble_number(match):
    number = match.group(0)

    digits = list(number)

    random.shuffle(digits)

    return ''.join(digits)

def anonymize_entities(text):
    doc = nlp(text)
    anonymized_text = []
    for token in doc:
        if token.ent_type_ == "PERSON":
            anonymized_text.append("XXXXX")
        elif token.ent_type_ == "LOC":
            anonymized_text.append("XXXXX")
        elif token.ent_type_ == "NUM":
            anonymized_text.append(re.sub(number_regex, scramble_number, token.text))
        else:
            anonymized_text.append(token.text)
    return " ".join(anonymized_text)


def anonymize(text):
    text = anonymize_numbers(text)
    text = anonymize_entities(text)
    return text


anonymized_text = anonymize(text)
print(anonymized_text)


My name is XXXXX , I live in Delhi and my number is XXXXX .


# **QUESTIONS FROM THE CHALLENGE**

**Q: Is it possible to anonymize the dataset?**

A: Yes, a dataset can be anonymized, we can hide sensitive information. An observation I made while coding this was, sensitive information should not reach anybody that means our aim should be to have no human intervention while hiding this data. This can be achieved by predictive analysis and Natural Language Processing.



**Q: Does it ‘successfully’ anonymize?**

A: To consider this process succesfull in its entirety, it is important to note that a model needs to be trained on a very large set of data. When considering anonymization, its applications can be in numerous fields, like banking, healthcare, internet, online shopping, reviews and so on. To be successful, the model must be trained on enough data to identify what could possibly be a PII, and then hide it. (using hashing, regex, etc)

**Q: How easy is it to use NLP?**

A: NLP Applications on text anonymization can vary from simple masking to use of advanced libraries. But considering effectiveness over ease, NLP is a pretty good approach. Yet these are the potential issues that we may encounter while working:

1. Slow performance while dealing with large datasets.
2. Slow learning curve due to huge domain specific applications.
3. Validation: we will have to employ some automated method that  could validate the models working on test data.



**Does it make sense to use NLP?**

A: Yes, use of NLP is a very intelligent approach while dealing with text anonymization. As sensitive data shall not reach anybody, using a Machine Learning approach can remove the scope of anybody interacting with a user's information. This can be a highly secure method. Once the model is trained, it can autonomously deal with the task at hand.

**Q: Are the available libraries good enough?**

A: Some available libraries such as spacy, NLTK, regex can be extented to the task of anonymization. Yet, the available libraried fail to completely align with this task.

Here are the problems I faced while working on this challenge:

1. Accuracy Issues: I initially started with more complex NLP Solutions like usage of spacy and textBlob, but what I obsereved was inability to handle variations, which required a lot of manual review.

2. Another thing is that we need a long chain of ensemble processes. What I mean is in a dataset where let's say we have phone number and name, we need regex for the number, Named Entity Recognition (using spacy, textBlob etc) for the name, and then valdation strategies to check correctness of output, and then we can say that anonymization is done.