In [37]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/web-server-access-logs/access.log
/kaggle/input/web-server-access-logs/client_hostname.csv


# **Data pre-processing**

The log file comprises 3.3GB of web server logs extracted from zanbil.ir, an Iranian ecommerce platform, offering a comprehensive view of user interactions, crawler activities, and business trends. This log file, compiled by Zaker and Farzin in 2019, is available via Harvard Dataverse for research and analytical purposes.

#### **Loading the log file into a dataframe**

I extracted relevant information such as client IP, user ID, timestamp, HTTP method, request, status code, size, referer, and user agent from each log line. 

In [38]:
import pandas as pd
import re

# Define the log file path
log_file_path = '/kaggle/input/web-server-access-logs/access.log'

# Define the regex pattern to extract information from log lines
regex_pattern = r'^(?P<client>\S+) \S+ (?P<userid>\S+) \[(?P<datetime>[\w:/]+\s[+\-]\d{4})\] "(?P<method>[A-Z]+) (?P<request>[^ "]+)? HTTP/[0-9.]+" (?P<status>[0-9]{3}) (?P<size>[0-9]+|-) "(?P<referer>[^"]*)" "(?P<user_agent>[^"]*)"'

# Define the column names
columns = ['client', 'userid', 'datetime', 'method', 'request', 'status', 'size', 'referer', 'user_agent']

# Read the first 10000 rows of the log file into a list of dictionaries using regex pattern matching
log_data = []
with open(log_file_path, 'r') as file:
    for i, line in enumerate(file):
        if i >= 10000:
            break
        match = re.match(regex_pattern, line)
        if match:
            log_data.append({
                'client': match.group('client'),
                'userid': match.group('userid'),
                'datetime': match.group('datetime'),
                'method': match.group('method'),
                'request': match.group('request'),
                'status': match.group('status'),
                'size': match.group('size'),
                'referer': match.group('referer'),
                'user_agent': match.group('user_agent')
            })
        else:
            print("Error: Line does not match regex pattern:", line)

# Create DataFrame from the list of dictionaries
logs_df = pd.DataFrame(log_data, columns=columns)

In [39]:
# Diplaying the first 5 rows of the dataframe
logs_df.head()

Unnamed: 0,client,userid,datetime,method,request,status,size,referer,user_agent
0,54.36.149.41,-,22/Jan/2019:03:56:14 +0330,GET,/filter/27|13%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C...,200,30577,-,Mozilla/5.0 (compatible; AhrefsBot/6.1; +http:...
1,31.56.96.51,-,22/Jan/2019:03:56:16 +0330,GET,/image/60844/productModel/200x200,200,5667,https://www.zanbil.ir/m/filter/b113,Mozilla/5.0 (Linux; Android 6.0; ALE-L21 Build...
2,31.56.96.51,-,22/Jan/2019:03:56:16 +0330,GET,/image/61474/productModel/200x200,200,5379,https://www.zanbil.ir/m/filter/b113,Mozilla/5.0 (Linux; Android 6.0; ALE-L21 Build...
3,40.77.167.129,-,22/Jan/2019:03:56:17 +0330,GET,/image/14925/productModel/100x100,200,1696,-,Mozilla/5.0 (compatible; bingbot/2.0; +http://...
4,91.99.72.15,-,22/Jan/2019:03:56:17 +0330,GET,/product/31893/62100/%D8%B3%D8%B4%D9%88%D8%A7%...,200,41483,-,Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16...


#### **Understanding and processing the dataset**

In [40]:
# Checking the overview of the dataframe
logs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   client      10000 non-null  object
 1   userid      10000 non-null  object
 2   datetime    10000 non-null  object
 3   method      10000 non-null  object
 4   request     10000 non-null  object
 5   status      10000 non-null  object
 6   size        10000 non-null  object
 7   referer     10000 non-null  object
 8   user_agent  10000 non-null  object
dtypes: object(9)
memory usage: 703.2+ KB


In [41]:
from datetime import datetime
import pytz

In [42]:
# Function to parse the datetime (from the class session practice exercise)
def parse_datetime(x):
    '''
    Parses datetime with timezone formatted as:
        `[day/month/year:hour:minute:second zone]`

    Example:
        `>>> parse_datetime('13/Nov/2015:11:45:42 +0000')`
        `datetime.datetime(2015, 11, 3, 11, 45, 4, tzinfo=<UTC>)`

    Due to problems parsing the timezone (`%z`) with `datetime.strptime`, the
    timezone will be obtained using the `pytz` library.
    '''
    try:
        dt = datetime.strptime(x[1:-7], '%d/%b/%Y:%H:%M:%S')
        dt_tz = int(x[-6:-3])*60+int(x[-3:-1])
        return dt.replace(tzinfo=pytz.FixedOffset(dt_tz))
    except ValueError:
        return '-'


In [43]:
logs_df['status'] = logs_df['status'].astype(int)
logs_df['size'] = logs_df['size'].astype(int)
logs_df['datetime'] = logs_df['datetime'].apply(parse_datetime)

In [44]:
logs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype                               
---  ------      --------------  -----                               
 0   client      10000 non-null  object                              
 1   userid      10000 non-null  object                              
 2   datetime    10000 non-null  datetime64[ns, pytz.FixedOffset(33)]
 3   method      10000 non-null  object                              
 4   request     10000 non-null  object                              
 5   status      10000 non-null  int64                               
 6   size        10000 non-null  int64                               
 7   referer     10000 non-null  object                              
 8   user_agent  10000 non-null  object                              
dtypes: datetime64[ns, pytz.FixedOffset(33)](1), int64(2), object(6)
memory usage: 703.2+ KB


#### **Dropping the userid column**

This because it has one unique value which is just a hyphen

In [45]:
users = logs_df['userid'].unique()
print(users)

['-']


In [46]:
logs_df.drop(columns=['userid'], inplace=True)

#### **Dropping duplicates**



There were duplicates which are not adding value to the analysis.

In [47]:
# Count duplicates in the dataframe
duplicate_count = logs_df.duplicated().sum()

# Display the count of duplicates
print("Number of duplicates:", duplicate_count)

Number of duplicates: 49


In [48]:
# Drop the duplicates
logs_df = logs_df.drop_duplicates()

**The sample of processed data**

In [49]:
# Diplaying the first 5 rows of the dataframe
logs_df.head()

Unnamed: 0,client,datetime,method,request,status,size,referer,user_agent
0,54.36.149.41,2019-01-02 03:56:01+00:33,GET,/filter/27|13%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C...,200,30577,-,Mozilla/5.0 (compatible; AhrefsBot/6.1; +http:...
1,31.56.96.51,2019-01-02 03:56:01+00:33,GET,/image/60844/productModel/200x200,200,5667,https://www.zanbil.ir/m/filter/b113,Mozilla/5.0 (Linux; Android 6.0; ALE-L21 Build...
2,31.56.96.51,2019-01-02 03:56:01+00:33,GET,/image/61474/productModel/200x200,200,5379,https://www.zanbil.ir/m/filter/b113,Mozilla/5.0 (Linux; Android 6.0; ALE-L21 Build...
3,40.77.167.129,2019-01-02 03:56:01+00:33,GET,/image/14925/productModel/100x100,200,1696,-,Mozilla/5.0 (compatible; bingbot/2.0; +http://...
4,91.99.72.15,2019-01-02 03:56:01+00:33,GET,/product/31893/62100/%D8%B3%D8%B4%D9%88%D8%A7%...,200,41483,-,Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16...


# **Approach 1: USING Anonymization function**

In [50]:
from hashlib import sha256

# Function to anonymize IP addresses using SHA-256
def anonymize_ip(ip_address: str) -> str:
    return sha256(ip_address.encode()).hexdigest()

In [51]:
# Apply the anonymization function to the 'client' column
logs_df['client_anonymized'] = logs_df['client'].apply(anonymize_ip)

# Show the first few rows to verify anonymization
logs_df[['client', 'client_anonymized']].head()

Unnamed: 0,client,client_anonymized
0,54.36.149.41,b8cc5b23036ee3915151af26c2db92bcf7adde76b0dfca...
1,31.56.96.51,2effc77aa35f95bcad29d64684c56f0e864ca09e8b52c9...
2,31.56.96.51,2effc77aa35f95bcad29d64684c56f0e864ca09e8b52c9...
3,40.77.167.129,5e03b85dce17db2702e0ae381f4bf6b2b1318a04b28378...
4,91.99.72.15,b1c5a711088a94472abe21647784256fc55c493059653a...


In [52]:
logs_df.head()

Unnamed: 0,client,datetime,method,request,status,size,referer,user_agent,client_anonymized
0,54.36.149.41,2019-01-02 03:56:01+00:33,GET,/filter/27|13%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C...,200,30577,-,Mozilla/5.0 (compatible; AhrefsBot/6.1; +http:...,b8cc5b23036ee3915151af26c2db92bcf7adde76b0dfca...
1,31.56.96.51,2019-01-02 03:56:01+00:33,GET,/image/60844/productModel/200x200,200,5667,https://www.zanbil.ir/m/filter/b113,Mozilla/5.0 (Linux; Android 6.0; ALE-L21 Build...,2effc77aa35f95bcad29d64684c56f0e864ca09e8b52c9...
2,31.56.96.51,2019-01-02 03:56:01+00:33,GET,/image/61474/productModel/200x200,200,5379,https://www.zanbil.ir/m/filter/b113,Mozilla/5.0 (Linux; Android 6.0; ALE-L21 Build...,2effc77aa35f95bcad29d64684c56f0e864ca09e8b52c9...
3,40.77.167.129,2019-01-02 03:56:01+00:33,GET,/image/14925/productModel/100x100,200,1696,-,Mozilla/5.0 (compatible; bingbot/2.0; +http://...,5e03b85dce17db2702e0ae381f4bf6b2b1318a04b28378...
4,91.99.72.15,2019-01-02 03:56:01+00:33,GET,/product/31893/62100/%D8%B3%D8%B4%D9%88%D8%A7%...,200,41483,-,Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16...,b1c5a711088a94472abe21647784256fc55c493059653a...


# **Approach 2: USING NER (Named Entity Recognition)**

To perform anonymization using Named Entity Recognition (NER), we will focus on identifying and anonymizing potential PII in the provided log entries. For this task, we can use the spaCy library, which includes a powerful NER model capable of identifying various entities in text. The goal is to anonymize IP addresses and any other identifiable information that NER might detect as an entity related to a person, location, organization, etc.

In [53]:
# Install spaCy and download the English model for NER
!pip install spacy
!python -m spacy download en_core_web_sm

import spacy

# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

# Sample log entries for demonstration
log_entries = [
    "54.36.149.41 - - [22/Jan/2019:03:56:14 +0330] \"GET /filter/27|13%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,27|%DA%A9%D9%85%D8%AA%D8%B1%20%D8%A7%D8%B2%205%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,p53 HTTP/1.1\" 200 30577 \"-\" \"Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)\" \"-\"",
    "31.56.96.51 - - [22/Jan/2019:03:56:16 +0330] \"GET /image/60844/productModel/200x200 HTTP/1.1\" 200 5667 \"https://www.zanbil.ir/m/filter/b113\" \"Mozilla/5.0 (Linux; Android 6.0; ALE-L21 Build/HuaweiALE-L21) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.158 Mobile Safari/537.36\" \"-\"",
]

# Function to anonymize IP addresses in the log entry
def anonymize_ip(log_entry):
    return re.sub(r'\b(?:\d{1,3}\.){3}\d{1,3}\b', lambda match: anonymize_text(match.group(0)), log_entry)

# Anonymize IPs in the sample logs
anonymized_logs = [anonymize_ip(entry) for entry in log_entries]

# Function to identify and anonymize potential PII using NER in parts like the User-Agent string
def anonymize_with_ner(log_entry):
    # Splitting the log entry to extract the User-Agent string (or any other potentially identifiable information)
    parts = log_entry.split('"')
    if len(parts) > 5:
        user_agent = parts[5]  # Assuming the User-Agent is always in this position
        doc = nlp(user_agent)
        anonymized_text = user_agent
        for ent in doc.ents:
            anonymized_text = anonymized_text.replace(ent.text, "[ANONYMIZED]")
        parts[5] = anonymized_text
    return '"'.join(parts)

# Apply NER-based anonymization to the already IP-anonymized logs
ner_anonymized_logs = [anonymize_with_ner(entry) for entry in anonymized_logs]

ner_anonymized_logs

  pid, fd = os.forkpty()


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m52.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


['b8cc5b23036ee3915151af26c2db92bcf7adde76b0dfca161256b630ac17ab0e - - [22/Jan/2019:03:56:14 +0330] "GET /filter/27|13%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,27|%DA%A9%D9%85%D8%AA%D8%B1%20%D8%A7%D8%B2%205%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,p53 HTTP/1.1" 200 30577 "-" "Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)" "-"',
 '2effc77aa35f95bcad29d64684c56f0e864ca09e8b52c9c4316a3d3e248e05c7 - - [22/Jan/2019:03:56:16 +0330] "GET /image/60844/productModel/200x200 HTTP/1.1" 200 5667 "https://www.zanbil.ir/m/filter/b113" "Mozilla/5.0 ([ANONYMIZED]; [ANONYMIZED] [ANONYMIZED]; [ANONYMIZED]) AppleWebKit/537.36 ([ANONYMIZED], like [ANONYMIZED]) Chrome/6[ANONYMIZED].3359.158 [ANONYMIZED] Safari/537.36" "-"']

In [54]:
ner_anonymized_logs

['b8cc5b23036ee3915151af26c2db92bcf7adde76b0dfca161256b630ac17ab0e - - [22/Jan/2019:03:56:14 +0330] "GET /filter/27|13%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,27|%DA%A9%D9%85%D8%AA%D8%B1%20%D8%A7%D8%B2%205%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,p53 HTTP/1.1" 200 30577 "-" "Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)" "-"',
 '2effc77aa35f95bcad29d64684c56f0e864ca09e8b52c9c4316a3d3e248e05c7 - - [22/Jan/2019:03:56:16 +0330] "GET /image/60844/productModel/200x200 HTTP/1.1" 200 5667 "https://www.zanbil.ir/m/filter/b113" "Mozilla/5.0 ([ANONYMIZED]; [ANONYMIZED] [ANONYMIZED]; [ANONYMIZED]) AppleWebKit/537.36 ([ANONYMIZED], like [ANONYMIZED]) Chrome/6[ANONYMIZED].3359.158 [ANONYMIZED] Safari/537.36" "-"']

#  **Approach 3: Using Prompt Engineering**

In [55]:
pip install openai

Note: you may need to restart the kernel to use updated packages.


In [56]:
OPENAI_API_KEY = "XXX"

In [57]:
import os
from openai import OpenAI
# Sample log entry to anonymize
log_entry = """
54.36.149.41 - - [22/Jan/2019:03:56:14 +0330] "GET /filter/27|13%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,27|%DA%A9%D9%85%D8%AA%D8%B1%20%D8%A7%D8%B2%205%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,p53 HTTP/1.1" 200 30577 "-" "Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)" "-"
"""

prompt = f"""
Given the web server access log entry below, identify and replace any personally identifiable information (PII) with generic placeholders while retaining the overall structure and content of the log entry for analysis purposes.

Log entry:
{log_entry}

Anonymize the log entry:
"""

client = OpenAI(
    # This is the default and can be omitted
    api_key="XXX",
)

chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": prompt,
        }
    ],
    model="gpt-3.5-turbo",
)

print(chat_completion.choices[0].message.content)


XX.XX.XX.XX - - [22/Jan/2019:03:56:14 +0330] "GET /filter/XX|XX%20XXXXXXX,XX|XXXXX%20XXXX%20XX%20XXXXXX,pXX HTTP/1.1" 200 30577 "-" "Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)" "-"


# **Approach 4: Using NER + Transformer Model**

To anonymize text using a Named Entity Recognition (NER) and transformer-based approach, we can leverage the transformers library from Hugging Face, which provides access to a wide range of pre-trained NER models. For this task, we'll use a model that has been fine-tuned on a NER task, such as dbmdz/bert-large-cased-finetuned-conll03-english or any other model that is suitable for NER tasks available from Hugging Face's model repository.

In [58]:
from transformers import pipeline
import re

# Load the NER pipeline
ner_pipeline = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english", tokenizer="dbmdz/bert-large-cased-finetuned-conll03-english")

# Sample log entries
log_entries = [
    "54.36.149.41 - - [22/Jan/2019:03:56:14 +0330] \"GET /filter/27|13%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,27|%DA%A9%D9%85%D8%AA%D8%B1%20%D8%A7%D8%B2%205%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,p53 HTTP/1.1\" 200 30577 \"-\" \"Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)\" \"-\"",
    # Add other log entries here
]

def anonymize_ip(log_entry):
    # Anonymize IP addresses
    return re.sub(r'\b(?:\d{1,3}\.){3}\d{1,3}\b', '[ANONYMIZED_IP]', log_entry)

def anonymize_with_ner(log_entry, ner_pipeline):
    # Detect entities
    entities = ner_pipeline(log_entry)
    anonymized_entry = log_entry
    for entity in entities:
        # Replace entities with a placeholder
        entity_text = log_entry[entity['start']:entity['end']]
        anonymized_entry = anonymized_entry.replace(entity_text, '[ANONYMIZED]')
    return anonymized_entry

# Anonymize log entries
anonymized_logs = []
for entry in log_entries:
    anonymized_ip_entry = anonymize_ip(entry)
    anonymized_entry = anonymize_with_ner(anonymized_ip_entry, ner_pipeline)
    anonymized_logs.append(anonymized_entry)

# Print or return the anonymized logs
print(anonymized_logs)


Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


['[ANONYMIZED_IP] - - [22/Jan/2[ANONYMIZED]19:[ANONYMIZED]3:[ANONYMIZED]6:14 +[ANONYMIZED]33[ANONYMIZED]] "GET /filter/27|13%2[ANONYMIZED]%D9%8[ANONYMIZED]%DA%AF%D8%A7%D9%[ANONYMIZED]E%D[ANONYMIZED]%8C%DA%A9%D8%[ANONYMIZED]3%D9%84,27|%DA%A9%D9%8[ANONYMIZED]%D8%AA%D8%[ANONYMIZED]1%2[ANONYMIZED]%D8%A7%D8%[ANONYMIZED]2%2[ANONYMIZED][ANONYMIZED]%2[ANONYMIZED]%D9%8[ANONYMIZED]%DA%AF%D8%A7%D9%[ANONYMIZED]E%D[ANONYMIZED]%8C%DA%A9%D8%[ANONYMIZED]3%D9%84,p[ANONYMIZED]3 HTTP/1.1" 2[ANONYMIZED][ANONYMIZED] 3[ANONYMIZED][ANONYMIZED]77 "-" "[ANONYMIZED][ANONYMIZED]/[ANONYMIZED].[ANONYMIZED] (compatible; [ANONYMIZED][ANONYMIZED][ANONYMIZED][ANONYMIZED]ot/6.1; +http://ah[ANONYMIZED][ANONYMIZED].com/robot/)" "-"']


# **Approach 5 : Using RagaLLM Hub**

In [59]:
pip install -q -U raga-llm-hub

Note: you may need to restart the kernel to use updated packages.


In [60]:
from raga_llm_hub import RagaLLMEval

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [61]:
evaluator = RagaLLMEval(
    api_keys={"OPENAI_API_KEY": "XXX"}
)

🌟 Welcome to RagaLLMHub! 🌟
The most comprehensive LLM (Large Language Models) testing library at your service.

Launching Evaluation: 'fde6fbd3-e169-4d17-bea3-89cd49468412'
Keep this identifier handy for tracking your progress!



### Anonymize

In [62]:
evaluator.add_test(
    test_names=["anonymize_guardrail"],
    data={
        "prompt": """54.36.149.41 - - [22/Jan/2019:03:56:14 +0330] \"GET /filter/27|13%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,27|%DA%A9%D9%85%D8%AA%D8%B1%20%D8%A7%D8%B2%205%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,p53 HTTP/1.1\" 200 30577 \"-\" \"Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)\" \"-\
""",
    },
).run()

evaluator.print_results()

🚀 Starting execution of 1 tests...

🔍 Test 1 of 1: anonymize_guardrail starts...


tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


✅ Test completed: anonymize_guardrail.
✨ All tests completed. Total tests executed: 1.
Test run details saved/updated under eval name 'fde6fbd3-e169-4d17-bea3-89cd49468412' with timestamp 2024-04-05T19:32:08.363404.

📊 Results:

Test Name: anonymize_guardrail

+---------------------+---------------------------+---------------------------+-------+--------+---------------------------+
|      Test Name      |           Prompt          |         Parameters        | Score | Result |      Sanitized Prompt     |
+---------------------+---------------------------+---------------------------+-------+--------+---------------------------+
| anonymize_guardrail |      54.36.149.41 - -     |     use_faker: False,     |  0.60 |   ❌   | [REDACTED_IP_ADDRESS_1] - |
|                     |   [22/Jan/2019:03:56:14   |  threshold: 0, use_onnx:  |       |        |  - [22/Jan/2019:03:56:14  |
|                     | +0330] "GET /filter/27|13 |           False           |       |        | +0330] "GET /filte

### Deanonymize

In [63]:
evaluator.add_test(
    test_names=["deanonymize_guardrail"],
    data={
        "prompt": """[REDACTED_IP_ADDRESS_1] - - [22/Jan/2019:03:56:14 +0330] \"GET /filter/27|13%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,27|%DA%A9%D9%85%D8%AA%D8%B1%20%D8%A7%D8%B2%205%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,p53 HTTP/1.1\" 200 30577 \"-\" \"Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)\" \"-\
""",
    },
).run()

evaluator.print_results()

🚀 Starting execution of 1 tests...

🔍 Test 1 of 1: deanonymize_guardrail starts...
✅ Test completed: deanonymize_guardrail.
✨ All tests completed. Total tests executed: 1.
Test run details saved/updated under eval name 'fde6fbd3-e169-4d17-bea3-89cd49468412' with timestamp 2024-04-05T19:32:30.895326.

📊 Results:

Test Name: deanonymize_guardrail

+-----------------------+---------------------------+--------------------------+-------+--------+---------------------------+
|       Test Name       |           Prompt          |        Parameters        | Score | Result |      Sanitized Prompt     |
+-----------------------+---------------------------+--------------------------+-------+--------+---------------------------+
| deanonymize_guardrail | [REDACTED_IP_ADDRESS_1] - | matching_strategy: exact |  0.00 |   ✅   |      54.36.149.41 - -     |
|                       |  - [22/Jan/2019:03:56:14  |                          |       |        |   [22/Jan/2019:03:56:14   |
|                      

# Answers to the questions

## **Is it possible to anonymize the dataset?**


Yes, it is possible to anonymize the dataset using the discussed methods. Each approach provides a mechanism to identify and replace or remove personally identifiable information (PII) or sensitive data from text-based datasets, like web server logs or CSV files containing potentially identifiable information.

All the approaches discussed can be used to anonymize datasets, but their effectiveness varies based on the data's complexity and the types of PII present.

1. **Manual Anonymization and Regex-based Approaches**: Highly effective for structured and predictable data formats (like IP addresses, dates, etc.). Their precision is high for known patterns but requires manual updates for new data types or formats.
2. **NER and NER with Transformers**: These methods extend the capabilities to unstructured or semi-structured text, detecting a broader range of entity types based on context. They can recognize names, organizations, locations, and more, which are not easily captured by regex patterns.
3. **Prompt Engineering with LLMs**: Given a well-crafted prompt, LLMs can potentially anonymize a wide range of PII types, even in complex sentences. The model's understanding of context and ability to generate human-like text make it versatile. However, effectiveness heavily depends on the prompt's design and the model's training data.
4. **RagaLLMHub** : Assuming this approach leverages a tailored LLM for anonymization, it could offer a highly flexible and potentially effective solution, assuming the model is specifically optimized for identifying and anonymizing PII across diverse datasets.

## **Does it ‘successfully’ anonymize?**

The success of anonymization is measured by the thoroughness of PII removal and the preservation of data utility for further analysis or processing.

1. **Manual and Regex Approaches**: While successful for known patterns, they might fail to identify less obvious or new forms of PII, potentially leaving data partially anonymized.
2. **NER Approaches**: Success varies by the model's ability to recognize different entities. While they can anonymize more types of PII than regex approaches, they might miss entities that are model-specific or not well-represented in the training data.
3. **LLM-based Approaches**: Potentially the most comprehensive, as these models can understand and generate text based on complex instructions. However, their success is not guaranteed, as they may overlook subtle PII or misunderstand the anonymization task in certain contexts.
4. **RagaLLMHub**: The success would depend on the model's specific training for anonymization tasks. If well-trained, it could theoretically offer a high success rate across various PII types.

 ## **How easy is it to use NLP?**

The ease of use varies significantly across the different approaches, influenced by the user's technical expertise and the specific requirements of the anonymization task.

1. **Manual and Regex**: Easy to implement for users with basic programming skills but requires manual updates and can become complex for sophisticated patterns.
2. **NER (spaCy), NER + Transformers**: Moderate difficulty. Requires some understanding of NLP concepts and familiarity with the libraries. Setting up and customizing the models for specific needs can be challenging for beginners.
3. **LLMs and Prompt Engineering**: Varies from moderate to difficult. Crafting effective prompts is an art that requires understanding both the task and how the model interprets instructions. The technical setup is straightforward with APIs, but optimizing prompts for best performance may require trial and error.
4. **RagaLLMHub**: The ease of use would depend on how the solution is packaged and provided to the end-user. If it offers a simple API or interface for anonymization tasks, it could be relatively easy for users without deep NLP knowledge.

## **Does it make sense to use NLP?**


Yes, especially for complex datasets where PII can be embedded in natural language text or when PII is not strictly structured or predictable. NLP allows for the contextual identification of PII, which is particularly valuable in free-form text or when dealing with entities that simple pattern matching might not accurately identify.

## **Are the available libraries good enough?**


Generally, Yes. Libraries and models for NLP and anonymization tasks are continually improving, with active development communities and support. While no tool may be perfect—given the wide variety of PII and the nuances of natural language—the current ecosystem provides a strong foundation for most anonymization needs. Tools like spaCy for NER, transformers for working with state-of-the-art language models, and even specialized anonymization frameworks like Microsoft's Presidio offer robust solutions. The effectiveness and ease of use can vary, and there may be a learning curve for more advanced models or custom anonymization tasks

Comparing Approaches:
* **Manual/Regex-based Approaches**: Most straightforward, best for structured PII like IP addresses.
* **NER (spaCy), NER + Transformers**: Good for unstructured text, require some NLP knowledge.
* **Prompt Engineering with LLMs**: Flexible, potentially powerful, but requires careful prompt design and may incur costs for API use.
* **RagaLLMHub**: Success would depend on the model's training and the effectiveness of its implementation for anonymization tasks.