### Description

This notebook show the experiments for ideal MASK and BLANK filters. 

- First we load the data, 
- generate the camouflage saving the metadata (positions of tokens camouflaged)
- then we apply the filters substituting the camouflaged tokens with the MASK or BLANK token and save to .spacy files.
- Evalaute the model on the test filtered data.

In [None]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"


# import before spacy to avoid conflicts with PyTorch GPU setup
from thinc.api import set_gpu_allocator, require_gpu 
set_gpu_allocator("pytorch")
require_gpu(0)

import spacy
spacy.require_gpu()

import sys
sys.path.insert(0, '../')

from pyleetspeak.pyleetspeak import modes, WordCamouflage_Augmenter, NER_data_generator
from transformers import AutoTokenizer

import warnings
warnings.filterwarnings("ignore")

from pathlib import Path
import pandas as pd
from tqdm.auto import tqdm
tqdm.pandas()
import re
import emoji

home = str(Path.home())

# Create logger
import logging
import sys

logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

# Create handlers
c_handler = logging.StreamHandler(sys.stdout)
c_handler.setLevel(logging.DEBUG)


# Create custom formatter that sets the color of the log message based on its level
class ColoredFormatter(logging.Formatter):
    def format(self, record):
        if record.levelno == logging.DEBUG:
            color = "\x1b[34m"  # blue
        elif record.levelno == logging.INFO:
            color = "\x1b[32m"  # green
        elif record.levelno == logging.WARNING:
            color = "\x1b[33m"  # yellow
        else:
            color = "\x1b[31m"  # red
        message = super().format(record)
        message = color + message + "\x1b[0m"  # reset color
        return message


# Create formatters and add them to handlers get line of code where the log was created
c_format = ColoredFormatter("%(asctime)s - %(name)s - %(levelname)s - %(lineno)d - %(message)s ")
f_format = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")
c_handler.setFormatter(c_format)

# Create logger and add handlers to it
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
logger.addHandler(c_handler)

import transformers
transformers.__version__ # '4.36.2'

'4.36.2'

In [3]:
!nvidia-smi

Mon Apr 22 08:41:47 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Quadro RTX 8000                Off | 00000000:37:00.0 Off |                  Off |
| 60%   78C    P2             259W / 260W |  13213MiB / 49152MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Quadro RTX 8000                Off | 00000000:86:00.0 Off |  

## Load OffenSemEval data

In [26]:
# offen_10_leet_data_path = f"{home}/work/WordCamouflage_Resiliance/code/Spacy_Data/Offen_SemEval_2019/Leet_Data/25_per_advanced_20/test.spacy"

offen_10_leet_data_path = f"{home}/work/WordCamouflage_Resiliance/code/Spacy_Data/Offen_SemEval_2019/Ori_Data/dev.spacy"


# # from spacy to pandas
# def spacy_to_pandas(spacy_data_path):
#     docs = []
#     with open(spacy_data_path, 'r') as file:
#         for line in file:
#             # continue


import spacy
from spacy.tokens import DocBin


offen_10_leet_data_path = f"{home}/work/WordCamouflage_Resiliance/code/Spacy_Data/Offen_SemEval_2019/Ori_Data/dev.spacy"
nlp = spacy.blank("en")

# Cargar los datos
doc_bin = DocBin().from_disk(offen_10_leet_data_path)
docs = list(doc_bin.get_docs(nlp.vocab))

# Convertir a DataFrame 
#  doc.cats --> {'OFF': False, 'NOT': True}
data = [{
    'text': doc.text,
    'test_label': next((label for label, is_true in doc.cats.items() if is_true), None)
} for doc in docs]
offen_10_leet_df = pd.DataFrame(data)

In [27]:
offen_10_leet_df

Unnamed: 0,text,test_label
0,@USER @USER No clue where you get those numbers. We are the only country in the world with mass shootings and staggering death tolls from gun violence. How anyone is against gun control defies logic.,NOT
1,@USER @USER @USER @USER yes he is,NOT
2,@USER You are a beautiful model &amp; HWs were jealous of that. I agree that Kendall should stay how she is &amp; not have all that phony plastic surgery like her sisters. I don't event recognize Khloe anymore. Kylie needs to stop w/fillers. She's pretty on her own &amp; not overdue it,NOT
3,@USER @USER When he's not imparting these gems Michael Moore is stuffing his face.,OFF
4,// Rean's Arcane Gale is broken. If he is gonna be able to use that from scratch in Sen IV ( because of his demon form) it's gonna be cool using it in every battle!,OFF
...,...,...
1316,@USER @USER What really pisses me off about Asians in California voting for more gun control is that many should know better especially Koreans in LA. They all knew someone who had their businesses destroyed during the riots. At least this salty boi hasn’t forgotten: URL,NOT
1317,@USER @USER @USER He. Is. A. Sociopath.,OFF
1318,@USER He drew the saw because the marker was in his left hand which is connected to his right brain which is connected to his left eye which saw the saw. His right eye which is connected to his left brain saw the hammer. #PSYC1101,OFF
1319,@USER Does penn state produce criminals too?,NOT


## Generate Camouflage test data saving the metadata

In [9]:
import pandas as pd
from copy import deepcopy, copy
import spacy
from spacy.tokens import DocBin
from tqdm import tqdm
tqdm.pandas()
from sklearn.model_selection import train_test_split
import numpy as np
from codetiming import Timer
import re
import emoji

In [10]:
def make_docs(data_tuples, nlp, labels):
    """_summary_
    """
    docs = []
    for text, label in tqdm(nlp.pipe(data_tuples, as_tuples=True), total = len(data_tuples), desc = "Making docs"):
        doc = nlp(text)

        for l in labels:
            # Hay que hacer todos los labels
            doc.cats[l] = label == l   

        # put them into a nice list
        docs.append(doc)
    
    return docs

def leet_data(text, generator):
    """_summary_

    Args:
        text (_type_): _description_
        generator (_type_): _description_

    Returns:
        _type_: _description_
    """
    NER_data, ori_data = generator.generate_data(
            sentence=text
            # important_kws = [r"\bpfizer\b", r"control\b", r"vacuna\b", r"vaccines\b"],
        )

    leet_text, _ =  NER_data[0]
    return leet_text


# Create a function to save pandas dataframe to spacy binary file
def pd_2_spacy(df_train, df_dev, df_test, train_output_path, dev_output_path, test_output_path, labels, lang="en"):
    """_summary_

    Args:
        df_train (_type_): _description_
        df_dev (_type_): _description_
        df_test (_type_): _description_
        train_output_path (_type_): _description_
        dev_output_path (_type_): _description_
        test_output_path (_type_): _description_
        lang (str, optional): _description_. Defaults to "en".

    Returns:
        _type_: _description_
    """
    # Spact empty model
    nlp = spacy.blank("en")

    if df_train is not None:
        # tuple of tuples. Each nested tuple is (Tweet, Label)
        train_data_tuples = tuple(df_train.iloc[:, [0,1]].itertuples(index=False, name=None))
                
        # Make spacy DocBin
        train_docs = make_docs(train_data_tuples, nlp, labels)

        # Make outpath directory with Pathlib
        Path(train_output_path).parent.mkdir(parents=True, exist_ok=True)

        # save to binary file 
        train_doc_bin = DocBin(docs=train_docs)
        train_doc_bin.to_disk(train_output_path)
        print(f"Processed Train {len(train_data_tuples)} documents: {train_output_path}")        

    if df_dev is not None:
        dev_data_tuples = tuple(df_dev.iloc[:, [0,1]].itertuples(index=False, name=None))
        dev_docs = make_docs(dev_data_tuples, nlp, labels)
        Path(dev_output_path).parent.mkdir(parents=True, exist_ok=True)
        dev_doc_bin = DocBin(docs=dev_docs)
        dev_doc_bin.to_disk(dev_output_path)
        print(f"Processed Dev {len(dev_data_tuples)} documents: {dev_output_path}")

    if df_test is not None:
        test_data_tuples = tuple(df_test.iloc[:, [0,1]].itertuples(index=False, name=None))
        test_docs = make_docs(test_data_tuples, nlp, labels)
        Path(test_output_path).parent.mkdir(parents=True, exist_ok=True)   

        test_doc_bin = DocBin(docs=test_docs)
        test_doc_bin.to_disk(test_output_path)
        print(f"Processed Test {len(test_data_tuples)} documents: {test_output_path}")   


def leet_data_augmenter(text, augmenter):
    """_summary_

    Args:
        text (_type_): _description_
        generator (_type_): _description_

    Returns:
        _type_: _description_
    """
    leet_text, ori_data = augmenter.transform(
            text
        )
    return (leet_text, ori_data)


# function to create dataframes witrh a percentage leet tweets
def create_leet_augmenter_df(df_ori, frac, augmenter, column_to_leet):
    """_summary_

    Args:
        df (_type_): _description_
        frac (_type_): _description_
        generator (_type_): _description_
        column_to_leet (_type_): _description_

    Returns:
        _type_: _description_
    """
    # Create a copy of the original dataframe for saving the leeted version
    df_leeted = copy(df_ori)
    df_leeted["Camouflaged"] = False
    
    # Extract fraction to leet
    df_to_leet = df_leeted.sample(frac=frac, random_state=42)

    # Leet the tweets
    # Step 2: Apply the function and store results
    # Apply the augmenter and store both outputs in a temporary column as a tuple
    # df_to_leet[column_to_leet] = df_to_leet[column_to_leet].progress_apply(leet_data_augmenter,  augmenter=augmenter)
    df_to_leet["temp"] = df_to_leet[column_to_leet].progress_apply(leet_data_augmenter,  augmenter=augmenter)

    # Step 3: Split the tuple into two different columns
    df_to_leet['leet_text'] = df_to_leet['temp'].apply(lambda x: x[0])
    df_to_leet['annotations'] = df_to_leet['temp'].apply(lambda x: x[1])

    df_to_leet["Camouflaged"] = True

    # count nan values in df_train_offen_to_leet["tweet"] column
    print(f"Nan values in '{column_to_leet}' column: ", df_to_leet[column_to_leet].isna().sum())

    # Substitute the original rows by the leeted version
    # cols = list(df_leeted.columns) 
    # df_leeted.loc[df_leeted.index.isin(df_to_leet.index), cols] = df_to_leet[cols]    
    # Concatenate the augmented and non-augmented parts into one DataFrame
    # First, prepare the non-augmented part to include new columns filled with NaNs or appropriate defaults
    df_not_leeted = df_leeted.drop(df_to_leet.index)
    for col in ['leet_text', 'annotations']:
        df_not_leeted[col] = pd.NA

    # Concatenate back together
    df_leeted = pd.concat([df_not_leeted, df_to_leet])

    display( df_leeted.groupby("Camouflaged").count() ) 
    
    return df_leeted



#### Functions to apply the filters uing the metadata

In [None]:
# Applying the mask using annotations

import pandas as pd

def extract_masking_indices(annotations):
    # Extract all indices from the annotations for masking
    indices = []
    for ann in annotations.get("meta", []):
        if "leet_idxs" in ann:
            indices.append((ann["leet_idxs"][0], ann["leet_idxs"][1]))
    # Sort indices by start, reverse to start masking from the end
    indices.sort(reverse=True, key=lambda x: x[0])
    return indices

def apply_masks(text, indices, mask='[MASK]'):
    """Apply multiple mask replacements based on provided indices."""
    for start, end in indices:
        if start < len(text) and end <= len(text) and start < end:
            text = text[:start] + mask + text[end:]
    return text

#### Original Test Data

In [11]:
# Load Test Data and Label
test_data_offen_path = Path(home).joinpath("work/WordCamouflage_Resiliance/code/tmp_data/Offen/testset-levela.tsv")
test_label_offen_path = Path(home).joinpath("work/WordCamouflage_Resiliance/code/tmp_data/Offen/labels-levela.csv")

df_test_data_offen = pd.read_csv( test_data_offen_path, sep = "\t")
df_test_label_offen = pd.read_csv( test_label_offen_path, sep = ",", header=None, names=["id", "test_label"])

# Merge Test Data and Test Label
df_test_offen = pd.merge(df_test_data_offen, df_test_label_offen, on="id", how="outer").drop(labels = ["id"], axis = 1)

print("Test")
display(df_test_offen.groupby("test_label").count())

Test


Unnamed: 0_level_0,tweet
test_label,Unnamed: 1_level_1
NOT,620
OFF,240


Create the camouflage data. As the filters are **ideal** (they properly discern camouflage tokens from non-camouflaged tokens) independently of the levels, we can use any arbitrary level to create the data.  We will use the level 1. But we create the two version regardin the ratios of number of words camouflaged (v1  15% and v2 = 65%). 

#### Level X.1

In [15]:
# Create a test dataframe with 10% of leeted tweets wirth Level 1.1
resiliance_level = "Level_1.1"
resiliance_easy = ["basic_leetspeak"]
augmenter = WordCamouflage_Augmenter.augmenter(
    extractor_type="yake",
    max_top_n=5,
    leet_punt_prb=0.9,
    leet_change_prb=0.8,
    leet_change_frq=0.8,
    leet_uniform_change=0.5,
    method=resiliance_easy,
    return_kws=True,
)

#### 10% Leet ####
df_test_offen_10_per_annot = create_leet_augmenter_df(
    df_ori=df_test_offen, frac=0.1, augmenter=augmenter, column_to_leet="tweet"
)


# Assuming df has the columns 'leet_text', 'annotations', and 'Camouflaged'
df_test_offen_10_per_annot["leet_mask_filter"] = (
    df_test_offen_10_per_annot.progress_apply(
        lambda row: (
            apply_masks(row["leet_text"], extract_masking_indices(row["annotations"]))
            if row["Camouflaged"]
            else row["tweet"]
        ),
        axis=1,
    )
)

pd_2_spacy(
    df_train=None,
    df_dev=None,
    df_test=df_test_offen_10_per_annot.loc[:, ["leet_mask_filter", "test_label"]],
    train_output_path=None,
    dev_output_path=None,
    test_output_path=f"{home}/work/WordCamouflage_Resiliance/code/Spacy_Data/Offen_SemEval_2019/Leet_Data/{resiliance_level}/10_per/test_mask_filter.spacy",
    lang="en",
    labels=df_test_offen_10_per_annot["test_label"].to_list(),
)


df_test_offen_10_per_annot["leet_out_filter"] = (
    df_test_offen_10_per_annot.progress_apply(
        lambda row: (
            apply_masks(
                row["leet_text"], extract_masking_indices(row["annotations"]), ""
            )
            if row["Camouflaged"]
            else row["tweet"]
        ),
        axis=1,
    )
)

pd_2_spacy(
    df_train=None,
    df_dev=None,
    df_test=df_test_offen_10_per_annot.loc[:, ["leet_out_filter", "test_label"]],
    train_output_path=None,
    dev_output_path=None,
    test_output_path=f"{home}/work/WordCamouflage_Resiliance/code/Spacy_Data/Offen_SemEval_2019/Leet_Data/{resiliance_level}/10_per/test_out_filter.spacy",
    lang="en",
    labels=df_test_offen_10_per_annot["test_label"].to_list(),
)


#### 25% Leet ####
df_test_offen_25_per_annot = create_leet_augmenter_df(
    df_ori=df_test_offen, frac=0.25, augmenter=augmenter, column_to_leet="tweet"
)


# Assuming df has the columns 'leet_text', 'annotations', and 'Camouflaged'
df_test_offen_25_per_annot["leet_mask_filter"] = (
    df_test_offen_25_per_annot.progress_apply(
        lambda row: (
            apply_masks(row["leet_text"], extract_masking_indices(row["annotations"]))
            if row["Camouflaged"]
            else row["tweet"]
        ),
        axis=1,
    )
)

pd_2_spacy(
    df_train=None,
    df_dev=None,
    df_test=df_test_offen_25_per_annot.loc[:, ["leet_mask_filter", "test_label"]],
    train_output_path=None,
    dev_output_path=None,
    test_output_path=f"{home}/work/WordCamouflage_Resiliance/code/Spacy_Data/Offen_SemEval_2019/Leet_Data/{resiliance_level}/25_per/test_mask_filter.spacy",
    lang="en",
    labels=df_test_offen_25_per_annot["test_label"].to_list(),
)


df_test_offen_25_per_annot["leet_out_filter"] = (
    df_test_offen_25_per_annot.progress_apply(
        lambda row: (
            apply_masks(
                row["leet_text"], extract_masking_indices(row["annotations"]), ""
            )
            if row["Camouflaged"]
            else row["tweet"]
        ),
        axis=1,
    )
)

pd_2_spacy(
    df_train=None,
    df_dev=None,
    df_test=df_test_offen_25_per_annot.loc[:, ["leet_out_filter", "test_label"]],
    train_output_path=None,
    dev_output_path=None,
    test_output_path=f"{home}/work/WordCamouflage_Resiliance/code/Spacy_Data/Offen_SemEval_2019/Leet_Data/{resiliance_level}/25_per/test_out_filter.spacy",
    lang="en",
    labels=df_test_offen_25_per_annot["test_label"].to_list(),
)


#### 50% Leet ####
df_test_offen_50_per_annot = create_leet_augmenter_df(
    df_ori=df_test_offen, frac=0.50, augmenter=augmenter, column_to_leet="tweet"
)


# Assuming df has the columns 'leet_text', 'annotations', and 'Camouflaged'
df_test_offen_50_per_annot["leet_mask_filter"] = (
    df_test_offen_50_per_annot.progress_apply(
        lambda row: (
            apply_masks(row["leet_text"], extract_masking_indices(row["annotations"]))
            if row["Camouflaged"]
            else row["tweet"]
        ),
        axis=1,
    )
)

pd_2_spacy(
    df_train=None,
    df_dev=None,
    df_test=df_test_offen_50_per_annot.loc[:, ["leet_mask_filter", "test_label"]],
    train_output_path=None,
    dev_output_path=None,
    test_output_path=f"{home}/work/WordCamouflage_Resiliance/code/Spacy_Data/Offen_SemEval_2019/Leet_Data/{resiliance_level}/50_per/test_mask_filter.spacy",
    lang="en",
    labels=df_test_offen_50_per_annot["test_label"].to_list(),
)


df_test_offen_50_per_annot["leet_out_filter"] = (
    df_test_offen_50_per_annot.progress_apply(
        lambda row: (
            apply_masks(
                row["leet_text"], extract_masking_indices(row["annotations"]), ""
            )
            if row["Camouflaged"]
            else row["tweet"]
        ),
        axis=1,
    )
)

pd_2_spacy(
    df_train=None,
    df_dev=None,
    df_test=df_test_offen_50_per_annot.loc[:, ["leet_out_filter", "test_label"]],
    train_output_path=None,
    dev_output_path=None,
    test_output_path=f"{home}/work/WordCamouflage_Resiliance/code/Spacy_Data/Offen_SemEval_2019/Leet_Data/{resiliance_level}/50_per/test_out_filter.spacy",
    lang="en",
    labels=df_test_offen_50_per_annot["test_label"].to_list(),
)


#### 75% Leet ####
df_test_offen_75_per_annot = create_leet_augmenter_df(
    df_ori=df_test_offen, frac=0.75, augmenter=augmenter, column_to_leet="tweet"
)


# Assuming df has the columns 'leet_text', 'annotations', and 'Camouflaged'
df_test_offen_75_per_annot["leet_mask_filter"] = (
    df_test_offen_75_per_annot.progress_apply(
        lambda row: (
            apply_masks(row["leet_text"], extract_masking_indices(row["annotations"]))
            if row["Camouflaged"]
            else row["tweet"]
        ),
        axis=1,
    )
)

pd_2_spacy(
    df_train=None,
    df_dev=None,
    df_test=df_test_offen_75_per_annot.loc[:, ["leet_mask_filter", "test_label"]],
    train_output_path=None,
    dev_output_path=None,
    test_output_path=f"{home}/work/WordCamouflage_Resiliance/code/Spacy_Data/Offen_SemEval_2019/Leet_Data/{resiliance_level}/75_per/test_mask_filter.spacy",
    lang="en",
    labels=df_test_offen_75_per_annot["test_label"].to_list(),
)


df_test_offen_75_per_annot["leet_out_filter"] = (
    df_test_offen_75_per_annot.progress_apply(
        lambda row: (
            apply_masks(
                row["leet_text"], extract_masking_indices(row["annotations"]), ""
            )
            if row["Camouflaged"]
            else row["tweet"]
        ),
        axis=1,
    )
)

pd_2_spacy(
    df_train=None,
    df_dev=None,
    df_test=df_test_offen_75_per_annot.loc[:, ["leet_out_filter", "test_label"]],
    train_output_path=None,
    dev_output_path=None,
    test_output_path=f"{home}/work/WordCamouflage_Resiliance/code/Spacy_Data/Offen_SemEval_2019/Leet_Data/{resiliance_level}/75_per/test_out_filter.spacy",
    lang="en",
    labels=df_test_offen_75_per_annot["test_label"].to_list(),
)


#### 100% Leet ####
df_test_offen_100_per_annot = create_leet_augmenter_df(
    df_ori=df_test_offen, frac=1, augmenter=augmenter, column_to_leet="tweet"
)


# Assuming df has the columns 'leet_text', 'annotations', and 'Camouflaged'
df_test_offen_100_per_annot["leet_mask_filter"] = (
    df_test_offen_100_per_annot.progress_apply(
        lambda row: (
            apply_masks(row["leet_text"], extract_masking_indices(row["annotations"]))
            if row["Camouflaged"]
            else row["tweet"]
        ),
        axis=1,
    )
)

pd_2_spacy(
    df_train=None,
    df_dev=None,
    df_test=df_test_offen_100_per_annot.loc[:, ["leet_mask_filter", "test_label"]],
    train_output_path=None,
    dev_output_path=None,
    test_output_path=f"{home}/work/WordCamouflage_Resiliance/code/Spacy_Data/Offen_SemEval_2019/Leet_Data/{resiliance_level}/100_per/test_mask_filter.spacy",
    lang="en",
    labels=df_test_offen_100_per_annot["test_label"].to_list(),
)


df_test_offen_100_per_annot["leet_out_filter"] = (
    df_test_offen_100_per_annot.progress_apply(
        lambda row: (
            apply_masks(
                row["leet_text"], extract_masking_indices(row["annotations"]), ""
            )
            if row["Camouflaged"]
            else row["tweet"]
        ),
        axis=1,
    )
)

pd_2_spacy(
    df_train=None,
    df_dev=None,
    df_test=df_test_offen_100_per_annot.loc[:, ["leet_out_filter", "test_label"]],
    train_output_path=None,
    dev_output_path=None,
    test_output_path=f"{home}/work/WordCamouflage_Resiliance/code/Spacy_Data/Offen_SemEval_2019/Leet_Data/{resiliance_level}/100_per/test_out_filter.spacy",
    lang="en",
    labels=df_test_offen_100_per_annot["test_label"].to_list(),
)

100%|██████████| 86/86 [00:00<00:00, 120.35it/s]

Nan values in 'tweet' column:  0





Unnamed: 0_level_0,tweet,test_label,leet_text,annotations,temp
Camouflaged,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
False,774,774,0,0,0
True,86,86,86,86,86


100%|██████████| 860/860 [00:00<00:00, 50403.15it/s]
Making docs: 100%|██████████| 860/860 [00:00<00:00, 1328.01it/s]


Processed Test 860 documents: /home/alvaro/work/WordCamouflage_Resiliance/code/Spacy_Data/Offen_SemEval_2019/Leet_Data/Level_1.1/10_per/test_mask_filter.spacy


100%|██████████| 860/860 [00:00<00:00, 50955.68it/s]
Making docs: 100%|██████████| 860/860 [00:00<00:00, 1315.68it/s]


Processed Test 860 documents: /home/alvaro/work/WordCamouflage_Resiliance/code/Spacy_Data/Offen_SemEval_2019/Leet_Data/Level_1.1/10_per/test_out_filter.spacy


100%|██████████| 215/215 [00:01<00:00, 129.08it/s]

Nan values in 'tweet' column:  0





Unnamed: 0_level_0,tweet,test_label,leet_text,annotations,temp
Camouflaged,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
False,645,645,0,0,0
True,215,215,215,215,215


100%|██████████| 860/860 [00:00<00:00, 44341.61it/s]
Making docs: 100%|██████████| 860/860 [00:00<00:00, 1373.19it/s]


Processed Test 860 documents: /home/alvaro/work/WordCamouflage_Resiliance/code/Spacy_Data/Offen_SemEval_2019/Leet_Data/Level_1.1/25_per/test_mask_filter.spacy


100%|██████████| 860/860 [00:00<00:00, 45020.11it/s]
Making docs: 100%|██████████| 860/860 [00:00<00:00, 1365.91it/s]


Processed Test 860 documents: /home/alvaro/work/WordCamouflage_Resiliance/code/Spacy_Data/Offen_SemEval_2019/Leet_Data/Level_1.1/25_per/test_out_filter.spacy


100%|██████████| 430/430 [00:03<00:00, 123.92it/s]

Nan values in 'tweet' column:  0





Unnamed: 0_level_0,tweet,test_label,leet_text,annotations,temp
Camouflaged,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
False,430,430,0,0,0
True,430,430,430,430,430


100%|██████████| 860/860 [00:00<00:00, 37874.21it/s]
Making docs: 100%|██████████| 860/860 [00:00<00:00, 1447.61it/s]


Processed Test 860 documents: /home/alvaro/work/WordCamouflage_Resiliance/code/Spacy_Data/Offen_SemEval_2019/Leet_Data/Level_1.1/50_per/test_mask_filter.spacy


100%|██████████| 860/860 [00:00<00:00, 38218.11it/s]
Making docs: 100%|██████████| 860/860 [00:00<00:00, 1457.72it/s]


Processed Test 860 documents: /home/alvaro/work/WordCamouflage_Resiliance/code/Spacy_Data/Offen_SemEval_2019/Leet_Data/Level_1.1/50_per/test_out_filter.spacy


100%|██████████| 645/645 [00:05<00:00, 114.84it/s]


Nan values in 'tweet' column:  0


Unnamed: 0_level_0,tweet,test_label,leet_text,annotations,temp
Camouflaged,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
False,215,215,0,0,0
True,645,645,645,645,645


100%|██████████| 860/860 [00:00<00:00, 27441.07it/s]
Making docs: 100%|██████████| 860/860 [00:00<00:00, 1529.09it/s]


Processed Test 860 documents: /home/alvaro/work/WordCamouflage_Resiliance/code/Spacy_Data/Offen_SemEval_2019/Leet_Data/Level_1.1/75_per/test_mask_filter.spacy


100%|██████████| 860/860 [00:00<00:00, 32882.41it/s]
Making docs: 100%|██████████| 860/860 [00:00<00:00, 1533.43it/s]


Processed Test 860 documents: /home/alvaro/work/WordCamouflage_Resiliance/code/Spacy_Data/Offen_SemEval_2019/Leet_Data/Level_1.1/75_per/test_out_filter.spacy


100%|██████████| 860/860 [00:06<00:00, 123.02it/s]


Nan values in 'tweet' column:  0


Unnamed: 0_level_0,tweet,test_label,leet_text,annotations,temp
Camouflaged,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
True,860,860,860,860,860


100%|██████████| 860/860 [00:00<00:00, 27964.41it/s]
Making docs: 100%|██████████| 860/860 [00:00<00:00, 1643.54it/s]


Processed Test 860 documents: /home/alvaro/work/WordCamouflage_Resiliance/code/Spacy_Data/Offen_SemEval_2019/Leet_Data/Level_1.1/100_per/test_mask_filter.spacy


100%|██████████| 860/860 [00:00<00:00, 29009.52it/s]
Making docs: 100%|██████████| 860/860 [00:00<00:00, 1573.67it/s]


Processed Test 860 documents: /home/alvaro/work/WordCamouflage_Resiliance/code/Spacy_Data/Offen_SemEval_2019/Leet_Data/Level_1.1/100_per/test_out_filter.spacy


#### Level X.2

In [None]:
resiliance_level = "Level_1.2"
resiliance_easy = ["basic_leetspeak"]
augmenter = WordCamouflage_Augmenter.augmenter(
        extractor_type="yake",
        max_top_n=20,
        leet_punt_prb=0.9,
        leet_change_prb=0.8,
        leet_change_frq=0.8,
        leet_uniform_change=0.5,
        method=resiliance_easy,
        return_kws = True
)

#### 10% Leet ####
df_test_offen_10_per_annot = create_leet_augmenter_df(
    df_ori=df_test_offen, frac=0.1, augmenter=augmenter, column_to_leet="tweet"
)


# Assuming df has the columns 'leet_text', 'annotations', and 'Camouflaged'
df_test_offen_10_per_annot["leet_mask_filter"] = (
    df_test_offen_10_per_annot.progress_apply(
        lambda row: (
            apply_masks(row["leet_text"], extract_masking_indices(row["annotations"]))
            if row["Camouflaged"]
            else row["tweet"]
        ),
        axis=1,
    )
)

pd_2_spacy(
    df_train=None,
    df_dev=None,
    df_test=df_test_offen_10_per_annot.loc[:, ["leet_mask_filter", "test_label"]],
    train_output_path=None,
    dev_output_path=None,
    test_output_path=f"{home}/work/WordCamouflage_Resiliance/code/Spacy_Data/Offen_SemEval_2019/Leet_Data/{resiliance_level}/10_per/test_mask_filter.spacy",
    lang="en",
    labels=df_test_offen_10_per_annot["test_label"].to_list(),
)


df_test_offen_10_per_annot["leet_out_filter"] = (
    df_test_offen_10_per_annot.progress_apply(
        lambda row: (
            apply_masks(
                row["leet_text"], extract_masking_indices(row["annotations"]), ""
            )
            if row["Camouflaged"]
            else row["tweet"]
        ),
        axis=1,
    )
)

pd_2_spacy(
    df_train=None,
    df_dev=None,
    df_test=df_test_offen_10_per_annot.loc[:, ["leet_out_filter", "test_label"]],
    train_output_path=None,
    dev_output_path=None,
    test_output_path=f"{home}/work/WordCamouflage_Resiliance/code/Spacy_Data/Offen_SemEval_2019/Leet_Data/{resiliance_level}/10_per/test_out_filter.spacy",
    lang="en",
    labels=df_test_offen_10_per_annot["test_label"].to_list(),
)


#### 25% Leet ####
df_test_offen_25_per_annot = create_leet_augmenter_df(
    df_ori=df_test_offen, frac=0.25, augmenter=augmenter, column_to_leet="tweet"
)


# Assuming df has the columns 'leet_text', 'annotations', and 'Camouflaged'
df_test_offen_25_per_annot["leet_mask_filter"] = (
    df_test_offen_25_per_annot.progress_apply(
        lambda row: (
            apply_masks(row["leet_text"], extract_masking_indices(row["annotations"]))
            if row["Camouflaged"]
            else row["tweet"]
        ),
        axis=1,
    )
)

pd_2_spacy(
    df_train=None,
    df_dev=None,
    df_test=df_test_offen_25_per_annot.loc[:, ["leet_mask_filter", "test_label"]],
    train_output_path=None,
    dev_output_path=None,
    test_output_path=f"{home}/work/WordCamouflage_Resiliance/code/Spacy_Data/Offen_SemEval_2019/Leet_Data/{resiliance_level}/25_per/test_mask_filter.spacy",
    lang="en",
    labels=df_test_offen_25_per_annot["test_label"].to_list(),
)


df_test_offen_25_per_annot["leet_out_filter"] = (
    df_test_offen_25_per_annot.progress_apply(
        lambda row: (
            apply_masks(
                row["leet_text"], extract_masking_indices(row["annotations"]), ""
            )
            if row["Camouflaged"]
            else row["tweet"]
        ),
        axis=1,
    )
)

pd_2_spacy(
    df_train=None,
    df_dev=None,
    df_test=df_test_offen_25_per_annot.loc[:, ["leet_out_filter", "test_label"]],
    train_output_path=None,
    dev_output_path=None,
    test_output_path=f"{home}/work/WordCamouflage_Resiliance/code/Spacy_Data/Offen_SemEval_2019/Leet_Data/{resiliance_level}/25_per/test_out_filter.spacy",
    lang="en",
    labels=df_test_offen_25_per_annot["test_label"].to_list(),
)


#### 50% Leet ####
df_test_offen_50_per_annot = create_leet_augmenter_df(
    df_ori=df_test_offen, frac=0.50, augmenter=augmenter, column_to_leet="tweet"
)


# Assuming df has the columns 'leet_text', 'annotations', and 'Camouflaged'
df_test_offen_50_per_annot["leet_mask_filter"] = (
    df_test_offen_50_per_annot.progress_apply(
        lambda row: (
            apply_masks(row["leet_text"], extract_masking_indices(row["annotations"]))
            if row["Camouflaged"]
            else row["tweet"]
        ),
        axis=1,
    )
)

pd_2_spacy(
    df_train=None,
    df_dev=None,
    df_test=df_test_offen_50_per_annot.loc[:, ["leet_mask_filter", "test_label"]],
    train_output_path=None,
    dev_output_path=None,
    test_output_path=f"{home}/work/WordCamouflage_Resiliance/code/Spacy_Data/Offen_SemEval_2019/Leet_Data/{resiliance_level}/50_per/test_mask_filter.spacy",
    lang="en",
    labels=df_test_offen_50_per_annot["test_label"].to_list(),
)


df_test_offen_50_per_annot["leet_out_filter"] = (
    df_test_offen_50_per_annot.progress_apply(
        lambda row: (
            apply_masks(
                row["leet_text"], extract_masking_indices(row["annotations"]), ""
            )
            if row["Camouflaged"]
            else row["tweet"]
        ),
        axis=1,
    )
)

pd_2_spacy(
    df_train=None,
    df_dev=None,
    df_test=df_test_offen_50_per_annot.loc[:, ["leet_out_filter", "test_label"]],
    train_output_path=None,
    dev_output_path=None,
    test_output_path=f"{home}/work/WordCamouflage_Resiliance/code/Spacy_Data/Offen_SemEval_2019/Leet_Data/{resiliance_level}/50_per/test_out_filter.spacy",
    lang="en",
    labels=df_test_offen_50_per_annot["test_label"].to_list(),
)


#### 75% Leet ####
df_test_offen_75_per_annot = create_leet_augmenter_df(
    df_ori=df_test_offen, frac=0.75, augmenter=augmenter, column_to_leet="tweet"
)


# Assuming df has the columns 'leet_text', 'annotations', and 'Camouflaged'
df_test_offen_75_per_annot["leet_mask_filter"] = (
    df_test_offen_75_per_annot.progress_apply(
        lambda row: (
            apply_masks(row["leet_text"], extract_masking_indices(row["annotations"]))
            if row["Camouflaged"]
            else row["tweet"]
        ),
        axis=1,
    )
)

pd_2_spacy(
    df_train=None,
    df_dev=None,
    df_test=df_test_offen_75_per_annot.loc[:, ["leet_mask_filter", "test_label"]],
    train_output_path=None,
    dev_output_path=None,
    test_output_path=f"{home}/work/WordCamouflage_Resiliance/code/Spacy_Data/Offen_SemEval_2019/Leet_Data/{resiliance_level}/75_per/test_mask_filter.spacy",
    lang="en",
    labels=df_test_offen_75_per_annot["test_label"].to_list(),
)


df_test_offen_75_per_annot["leet_out_filter"] = (
    df_test_offen_75_per_annot.progress_apply(
        lambda row: (
            apply_masks(
                row["leet_text"], extract_masking_indices(row["annotations"]), ""
            )
            if row["Camouflaged"]
            else row["tweet"]
        ),
        axis=1,
    )
)

pd_2_spacy(
    df_train=None,
    df_dev=None,
    df_test=df_test_offen_75_per_annot.loc[:, ["leet_out_filter", "test_label"]],
    train_output_path=None,
    dev_output_path=None,
    test_output_path=f"{home}/work/WordCamouflage_Resiliance/code/Spacy_Data/Offen_SemEval_2019/Leet_Data/{resiliance_level}/75_per/test_out_filter.spacy",
    lang="en",
    labels=df_test_offen_75_per_annot["test_label"].to_list(),
)


#### 100% Leet ####
df_test_offen_100_per_annot = create_leet_augmenter_df(
    df_ori=df_test_offen, frac=1, augmenter=augmenter, column_to_leet="tweet"
)


# Assuming df has the columns 'leet_text', 'annotations', and 'Camouflaged'
df_test_offen_100_per_annot["leet_mask_filter"] = (
    df_test_offen_100_per_annot.progress_apply(
        lambda row: (
            apply_masks(row["leet_text"], extract_masking_indices(row["annotations"]))
            if row["Camouflaged"]
            else row["tweet"]
        ),
        axis=1,
    )
)

pd_2_spacy(
    df_train=None,
    df_dev=None,
    df_test=df_test_offen_100_per_annot.loc[:, ["leet_mask_filter", "test_label"]],
    train_output_path=None,
    dev_output_path=None,
    test_output_path=f"{home}/work/WordCamouflage_Resiliance/code/Spacy_Data/Offen_SemEval_2019/Leet_Data/{resiliance_level}/100_per/test_mask_filter.spacy",
    lang="en",
    labels=df_test_offen_100_per_annot["test_label"].to_list(),
)


df_test_offen_100_per_annot["leet_out_filter"] = (
    df_test_offen_100_per_annot.progress_apply(
        lambda row: (
            apply_masks(
                row["leet_text"], extract_masking_indices(row["annotations"]), ""
            )
            if row["Camouflaged"]
            else row["tweet"]
        ),
        axis=1,
    )
)

pd_2_spacy(
    df_train=None,
    df_dev=None,
    df_test=df_test_offen_100_per_annot.loc[:, ["leet_out_filter", "test_label"]],
    train_output_path=None,
    dev_output_path=None,
    test_output_path=f"{home}/work/WordCamouflage_Resiliance/code/Spacy_Data/Offen_SemEval_2019/Leet_Data/{resiliance_level}/100_per/test_out_filter.spacy",
    lang="en",
    labels=df_test_offen_100_per_annot["test_label"].to_list(),
)

### Load spacy data

In [21]:
import spacy
from spacy.tokens import DocBin


offen_10_leet_data_path = "/home/alvaro/work/WordCamouflage_Resiliance/code/Spacy_Data/Offen_SemEval_2019/Leet_Data/Level_1.1/100_per/test_mask_filter.spacy"
nlp = spacy.blank("en")

# Cargar los datos
doc_bin = DocBin().from_disk(offen_10_leet_data_path)
docs = list(doc_bin.get_docs(nlp.vocab))

# Convertir a DataFrame 
#  doc.cats --> {'OFF': False, 'NOT': True}
data = [{
    'text': doc.text,
    'test_label': next((label for label, is_true in doc.cats.items() if is_true), None)
} for doc in docs]
offen_10_leet_mask_filter_df_v1 = pd.DataFrame(data)


import spacy
from spacy.tokens import DocBin


offen_10_leet_data_path = "/home/alvaro/work/WordCamouflage_Resiliance/code/Spacy_Data/Offen_SemEval_2019/Leet_Data/Level_1.2/100_per/test_mask_filter.spacy"
nlp = spacy.blank("en")

# Cargar los datos
doc_bin = DocBin().from_disk(offen_10_leet_data_path)
docs = list(doc_bin.get_docs(nlp.vocab))

# Convertir a DataFrame 
#  doc.cats --> {'OFF': False, 'NOT': True}
data = [{
    'text': doc.text,
    'test_label': next((label for label, is_true in doc.cats.items() if is_true), None)
} for doc in docs]
offen_10_leet_mask_filter_df_v2 = pd.DataFrame(data)


import spacy
from spacy.tokens import DocBin


offen_10_leet_data_path = "/home/alvaro/work/WordCamouflage_Resiliance/code/Spacy_Data/Offen_SemEval_2019/Leet_Data/Level_3.2/100_per/test_mask_filter.spacy"
nlp = spacy.blank("en")

# Cargar los datos
doc_bin = DocBin().from_disk(offen_10_leet_data_path)
docs = list(doc_bin.get_docs(nlp.vocab))

# Convertir a DataFrame 
#  doc.cats --> {'OFF': False, 'NOT': True}
data = [{
    'text': doc.text,
    'test_label': next((label for label, is_true in doc.cats.items() if is_true), None)
} for doc in docs]
offen_10_leet_mask_filter_df_3v2 = pd.DataFrame(data)

In [22]:
pd.set_option('display.max_colwidth', None)

display(offen_10_leet_mask_filter_df_v1.head(2))
display(offen_10_leet_mask_filter_df_v2.head(2))
display(offen_10_leet_mask_filter_df_3v2.head(2))



Unnamed: 0,text,test_label
0,"#[MASK]. Don't [MASK] all of your [MASK]. If you are not saving [MASK] 15% of what you earn, [MASK] today. Look at your income and expenses, find out where you can [MASK] to cut back and get 15% you save.",NOT
1,[MASK] @[MASK] follows me!,NOT


Unnamed: 0,text,test_label
0,"#[MASK]. Don't [MASK] all of your [MASK]. If you are not [MASK] [MASK] 15% of what you [MASK], [MASK] [MASK]. Look at your [MASK] and [MASK], [MASK] out where you can [MASK] to [MASK] [MASK] and get 15% you [MASK].",NOT
1,[MASK] @[MASK] follows me!,NOT


Unnamed: 0,text,test_label
0,"#[MASK]. Don't [MASK] all of your [MASK]. If you are not [MASK] [MASK] 15% of what you [MASK], [MASK] [MASK]. Look at your [MASK] and [MASK], [MASK] out where you can [MASK] to [MASK] [MASK] and get 15% you [MASK].",NOT
1,[MASK] @[MASK] follows me!,NOT


In [66]:
offen_10_leet_mask_filter_df

Unnamed: 0,text,test_label
0,"#[MASK]. Don't [MASK] all of your [MASK]. If you are not [MASK] [MASK] 15% of what you [MASK], [MASK] [MASK]. Look at your [MASK] and [MASK], [MASK] out where you can [MASK] to [MASK] [MASK] and get 15% you [MASK].",NOT
1,[MASK] @[MASK] follows me!,NOT
2,@[MASK] @[MASK] @[MASK] I'm [MASK] by the way and I [MASK] the [MASK] [MASK] already in [MASK] if it means [MASK] like you don't ever [MASK] a [MASK] you are saying has nothing to do with this.,NOT
3,@[MASK] [MASK] considers [MASK] the [MASK] [MASK] and having [MASK] later in the [MASK] [MASK] she is a [MASK] [MASK],NOT
4,"#[MASK] #[MASK]☠️#[MASK] #[MASK] #[MASK] #[MASK] #[MASK] '... in an [MASK] where [MASK] [MASK] [MASK] [MASK] on all the [MASK] that [MASK] to them most — [MASK], [MASK] [MASK], [MASK] [MASK] and so on — [MASK] ... on the [MASK] [MASK]....'[MASK]",NOT
...,...,...
855,"#[MASK]: If you [MASK] in #[MASK], [MASK] wants to [MASK] your [MASK] on the [MASK] of a new [MASK] [MASK] [MASK]. She is [MASK] to [MASK] a new [MASK] [MASK] [MASK] in the [MASK] [MASK] [MASK] [MASK] [MASK] on [MASK] [MASK]. What do you think? [MASK] [MASK]",NOT
856,[MASK] [MASK] [MASK] #[MASK] [MASK],NOT
857,"#[MASK] I saw the [MASK] [MASK] , thank you for [MASK] us so much , just the way we [MASK] you [MASK] , you are [MASK] with [MASK] ♥️♥️",NOT
858,"#[MASK]: [MASK], 13, [MASK], [MASK] and [MASK] From [MASK] Over —[MASK] for it— #[MASK] [MASK] from [MASK] [MASK] [MASK] #[MASK] #[MASK] #[MASK] #[MASK] #[MASK] #[MASK] #[MASK]",NOT


#### CLI command to evaluate the models

```bash
    python -m spacy evaluate path_model data_path_test_mask_filter.spacy  --gpu-id 0 --output output_path_test_mask_filter_result_25.json
```

For example

```bash
    python -m spacy evaluate /home/alvaro/work/WordCamouflage_Resiliance/code/models/bert-base-uncased_naive/model-best /home/alvaro/work/WordCamouflage_Resiliance/code/Spacy_Data/Offen_SemEval_2019/Leet_Data/Level_1.1/25_per/test_mask_filter.spacy  --gpu-id 0 --output /home/alvaro/work/WordCamouflage_Resiliance/code/models/bert-base-uncased_naive/model-best/Filter_mask/test_mask_filter_result_25.json
```

Check the results in the output file.

In [18]:
# to data frame
import pandas as pd
import json
import os

filter = "blank"
directory = f"/home/alvaro/work/WordCamouflage_Resiliance/code/models/mbart-50-naive/model-best/Filter_{filter}"


# Lista de nombres de archivos en el orden deseado
file_order = [
    f"test_{filter}_filter_result_10_v1.json",
    f"test_{filter}_filter_result_10_v2.json",
    f"test_{filter}_filter_result_25_v1.json",
    f"test_{filter}_filter_result_25_v2.json",
    f"test_{filter}_filter_result_50_v1.json",
    f"test_{filter}_filter_result_50_v2.json",
    f"test_{filter}_filter_result_75_v1.json",
    f"test_{filter}_filter_result_75_v2.json",
    f"test_{filter}_filter_result_100_v1.json",
    f"test_{filter}_filter_result_100_v2.json",
]

results = []
for filename in file_order:
    with open(os.path.join(directory, filename), 'r') as f:
        data = json.load(f)
        results.append(data)

df_results = pd.DataFrame(results)
df_results.loc[:, "cats_macro_f"]  # rotate
# rotate
print(filter)
df_results.loc[:, "cats_macro_f"].to_frame().T

blank


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
cats_macro_f,0.742304,0.727541,0.719691,0.701188,0.691769,0.641577,0.668116,0.565613,0.640255,0.439124
