EXISTING CODE (REPRODUCTION CODE)

1.   CODE CELLS 1 AND 2 - PREDICTION
2.   REMAINING CELLS - ANALYSIS AND IMPLEMENTATION



NOTE : PROF_TABLE.TSV FILE SHOULD BE UPLOADED


In [6]:
import pandas as pd
import re

class Prof:

    def __init__(self):
        """
        Class initialization
        """

        self.prof_table = pd.read_csv("/prof_table.tsv", sep="\t")

    def obfuscate_string(self, text, lang="any"):
        """
        Predicts the emotion for the sentences in input
        @param text: text to be obfuscated
        @param lang: text's language
        @return: obfuscated text
        """

        if lang != "any":
            assert lang in ['EN', 'FR', 'DE', 'IT', 'ES']
            prof_table_lang = self.prof_table[self.prof_table['language'] == lang]
        else:
            prof_table_lang = self.prof_table
        replacements = prof_table_lang[['profanity', 'obfuscation']].to_numpy()
        for k,v in replacements:
            text = re.sub(f'(?<![a-zA-Z]){k}(?![a-z-Z])', v, text)

        return text

    def reveal_profanity(self, profanity_obfuscated, lang="any"):
        """
        Predicts the emotion for the sentences in input
        @param profanity: profanity to be revealed
        @param lang: text's language
        @return: revealed profanity
        """

        if lang != "any":
            assert lang in ['EN', 'FR', 'DE', 'IT', 'ES']
            prof_table_lang = self.prof_table[self.prof_table['language'] == lang]
        else:
            prof_table_lang = self.prof_table

        profanity = prof_table_lang[prof_table_lang['obfuscation'] == profanity_obfuscated]['profanity']

        profanity = profanity.unique()[0]

        return profanity

In [49]:
prof = Prof()
obfuscator = prof

print(obfuscator.obfuscate_string("puta mierda"))

print(obfuscator.obfuscate_string("mother fucker"))

print(obfuscator.obfuscate_string("motherfucker"))

print(obfuscator.obfuscate_string("porca puttana","IT"))

print(obfuscator.obfuscate_string("porn"))


p*ta m*erda
mother fucker
motherfucker
p*rca p*ttana
porn


As it can be observed that all the profanity words are not correctly recognized. This is due to lack of proper data set collection as this algorithm classifies the bad words which are only stored in the dataset as bad words. **This is the main limitation of this profanity obfuscation algorithm**. **Other limitation is that it is not sensitive to the semantic and contextual meaning of an input sentence.**
Our proposed approach uses the concept of masking words which is inspired from this project/research paper,designed in a way such that these limitations are tackled using the improvised version of this project

In [4]:
%%capture
!pip install contextualized-topic-models

In [5]:
%%capture
!pip install pyldavis

In [6]:
%%capture
!wget https://raw.githubusercontent.com/vinid/data/master/dbpedia_sample_abstract_20k_unprep.txt

In [7]:
!head -n 2 dbpedia_sample_abstract_20k_unprep.txt

The Mid-Peninsula Highway is a proposed freeway across the Niagara Peninsula in the Canadian province of Ontario. Although plans for a highway connecting Hamilton to Fort Erie south of the Niagara Escarpment have surfaced for decades,it was not until The Niagara Frontier International Gateway Study was published by the Ministry
Monte Zucker (died March 15, 2007) was an American photographer. He specialized in wedding photography, entering it as a profession in 1947. In the 1970s he operated a studio in Silver Spring, Maryland. Later he lived in Florida. He was Brides Magazine's Wedding Photographer of the Year for 1990 and


In [5]:
text_file = "dbpedia_sample_abstract_20k_unprep.txt" # EDIT THIS WITH THE FILE YOU UPLOAD

# NOTE: RESTART THE KERNEL

In [6]:
from contextualized_topic_models.models.ctm import CombinedTM
from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation
from contextualized_topic_models.utils.preprocessing import WhiteSpacePreprocessingStopwords
import nltk

In [7]:
from nltk.corpus import stopwords as stop_words

nltk.download('stopwords')

documents = [line.strip() for line in open(text_file, encoding="utf-8").readlines()[0:2000]]

stopwords = list(stop_words.words("english"))

sp = WhiteSpacePreprocessingStopwords(documents, stopwords_list=stopwords)
preprocessed_documents, unpreprocessed_corpus, vocab, retained_indices = sp.preprocess()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [8]:
preprocessed_documents[:2]

['mid peninsula highway across peninsula canadian province ontario although highway fort south decades international study published ministry',
 'died march american photographer photography operated studio silver spring maryland later lived florida magazine photographer year']

In [9]:
tp = TopicModelDataPreparation("all-mpnet-base-v2")

training_dataset = tp.fit(text_for_contextual=unpreprocessed_corpus, text_for_bow=preprocessed_documents)

Downloading (…)99753/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)0cdb299753/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)db299753/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)753/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)99753/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)9753/train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading (…)0cdb299753/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)b299753/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Batches:   0%|          | 0/10 [00:00<?, ?it/s]

In [10]:
tp.vocab[:10]

array(['ab', 'abbreviated', 'abroad', 'academic', 'academy', 'accepted',
       'access', 'according', 'accounting', 'achieved'], dtype=object)

In [11]:
ctm = CombinedTM(bow_size=len(tp.vocab), contextual_size=768, n_components=20, num_epochs=10)
ctm.fit(training_dataset) # run the model

Epoch: [10/10]	 Seen Samples: [19840/20000]	Train Loss: 139.61832354145665	Time: 0:00:00.652538: : 10it [00:08,  1.16it/s]
100%|██████████| 32/32 [00:00<00:00, 52.68it/s]


In [12]:
ctm.get_topic_lists(5)

[['used', 'system', 'modern', 'chinese', 'related'],
 ['directed', 'french', 'drama', 'comic', 'history'],
 ['group', 'science', 'album', 'projects', 'use'],
 ['born', 'championships', 'new', 'national', 'season'],
 ['born', 'played', 'professional', 'footballer', 'football'],
 ['located', 'near', 'village', 'population', 'central'],
 ['various', 'released', 'well', 'peter', 'language'],
 ['island', 'population', 'village', 'kilometres', 'district'],
 ['house', 'general', 'council', 'party', 'united'],
 ['championship', 'american', 'world', 'women', 'summer'],
 ['kilometres', 'located', 'km', 'mi', 'village'],
 ['book', 'film', 'work', 'published', 'written'],
 ['released', 'production', 'rock', 'music', 'based'],
 ['located', 'south', 'state', 'within', 'district'],
 ['county', 'states', 'state', 'united', 'miles'],
 ['politician', 'member', 'served', 'party', 'house'],
 ['born', 'player', 'played', 'december', 'former'],
 ['football', 'played', 'league', 'professional', 'player'],
 [

In [13]:
topics_predictions = ctm.get_thetas(training_dataset, n_samples=5) # get all the topic predictions

100%|██████████| 32/32 [00:00<00:00, 32.47it/s]


In [14]:
preprocessed_documents[0] # see the text of our preprocessed document

'mid peninsula highway across peninsula canadian province ontario although highway fort south decades international study published ministry'

In [15]:
import numpy as np
topic_number = np.argmax(topics_predictions[0]) # get the topic id of the first document

In [16]:
ctm.get_topic_lists(5)[15]

['politician', 'member', 'served', 'party', 'house']

In [17]:
ctm.get_topic_lists(5)[topic_number]

['house', 'general', 'council', 'party', 'united']