# ConBERT vocabulary compilation
This notebook reproduces creation of CondBERT vocabulary.

The files `positive-words.txt`, `negative-words.txt` and `toxic_words.txt` are not reproduced exactly because of our internal issues. 

However, all other files (`token_toxicities.txt` and `word2coef.pkl` ) are reproduced accurately.

Important note: The notebook use pseudo-absolute path and should be launched only once. So If you want to launch it second time, restart the kernel.

In [1]:
import os

# Upcast the path to the src folder
os.chdir('..')
print(os.getcwd())

/home/leon/Projects/Programming/Study/Python/ML_Inno/PMLDL/PML_ASS_1


In [None]:
import torch
import numpy as np

def manual_seed(seed):
    """
    Function to set the seed value for reproducibility
    :param seed: seed value
    :return: None
    """
    # PyTorch manual seed
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)

    # NumPy manual seed
    np.random.seed(seed)

# Set the seed value
seed = 42

# Call the manual seeding function
manual_seed(seed)

In [2]:
VOCAB_DIRNAME = 'models/Conbert/vocab/'

In [3]:
from models.Conbert.conbert import CondBertRewriter
from models.Conbert.choosers import EmbeddingSimilarityChooser
from models.Conbert.multiword.masked_token_predictor_bert import MaskedTokenPredictorBert

# Loading BERT

In [4]:
from transformers import BertTokenizer, BertForMaskedLM
import pickle
import os
from tqdm.auto import tqdm, trange

In [5]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [6]:
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)

In [7]:
model = BertForMaskedLM.from_pretrained(model_name)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [8]:
model.to(device);

# Preparing the vocabularires.

- negative-words.txt
- positive-words.txt
- word2coef.pkl
- token_toxicities.txt

These files should be prepared once.

In [33]:
tox_corpus_path = 'data/interm/train_toxic_corpus.txt'
norm_corpus_path = 'data/interm/train_normal_corpus.txt'

In [34]:
if not os.path.exists(VOCAB_DIRNAME):
    os.makedirs(VOCAB_DIRNAME)

### Preparing the DRG-like vocabularies

In [35]:
import os
import argparse
from tqdm import tqdm
from nltk import ngrams
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer



class NgramSalienceCalculator():
    """Calculates salience of ngrams in a corpus"""
    def __init__(self, tox_corpus, norm_corpus, use_ngrams=False):
        ngrams = (1, 3) if use_ngrams else (1, 1)
        self.vectorizer = CountVectorizer(ngram_range=ngrams)

        tox_count_matrix = self.vectorizer.fit_transform(tox_corpus)
        self.tox_vocab = self.vectorizer.vocabulary_
        self.tox_counts = np.sum(tox_count_matrix, axis=0)

        norm_count_matrix = self.vectorizer.fit_transform(norm_corpus)
        self.norm_vocab = self.vectorizer.vocabulary_
        self.norm_counts = np.sum(norm_count_matrix, axis=0)

    def salience(self, feature, attribute='tox', lmbda=0.5):
        """
        Calculates salience of a feature
        :param feature: input feature
        :param attribute: attribute to calculate salience for
        :param lmbda: smoothing parameter
        :return: salience of a feature
        """
        assert attribute in ['tox', 'norm']
        if feature not in self.tox_vocab:
            tox_count = 0.0
        else:
            tox_count = self.tox_counts[0, self.tox_vocab[feature]]

        if feature not in self.norm_vocab:
            norm_count = 0.0
        else:
            norm_count = self.norm_counts[0, self.norm_vocab[feature]]

        if attribute == 'tox':
            return (tox_count + lmbda) / (norm_count + lmbda)
        else:
            return (norm_count + lmbda) / (tox_count + lmbda)


In [12]:
from collections import Counter
c = Counter()

for fn in [tox_corpus_path, norm_corpus_path]:
    with open(fn, 'r') as corpus:
        for line in corpus.readlines():
            for tok in line.strip().split():
                c[tok] += 1

print(len(c))

88645


In [13]:
vocab = {w for w, _ in c.most_common() if _ > 0}  # if we took words with > 1 occurences, vocabulary would be x2 smaller, but we'll survive this size
print(len(vocab))

88645


In [14]:
with open(tox_corpus_path, 'r') as tox_corpus, open(norm_corpus_path, 'r') as norm_corpus:
    corpus_tox = [' '.join([w if w in vocab else '<unk>' for w in line.strip().split()]) for line in tox_corpus.readlines()]
    corpus_norm = [' '.join([w if w in vocab else '<unk>' for w in line.strip().split()]) for line in norm_corpus.readlines()]

In [15]:
neg_out_name = VOCAB_DIRNAME + '/negative-words.txt'
pos_out_name = VOCAB_DIRNAME + '/positive-words.txt'

In [16]:
threshold = 4

In [17]:
sc = NgramSalienceCalculator(corpus_tox, corpus_norm, False)
seen_grams = set()

with open(neg_out_name, 'w') as neg_out, open(pos_out_name, 'w') as pos_out:
    for gram in set(sc.tox_vocab.keys()).union(set(sc.norm_vocab.keys())):
        if gram not in seen_grams:
            seen_grams.add(gram)
            toxic_salience = sc.salience(gram, attribute='tox')
            polite_salience = sc.salience(gram, attribute='norm')
            if toxic_salience > threshold:
                neg_out.writelines(f'{gram}\n')
            elif polite_salience > threshold:
                pos_out.writelines(f'{gram}\n')

## Evaluating word toxicities with a logistic regression

In [18]:
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(CountVectorizer(), LogisticRegression(max_iter=1000))

In [19]:
X_train = corpus_tox + corpus_norm
y_train = [1] * len(corpus_tox) + [0] * len(corpus_norm)
pipe.fit(X_train, y_train);

In [20]:
coefs = pipe[1].coef_[0]
coefs.shape

(88519,)

In [21]:
word2coef = {w: coefs[idx] for w, idx in pipe[0].vocabulary_.items()}

In [22]:
import pickle
with open(VOCAB_DIRNAME + '/word2coef.pkl', 'wb') as f:
    pickle.dump(word2coef, f)

## Labelling BERT tokens by toxicity

In [23]:
from collections import defaultdict
toxic_counter = defaultdict(lambda: 1)
nontoxic_counter = defaultdict(lambda: 1)

for text in tqdm(corpus_tox):
    for token in tokenizer.encode(text):
        toxic_counter[token] += 1
for text in tqdm(corpus_norm):
    for token in tokenizer.encode(text):
        nontoxic_counter[token] += 1

100%|██████████| 135390/135390 [00:57<00:00, 2351.64it/s]
100%|██████████| 135390/135390 [00:57<00:00, 2341.25it/s]


In [24]:
token_toxicities = [toxic_counter[i] / (nontoxic_counter[i] + toxic_counter[i]) for i in range(len(tokenizer.vocab))]

In [25]:
with open(VOCAB_DIRNAME + '/token_toxicities.txt', 'w') as f:
    for t in token_toxicities:
        f.write(str(t))
        f.write('\n')

# Setting up the model

### Loading the vocabularies

In [26]:
with open(VOCAB_DIRNAME + "/negative-words.txt", "r") as f:
    s = f.readlines()
negative_words = list(map(lambda x: x[:-1], s))

with open(VOCAB_DIRNAME + "/positive-words.txt", "r") as f:
    s = f.readlines()
positive_words = list(map(lambda x: x[:-1], s))

In [27]:
import pickle
with open(VOCAB_DIRNAME + '/word2coef.pkl', 'rb') as f:
    word2coef = pickle.load(f)

In [28]:
token_toxicities = []
with open(VOCAB_DIRNAME + '/token_toxicities.txt', 'r') as f:
    for line in f.readlines():
        token_toxicities.append(float(line))
token_toxicities = np.array(token_toxicities)
token_toxicities = np.maximum(0, np.log(1/(1/token_toxicities-1)))   # log odds ratio

# discourage meaningless tokens
for tok in ['.', ',', '-']:
    token_toxicities[tokenizer.encode(tok)][1] = 3

for tok in ['you']:
    token_toxicities[tokenizer.encode(tok)][1] = 0

In [29]:
def adjust_logits(logits, label=0):
    """
    Function that adjusts logits to make the model more sensitive to the label
    :param logits: logits from the model
    :param label: label to adjust logits to
    :return: adjusted logits
    """
    return logits - token_toxicities * 100 * (1 - 2 * label)

predictor = MaskedTokenPredictorBert(model, tokenizer, max_len=250, device=device, label=0, contrast_penalty=0.0, logits_postprocessor=adjust_logits)

editor = CondBertRewriter(
    model=model,
    tokenizer=tokenizer,
    device=device,
    neg_words=negative_words,
    pos_words=positive_words,
    word2coef=word2coef,
    token_toxicities=token_toxicities,
    predictor=predictor,
)

The model below is used for reranking BERT hypotheses and helps to increase semantic similarity by choosing the hypotheses with  embeddings similar to the orignal words. 

In [30]:
chooser = EmbeddingSimilarityChooser(sim_coef=10, tokenizer=tokenizer)

# Finally, the inference

Parallel application of the model to all tokens, fast, but dirty. 

In [31]:
print(editor.translate('You are idiot!', prnt=False))

you are mistake !


Application of the model to all the tokens sequentially, in the multiword mode. 

In [32]:
print(editor.replacement_loop('You are stupid!', verbose=False, chooser=chooser, n_tokens=(1, 2, 3), n_top=10))

you are very beautiful !


Parameters that could be tuned:
* The coeffincient in `adjust_logits` - the larger it is, the more the model avoids toxic words
* The coefficient in `EmbeddingSimilarityChooser` - the larger it is, the more the model tries to preserve content 
* n_tokens - how many words can be generated from one
* n_top - how many BERT hypotheses are reranked
* n_top - how many BERT hypotheses are reranked