# **Models**
## Fine tune `mrm8488/t5-small-finetuned-quora-for-paraphrasing`
## [Link for the model](https://huggingface.co/mrm8488/t5-small-finetuned-quora-for-paraphrasing)

### Preparation

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import re
import torch
from torch.utils.data import Dataset, DataLoader

In [2]:
# Choose device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cpu


In [3]:
# Read in data from a CSV file
data = pd.read_csv("filtered_paranmt/filtered.tsv", sep="\t", index_col=0)
data.head(7)

Unnamed: 0,reference,translation,similarity,lenght_diff,ref_tox,trn_tox
0,"If Alkar is flooding her with psychic waste, t...","if Alkar floods her with her mental waste, it ...",0.785171,0.010309,0.014195,0.981983
1,Now you're getting nasty.,you're becoming disgusting.,0.749687,0.071429,0.065473,0.999039
2,"Well, we could spare your life, for one.","well, we can spare your life.",0.919051,0.268293,0.213313,0.985068
3,"Ah! Monkey, you've got to snap out of it.","monkey, you have to wake up.",0.664333,0.309524,0.053362,0.994215
4,I've got orders to put her down.,I have orders to kill her.,0.726639,0.181818,0.009402,0.999348
5,I'm not gonna have a child... ...with the same...,I'm not going to breed kids with a genetic dis...,0.703185,0.206522,0.950956,0.035846
6,"They're all laughing at us, so we'll kick your...",they're laughing at us. We'll show you.,0.618866,0.230769,0.999492,0.000131


In [6]:
# Class that is used to prepare the data for model
class MyDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_source_length, max_target_length):
        self.dataframe = dataframe
        self.tokenizer = tokenizer
        self.max_source_length = max_source_length
        self.max_target_length = max_target_length

    def __len__(self):
        return len(self.dataframe)
    
    def priint(self, idx):
        row = self.dataframe.iloc[idx]
        print(row["Toxic"])

    def __getitem__(self, idx):
        row = self.dataframe.iloc[idx]
        inputs = self.tokenizer(
            row["Toxic"]
        )
        outputs = self.tokenizer(
            row["Neutral"]
        )
        inputs["input_ids"] = torch.tensor(inputs["input_ids"])
        inputs["attention_mask"] = torch.tensor(inputs["attention_mask"])
        inputs["labels"] = torch.tensor(outputs["input_ids"])
        return inputs

In [187]:
# Load input data 
input_data = pd.read_csv("filtered_for_models.csv", index_col=0)
input_data

Unnamed: 0,Toxic,Neutral,Tox score
0,if alkar floods her with her mental waste it w...,if alkar is flooding her with psychic waste th...,0.981983
1,youre becoming disgusting,now youre getting nasty,0.999039
2,well we can spare your life,well we could spare your life for one,0.985068
3,monkey you have to wake up,ah monkey youve got to snap out of it,0.994215
4,i have orders to kill her,ive got orders to put her down,0.999348
...,...,...,...
557503,you didnt know that estelle stole your fish fr...,you didnt know that estelle had stolen some fi...,0.949143
557504,itil suck the life out of you,youd be sucked out of your life,0.996124
557505,i cant fuckin take that bruv,i really cant take this,0.984538
557506,they called me a fucking hero the truth is i d...,they said i was a hero but i didnt care,0.991945


In [18]:
# Split data into train and validation sets
from sklearn.model_selection import train_test_split

train_set, val_set = train_test_split(input_data[:1000], test_size=0.2, random_state=42)


In [19]:
# %pip install sentencepiece

In [20]:
# %pip install --upgrade transformers

In [21]:
# %pip install protobuf

### Model fine tuning

In [22]:
# Load model and tokenizer from pretrained
from transformers import AutoTokenizer, AutoModelWithLMHead

tokenizer = AutoTokenizer.from_pretrained("mrm8488/t5-small-finetuned-quora-for-paraphrasing", cache_dir=None)
model = AutoModelWithLMHead.from_pretrained("mrm8488/t5-small-finetuned-quora-for-paraphrasing", cache_dir=None)

  from .autonotebook import tqdm as notebook_tqdm
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in 

In [23]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer)

In [24]:
# Recollect data to fit for the model
train_dataset = MyDataset(train_set, tokenizer, 128, 128)
val_dataset = MyDataset(val_set, tokenizer, 128, 128)

In [25]:
train_dataset.priint(0)

and im not just talking about hitting me for your boyfriend what a girl


In [186]:
# %pip install accelerate -U

In [27]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir='./results',
    per_device_train_batch_size=124,
    per_device_eval_batch_size=124,
    num_train_epochs=2,
    logging_dir='./logs',
    save_strategy="steps",
    evaluation_strategy="steps",
    eval_steps=100,
    save_steps=300,
    logging_steps=50,
    learning_rate=1e-4,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
)

In [29]:
trainer.train()

  0%|          | 0/14 [00:00<?, ?it/s]You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
100%|██████████| 14/14 [12:04<00:00, 51.73s/it]

{'train_runtime': 724.1687, 'train_samples_per_second': 2.209, 'train_steps_per_second': 0.019, 'train_loss': 3.4641502925327847, 'epoch': 2.0}





TrainOutput(global_step=14, training_loss=3.4641502925327847, metrics={'train_runtime': 724.1687, 'train_samples_per_second': 2.209, 'train_steps_per_second': 0.019, 'train_loss': 3.4641502925327847, 'epoch': 2.0})

In [59]:
torch.save(model.state_dict(), "t5smallparaph.pt")

### Paraphrase

In [102]:
def paraphrase(text, max_length=128):

  input_ids = tokenizer.encode(text, return_tensors="pt", add_special_tokens=True)

  generated_ids = model.generate(input_ids=input_ids, num_return_sequences=5, num_beams=5, max_length=max_length, no_repeat_ngram_size=2, repetition_penalty=3.5, length_penalty=1.0, early_stopping=True)

  preds = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in generated_ids]

  return preds
  
preds = paraphrase("paraphrase: What is the best framework for dealing with a huge text dataset?")


In [170]:
test_sample = input_data.sample(n=10)
test_sample

Unnamed: 0,Toxic,Neutral,Tox score
535150,ill ill smash it myself,i broke it myself,0.971542
444270,the tits,those boobies,0.984656
208888,but hes fucking captain america,hes captain america,0.998577
478355,if he gets another chance im dead,if he ever got a second chance i was dead,0.96346
316261,basically these two are real criminals,actually those are the real criminals over there,0.986255
335957,how dare you shoot at my wife,how dare you shoot my wife,0.978353
387628,no wonder kurtz put a weed up commands ass,no wonder kurtz had screwed up the command,0.996818
73151,so these boobs,so the breasts,0.991037
420809,this is junk again near california state unive...,and here is the flotsam near california state ...,0.971545
537234,i cant go around telling guys not to ask amy o...,i cant ban guys from asking amy out because i ...,0.980846


In [171]:
# Get paraphrased sentences from the model
paraphrased = []
for sent in test_sample['Toxic']:
    paraph = paraphrase(sent)
    paraphrased.append(paraph)

In [172]:
paraphrased

[['Is it okay to smash it myself?',
  'Is it okay to smash it myself ill?',
  'I ill smash it myself.',
  'Is it okay to smash it myself.',
  'Is it okay to smash it myself ill.'],
 ['What are the tits?',
  'What are the tits and how do they work?',
  'What are the tits and how do they look like?',
  'What are the tits and how do they work out?',
  'What are the tits and how do they get rid of?'],
 ['But hes fucking captain america.',
  'Hes fucking captain america but hes not a big fan.',
  'Hes fucking captain america but hes not a good guy.',
  'Hes fucking captain america but hes not a bad guy.',
  'Hes fucking captain america but hes not a good captain.'],
 ['If he gets another chance im dead.',
  'if he gets another chance im dead.',
  'If he gets another chance im dead in the next 10 years, and his chances are good.',
  'If he gets another chance im dead in the next 10 years, and his chances are there.',
  'If he gets another chance im dead in the next 10 years, if his chances a

### Appresiate the model

In [173]:
# Load vocabulary
vocabframe = pd.read_csv("bestvocab.csv", index_col=0)
vocabframe

Unnamed: 0,key,translation
0,kissoon,24170
1,ripton,21371
2,loose-jointed,90238
3,skins,7531
4,seena,37526
...,...,...
113967,ﬁve,113966
113968,ﬂoat,113967
113969,ﬂoor,113969
113970,ﬂunkeys,113970


In [174]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\84907\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [175]:
# Data preporation for metric model
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

PUNCT_TO_REMOVE = string.punctuation
ENGLISH_STOPWORDS = set(stopwords.words("english"))

def text_to_tensor(sent):
    sent = sent.lower()
    sent = sent.translate(str.maketrans('', '', PUNCT_TO_REMOVE))
    sent = " ".join([word for word in str(sent).split() if word not in ENGLISH_STOPWORDS])
    sent = word_tokenize(sent)

    words = []
    for word in sent:
        query = list(vocabframe.query("key == @word")['translation'])
        if len(query) > 0:
            words.append(query[0])
    return torch.tensor(words, dtype=torch.int64)

In [176]:
paraphrased[0][0]

'Is it okay to smash it myself?'

In [177]:
text_to_tensor(paraphrased[0][0])

tensor([ 151, 1445])

In [178]:
import torch.nn as nn

class TextRegressionModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim):
        super().__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=False)
        self.dropout = nn.Dropout(0.4)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
        self.linear1 = nn.Linear(hidden_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.linearOut = nn.Linear(hidden_dim, 1)
        self.out = nn.Sigmoid()

    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        dout = self.dropout(embedded)
        lstm_out, (ht, ct) = self.lstm(dout)
        out = self.linear1(lstm_out)
        out = self.relu(out)
        out = self.linear1(out)
        out = self.relu(out)
        return self.linearOut(out)

In [179]:
# Load metric model
from torch import nn

metric_model = TextRegressionModel(114506, 300, 200)
cpt = torch.load("best.pt")
metric_model.load_state_dict(cpt)

<All keys matched successfully>

In [180]:
def get_offset(sent):
    offset = [0]
    offset.append(sent.size(0))
    offset = torch.tensor(offset[:-1]).cumsum(dim=0)
    return offset

In [181]:
def predict(
    model,
    sent,
    offset
):
    with torch.no_grad():
        model.eval()
        
        output = model(sent, offset)
        if output.item() > 1:
            score = 1.0
        else: score = output.item()

    return round(score, 4)

In [182]:
# Score paraphpased sentences
predicted = []
pred_scores = []
for set5 in paraphrased:
    best_score = 1.1
    best = -1
    for i, sent in enumerate(set5):
        tokenized = text_to_tensor(sent)
        offset = get_offset(tokenized)
        score = predict(metric_model, tokenized, offset)
        if score < best_score:
            best_score = score
            best = i
    predicted.append(set5[best])
    pred_scores.append(best_score)

In [183]:
print(predicted)

['Is it okay to smash it myself ill?', 'What are the tits?', 'Hes fucking captain america but hes not a bad guy.', 'If he gets another chance im dead.', 'Basically these two are real criminals and basically they are the same.', 'How dare you shoot at my wife?', 'No wonder kurtz put a weed up commands ass no wonder.', "So these boobs are so good that they don't need to be.", 'This is junk again near california state university long beach this is trash in the past.', 'I cant go around telling guys not to ask amy out because i like her and im too dumb to do anything about it.']


In [184]:
input_score = list(test_sample['Tox score'])
inputs = list(test_sample['Toxic'])

In [185]:
# Look at the scores
scores = pd.DataFrame(list(zip(inputs, input_score, pred_scores, predicted)), index=None, columns=['Toxic style', 'Before', 'After', 'Translation'])
scores

Unnamed: 0,Toxic style,Before,After,Translation
0,ill ill smash it myself,0.971542,0.5256,Is it okay to smash it myself ill?
1,the tits,0.984656,0.0423,What are the tits?
2,but hes fucking captain america,0.998577,0.0686,Hes fucking captain america but hes not a bad ...
3,if he gets another chance im dead,0.96346,0.2158,If he gets another chance im dead.
4,basically these two are real criminals,0.986255,0.9131,Basically these two are real criminals and bas...
5,how dare you shoot at my wife,0.978353,0.1601,How dare you shoot at my wife?
6,no wonder kurtz put a weed up commands ass,0.996818,0.0491,No wonder kurtz put a weed up commands ass no ...
7,so these boobs,0.991037,0.0404,So these boobs are so good that they don't nee...
8,this is junk again near california state unive...,0.971545,0.3608,This is junk again near california state unive...
9,i cant go around telling guys not to ask amy o...,0.980846,0.6521,I cant go around telling guys not to ask amy o...
