# User Guid

STEP 1: Run all the cells below.

STEP 2: Define your text (paragraph)thate you want to insert the token <'X'>


STEP 3: Run the function "predict(text)".

#### Example:

input : "In the second step of the IFM procedure, we made use of the Expectation--Maximisation algorithm of in order to deal with the markovian structure characterising the latent states. Further details about the employed estimation technique can be found in."

output : "in the second step of the ifm procedure, we made use of the expectation - - maximisation algorithm of <'x'> in order to deal with the markovian structure characterising the latent states. further details about the employed estimation technique can be found in <'x'>."

In [1]:
import os
import json
import numpy as np
import tensorflow as tf
from transformers import BertTokenizer, TFBertModel
from tensorflow.keras.metrics import Precision, Recall

In [2]:
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
     tf.config.experimental.set_memory_growth(gpu, True)

In [3]:
model_dir = "output/"

In [4]:
# Define F1 score metric

class F1Score(tf.keras.metrics.Metric):
    def __init__(self, name='f1_score', **kwargs):
        super(F1Score, self).__init__(name=name, **kwargs)
        self.precision = Precision()
        self.recall = Recall()

    def update_state(self, y_true, y_pred, sample_weight=None):
        self.precision.update_state(y_true, y_pred)
        self.recall.update_state(y_true, y_pred)

    def result(self):
        precision_result = self.precision.result()
        recall_result = self.recall.result()
        return 2 * ((precision_result * recall_result) / (precision_result + recall_result + tf.keras.backend.epsilon()))

    def reset_state(self):
        self.precision.reset_state()
        self.recall.reset_state()

In [5]:
def break_paragraph(paragraph):
    sentences_list = []
    split_list = paragraph.split('. ')
    for sentence in split_list:
        if sentence[-1]=='.':
            sentences_list.append(sentence)
        else:
            sentences_list.append(sentence+'.')
    return sentences_list

In [122]:
def clean_spaces(sentence):
    cleaned_sentence = sentence.replace("( ", "(").replace(" )", ")").replace(" -", "-").replace("- ", "-").replace(" !", "!").replace(" :", ":")
    return cleaned_sentence

In [146]:
def format_sentence(org_sentence, sentence):
    tags = org_sentence[:-1].split(' ')
    X_indexes = []
    for index, word in enumerate(sentence[:-1].split(' ')):
        if "<x>" == word:
            tags.insert(index, "<X>")
        elif "<x>," == word:
            tags[index-1] = tags[index-1][:-1]
            tags.insert(index, "<X>,")
        elif "<x>:" == word:
            tags[index-1] = tags[index-1][:-1]
            tags.insert(index, "<X>:")

    return " ".join(tags)+'.'

In [168]:
# use for inserting token <X> into your text

def predict (text, threshold=0.5, max_len=60):
    model = tf.keras.models.load_model(model_dir + 'model/token_insertion_model.h5', 
                                   custom_objects={'TFBertModel': TFBertModel, 'F1Score': F1Score, 'Precision': Precision, 'Recall': Recall}) 
    tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
    tokenizer.add_tokens("<X>")

    sentences_with_mask = []
    sentences_list = break_paragraph(text)

    for sentence in sentences_list:
        input_dict = tokenizer(sentence, max_length=max_len, padding='max_length', truncation=True, return_tensors="tf")
        input_ids = input_dict['input_ids']
        attention_mask = input_dict['attention_mask']
        insertion_points = []
        
        prediction = model.predict([input_ids, attention_mask])
    
        binary_predictions = [1 if pred > threshold else 0 for pred in prediction[0]]
        
        for i, pred in enumerate(binary_predictions):
            if pred == 1:
                insertion_points.append(i+1)

        ids_with_mask = np.insert(input_ids, insertion_points, tokenizer.encode("<X>", add_special_tokens=False))
        decoded_sentence = tokenizer.decode(ids_with_mask, skip_special_tokens=True)
        cleaned_sentence = clean_spaces(decoded_sentence)
        formatted_sentence = format_sentence(sentence, cleaned_sentence)
        sentences_with_mask.append(formatted_sentence)
        
    output = " ".join(sentences_with_mask)
    
    return output


## Change the "text" as you need and run the function "predict(text)"

In [169]:
text_1 = '''In the second step of the IFM procedure, we made use of the Expectation--Maximisation algorithm of in order to deal with the markovian 
structure characterising the latent states. Further details about the employed estimation technique can be found in.'''

In [170]:
print(predict(text_1))

In the second step of the IFM procedure, we made use of the Expectation--Maximisation algorithm of <X> in order to deal with the markovian 
structure characterising the latent states. Further details about the employed estimation technique can be found in <X>.


In [171]:
text_2 = 'It is also worth noticing that the results in give the Hausdorff dimension of.'

In [172]:
print(predict(text_2))

It is also worth noticing that the results in <X> give the Hausdorff dimension of.


In [173]:
text_3 = '''While convolutional neural networks have resulted in many practical successes, they can be highly susceptible to adversarial examples. 
In one extreme case, the change of a single pixel within the input image can with high confidence change the output prediction of the network'''

In [174]:
print(predict(text_3))

While convolutional neural networks have resulted in many practical successes <X>, they can be highly susceptible to adversarial examples. 
In one extreme case, the change of a single pixel within the input image can with high confidence change the output prediction of the network <X>.


In [175]:
text_4 = '''We now discuss the junction conditions needed for the numerical integration of the oscillations' equations when a sharp 
interface due to a first order phase transition takes place inside a hybrid compact star. Such conditions are intrinsically related 
to the velocity of the phase transition near the surface splitting any two phases (see for further details).'''

In [176]:
print(predict(text_4))

We now discuss the junction conditions needed for the numerical integration of the oscillations' equations when a sharp 
interface due to a first order phase transition takes place inside a hybrid compact <X> star. Such conditions are intrinsically related 
to the velocity of the phase transition near the surface splitting any two phases (see <X> for further details).


In [177]:
text_5 = '''We evaluate CODEFUSION on NL-to-code for three different languages: Python, Bash, and conditional formatting rules in Microsoft Excel. 
Our results show that CODEFUSION's (75M parameters) top-1 results are comparable or better than much larger state-of-the-art systems 
(350M-175B parameters). In top-3 and top-5, CODEFUSION performs better than all baselines.'''

In [178]:
print(predict(text_5))

We evaluate CODEFUSION on NL-to-code for three different languages: Python <X>, Bash <X>, and conditional formatting rules in Microsoft Excel <X>. 
Our results show that CODEFUSION's (75M parameters) top-1 results are comparable or better than much larger state-of-the-art systems <X> 
(350M-175B parameters). In top-3 and top-5, CODEFUSION <X> performs better than all baselines.
