<a href="https://colab.research.google.com/github/LorenzoCorbinelli/MLSA-project/blob/main/Inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers gdown



In [2]:
from transformers import RobertaTokenizer, RobertaForMaskedLM, pipeline
from tabulate import tabulate

In [3]:
# Import the model
!gdown --folder "https://drive.google.com/drive/folders/1-14DZR-ds0AZgeQqKtzTf5hNyGkNHj-1"

Retrieving folder contents
Processing file 1-3Gi1RfLXRr23mDMeHQD-zn-KIWhBSUc config.json
Processing file 1-8ySYaaguBwz9PKym5KRprAuSxETG4fC merges.txt
Processing file 1-5aTG7SM33yj8blDKpEFKA3Xw1ki-N0i model.safetensors
Processing file 1-JyQTIxMkpEomU39cGXA6Zubj-p8mdZs special_tokens_map.json
Processing file 1-MV-LuW38tPsqWorIAPTGsTTo_RdGx8O tokenizer_config.json
Processing file 1-EyCic4FIFKUhS_Glk_nvpMDexILo15i vocab.json
Retrieving folder contents completed
Building directory structure
Building directory structure completed
Downloading...
From: https://drive.google.com/uc?id=1-3Gi1RfLXRr23mDMeHQD-zn-KIWhBSUc
To: /content/Model/config.json
100% 710/710 [00:00<00:00, 4.34MB/s]
Downloading...
From: https://drive.google.com/uc?id=1-8ySYaaguBwz9PKym5KRprAuSxETG4fC
To: /content/Model/merges.txt
100% 456k/456k [00:00<00:00, 115MB/s]
Downloading...
From (original): https://drive.google.com/uc?id=1-5aTG7SM33yj8blDKpEFKA3Xw1ki-N0i
From (redirected): https://drive.google.com/uc?id=1-5aTG7SM33yj8b

In [4]:
# load the model
directory = '/content/Model'

model = RobertaForMaskedLM.from_pretrained(directory)
tokenizer = RobertaTokenizer.from_pretrained(directory)

In [5]:
def print_result(outputs):
    table_data = []
    for output in outputs:
        token_str = f'"{output["token_str"]}"'  # Preserve leading spaces by wrapping in quotes
        table_data.append([output['sequence'], token_str, output['score']])

    print("The suggested code completions are:")
    print(tabulate(table_data, headers=["Completion", "Predicted token", "Score"], tablefmt="grid", colalign=("left", "left", "center")) )

In [6]:
def code_completion(code_example, iterations: int = 1):
    '''
    - code_example: snipped of code that need to be code-completed. No token <mask> needed.
    - iterations: number of subsequent code completions to be generated.
                  Each sequence generated after the first one will be based only on the previous sequence with the highest score.
    '''
    code_example = code_example + "<mask>"
    fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)
    current_example = code_example  # Start with the initial code

    for _ in range(iterations):
        outputs = fill_mask(current_example)

        # Take the first prediction and append <mask> to continue completion
        best_prediction = outputs[0]["sequence"]
        current_example = best_prediction + " <mask>"
        print_result(outputs)
    return outputs

In [7]:
result = code_completion("def is_zero(x): return x==")

Device set to use cuda:0


The suggested code completions are:
+--------------------------------+-------------------+------------+
| Completion                     | Predicted token   |   Score    |
| def is_zero(x): return x==0    | "0"               |  0.725383  |
+--------------------------------+-------------------+------------+
| def is_zero(x): return x== 0   | " 0"              |  0.178053  |
+--------------------------------+-------------------+------------+
| def is_zero(x): return x==x    | "x"               | 0.0419061  |
+--------------------------------+-------------------+------------+
| def is_zero(x): return x==zero | "zero"            |  0.011265  |
+--------------------------------+-------------------+------------+
| def is_zero(x): return x==1    | "1"               | 0.00624731 |
+--------------------------------+-------------------+------------+


In [8]:
result = code_completion("def add(a, b): return a+")

Device set to use cuda:0


The suggested code completions are:
+----------------------------+-------------------+------------+
| Completion                 | Predicted token   |   Score    |
| def add(a, b): return a+b  | "b"               |   0.9766   |
+----------------------------+-------------------+------------+
| def add(a, b): return a+a  | "a"               | 0.00803229 |
+----------------------------+-------------------+------------+
| def add(a, b): return a+ b | " b"              | 0.00514298 |
+----------------------------+-------------------+------------+
| def add(a, b): return a+1  | "1"               | 0.00164974 |
+----------------------------+-------------------+------------+
| def add(a, b): return a+2  | "2"               | 0.00139649 |
+----------------------------+-------------------+------------+


In [9]:
result = code_completion("def add(a, b): return a", 2)

Device set to use cuda:0


The suggested code completions are:
+---------------------------+-------------------+-----------+
| Completion                | Predicted token   |   Score   |
| def add(a, b): return a + | " +"              | 0.788156  |
+---------------------------+-------------------+-----------+
| def add(a, b): return a - | " -"              | 0.0473608 |
+---------------------------+-------------------+-----------+
| def add(a, b): return a   | "                 | 0.0304668 |
|                           | "                 |           |
+---------------------------+-------------------+-----------+
| def add(a, b): return a.  | "."               |  0.02006  |
+---------------------------+-------------------+-----------+
| def add(a, b): return a,  | ","               | 0.020017  |
+---------------------------+-------------------+-----------+
The suggested code completions are:
+-----------------------------+-------------------+-------------+
| Completion                  | Predicted token   |    S

In [10]:
result = code_completion("for element ", 2)

Device set to use cuda:0


The suggested code completions are:
+----------------+-------------------+-----------+
| Completion     | Predicted token   |   Score   |
| for element in | " in"             | 0.442312  |
+----------------+-------------------+-----------+
| for element_   | "_"               | 0.0344412 |
+----------------+-------------------+-----------+
| for elementIn  | "In"              | 0.0282872 |
+----------------+-------------------+-----------+
| for elementin  | "in"              | 0.0220934 |
+----------------+-------------------+-----------+
| for element()  | "()"              | 0.0185084 |
+----------------+-------------------+-----------+
The suggested code completions are:
+-------------------------+-------------------+-----------+
| Completion              | Predicted token   |   Score   |
| for element in elements | " elements"       | 0.466459  |
+-------------------------+-------------------+-----------+
| for element in list     | " list"           | 0.0745768 |
+---------------