<a href="https://colab.research.google.com/github/0xVolt/whats-up-doc/blob/main/src/experimental-notebooks/code_trans_t5_small_code_documentation_generation_python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CodeTransT5-Small-ST-TF for Python

This notebook explores the CodeTrans model which is based on the T5 model's architecture. This particular model uses the T5-Small skeleton and is built for the purpose of a single task (ST), to generate code documentation for Python. This notebook will first use the model from HuggingFace and then attempt to fine-tune it (FT) on the [`python_code_instructions_18k_alpaca`](https://huggingface.co/datasets/iamtarun/python_code_instructions_18k_alpaca/viewer/default/train).

## TODO
- [ ] Get model fine-tuned
- [ ] Get predictions
- [ ] Compare scores
- [ ] Look into pre-processing dataset to improve results

---

In [1]:
%pip install -q --no-cache-dir transformers sentencepiece datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m69.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m263.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.7/493.7 kB[0m [31m241.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m281.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m254.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m291.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m242.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m291.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━

In [2]:
import tensorflow as tf
from transformers import AutoTokenizer, AutoModelWithLMHead, TFTrainer, TFTrainingArguments
import torch
import pandas as pd
from datasets import load_dataset
from sklearn.model_selection import train_test_split

In [3]:
dataset = load_dataset('iamtarun/python_code_instructions_18k_alpaca')
# dataset = load_dataset("flytech/llama-python-codes-30k")
dataset = dataset['train'].to_pandas()

Downloading readme:   0%|          | 0.00/905 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/18612 [00:00<?, ? examples/s]

In [4]:
dataset.columns

Index(['instruction', 'input', 'output', 'prompt'], dtype='object')

In [5]:
dataset.shape

(18612, 4)

In [6]:
dataset.head()

Unnamed: 0,instruction,input,output,prompt
0,Create a function to calculate the sum of a se...,"[1, 2, 3, 4, 5]",# Python code\ndef sum_sequence(sequence):\n ...,Below is an instruction that describes a task....
1,Generate a Python code for crawling a website ...,website: www.example.com \ndata to crawl: phon...,import requests\nimport re\n\ndef crawl_websit...,Below is an instruction that describes a task....
2,Create a Python list comprehension to get the ...,,"[x*x for x in [1, 2, 3, 5, 8, 13]]",Below is an instruction that describes a task....
3,Generate a python script to perform this action.,"Given a string, remove all the consecutive dup...",def remove_duplicates(string): \n result = ...,Below is an instruction that describes a task....
4,Write a python script to generates random numb...,,def generate_random_divisible_number():\n i...,Below is an instruction that describes a task....


In [7]:
y = list(dataset['instruction'])
y[:5]

['Create a function to calculate the sum of a sequence of integers.',
 'Generate a Python code for crawling a website for a specific type of data.',
 'Create a Python list comprehension to get the squared values of a list [1, 2, 3, 5, 8, 13].',
 'Generate a python script to perform this action.',
 'Write a python script to generates random numbers between 0 and 9 that are divisible by 3.']

In [8]:
X = list(dataset['output'])
X[5]

'def third_largest(lst):\n    if len(lst) < 3:\n        return\n    distinct = []\n    for i in lst:\n        if i not in distinct:\n            distinct.append(i)\n    distinct.sort(reverse=True)\n    return distinct[2]'

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, shuffle=True, random_state=14)

In [10]:
len(X_train), len(X_test)

(12470, 6142)

In [11]:
X_train[5]

'import random \n  \n# Function to draw tic-tac-toe board \ndef drawBoard(board): \n    print("--- --- ---")\n    print("| " + board[0][0] + " | " + board[0][1] + " | " + board[0][2] + " |")\n    print("--- --- ---")\n    print("| " + board[1][0] + " | " + board[1][1] + " | " + board[1][2] + " |")\n    print("--- --- ---")\n    print("| " + board[2][0] + " | " + board[2][1] + " | " + board[2][2] + " |")\n    print("--- --- ---") \n  \n# Function to check if any player has won horizontally or vertically    \ndef checkWin(board): \n    # Win Horizontally \n    for x in range(0, 3): \n        if (board[x][0] == board[x][1] and board[x][1] == board[x][2] and board[x][0] != \' \'): \n            return board[x][0];'

In [12]:
class CodeTransForCDGPythonWrapper(tf.keras.Model):
    def __init__(self, model):
        super().__init__()
        self.model = model

    def call(self, inputs, training=None):
        # Ensure that 'input_ids' and 'attention_mask' keys are present in inputs
        # Do some additional processing before calling the underlying model
        outputs = self.model(input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask'],
                             training=training)

        # Do some additional processing after calling the underlying model
        return outputs

    # def call(self, inputs, training=None):
    #     # Do some additional processing before calling the underlying model
    #     outputs = self.model(inputs, training=training)

    #     # Do some additional processing after calling the underlying model
    #     return outputs

    # def call(self, inputs, training=None, **kwargs):
    #     return self.model(inputs, training=training, **kwargs)

In [13]:
tokenizer = AutoTokenizer.from_pretrained("SEBIS/code_trans_t5_small_code_documentation_generation_python",
                                          use_fast=False)

Downloading (…)okenizer_config.json:   0%|          | 0.00/24.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/641 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/797k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [14]:
modelMaxLength = 128
XTrainEncoded = tokenizer(X_train, truncation=True, max_length=modelMaxLength, padding=True)
XTestEncoded = tokenizer(X_test, truncation=True, max_length=modelMaxLength, padding=True)
yTrainEncoded = tokenizer(y_train, truncation=True, max_length=modelMaxLength, padding=True)
yTestEncoded = tokenizer(y_test, truncation=True, max_length=modelMaxLength, padding=True)

In [15]:
trainDataset = tf.data.Dataset.from_tensor_slices((
    dict(XTrainEncoded),
    dict(yTrainEncoded)
))

testDataset = tf.data.Dataset.from_tensor_slices((
    dict(XTestEncoded),
    dict(yTestEncoded)
))

In [16]:
trainDataset

<_TensorSliceDataset element_spec=({'input_ids': TensorSpec(shape=(128,), dtype=tf.int32, name=None), 'attention_mask': TensorSpec(shape=(128,), dtype=tf.int32, name=None)}, {'input_ids': TensorSpec(shape=(128,), dtype=tf.int32, name=None), 'attention_mask': TensorSpec(shape=(128,), dtype=tf.int32, name=None)})>

In [21]:
trainingArguments = TFTrainingArguments(
    output_dir = './results',
    num_train_epochs = 2,
    evaluation_strategy = 'epoch',
    per_device_train_batch_size = 4,
    per_device_eval_batch_size = 8,
    warmup_steps = 100,
    weight_decay = 0.01,
    logging_dir = './logs',
    logging_steps = 1
)

In [22]:
with trainingArguments.strategy.scope():
    model = AutoModelWithLMHead.from_pretrained("SEBIS/code_trans_t5_small_code_documentation_generation_python")
    wrappedModel = CodeTransForCDGPythonWrapper(model)

In [23]:
trainer = TFTrainer(
    model=wrappedModel,
    args=trainingArguments,
    train_dataset=trainDataset,
    eval_dataset=testDataset
)

In [24]:
trainer.train()

TypeError: ignored

In [None]:
results = trainer.evaluate(testDataset)
print("Accuracy:", results["accuracy"])