# Environment Setup
We'll begin by installing the necessary libraries:

1. HuggingFace Transformers: Essential for the CodeT5p model.
2. HuggingFace Datasets: Required for loading and preprocessing the dataset.
3. PyTorch Lightning: Used for training the model.
4. Weights and Biases: Implements the logging of training metrics.

In [1]:
!pip install -q datasets transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.3/519.3 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m24.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m17.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m17.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m30.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m59.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m62.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
!pip install -q pytorch-lightning

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m725.0/725.0 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m731.6/731.6 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
from datasets import load_dataset

# Preprocessing Data

The CodeXGLUE dataset's "code_to_text" portion is loaded, specifically focusing on the Python programming language examples.

In [4]:
dataset = load_dataset("code_x_glue_ct_code_to_text", "python")

Downloading builder script:   0%|          | 0.00/5.92k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/17.9k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/25.7k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/2.35k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.74k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/941M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/12.4M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/251820 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13914 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/14918 [00:00<?, ? examples/s]

In [5]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'repo', 'path', 'func_name', 'original_string', 'language', 'code', 'code_tokens', 'docstring', 'docstring_tokens', 'sha', 'url'],
        num_rows: 251820
    })
    validation: Dataset({
        features: ['id', 'repo', 'path', 'func_name', 'original_string', 'language', 'code', 'code_tokens', 'docstring', 'docstring_tokens', 'sha', 'url'],
        num_rows: 13914
    })
    test: Dataset({
        features: ['id', 'repo', 'path', 'func_name', 'original_string', 'language', 'code', 'code_tokens', 'docstring', 'docstring_tokens', 'sha', 'url'],
        num_rows: 14918
    })
})

In [6]:
example = dataset["train"][0]

In [7]:
example

{'id': 0,
 'repo': 'ageitgey/face_recognition',
 'path': 'examples/face_recognition_knn.py',
 'func_name': 'train',
 'original_string': 'def train(train_dir, model_save_path=None, n_neighbors=None, knn_algo=\'ball_tree\', verbose=False):\n    """\n    Trains a k-nearest neighbors classifier for face recognition.\n\n    :param train_dir: directory that contains a sub-directory for each known person, with its name.\n\n     (View in source code to see train_dir example tree structure)\n\n     Structure:\n        <train_dir>/\n        ├── <person1>/\n        │   ├── <somename1>.jpeg\n        │   ├── <somename2>.jpeg\n        │   ├── ...\n        ├── <person2>/\n        │   ├── <somename1>.jpeg\n        │   └── <somename2>.jpeg\n        └── ...\n\n    :param model_save_path: (optional) path to save model on disk\n    :param n_neighbors: (optional) number of neighbors to weigh in classification. Chosen automatically if not specified\n    :param knn_algo: (optional) underlying data structure 

In [8]:
example = dataset['train'][0]

print("Code:", example["code"])
print("Docstring:", example["docstring"])

Code: def train(train_dir, model_save_path=None, n_neighbors=None, knn_algo='ball_tree', verbose=False):
    """
    Trains a k-nearest neighbors classifier for face recognition.

    :param train_dir: directory that contains a sub-directory for each known person, with its name.

     (View in source code to see train_dir example tree structure)

     Structure:
        <train_dir>/
        ├── <person1>/
        │   ├── <somename1>.jpeg
        │   ├── <somename2>.jpeg
        │   ├── ...
        ├── <person2>/
        │   ├── <somename1>.jpeg
        │   └── <somename2>.jpeg
        └── ...

    :param model_save_path: (optional) path to save model on disk
    :param n_neighbors: (optional) number of neighbors to weigh in classification. Chosen automatically if not specified
    :param knn_algo: (optional) underlying data structure to support knn.default is ball_tree
    :param verbose: verbosity of training
    :return: returns knn classifier that was trained on the given data.
    

In [9]:
import torch

In [10]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

checkpoint = "Salesforce/codet5p-2b"
device = "cuda" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint,
                                              torch_dtype=torch.float16,
                                              trust_remote_code=True).to(device)

encoding = tokenizer("def print_hello_world():", return_tensors="pt").to(device)
encoding['decoder_input_ids'] = encoding['input_ids'].clone()
outputs = model.generate(**encoding, max_length=15)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))


Downloading (…)okenizer_config.json:   0%|          | 0.00/284 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/131 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/5.07k [00:00<?, ?B/s]

Downloading (…)iguration_codet5p.py:   0%|          | 0.00/4.07k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/Salesforce/codet5p-2b:
- configuration_codet5p.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


Downloading (…)/modeling_codet5p.py:   0%|          | 0.00/43.5k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/Salesforce/codet5p-2b:
- modeling_codet5p.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


Downloading pytorch_model.bin:   0%|          | 0.00/6.45G [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


def print_hello_world():
    print("Hello world!"


In [11]:

test_example = dataset['test'][2]
print("Code:", test_example['code'])

Code: def sina_download(url, output_dir='.', merge=True, info_only=False, **kwargs):
    """Downloads Sina videos by URL.
    """
    if 'news.sina.com.cn/zxt' in url:
        sina_zxt(url, output_dir=output_dir, merge=merge, info_only=info_only, **kwargs)
        return

    vid = match1(url, r'vid=(\d+)')
    if vid is None:
        video_page = get_content(url)
        vid = hd_vid = match1(video_page, r'hd_vid\s*:\s*\'([^\']+)\'')
        if hd_vid == '0':
            vids = match1(video_page, r'[^\w]vid\s*:\s*\'([^\']+)\'').split('|')
            vid = vids[-1]

    if vid is None:
        vid = match1(video_page, r'vid:"?(\d+)"?')
    if vid:
        #title = match1(video_page, r'title\s*:\s*\'([^\']+)\'')
        sina_download_by_vid(vid, output_dir=output_dir, merge=merge, info_only=info_only)
    else:
        vkey = match1(video_page, r'vkey\s*:\s*"([^"]+)"')
        if vkey is None:
            vid = match1(url, r'#(\d+)')
            sina_download_by_vid(vid, output_dir=out

In [17]:
# prepare for the model
input_ids = tokenizer(test_example["code"], return_tensors='pt').input_ids
input_ids  = input_ids.to(device)
# generate
outputs = model.generate(input_ids)
print("Generated docstring:", tokenizer.decode(outputs[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated docstring: 
        return

    if vid:
        sina_download_by_vid


In [19]:
# prepare for the model
input_ids = tokenizer("encoded_text = tokenizer.encode(example_text, return_tensors='pt')", return_tensors='pt').input_ids
input_ids  = input_ids.to(device)
# generate
outputs = model.generate(input_ids)
print("Generated docstring:", tokenizer.decode(outputs[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated docstring:   # type: ignore
tokenizer.tokenize(example_text)
tokenizer


In [21]:
# prepare for the model
# Tokenize the code
encoding = tokenizer("encoded_text = tokenizer.encode(example_text, return_tensors='pt')", return_tensors='pt', padding='max_length', truncation=True, max_length=512)

# Get input_ids and attention_mask from the encoding
input_ids = encoding['input_ids'].to(device)
attention_mask = encoding['attention_mask'].to(device)

# generate
outputs = model.generate(input_ids, attention_mask=attention_mask, pad_token_id = eos_token_id)
print("Generated docstring:", tokenizer.decode(outputs[0], skip_special_tokens=True))

Generated docstring:   # type: ignore
tokenizer.tokenize(example_text)
tokenizer


In [13]:
eos_token_id = tokenizer.eos_token_id
pad_token_id = eos_token_id

In [15]:
# prepare for the model
# Tokenize the code
encoding = tokenizer(test_example['code'], return_tensors='pt', padding='max_length', truncation=True, max_length=512)

# Get input_ids and attention_mask from the encoding
input_ids = encoding['input_ids'].to(device)
attention_mask = encoding['attention_mask'].to(device)

# generate
outputs = model.generate(input_ids, attention_mask=attention_mask, pad_token_id = eos_token_id)
print("Generated docstring:", tokenizer.decode(outputs[0], skip_special_tokens=True))


Generated docstring: 
        return

    if vid:
        sina_download_by_vid


**Objective:** The objective is to create a model that generates docstrings based on the provided code.

**Preparing Code-Docstring Pairs:**

1. **Tokenization:** Transformer models (like BERT, BART, T5) require integers as input (known as input_ids in HuggingFace Transformers) rather than direct text. These integers correspond to tokens in the model's vocabulary.

2. **Contextual Embedding Vectors:** The model learns rich vectors for each token, which helps in obtaining quality results.
Conversion to input_ids: Both the "Code" and "Docstring" must be turned into input_ids; the former becomes the model's input and the latter serves as labels.

3. **Padding and Truncation:** Since models are trained in batches, inputs and labels must be of the same length, requiring padding/truncation.

4. **Attention Mask:** An attention_mask is also added to make sure padding tokens are not considered in attention score computations.

5. **Preprocessing Function:** Finally, a preprocess_examples function is defined, allowing the entire dataset to be processed according to these requirements.

### To summarize:

1. **input:** code, which is turned into input_ids + attention_mask
2. **output:** docstrings, which are turned into labels (which are the input_ids of the docstrings).