<a href="https://colab.research.google.com/github/Akash-an/LLM/blob/master/code_completion_py.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python Code Completion

### **[Huggingface Tutorial](https://huggingface.co/learn/nlp-course/chapter7/6?fw=tf)<br>**
#### In this notebook,
> I trained a transformer model and pushed it huggingface hub<br>
So we can load the model directly from the hub and look at its predictions.<br>
The model can complete python code(mostly related to numpy, pandas, plt)<br>
Tokenizing the dataset and Training the model takes up a lot of time. So don't bother doing it again.

#### Notebook Contents
> First few block are enough for a demo. After that the time-consuming process of training code is present. Use it only for reference. Its not worth trying to reproduce. <br> Tokenization alone take 1.5 hours and results in a massive dataset. I stored the dataset on my drive and re-used it for further training. Most likely the data would be removed from my drive.<br>
Next training with 0.01% of the dataset(100_000 samples) would take 1 hour. The whole dataset would take about 180 hours.

## Demo

In [None]:
%%capture
!pip install transformers
!pip install datasets

In [None]:
from transformers import pipeline
from transformers import TFGPT2LMHeadModel, AutoConfig
from transformers import AutoTokenizer


course_model = TFGPT2LMHeadModel.from_pretrained("akash0/py-code-complete")
course_tokenizer = AutoTokenizer.from_pretrained("akash0/py-code-complete")

pipe = pipeline(
    "text-generation", model = course_model, tokenizer=course_tokenizer, device=0
)

Downloading (…)lve/main/config.json:   0%|          | 0.00/911 [00:00<?, ?B/s]

Downloading tf_model.h5:   0%|          | 0.00/497M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at akash0/py-code-complete.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


Downloading (…)neration_config.json:   0%|          | 0.00/152 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/789k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/448k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.09M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/131 [00:00<?, ?B/s]

In [None]:
txt = """\
# create some data
x = np.random.randn(100)
y = np.random.randn(100)

# create dataframe from x and y
"""
print(pipe(txt, num_return_sequences=1)[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


# create some data
x = np.random.randn(100)
y = np.random.randn(100)

# create dataframe from x and y
n_samples = X.astype([20, 3], dtype=


In [None]:
txt = """\
# create some data
x = np.random.randn(100)
y = np.random.randn(100)

# create scatter plot with x, y
"""
print(pipe(txt, num_return_sequences=1)[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


# create some data
x = np.random.randn(100)
y = np.random.randn(100)

# create scatter plot with x, y
x = (x, yy)
y = np.


## Training

### Data loading and Tokenizing

In [None]:
#loads raw dataset from the hub (takes ~5 mins)
from datasets import load_dataset, DatasetDict

ds_train = load_dataset("huggingface-course/codeparrot-ds-train",split='train')
ds_valid = load_dataset("huggingface-course/codeparrot-ds-valid",split='validation')

raw_datasets = DatasetDict(
    {
        "train":ds_train,
        "valid":ds_valid
    }
)

Downloading and preparing dataset json/huggingface-course--codeparrot-ds-train to /root/.cache/huggingface/datasets/huggingface-course___json/huggingface-course--codeparrot-ds-train-f9a8332f1c219270/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/8.25G [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/huggingface-course___json/huggingface-course--codeparrot-ds-train-f9a8332f1c219270/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96. Subsequent calls will reuse this data.
Downloading and preparing dataset json/huggingface-course--codeparrot-ds-valid to /root/.cache/huggingface/datasets/huggingface-course___json/huggingface-course--codeparrot-ds-valid-6e0d938447aa2722/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/46.1M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/huggingface-course___json/huggingface-course--codeparrot-ds-valid-6e0d938447aa2722/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96. Subsequent calls will reuse this data.


In [None]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['repo_name', 'path', 'copies', 'size', 'content', 'license'],
        num_rows: 606720
    })
    valid: Dataset({
        features: ['repo_name', 'path', 'copies', 'size', 'content', 'license'],
        num_rows: 3322
    })
})

In [None]:
raw_datasets['train'][0]

In [None]:
raw_datasets["train"].column_names

['repo_name', 'path', 'copies', 'size', 'content', 'license']

In [None]:
# Load tokenizer from the hub (doesn't take a lot of time)
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("huggingface-course/code-search-net-tokenizer")
context_length = 128


Downloading (…)okenizer_config.json:   0%|          | 0.00/265 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/789k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/448k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

In [None]:
outputs = tokenizer(
    raw_datasets['train'][:2]['content'],
    truncation=True,
    max_length=context_length,
    return_overflowing_tokens=True,
    return_length=True
)

In [None]:
outputs

In [None]:
# Tokenization!! Takes about 1.5 hours.

def tokenize(elements):
    outputs = tokenizer(
        elements['content'],
        truncation=True,
        max_length=context_length,
        return_overflowing_tokens=True,
        return_length=True
    )
    input_batch = []
    for input_ids,len in zip(outputs['input_ids'], outputs['length']):
        if len==context_length:
            input_batch.append(input_ids)

    return {'input_ids': input_batch}

tokenized_dataset = raw_datasets.map(tokenize,remove_columns=raw_datasets["train"].column_names,batched=True)

Map:   0%|          | 0/606720 [00:00<?, ? examples/s]

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#save the tokenized dataset so that we can reuse it
tokenized_datasets.save_to_disk('/content/drive/MyDrive/py-tokenized')

In [None]:
from datasets import load_from_disk

In [None]:
#load the tokenized dataset from disk, if you want to train the model more.
tokenized_dataset = load_from_disk("/content/drive/MyDrive/py-tokenized")

In [None]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids'],
        num_rows: 16702061
    })
    valid: Dataset({
        features: ['input_ids'],
        num_rows: 93164
    })
})

In [None]:
#subsets of the training data.
tokenized_dataset['train_10000'] = tokenized_dataset['train'].shuffle().select(range(100000))
tokenized_dataset['valid_1000'] = tokenized_dataset['valid'].shuffle().select(range(10000))

In [None]:
# loads GPT2 config (tokenizer needs to be defined)

from transformers import TFGPT2LMHeadModel, AutoConfig

config = AutoConfig.from_pretrained(
    'gpt2',
    vocab_size=len(tokenizer),
    n_ctx=context_length,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id
)

In [None]:
# define the model for the first time. Training from scratch
'''
model = TFGPT2LMHeadModel(config)
model(model.dummy_inputs)
model.summary()
'''
#instead use the loaded model from the hub for further training
model = course_model
tokenizer=course_tokenizer


In [None]:
# data collator for padding and stuff
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False, return_tensors="tf")

In [None]:
#dataset to tensors

tf_train_dataset = model.prepare_tf_dataset(
    tokenized_dataset['train_10000'],
    collate_fn=data_collator,
    shuffle=True,
    batch_size=16
)

tf_valid_dataset = model.prepare_tf_dataset(
    tokenized_dataset['valid_1000'],
    collate_fn=data_collator,
    shuffle=True,
    batch_size=16
)

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [None]:
#if you want to push the model to hub

from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
#prepare for training. Define a custom(decaying) learning rate.

from transformers import create_optimizer
import tensorflow as tf

num_train_steps = len(tf_train_dataset)
optimizer, schedule = create_optimizer(
        init_lr=5e-5,
        num_warmup_steps=100,
        num_train_steps=num_train_steps,
        weight_decay_rate=0.01
)
model.compile(optimizer=optimizer)


In [None]:
#Training. Takes ~1hr for 100_000 samples
from transformers.keras_callbacks import PushToHubCallback

callback = PushToHubCallback(output_dir="py-code-complete", tokenizer=tokenizer)

model.fit(tf_train_dataset, validation_data=tf_valid_dataset, callbacks=[callback])

Cloning https://huggingface.co/akash0/py-code-complete into local empty directory.


Download file tf_model.h5:   0%|          | 16.5k/474M [00:00<?, ?B/s]

Clean file tf_model.h5:   0%|          | 1.00k/474M [00:00<?, ?B/s]



In [None]:
# check if the model is working.

pipe = pipeline(
    "text-generation", model=model, tokenizer=tokenizer, device=0
)