In [7]:
# pip install -q transformers==4.29.2
from transformers import AutoModelForCausalLM, AutoTokenizer
import re

In [8]:
checkpoint = "bigcode/tiny_starcoder_py"
device = "cuda" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)



# About Model
**tiny_starcoder_py** built using the **GPTBigCodeForCausalLM** architecture, which is a transformer-based model designed for code generation.

It consists of:
- Embedding layers: Word token embedding (**wte**) and positional embedding (**wpe**), both with 768-dimensional outputs.
- 20 Transformer blocks (**GPTBigCodeBlock**): Each block has multi-head attention with dropout set to 0.1, layer normalization, and a feed-forward network (expanding to 3072 dimensions + GELU activation).
- Final layer norm (**ln_f**): Applied after the transformer layers.
- Language modeling head (**lm_head**): Maps the 768-dimensional hidden state back to the vocabulary of 49,152 tokens.

In [3]:
model

GPTBigCodeForCausalLM(
  (transformer): GPTBigCodeModel(
    (wte): Embedding(49152, 768)
    (wpe): Embedding(8192, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-19): 20 x GPTBigCodeBlock(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPTBigCodeSdpaAttention(
          (c_attn): Linear(in_features=768, out_features=896, bias=True)
          (c_proj): Linear(in_features=768, out_features=768, bias=True)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPTBigCodeMLP(
          (c_fc): Linear(in_features=768, out_features=3072, bias=True)
          (c_proj): Linear(in_features=3072, out_features=768, bias=True)
          (act): PytorchGELUTanh()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, el

# Testing Model
Taking examples from the creator of the model ([repo](https://github.com/the-crypt-keeper/tiny_starcoder/tree/main)), we can try to either replicate them and assure that model works in the right way or to run our own completions.

### Choosing configuration parameters
We can either use greedy decoding (which is basically selecting the most probable token in each step) or use some of the sampling methods (Beam search, Top-k sampling, Top-p sampling, or lots of others that not available on [HuggingFace](https://huggingface.co/docs/transformers/generation_strategies#customize-text-generation))

They recommend to use following parameters for default generation tasks:
- *max_new_tokens* (128): Limits the number of tokens generated to 128.
- *temperature* (0.2): Controls randomness: lower values make the output more deterministic, while higher values increase creativity.
- *top_k* (50): Restricts sampling to the top 50 most probable tokens at each step, limiting less likely options.
- *top_p* (0.1): Samples from the smallest set of tokens whose cumulative probability reaches 10%, balancing diversity.
- *repetition_penalty* (1.17): Penalizes repeated tokens to reduce loops and ensure variety in generated output.
- *do_sample* (True): Enables sampling, introducing randomness into the generation instead of always choosing the most probable token.

Will experiment with these parameters later.

In [4]:
# Sane hyper-parameters
params = {
    'max_new_tokens': 128,
    'temperature': 0.2,
    'top_k': 50,
    'top_p': 0.1,
    'repetition_penalty': 1.17,
    'do_sample': True
}

In [45]:
def format_middle_output(text):
    prefix = re.search('<fim_prefix>(.*)<fim_suffix>', text, re.DOTALL).group(1)
    suffix = re.search('<fim_suffix>(.*)<fim_middle>', text, re.DOTALL).group(1)
    output = re.search('<fim_middle>(.*)', text).group(1).replace('<|endoftext|>', '')
    return prefix + output + suffix

In [36]:
# Prompt Style 1: Function Signature
inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device)
outputs = model.generate(inputs, pad_token_id=tokenizer.eos_token_id, **params)
print(f'Prompt Style 1: Function Signature\n\033[96m {tokenizer.decode(outputs[0])} \033[00m\n\n')

# Prompt Style 2: A comment
inputs = tokenizer.encode("# a python function that says hello\n", return_tensors="pt").to(device)
outputs = model.generate(inputs, pad_token_id=tokenizer.eos_token_id, **params)
print(f'Prompt Style 2: A comment\n\033[96m {tokenizer.decode(outputs[0])} \033[00m\n\n')

# Prompt Style 3: A docstring
inputs = tokenizer.encode("\"\"\" a python function that says hello \"\"\"\n", return_tensors="pt").to(device)
outputs = model.generate(inputs, pad_token_id=tokenizer.eos_token_id, **params)
print(f'Prompt Style 3: A docstring\n\033[96m {tokenizer.decode(outputs[0])} \033[00m\n\n')

# Prompt Style 4: [ADVANCED] Fill in the middle
input_text = "<fim_prefix>def print_one_two_three():\n    print('one')\n    <fim_suffix>\n    print('three')<fim_middle>"
inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model.generate(inputs, pad_token_id=tokenizer.eos_token_id, **params)
print(f'Prompt Style 4: [ADVANCED] Fill in the middle (w/o processing)\n\033[96m {tokenizer.decode(outputs[0])}\n \033[00m')
print(f'Prompt Style 4: [ADVANCED] Fill in the middle (Processing)\n\033[96m {format_middle_output(tokenizer.decode(outputs[0]))} \033[00m')

Prompt Style 1: Function Signature
[96m def print_hello_world():
    """Prints hello world"""

    print("Hello World!")


if __name__ == "__main__":
    main()
<|endoftext|> [00m


Prompt Style 2: A comment
[96m # a python function that says hello
def say_hello():
    print("Hello World!")


if __name__ == "__main__":
    say_hello()<|endoftext|> [00m


Prompt Style 3: A docstring
[96m """ a python function that says hello """
def say_hello():
    print("Hello World!")

<|endoftext|> [00m


Prompt Style 4: [ADVANCED] Fill in the middle (w/o processing)
[96m <fim_prefix>def print_one_two_three():
    print('one')
    <fim_suffix>
    print('three')<fim_middle>print('two')<|endoftext|>
 [00m
Prompt Style 4: [ADVANCED] Fill in the middle (Processing)
[96m def print_one_two_three():
    print('one')
    print('two')
    print('three') [00m


### Own examples
As we can see, the model does fairly well on the example jobs. Let's attempt our own examples. Since our task is to generate code using prefix and suffix, I will use Prompt Style 4.

#### Example 1
Initialize model using `from_pretrained`.

In [64]:
prefix_load_model = "<fim_prefix>base_model_id = 'microsoft/phi-2'\nmodel = AutoModelForCausalLM.from_pretrained("
suffix_load_model = "<fim_suffix>)\n"
input_text = prefix_load_model + suffix_load_model + '<fim_middle>'

inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model.generate(inputs, pad_token_id=tokenizer.eos_token_id, **params)
print(f'\033[96m {format_middle_output(tokenizer.decode(outputs[0]))} \033[00m')

[96m base_model_id ='microsoft/phi-2'
model = AutoModelForCausalLM.from_pretrained(model_name)
 [00m


#### Example 2
Initialize model using `from_pretrained` with additional comment.

In [65]:
prefix_load_model_comment = "<fim_prefix># Initialize model and set load_in_8bit to True\nbase_model_id = 'microsoft/phi-2'\nmodel = AutoModelForCausalLM.from_pretrained("
suffix_load_model_comment = "<fim_suffix>)\n"
input_text = prefix_load_model_comment + suffix_load_model_comment + '<fim_middle>'

inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model.generate(inputs, pad_token_id=tokenizer.eos_token_id, **params)
print(f'\033[96m {format_middle_output(tokenizer.decode(outputs[0]))} \033[00m')

[96m # Initialize model and set load_in_8bit to True
base_model_id ='microsoft/phi-2'
model = AutoModelForCausalLM.from_pretrained(base_model_id)
 [00m


#### Example 3
Tokenize labels within tokenization function.

In [66]:
prefix_tokenize_labels = """<fim_prefix>def tokenize(prompt): \nresult = tokenizer(prompt['prompt'], max_length=max_input_length, truncation=True, padding=True)\n"""
suffix_tokenize_labels = """<fim_suffix>\nresult["labels"] = labels["input_ids"] \n return result"""
input_text = prefix_tokenize_labels + suffix_tokenize_labels + '<fim_middle>'

inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model.generate(inputs, pad_token_id=tokenizer.eos_token_id, **params)
print(f'\033[96m {format_middle_output(tokenizer.decode(outputs[0]))} \033[00m')

[96m def tokenize(prompt): 
result = tokenizer(prompt['prompt'], max_length=max_input_length, truncation=True, padding=True)
    print("Tokens: ", result.keys())
result["labels"] = labels["input_ids"] 
 return result [00m


#### Example 4
Map tokenization function to the dataset.

In [67]:
prefix_map_tokenization = """<fim_prefix>def generate_and_tokenize_prompt(data_point):\n\treturn tokenize(data_point)\ntokenized_train_dataset = dataset.map("""
suffix_map_tokenization = "<fim_suffix>)\n"
input_text = prefix_map_tokenization + suffix_map_tokenization + '<fim_middle>'

inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model.generate(inputs, pad_token_id=tokenizer.eos_token_id, **params)
print(f'\033[96m {format_middle_output(tokenizer.decode(outputs[0]))} \033[00m')

[96m def generate_and_tokenize_prompt(data_point):
	return tokenize(data_point)
tokenized_train_dataset = dataset.map(generate_and_tokenize_prompt, num_parallel_calls=4)
 [00m


### Results analysis
We can see that the model did well with simple examples (as were provided by the creator) but generated some artifacts during the test of its own examples:
- Example 1: Set `model_name` instead of `base_model_id`.
- Example 2: Didn't follow the instructions in the comment and didn't set `load_in_8bit` to True for initialization.
- Example 3: Just wrong problem understanding.
- Example 4: Did fairly well on task, but set `num_parallel_calls` to 4 without any specific instruction.