# Info
If you didn't see `explore_tiny_starcoder.ipynb` notebook yet, better make it before looking into this notebook since it is a continuation of experiments from there.


In [12]:
!pip install -q -U bitsandbytes
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import re

Approximate VRAM usage is 3.5GB

In [41]:
quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model_name = "bigcode/starcoder2-3b"
device = "cuda"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=quantization_config).to(device)

# About Model

**Starcoder3B** is built on the **Starcoder2Model** architecture, designed for code generation.

It consists of:

- Embedding layer (**embed_tokens**):  Maps input tokens to a 3072-dimensional space with a vocabulary of 49,152 tokens.

- 30 Transformer blocks (**Starcoder2DecoderLayer**):
  Each block includes:
  - Self-attention (**self_attn**):
    - Linear layers for queries (**q_proj**), keys (**k_proj**), values (**v_proj**), and output (**o_proj**) with inputs/outputs of 3072 and 256 dimensions.
    - Rotary embedding (**rotary_emb**) for positional encoding.
  - Feed-forward network (**mlp**):
    - Expands dimensions (c_fc: 3072 to 12288) and projects back (**c_proj**: 12288 to 3072).
    - Activation function: Uses **PytorchGELUTanh**.
  - Layer normalization: Applied before and after attention.

- Final layer normalization (**norm**): Stabilizes training.

- Language modeling head (**lm_head**):
  A linear layer mapping the 3072-dimensional hidden state to 49,152 tokens for output generation.<br><br>

**Interesting note on the differences in embeddings between two models** (Starcoder3B and tiny_starcoder_py):

Rotary embedding layers use a rotary mechanism to encode positional information, capturing relative positions and improving attention across varying input lengths. This enhances generalization for different sequence lengths.

In contrast, traditional word position embeddings assign fixed vectors to absolute positions, which are less flexible and may not generalize well to longer sequences.

In [62]:
model

Starcoder2ForCausalLM(
  (model): Starcoder2Model(
    (embed_tokens): Embedding(49152, 3072)
    (layers): ModuleList(
      (0-29): 30 x Starcoder2DecoderLayer(
        (self_attn): Starcoder2SdpaAttention(
          (q_proj): Linear8bitLt(in_features=3072, out_features=3072, bias=True)
          (k_proj): Linear8bitLt(in_features=3072, out_features=256, bias=True)
          (v_proj): Linear8bitLt(in_features=3072, out_features=256, bias=True)
          (o_proj): Linear8bitLt(in_features=3072, out_features=3072, bias=True)
          (rotary_emb): Starcoder2RotaryEmbedding()
        )
        (mlp): Starcoder2MLP(
          (c_fc): Linear8bitLt(in_features=3072, out_features=12288, bias=True)
          (c_proj): Linear8bitLt(in_features=12288, out_features=3072, bias=True)
          (act): PytorchGELUTanh()
        )
        (input_layernorm): LayerNorm((3072,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((3072,), eps=1e-05, elementwise_affine=Tru

# Testing Model
The generation of the middle part for this model is slightly different from the tiny one. You can see a full thread in this [issue](https://github.com/bigcode-project/starcoder2/issues/10), I try to summarize:
- Suffix and prefix settings are performed using the same tokens as for tiny_starcoder_py: `<fim_prefix>`, `<fim_suffix>` and `<fim_middle>`.
- Model does not always generate `<|endoftext|>` to end generation, sometimes it happens to get `<file_sep>` instead. <br><br>

Examples are quite similar but some of them are a little complicated since I want to see actual model capabilities.




In [55]:
def format_middle_output(text):
    prefix = re.search('<fim_prefix>(.*?)<fim_suffix>', text, re.DOTALL).group(1)
    suffix = re.search('<fim_suffix>(.*?)<fim_middle>', text, re.DOTALL).group(1)
    try:
        output = re.search('<fim_middle>(.*?)<file_sep>', text, re.DOTALL).group(1)
    except:
        output = re.search('<fim_middle>(.*)', text).group(1).replace('<|endoftext|>', '')
    return prefix + output + suffix

In [75]:
for i in range(7):
  print(tokenizer.decode(i))

<|endoftext|>
<fim_prefix>
<fim_middle>
<fim_suffix>
<fim_pad>
<repo_name>
<file_sep>


### Choosing configuration parameters
In detail, parameters are described in the notebook for tiny_starcoder_py. I will use the same ones here since it seems to work well.


In [7]:
params = {
    'max_new_tokens': 128,
    'temperature': 0.2,
    'top_k': 50,
    'top_p': 0.1,
    'repetition_penalty': 1.17,
    'do_sample': True
}

#### Example 1
Initialize model using `from_pretrained`.

In [60]:
prefix_load_model = "<fim_prefix>base_model_id = 'microsoft/phi-2'\nmodel = "
suffix_load_model = "<fim_suffix>tokenizer = AutoTokenizer.from_pretrained(model_name)\n"
input_text = prefix_load_model + suffix_load_model + '<fim_middle>'

inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model.generate(inputs, pad_token_id=tokenizer.eos_token_id, **params)
print(f'\033[96m {format_middle_output(tokenizer.decode(outputs[0]))} \033[00m')

[96m base_model_id ='microsoft/phi-2'
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3).to('cuda')

tokenizer = AutoTokenizer.from_pretrained(model_name)
 [00m


#### Example 2
Initialize model using `from_pretrained` with additional comment.

In [61]:
prefix_load_model_comment = "<fim_prefix># Initialize model and set load_in_8bit to True\nbase_model_id = 'microsoft/phi-2'\nmodel = "
suffix_load_model_comment = "<fim_suffix>tokenizer = AutoTokenizer.from_pretrained(model_name)\n"
input_text = prefix_load_model_comment + suffix_load_model_comment + '<fim_middle>'

inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model.generate(inputs, pad_token_id=tokenizer.eos_token_id, **params)
print(f'\033[96m {format_middle_output(tokenizer.decode(outputs[0]))} \033[00m')

[96m # Initialize model and set load_in_8bit to True
base_model_id ='microsoft/phi-2'
model = AutoModelForCausalLM.from_pretrained(
    base_model_id, load_in_8bit=True).to("cuda")

tokenizer = AutoTokenizer.from_pretrained(model_name)
 [00m


#### Example 3
Tokenize labels within tokenization function.

In [76]:
prefix_tokenize_labels = """<fim_prefix>def tokenize(prompt): \nresult = tokenizer(prompt['prompt'], max_length=max_input_length, truncation=True, padding=True)\n"""
suffix_tokenize_labels = """<fim_suffix>\nresult["labels"] = labels["input_ids"] \n return result"""
input_text = prefix_tokenize_labels + suffix_tokenize_labels + '<fim_middle>'

inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model.generate(inputs, pad_token_id=tokenizer.eos_token_id, **params)
print(f'\033[96m {format_middle_output(tokenizer.decode(outputs[0]))} \033[00m')

[96m def tokenize(prompt): 
result = tokenizer(prompt['prompt'], max_length=max_input_length, truncation=True, padding=True)
result.update({"input_ids": prompt[0], "attention_mask" : prompt[1]})
result["labels"] = labels["input_ids"] 
 return result [00m


#### Example 4
Map tokenization function to the dataset.

In [77]:
prefix_map_tokenization = """<fim_prefix>def generate_and_tokenize_prompt(data_point):\n\treturn tokenize(data_point)\ntokenized_train_dataset = """
suffix_map_tokenization = "<fim_suffix>tokenized_train_dataset = tokenized_train_dataset.remove_columns(['prompt', 'function_name'])"
input_text = prefix_map_tokenization + suffix_map_tokenization + '<fim_middle>'

inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model.generate(inputs, pad_token_id=tokenizer.eos_token_id, **params)
print(f'\033[96m {format_middle_output(tokenizer.decode(outputs[0]))} \033[00m')

[96m def generate_and_tokenize_prompt(data_point):
	return tokenize(data_point)
tokenized_train_dataset = train_dataset.map(generate_and_tokenize_prompt, batched=True).filter(lambda x: len(x['input_ids'][0]) > 128 and len(x['input_ids'][0]) < 513 )

tokenized_train_dataset = tokenized_train_dataset.remove_columns(['prompt', 'function_name']) [00m


###Results Analysis
The model performed well with the provided examples, generating coherent code, but there were some areas where improvements could be made:
- Example 1: The model correctly initializes the model, everything is alright.
- Example 2: Same as Ex.1
- Example 3: Models does not understand that there is a variable of labels missing and that it needs to generate it.
- Example 4: The model effectively mapped the function to the dataset and added filtering, but the filtering condition's complexity may need clarification regarding its purpose.