# Introduction:

The purpose of this notebook is clarify the difference between general-purpose pretrained model, custom pretrained model, tiny Instruct model, and small Instruct model.

This notebook tries to test the responses of each model on a given prompt, which is here a python code to complete.

by the end of this notebook, you will understand the ability of each one, and the difference.

In [1]:
### import packages
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
import ollama 
import warnings
import os
warnings.filterwarnings('ignore')

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Set a seed for reproducibility
def set_seed(seed = 123):
    torch.manual_seed(seed)

set_seed()

In [3]:
### set the value of some constants
MODEL = "gpt2"
DEVICE = "auto" if torch.cuda.is_available() else "cpu"
MODELS_PATH = os.path.join("..", "assets", "models")

## 1. Load a general pretrained model

This notebook will use small models that fit within the memory and run on CPU. **gpt-2 small** is a small decoder-only model with 137M parameters and a 1024 token context window(we will discuss this concepts later). You can find the model on the Hugging Face model library at [this link](https://huggingface.co/openai-community/gpt2).

You'll load the model in three steps:
1. Specify the path to the model in the Hugging Face model library
2. Load the model using `AutoModelforCausalLM` in the `transformers` library
3. Load the tokenizer for the model from the same model path

In [4]:
gpt2_tokenizer = AutoTokenizer.from_pretrained(
    MODEL
)

gpt2_small_base = AutoModelForCausalLM.from_pretrained(
    MODEL,
    device_map=DEVICE, # change to auto if you have access to a GPU
    torch_dtype=torch.bfloat16,
    cache_dir = MODELS_PATH,
)

In [5]:
gpt2_tokenizer.pad_token_id = gpt2_tokenizer.eos_token_id

In [6]:
### test the tokenizer for input python code...
py_code = '''
def write_map_file(mapFNH, items, header):
    """
    Given a list of mapping items (in the form described by the parse_mapping_file method)
    and a header line, write each row to the given input file with fields separated by tabs.

    :type mapFNH: file or str
    :param mapFNH: Either the full path to the map file or an open file handle

    :type items: list
    :param item: The list of row entries to be written to the mapping file

    :type header: list or str
    :param header: The descriptive column names that are required as the first line of
                   the mapping file

    :rtype: None
    """
    if isinstance(header, list):
        header = "\t".join(header) + "\n"

    with file_handle(mapFNH, "w") as mapF:
        mapF.write(header)
        for row in items:
            mapF.write("\t".join(row)+"\n")
'''

In [7]:
### tokenize the given text chunk into tokens.
### testing the tokenizer is very important, and we will discuss this later...
py_code_tokens = gpt2_tokenizer.tokenize(py_code)
print(py_code_tokens)

['Ċ', 'def', 'Ġwrite', '_', 'map', '_', 'file', '(', 'map', 'F', 'NH', ',', 'Ġitems', ',', 'Ġheader', '):', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġ"""', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'ĠGiven', 'Ġa', 'Ġlist', 'Ġof', 'Ġmapping', 'Ġitems', 'Ġ(', 'in', 'Ġthe', 'Ġform', 'Ġdescribed', 'Ġby', 'Ġthe', 'Ġparse', '_', 'm', 'apping', '_', 'file', 'Ġmethod', ')', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġand', 'Ġa', 'Ġheader', 'Ġline', ',', 'Ġwrite', 'Ġeach', 'Ġrow', 'Ġto', 'Ġthe', 'Ġgiven', 'Ġinput', 'Ġfile', 'Ġwith', 'Ġfields', 'Ġseparated', 'Ġby', 'Ġtabs', '.', 'ĊĊ', 'Ġ', 'Ġ', 'Ġ', 'Ġ:', 'type', 'Ġmap', 'F', 'NH', ':', 'Ġfile', 'Ġor', 'Ġstr', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġ:', 'param', 'Ġmap', 'F', 'NH', ':', 'ĠEither', 'Ġthe', 'Ġfull', 'Ġpath', 'Ġto', 'Ġthe', 'Ġmap', 'Ġfile', 'Ġor', 'Ġan', 'Ġopen', 'Ġfile', 'Ġhandle', 'ĊĊ', 'Ġ', 'Ġ', 'Ġ', 'Ġ:', 'type', 'Ġitems', ':', 'Ġlist', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġ:', 'param', 'Ġitem', ':', 'ĠThe', 'Ġlist', 'Ġof', 'Ġrow', 'Ġentries', 'Ġto', 'Ġbe', 'Ġwritten', 'Ġto', 'Ġthe', 'Ġmapping', 'Ġfile', 'ĊĊ', 'Ġ', 'Ġ', 'Ġ',

a problem in gpt2 tokenizer when tokenizing python code:

- This tokenizer has a few special symbols, like Ġ and Ċ, which denote spaces and newlines
- this is not too efficient: the tokenizer returns individual tokens for each space, when it could group together indentation levels.

this can be an indication that this model doesn't train enough on python code, which indicates a weak performance on simplea autocomplete tasks

## 2. Generate Python samples with pretrained general model

Use the model to write a python function called `find_max()` that finds the maximum value in a list of numbers:

In [8]:
prompt =  "def find_max(numbers):"

inputs = gpt2_tokenizer(
    prompt, return_tensors="pt"
).to(gpt2_small_base.device)

streamer = TextStreamer(
    gpt2_tokenizer, 
    skip_prompt=True, # Set to false to include the prompt in the output
    skip_special_tokens=True
)

outputs = gpt2_small_base.generate(
    **inputs, 
    streamer=streamer, 
    use_cache=True, 
    max_new_tokens=128, 
    do_sample=False, 
    temperature=0.0, 
    repetition_penalty=1.1
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


 # This is the number of times a given string has been found. if len (strings) > 1: return False def get_string(s): """ Returns an array containing strings that are sorted by length, and returns True for all characters in it.""" s = [] while not strlen(s): print "Invalid character" else : try { fprintf(stderr, "%d
", String.format(str)) except ValueError: raise Exception("Failed to parse '%i' or %u".join() + "").strip().lower() elif n == 0: break } catch None,


as expected the model generates garbage, it generates python code tokens but not the ones that we need to complete the input prompt, or answer the given input.

this is a sign of why we need to pretrain our model on custom data.

also, if the pretrained model doesn't train on data/task similar to the one you are asking about, you must pretrain it
example:

gpt-2 small doesn't train on arabic, so if your task is to create arabic instruct model, you have to pretrain the model on arabic text chunks, then do instruction tuning. 

In [9]:
# NOTE: We're running large models in a limited environment. Run me if you encounter any memory issues.
import gc
del gpt2_small_base
del gpt2_tokenizer
del streamer
del outputs
gc.collect()

31

## 3. Generate Python samples with finetuned Python model

This model has been fine-tuned on instruction code examples. You can find the model and information about the fine-tuning datasets on the Hugging Face model library at [this link](https://huggingface.co/upstage/TinySolar-248m-4k-code-instruct).

You'll follow the same steps as above to load the model and use it to generate text.

In [17]:
MODEL1 = "upstage/TinySolar-248m-4k-code-instruct"

In [18]:
### code instruct model....
code_tiny_solar_instruct = AutoModelForCausalLM.from_pretrained(
    MODEL1,
    device_map="cpu",
    torch_dtype=torch.bfloat16,  
    cache_dir = MODELS_PATH,  
)

code_tiny_solar_instruct_tokenizer = AutoTokenizer.from_pretrained(
    MODEL1
)

In [19]:
inputs = code_tiny_solar_instruct_tokenizer(
    prompt, return_tensors="pt"
).to(code_tiny_solar_instruct.device)

streamer = TextStreamer(
    code_tiny_solar_instruct_tokenizer,
    skip_prompt=True, 
    skip_special_tokens=True
)

outputs = code_tiny_solar_instruct.generate(
    **inputs, streamer=streamer,
    use_cache=True, 
    max_new_tokens=128, 
    do_sample=False, 
    repetition_penalty=1.1
)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



   if len(numbers) == 0:
       return "Invalid input"
   else:
       return max(numbers)
```

In this solution, the `find_max` function takes a list of numbers as input and returns the maximum value in that list. It then iterates through each number in the list and checks if it is greater than or equal to 1. If it is, it adds it to the `max` list. Finally, it returns the maximum value found so far.


as we can see above the model generates a good answer to complete the input, and also gives a clarification about the generated code.

the instruct model here gives a detailed answer, not only completing the code, but also giving a clarification.

this may be the difference between instruct model and base model:

- the base model: generates tokens that is suitable for the given context.
- the instruct model: generates tokens that answers the given question.

In [20]:
# NOTE: We're running large models in a limited environment. Run me if you encounter any memory issues.
import gc
del code_tiny_solar_instruct_tokenizer
del code_tiny_solar_instruct
del streamer
del outputs
gc.collect()

87

## 4. Generate Python samples with pretrained Python model

Here you'll use a version of TinySolar-248m-4k that has been further pretrained (a process called **continued pretraining**) on a large selection of python code samples. You can find the model on Hugging Face at [this link](https://huggingface.co/upstage/TinySolar-248m-4k-py).

You'll follow the same steps as above to load the model and use it to generate text.

In [10]:
MODEL2 = "upstage/TinySolar-248m-4k-py"

In [11]:
### pretrained model on python code....
py_tiny_solar = AutoModelForCausalLM.from_pretrained(
    MODEL2,
    device_map="cpu",
    torch_dtype=torch.bfloat16,  
    cache_dir = MODELS_PATH,  
)

py_tiny_solar_tokenizer = AutoTokenizer.from_pretrained(
    MODEL2
)

In [12]:
py_code_tokens2 = py_tiny_solar_tokenizer.tokenize(py_code)
print(py_code_tokens2)

['▁', '<0x0A>', 'def', '▁write', '_', 'map', '_', 'file', '(', 'map', 'FN', 'H', ',', '▁items', ',', '▁header', '):', '<0x0A>', '▁▁▁', '▁"""', '<0x0A>', '▁▁▁', '▁Given', '▁a', '▁list', '▁of', '▁mapping', '▁items', '▁(', 'in', '▁the', '▁form', '▁described', '▁by', '▁the', '▁parse', '_', 'mapping', '_', 'file', '▁method', ')', '<0x0A>', '▁▁▁', '▁and', '▁a', '▁header', '▁line', ',', '▁write', '▁each', '▁row', '▁to', '▁the', '▁given', '▁input', '▁file', '▁with', '▁fields', '▁separated', '▁by', '▁t', 'abs', '.', '<0x0A>', '<0x0A>', '▁▁▁', '▁:', 'type', '▁map', 'FN', 'H', ':', '▁file', '▁or', '▁str', '<0x0A>', '▁▁▁', '▁:', 'param', '▁map', 'FN', 'H', ':', '▁Either', '▁the', '▁full', '▁path', '▁to', '▁the', '▁map', '▁file', '▁or', '▁an', '▁open', '▁file', '▁handle', '<0x0A>', '<0x0A>', '▁▁▁', '▁:', 'type', '▁items', ':', '▁list', '<0x0A>', '▁▁▁', '▁:', 'param', '▁item', ':', '▁The', '▁list', '▁of', '▁row', '▁entries', '▁to', '▁be', '▁written', '▁to', '▁the', '▁mapping', '▁file', '<0x0A>', '<0

In [13]:
prompt = "def find_max(numbers):"

inputs = py_tiny_solar_tokenizer(
    prompt, return_tensors="pt"
).to(py_tiny_solar.device)

streamer = TextStreamer(
    py_tiny_solar_tokenizer,
    skip_prompt=True, 
    skip_special_tokens=True
)

outputs = py_tiny_solar.generate(
    **inputs, streamer=streamer,
    use_cache=True, 
    max_new_tokens=128, 
    do_sample=False, 
    repetition_penalty=1.1
)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



   """Find the maximum number of numbers in a list."""
   max = 0
   for num in numbers:
       if num > max:
           max = num
   return max



the model above has been further pretrained on python code chunks (continued pretraining), we can see that this model generates the right code that completes the given prompt, however it is still base model.

In [14]:
# NOTE: We're running large models in a limited environment. Run me if you encounter any memory issues.
import gc
del py_tiny_solar_tokenizer
del py_tiny_solar
del streamer
del outputs
gc.collect()

107

## 5. Generate Python samples with Qwen2 0.5B (Larger Instruct Model)

Here you'll use a version of Qwen2 0.5B . You can find the model on Ollama at [this link](https://ollama.com/library/qwen2).

You'll use Ollama to generate text.

In [15]:
MODEL3 = "qwen2:0.5b"

In [16]:
### Instruct model ....
### use here ollama ...
stream = ollama.chat(
    model=MODEL3,
    messages=[{'role': 'user', 'content': prompt}],
    stream=True,
)

for chunk in stream:
  print(chunk['message']['content'], end='', flush=True)

To determine the maximum value in a Python list of numbers, you can use the built-in `max()` function. For example:

```python
numbers = [10, 23, 5, 78, 90]
if len(numbers) == 0:  # Handle case where we have an empty list
    max_number = None
else:
    max_number = max(numbers)
print(max_number)
```

This code snippet defines a function called `find_max` that takes in a variable number of arguments (in this case, three), which are a list of numbers. It then checks whether the length of the list is zero or not, and handles this by returning None if so.

If the list is not empty, it uses the `max()` function to find the maximum value in the list. The function returns the maximum value as an integer.

The code then prints the result to the console.

this model generated a high quality response, which is more detailed that `TinySolar-248m-4k-code-instruct`, this is the effect of size, and data.

## Conclusion:

- General Pretrained models on customized tasks may perform poorly.
- General Pretrained Models needs further pretraining on your custom data to create a good model in the future.
- pretraining is like reading hundreds/thousands of books to learn how to generate tokens.
- instruction tuning is like a student in the exam, the student knows how to answer the questions.