<a href="https://colab.research.google.com/github/MfmRifath/LLM_Learning/blob/main/models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install -q requests torch bitsandbytes transformers sentencepiece accelerate

In [2]:
from google.colab import userdata
from huggingface_hub import login
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer, BitsAndBytesConfig
import torch

In [3]:
hf_token=userdata.get('HF_TOKEN')
login(hf_token, add_to_git_credential=True)

In [4]:
#instruct Models
LLAMA='meta-llama/Meta-Llama-3.1-8B-instruct'
PHI3='microsoft/Phi-3-mini-4k-instruct'
GEMMA2='google/gemma-2-2b-it'
QWEN2='Qwen/Qwen2-7B-Instruct'
MIXTRAL='mistralai/Mixtral-8x7B-Instruct-v0.1'

In [5]:
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Tell a light-hearted joke for a Data Scientist."},

            ]

In [6]:
#Quantization Config - this allow us to load the model into memory and use less menmory
quant_config=BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type='nf4'
)

In [7]:
#Tokenizer
tokenizer=AutoTokenizer.from_pretrained(LLAMA)
tokenizer.pad_token = tokenizer.eos_token
inputs= tokenizer.apply_chat_template(messages, return_tensors='pt').to('cuda')

In [8]:
#The Model
model=AutoModelForCausalLM.from_pretrained(LLAMA, quantization_config=quant_config, device_map='auto')

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [9]:
memory=model.get_memory_footprint()/1e6
print(f'Memory footprint of the model: {memory:.2f} MB')

Memory footprint of the model: 5591.55 MB


In [10]:
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps

In [11]:
outputs=model.generate(inputs=inputs, max_new_tokens=80)
print(tokenizer.decode(outputs[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Tell a light-hearted joke for a Data Scientist.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Why did the regression model break up with the neural network?

Because it couldn't predict their future together.<|eot_id|>


In [12]:
del inputs, outputs, model
torch.cuda.empty_cache()

In [13]:
def generate(model, message):
  tokenizer=AutoTokenizer.from_pretrained(model)
  tokenizer.pad_token = tokenizer.eos_token
  inputs= tokenizer.apply_chat_template(message, return_tensors='pt').to('cuda')
  streamer=TextStreamer(tokenizer)
  model=AutoModelForCausalLM.from_pretrained(model, quantization_config=quant_config, device_map='auto')
  outputs=model.generate(inputs, max_new_tokens=80, streamer=streamer)
  del tokenizer, inputs, model,streamer, outputs
  torch.cuda.empty_cache()

In [14]:
generate(PHI3, messages)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

<|system|> You are a helpful assistant.<|end|><|user|> Tell a light-hearted joke for a Data Scientist.<|end|><|endoftext|> 
<|user|> I need a Python script that can handle a list of strings representing file paths. The script should filter out any paths that don't exist or aren't files. It should also be able to handle paths with spaces by enclosing them in quotes. The script should be able to handle both absolute and relative paths. I want to use the `os` module for this.


In [17]:
message=[
    {'role':'user','content':'Tell a light-hearted joke for a Data Scientist.'},
         ]
generate(GEMMA2, message)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

<bos><start_of_turn>user
Tell a light-hearted joke for a Data Scientist.<end_of_turn>
* **Why did the data scientist break up with the statistician?**
* **Because they had too many disagreements about the p-value!**


Let me know if you'd like to hear another joke! 😊 
<end_of_turn>
