<a href="https://colab.research.google.com/github/SaibalPatraDS/Hands-on-LLM/blob/main/Microsoft_Phi_3_LLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Microsoft `Phi-3` LLM

In [1]:
!pip install transformers>=4.40.1 accelerate>=0.27.2

In [4]:
from transformers import AutoModelForCausalLM, AutoTokenizer

In [6]:
## Loading the model and tokenizers
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map = "cuda",
    torch_dtype = "auto",
    trust_remote_code = True
)

tokenizer = AutoTokenizer.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct"
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

Downloaded -

1.   Generative Model
2.   Tokenizer



In [7]:
## no of tokens in the Tokenizer
print(f"No of vocabs : {tokenizer.vocab_size}")

No of vocabs : 32000


In [14]:
## Encoding Sample Words
words = ["Hello", "World", "AI", "Learning", "A", "ai", "Ai", "aI", " ", "  ", "   "]
for word in words:
  print(f"Tokenized {word} : {tokenizer.encode(word)}")

Tokenized Hello : [15043]
Tokenized World : [2787]
Tokenized AI : [319, 29902]
Tokenized Learning : [29257]
Tokenized A : [319]
Tokenized ai : [7468]
Tokenized Ai : [319, 29875]
Tokenized aI : [263, 29902]
Tokenized   : [259]
Tokenized    : [1678]
Tokenized     : [268]


In [18]:
## Tokenization of Sentences
sentence = "Myself Saibal Patra and currently I am working as an Analyst in a startup."
encoded_sentence = tokenizer.encode(sentence)
decoded_sentence = tokenizer.decode(encoded_sentence)
print(f"Actual Sentence : {sentence}")
print(f"Encoded Sentence : {encoded_sentence}")
print(f"Decoded Sentence : {decoded_sentence}")

Actual Sentence : Myself Saibal Patra and currently I am working as an Analyst in a startup.
Encoded Sentence : [28924, 761, 5701, 747, 284, 4121, 336, 322, 5279, 306, 626, 1985, 408, 385, 11597, 858, 297, 263, 20234, 29889]
Decoded Sentence : Myself Saibal Patra and currently I am working as an Analyst in a startup.


In [24]:
## Backfilling the Encoded Sentence to Decoded Sentence
# for token in encoded_sentence:
#   print(f"Encode Number : {token} - Word : {tokenizer.decode(token)}")

## Mapping Tokens and Token_IDs
tokens = [tokenizer.decode([token_id]) for token_id in encoded_sentence]
## Create a Mapping of Token IDs and Tokens
token_id_map = list(zip(encoded_sentence, tokens))
## Print the Token IDs and their correspoding tokens
for token_id, token in token_id_map:
  print(f"Token-ID : {token_id} - token : {token}")

Token-ID : 28924 - token : Mys
Token-ID : 761 - token : elf
Token-ID : 5701 - token : Sa
Token-ID : 747 - token : ib
Token-ID : 284 - token : al
Token-ID : 4121 - token : Pat
Token-ID : 336 - token : ra
Token-ID : 322 - token : and
Token-ID : 5279 - token : currently
Token-ID : 306 - token : I
Token-ID : 626 - token : am
Token-ID : 1985 - token : working
Token-ID : 408 - token : as
Token-ID : 385 - token : an
Token-ID : 11597 - token : Anal
Token-ID : 858 - token : yst
Token-ID : 297 - token : in
Token-ID : 263 - token : a
Token-ID : 20234 - token : startup
Token-ID : 29889 - token : .


In [26]:
## Creating Pipeline for Word Generation
from transformers import pipeline

## Create a Pipeline
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    max_new_tokens=500,
    do_sample=False ## Giving output the word with max probability
)

In [36]:
## Prompt (User Input / Query)
messages = [
    {"role":"system","content":"You are standup comedian"},
    {"role":"user","content":"Create a funny joke about Goats."},
]
## Output/Generator
output = generator(messages)
print(output[0]['generated_text'])


 Why don't goats ever get lost? Because they always follow their herd instincts!
