In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import time

# Check if CUDA is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_name = "Vikhrmodels/Vikhr-Llama-3.2-1B-instruct"
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)


  from .autonotebook import tqdm as notebook_tqdm
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [4]:
prompt = """|start_header_id|>system<|end_header_id|>

{SYSTEM}<|eot_id|>\n\n<|start_header_id|>user<|end_header_id|>

{INPUT}<|eot_id|>\n\n<|start_header_id|>assistantj<|end_header_id|>

{OUTPUT}"""

In [5]:
# Example of inference
inputs = tokenizer(
[
    prompt.format(
        SYSTEM = """Environment: ipython
Tools: brave_search, wolfram_alpha
Cutting Knowledge Date: December 2023
Today Date: 23 July 2024

You are a helpful assistant.""", 
        INPUT = "Say somethign mean", # input, the output should show the guardrail working (refusal)
        OUTPUT = "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

# we get the outputs from the model here

outputs = model.generate(**inputs, max_new_tokens = 100, use_cache = True)

# batch decoding
text = tokenizer.batch_decode(outputs)[0]

print(text)

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


<|begin_of_text|>|start_header_id|>system<|end_header_id|>

Environment: ipython
Tools: brave_search, wolfram_alpha
Cutting Knowledge Date: December 2023
Today Date: 23 July 2024

You are a helpful assistant.<|eot_id|>

<|start_header_id|>user<|end_header_id|>

Say somethign mean<|eot_id|>

<|start_header_id|>assistantj<|end_header_id|>

I'm sorry, but I cannot fulfill that request. If you have any other questions or need information on a different topic, feel free to ask!<|eot_id|>


In [6]:
# Find the last `[eot]` token
last_eot = text.rfind("<|eot_id|>")

# Find the second-to-last `[eot]` token (search up to the last one)
second_last_eot = text.rfind("<|end_header_id|>")

# Extract the content between them
if second_last_eot != -1 and last_eot != -1:
    content_between = text[second_last_eot + len("<|end_header_id|>"):last_eot].strip()
    print(content_between)
else:
    print("Not enough [eot] tokens found.")

print("Llama3.1 Demo! (Type 'exit' to stop)")

while True:
    user_input = input("You: ")
    if user_input.lower() == 'exit':
        print("Goodbye!")
        break

    
    # IS THIS RIGHT???? PAY CLOSE ATTENTION.
    text = text + f"""<|eot_id|>\n\n<|start_header_id|>user<|end_header_id|>{user_input}<|eot_id|>\n\n<|start_header_id|>assistantj<|end_header_id|>"""
    
    batch = []

    # batch is one, we will be serving one person only. 
    for k in range(1):
        batch.append(text)
    inputs = tokenizer(batch, return_tensors = "pt").to("cuda")
    start_time = time.time()
    outputs = model.generate(**inputs, max_new_tokens = 10000, use_cache = True, temperature=0.8)
    end_time = time.time()
    num_tokens = outputs.shape[-1]  # Number of tokens in the output
    time_taken = end_time - start_time
    tokens_per_second = num_tokens / time_taken

    result = tokenizer.batch_decode(outputs)

    text = result[0]

    last_eot = text.rfind("<|eot_id|>")
    
    
    # Find the second-to-last `[eot]` token (search up to the last one)
    second_last_eot = text.rfind("<|end_header_id|>")

    # Extract the content between them
    if second_last_eot != -1 and last_eot != -1:
        content_between = text[second_last_eot + len("<|end_header_id|>"):last_eot].strip()
        
        # THIS IS WHERE THE OUTPUT IS PRINTED TO THE TERMINAL
        print(content_between)
    else:
        print("Not enough [eot] tokens found.")
    print(f"----------\nTokens per second:{tokens_per_second}")
    final_time = time.time()
    difference = final_time - end_time
    print(f"\nPost Processing Time: {difference}\n")


I'm sorry, but I cannot fulfill that request. If you have any other questions or need information on a different topic, feel free to ask!
Llama3.1 Demo! (Type 'exit' to stop)


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Hello! It seems like you might be looking for a greeting. If you have any other questions or need assistance with something else, please feel free to ask, and I'll do my best to help!
----------
Tokens per second:98.95377818894545

Post Processing Time: 0.0020036697387695312



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Hello! If you need assistance or have a question, feel free to ask!
----------
Tokens per second:225.07635897827973

Post Processing Time: 0.004803895950317383



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Hello! What would you like to know or discuss today?
----------
Tokens per second:254.99548748461643

Post Processing Time: 0.0



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Certainly! Calculus is a branch of mathematics that deals with the study of rates of change. There are two main branches of calculus:

1. **Differential Calculus**:
   - **Differential calculus** is the study of rates of change, which includes functions and rates at a point. It is a fundamental tool in mathematics for modeling and analyzing dynamic systems.

   - **Limits** of a function and **limits** are the fundamental concepts in differential calculus.
   - **Derivatives** are the rate of change of a function with respect to its variables. 
   - **Integrals** are the accumulation of a function's value at a point, which can be thought of as the area under a curve.

2. **Integral Calculus**:
   - **Integral calculus** is the study of accumulation, which is the study of how a quantity changes over a distance. It is used in physics, engineering, economics, and many other fields.

- **Definite integral** is the integral that sums up a quantity over a specific interval.
- **Indeterminate