## LLM inference

We plan to load the `Llama-1B` model from *HuggingFace* and run inference by entering a prompt and generating a response.

We start the tutorial with python environment and packages. The Python environment refers to the setup in which we run our code, including the interpreter, the libraries, and the tools that make development easier.

### Environment and packages
The Python environment is like your toolbox. By itself, it only has a few basic tools. When you want to do more, such as calculating the sum, mean, or variance, you need to add new tools. These extra tools come in packages. A package is like a small toolbox dedicated to a certain task. By downloading and installing the corresponding packages, you equip your Python environment with everything needed to complete your project.

The basic package used in LLMs is `transformers`. Your can run following command to install it.

In [None]:
!pip install -U transformers

*HuggingFace* is an opensourced platform like GitHub to offer models, datasets, and papers. You need to sign up for an account and create a new token in the [link](https://huggingface.co/settings/tokens).

To load the models from *HuggingFace*, please fill in the table to access the model in the [link](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct). It may take a few minutes to process and get the permission. Then, you need to log in `huggingface_cli` with your own token.


In [None]:
!huggingface-cli login

### Now, you can load the model from *Huggingface*

We load the tokenizer and model of `Llama-3.2-1B-Instruct`.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Llama-3.2-1B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

The inference procedure can be divided into three parts.

First, the natural language input needs to be converted into a form that the model understands. This is the job of the tokenizer: it splits sentences into tokens and maps them to the corresponding indices. For example, the sentence `"I want to learn more about AI"` might be converted into a list of numbers like `[42, 103, 88, 44]`.

Second, we feed these token indices into the model and obtain the output using the `model.generate()` function. The model may produce another list of numbers, such as `[42, 103, 88, 44, 205, 77]`, which represents the predicted tokens.

Finally, we decode this output back into human-readable text using `tokenizer.decode()`. In this example, `[42, 103, 88, 42, 205, 77]` would be decoded into `"I want to learn more about AI and NLP."`.


In [None]:
input_text = "Hello, I'm your TA. Welcome to CS4100 and welcome back to the campus!"
inputs = tokenizer(input_text, return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=100)
output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("Generated answer:", output_text)

You can also use `model().logits` to get the original model output logit. This logit is before the softmax layer. The `inputs` here contains two keys: `"input_ids"` and `"attention_mask"`. `"attention_mask"` is related to the padding. At this part, we don't have any padding procedure, so we just focus on `"inputs_id"`. It's an ordered list of numbers, indicating the whole input sentences. The logits are used to get the inferred output in the above response.

In [None]:
input_text = "Hello, I'm your TA. Welcome to CS4100 and welcome back to the campus!"
inputs = tokenizer(input_text, return_tensors="pt")

outputs = model(inputs["input_ids"])
logits = outputs.logits
print(logits.size())