# How to run an LLM locally

Running an LLM locally can be done either through code or through an application/platform such as oobabooga, LM Studio, etc. This can also be done with either your own LLM architecture or by downloading a pre-trained LLM. The instructions below run a pre-trained LLM through code.

### Download Prerequisites

Each prerequisite can be downloaded from the linked site:
- [Jupyter Notebook](https://jupyter.org/install)
- [Python 3.7 or later](https://www.python.org/downloads/)
- [PyTorch](https://pytorch.org/)
- [pip](https://pypi.org/project/pip/)

In [1]:
# Download prerequisites
%pip install transformers

Note: you may need to restart the kernel to use updated packages.


DEPRECATION: Loading egg at c:\users\irish\appdata\local\programs\python\python311\lib\site-packages\filelock-3.13.1-py3.11.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330
DEPRECATION: Loading egg at c:\users\irish\appdata\local\programs\python\python311\lib\site-packages\fsspec-2023.10.0-py3.11.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330
DEPRECATION: Loading egg at c:\users\irish\appdata\local\programs\python\python311\lib\site-packages\mpmath-1.3.0-py3.11.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330
DEPRECATION: Loading egg at c:\users\irish

### Import libraries

In [2]:
# Import the needed libraries
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

### Download model

This tutorial runs GPT-2 locally. If you would like to run a different LLM, select a model from the [Hugging Face Transformers library documentation](https://huggingface.co/docs/transformers/index).

In [3]:
# Load a pre-trained LLM and tokenizer
model_name = "gpt2" # Replace with name of the LLM you want to run
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

model.safetensors:  77%|#######6  | 419M/548M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

### Choose device for model

The following commands define whether your model will be run on CPU or on GPU (sometimes referred to as CUDA). Note that the installation process for using PyTorch with CPU is different from using PyTorch with GPU. If any errors arise in the code below, ensure you have the correct PyTorch installation.

In [4]:
# Define the device to run the model (GPU or CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### Feed input to model

Create the input for your LLM and tokenize the input – this splits the input into “tokens” which the model understand. The tokens will then be placed on your selected device (either CPU/GPU).

In [5]:
input_text = "Once upon a time"
input_tokens = tokenizer(input_text, return_tensors="pt").to(device)

Feed the input tokens into your model.

In [6]:
with torch.no_grad(): # Tells PyTorch not to train the model on your input
    output_tokens = model.generate(input_tokens ["input_ids"], num_return_sequences=1)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


### Get model output

Decode the generated output tokens.

In [7]:
generated_text = tokenizer.decode(output_tokens [0], skip_special_tokens=True)
print(generated_text)

Once upon a time, the world was a place of great beauty and great danger. The world was
