# Running Llama locally

In this notebook we load the Llama3.2 1 Billion parameter model, and perform text completion

In [7]:
import os

os.chdir('/home/matt/.llama/checkpoints/Llama3.2-1B')
os.getcwd()

'/home/matt/.llama/checkpoints/Llama3.2-1B'

In [8]:
os.chdir('/home/matt/.llama/checkpoints')

In [9]:
if False: # It would be great to run this, but it OOMs 

    from transformers import AutoTokenizer, LlamaConfig, LlamaForCausalLM
    import torch

    model_path = 'Llama3.2-1B'
    #https://stackoverflow.com/a/78911943

    # Load the tokenizer directly from the model path
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    print("tokenizer loaded")

    # Load model configuration from params.json
    config = LlamaConfig.from_json_file(f'{model_path}/params.json')
    print("config loaded")

    # load the model with the specific configs. 
    model = LlamaForCausalLM(config=config)
    print("model loaded")

    # Load the weights of the model
    state_dict = torch.load(f'{model_path}/consolidated.00.pth', map_location=torch.device('cpu'))
    model.load_state_dict(state_dict)
    print("weights loaded")

    model.eval()
    print("eval called")

## Converting from the default file download format

When downloaded from llama.com, the files look like this

```
checklist.chk  config.json  consolidated.00.pth  params.json  tokenizer.model
```

We want them in the HuggingFace format, to do that I ran this script from the `transformers` package (included here for convenience)

```bash
python3 convert_llama_to_hf.py --input_dir /home/matt/.llama/checkpoints/Llama3.2-1B --model_size 1B --output_dir /home/matt/.llama/checkpoints/Llama3.2-1B-hf --llama_version 3.2
```

That then populates the output directory with the desired files, which look like:

```
config.json  generation_config.json  model.safetensors  special_tokens_map.json  tokenizer.json  tokenizer_config.json
```

In [10]:
from transformers import AutoTokenizer
from transformers.tokenization_utils_fast import PreTrainedTokenizerFast

model_path = "Llama3.2-1B-hf"

tokenizer: PreTrainedTokenizerFast = AutoTokenizer.from_pretrained(model_path)
print("tokenizer loaded")

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

tokenizer loaded


In [11]:
from transformers import AutoModelForCausalLM

model_path = "Llama3.2-1B-hf"

# load model with reduced precision
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype="auto",  # automatically uses float16/bfloat16 if available
    low_cpu_mem_usage=True,  # prevents high RAM usage
    device_map="auto"  # automatically assigns layers to GPU/CPU based on available memory
)

print("model loaded")


model loaded


In [12]:
import torch

input_text = "hello how are you?"

inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True)
inputs["attention_mask"] = (inputs["input_ids"] != tokenizer.pad_token_id).long()

with torch.no_grad():  # reduces memory usage
    outputs = model.generate(
        inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_length=150,  
        temperature=1.0,
        pad_token_id=tokenizer.pad_token_id 
    )

output = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(output)

hello how are you? i am ron. i work in a hotel. i want to travel. i am from kenya.
Hi I'm Raph, 20 and I live in Scotland. I was in school but have stopped to do my life now. I've been in Germany for a year now and I love it.
