https://github.com/meta-llama/llama-recipes

#### Llama 사용 방법
- Meta에 직접 사용 요청
- Hugging Face Hub에서 변환된 모델을 사용

## **Hugging Face Hub에서 변환된 모델을 사용하는 방법**
Hugging Face Hub는 Meta Llama 가중치를 변환하여 업로드한 모델을 제공할 수 있습니다. 이 경우, 사용자는 변환된 모델을 바로 다운로드하고 사용할 수 있습니다.

#### 절차:
1. **Hugging Face Hub에 로그인**:
   Hugging Face Hub에서 모델을 다운로드하려면 Hugging Face 계정에 로그인해야 합니다. 이를 위해 `huggingface_hub` 라이브러리를 사용할 수 있습니다.
   ```python
   from huggingface_hub import login
   login()  # 명령어 실행 후, 계정 정보를 입력하여 로그인
   ```

2. **모델과 토크나이저 불러오기**:
   Hugging Face에 업로드된 Meta Llama 모델을 다운로드하여 사용합니다. 예를 들어, `meta-llama/Meta-Llama-3.1-8B-Instruct`라는 이름의 모델을 불러올 수 있습니다:
   ```python
   from transformers import AutoModelForCausalLM, AutoTokenizer

   model = "meta-llama/Meta-Llama-3.1-8B-Instruct"
   model = AutoModelForCausalLM.from_pretrained(model)
   tokenizer = AutoTokenizer.from_pretrained(model)
   ```

3. **파이프라인 설정**:
   텍스트 생성 파이프라인을 설정하여 모델을 사용할 수 있습니다:
   ```python
   from transformers import pipeline

   generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
   input_text = "Once upon a time,"
   outputs = generator(input_text, max_length=50, num_return_sequences=1)
   print(outputs)
   ```

이 방법은 가중치 변환 과정 없이 Hugging Face Hub에 업로드된 모델을 바로 사용할 수 있어 간편합니다. 그러나 사용자가 접근할 수 있는 모델은 허브에 업로드된 변환된 모델에 한정됩니다.

---

### 방법별 비교

| 방법 | 장점 | 단점 |
| --- | --- | --- |
| **Meta에서 직접 다운로드하여 변환** | 최신 모델 가중치를 사용할 수 있음 | 가중치 다운로드 요청 필요, 변환 과정이 복잡함 |
| **Hugging Face Hub에서 사용** | 변환 없이 바로 사용 가능, 간편함 | 허브에 업로드된 모델에 한정됨, Meta의 가중치 배포보다 제한적 |



## Hugging Face Hub에 업로드된 허가된 Llama 모델을 가져오는 방식
- Meta에서 직접 다운로드하는 방식과는 별개로, 해당 모델이 Hugging Face Hub에 업로드되어 있으면, 로그인 후 쉽게 다운로드하여 사용
- 해당 모델이 Hugging Face 허브에 업로드되어 있어야 하고, 사용자가 해당 모델에 대한 접근 권한을 가져야 합니다.

In [1]:
%pip install --upgrade transformers       ## Llama 3.0 모델은 트랜스포머 최신꺼를 써야 사용 가능
%pip install accelerate

Collecting transformers
  Using cached transformers-4.45.2-py3-none-any.whl.metadata (44 kB)
Using cached transformers-4.45.2-py3-none-any.whl (9.9 MB)
[0mInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.44.2
    Uninstalling transformers-4.44.2:
      Successfully uninstalled transformers-4.44.2
Successfully installed transformers-4.45.2
[0m

In [2]:
%pip install --upgrade huggingface_hub

[0m

In [3]:
from transformers import AutoTokenizer
import transformers
import torch

### 허킹페이스 토큰을 생성
- setting/Acess tokens/Create tokens

### 모델 사용 Grant
- HF, Model에서 "Llama-3.1-8B-Instruct" 찾아서

In [10]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [5]:
model = "meta-llama/Meta-Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

In [6]:
pipeline=transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]



In [7]:
sequences=pipeline(
    'I have tomatoes, basil and cheese at home. What can I cook for dinner?\n',
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    truncation=True,
    max_length=400,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


Result: I have tomatoes, basil and cheese at home. What can I cook for dinner?
You can make a classic Bruschetta, a simple and delicious Italian appetizer or side dish. Here's a quick recipe:
Ingredients:
* 4-6 tomatoes, diced
* 1/4 cup fresh basil leaves, chopped
* 1/2 cup grated cheese (Parmesan or Mozzarella work well)
* 1 baguette, sliced into 1/2-inch thick rounds
* Salt and pepper to taste
* Olive oil for brushing

Instructions:
1. Preheat your oven to 400°F (200°C).
2. Brush the baguette slices with olive oil and toast in the oven for 5-7 minutes, or until lightly browned.
3. In a bowl, mix together the diced tomatoes, chopped basil, and grated cheese.
4. Remove the toasted bread from the oven and let it cool for a minute or two.
5. Top each bread slice with a spoonful of the tomato-basil mixture.
6. Season with salt and pepper to taste.
7. Serve immediately and enjoy!

You can also use this as a base and add some protein like grilled chicken or salmon to make it a more substant

## Llama 3.2-1B

#########
아래는 명신씨꺼 전달 받은 것

In [None]:
model2= "meta-llama/Llama-3.2-1B"
tokenizer = AutoTokenizer.from_pretrained(model)

In [None]:
pipeline=transformers.pipeline(
    "text-generation",
    model=model2,
    torch_dtype=torch.float16,
    device_map="auto",
)

config.json:   0%|          | 0.00/843 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

In [None]:
model2= "meta-llama/Llama-3.2-1B"
tokenizer = AutoTokenizer.from_pretrained(model)

In [None]:
pipeline=transformers.pipeline(
    "text-generation",
    model=model2,
    torch_dtype=torch.float16,
    device_map="auto",
)

config.json:   0%|          | 0.00/843 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

In [None]:
sequences=pipeline(
    'Dogs are \n',
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    truncation=True,
    max_length=100,
)

for seq in sequences:
    print(f"Result: {seq['generated_text']}")

#########
아래는 내가 돌려 본 것

In [8]:
from transformers import pipeline
pipe=pipeline("text-generation", model="meta-llama/Llama-3.2-1B")


config.json:   0%|          | 0.00/843 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [9]:
sequences=pipe(
    'I have tomatoes, basil and cheese at home. What can I cook for dinner?\n',
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    truncation=True,
    max_length=400,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Result: I have tomatoes, basil and cheese at home. What can I cook for dinner?
You can make a salad with tomatoes, basil, cheese and olive oil. Or you can make a pasta dish with tomatoes, basil, cheese and olive oil. Or you can make a sandwich with tomatoes, basil, cheese and olive oil. You can make a soup with tomatoes, basil, cheese and olive oil. You can make a pizza with tomatoes, basil, cheese and olive oil. You can make a cake with tomatoes, basil, cheese and olive oil. You can make a cake with tomatoes, basil, cheese and olive oil. You can make a cake with tomatoes, basil, cheese and olive oil. You can make a cake with tomatoes, basil, cheese and olive oil. You can make a cake with tomatoes, basil, cheese and olive oil. You can make a cake with tomatoes, basil, cheese and olive oil. You can make a cake with tomatoes, basil, cheese and olive oil. You can make a cake with tomatoes, basil, cheese and olive oil. You can make a cake with tomatoes, basil, cheese and olive oil. You can

## Llama 3.2-3B

In [17]:
import torch
from transformers import pipeline

model_id="meta-llama/Llama-3.2-3B-Instruct"
pipe=pipeline(
    "text-generation",
    model=model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
messages=[
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]
outputs=pipe(
    messages,
    max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])

config.json:   0%|          | 0.00/878 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


In [None]:
sequences=pipeline(
    'I have tomatoes, basil and cheese at home. What can I cook for dinner?\n',
    do_sample=True, # 모델이 가장 가능성 높은 토큰을 항상 고르는 것이 아니라 상위 k개의 토큰 중에서 무작위로 하나를 선택하여 더 다양한 응답을 생성
    top_k=10, # 각 단계에서 상위 10개의 가장 가능성 높은 토큰 중에서 선택
    num_return_sequences=1, # 생성된 텍스트 시퀀스(응답)의 수를 1개로 설정
    eos_token_id=tokenizer.eos_token_id, # 문장의 끝을 나타내는 토큰 ID를 지정. 모델이 이 토큰을 만나면 텍스트 생성을 멈춤
    truncation=True,
    max_length=150,
)

for seq in sequences:
    print(f"Result: {seq['generated_text']}")

## Llama 3.2-11B-Vision

In [None]:
%pip install --upgrade transformers

In [None]:
###### 아래는 Resource 문제때문에 돌려 보진 않은 것

In [None]:
import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor

model_id = "meta-llama/Llama-3.2-11B-Vision"

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",      ## 사용가능한 GPU를 자동 감지하고 할당
)
processor = AutoProcessor.from_pretrained(model_id) # 입력 데이터를 모델에 맞게 전처리하는 데 사용
# url = "https://www.ilankelman.org/stopsigns/australia.jpg"
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
image = Image.open(requests.get(url, stream=True).raw) # 주어진 URL에서 이미지를 다운로드하여 PIL.Image 객체로 변환

# 이 프롬프트는 모델이 이미지와 관련된 시적 묘사를 생성하도록 유도하는 구체적인 요청을 포함
prompt = "<|image|><|begin_of_text|>If I had to write a haiku for this one"
inputs = processor(image, prompt, return_tensors="pt").to(model.device) # processor는 이미지와 프롬프트를 모델이 이해할 수 있는 텐서 형태로 변환

output = model.generate(**inputs, max_new_tokens=30) # 최대 30개의 새로운 토큰을 생성하며, 입력 데이터(이미지 및 프롬프트)를 사용해 텍스트를 예측
print(processor.decode(output[0])) # 생성한 텍스트를 사람이 읽을 수 있는 형식으로 디코딩하여 출력