# RLHF

### 0. 환경 설정

In [1]:
!pip install transformers torch stable-baselines3

Collecting stable-baselines3
  Downloading stable_baselines3-2.7.0-py3-none-any.whl.metadata (4.8 kB)
Downloading stable_baselines3-2.7.0-py3-none-any.whl (187 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m187.2/187.2 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: stable-baselines3
Successfully installed stable-baselines3-2.7.0


### 1. LLM 모델 로드 및 텍스트 생성

In [4]:
from transformers import pipeline

generator = pipeline('text-generation', model='gpt2')

def generate_text(prompt, max_length=150):
  response = generator(prompt, max_length=max_length, num_return_sequences=1)
  return response[0]['generated_text']

Device set to use cuda:0


In [5]:
prompt = "This is sunny day, and"
print(generate_text(prompt))

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


This is sunny day, and we are all on the bus.

There is a man in the group, and he has his arms folded behind his back. He is standing in front of the vehicle. He is wearing a grey uniform. He has an expression of a man in mourning, and his face is painted with a red cross. His head is covered with a red cross. He leans forward. I hear a voice.

I look up.

"Ya know that."

"I'm sure. I'm sure."

"You know, this is how it's always been."

"It's always been, and it always will be."

It's been so long since I last saw it. It's been so long since I had seen it that I have to stop.

It's been so long since I have had a chance to see it.

I'm going to call him.

We're going to be together.

I'm going to make him see it.

We'll be together.

I'm going to make him feel so safe that he can't let his guard down.

I'm going to make him feel so safe that he can't let his guard down.


### 2. 강화학습을 위한 Feedback 환경 생성

In [6]:
!pip install 'shimmy>=2.0'

Collecting shimmy>=2.0
  Downloading Shimmy-2.0.0-py3-none-any.whl.metadata (3.5 kB)
Downloading Shimmy-2.0.0-py3-none-any.whl (30 kB)
Installing collected packages: shimmy
Successfully installed shimmy-2.0.0


In [9]:
import gymnasium as gym
import numpy as np
from stable_baselines3 import PPO

class ContentFeedbackEnv(gym.Env):
  def __init__(self):
    super(ContentFeedbackEnv, self).__init__()
    self.action_space = gym.spaces.Discrete(3) # 0: 싫어요, 1: 좋아요, 2: 유해 콘텐츠 신고
    self.observation_space = gym.spaces.Box(low=0, high=1, shape=(1,), dtype=np.float32)
    self.history = []

  def step(self, action):
    if action == 1:
      reward = 1
      feedback = "Like"
    elif action == 2:
      reward = -2
      feedback = "Danger"
    else:
      reward = -1
      feedback = "Hate"

    self.history.append(feedback)

    obs = np.array([0.5])
    terminated = False
    truncated = False
    info = {}

    return obs, reward, terminated, truncated, info

  def reset(self, seed=None, options=None):
    super().reset(seed=seed)
    return np.array([0.5]), {}

##### 3. PPO 모델 생성 및 학습

In [10]:
env = ContentFeedbackEnv()
model = PPO("MlpPolicy", env, verbose=1)

Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.




In [11]:
past_feedback = [1, 0, 2, 1, 1, 0, 2, 1, 0, 1]

for action in past_feedback:
  env.step(action)

In [12]:
# PPO 모델 학습
model.learn(total_timesteps=10000)
model.save('rlhf_content_model')

-----------------------------
| time/              |      |
|    fps             | 374  |
|    iterations      | 1    |
|    time_elapsed    | 5    |
|    total_timesteps | 2048 |
-----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 415         |
|    iterations           | 2           |
|    time_elapsed         | 9           |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.014977451 |
|    clip_fraction        | 0.368       |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.09       |
|    explained_variance   | 5.96e-08    |
|    learning_rate        | 0.0003      |
|    loss                 | 3.87        |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.047      |
|    value_loss           | 68          |
-----------------------------------------
----------------------------------

In [13]:
# 저장된 모델 로드
model = PPO.load('rlhf_content_model')

env = ContentFeedbackEnv()
model.set_env(env)

Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.




In [14]:
prompt = "This is windy day, so"
response = generate_text(prompt)

print(response)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


This is windy day, so I was sitting on the back porch and took a picture. I was a little confused at first, but then I realized that I was really, really old! I was 6-foot-4 and my hair was just too long. I'd always assumed that my hair was cut like this, but this is a different hair type. I started feeling a bit more confident as I looked at my hair. I've always wondered if it's so I don't look like this. I'm 6-foot-4 and my hair is really big. I feel like a freak, but I know that I don't look this great.

How did you get your hair so big?

I guess it was the hair type of my parents' hair. I don't remember what it was. It was the hair type of my mom's hair. When my parents started to grow up, I would use the hair type of mine, but I think I got it because I looked amazing. I can't remember the definition of my hair. We were all just getting on so much.

It's hard to remember how long you're wearing a wig when you're in the shower.

I was wearing a hoodie when I was six years old. I


In [15]:
action = 1

env.step(action)
model.learn(total_timesteps=10)

-----------------------------
| time/              |      |
|    fps             | 612  |
|    iterations      | 1    |
|    time_elapsed    | 3    |
|    total_timesteps | 2048 |
-----------------------------


<stable_baselines3.ppo.ppo.PPO at 0x7ee0d94f3440>