### 10. 고급 미세 튜닝 - 약품 분류하기

### 사전 준비
 * 구글 코랩 환경은 일정 시간이후에 초기화가 되기 때문에 두가지 작업을 매번 수행해야 함.
   * chatgpt.env 파일 생성이 필요.
     * 준비된 chatgpt.env를 내용을 변경하여 업로드 하거나 또는 API_KEY와 ORG_ID를 확인하여 생성한다.
   * pip install openai 설치
   * 캐글 데이터 셋 다운로드 후, 업로드
     * https://www.kaggle.com/datasets/saratchendra/medicine-recommendation 또는 https://www.kaggle.com/datasets/saratchendra/medicine-recommendation/download?datasetVersionNumber=1
     * 파일 이름 : 'Medicine_description.xlsx

### 학습 내용
 * 판다스를 이용한 데이터 포맷 변경
 * 미세 튜닝된 모델 테스트하기

In [None]:
!pip install openai

Collecting openai
  Downloading openai-0.27.8-py3-none-any.whl (73 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/73.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━[0m [32m41.0/73.6 kB[0m [31m1.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.6/73.6 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: openai
Successfully installed openai-0.27.8


### 판다스를 이용한 데이터 포맷 변경

In [None]:
# 판다스 라이브러리 불러오기
import pandas as pd

# 처음 n개의 행 읽기
n = 2000
df = pd.read_excel('Medicine_description.xlsx', sheet_name='Sheet1', header=0, nrows=n)

# ‘Reason’ 열에서 고유한 값들 얻기
reasons = df["Reason"].unique()
print(reasons)

# 각 Reason에 번호 할당
reasons_dict = {reason : i for i, reason in enumerate(reasons)}

# 각 Description 끝에 새 줄과 ### 추가
df["Drug_Name"] = "Drug : " + df["Drug_Name"] + "\n" + "Malady:"

# ‘Reason’과 Description 열 합치기
df["Reason"] = " " + df["Reason"].apply(lambda x : "" + str(reasons_dict[x]))

# ‘Reason 열 삭제하기’
df.drop(["Description"], axis=1, inplace=True)

# ‘Reason’ 열 이름 변경하기
df.rename(columns={"Drug_Name" : "prompt" , "Reason": "completion"}, inplace=True)

# 데이터 프레임을 jsonl 형식으로 변환하기
jsonl = df.to_json(orient="records", indent=0, lines=True)

# jsonl을 파일에 작성하기
with open("drug_malady_data_01.jsonl", "w") as f :
    f.write(jsonl)

['Acne' 'Adhd' 'Allergies' 'Alzheimer' 'Amoebiasis' 'Anaemia' 'Angina']


In [None]:
### 파일 형식 변환

In [None]:
# OpenAI API 키를 설정합니다.
import os
os.environ['OPENAI_API_KEY'] = "sk-xxxx"

In [None]:
!openai tools fine_tunes.prepare_data -f drug_malady_data_01.jsonl

Analyzing...

- Your file contains 2000 prompt-completion pairs
- Based on your data it seems like you're trying to fine-tune a model for classification
- For classification, we recommend you try one of the faster and cheaper models, such as `ada`
- For classification, you can estimate the expected model performance by keeping a held out dataset, which is not used for training
- All prompts end with suffix `\nMalady:`
- All prompts start with prefix `Drug : `

No remediations found.
- [Recommended] Would you like to split into training and validation set? [Y/n]: Y


Your data will be written to a new JSONL file. Proceed [Y/n]: Y

Wrote modified files to `drug_malady_data_01_prepared_train.jsonl` and `drug_malady_data_01_prepared_valid.jsonl`
Feel free to take a look!

Now use that file when fine-tuning:
> openai api fine_tunes.create -t "drug_malady_data_01_prepared_train.jsonl" -v "drug_malady_data_01_prepared_valid.jsonl" --compute_classification_metrics --classification_n_classes 7


### 데이터 활용 미세튜닝하기

In [None]:
!openai api fine_tunes.create -t "drug_malady_data_01_prepared_train.jsonl" -v "drug_malady_data_01_prepared_valid.jsonl" --compute_classification_metrics --classification_n_classes 7 -m ada --suffix "drug_data"

Upload progress:   0% 0.00/130k [00:00<?, ?it/s]Upload progress: 100% 130k/130k [00:00<00:00, 92.4Mit/s]
Uploaded file from drug_malady_data_01_prepared_train.jsonl: file-JEAVasHGI3uQZ5BzR6ap0N0B
Upload progress: 100% 32.4k/32.4k [00:00<00:00, 66.4Mit/s]
Uploaded file from drug_malady_data_01_prepared_valid.jsonl: file-9mYIYUPjZXk3G8fuRT57W6fn
Created fine-tune: ft-vtRdskcACywoN7D81z6ujH0m
Streaming events until fine-tuning is complete...

(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2023-08-21 03:14:50] Created fine-tune: ft-vtRdskcACywoN7D81z6ujH0m
[2023-08-21 03:15:25] Fine-tune costs $0.05
[2023-08-21 03:15:25] Fine-tune enqueued. Queue number: 0



In [None]:
!openai api fine_tunes.follow -i ft-RDkiqx5nhawzladXN7pmIRSX

[2023-08-15 09:21:57] Created fine-tune: ft-RDkiqx5nhawzladXN7pmIRSX

Stream interrupted (client disconnected).
To resume the stream, run:

  openai api fine_tunes.follow -i ft-RDkiqx5nhawzladXN7pmIRSX



### 미세 튜닝된 모델 테스트 하기

In [None]:
import os
import openai

def init_api():
    with open ( "chatgpt.env" ) as env:
        for line in env:
            key, value = line.strip().split( "=" )
            os.environ[key] = value
    openai.api_key = os.environ.get( "API_KEY" )
    openai.organization = os.environ.get( "ORG_ID" )

init_api()

# 모델 ID 설정. 여기서는 사용자의 모델 ID로 변경해야 합니다.
model = "ada:ft-personal:drug-data-2023-08-15-09-58-51"

# 각 클래스에서 하나의 약물을 선택합니다.
drugs = [
    "A CN Gel(Topical) 20gmA CN Soap 75gm" , # Class 0
    "Addnok Tablet 20'S" , # Class 1
    "ABICET M Tablet 10's" , # Class 2
]

# 각 약물에 대한 약물 클래스를 반환합니다.
for drug_name in drugs:
    prompt = "Drug: {} \n Malady:" . format (drug_name)
    response = openai.Completion.create( model = model, prompt = prompt, temperature = 1 , max_tokens = 1 , )

    # 생성된 텍스트를 출력합니다.
    drug_class = response.choices[ 0 ].text

    # 결과는 0, 1, 2 중 하나여야 합니다.
    print (drug_class)


 0
 1
 2


### drugs를 변경 후, 테스트 해 보기

In [None]:
import os
import openai

def init_api():
    with open ( "chatgpt.env" ) as env:
        for line in env:
            key, value = line.strip().split( "=" )
            os.environ[key] = value
    openai.api_key = os.environ.get( "API_KEY" )
    openai.organization = os.environ.get( "ORG_ID" )

init_api()

# 모델 ID 설정. 여기서는 사용자의 모델 ID로 변경해야 합니다.
model = "ada:ft-personal:drug-data-2023-08-15-09-58-51"

# 각 클래스에서 하나의 약물을 선택합니다.
drugs = [
    "What is 'A CN Gel(Topical) 20gmA CN Soap 75gm' used for?", # Class 0
    "What is 'Addnok Tablet 20'S' used for?", # Class 1
    "What is 'ABICET M Tablet 10's' used for?", # Class 2
]

class_map = {
    0 : "Acne" ,
    1 : "Adhd" ,
    2 : "Allergies" ,
    # ...
}



# 각 약에 대한 약 클래스를 반환합니다.
for drug_name in drugs:
    prompt = "Drug: {} \n Malady:" . format (drug_name)
    response = openai.Completion.create(
        model = model,
        prompt = prompt,
        temperature = 1 ,
        max_tokens = 1 ,
    )

    response = response.choices[0].text

    try :
        print (drug_name + " is used for " + class_map[ int (response)])
    except :
        print ( "I don't know what " + drug_name + " is used for." )

    print ()



What is 'A CN Gel(Topical) 20gmA CN Soap 75gm' used for? is used for Acne

What is 'Addnok Tablet 20'S' used for? is used for Adhd

What is 'ABICET M Tablet 10's' used for? is used for Allergies

