# (Q)分子構造+(R)理由+(A)物性データセットのLLMによる学習と予測
- Q&A: 融点データセットを使用
- R: 自分自身で考えさせて､正解のデータを学習させる

In [8]:
%reload_ext autoreload
%autoreload 2

import os
#os.environ["CUDA_VISIBLE_DEVICES"]="1"

from transformers import AutoTokenizer
import pandas as pd
import random
import copy
import glob
import json
from datetime import datetime
from llmchem.utils import mk_dir,clean_vram

#import clear_output

from IPython.display import clear_output

In [2]:
#dataset settings
n_test=5 #number of testing data
n_train_check=5 #number of training data for checking (i.e., checking everything takes too long, so we check only a part of training data)
n_GPT_reasoning=30 # number of reasoning data made by GPT
n_generation_iterations=10   # trial numbers to generate new self reasoning data
max_generations=10

#model settings
model_name="mistralai/Mixtral-8x7B-Instruct-v0.1"
target_modules= [
    "lm_head",
    "q_proj",
    "k_proj",
    "v_proj",
    "o_proj",
    "gate",
    "w1",
    "w2",
    "w3"
]

model_name=f"meta-llama/Llama-2-7b-chat-hf"
target_modules= [
    #"embed_tokens",
    "lm_head",
    #"q_proj",
    #"k_proj",
    "v_proj",
    "o_proj",
    "gate_proj",
    "up_proj",
    #"down_proj",
]



#LoRA settings
r=32
lora_alpha=r
bit=16
#bit=8
#bit=4

#train settings
gradient_checkpointing =False
per_device_train_batch_size=1
epochs=3
lr=10**-5

#device settings
device_map="auto"

#dataset path
dataset_path="dataset/231225AutoReasoning/240117best_reason_record_30k.csv"

#project path
project_dir="results/projects/240117test"

#reasoning options
error_threshold=30  # if abolute error is smaller than this, add to training data

In [4]:
mk_dir(project_dir)
mk_dir(project_dir+"/eval")
mk_dir(project_dir+"/self_reasoning")
mk_dir(project_dir+"/train")

In [5]:
from llmchem.model import init_model
from llmchem.train import train_model
from llmchem.eval import eval_model
from llmchem.reasoning import self_reasoning
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

In [6]:
#load base dataset

df=pd.read_csv(dataset_path)
dataset=df.to_dict(orient="records")
random.seed(0)
random.shuffle(dataset)

base_train_dataset=dataset[:-n_test]
train_check_dataset=base_train_dataset[-n_train_check:]
example_reasoning_dataset=base_train_dataset[:n_GPT_reasoning]
test_dataset=dataset[-n_test:]

In [10]:
#Loop: training, evaluation, data generation
for generation in range(max_generations):
    clear_output()
    #prepare train dataset

    ## reason data made by GPT4
    train_dataset=copy.deepcopy(example_reasoning_dataset)

    print(f"GPT-generated reasons: {len(train_dataset)}")

    ## reason data made by model itself
    for path in glob.glob(f"{project_dir}/self_reasoning/*.json"):
        with open(path) as f:
            train_dataset.append(json.load(f))

    print(f"All-generated reasons: {len(train_dataset)}")
    random.shuffle(train_dataset)

    #train model
    clean_vram()
    model=init_model(model_name, r, lora_alpha, target_modules, bit=bit,device_map=device_map)
    train_result=train_model(model,tokenizer,train_dataset,
                    project_dir=project_dir,
                    epochs=epochs,
                    lr=lr,
                    per_device_train_batch_size=per_device_train_batch_size,
                    gradient_checkpointing=gradient_checkpointing,
                    )

    #eval
    train_eval_result=eval_model(model,tokenizer,train_check_dataset,
                                f"{project_dir}/eval",
                                n_prompt_examples=3,
                                prompt_dataset=example_reasoning_dataset,
                                prefix=f"train_{generation}"
                                )

    test_eval_result=eval_model(model,tokenizer,test_dataset,
                                f"{project_dir}/eval",
                                n_prompt_examples=3,
                                prompt_dataset=example_reasoning_dataset,
                                prefix=f"test_{generation}"
                                )

    #generate additional training data by self-reasoning
    self_reasoning(model,tokenizer,base_train_dataset,
                example_reasoning_dataset,project_dir,generation=generation,
                n_iterations=n_generation_iterations,
                error_threshold=error_threshold)

GPT-generated reasons: 30
All-generated reasons: 30
Using fp16 mode


The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead.
Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00,  1.15s/it]
Map: 100%|██████████| 30/30 [00:00<00:00, 2425.34 examples/s]
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss


  0%|          | 0/5 [00:00<?, ?it/s]

promlem 1 / 5


 20%|██        | 1/5 [01:02<04:09, 62.33s/it]

----


Heroin's melting point is challenging to predict due to its complex structure, which includes a phenanthrene core, a nitro functional group, and a methyl ester group. The phenanthrene core has a high melting point due to its planarity and the presence of a conjugated double bond. The nitro functional group can increase the melting point due to its electronegativity and the potential for hydrogen bonding. The methyl ester group can also contribute to the melting point due to its dipole moment and the potential for hydrogen bonding. The presence of these functional groups and their interactions with each other and the phenanthrene core can lead to a complex interplay of effects on the melting point.

To estimate the melting point of heroin, we need to consider the effects of each functional group and their interactions. The phenanthrene core is estimated to increase the melting point by +40 over a basic hydrocarbon backbone. The nitro functional group can increase the melting poin

 40%|████      | 2/5 [01:04<01:21, 27.08s/it]

----


##Prediction: 125.0


#Problem
actual:  96.0 predicted:  125.0
promlem 3 / 5


 60%|██████    | 3/5 [01:07<00:31, 15.88s/it]

----


##Prediction: 105.0


#Problem
actual:  -80.0 predicted:  105.0
promlem 4 / 5


 80%|████████  | 4/5 [01:09<00:10, 10.51s/it]

----


##Prediction: 105.0


#Problem
actual:  190.0 predicted:  105.0
promlem 5 / 5


100%|██████████| 5/5 [01:25<00:00, 17.14s/it]


----


Theophylline is a xanthine alkaloid with a melting point of 190-195 °C. The molecule has a planar, symmetrical structure with a nitrogen atom at the center of the ring. The oxygen atoms are also symmetrical and are bonded to the nitrogen through single bonds. The carbon atoms are bonded to the nitrogen and oxygen through single bonds. The hydrogen atoms are bonded to the carbon atoms through single bonds. The molecule has a high degree of symmetry, which can contribute to a higher melting point due to the increased molecular weight and the stabilizing effect of the symmetry.
##Prediction: 190.0


#Problem
actual:  272.0 predicted:  190.0


  0%|          | 0/5 [00:00<?, ?it/s]

promlem 1 / 5


 20%|██        | 1/5 [00:38<02:34, 38.60s/it]

----


The functional groups in this molecule are a methoxy group, a nitro group, and a benzaldehyde group. The methoxy group is polar and hydrogen bonding capable, while the nitro group is highly electronegative and will increase the melting point due to intermolecular dipole-dipole interactions. The benzaldehyde group is a relatively small functional group, but it can still contribute to the melting point due to conjugation and increased polarity.

Based on the individual contributions of each functional group, we can estimate the melting point as follows:

- Methoxy group: 50 degrees Celsius (based on the average melting point of similar compounds)
- Nitro group: 40 degrees Celsius (based on the increased polarity and dipole-dipole interactions)
- Benzaldehyde group: 20 degrees Celsius (based on the conjugation effect and increased polarity)

Adding these contributions up, we get a predicted melting point of 110 degrees Celsius.
##Prediction: 110.0


#Problem
actual:  99.0 predicted