# (Q)分子構造+(R)理由+(A)物性データセットのLLMによる学習と予測
- Q&A: 融点データセットを使用
- R: 自分自身で考えさせて､正解のデータを学習させる

In [1]:
%reload_ext autoreload
%autoreload 2

import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"

from transformers import AutoTokenizer
import pandas as pd
import random
import copy
import glob
import json
from datetime import datetime
from llmchem.utils import mk_dir,clean_vram

#import clear_output

from IPython.display import clear_output

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
#dataset settings
n_test=50 #number of testing data
n_train_check=50 #number of training data for checking (i.e., checking everything takes too long, so we check only a part of training data)
n_GPT_reasoning=100 # number of reasoning data made by GPT
n_generation_iterations=100   # trial numbers to generate new self reasoning data
max_generations=1000

#model settings
model_name="mistralai/Mixtral-8x7B-Instruct-v0.1"
target_modules= [
    "lm_head",
    "q_proj",
    "k_proj",
    "v_proj",
    "o_proj",
    "gate",
    "w1",
    "w2",
    "w3"
]

model_name=f"meta-llama/Llama-2-7b-chat-hf"
target_modules= [
    #"embed_tokens",
    "lm_head",
    #"q_proj",
    #"k_proj",
    "v_proj",
    "o_proj",
    "gate_proj",
    "up_proj",
    #"down_proj",
]



#LoRA settings
r=32
lora_alpha=r
bit=16
#bit=8
#bit=4

#train settings
gradient_checkpointing =False
per_device_train_batch_size=1
epochs=3
lr=10**-5

#device settings
device_map="auto"

#dataset path
dataset_path="dataset/231225AutoReasoning/240117best_reason_record_3k.csv"

#project path
project_dir="results/projects/240118llama7b_100"

#reasoning options
error_threshold=30  # if abolute error is smaller than this, add to training data

In [3]:
mk_dir(project_dir)
mk_dir(project_dir+"/eval")
mk_dir(project_dir+"/self_reasoning")
mk_dir(project_dir+"/train")

In [4]:
from llmchem.model import init_model
from llmchem.train import train_model
from llmchem.eval import eval_model
from llmchem.reasoning import self_reasoning
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

In [5]:
#load base dataset

df=pd.read_csv(dataset_path)
dataset=df.to_dict(orient="records")
random.seed(0)
random.shuffle(dataset)

base_train_dataset=dataset[:-n_test]
example_reasoning_dataset=base_train_dataset[:n_GPT_reasoning]
test_dataset=dataset[-n_test:]

In [6]:
#Loop: training, evaluation, data generation
for generation in range(max_generations):
    clear_output()
    print(f"Generation: {generation}")
    #prepare train dataset

    ## reason data made by GPT4
    train_dataset=copy.deepcopy(example_reasoning_dataset)

    print(f"GPT-generated reasons: {len(train_dataset)}")

    ## reason data made by model itself
    for path in glob.glob(f"{project_dir}/self_reasoning/*.json"):
        with open(path) as f:
            train_dataset.append(json.load(f))

    print(f"All-generated reasons: {len(train_dataset)}")
    random.shuffle(train_dataset)

    #train model
    clean_vram()
    model=init_model(model_name, r, lora_alpha, target_modules, bit=bit,device_map=device_map)
    train_result=train_model(model,tokenizer,train_dataset,
                    project_dir=project_dir,
                    epochs=epochs,
                    lr=lr,
                    per_device_train_batch_size=per_device_train_batch_size,
                    gradient_checkpointing=gradient_checkpointing,
                    )

    #eval

    train_check_dataset=copy.deepcopy(train_dataset[:n_train_check])
    random.shuffle(train_check_dataset)
    train_eval_result=eval_model(model,tokenizer,train_check_dataset,
                                f"{project_dir}/eval",
                                n_prompt_examples=3,
                                prompt_dataset=example_reasoning_dataset,
                                prefix=f"train_{generation}"
                                )

    test_eval_result=eval_model(model,tokenizer,test_dataset,
                                f"{project_dir}/eval",
                                n_prompt_examples=3,
                                prompt_dataset=example_reasoning_dataset,
                                prefix=f"test_{generation}"
                                )

    #generate additional training data by self-reasoning
    self_reasoning(model,tokenizer,base_train_dataset,
                example_reasoning_dataset,project_dir,
                generation=generation,
                n_iterations=n_generation_iterations,
                error_threshold=error_threshold,
                n_max_trials=2)

Generation: 0
GPT-generated reasons: 100
All-generated reasons: 100
Using fp16 mode


The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead.
Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00,  1.36s/it]
Map: 100%|██████████| 100/100 [00:00<00:00, 5098.96 examples/s]
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
100,1.835
200,1.4625
300,1.3417


  0%|          | 0/50 [00:00<?, ?it/s]

promlem 1 / 50


  2%|▏         | 1/50 [00:01<01:28,  1.80s/it]

----


##Prediction: 105.0


#Problem
actual:  193.0 predicted:  105.0
promlem 2 / 50


  4%|▍         | 2/50 [00:04<01:46,  2.23s/it]

----


##Prediction: 105.0


#Problem
actual:  142.0 predicted:  105.0
promlem 3 / 50


  6%|▌         | 3/50 [00:06<01:45,  2.25s/it]

----


##Prediction: 170.0


#Problem
actual:  97.5 predicted:  170.0
promlem 4 / 50


  8%|▊         | 4/50 [00:45<12:47, 16.69s/it]

----


The basic unit for comparison could be anthracene, which has a melting point around 216°C. The addition of a ketone group (-CO-CH2-), which is a polar functional group, generally increases the melting point due to the presence of a polar carbonyl group capable of dipole-dipole interactions, adding approximately +10°C. The presence of a benzene ring, which is known to contribute to the melting point due to the conjugated system and potential for hydrogen bonding, adds another +20°C. The additional functional groups, such as the hydroxyl (-OH) and amino (-NH2) groups, may also contribute to the melting point, although their effects are less predictable.

Therefore:
- Basic unit (anthracene): 216.0
- Ketone group: +10.0
- Benzene ring: +20.0
- Hydroxyl group: +5.0
- Amino group: +5.0
Combined estimated effect: 216.0 + 10.0 + 20.0 + 5.0 + 5.0 ≈ 266.0
##Prediction: 266.0


#Problem
actual:  170.0 predicted:  266.0
promlem 5 / 50


 10%|█         | 5/50 [01:37<22:03, 29.41s/it]

----


To predict the melting point of 5-chlorothiophene-2-carboxylic acid, we consider the known melting point influence of various functional groups relative to benzene, which has a melting point of 5.5°C. Thiophene rings have oxygen which can participate in hydrogen bonding and contribute additional ring structure, possibly increasing the melting point by about +40°C each. Chlorine is electron-withdrawing and can increase the melting point due to increased molecular weight and van der Waals interactions, let's estimate +20°C for the chlorine group. The carboxylic acid group can also increase the melting point due to increased molecular weight and van der Waals interactions, let's estimate +10°C for the carboxylic acid group. The presence of a chlorine atom can significantly increase the melting point because of its strong electronegative nature and the capability for intermolecular interactions, contributing approximately +30°C. Therefore, we calculate as follows: Benzene (5.5) + 2 

 12%|█▏        | 6/50 [01:40<14:53, 20.30s/it]

----


##Prediction: 105.0


#Problem
actual:  46.0 predicted:  105.0
promlem 7 / 50


 14%|█▍        | 7/50 [01:42<10:15, 14.32s/it]

----


##Prediction: 105.0


#Problem
actual:  -37.0 predicted:  105.0
promlem 8 / 50


 16%|█▌        | 8/50 [01:43<07:10, 10.25s/it]

----


##Prediction: -10.0


#Problem
actual:  147.0 predicted:  -10.0
promlem 9 / 50


 18%|█▊        | 9/50 [01:45<05:18,  7.76s/it]

----


##Prediction: 100.0


#Problem
actual:  93.0 predicted:  100.0
promlem 10 / 50


 20%|██        | 10/50 [01:47<03:59,  5.99s/it]

----


##Prediction: 120.0


#Problem
actual:  96.0 predicted:  120.0
promlem 11 / 50


 22%|██▏       | 11/50 [01:49<03:06,  4.79s/it]

----


##Prediction: -10.0


#Problem
actual:  -151.0 predicted:  -10.0
promlem 12 / 50


 24%|██▍       | 12/50 [01:51<02:30,  3.95s/it]

----


##Prediction: 105.0


#Problem
actual:  37.0 predicted:  105.0
promlem 13 / 50


 26%|██▌       | 13/50 [01:54<02:04,  3.38s/it]

----


##Prediction: 100.0


#Problem
actual:  -52.3 predicted:  100.0
promlem 14 / 50


 28%|██▊       | 14/50 [02:08<04:07,  6.87s/it]

----


The basic unit for 7-methylquinoline is quinoline, which has a melting point of -10°C. Methylation of the quinoline ring increases the molecular weight and introduces a nonpolar, electronegative group, which can increase the melting point due to increased molecular weight and the potential for pi-pi stacking interactions with other methyl groups in the solid state. The effect might be an increase of around +20°C compared to the basic quinoline scaffold.
##Prediction: 12.0


#Problem
actual:  39.0 predicted:  12.0
promlem 15 / 50


 30%|███       | 15/50 [02:11<03:13,  5.53s/it]

----


##Prediction: 100.0


#Problem
actual:  157.5 predicted:  100.0
promlem 16 / 50


 32%|███▏      | 16/50 [02:13<02:37,  4.64s/it]

----


##Prediction: 100.0


#Problem
actual:  -25.0 predicted:  100.0
promlem 17 / 50


 34%|███▍      | 17/50 [02:57<09:00, 16.39s/it]

----


The melting point of a compound is influenced by several structural factors including molecular weight, symmetry, and the strength of intermolecular forces. For 1-bromo-2-nitrobenzene, several functional group effects should be considered: 
- Basic unit, benzene, has a melting point of 5.5 °C. 
- Bromo group: A heavy halogen like bromine adds significant molecular weight and polarizability to the compound, which could reasonably be expected to increase the melting point due to stronger London dispersion forces. The bromo group typically raises the melting point by an estimated +20 °C over unsubstituted benzene. 
- Nitro group: The nitro group is an electron-withdrawing group that increases the polarity of the molecule and can allow for stronger dipole-dipole interactions and potentially hydrogen bonding with protic solvents. However, in a nonprotic crystalline state, its main effect would be on the polarity and rigidity of the compound. The contribution to the melting point coul

 36%|███▌      | 18/50 [03:43<13:24, 25.15s/it]

----


The compound (2E)-2-phenyl-3-(phenylamino)prop-2-enenitrile has a melting point of 279.5 °C. To predict the melting point, we will consider the basic structure of the compound and the influence of its functional groups and modifications. The basic structure of the compound is a phenyl-substituted benzene ring, which has a melting point of approximately 48 °C. The addition of a nitrile group typically raises the melting point due to increased molecular weight and van der Waals forces. The phenylamino group is electron-donating and can lower the melting point due to the increased molecular weight and the potential for hydrogen bonding. The presence of two phenyl groups can increase the melting point due to increased conjugation and planarity. The addition of a nitrile group can increase the melting point by approximately +40 °C. The phenylamino group can lower the melting point by approximately -20 °C. The presence of two phenyl groups can increase the melting point by approximate

 38%|███▊      | 19/50 [03:45<09:25, 18.25s/it]

----


##Prediction: -10.0


#Problem
actual:  -52.4 predicted:  -10.0
promlem 20 / 50


 40%|████      | 20/50 [03:47<06:46, 13.54s/it]

----


##Prediction: 100.0


#Problem
actual:  53.0 predicted:  100.0
promlem 21 / 50


 42%|████▏     | 21/50 [03:50<04:53, 10.13s/it]

----


##Prediction: 135.0


#Problem
actual:  55.75 predicted:  135.0
promlem 22 / 50


 44%|████▍     | 22/50 [03:52<03:36,  7.73s/it]

----


##Prediction: 130.0


#Problem
actual:  110.0 predicted:  130.0
promlem 23 / 50


 46%|████▌     | 23/50 [03:55<02:49,  6.26s/it]

----


##Prediction: 120.0


#Problem
actual:  109.85 predicted:  120.0
promlem 24 / 50


 48%|████▊     | 24/50 [03:56<02:07,  4.88s/it]

----


##Prediction: 105.0


#Problem
actual:  115.0 predicted:  105.0
promlem 25 / 50


 50%|█████     | 25/50 [03:58<01:39,  3.99s/it]

----


##Prediction: 130.0


#Problem
actual:  92.5 predicted:  130.0
promlem 26 / 50


 52%|█████▏    | 26/50 [04:00<01:19,  3.30s/it]

----


##Prediction: 104.0


#Problem
actual:  251.5 predicted:  104.0
promlem 27 / 50


 54%|█████▍    | 27/50 [04:02<01:04,  2.83s/it]

----


##Prediction: 170.0


#Problem
actual:  -16.0 predicted:  170.0
promlem 28 / 50


 56%|█████▌    | 28/50 [04:04<00:58,  2.68s/it]

----


##Prediction: 100.0


#Problem
actual:  256.5 predicted:  100.0
promlem 29 / 50


 58%|█████▊    | 29/50 [04:19<02:15,  6.44s/it]

----


The basic unit for 6-nitro-1H-indazole is indazole, which has a melting point of -10.5 °C. The nitro group is electron-withdrawing and can increase the polarity of the molecule, which could allow for stronger dipole-dipole interactions and potentially hydrogen bonding with protic solvents. The nitro group typically raises the melting point by an estimated +15 °C over unsubstituted indazole. The presence of a nitro group also increases the molecular weight, which could contribute to a higher melting point due to stronger London dispersion forces.
##Prediction: -5.0


#Problem
actual:  181.0 predicted:  -5.0
promlem 30 / 50


 60%|██████    | 30/50 [04:21<01:44,  5.21s/it]

----


##Prediction: 128.0


#Problem
actual:  77.0 predicted:  128.0
promlem 31 / 50


 62%|██████▏   | 31/50 [04:43<03:10, 10.03s/it]

----


The compound 2-amino-6-fluoropyridine has a pyridine base with a melting point of approximately -10.5°C. The fluorine substituent is electron-withdrawing and can increase the melting point due to increased molecular polarity and rigidity. The amino group can also contribute to the melting point due to its basic nature and potential hydrogen bonding capabilities. Considering these effects, we can estimate an increase of approximately +20°C for the fluorine group and +10°C for the amino group.
##Prediction: -7.5


#Problem
actual:  60.0 predicted:  -7.5
promlem 32 / 50


 64%|██████▍   | 32/50 [04:46<02:25,  8.06s/it]

----


##Prediction: 100.0


#Problem
actual:  -113.0 predicted:  100.0
promlem 33 / 50


 66%|██████▌   | 33/50 [04:49<01:48,  6.38s/it]

----


##Prediction: 120.0


#Problem
actual:  245.0 predicted:  120.0
promlem 34 / 50


 68%|██████▊   | 34/50 [04:51<01:21,  5.07s/it]

----


##Prediction: 140.0


#Problem
actual:  230.0 predicted:  140.0
promlem 35 / 50


 70%|███████   | 35/50 [04:53<01:03,  4.24s/it]

----


##Prediction: 120.0


#Problem
actual:  143.0 predicted:  120.0
promlem 36 / 50


 72%|███████▏  | 36/50 [04:56<00:53,  3.86s/it]

----


##Prediction: 175.0


#Problem
actual:  175.0 predicted:  175.0
promlem 37 / 50


 74%|███████▍  | 37/50 [05:20<02:10, 10.06s/it]

----


The base unit in 2-phenoxypropionic acid is benzene, which has a melting point of 5.5°C. Introducing a phenoxy group will generally increase this value because of the increased molecular complexity and potential for additional intermolecular interactions such as dipole-dipole attraction from the phenoxy oxygen. We can predict this might add approximately +80°C. The propionic acid side chain adds both hydrophobic alkyl chain that might slightly lower the melting point but also contains a carboxylic acid group, which can form strong intermolecular hydrogen bonds, significantly raising the melting point. The combined alkyl (-CH2- and -CH3) effects could be about -5°C while the carboxyl group could contribute +40°C due to hydrogen bonding potentials. The overall expected increase would be expected to be a sum of the individual contributions.
##Prediction: 121.0


#Problem
actual:  117.75 predicted:  121.0
promlem 38 / 50
