# (Q)分子構造+(R)理由+(A)物性データセットのLLMによる学習と予測
- Q&A: 融点データセットを使用
- R: 自分自身で考えさせて､正解のデータを学習させる

In [1]:
%reload_ext autoreload
%autoreload 2

import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"

from transformers import AutoTokenizer
import pandas as pd
import random
import copy
import glob
import json
from datetime import datetime
from llmchem.utils import mk_dir,clean_vram

#import clear_output

from IPython.display import clear_output

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
#dataset settings
n_test=50 #number of testing data
n_train_check=50 #number of training data for checking (i.e., checking everything takes too long, so we check only a part of training data)
n_GPT_reasoning=30 # number of reasoning data made by GPT
n_generation_iterations=100   # trial numbers to generate new self reasoning data
max_generations=50

#model settings
model_name="mistralai/Mixtral-8x7B-Instruct-v0.1"
target_modules= [
    "lm_head",
    "q_proj",
    "k_proj",
    "v_proj",
    "o_proj",
    "gate",
    "w1",
    "w2",
    "w3"
]

model_name=f"meta-llama/Llama-2-7b-chat-hf"
target_modules= [
    #"embed_tokens",
    "lm_head",
    #"q_proj",
    #"k_proj",
    "v_proj",
    "o_proj",
    "gate_proj",
    "up_proj",
    #"down_proj",
]



#LoRA settings
r=32
lora_alpha=r
bit=16
#bit=8
#bit=4

#train settings
gradient_checkpointing =False
per_device_train_batch_size=1
epochs=3
lr=10**-5

#device settings
device_map="auto"

#dataset path
dataset_path="dataset/231225AutoReasoning/240117best_reason_record_3k.csv"

#project path
project_dir="results/projects/240117llama7b"

#reasoning options
error_threshold=30  # if abolute error is smaller than this, add to training data

In [3]:
mk_dir(project_dir)
mk_dir(project_dir+"/eval")
mk_dir(project_dir+"/self_reasoning")
mk_dir(project_dir+"/train")

In [4]:
from llmchem.model import init_model
from llmchem.train import train_model
from llmchem.eval import eval_model
from llmchem.reasoning import self_reasoning
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

In [5]:
#load base dataset

df=pd.read_csv(dataset_path)
dataset=df.to_dict(orient="records")
random.seed(0)
random.shuffle(dataset)

base_train_dataset=dataset[:-n_test]
example_reasoning_dataset=base_train_dataset[:n_GPT_reasoning]
test_dataset=dataset[-n_test:]

In [6]:
#Loop: training, evaluation, data generation
for generation in range(max_generations):
    clear_output()
    print(f"Generation: {generation}")
    #prepare train dataset

    ## reason data made by GPT4
    train_dataset=copy.deepcopy(example_reasoning_dataset)

    print(f"GPT-generated reasons: {len(train_dataset)}")

    ## reason data made by model itself
    for path in glob.glob(f"{project_dir}/self_reasoning/*.json"):
        with open(path) as f:
            train_dataset.append(json.load(f))

    print(f"All-generated reasons: {len(train_dataset)}")
    random.shuffle(train_dataset)

    #train model
    clean_vram()
    model=init_model(model_name, r, lora_alpha, target_modules, bit=bit,device_map=device_map)
    train_result=train_model(model,tokenizer,train_dataset,
                    project_dir=project_dir,
                    epochs=epochs,
                    lr=lr,
                    per_device_train_batch_size=per_device_train_batch_size,
                    gradient_checkpointing=gradient_checkpointing,
                    )

    #eval

    train_check_dataset=copy.deepcopy(train_dataset[:n_train_check])
    random.shuffle(train_check_dataset)
    train_eval_result=eval_model(model,tokenizer,train_check_dataset,
                                f"{project_dir}/eval",
                                n_prompt_examples=3,
                                prompt_dataset=example_reasoning_dataset,
                                prefix=f"train_{generation}"
                                )

    test_eval_result=eval_model(model,tokenizer,test_dataset,
                                f"{project_dir}/eval",
                                n_prompt_examples=3,
                                prompt_dataset=example_reasoning_dataset,
                                prefix=f"test_{generation}"
                                )

    #generate additional training data by self-reasoning
    self_reasoning(model,tokenizer,base_train_dataset,
                example_reasoning_dataset,project_dir,
                generation=generation,
                n_iterations=n_generation_iterations,
                error_threshold=error_threshold,
                n_max_trials=2)

Generation: 1
GPT-generated reasons: 30
All-generated reasons: 53
Using fp16 mode


Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00,  1.22s/it]
Map: 100%|██████████| 53/53 [00:00<00:00, 4388.04 examples/s]


Step,Training Loss
100,1.678


  0%|          | 0/50 [00:00<?, ?it/s]

promlem 1 / 50


  2%|▏         | 1/50 [00:01<01:10,  1.45s/it]

----


##Prediction: 105.0


#Problem
actual:  -52.3 predicted:  105.0
promlem 2 / 50


  4%|▍         | 2/50 [00:02<01:11,  1.49s/it]

----


##Prediction: 120.0


#Problem
actual:  42.0 predicted:  120.0
promlem 3 / 50


  6%|▌         | 3/50 [00:05<01:36,  2.04s/it]

----


##Prediction: 105.0


#Problem
actual:  165.5 predicted:  105.0
promlem 4 / 50


  8%|▊         | 4/50 [00:08<01:49,  2.38s/it]

----


##Prediction: 170.0


#Problem
actual:  157.5 predicted:  170.0
promlem 5 / 50


 10%|█         | 5/50 [00:11<01:52,  2.50s/it]

----


##Prediction: 105.0


#Problem
actual:  15.0 predicted:  105.0
promlem 6 / 50


 12%|█▏        | 6/50 [00:53<11:40, 15.91s/it]

----


- Benzene ring: The base unit benzene has a melting point of 5.5 degrees Celsius.
- Cyclohexyl group: The cyclohexyl group has a melting point of -12.5 degrees Celsius.
- Carboxylic acid group: This functional group typically raises the melting point due to hydrogen bonding and increased polarity. A rough estimate for its contribution could be +50 degrees.
- Cumulative impact: The individual contributions of each group can be summed up, but it's also necessary to consider that the presence of multiple polar and hydrogen bonding functional groups in one molecule can lead to synergistic effects, amplifying the increase in melting point beyond the sum of individual contributions.
Based on the above considerations, a rough estimate would be 5.5 (for the benzene rings) + 12.5 (for the cyclohexyl group) + 50 (for the carboxylic acid) = 77.5 degrees Celsius. Adding a consideration for synergistic effects, a more refined estimate taking into account the cumulative polar and hydrogen bon

 14%|█▍        | 7/50 [00:55<08:17, 11.58s/it]

----


##Prediction: -50.0


#Problem
actual:  42.0 predicted:  -50.0
promlem 8 / 50


 16%|█▌        | 8/50 [00:58<06:00,  8.58s/it]

----


##Prediction: 140.0


#Problem
actual:  178.0 predicted:  140.0
promlem 9 / 50


 18%|█▊        | 9/50 [01:00<04:28,  6.54s/it]

----


##Prediction: 125.0


#Problem
actual:  147.0 predicted:  125.0
promlem 10 / 50


 20%|██        | 10/50 [01:19<07:03, 10.59s/it]

----


The compound has a base structure of a phenyl group, which has a melting point of approximately 100C. The presence of a nitrile group (-CN) can increase the melting point due to the polarity and electronegativity of the nitrogen atom. The phenylamino group (NH2) is also polar and can form hydrogen bonds, which can further increase the melting point. The presence of two phenyl groups adds significant mass and polarity to the molecule, which can also contribute to the melting point. The overall effect of these groups on the melting point is synergistic, rather than additive, due to the complex interplay of implied intermolecular interactions.
##Prediction: 120.0


#Problem
actual:  156.0 predicted:  120.0
promlem 11 / 50


 22%|██▏       | 11/50 [01:21<05:11,  7.98s/it]

----


##Prediction: 55.0


#Problem
actual:  95.0 predicted:  55.0
promlem 12 / 50


 24%|██▍       | 12/50 [01:24<03:59,  6.30s/it]

----


##Prediction: 105.5


#Problem
actual:  151.0 predicted:  105.5
promlem 13 / 50


 26%|██▌       | 13/50 [01:27<03:13,  5.24s/it]

----


##Prediction: 135.0


#Problem
actual:  247.0 predicted:  135.0
promlem 14 / 50


 28%|██▊       | 14/50 [01:29<02:39,  4.42s/it]

----


##Prediction: 125.0


#Problem
actual:  342.0 predicted:  125.0
promlem 15 / 50


 30%|███       | 15/50 [01:32<02:15,  3.88s/it]

----


##Prediction: 50.0


#Problem
actual:  -170.0 predicted:  50.0
promlem 16 / 50


 32%|███▏      | 16/50 [01:34<01:59,  3.52s/it]

----


##Prediction: 105.0


#Problem
actual:  65.0 predicted:  105.0
promlem 17 / 50


 34%|███▍      | 17/50 [01:37<01:45,  3.21s/it]

----


##Prediction: 120.0


#Problem
actual:  192.0 predicted:  120.0
promlem 18 / 50


 36%|███▌      | 18/50 [01:39<01:31,  2.87s/it]

----


##Prediction: 105.0


#Problem
actual:  95.0 predicted:  105.0
promlem 19 / 50


 38%|███▊      | 19/50 [02:35<09:39, 18.69s/it]

----


Salinomycin is a complex molecule with multiple functional groups that contribute to its melting point. The basic unit for comparison is the parent compound, salicylic acid, which has a melting point of 143 degrees Celsius. The introduction of additional functional groups can increase or decrease the melting point depending on their effect on intermolecular forces.

The carboxylic acid group (-COOH) is a common functional group that can increase the melting point due to the increased polarizability of the oxygen atoms and the electronegativity of the oxygen atoms. The effect of a carboxylic acid group on the melting point is estimated to be around +10 to +20 degrees Celsius.

The hydroxyl (-OH) group can also increase the melting point due to the increased hydrogen bonding capabilities. The effect of a hydroxyl group on the melting point is estimated to be around +50 to +70 degrees Celsius.

The methyl (-CH3) group can decrease the melting point due to the reduced polarizability

 40%|████      | 20/50 [02:37<06:52, 13.75s/it]

----


##Prediction: -105.0


#Problem
actual:  205.0 predicted:  -105.0
promlem 21 / 50


 42%|████▏     | 21/50 [02:39<04:58, 10.29s/it]

----


##Prediction: 140.0


#Problem
actual:  208.0 predicted:  140.0
promlem 22 / 50


 44%|████▍     | 22/50 [02:55<05:32, 11.89s/it]

----


The basic unit for comparison is aniline, which has a melting point of 10.5 degrees Celsius. The iodine atom adds a significant increase in melting point due to the polar and electronegative nature of iodine. The iodine atom also increases the molecular weight, which can contribute to a higher melting point. A rough estimate for the iodine atom would be +100 degrees Celsius.
##Prediction: 110.5


#Problem
actual:  55.75 predicted:  110.5
promlem 23 / 50


 46%|████▌     | 23/50 [03:32<08:46, 19.51s/it]

----


The compound 3,5-dichlorophenylhydrazine has a melting point of 155.0 degrees Celsius. The functional groups present in the molecule contribute to this value:
- Phenyl group: The presence of two chlorine atoms increases the melting point due to increased polarity and intermolecular interactions. A rough estimate for this effect could be around 25 degrees Celsius.
- Hydrazine group: Hydrazines can form hydrogen bonds, which tend to increase the melting point. A reasonable estimate for this effect could be around 50 degrees Celsius.
- Chlorine atoms: Chlorine atoms can increase the melting point due to increased polarity and intermolecular interactions. An estimated contribution for each chlorine atom could be around 10 degrees Celsius.
- Synergistic effect: The presence of multiple polar functional groups in close proximity can lead to synergistic effects, amplifying the increase in melting point beyond the sum of individual contributions.
##Prediction: 180.0


#Problem
actual:  

 48%|████▊     | 24/50 [03:35<06:19, 14.59s/it]

----


##Prediction: 145.0


#Problem
actual:  230.0 predicted:  145.0
promlem 25 / 50


 50%|█████     | 25/50 [03:37<04:31, 10.87s/it]

----


##Prediction: 100.0


#Problem
actual:  -52.4 predicted:  100.0
promlem 26 / 50


 52%|█████▏    | 26/50 [03:40<03:19,  8.31s/it]

----


##Prediction: 105.0


#Problem
actual:  193.0 predicted:  105.0
promlem 27 / 50


 54%|█████▍    | 27/50 [04:05<05:11, 13.52s/it]

----


The molecule has a benzene ring, which has a melting point of approximately -10°C. The methyl group (-CH3) adds steric bulk and increases molecular weight, which typically raises the melting point. The chlorine atom (-Cl) is polar and can form hydrogen bonds, thus increasing the melting point. The methoxy group (-OCH3) adds additional hydrogen bonding opportunities and increased molecular weight, which can further increase the melting point. The overall effect of these functional groups on the melting point is estimated to be approximately +30°C.
##Prediction: 40.0


#Problem
actual:  95.0 predicted:  40.0
promlem 28 / 50


 56%|█████▌    | 28/50 [04:08<03:48, 10.39s/it]

----


##Prediction: 105.0


#Problem
actual:  -24.15 predicted:  105.0
promlem 29 / 50


 58%|█████▊    | 29/50 [04:10<02:45,  7.88s/it]

----


##Prediction: 105.0


#Problem
actual:  115.0 predicted:  105.0
promlem 30 / 50


 60%|██████    | 30/50 [04:13<02:04,  6.20s/it]

----


##Prediction: 105.0


#Problem
actual:  82.5 predicted:  105.0
promlem 31 / 50


 62%|██████▏   | 31/50 [04:15<01:34,  4.97s/it]

----


##Prediction: 175.0


#Problem
actual:  46.0 predicted:  175.0
promlem 32 / 50


 64%|██████▍   | 32/50 [04:17<01:14,  4.12s/it]

----


##Prediction: 170.0


#Problem
actual:  259.0 predicted:  170.0
promlem 33 / 50


 66%|██████▌   | 33/50 [04:58<04:19, 15.28s/it]

----


The melting point of this compound is influenced by the presence of several functional groups, including a nitro group, a benzodioxine ring, and two ethylfuran groups. The nitro group is electron-withdrawing and can increase the polarity of the molecule, which can lead to stronger intermolecular forces and a higher melting point. The benzodioxine ring is a heterocycle with oxygen and nitrogen atoms, which can contribute to increased polarity and hydrogen bonding capabilities. The ethylfuran groups are polar and can participate in hydrogen bonding, which can also increase the melting point.

To predict the melting point of this compound, we will consider the individual contributions of each functional group based on known trends and compare those to the actual value provided:

- Nitro group: +20 °C
- Benzodioxine ring: +15 °C
- Ethylfuran groups: +10 °C each

Adding up these contributions, we predict a higher melting point than that of a simple aromatic ring due to the presence o

 68%|██████▊   | 34/50 [05:00<03:00, 11.30s/it]

----


##Prediction: 105.0


#Problem
actual:  142.0 predicted:  105.0
promlem 35 / 50


 70%|███████   | 35/50 [05:19<03:23, 13.59s/it]

----


The basic unit in this case is benzoic acid, which has a melting point of 121.5°C. The 2-hydroxyethyl ester group (C2H5) adds a hydrophilic head to the molecule, which can increase the melting point due to increased intermolecular hydrogen bonding. The hydroxyl group (OH) in the 2-hydroxyethyl ester also contributes to the melting point due to increased hydrogen bonding. The estimated effect of the 2-hydroxyethyl ester is +20°C, and the effect of the hydroxyl group is +10°C.
##Prediction: 141.5


#Problem
actual:  37.0 predicted:  141.5
promlem 36 / 50


 72%|███████▏  | 36/50 [05:22<02:24, 10.35s/it]

----


##Prediction: 105.0


#Problem
actual:  -100.0 predicted:  105.0
promlem 37 / 50


 74%|███████▍  | 37/50 [05:24<01:44,  8.02s/it]

----


##Prediction: 120.0


#Problem
actual:  110.0 predicted:  120.0
promlem 38 / 50


 76%|███████▌  | 38/50 [05:27<01:17,  6.48s/it]

----


##Prediction: 105.0


#Problem
actual:  167.0 predicted:  105.0
promlem 39 / 50


 78%|███████▊  | 39/50 [06:17<03:34, 19.49s/it]

----


The melting point of 2-amino-6-fluoropyridine can be predicted by considering the effects of the functional groups within the molecule. The pyridine ring itself has a melting point of around 100°C, which is a common value for simple aromatic rings. The fluorine atom adds a significant increase in melting point due to its electronegativity and the potential for dipole-dipole interactions. The amino group can also contribute to a higher melting point due to its ability to form hydrogen bonds. However, the fluorine atom's higher electronegativity and the potential for interference with hydrogen bonding may reduce the amino group's effect.

Based on these considerations, we can estimate the melting point of 2-amino-6-fluoropyridine as follows:

- Pyridine ring: 100°C
- Fluorine atom: +50°C (increased electronegativity and potential for dipole-dipole interactions)
- Amino group: +20°C (ability to form hydrogen bonds)

Combining these values, we predict a melting point for 2-amino-6-f

 80%|████████  | 40/50 [06:20<02:25, 14.53s/it]

----


##Prediction: 105.0


#Problem
actual:  -100.0 predicted:  105.0
promlem 41 / 50


 82%|████████▏ | 41/50 [06:49<02:48, 18.75s/it]

----


The base molecule, dibenzyl sulfoxide, has a melting point of approximately -20.5 °C. The presence of two benzene rings adds significant mass and induces strong dipole moments due to their electronegativity, which can increase the melting point due to stronger intermolecular forces (each benzene ring approximately +60 °C). The sulfoxide group is polar and can form hydrogen bonds, thus also increasing the melting point (approximately +40 °C). However, the presence of two non-equivalent functional groups (sulfur and benzene) in the para position relative to each other may slightly reduce symmetry and possibly decrease the overall effect on melting point when compared to a more symmetrically substituted molecule.
##Prediction: 160.0


#Problem
actual:  133.0 predicted:  160.0
promlem 42 / 50


 84%|████████▍ | 42/50 [06:52<01:51, 13.98s/it]

----


##Prediction: -100.0


#Problem
actual:  86.0 predicted:  -100.0
promlem 43 / 50


 86%|████████▌ | 43/50 [07:19<02:06, 18.10s/it]

----


- The carboxylic acid functional group (-COOH) is a common component of organic acids and can contribute to a higher melting point due to the increased polarity and possible hydrogen bonding. Estimated to increase melting point by +20.
- The methyl group (-CH3) adds rigidity to the molecule and can potentially form hydrogen bonds, which would increase the melting point by +10.
- The presence of multiple carbons in the molecule contributes to the molecule's rigidity and intermolecular forces, estimated to increase the melting point by +20.

Starting with a base melting point of around 10°C for simple carboxylic acids, these adjustments lead to a total estimation.
##Prediction: 85.0


#Problem
actual:  100.0 predicted:  85.0
promlem 44 / 50


 88%|████████▊ | 44/50 [07:22<01:20, 13.44s/it]

----


##Prediction: 105.0


#Problem
actual:  164.0 predicted:  105.0
promlem 45 / 50


 90%|█████████ | 45/50 [07:47<01:25, 17.03s/it]

----


The basic unit for comparison could be 4-methylaniline, which has a melting point of 107.4 °C. The presence of a thiopyran ring in the molecule increases the molecular weight and introduces a polar functional group, which can increase the melting point due to stronger London dispersion forces. The thiopyran ring also introduces a degree of aromaticity, which can contribute to the melting point. The presence of two phenyl groups adds further polarity and rigidity to the molecule, which can also increase the melting point.

Considering the individual effects, one might expect these functional groups to have mostly additive impacts on the melting point, but some deviation from strict additivity can occur due to the disruption of the aromatic system's symmetry by the substituents.
##Prediction: 130.0


#Problem
actual:  148.0 predicted:  130.0
promlem 46 / 50


 92%|█████████▏| 46/50 [08:03<01:06, 16.69s/it]

----


Chaulmoogric acid has a long hydrocarbon chain with many double bonds, which increases its melting point due to the increased molecular weight and the presence of conjugated double bonds, which can form pi-pi interactions. The hydroxyl (-OH) group at one end of the chain can also contribute to the melting point through hydrogen bonding. The presence of multiple hydroxyl groups can increase the melting point further.
##Prediction: 100.0


#Problem
actual:  68.5 predicted:  100.0
promlem 47 / 50


 94%|█████████▍| 47/50 [08:06<00:37, 12.46s/it]

----


##Prediction: 147.0


#Problem
actual:  147.0 predicted:  147.0
promlem 48 / 50
