<a href="https://colab.research.google.com/github/Takumi173/2023Test/blob/main/PseudoCodeToText_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# プログラム

In [None]:
!wget https://github.com/Takumi173/2023Test/releases/download/20231114/TestData_Ver.0.4.csv
!wget https://github.com/Takumi173/2023Test/releases/download/20231114/TrainData_Ver.0.4.csv
!wget https://github.com/Takumi173/2023Test/releases/download/20231126/PreparedData.tsv

In [None]:
# パッケージのインストール
!pip install transformers
#!pip install git+https://github.com/huggingface/transformers -Uqq #for mistral
!pip install sentencepiece accelerate bitsandbytes sentence_transformers xformers

!pip install einops # for japanese-stablelm-instruct-alpha-7b

#AutoGPTQ
!pip install optimum
!pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/  # Use cu117 if on CUDA 11.7

In [None]:
# HuggingFaceのログイン
!huggingface-cli login

# llama2 Chat方式

In [None]:
from threading import Thread
from typing import Iterator

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer, BitsAndBytesConfig

#model_id = 'meta-llama/Llama-2-7b-chat-hf'
model_id = 'meta-llama/Llama-2-13b-chat-hf'  #13Bの4bitのほうが7bのフルロードよりも正確。知識ベースの積み上げはモデルが大きいほど正確っぽい。Bitが影響するのは書き出しの部分か？'
#model_id = 'codellama/CodeLlama-13b-Instruct-hf'
#model_id = "elyza/ELYZA-japanese-Llama-2-7b-instruct"
#model_id = 'pfnet/plamo-13b'
#model_id = 'mistralai/Mistral-7B-Instruct-v0.1' メモリ不足


tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

if torch.cuda.is_available():
    bnb_config = BitsAndBytesConfig(
      load_in_4bit=True,
      bnb_4bit_quant_type="nf4",
      bnb_4bit_compute_dtype=torch.float16,
    )
    if model_id in ('meta-llama/Llama-2-13b-chat-hf', 'pfnet/plamo-13b', 'codellama/CodeLlama-13b-Instruct-hf'):
      model = AutoModelForCausalLM.from_pretrained(
          model_id,
          torch_dtype=torch.float16,
          device_map='auto',
#          load_in_8bit=True,
          quantization_config=bnb_config
      )
    elif model_id in ('elyza/ELYZA-japanese-Llama-2-7b-instruct'):
      model = AutoModelForCausalLM.from_pretrained(
          model_id,
          torch_dtype=torch.float16,
          device_map='auto',
      )
    elif model_id in ('mistralai/Mistral-7B-Instruct-v0.1'):
      bnb_config  = BitsAndBytesConfig(
          load_in_4bit=True,
          bnb_4bit_use_double_quant=True,
          bnb_4bit_quant_type="nf4",
          bnb_4bit_compute_dtype=torch.bfloat16,
      )
      model = AutoModelForCausalLM.from_pretrained(
          model_id,
          trust_remote_code=True,
#          torch_dtype=torch.float16,
          device_map='auto',
          quantization_config=bnb_config
      )
    else:
      model = AutoModelForCausalLM.from_pretrained(
          model_id,
          torch_dtype=torch.float16,
          device_map='auto',
      )
else:
    model = None




In [None]:
import gc
def flush():
  gc.collect()
  torch.cuda.empty_cache()
  torch.cuda.reset_peak_memory_stats()
  return

def GeneratePrediction (Code):
  flush()

  prompt = f"""<s>[INST] <<SYS>>
You are a helpful, respectful, honest and excellent programmer.  Always answer as helpfully as possible, while being safe.  Please answer the questions as accurately as possible based on the knowledge you have.  Take a deep breath, read the text carefully, and think step-by-step!
<</SYS>>

### CONTEXT
Here are examples of translations from CODE to TEXT.

Example 1:
CODE: [Overall Response] <> "PROGRESSIVE DISEASE" AND [Overall Response] IS NOT EMPTY AND [New Lesion Progression] = "UNEQUIVOCAL"
TEXT: The [Overall Response] is not "PROGRESSIVE DISEASE", yet [New Lesion Progression] is "UNEQUIVOCAL".  Please review.

Example 2:
CODE: [Stop Date] > [Date of First Study Drug Taken] -30
TEXT: The [Stop Date] is more than 30 days prior to the [Date of First Study Drug Taken].  Please review.

Example 3:
CODE: [Primary Study Drug Treatment Status] = "COMPLETED" and [Date of Last Dose of Study Drug] < [Study Day 90]
TEXT: The [Primary Study Drug Treatment Status] is "COMPLETED", yet the [Date of Last Dose of Study Drug] is prior to [Study Day 90].  Please review.

Example 4:
CODE: 18 > [Age]
TEXT: [Age] is less than 18 years.

Example 5:
CODE: [LBTEST] = "Leukocytes" AND ([LBORRES] < 13 or [LBORRES] > 294)
TEXT: The [LBTEST] is "Leukocytes" and the [LBORRES] is not within the expected range of 13-294. Please review.

Example 6:
CODE: [Do you consider that there is a reasonable possibility that the event may have been caused by study drug?] = "YES" AND [Start date] < [Date of First Study Drug Taken]
TEXT: The [Do you consider that there is a reasonable possibility that the event may have been caused by study drug?] is "YES" and the [Start Date] is prior to the [Date of First Study Drug Taken].  Please review.

Example 7:
CODE: Record count of [Subject ID] is more than 1
TEXT: Duplicate [Subject ID].  Please review.


### QUESTION

Step1: Referring to the CONTEXT examples, translate the following CODE into TEXT.  It is important to maintain the CODE and the TEXT square brackets or quotation marks around the text when translating.
CODE: {Code}

Step2: Check if the text enclosed in square brackets or quotation marks in the CODE and the TEXT match exactly.  Also check that the logic of the CODE and the TEXT match exactly.  If not, modify the TEXT to match.  Answer should be start with "TEXT:" and end with "Please review."  No explanation required.


### ANSWER
TEXT: [/INST]"""


  #print(prompt)
  with torch.no_grad():
      token_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
      if len(token_ids[0]) > 1500:
        output = "Too many text to read"
        return prompt, output

      output_ids = model.generate(
          token_ids.to(model.device),
          max_new_tokens=256,
          do_sample=False,
  #        top_p=1,
  #        top_k=1,
  #        temperature=0.1,
  #        pad_token_id=tokenizer.pad_token_id,
  #        eos_token_id=tokenizer.eos_token_id,
      )
  output = tokenizer.decode(output_ids.tolist()[0][token_ids.size(1) :], skip_special_tokens=True)
  return prompt, output

Code = '[If Pregnancy Test was not done, please provide reason] <> "PRE-PUBERTY" or [Age] < 10'

prompt, output = GeneratePrediction(Code)

print("*** OUTPUT ***")
print(output)
print("\n*** PROMPT ***")
print(prompt)


In [None]:
import pandas as pd
df1 = pd.read_csv('TestData_Ver.0.4.csv')
df1["Origin"] = "TestData"
df2 = pd.read_csv('TrainData_Ver.0.4.csv')
df2["Origin"] = "TrainData"

df = pd.concat([df1,df2],axis = 0)
df

In [None]:
import pandas as pd
import warnings
warnings.simplefilter('ignore')

from tqdm.auto import tqdm

ResList = []
PredList = []

for t in tqdm(df['Code'].tolist()):
  _, Response = GeneratePrediction(t)
  PredictedText = Response.split("TEXT:")[-1].split("\n")[0].strip()

  print(f'Generated *** {PredictedText}')
  ResList.append(Response)
  PredList.append(PredictedText)

df_pred = df
df_pred['Response'] = ResList
df_pred['GeneratedText'] = PredList

df_pred

In [None]:
df_pred.to_csv('PredData_llama2.csv')
#df_pred.to_csv('PredData_codellama.csv')