In [1]:
#libraries
!pip install transformers==4.31.0 
!pip install sentence-transformers==2.2.2 



In [2]:
#specific version to avoid error with vector store
!pip install chromadb==0.5.3
!pip install datasets



In [3]:
#loading training data for providing context to the model
import pandas as pd
train_data_df=pd.read_csv("C:/Users/USER/Documents/Projects/Instruction Dataset/LLM Training Data/train_data_split.csv")
## Convert the DataFrame to a format compatible with Hugging Face Datasets
train_data_df['Instruction'] = train_data_df['Instruction']+ ". Input Text: " + train_data_df['input_text'] + "\n\nExtracted Properties:\n\n"+ train_data_df['Output'].astype(str)
train_dataset_dict = {
    'instruction': train_data_df['Instruction'].tolist()
}

In [4]:
#verify
#train_dataset_dict

In [5]:
#creating context document from the list
docs=train_dataset_dict['instruction']

In [6]:
#initializing the embedding model
from sentence_transformers import SentenceTransformer
#pip install chromadb==0.5.3
import chromadb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM
import torch
embedding_model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2', cache_folder = '/data/base_models')

In [7]:
#do the embedding
embeddings = embedding_model.encode(docs)

In [8]:
#verify the embeddings
embeddings.shape

(1040, 768)

In [9]:
#indexing-run this only once
chroma_client = chromadb.Client()
collection = chroma_client.create_collection(name="q-rag_gpt")

In [10]:
#addding the train data
collection.add(
    embeddings = embeddings,
    documents=docs,
    ids= [str(i) for i in range(len(docs))]
)

In [11]:
#retrieving
def retrieve_vector_db(query, n_results=3):
    results = collection.query(
    query_embeddings = embedding_model.encode(query).tolist(),
    n_results=n_results
    )
    return results['documents']

In [12]:
#sample query
#we can have different queries depending on the property we are interested in
#query=input ("Please Enter your query here: ") 
query = "Please extract the design type of the quantum cascade laser device. Please print none if the value is not in text and do not give any explanations. In the output, just include only the extracted property."
#input_text is the user input text containing the properties for extraction
#print("Enter the text containing the properties herer")
input_text= "2.1 THz quantum cascade laser (QCL) based on a scatterin gassisted injection and resonant-phonon depopulation design scheme is demonstrated. The QCL is based on a four-well period implemented in the GaAs/Al0:15Ga0:85As material system. The QCL operates up to a heat-sink temperature of 144 K in pulsed-mode, which is considerably higher than that achieved for previously reported THz QCLs operating around the frequency of 2 THz. At 46 K, the threshold current-density was measured as \~ 745 A/cm2 with a peak-power output of \~ 10 mW. Electrically stable operation in a positive differential-resistance regime is achieved by a careful choice of design parameters. The results validate the robustness of scattering-assisted injection schemes for development of low-frequency (n < 2:5 THz) QCLs."
input_text2="We report the development of a quantum cascade laser, at l587.2 mm, corresponding to 3.44 THz or 14.2 meV photon energy. The GaAs/Al0.15Ga0.85As laser structure utilizes longitudinal-optical LO-phonon scattering for electron depopulation. Laser action is obtained in pulsed mode at temperatures up to 65 K, and at 50% duty cycle up to 29 K. Operating at 5 K in pulsed mode, the threshold current density is 840 A/cm2, and the peak power is approximately 2.5 mW. Based on the relatively high operating temperatures and duty cycles, we propose that direct LO-phonon-based depopulation is a robust method for achieving quantum cascade lasers at long-wavelength THz frequencies."
query=query+ "" + " Input Text: " +input_text2
retrieved_results = retrieve_vector_db(query)
print(query)

Please extract the design type of the quantum cascade laser device. Please print none if the value is not in text and do not give any explanations. In the output, just include only the extracted property. Input Text: We report the development of a quantum cascade laser, at l587.2 mm, corresponding to 3.44 THz or 14.2 meV photon energy. The GaAs/Al0.15Ga0.85As laser structure utilizes longitudinal-optical LO-phonon scattering for electron depopulation. Laser action is obtained in pulsed mode at temperatures up to 65 K, and at 50% duty cycle up to 29 K. Operating at 5 K in pulsed mode, the threshold current density is 840 A/cm2, and the peak power is approximately 2.5 mW. Based on the relatively high operating temperatures and duty cycles, we propose that direct LO-phonon-based depopulation is a robust method for achieving quantum cascade lasers at long-wavelength THz frequencies.


In [13]:
#viewing the retreived 3 sample context text based on the user query.
retrieved_results[0]

['From the following sentence, please extract the design type of the Quantum Cascade Semiconductor laser device. Print none if the value does not exist in the input text. Input Text: Given the above-normal operating temperatures and duty cycles, we assert that utilizing direct LO-phonon-based depopulation proves to be a robust method in the realization of long-wavelength THz quantum cascade lasers.\t\t\n\nExtracted Properties:\n\nLO-phonon',
 'From the following sentence, please extract the design type of the Quantum Cascade Semiconductor laser device. Print none if the value does not exist in the input text. Input Text: Considering the relatively high temperatures at which they operate and the frequency of their cycles, we propose that leveraging direct LO-phonon-based depopulation is a dependable means of obtaining quantum cascade lasers that emit long-wavelength THz frequencies.\t\t\n\nExtracted Properties:\n\nLO-phonon',
 'From the following sentence, please extract the design type

In [14]:
#assigning the context
context=retrieved_results[0]

In [15]:
#prompt template
prompt = f'''
[INST]
Problem Definition: Extraction of quantum cascade laser properties from text entails extracting properties from a given text description. This should be done without providing any other additional information or explanations. The output format should correspond to the one in the example sentences. Example sentences containing an instruction, input text and the extracted properties are given below:

Example Sentences: {context}

Instruction: {query}

[/INST]
'''

In [16]:
#view prompt
print(prompt)


[INST]
Problem Definition: Extraction of quantum cascade laser properties from text entails extracting properties from a given text description. This should be done without providing any other additional information or explanations. The output format should correspond to the one in the example sentences. Example sentences containing an instruction, input text and the extracted properties are given below:

Example Sentences: ['From the following sentence, please extract the design type of the Quantum Cascade Semiconductor laser device. Print none if the value does not exist in the input text. Input Text: Given the above-normal operating temperatures and duty cycles, we assert that utilizing direct LO-phonon-based depopulation proves to be a robust method in the realization of long-wavelength THz quantum cascade lasers.\t\t\n\nExtracted Properties:\n\nLO-phonon', 'From the following sentence, please extract the design type of the Quantum Cascade Semiconductor laser device. Print none if

In [17]:
#parsing the complete prompt to GPT
#GPT
import openai
import os
from IPython.display import Markdown
api_key="OPENAI_API_KEY"

In [18]:
#some function to chat with GPT
import openai

# Set your API key securely
openai.api_key = api_key

def chatWithGPT4(user_text, print_output=False):
    try:
        # Creating text completions using the updated `ChatCompletion` class
        response = openai.ChatCompletion.create(
            model="gpt-4-turbo",
            messages=[{"role": "system", "content": "You are a helpful assistant."},
                      {"role": "user", "content": user_text}],
            max_tokens=3000
        )
        text_output = response['choices'][0]['message']['content'] if response['choices'] else "No response generated."
        if print_output:
            print(text_output)
        return text_output
    except Exception as e:
        print(f"An error occurred: {str(e)}")
        return None

In [19]:
# Extracting a property
result = chatWithGPT4(prompt)
print(result)

LO-phonon
