Code for EDA and to embed columns of the dataset

In [None]:
import pandas as pd 
import numpy as np
import re



In [27]:
df = pd.read_csv('Data/DL1(5663).csv')

The adverse reactions section includes Clinical Trial data, this code extracts that section and places it into a separate column

In [28]:
def split_clinical_trials(text):
    if isinstance(text, str) and '6.1 Clinical Trials Experience' in text:
        parts = text.split('6.1 Clinical Trials Experience', 1)
        return parts[0].strip(), parts[1].strip()  
    return text, None 

In [29]:
adverse_reactions_cleaned = []
clinical_trials_info = []

for text in df['Adverse Reactions']:
    adverse_part, clinical_part = split_clinical_trials(text)
    adverse_reactions_cleaned.append(adverse_part)
    clinical_trials_info.append(clinical_part)

df['Adverse Reactions'] = adverse_reactions_cleaned
df['Clinical Trials'] = clinical_trials_info

In [30]:
df.drop(columns={'NDC', 'Brand Name', 'Clinical Studies'},inplace=True)
df.replace("Unknown", "", inplace=True)
df.replace(np.nan, "", inplace=True)
df.replace("None", "", inplace=True)

In [31]:
df.head(5)

Unnamed: 0,Generic Name,Indications,Purpose,Contraindications,Warnings,Boxed Warning,Adverse Reactions,Use in Specific Populations,Dosage and Administration,Clinical Trials
0,"ACETAMINOPHEN, GUAIFENESIN, AND PHENYLEPHRINE ...",Uses temporarily relieves these symptoms assoc...,,,,,,,Directions do not take more than directed (see...,
1,AMLODIPINE AND OLMESARTAN MEDOXOMIL,1 INDICATIONS AND USAGE Amlodipine and olmesar...,11 DESCRIPTION Amlodipine and olmesartan medox...,4 CONTRAINDICATIONS Do not co-administer alisk...,,WARNING: FETAL TOXICITY When pregnancy is dete...,6 ADVERSE REACTIONS Most common adverse reacti...,8 USE IN SPECIFIC POPULATIONS Lactation: Breas...,2 DOSAGE AND ADMINISTRATION The usual starting...,Because clinical studies are conducted under w...
2,"ACETAMINOPHEN, DEXTROMETHORPHAN HBR, DOXYLAMIN...",Uses temporarily relieves these symptoms due t...,,,,,,,Directions do not take more than the recommend...,
3,"OCTINOXATE, TITANIUM DIOXIDE",Helps prevent sunburn,,,,,,,Apply liberally 15 minutes before sun exposure...,
4,"ACONITED7, AGARICUS MUSD6, ANACARDIUMD7, GELSE...","Uses Temporarily relieves anxiousness, poor se...",,,,,,,How to Use Mix with half a cup of water 3-4 ti...,


Adverse Reactions has [see Warnings and Precautions] for every symptom listed, this code removes it for clarity

In [32]:
def remove_brackets(text):
    if pd.isna(text) or text == "":
        return text  
    return re.sub(r"\[.*?\]|\(.*?\)|\{.*?\}", "", text).strip()  

df["Adverse Reactions"] = df["Adverse Reactions"].apply(remove_brackets)

The data currently includes the column name and a number, this code removes it for clarity

In [33]:
def clean_column_text(text, column_name):
    if text: 
        column_name_regex = re.escape(column_name) 
        cleaned_text = re.sub(r"^\d+\s+" + column_name_regex + r"\s*", "", text, count=1)
        return cleaned_text
    return text

In [34]:
df["Adverse Reactions"] = df["Adverse Reactions"].apply(lambda x: clean_column_text(x, "ADVERSE REACTIONS"))
df["Indications"] = df["Indications"].apply(lambda x: clean_column_text(x, "INDICATIONS AND USAGE"))
df["Purpose"] = df["Purpose"].apply(lambda x: clean_column_text(x, "DESCRIPTION"))
df["Contraindications"] = df["Contraindications"].apply(lambda x: clean_column_text(x, "CONTRAINDICATIONS"))
df["Warnings"] = df["Warnings"].apply(lambda x: clean_column_text(x, "WARNINGS"))
df["Use in Specific Populations"] = df["Use in Specific Populations"].apply(lambda x: clean_column_text(x, "USE IN SPECIFIC POPULATIONS"))
df["Dosage and Administration"] = df["Dosage and Administration"].apply(lambda x: clean_column_text(x, "DOSAGE AND ADMINISTRATION"))
df.head(5)

Unnamed: 0,Generic Name,Indications,Purpose,Contraindications,Warnings,Boxed Warning,Adverse Reactions,Use in Specific Populations,Dosage and Administration,Clinical Trials
0,"ACETAMINOPHEN, GUAIFENESIN, AND PHENYLEPHRINE ...",Uses temporarily relieves these symptoms assoc...,,,,,,,Directions do not take more than directed (see...,
1,AMLODIPINE AND OLMESARTAN MEDOXOMIL,Amlodipine and olmesartan medoxomil tablets ar...,Amlodipine and olmesartan medoxomil provided a...,Do not co-administer aliskiren with amlodipine...,,WARNING: FETAL TOXICITY When pregnancy is dete...,Most common adverse reaction is edema . To r...,Lactation: Breastfeeding is not recommended ( ...,The usual starting dose of amlodipine and olme...,Because clinical studies are conducted under w...
2,"ACETAMINOPHEN, DEXTROMETHORPHAN HBR, DOXYLAMIN...",Uses temporarily relieves these symptoms due t...,,,,,,,Directions do not take more than the recommend...,
3,"OCTINOXATE, TITANIUM DIOXIDE",Helps prevent sunburn,,,,,,,Apply liberally 15 minutes before sun exposure...,
4,"ACONITED7, AGARICUS MUSD6, ANACARDIUMD7, GELSE...","Uses Temporarily relieves anxiousness, poor se...",,,,,,,How to Use Mix with half a cup of water 3-4 ti...,


The idea is that depending on the user query, different context will be given to the model to work around model tokenization constraints

I.E. if the query asks about specific populations that the medication may not be good for (like a medicine shouldn't be used if pregnant) then we can return context about it 

Otherwise if the query is more general, we can return a general summary based on the uses, warnings, and reactions

The categories will be summary, clinical studies, uses in specific populations, dosage and administration

In [35]:
def process_column(df, column_name):

    return df[column_name].apply(
        lambda x: f"Name: {df.loc[df[column_name] == x, 'Generic Name'].values[0]} | {column_name}: {x}" 
        if x != "" and pd.notna(x) and not df.loc[df[column_name] == x, "Generic Name"].empty 
        else x
    )

In [None]:
df['Summary'] = 'Summary of ' + df['Generic Name'] + ' | Uses: ' + df['Indications'] + ' and ' + df['Purpose'] + ' | Warnings: ' + df['Boxed Warning'] + ' and ' + df['Warnings'] + ' | Reactions: ' + df['Adverse Reactions']
df['Clinical'] = process_column(df, 'Clinical Trials')
df['Specific Populations'] = process_column(df, 'Use in Specific Populations')
df['Dosage'] = process_column(df, 'Dosage and Administration')

In [37]:
vectordf = df[['Generic Name', 'Summary', 'Clinical', 'Specific Populations', 'Dosage']]
vectordf.head(5)

Unnamed: 0,Generic Name,Summary,Clinical,Specific Populations,Dosage
0,"ACETAMINOPHEN, GUAIFENESIN, AND PHENYLEPHRINE ...","Summary of ACETAMINOPHEN, GUAIFENESIN, AND PHE...",,,"Name: ACETAMINOPHEN, GUAIFENESIN, AND PHENYLEP..."
1,AMLODIPINE AND OLMESARTAN MEDOXOMIL,Summary of AMLODIPINE AND OLMESARTAN MEDOXOMIL...,Name: AMLODIPINE AND OLMESARTAN MEDOXOMIL | Cl...,Name: AMLODIPINE AND OLMESARTAN MEDOXOMIL | Us...,Name: AMLODIPINE AND OLMESARTAN MEDOXOMIL | Do...
2,"ACETAMINOPHEN, DEXTROMETHORPHAN HBR, DOXYLAMIN...","Summary of ACETAMINOPHEN, DEXTROMETHORPHAN HBR...",,,"Name: ACETAMINOPHEN, DEXTROMETHORPHAN HBR, DOX..."
3,"OCTINOXATE, TITANIUM DIOXIDE","Summary of OCTINOXATE, TITANIUM DIOXIDE | Uses...",,,"Name: OCTINOXATE, TITANIUM DIOXIDE | Dosage an..."
4,"ACONITED7, AGARICUS MUSD6, ANACARDIUMD7, GELSE...","Summary of ACONITED7, AGARICUS MUSD6, ANACARDI...",,,"Name: ACONITED7, AGARICUS MUSD6, ANACARDIUMD7,..."


In [38]:
vectordf.to_csv('vectordb.csv', index=False)

Embeddings and faiss metadata index generated here

In [26]:
import pandas as pd 
import faiss
import numpy as np
from FlagEmbedding import FlagAutoModel
import os

os.environ["CUDA_VISIBLE_DEVICES"]=""

model = FlagAutoModel.from_finetuned('BAAI/bge-base-en-v1.5',
                                      query_instruction_for_retrieval="Represent this sentence for searching relevant passages:",
                                      use_fp16=True,
                                      devices='cpu')

In [2]:
df = pd.read_csv('Data/vectordb.csv')

In [3]:
summary_embeddings = model.encode(df['Summary'].to_list()) 

pre tokenize: 100%|██████████| 23/23 [00:02<00:00,  9.20it/s]
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Inference Embeddings: 100%|██████████| 23/23 [22:40<00:00, 59.15s/it] 


In [7]:
clinical_text = df['Clinical'].dropna().astype(str).tolist()
clinical_embeddings = model.encode(clinical_text)

pre tokenize: 100%|██████████| 4/4 [00:01<00:00,  2.37it/s]
Inference Embeddings: 100%|██████████| 4/4 [08:46<00:00, 131.70s/it]


In [8]:
pop_text = df['Specific Populations'].dropna().astype(str).tolist()
pop_embeddings =  model.encode(pop_text)

pre tokenize: 100%|██████████| 6/6 [00:02<00:00,  2.87it/s]
Inference Embeddings: 100%|██████████| 6/6 [11:53<00:00, 118.85s/it]


In [9]:
dosage_text = df['Dosage'].dropna().astype(str).tolist()
dosage_embeddings = model.encode(dosage_text)

pre tokenize: 100%|██████████| 22/22 [00:02<00:00,  9.45it/s]
Inference Embeddings: 100%|██████████| 22/22 [25:23<00:00, 69.27s/it]


In [10]:
embeddings = np.vstack([summary_embeddings, clinical_embeddings, pop_embeddings, dosage_embeddings])
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)
faiss.write_index(index, 'faiss_index.idx')

In [11]:
metadata = [{"type": "summary", "category": "general"} for _ in range(len(summary_embeddings))] + \
    [{"type": "clinical", "category": "clinical data"} for _ in range(len(clinical_embeddings))] + \
    [{"type": "population", "category": "specific population usage"} for _ in range(len(pop_embeddings))] + \
    [{"type": "dosage", "category": "dosage and administration"} for _ in range(len(dosage_embeddings))]

In [None]:
from langchain.vectorstores import FAISS


  model = HuggingFaceBgeEmbeddings(


In [None]:
text_embeddings = [
    (text, embedding)
    for text, embedding in zip(df['Summary'].to_list() + clinical_text + pop_text + dosage_text, embeddings)
]

vector_store = FAISS.from_embeddings(embedding=model, metadatas=metadata, text_embeddings=text_embeddings)
vector_store.save_local('my_vector_store')

Testing the vector DB and vector store

In [34]:
# Load the FAISS index and vector store
vector_store = FAISS.load_local('my_vector_store', embeddings=model, allow_dangerous_deserialization=True)

# Querying the vector store (find nearest neighbors)
query = "ACETAMINOPHEN uses"
embedded_query = model.encode([query])
results = vector_store.similarity_search_by_vector(embedded_query[0], k=3)
results

`embedding_function` is expected to be an Embeddings object, support for passing in a function will soon be removed.


[Document(id='aee4e363-399e-4746-b59e-3a22251d7d02', metadata={'type': 'summary', 'category': 'general'}, page_content='Summary of ACETAMINOPHEN | Uses: Uses For the temporary relief of minor aches and pains associated with • headache • toothache • minor arthritis pain • muscular aches • common cold • menstrual cramps For the reduction of fever. and  |  and  | Reactions: '),
 Document(id='3fc85fc6-8aed-42e2-8689-745f01a9c628', metadata={'type': 'summary', 'category': 'general'}, page_content='Summary of ACETAMINOPHEN | Uses: Uses temporarily: • reduces fever • relieves minor aches and pains due to: • the common cold • flu • headache • sore throat • toothache and  |  and  | Reactions: '),
 Document(id='1f6688c3-daf1-49f7-982a-91cee986d21c', metadata={'type': 'summary', 'category': 'general'}, page_content='Summary of ACETAMINOPHEN | Uses: Uses temporarily: • reduces fever • relieves minor aches and pains due to: • the common cold • flu • headache • sore throat • toothache and  |  and  |

In [None]:
query = "AMLODIPINE clinical"
embedded_query = model.encode([query])
results = vector_store.similarity_search_by_vector(embedded_query[0], k=3)
results

[Document(id='a2d4f83f-3f96-4ad4-9c8f-22a5107aa038', metadata={'type': 'clinical', 'category': 'clinical data'}, page_content='Name: AMLODIPINE BESYLATE | Clinical Trials: Because clinical trials are conducted under widely varying conditions, adverse reaction rates observed in the clinical trials of a drug cannot be directly compared to rates in the clinical trials of another drug and may not reflect the rates observed in practice. Amlodipine has been evaluated for safety in more than 11,000 patients in U.S. and foreign clinical trials. In general, treatment with amlodipine was well-tolerated at doses up to 10 mg daily. Most adverse reactions reported during therapy with amlodipine were of mild or moderate severity. In controlled clinical trials directly comparing amlodipine (N=1730) at doses up to 10 mg to placebo (N=1250), discontinuation of amlodipine because of adverse reactions was required in only about 1.5% of patients and was not significantly different from placebo (about 1%).

: 

In [None]:
query = "ACETAMINOPHEN uses"
embed_query = model.encode([query])

results = vector_store.similarity_search_by_vector(
    embed_query[0], 
    k=10, 
    filter={"category": "clinical"}  
)