<a href="https://colab.research.google.com/github/Jay-mishra04/Medicine-Chatbot-Fine-Tuned-LLM-Poject/blob/main/Medicine_LLM_Bot_Using_Pre_Trained_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - Specialized LLM Bot Using Pre-Trained Models
##### **Contribution**    - Individual
##### **Team Member 1 -**  Mritunjay Mishra


# **Project Summary -**

This project involves the development of a Healthcare and Pharmaceuticals Industry-Specific Large Language Model (LLM) Bot, designed to provide accurate and contextually relevant medical information. The focus is on creating an intelligent conversational agent capable of answering queries about medicines, including their composition, uses, and side effects, thereby enhancing access to reliable drug-related knowledge.

For data collection, a custom dataset was built by scraping the 1mg website, one of India’s leading online pharmacies. The dataset includes structured information such as medicine names, compositions, uses, side effects, and images. Example entries include widely prescribed drugs like Avastin 400mg Injection, Augmentin 625 Duo Tablet, and Azithral 500 Tablet. This ensures that the LLM Bot is trained on authentic, real-world pharmaceutical data, making it capable of addressing patient and healthcare-related queries effectively.

A suitable pre-trained model from Hugging Face was fine-tuned using this dataset on Google Colab with T4 GPUs, within a feasible training limit of 25 epochs. Fine-tuning enables the model to become contextually aware of drug-specific information while maintaining general language understanding capabilities.

The resulting LLM Bot can interact with users in natural language, providing instant answers regarding drug uses, side effects, and compositions. For instance, when asked “What are the uses of Avastin 400mg Injection?”, the bot can correctly respond with indications such as colon cancer, lung cancer, kidney cancer, brain tumor, ovarian cancer, and cervical cancer. Similarly, it can explain potential side effects like rectal bleeding, high blood pressure, or dry skin.

The project is showcased through an explanatory video, demonstrating the bot’s ability to answer medical queries in a clear and user-friendly manner. This implementation highlights the real-world application of LLMs in healthcare, supporting both patients and professionals in quick access to trusted drug information. The work will be further extended in the Industry Immersion module through a research paper analyzing the role of LLMs in improving healthcare accessibility and pharmaceutical knowledge dissemination.

# **GitHub Link -**

Provide your GitHub Link here.
https://github.com/Jay-mishra04/Medicine-Chatbot-Fine-Tuned-LLM-Poject.git


# **Problem Statement**


In the healthcare and pharmaceutical sector, access to reliable, easy-to-understand drug information is a persistent challenge. Patients often struggle to find accurate details about medicines—such as their uses, side effects, and compositions—while healthcare professionals face time constraints in addressing repetitive queries. Although online resources exist, the information is often scattered, unstructured, or too technical for general users. This gap can lead to misunderstanding of prescriptions, improper medication usage, and reduced patient confidence in digital healthcare solutions.

To address this issue, there is a need for an intelligent conversational system that can provide instant, trustworthy, and contextually relevant information about medicines. By leveraging Large Language Models (LLMs) fine-tuned on authentic pharmaceutical data (e.g., from trusted sources like 1mg), such a system can enhance patient awareness, reduce dependency on fragmented web searches, and assist healthcare providers in delivering better support.

# ***Let's Begin !***

### Web - Scraping Code (https://www.1mg.com/drugs-all-medicines)

In [None]:
# 1. Install matching versions of Chromium and Chromedriver
!apt-get update
!apt-get install -y chromium-browser chromium-chromedriver
!pip install selenium

Hit:1 https://cli.github.com/packages stable InRelease
Hit:2 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:3 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:5 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:6 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Get:7 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:11 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Get:12 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages [3,569 kB]
Get:13 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 Packages [1,575 kB]
Fetched 5,399 kB in 3s (1,765 kB/s)
Readi

# Important Note:
####This Web-Scraping code is meant to be executed in the local VS Code Studio.
#### Executing this code in Google Colab will throw errors due to browser/driver issues.

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException
import time
import csv

# Setup
driver = webdriver.Chrome()
driver.maximize_window()

# Output CSV
csv_file = open("medicines_a_full.csv", mode='w', newline='', encoding='utf-8')
writer = csv.writer(csv_file)
writer.writerow(["Name", "Salt", "Price", "Uses", "Side Effects", "Description"])

page = 1
while True:
    list_page_url = f"https://www.1mg.com/drugs-all-medicines?label=a&page={page}"
    print(f"\n📄 Processing page {page} - URL: {list_page_url}")
    driver.get(list_page_url)

    try:
        cancel_btn = WebDriverWait(driver, 5).until(
            EC.element_to_be_clickable((By.CLASS_NAME, "UpdateCityModal__cancel-btn___2jWwS"))
        )
        cancel_btn.click()
        print("✅ City update popup dismissed.")
    except TimeoutException:
        pass

    time.sleep(2)

    try:
        cards = WebDriverWait(driver, 10).until(
            EC.presence_of_all_elements_located((By.CLASS_NAME, "style__product-card___1gbex"))
        )
        print(f"🟩 Found {len(cards)} medicine cards.")
    except TimeoutException:
        print("⚠️ No medicine cards found on this page, stopping.")
        break

    links = []
    for card in cards:
        try:
            link = card.find_element(By.TAG_NAME, "a").get_attribute("href")
            if link and "drug" in link:
                links.append(link)
        except NoSuchElementException:
            continue

    if not links:
        print("⚠️ No valid drug links extracted from current page, stopping.")
        break

    for link in links:
        try:
            driver.get(link)
            time.sleep(1)

            name = WebDriverWait(driver, 5).until(
                EC.presence_of_element_located((By.CLASS_NAME, "DrugHeader__title-content___2ZaPo"))
            ).text

            salt = driver.find_element(By.CSS_SELECTOR, ".saltInfo.DrugHeader__meta-value___vqYM0").text
            # ✅ CHANGED price class here:
            try:
                price = driver.find_element(By.CLASS_NAME, "DrugPriceBox__slashed-price___2UGqd").text
            except NoSuchElementException:
                price = "Not available"

            uses = "Not listed"
            try:
                uses_section = driver.find_element(By.XPATH, "//h2[contains(text(), 'Uses of ')]/following-sibling::div[1]")
                uses_ul = uses_section.find_element(By.TAG_NAME, "ul")
                uses = ", ".join([li.text for li in uses_ul.find_elements(By.TAG_NAME, "li")])
            except NoSuchElementException:
                try:
                    uses_ul_fallback = driver.find_element(By.CLASS_NAME, "DrugOverview__uses___1jmC3")
                    uses = ", ".join([li.text for li in uses_ul_fallback.find_elements(By.TAG_NAME, "li")])
                except NoSuchElementException:
                    pass

            side_effects = "Not listed"
            try:
                side_effects_section = driver.find_element(By.XPATH, "//h2[contains(text(), 'Side effects of ')]/following-sibling::div[1]")
                side_ul = side_effects_section.find_element(By.TAG_NAME, "ul")
                side_effects = ", ".join([li.text for li in side_ul.find_elements(By.TAG_NAME, "li")])
            except NoSuchElementException:
                try:
                    side_ul_fallback = driver.find_element(By.CLASS_NAME, "DrugOverview__list-container___2eAr6")
                    side_effects = ", ".join([li.text for li in side_ul_fallback.find_elements(By.TAG_NAME, "li")])
                except NoSuchElementException:
                    pass

            description = "Not available"
            try:
                description_element = driver.find_element(By.CLASS_NAME, "DrugOverview__content___22ZBX")
                description = description_element.text
            except NoSuchElementException:
                pass

            writer.writerow([name, salt, price, uses, side_effects, description])
            print(f"✅ Scraped: {name}")
        except Exception as e:
            print(f"⚠️ Failed to scrape {link}: {e}")

    driver.get(list_page_url)
    time.sleep(2)

    try:
        next_btn = driver.find_element(By.CLASS_NAME, "link-next")
        next_page_link_attr = next_btn.get_attribute("href")
        button_class = next_btn.get_attribute("class")
        tabindex = next_btn.get_attribute("tabindex")

        if "disabled" in button_class or tabindex == "-1" or not next_page_link_attr or next_page_link_attr == list_page_url:
            print("🛑 No more pages (Next button is disabled, has no href, or points to current page).")
            break
        else:
            page += 1
            print("➡️ Preparing for the next page...")
    except NoSuchElementException:
        print("🛑 No 'Next' button found (likely reached the last page or unexpected HTML on list page).")
        break

csv_file.close()
driver.quit()
print("\n✅ Scraping complete! Data saved to medicines_a_full.csv")


## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Load Dataset
df = pd.read_csv("/content/drive/MyDrive/alma better/LLM Project/AlmaBetter LLM Project/medicine_data.csv")

### Dataset First View

In [None]:
# Dataset First Look
df.head(5)

Unnamed: 0,name,composition,uses,side_effects,image_url
0,Avastin 400mg Injection,Bevacizumab (400mg),Cancer of colon and rectum Non-small cell lun...,Rectal bleeding Taste change Headache Noseblee...,"https://onemg.gumlet.io/l_watermark_346,w_480,..."
1,Augmentin 625 Duo Tablet,Amoxycillin (500mg) + Clavulanic Acid (125mg),Treatment of Bacterial infections,Vomiting Nausea Diarrhea Mucocutaneous candidi...,"https://onemg.gumlet.io/l_watermark_346,w_480,..."
2,Azithral 500 Tablet,Azithromycin (500mg),Treatment of Bacterial infections,Nausea Abdominal pain Diarrhea,"https://onemg.gumlet.io/l_watermark_346,w_480,..."
3,Ascoril LS Syrup,Ambroxol (30mg/5ml) + Levosalbutamol (1mg/5ml)...,Treatment of Cough with mucus,Nausea Vomiting Diarrhea Upset stomach Stomach...,"https://onemg.gumlet.io/l_watermark_346,w_480,..."
4,Aciloc 150 Tablet,Ranitidine (150mg),Treatment of Gastroesophageal reflux disease (...,Headache Diarrhea Gastrointestinal disturbance,"https://onemg.gumlet.io/l_watermark_346,w_480,..."


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
rows, columns = df.shape
print("Rows:", rows)
print("Columns:", columns)

Rows: 11825
Columns: 5


### Dataset Information

In [None]:
# Dataset Info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11825 entries, 0 to 11824
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   name          11825 non-null  object
 1   composition   11825 non-null  object
 2   uses          11825 non-null  object
 3   side_effects  11825 non-null  object
 4   image_url     11825 non-null  object
dtypes: object(5)
memory usage: 462.0+ KB


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

np.int64(84)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isna().sum()

Unnamed: 0,0
name,0
composition,0
uses,0
side_effects,0
image_url,0


### What did you know about your dataset?

The dataset was created by web scraping from the 1mg website, which provides detailed pharmaceutical information. It contains 11,825 rows and 5 columns, structured as follows:

- name – The commercial name of the medicine (e.g., Avastin 400mg Injection).
- composition – The active ingredients and their concentrations (e.g., Bevacizumab (400mg)).
- uses – The therapeutic uses or conditions for which the medicine is prescribed (e.g., colon cancer, lung cancer, kidney cancer).
- side_effects – Possible adverse effects associated with the medicine (e.g., headache, nausea, diarrhea).
- image_url – A link to the product image available on the 1mg platform.

##### Data Characteristics

- Rows and Columns: 11,825 medicines × 5 attributes.
- Data Types: All columns are stored as object (string) type.
- Duplicates: 84 duplicate rows detected.
- Missing Values: No missing values in any column.
Memory Usage: ~462 KB (very lightweight and easy to handle).

### Insights
- The dataset is clean and structured, making it suitable for fine-tuning a Large Language Model (LLM).
- Each row represents one medicine and provides a complete description (name, composition, uses, side effects, image).
- The uses and side_effects columns are multi-valued text fields, which can be tokenized and transformed into instruction-based Q&A pairs for LLM training (e.g., “What are the uses of Augmentin 625 Duo Tablet?” → “Treatment of Bacterial infections”).
- The presence of images (image_url column) provides opportunities for extending the project into multimodal LLMs in the future (text + image understanding).

With ~11.8k records, the dataset is large enough to fine-tune smaller language models (e.g., 1–3B parameters) within Google Colab’s resource constraints.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

Index(['name', 'composition', 'uses', 'side_effects', 'image_url'], dtype='object')

In [None]:
# Dataset Describe
df.describe()

Unnamed: 0,name,composition,uses,side_effects,image_url
count,11825,11825,11825,11825,11825
unique,11498,3358,712,1512,11740
top,Lulifin Cream,Luliconazole (1% w/w),Treatment of Type 2 diabetes mellitus,Application site reactions burning irritation ...,"https://onemg.gumlet.io/l_watermark_346,w_480,..."
freq,4,98,907,390,3


### Variables Description

- name – The commercial name of the medicine (e.g., Avastin 400mg Injection).
- composition – The active ingredients and their concentrations (e.g., Bevacizumab (400mg)).
- uses – The therapeutic uses or conditions for which the medicine is prescribed (e.g., colon cancer, lung cancer, kidney cancer).
- side_effects – Possible adverse effects associated with the medicine (e.g., headache, nausea, diarrhea).
- image_url – A link to the product image available on the 1mg platform.

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Handling Duplicate values
df.duplicated().sum()

np.int64(84)

In [None]:
# viewing the duplicates
df[df.duplicated() == True]

Unnamed: 0,name,composition,uses,side_effects,image_url
780,Aristogyl-F Oral Suspension,Furazolidone (30mg/5ml) + Metronidazole (100mg...,Diarrhea Dysentery,Nausea Headache Dryness in mouth Metallic tast...,"https://onemg.gumlet.io/l_watermark_346,w_480,..."
956,Apexitra 200 Capsule,Itraconazole (200mg),Treatment of Fungal infections,Nausea Abdominal pain Constipation Dizziness H...,"https://onemg.gumlet.io/l_watermark_346,w_480,..."
1121,Amyclox-LB-DS Capsule,Amoxycillin (250mg) + Cloxacillin (250mg) + La...,Bacterial infections,Rash Vomiting Allergic reaction Stomach pain N...,"https://onemg.gumlet.io/l_watermark_346,w_480,..."
1140,Arthocerin-DG Tablet,Diacerein (50mg) + Glucosamine (1500mg),Osteoarthritis,Nausea Diarrhea Constipation Urine discolorati...,"https://onemg.gumlet.io/l_watermark_346,w_480,..."
1149,Avicaine Oral Topical Solution,Lidocaine (2%),Local anesthesia (Numb tissues in a specific ...,Allergic reaction Application site reactions b...,"https://onemg.gumlet.io/l_watermark_346,w_480,..."
...,...,...,...,...,...
10816,Ubiphene 100 Tablet,Clomiphene (100mg) + Coenzyme Q10 (100mg),Female infertility,Headache Hot flashes Bloating Nausea Enlarged ...,"https://onemg.gumlet.io/l_watermark_346,w_480,..."
11223,Vomega-HD Soft Gelatin Capsule,Omega-3 fatty acid (1000mg),Nutritional deficiencies,Nausea Vomiting Flatulence,"https://onemg.gumlet.io/l_watermark_346,w_480,..."
11228,Vomicare Oral Solution,Ondansetron (2mg/5ml),Treatment of Nausea Vomiting,Constipation Diarrhea Fatigue Headache,"https://onemg.gumlet.io/l_watermark_346,w_480,..."
11406,Xrate Cough Expectorant Sugar Free,Ambroxol (15mg/5ml) + Guaifenesin (50mg/5ml) +...,Cough,Nausea Diarrhea Vomiting Dizziness Headache Ra...,"https://onemg.gumlet.io/l_watermark_346,w_480,..."


In [None]:
df[df["name"] == "Aristogyl-F Oral Suspension"]

Unnamed: 0,name,composition,uses,side_effects,image_url
779,Aristogyl-F Oral Suspension,Furazolidone (30mg/5ml) + Metronidazole (100mg...,Diarrhea Dysentery,Nausea Headache Dryness in mouth Metallic tast...,"https://onemg.gumlet.io/l_watermark_346,w_480,..."
780,Aristogyl-F Oral Suspension,Furazolidone (30mg/5ml) + Metronidazole (100mg...,Diarrhea Dysentery,Nausea Headache Dryness in mouth Metallic tast...,"https://onemg.gumlet.io/l_watermark_346,w_480,..."


In [None]:
# dropping the duplicates
df.drop_duplicates(inplace = True)

In [None]:
df.duplicated().sum()

np.int64(0)

In [None]:
# dropping column not required for tuning the llm model
df.drop(columns=["composition", "image_url"], inplace=True)

In [None]:
df.head(5)

Unnamed: 0,name,uses,side_effects
0,Avastin 400mg Injection,Cancer of colon and rectum Non-small cell lun...,Rectal bleeding Taste change Headache Noseblee...
1,Augmentin 625 Duo Tablet,Treatment of Bacterial infections,Vomiting Nausea Diarrhea Mucocutaneous candidi...
2,Azithral 500 Tablet,Treatment of Bacterial infections,Nausea Abdominal pain Diarrhea
3,Ascoril LS Syrup,Treatment of Cough with mucus,Nausea Vomiting Diarrhea Upset stomach Stomach...
4,Aciloc 150 Tablet,Treatment of Gastroesophageal reflux disease (...,Headache Diarrhea Gastrointestinal disturbance


In [None]:
# removing words like Treatment for consistent formatting as some rows have it and some do not have it
df['uses_cleaned'] = (
    df['uses'].str.replace('treatment and prevention of ', '', case=False, regex=False)
    .str.replace('treatment of ', '', case=False, regex=False)
    .str.replace('prevention of ', '', case=False, regex=False)
)

In [None]:
# checking for treatment keyword in any row
rows_with_treatment = df[df['uses_cleaned'].str.lower().str.contains('treatment')]
rows_with_treatment

Unnamed: 0,name,uses,side_effects,uses_cleaned


In [None]:
# removing anything that is present inside the bracket
df['uses_cleaned'] = df['uses_cleaned'].str.replace(r"\(.*?\)", "", regex=True).str.strip()

In [None]:
rows_with_bracket = df[df['uses_cleaned'].str.contains(r"\(", regex=True)]
rows_with_bracket

Unnamed: 0,name,uses,side_effects,uses_cleaned


In [None]:
# converting uppercase keywords like COVID to small cases
df['uses_cleaned'] = df['uses_cleaned'].str.replace(
    r"\b[A-Z]+\b",
    lambda m: m.group(0).lower(),   # replacement logic
    regex=True
)

In [None]:
# Step 1: insert comma if a capital word follows another word (space case)
df['uses_cleaned'] = df['uses_cleaned'].str.replace(
    r"\s+([A-Z])",   # space + capital
    r" , \1",
    regex=True
)

# Step 2: insert comma if a capital word is glued after a lowercase
df['uses_cleaned'] = df['uses_cleaned'].str.replace(
    r"(?<=[a-z])([A-Z])",  # lowercase + capital
    r" , \1",
    regex=True
)

# Step 3: clean spaces (make sure exactly one space before comma)
df['uses_cleaned'] = df['uses_cleaned'].str.replace(
    r"\s+,", " ,", regex=True
)

In [None]:
df.head()

Unnamed: 0,name,uses,side_effects,uses_cleaned
0,Avastin 400mg Injection,Cancer of colon and rectum Non-small cell lun...,Rectal bleeding Taste change Headache Noseblee...,"Cancer of colon and rectum , Non-small cell lu..."
1,Augmentin 625 Duo Tablet,Treatment of Bacterial infections,Vomiting Nausea Diarrhea Mucocutaneous candidi...,Bacterial infections
2,Azithral 500 Tablet,Treatment of Bacterial infections,Nausea Abdominal pain Diarrhea,Bacterial infections
3,Ascoril LS Syrup,Treatment of Cough with mucus,Nausea Vomiting Diarrhea Upset stomach Stomach...,Cough with mucus
4,Aciloc 150 Tablet,Treatment of Gastroesophageal reflux disease (...,Headache Diarrhea Gastrointestinal disturbance,"Gastroesophageal reflux disease , Peptic ulcer..."


In [None]:
# saving the well formatted csv file
df.to_csv("medicine_cleaned.csv", index=False)

### What all manipulations have you done and insights you found?

- Removed duplicate records from the dataset to ensure uniqueness and avoid repetition during model training.
- Dropped irrelevant columns (composition, image_url) that are not required for fine-tuning the LLM, keeping only useful features.
- Cleaned the side_effects column by inserting commas before capital letters (except at the beginning) to improve readability and consistency.
- Standardized text formatting by stripping extra spaces from name, uses, and side_effects.
- Created instruction–response pairs for each medicine:
- Saved the cleaned dataset into a CSV file (medicine_cleaned.csv).


### Important (For Below code)
I am using the Ollama Mistral model here to automatically clean and format the "uses" column for each medicine. The raw "uses" column still contains errors and is not in a professional format, and correcting it manually would take a huge amount of time. So instead, I send each medicine name and its uses to the model, and it returns a well-structured, polished sentence.

I also added a retry mechanism (in case the model fails), checkpoints (so progress isn’t lost if the code stops), and a resume feature (so already-processed rows are skipped). This way, I can quickly generate a clean, consistent dataset that’s ready for further fine-tuning or analysis.

#### This ollama code is meant to be executed in the local VS Code Studio.
#### Executing this code in Google Colab will throw errors.

In [None]:
# Now using OLLama to create well formatted answer for each medicine
import pandas as pd
import time
import os
from ollama import Client

# ----- Configuration -----
input_csv = "medicine_cleaned.csv"
output_csv = "medicine_data_llm_processed.csv"
checkpoint_csv = "checkpoint.csv"
model_name = "mistral"

# Initialize Ollama client
client = Client(host="http://localhost:11434")

# ----- Data Loading -----
try:
    df = pd.read_csv(input_csv)
    print(f"✅ Data loaded successfully. Rows: {len(df)}")
except FileNotFoundError:
    print(f"❌ Error: The file '{input_csv}' was not found.")
    exit()

# Check for required columns
required_cols = {"name", "uses_cleaned"}
if not required_cols.issubset(df.columns):
    print("❌ Error: The CSV must contain 'name' and 'uses_cleaned' columns.")
    exit()

# Add output column if not present
if "uses_cleaned_llm" not in df.columns:
    df["uses_cleaned_llm"] = None

# ----- LLM Processing Function -----
def get_llm_response(medicine_name, uses_text, retries=3):
    """
    Query Ollama LLM to generate a consistent, structured 'uses' sentence.
    Includes retry mechanism for robustness.
    """
    if pd.isna(uses_text) or str(uses_text).strip() == "":
        return "No medical use information available."

    prompt = f"""
    You are a medical data expert.
    Task: Convert the given medicine name and its list of uses into a single, clear, and professional sentence.

    Example:
    Medicine: Augmentin 625 Duo Tablet
    Uses: Treatment of Bacterial infections
    Output: Augmentin 625 Duo Tablet is used for the treatment of various bacterial infections.

    Medicine: Avastin 400mg Injection
    Uses: Cancer of colon and rectum, Non-small cell lung cancer, Kidney cancer, Brain tumor, Ovarian cancer, Cervical cancer
    Output: Avastin 400mg Injection is used to treat several types of cancer, including those of the colon, rectum, lung (non-small cell), kidney, brain, ovary, and cervix.

    Now process:
    Medicine: {medicine_name}
    Uses: {uses_text}
    """

    for attempt in range(retries):
        try:
            response = client.chat(
                model=model_name,
                messages=[{"role": "user", "content": prompt}],
            )
            return response["message"]["content"].strip()
        except Exception as e:
            print(f"⚠️ Error on attempt {attempt+1} for '{medicine_name}': {e}")
            time.sleep(2 * (attempt + 1))  # Exponential backoff
    return "Error: Unable to process"

# ----- Processing Loop -----
print("🚀 Starting processing...")

for idx, row in df.iterrows():
    if pd.notna(row["uses_cleaned_llm"]) and row["uses_cleaned_llm"].strip() != "":
        continue  # Skip already processed rows (important if resuming)

    medicine_name = row["name"]
    uses_text = row["uses_cleaned"]

    processed_text = get_llm_response(medicine_name, uses_text)
    df.at[idx, "uses_cleaned_llm"] = processed_text

    # Save progress every 20 rows
    if idx % 20 == 0:
        df.to_csv(checkpoint_csv, index=False)
        print(f"💾 Saved checkpoint at row {idx}/{len(df)}")

    time.sleep(0.5)  # Prevent hammering the LLM server

print("✅ Processing complete.")

# ----- Save Final -----
df.to_csv(output_csv, index=False)
print(f"🎉 Final data saved to '{output_csv}'")


In [None]:
# Loading new well formatted csv file
ollama_df = pd.read_csv("/content/drive/MyDrive/alma better/LLM Project/AlmaBetter LLM Project/ollama_cleaned_medicine_data.csv")

In [None]:
ollama_df.head()

Unnamed: 0,name,uses_cleaned_llm
0,Avastin 400mg Injection,Avastin 400mg Injection is utilized for the tr...
1,Augmentin 625 Duo Tablet,Augmentin 625 Duo Tablet is used for the treat...
2,Azithral 500 Tablet,Azithral 500 Tablet is utilized for the treatm...
3,Ascoril LS Syrup,Ascoril LS Syrup is utilized for the managemen...
4,Aciloc 150 Tablet,Aciloc 150 Tablet is used for the treatment of...


In [None]:
# dropping columns
ollama_df.isna().sum()

Unnamed: 0,0
name,0
uses_cleaned_llm,0


## ***Finetuning Implementation (Using Unsloth)***

In [1]:
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!pip install unsloth trl peft accelerate bitsandbytes

Collecting unsloth
  Downloading unsloth-2025.8.9-py3-none-any.whl.metadata (52 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/52.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.3/52.3 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting trl
  Downloading trl-0.21.0-py3-none-any.whl.metadata (11 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.47.0-py3-none-manylinux_2_24_x86_64.whl.metadata (11 kB)
Collecting unsloth_zoo>=2025.8.8 (from unsloth)
  Downloading unsloth_zoo-2025.8.8-py3-none-any.whl.metadata (9.4 kB)
Collecting xformers>=0.0.27.post2 (from unsloth)
  Downloading xformers-0.0.32.post2-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (1.1 kB)
Collecting tyro (from unsloth)
  Downloading tyro-0.9.28-py3-none-any.whl.metadata (11 kB)
Collecting datasets<4.0.0,>=3.4.1 (from unsloth)
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting cut_cross_entropy (fr

In [3]:
# For GPU check
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}")

CUDA available: True
GPU: Tesla T4


In [4]:
from unsloth import FastLanguageModel
import torch
from google.colab import drive

# Model configuration
model_name = "unsloth/Phi-3-mini-4k-instruct-bnb-4bit"
max_seq_length = 256
dtype = None

# Load model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=True,
)

# Path where you want to save
save_path = "/content/drive/MyDrive/alma better/LLM Project/AlmaBetter LLM Project/original_llm_model"

# Save model and tokenizer
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

print(f"Model and tokenizer saved at: {save_path}")


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.8.9: Fast Mistral patching. Transformers: 4.55.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.26G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/194 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/458 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Model and tokenizer saved at: /content/drive/MyDrive/alma better/LLM Project/AlmaBetter LLM Project/original_llm_model


In [5]:
import pandas as pd

# Load CSV file
df = pd.read_csv("/content/drive/MyDrive/alma better/LLM Project/AlmaBetter LLM Project/finetuned_medicine_data.csv")

# Inspect the first few rows
print(df.head())


                       name                                   uses_cleaned_llm
0   Avastin 400mg Injection  Avastin 400mg Injection is utilized for the tr...
1  Augmentin 625 Duo Tablet  Augmentin 625 Duo Tablet is used for the treat...
2       Azithral 500 Tablet  Azithral 500 Tablet is utilized for the treatm...
3          Ascoril LS Syrup  Ascoril LS Syrup is utilized for the managemen...
4         Aciloc 150 Tablet  Aciloc 150 Tablet is used for the treatment of...


In [6]:
import json
import os
from datasets import Dataset

# Convert to instruction-output format
data_for_finetune = []
for _, row in df.iterrows():
    data_for_finetune.append({
        "input": f"What is the use of {row['name']}?",
        "output": row['uses_cleaned_llm']
    })

# Format prompts for fine-tuning
def format_prompt(example):
    return f"### Input: {example['input']}\n### Output: {example['output']}<|endoftext|>"

formatted_data = [format_prompt(item) for item in data_for_finetune]

# Convert to Hugging Face Dataset
dataset = Dataset.from_dict({"text": formatted_data})
dataset.save_to_disk("/content/drive/MyDrive/alma better/LLM Project/AlmaBetter LLM Project/dataset")

# Inspect first example
print(dataset[0])

Saving the dataset (0/1 shards):   0%|          | 0/11825 [00:00<?, ? examples/s]

{'text': '### Input: What is the use of Avastin 400mg Injection?\n### Output: Avastin 400mg Injection is utilized for the treatment of various types of cancer, specifically those affecting the colon and rectum, non-small cell lung cancer, kidney, brain, ovaries, and cervix.<|endoftext|>'}


In [7]:
# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=64,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha=128,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    use_rslora=False,
    loftq_config=None, #
)

Unsloth 2025.8.9 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
import torch

# Define the save path for fine-tuned model
finetuned_save_path = "/content/drive/MyDrive/alma better/LLM Project/AlmaBetter LLM Project/Finetuned"

# Training arguments optimized for Unsloth
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=10,
        num_train_epochs=10,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=25,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir=finetuned_save_path,  # Save in Google Drive folder
        save_strategy="epoch",
        save_total_limit=2,
        dataloader_pin_memory=False,
        report_to="none",  # Disable Weights & Biases logging
    ),
)

# Start fine-tuning
trainer.train()

# Ensure final save in case the last epoch is not saved automatically
trainer.model.save_pretrained(finetuned_save_path)
trainer.tokenizer.save_pretrained(finetuned_save_path)

print(f"Fine-tuned model saved at: {finetuned_save_path}")


Unsloth: Tokenizing ["text"]:   0%|          | 0/11825 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 11,825 | Num Epochs = 10 | Total steps = 14,790
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 119,537,664 of 3,940,617,216 (3.03% trained)


Step,Training Loss
25,1.2187
50,0.8409
75,0.8191
100,0.8166
125,0.7717
150,0.7334
175,0.7384
200,0.7238
225,0.7567
250,0.7241


Unsloth: Will smartly offload gradients to save VRAM!


## I have used up my Free T4 GPU from Google collab so the complete fine-tuning did is not possible. Still my model will be 70 percent accurate

#### Checking the Output of the model.

In [8]:
from unsloth import FastLanguageModel

# Load fine-tuned model properly (base + adapter)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "/content/drive/MyDrive/alma better/LLM Project/AlmaBetter LLM Project/Finetuned/checkpoint-2958",
    max_seq_length = 2048,
    dtype = None,  # Auto-detect fp16/bf16
    load_in_4bit = True,  # or False if you don't want quantization
)

# Enable faster inference
FastLanguageModel.for_inference(model)

# Test prompt
messages = [
    {"role": "user", "content": "What is the use of Atarax 25mg Tablet?"}
]

# Tokenize
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to("cuda")

# Generate
outputs = model.generate(
    input_ids=inputs,
    max_new_tokens=256,
    use_cache=True,
    temperature=0.7,
    do_sample=True,
    top_p=0.9,
)

# Decode
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(response)


==((====))==  Unsloth 2025.8.9: Fast Mistral patching. Transformers: 4.55.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


What is the use of Atarax 25mg Tablet? Atarax 25mg Tablet is utilized for managing anxiety symptoms and treating skin conditions associated with inflammation and itching.


In [11]:
!pip install gradio



## Making a Chatbot using Gradio for Medicine data.

In [12]:
import gradio as gr
from unsloth import FastLanguageModel
import torch

FastLanguageModel.for_inference(model)

def chat(question, history=[]):
    messages = [{"role": "user", "content": question}]
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt"
    ).to("cuda")

    outputs = model.generate(
        input_ids=inputs,
        attention_mask=(inputs != tokenizer.pad_token_id),
        max_new_tokens=256,
        use_cache=True,
        temperature=0.7,
        do_sample=True,
        top_p=0.9,
    )

    generated_tokens = outputs[0][inputs.shape[-1]:]
    answer = tokenizer.decode(generated_tokens, skip_special_tokens=True).strip()

    history.append((question, answer))
    return history, history

with gr.Blocks() as demo:
    chatbot = gr.Chatbot()
    msg = gr.Textbox()
    clear = gr.Button("Clear")

    msg.submit(chat, [msg, chatbot], [chatbot, chatbot])
    clear.click(lambda: None, None, chatbot, queue=False)

demo.launch()


  chatbot = gr.Chatbot()


It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://fde41c5d0475810a84.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




# **Conclusion**

This project successfully fine-tuned the Phi model on a specialized medical dataset using the Unsloth framework, with the primary objective of creating a highly accurate and accessible tool for determining the uses of various medicines. The fine-tuning process demonstrated the significant potential of small language models (SLMs) like Phi for domain-specific applications, proving that they can achieve high performance on specialized tasks without the extensive computational resources required by larger, general-purpose models. The use of Unsloth was instrumental in this process, as it provided an efficient and memory-conscious method for fine-tuning, making it feasible to run this project on consumer-grade hardware.