<a href="https://colab.research.google.com/github/Pauullamm/OpenAI_Pill_Checker/blob/main/OpenAI_Pill_Identifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Description: Python scripts to prepare a text-based AI model fine-tuned on an OpenAI davinci-002 model

Have you ever wanted a pill identifer tool to check the name of a tablet/capsule by its description? This Jupyter notebook outlines steps to fine-tune an OpenAI model for this purpose, utilising pharmaceutical/manufacturer data from the Electronic Medicines Compendium

In [51]:
!pip install --upgrade openai -q

Step 1: Retrive url links of all medicines starting with a particular letter from the electronic medicines compendium

In [122]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
from tqdm import tqdm


#getting letter A urls
letter_B_url = f'https://www.medicines.org.uk/emc/browse-medicines/B'

def get_elements_of_letter(url):
  """Gets the total number of drugs under the specific letter
  Args:
    url: url of the durgs of a specific letter
  """
  r = requests.get(url)
  letter_soup = BeautifulSoup(r.text, 'html.parser')
  total_elements = letter_soup.find(class_='latest-updates-results-header-summary-total')
  total_elements = total_elements.text.replace(" ", "")
  total_elements = int(total_elements.replace("resultsfound", ""))
  return total_elements

def get_urls(num, link, show_progress=False,):
  """
  Args:
      num (int): number of items on the page
      show_progress: prints the item being processed to the screen, default value is False

  Returns:
      A set with the links for each item
  """
  output_urls = set()
  for i in tqdm(range(1, num + 1, 50)):

    #iterate over over site number
    url_to_check = f'{link}?offset={i}&limit=50'
    response = requests.get(url_to_check)
    soup = BeautifulSoup(response.text, 'html.parser')

    url_title_links = soup.find_all(class_="search-results-product-info-title-link emc-link")
    for j in url_title_links:
      if "ablet" in j.text:
        if show_progress:
          print(f"Processing: {j.text}")
        href = 'https://www.medicines.org.uk/' + j.get('href')
        output_urls.add(href)
      if "apsule" in j.text:
        if show_progress:
          print(f"Processing: {j.text}")
        href_cap = 'https://www.medicines.org.uk/' + j.get('href')
        output_urls.add(href_cap)

  return output_urls


B_urls = get_urls(num=get_elements_of_letter(letter_B_url), link=letter_B_url)


  8%|▊         | 1/12 [00:00<00:07,  1.49it/s]

Processing: 
                    Baclofen 10 mg Tablets
                
Processing: 
                    Baclofen 10mg tablets
                
Processing: 
                    Baclofen Tablets 10mg
                
Processing: 
                    Bambec Tablets 10mg
                
Processing: 
                    Baraclude 0.5 mg film coated tablets
                
Processing: 
                    Baraclude 1 mg film-coated tablets
                
Processing: 
                    Bedranol 160mg SR Capsules
                
Processing: 
                    Bedranol 80mg SR Capsules
                
Processing: 
                    Beechams All-In-One Tablets
                
Processing: 
                    Beechams Max Strength All in One Capsules, hard
                


 17%|█▋        | 2/12 [00:01<00:04,  2.09it/s]

Processing: 
                    Benadryl Allergy Liquid Release 10mg Capsules
                
Processing: 
                    Benadryl Allergy One A Day 10mg Tablets
                
Processing: 
                    Benadryl Allergy Relief Plus Decongestant Capsules
                
Processing: 
                    Bendroflumethiazide 2.5mg Tablets
                
Processing: 
                    Bendroflumethiazide 5mg Tablets
                
Processing: 
                    Benylin Chesty Cough & Cold Tablets
                
Processing: 
                    Benylin Cold & Flu Max Strength Capsules
                
Processing: 
                    Benylin Day & Night Tablets
                
Processing: 
                    Benylin Four Flu tablets
                
Processing: 
                    Benylin Mucus Cough & Cold All in One Relief Tablets
                


 25%|██▌       | 3/12 [00:01<00:05,  1.73it/s]

Processing: 
                    Besavar XL 10mg Tablets
                
Processing: 
                    Betahistine 16 mg tablets
                
Processing: 
                    Betahistine 16 mg Tablets
                
Processing: 
                    Betahistine 8 mg tablets
                
Processing: 
                    Betahistine 8 mg Tablets
                
Processing: 
                    Betahistine Dihydrochloride 16mg Tablets
                
Processing: 
                    Betahistine Dihydrochloride 8mg Tablets
                
Processing: 
                    Betamethasone 500 microgram Soluble Tablets
                
Processing: 
                    Betamethasone 500 microgram Soluble Tablets
                
Processing: 
                    Betmiga 25 mg prolonged-release tablets
                
Processing: 
                    Betmiga 50 mg prolonged-release tablets
                


 33%|███▎      | 4/12 [00:02<00:04,  1.96it/s]

Processing: 
                    Bexarotene 75 mg Soft Capsules
                
Processing: 
                    Bexarotene 75 mg Soft Capsules
                
Processing: 
                    Bicalutamide 150 mg film-coated tablets
                
Processing: 
                    Bicalutamide 50 mg film-coated tablets
                
Processing: 
                    BIJUVE 1mg/100mg Capsules, soft
                
Processing: 
                    Biktarvy 30 mg/120 mg/15 mg film-coated tablets




Processing: 
                    Biktarvy 50 mg/200 mg/25 mg film-coated tablets




Processing: 
                    Bimizza 150 microgram/20 microgram Tablets
                
Processing: 
                    Binosto 70 mg effervescent tablets
                
Processing: 
                    Biquelle XL 150mg prolonged-release tablets
                
Processing: 
                    Biquelle XL 200mg prolonged-release tablets
                
Processing: 
                    Biquelle

 42%|████▏     | 5/12 [00:02<00:03,  1.76it/s]

Processing: 
                    Bisoprolol fumarate 1.25 mg film-coated tablets
                
Processing: 
                    Bisoprolol fumarate 1.25 mg tablets
                
Processing: 
                    Bisoprolol Fumarate 10 mg Film-coated Tablets
                
Processing: 
                    Bisoprolol Fumarate 10 mg Film-coated Tablets
                
Processing: 
                    Bisoprolol Fumarate 10 mg Film-coated Tablets
                
Processing: 
                    Bisoprolol Fumarate 10mg film-coated tablets
                
Processing: 
                    Bisoprolol Fumarate 2.5 mg film-coated tablets
                
Processing: 
                    Bisoprolol Fumarate 2.5 mg Film-coated Tablets
                
Processing: 
                    Bisoprolol Fumarate 2.5 mg Film-coated Tablets
                
Processing: 
                    Bisoprolol fumarate 2.5 mg tablets
                
Processing: 
                    Bisoprolol Fumarate 3.75

 50%|█████     | 6/12 [00:03<00:03,  1.65it/s]

Processing: 
                    Boots Aspirin 300mg Dispersible Tablets
                
Processing: 
                    Boots Aspirin 300mg Tablets
                
Processing: 
                    Boots Aspirin 75 mg Gastro-Resistant Tablets (GSL)
                
Processing: 
                    Boots Aspirin 75 mg Gastro-Resistant Tablets (P)
                
Processing: 
                    Boots Black Cohosh Menopause Relief Tablets
                
Processing: 
                    Boots Blocked Nose & Headache Relief Capsules
                
Processing: 
                    Boots Cold & Flu Relief Capsules
                
Processing: 
                    Boots Cold & Flu Relief Echinacea Effervescent Tablets
                
Processing: 
                    Boots Constipation Relief 5mg Tablets Adult
                
Processing: 
                    Boots Decongestant Tablets
                
Processing: 
                    Boots Decongestant with Pain Relief Tablets
      

 58%|█████▊    | 7/12 [00:04<00:03,  1.57it/s]

Processing: 
                    Boots Heartburn & Indigestion Relief 75mg Tablets (P)
                
Processing: 
                    Boots Heartburn and Acid Reflux Control 20mg Gastro-Resistant Capsules
                
Processing: 
                    Boots Ibuprofen 200 mg Liquid Capsules
                
Processing: 
                    Boots Ibuprofen 200 mg Tablets (16 tablets) (GSL)
                
Processing: 
                    Boots Ibuprofen 200mg Tablets
                
Processing: 
                    Boots Ibuprofen 200mg Tablets (24, 48 or 96 tablets) (P)
                
Processing: 
                    Boots Ibuprofen 200mg Tablets (caplets)
                
Processing: 
                    Boots Ibuprofen 200mg Tablets (P & GSL)
                
Processing: 
                    Boots Ibuprofen 400 mg Tablets
                
Processing: 
                    Boots Ibuprofen and Codeine 200mg/12.8mg Film-Coated Tablets
                
Processing: 
              

 67%|██████▋   | 8/12 [00:04<00:02,  1.54it/s]

Processing: 
                    Boots One-a-Day Allergy Relief 10mg Tablets (P & GSL)
                
Processing: 
                    Boots Paracetamol & Codeine 500mg/8mg Effervescent Tablets
                
Processing: 
                    Boots Paracetamol & Codeine 500mg/8mg Tablets
                
Processing: 
                    Boots Paracetamol & Codeine Tablets
                
Processing: 
                    Boots Paracetamol 500 mg Capsules
                
Processing: 
                    Boots Paracetamol 500 mg Effervescent Tablets
                
Processing: 
                    Boots Paracetamol 500 mg tablets (P)
                
Processing: 
                    Boots Paracetamol 500mg Tablets (16 Caplets)
                
Processing: 
                    Boots Paracetamol 500mg Tablets (16)
                
Processing: 
                    Boots Paracetamol 500mg Tablets (32)
                
Processing: 
                    Boots Paracetamol and Codeine Extra 

 75%|███████▌  | 9/12 [00:05<00:01,  1.51it/s]

Processing: 
                    Bosentan 125 mg Film-coated Tablets
                
Processing: 
                    Bosentan 62.5 mg Film-coated Tablets
                
Processing: 
                    Bosentan Cipla 125 mg Film-coated Tablets
                
Processing: 
                    Bosentan Cipla 62.5 mg Film-coated Tablets
                
Processing: 
                    Bosentan Dr. Reddy's 125 mg Film-Coated Tablets
                
Processing: 
                    Bosentan Dr. Reddy's 62.5 mg Film-Coated Tablets
                
Processing: 
                    Bosentan Milpharm 125 mg film-coated tablets
                
Processing: 
                    Bosentan Milpharm 62.5 mg film-coated tablets
                
Processing: 
                    Bosentan Zentiva 125mg film-coated tablets
                
Processing: 
                    Bosulif 100 mg film-coated tablets
                
Processing: 
                    Bosulif 400 mg film-coated tablets
        

 83%|████████▎ | 10/12 [00:06<00:01,  1.50it/s]

Processing: 
                    Brilique 60 mg film coated tablets
                
Processing: 
                    Brilique 90 mg film coated tablets
                
Processing: 
                    Brilique 90 mg Orodispersible Tablets
                
Processing: 
                    Brimisol PR 100 mg Prolonged-Release Tablets
                
Processing: 
                    Brimisol PR 200mg Prolonged-release Tablet
                
Processing: 
                    Brintellix 10 mg film-coated tablets
                
Processing: 
                    Brintellix 20 mg film-coated tablets
                
Processing: 
                    Brintellix 5 mg film-coated tablets
                
Processing: 
                    Briviact 10 mg Film-coated Tablets
                
Processing: 
                    Briviact 100 mg Film-coated Tablets
                
Processing: 
                    Briviact 25 mg Film-coated Tablets
                
Processing: 
                    Brivi

 92%|█████████▏| 11/12 [00:07<00:00,  1.35it/s]

Processing: 
                    Bumetanide 1 mg Tablets
                
Processing: 
                    Bumetanide 5 mg Tablets
                
Processing: 
                    Bumetanide/Amiloride 1mg/5mg Tablets
                
Processing: 
                    Buprenorphine 2 mg sublingual tablets
                
Processing: 
                    Buprenorphine 2 mg sublingual tablets
                
Processing: 
                    Buprenorphine 8 mg sublingual tablets
                
Processing: 
                    Buprenorphine 8 mg sublingual tablets
                
Processing: 
                    Buscomint Peppermint oil 0.2ml gastro-resistant capsule, soft
                
Processing: 
                    Buscopan 10 mg Tablets
                
Processing: 
                    Buspirone Hydrochloride 10 mg Tablets
                
Processing: 
                    Buspirone hydrochloride 10mg tablets
                
Processing: 
                    Buspirone Hydrochlor

100%|██████████| 12/12 [00:07<00:00,  1.56it/s]

Processing: 
                    Bylvay 1200 micrograms hard capsules




Processing: 
                    Bylvay 200 micrograms hard capsules




Processing: 
                    Bylvay 400 micrograms hard capsules




Processing: 
                    Bylvay 600 micrograms hard capsules









Step 2: Screen through each drug link starting with a particular letter to obtain drug description and manufacturer details

In [126]:
# Gathering data from letter urls

import requests
from bs4 import BeautifulSoup
import pandas as pd

# nesting scraper in a single function for pharmaceutical form

def find_drug_description(url):
  discontinued = []
  r = requests.get(url)
  soup = BeautifulSoup(r.text, 'html.parser')
  # getting name of medicines
  title_tag = soup.find(id='PRODUCTINFO')
  try:
    title_parent = title_tag.parent
    title = title_parent.find(class_='sectionWrapper').text
  except Exception as e:
    title = ""
    print(f"DISCONTINUED/NO SPC: {url}")
    discontinued.append(url)
    pass

  # getting description of medicine
  tag = soup.find(id='FORM')
  try:
    desc_parent = tag.parent
    all_desc = desc_parent.find_all(recursive=False)  # Restrict search within the parent div
    dsc_output = ""
    for desc in all_desc:
        if desc != tag:
          # Exclude the target element itself # Process the sibling element
          dsc_output = desc.text + "What is this drug?"
  except Exception as e:
    dsc_output = ""
    pass

  # getting company name
  try:
    comp_name = soup.find(class_="product-header-company-name").text
  except Exception:
    comp_name = ""
    pass

  return title.replace("\n", ""), dsc_output.replace("\n", ""), comp_name.replace("\n", ""), discontinued




output_dict = []
for i in tqdm(B_urls):
  name, description, company, disc_med = find_drug_description(i)
  output_dict.append({"Name": name,
               "Description": description,
               "Company": company})

df = pd.DataFrame(output_dict)
# Print the DataFrame (optional)
print(df.to_string())
df.to_csv('OSD(B).csv', index=False)



  3%|▎         | 6/211 [00:05<03:03,  1.12it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/8330/pil


  7%|▋         | 14/211 [00:11<02:03,  1.59it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/8274/pil


  7%|▋         | 15/211 [00:12<02:04,  1.57it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/8335/pil


  8%|▊         | 17/211 [00:13<02:19,  1.39it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/11480/pil


  9%|▊         | 18/211 [00:14<02:22,  1.35it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/9343/pil


 11%|█▏        | 24/211 [00:19<02:35,  1.21it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/8377/pil


 12%|█▏        | 26/211 [00:21<02:28,  1.24it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/10013/pil


 13%|█▎        | 27/211 [00:22<02:21,  1.30it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/1911/pil


 14%|█▍        | 30/211 [00:24<02:06,  1.43it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/13090/pil


 16%|█▌        | 33/211 [00:26<02:15,  1.32it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/3935/pil


 17%|█▋        | 35/211 [00:28<02:13,  1.32it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/8333/pil


 20%|█▉        | 42/211 [00:32<01:45,  1.60it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/8308/pil


 21%|██▏       | 45/211 [00:34<01:31,  1.82it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/8336/pil


 22%|██▏       | 47/211 [00:35<01:40,  1.63it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/8391/pil


 23%|██▎       | 49/211 [00:37<01:44,  1.54it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/8284/pil


 24%|██▎       | 50/211 [00:37<01:46,  1.51it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/8309/pil


 25%|██▍       | 52/211 [00:39<01:51,  1.43it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/8086/pil


 27%|██▋       | 58/211 [00:44<01:58,  1.29it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/8279/pil


 28%|██▊       | 59/211 [00:44<01:52,  1.35it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/8458/pil


 28%|██▊       | 60/211 [00:45<01:33,  1.61it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/9540/pil


 30%|██▉       | 63/211 [00:47<01:58,  1.25it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/8320/pil


 31%|███       | 65/211 [00:49<02:00,  1.21it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/8319/pil


 31%|███▏      | 66/211 [00:50<01:51,  1.30it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/8360/pil


 34%|███▎      | 71/211 [00:54<01:51,  1.26it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/8275/pil


 36%|███▌      | 75/211 [00:56<01:32,  1.46it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/8340/pil


 37%|███▋      | 78/211 [00:59<01:50,  1.21it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/8339/pil


 37%|███▋      | 79/211 [01:00<01:42,  1.29it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/11510/pil


 39%|███▉      | 82/211 [01:02<01:49,  1.18it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/5912/pil


 40%|███▉      | 84/211 [01:04<01:34,  1.34it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/8334/pil


 44%|████▍     | 93/211 [01:11<01:32,  1.28it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/8371/pil


 45%|████▌     | 95/211 [01:13<01:36,  1.20it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/8372/pil


 45%|████▌     | 96/211 [01:13<01:24,  1.36it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/8337/pil


 47%|████▋     | 99/211 [01:15<01:11,  1.57it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/8363/pil


 47%|████▋     | 100/211 [01:15<01:04,  1.72it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/8366/pil


 48%|████▊     | 101/211 [01:16<00:57,  1.91it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/6568/pil


 49%|████▉     | 104/211 [01:18<00:59,  1.80it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/8283/pil


 50%|█████     | 106/211 [01:20<01:23,  1.26it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/8332/pil


 51%|█████     | 107/211 [01:20<01:16,  1.36it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/8273/pil


 52%|█████▏    | 110/211 [01:22<01:05,  1.53it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/9542/pil


 53%|█████▎    | 111/211 [01:23<01:05,  1.53it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/9029/pil


 54%|█████▍    | 114/211 [01:25<01:00,  1.59it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/4079/pil


 55%|█████▍    | 115/211 [01:26<01:01,  1.56it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/6714/pil


 55%|█████▍    | 116/211 [01:26<01:03,  1.49it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/8910/pil


 57%|█████▋    | 121/211 [01:29<00:52,  1.73it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/8087/pil


 58%|█████▊    | 122/211 [01:30<00:46,  1.91it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/9539/pil


 58%|█████▊    | 123/211 [01:30<00:49,  1.77it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/8387/pil


 60%|█████▉    | 126/211 [01:32<00:55,  1.54it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/8342/pil


 61%|██████    | 129/211 [01:34<00:48,  1.70it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/8455/pil


 62%|██████▏   | 130/211 [01:35<00:47,  1.70it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/1910/pil


 73%|███████▎  | 155/211 [01:51<00:37,  1.48it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/8298/pil


 75%|███████▍  | 158/211 [01:53<00:30,  1.75it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/3286/pil


 79%|███████▊  | 166/211 [01:59<00:27,  1.63it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/8329/pil


 81%|████████  | 170/211 [02:01<00:25,  1.64it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/8389/pil


 82%|████████▏ | 174/211 [02:03<00:21,  1.76it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/13702/pil


 85%|████████▌ | 180/211 [02:07<00:18,  1.68it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/6567/pil


 86%|████████▌ | 181/211 [02:07<00:18,  1.63it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/8322/pil


 86%|████████▋ | 182/211 [02:08<00:16,  1.75it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/14044/pil


 87%|████████▋ | 183/211 [02:08<00:16,  1.70it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/8297/pil


 88%|████████▊ | 186/211 [02:10<00:13,  1.88it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/15027/pil


 95%|█████████▌| 201/211 [02:21<00:06,  1.44it/s]

DISCONTINUED/NO SPC: https://www.medicines.org.uk//emc/product/8338/pil


100%|██████████| 211/211 [02:28<00:00,  1.43it/s]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Name                                                                                                                                                                                                                                                                                




Step 3: Convert the collected data to JSON format

In [166]:
import pandas as pd
import json

def df_to_training_data(df):
  """
  Converts a pandas DataFrame to a list of dictionaries in training_data format.

  Args:
      df (pandas.DataFrame): The DataFrame to convert.

  Returns:
      list: A list of dictionaries in training_data format.
  """
  training_data = []
  for index, row in df.iterrows():
    # Merge last column into first column
    completion = f"{row[0]} {str(row[2]).strip()}"
    merged_prompt = f"{row[1]}"

    # Create dictionary and append to training_data
    data_dict = {"prompt": merged_prompt, "completion": completion}
    training_data.append(data_dict)

  return training_data

# Example usage (assuming your DataFrame is named 'df')
df = pd.read_csv("OSD(B).csv")
training_data = df_to_training_data(df.copy())  # Avoid modifying original df

# Print or use the training_data list
print(json.dumps(training_data, indent=2))


[
  {
    "prompt": "Film-coated tablet.Bosulif 500 mg film-coated tabletsRed oval (width: 9.5 mm; length: 18.3 mm) biconvex, film-coated tablet debossed with \u201cPfizer\u201d on one side and \u201c500\u201d on the other side.What is this drug?",
    "completion": "Bosulif 500 mg film-coated tablets Pfizer Limited"
  },
  {
    "prompt": "Film-coated tablet (tablet)Brintellix 5 mg film-coated tabletsPink, almond-shaped (5 x 8.4 mm) film-coated tablet engraved with \u201cTL\u201d on one side and \u201c5\u201d on the other side.Brintellix 10 mg film-coated tabletsYellow, almond-shaped (5 x 8.4 mm) film-coated tablet engraved with \u201cTL\u201d on one side and \u201c10\u201d on the other side.Brintellix 15 mg film-coated tabletsOrange, almond-shaped (5 x 8.4 mm) film-coated tablet engraved with \u201cTL\u201d on one side and \u201c15\u201d on the other side.Brintellix 20 mg film-coated tabletsRed, almond-shaped (5 x 8.4 mm) film-coated tablet engraved with \u201cTL\u201d on one side an

Step 4: Filter through data to prepare training dataset

In [163]:
import json

training_file_name = "training_dataB.jsonl"

def prepare_data(dictionary_data, final_file_name):
  with open(final_file_name, 'w') as outfile:

    for entry in dictionary_data:
      json.dump(entry, outfile)
      outfile.write('\n')

def remove_nan_dicts(data):
  """
  Removes dictionaries containing "nan" values from a JSON object or list.

  Args:
      data (object): The JSON data to process (dict or list).

  Returns:
      object: The modified JSON data with "nan" dictionaries removed.
  """
  # Iterate through the list of dictionaries
  output_json = []
  for d in data:
    if 'nan' in d.values():
      print(d)
      continue
    output_json.append(d)

  return output_json



# Example usage (assuming your JSON data is in a variable named 'json_data')
clean_data = remove_nan_dicts(training_data)

prepare_data(clean_data, training_file_name)


# Print or use the clean_data object
print(json.dumps(clean_data, indent=2))
print(clean_data)


{'prompt': 'nan', 'completion': 'nan THE BOOTS COMPANY PLC ->'}
{'prompt': 'nan', 'completion': 'nan THE BOOTS COMPANY PLC ->'}
{'prompt': 'nan', 'completion': 'nan THE BOOTS COMPANY PLC ->'}
{'prompt': 'nan', 'completion': 'nan THE BOOTS COMPANY PLC ->'}
{'prompt': 'nan', 'completion': 'nan THE BOOTS COMPANY PLC ->'}
{'prompt': 'nan', 'completion': 'nan THE BOOTS COMPANY PLC ->'}
{'prompt': 'nan', 'completion': 'nan THE BOOTS COMPANY PLC ->'}
{'prompt': 'nan', 'completion': 'nan THE BOOTS COMPANY PLC ->'}
{'prompt': 'nan', 'completion': 'nan THE BOOTS COMPANY PLC ->'}
{'prompt': 'nan', 'completion': 'nan THE BOOTS COMPANY PLC ->'}
{'prompt': 'nan', 'completion': 'nan THE BOOTS COMPANY PLC ->'}
{'prompt': 'nan', 'completion': 'nan THE BOOTS COMPANY PLC ->'}
{'prompt': 'nan', 'completion': 'nan THE BOOTS COMPANY PLC ->'}
{'prompt': 'nan', 'completion': 'nan THE BOOTS COMPANY PLC ->'}
{'prompt': 'nan', 'completion': 'nan THE BOOTS COMPANY PLC ->'}
{'prompt': 'nan', 'completion': 'nan THE

Step 5: Upload data to OpenAI API fine-tuning

In [164]:
from openai import OpenAI
import openai

api_key = "YOUR OPENAI API KEY"

client = OpenAI(api_key=api_key)

training_file_id = client.files.create(
  file=open(training_file_name, "rb"),
  purpose="fine-tune"
)

Training File ID: FileObject(id='file-SKQuUvwkSufrXPNvmdcmTkJ4', bytes=50800, created_at=1710426265, filename='training_dataB.jsonl', object='file', purpose='fine-tune', status='processed', status_details=None)


Step 6: Initiate model fine-tuning

In [165]:
response = client.fine_tuning.jobs.create(
  training_file=training_file_id.id,
  model="davinci-002",
  hyperparameters={
    "n_epochs": 15,
	"batch_size": 3,
	"learning_rate_multiplier": 0.3
  }
)
job_id = response.id
status = response.status

print(f'Fine-tunning model with jobID: {job_id}.')
print(f"Training Response: {response}")
print(f"Training Status: {status}")

Step 7: Use fine-tuned model using prompts to describe tablet or capsule details

In [167]:
from openai import OpenAI
client = OpenAI(api_key=api_key)

result = client.fine_tuning.jobs.list()
# print(result)
# Retrieve the finetuned model
fine_tuned_model = result.data[0].fine_tuned_model

new_prompt = "Briviact 10 mg film-coated tablets"
answer = client.completions.create(
  model='ft:babbage-002:personal::92fUVA3E',
  prompt=new_prompt,
  max_tokens=20
)

print(answer.choices[0].text)


 are currently not approved for the treatment of frontotemporal lobar degeneration (FTLD)
