# Dissolution Stability Machine Learning Project Using a Gaussian Mixture Model

## Project Overview – Analyzing Dissolution Test Data with Gaussian Mixture Models

**Goal:**  
Use FDA Dissolution Methods metadata + synthetic dissolution profiles to identify clusters of dissolution behaviors ("fast," "medium," "slow").  

**Steps:**  
1. Feature Engineering from FDA database  
2. Synthetic Profile Generation using kinetic models  
3. GMM clustering and visualization  
4. Interpretation & discussion of pharma relevance

### 1. Notebook Setup

Getting all the tools ready in the project's virtual environment.

In [1]:
# Setup matplotlib to plot inline (within the notebook)
%matplotlib inline

# Core libraries
import pandas as pd
import numpy as np
import re

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# # Modeling
# from sklearn.preprocessing import StandardScaler
# from sklearn.mixture import GaussianMixture
# from sklearn.metrics import silhouette_score

# Utility
from tqdm import tqdm

# Ensure pandas prints enough rows and columns without truncation
pd.set_option("display.max_rows", 200)        # show up to 100 rows
pd.set_option("display.max_columns", None)    # show all columns
pd.set_option("display.width", None)          # don't wrap to fit terminal width
pd.set_option("display.max_colwidth", None)   # show full cell content

### 2. Data Acquisition

The FDA Dissolution Methods Database contains:
- Dosage form
- Apparatus type
- Agitation speed (RPM)
- Medium type & volume
- Sampling times

These features can be parsed into a structured table for feature engineering.

Loading the dataset (if dataset is real and available).

In [2]:
# Import dataset from CSV file or URL
df = pd.read_csv("Dissolution Methods.csv")

# Quick check to view the data
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1568 entries, 0 to 1567
Data columns (total 8 columns):
 #   Column                                Non-Null Count  Dtype 
---  ------                                --------------  ----- 
 0   Drug Name                             1568 non-null   object
 1   Dosage Form                           1568 non-null   object
 2   USP Apparatus                         805 non-null    object
 3   Speed (RPMs)                          800 non-null    object
 4   Medium                                1568 non-null   object
 5   Volume (mL)                           792 non-null    object
 6   Recommended Sampling Times (minutes)  803 non-null    object
 7   Date Updated                          1568 non-null   object
dtypes: object(8)
memory usage: 98.1+ KB


Unnamed: 0,Drug Name,Dosage Form,USP Apparatus,Speed (RPMs),Medium,Volume (mL),Recommended Sampling Times (minutes),Date Updated
0,Abacavir Sulfate,Tablet,,,"Refer to FDA's Dissolution Guidance, 2018",,,07/02/2020
1,Abacavir Sulfate/Dolutegravir Sodium/Lamivudine,Tablet,II (Paddle),85.0,"0.01 M Phosphate Buffer with 0.5% sodium dodecyl sulfate (SDS), pH 6.8",900,"Abacavir and lamivudine: 10, 15, 20, 30 and 45; Dolutegravir: 5,15, 25, 35 and 45.",05/28/2015
2,Abacavir Sulfate/Dolutegravir Sodium/Lamivudine,Tablet (For Suspension),II (Paddle),50.0,"0.01 M Phosphate Buffer with 0.5 mM EDTA, pH 6.8",500,"5, 10, 15, 30, 45 and 60",10/06/2023
3,Abacavir Sulfate/Lamivudine,Tablet,II (Paddle),75.0,0.1 N HCl,900,"10, 20, 30, and 45",01/03/2007
4,Abacavir Sulfate/Lamivudine/Zidovudine,Tablet,II (Paddle),75.0,0.1 N HCl,Acid Stage: 900 mL; Buffer Stage: 1000 mL,"5, 10, 15, 30 and 45",01/03/2007


Simulating the dataset (if dataset if not available). (OPTIONAL)

In [3]:
# # Simulating dissolution test metrics for 200 batches
# np.random.seed(42)
# fast_group = np.random.normal(
#     loc=90, scale=5, size=(100, 3))  # high % dissolved
# slow_group = np.random.normal(
#     loc=60, scale=5, size=(100, 3))  # lower % dissolved
# synthetic_data = np.vstack([fast_group, slow_group])

# df = pd.DataFrame(synthetic_data, columns=["% Dissolved @ 5min", "% Dissolved @ 10min", "% Dissolved @ 15min"])

### 3. Data Cleaning, Preprocessing, and Feature Engineering

Prepare the dataset for use within the model. Convert the FDA method metadata into ML-ready features:

- Apparatus → categorical → one-hot encode  
- Medium type → categorical → one-hot encode  
- RPM → numerical  
- Medium volume → numerical  
- Sampling times → numerical summary features (e.g., # of samples, max time)  

#### 3.1. Data Cleaning

Ensuring numerical features are clean and scaled.

In [4]:
print(len(df))

1568


In [5]:
df.head()

Unnamed: 0,Drug Name,Dosage Form,USP Apparatus,Speed (RPMs),Medium,Volume (mL),Recommended Sampling Times (minutes),Date Updated
0,Abacavir Sulfate,Tablet,,,"Refer to FDA's Dissolution Guidance, 2018",,,07/02/2020
1,Abacavir Sulfate/Dolutegravir Sodium/Lamivudine,Tablet,II (Paddle),85.0,"0.01 M Phosphate Buffer with 0.5% sodium dodecyl sulfate (SDS), pH 6.8",900,"Abacavir and lamivudine: 10, 15, 20, 30 and 45; Dolutegravir: 5,15, 25, 35 and 45.",05/28/2015
2,Abacavir Sulfate/Dolutegravir Sodium/Lamivudine,Tablet (For Suspension),II (Paddle),50.0,"0.01 M Phosphate Buffer with 0.5 mM EDTA, pH 6.8",500,"5, 10, 15, 30, 45 and 60",10/06/2023
3,Abacavir Sulfate/Lamivudine,Tablet,II (Paddle),75.0,0.1 N HCl,900,"10, 20, 30, and 45",01/03/2007
4,Abacavir Sulfate/Lamivudine/Zidovudine,Tablet,II (Paddle),75.0,0.1 N HCl,Acid Stage: 900 mL; Buffer Stage: 1000 mL,"5, 10, 15, 30 and 45",01/03/2007


Establishing known lists for reference in clean up.

In [23]:
# Known "Dosage Form" release types to help detect them outside parentheses
KNOWN_RELEASE_TYPES = [
    "Extended Release", "Delayed Release", "Orally Disintegrating", 
    "Immediate Release", "Controlled Release", "Sustained Release", 
    "Copackage" # , "For Suspension"
]

# # Known RPM units
# RPM_UNIT_MAP = {
#     "rpm": "RPM",
#     "cycles per min": "RPM",
#     "cycles (chews) per minute": "RPM",
#     "dpm": "DPM",
#     "dips/min": "DPM",
#     "dpm (dip rate per minute)": "DPM",
#     "cycles per min": "cycles/min",
#     "cycles/min": "cycles/min"
# }

# # Known RPM units
# RPM_UNIT_MAP = {
#     "RPM": ["RPM", "rpm", "cycles per min", "cycles (chews) per minute"],
#     "DPM": ["DPM", "dpm", "dips/min", "dpm (dip rate per minute)"],
#     "cycles/min": ["cycles/min", "cycles per min"],
# }

# Known RPM units and possible variations
RPM_UNIT_MAP = {
    "RPM": ["rpm", "RPM", "revs/min", "revolutions per minute"],
    "DPM": ["dpm", "DPM", "dips/min", "dpm (dip rate per minute)"],
    "CPM": ["cpm", "CPM", "cycles per minute", "cycles/minute",
            "cycles/min", "cycles per min", "cycles (chews) per minute"],
}

# Flatten unit variants and build a regex alternation of *only* known units
UNIT_VARIANTS = sorted({v for vs in RPM_UNIT_MAP.values() for v in vs}, key=len, reverse=True)
UNIT_ALT = "|".join(re.escape(v) for v in UNIT_VARIANTS)  # longest-first avoids partial matches

Defining cleanup functions.

In [7]:
# Defining function to cleanup "Dosage Form" data
def clean_dosage_form(value):
    if pd.isna(value):
        return ("Unknown", "Unknown")
    
    # Strip spaces
    val = str(value).strip()

    # # Look for release type in parentheses
    # m = re.match(r"^(.*?)\s*\((.*?)\)$", val)
    
    # Regex: capture base form + anything inside parentheses
    # e.g. "Tablet (Delayed Release, Orally Disintegrating)"
    m = re.match(r"^(.*?)\s*(?:\((.+)\))?$", val)
    # m = re.match(r"^([A-Za-z ]+?)(?:\s*\((.+)\))?$", val)
    if m:
        dosage_form = m.group(1).strip().title()  # normalize dosage form
        release_type = m.group(2).strip() if m.group(2) else None

        # Remove trailing commas from both
        dosage_form = dosage_form.rstrip(',').strip()
        dosage_form = re.sub(r"\s+", " ", dosage_form)
        if release_type:
            release_type = release_type.rstrip(',').strip() # Note: ODT refers to Orally Disintegrating Tablet
            release_type = re.sub(r"\s+", " ", release_type)
            if release_type and not release_type[0].isupper():
                release_type = release_type[0].upper() + release_type[1:]

        # If no parentheses captured and there’s a comma, check for trailing release type
        if not release_type and ',' in dosage_form:
            parts = [p.strip() for p in dosage_form.split(',')]
            # If last part looks like a known release type, separate it
            if parts[-1] in KNOWN_RELEASE_TYPES:
                release_type = parts[-1]
                dosage_form = ', '.join(parts[:-1]).rstrip(',').strip()
        
        return dosage_form, release_type
    else:
        # fallback: keep whole thing in base, no modifiers
        return val.rstrip(',').title(), None


# • Apparatus → categorical → one-hot encode
# • Medium type → categorical → one-hot encode
# • RPM → numerical
# • Medium volume → numerical
# • Sampling times → numerical summary features (e.g., # of samples, max time)

# Defining function to cleanup "USP Apparatus" data

# def clean_usp_apparatus(value):
#     if pd.isna(value):
#         return "Unknown"
    
#     # Strip spaces
#     val = str(value).strip().title()

#     # Normalize common names
#     val = re.sub(r"\b(USP|Apparatus|Dissolution)\b", "", val)

#     # Remove extra whitespace
#     val = re.sub(r"\s+", " ", val).strip()

#     return val if val else "Unknown"

# • Apparatus → categorical → one-hot encode
# • Medium type → categorical → one-hot encode
# • RPM → numerical
# • Medium volume → numerical
# • Sampling times → numerical summary features (e.g., # of samples, max time)


# Preparing for cleanup of "Speed (RPMs)" data
df.rename(columns={"Speed (RPMs)": "Speed (RPMs) (Raw)"}, inplace=True)


# # Defining function to cleanup "Speed (RPMs)" data
# def clean_speed_rpms(value):
#     if pd.isna(value):
#         return 0

#     # Extract numeric part
#     val = re.findall(r"[\d.]+", str(value))
#     return float(val[0]) if val else 0

# Defining function to cleanup "Medium" data

# Defining function to cleanup "Volume (mL)" data

# Defining function to cleanup "Recommended Sampling Times (minutes)" data

In [8]:
# df["Speed (RPMs) (Raw)"].value_counts().reset_index(name="count").rename(columns={"index": "Speed (RPMs) (Raw)"})
num_unique_rpms = len(df["Speed (RPMs) (Raw)"].value_counts())
print("Number of rows displayed:", num_unique_rpms)

# df["Speed (RPMs) (Raw)"].value_counts().to_frame().style.set_table_attributes('style="display:inline"').set_table_styles(
#     [{'selector': 'table', 'props': [('max-height', '400px'), ('overflow-y', 'scroll'), ('display', 'block')]}]
# )

Number of rows displayed: 63


In [9]:
# Replacing any Unicode non-breaking space characters ("\xa0") with spaces
val = str(df["Speed (RPMs) (Raw)"]).replace("\xa0", " ").strip()
results = []

# Split by semicolons for multiple measurements
parts = [p.strip() for p in val.split(";")]

# Define regex patterns to explore
rpm_regex_patterns = {
    "digits_only": r'^\d+$',               # pure digits only
    "rpm_values": r'^\d+\s?RPM$',          # e.g., '50 RPM'
    "dpm_values": r'^\d+\s?DPM$',          # e.g., '30 DPM'
    "unclear_values": r'(\d+)\s*([A-Za-z/ ]+)',
    "embedded_cycles": r'(\d+)\s*(rpm|dpm|cycles per min|cycles/min)',
    "drug_based": r'([\d, andmgMG ]+):\s*(\d+)\s*([A-Za-z/]+)?',
    "other": r'(.+?):\s*(\d+)\s*([A-Za-z/]+)?',
}

# Function to match each semicolon-separated part
def match_parts(entry, pattern):
    entry = str(entry).replace("\xa0", " ").strip()
    parts = [p.strip() for p in entry.split(";")]
    matched_parts = [p for p in parts if pattern.match(p)]
    return matched_parts if matched_parts else [None]

# Choose a pattern to display (change this key as needed)
selected_rpm_pattern_key = "embedded_cycles"
selected_rpm_pattern = rpm_regex_patterns[selected_rpm_pattern_key]

# Compile regex
pattern = re.compile(selected_rpm_pattern, flags=re.IGNORECASE)

# Filter rows that MATCH this pattern
mask = df["Speed (RPMs) (Raw)"].astype(str).str.match(pattern)

# Compute value counts DataFrame
vc = df.loc[mask, "Speed (RPMs) (Raw)"].value_counts().to_frame(name="count")

# Count how many rows will be displayed
row_count = len(vc)
print(f"Number of rows displayed for pattern '{selected_rpm_pattern_key}': {row_count}")

# Apply styling for display in Jupyter
filtered = (
    vc.style
    .set_table_attributes('style="display:inline"')
    .set_table_styles(
        [{'selector': 'table',
          'props': [('max-height', '400px'),
                    ('overflow-y', 'scroll'),
                    ('display', 'block')]}]
    )
)

filtered

Number of rows displayed for pattern 'embedded_cycles': 9


Unnamed: 0_level_0,count
Speed (RPMs) (Raw),Unnamed: 1_level_1
20 dpm,2
27 dpm,1
"50 RPM (12.5 mg, 25 mg and 100 mg); 75 RPM (150 mg and 200 mg)",1
75 rpm (for 75 mg strength); 150 rpm (for 300 mg strength),1
25 dpm,1
30 cycles per min,1
12 dpm,1
30 cycles per minute. amplitude of about 2m.,1
10 dpm (dip rate per minute),1


In [10]:
unique_rpm_values = pd.DataFrame(df["Speed (RPMs) (Raw)"].unique(), columns=["Unique RPMs"])

# Force display of all rows in this DataFrame
with pd.option_context("display.max_rows", None):
    display(unique_rpm_values)

Unnamed: 0,Unique RPMs
0,
1,85
2,50
3,75
4,55
5,180
6,100
7,60
8,Clarithromycin and Vonoprazan: 50
9,Vonoprazan: 50


In [11]:
# import re
# import pandas as pd

# def clean_speed(value):
#     if pd.isna(value):
#         return (None, None, None)  # Speed, Unit, Notes
    
#     val = str(value).strip().replace("\xa0", " ")  # clean NBSP
    
#     # Match simple RPM number
#     if re.fullmatch(r"\d+", val):
#         return (int(val), "RPM", None)
    
#     # Check for "dpm", "cycles per min", "dips/min", "mL/min"
#     units_map = {
#         "dpm": r"(\d+)\s*dpm",
#         "cycles/min": r"(\d+)\s*cycles(?:\s*\(.*\))?\s*per\s*min(?:ute)?",
#         "dips/min": r"(\d+)\s*dips?/min",
#         "mL/min": r"(\d+)\s*mL/min",
#         "RPM": r"(\d+)\s*rpm",
#     }
    
#     for unit, pattern in units_map.items():
#         m = re.search(pattern, val, flags=re.I)
#         if m:
#             return (int(m.group(1)), unit, val)  # keep full text as notes
    
#     # Fallback: return None for speed but preserve text as Notes
#     return (None, None, val)

# # Apply to DataFrame
# df[["Speed (RPMs) (Clean)", "Speed Unit", "Speed Notes"]] = df["Speed (RPMs) (Raw)"].apply(
#     lambda x: pd.Series(clean_speed(x))
# )

# df[df["Speed (RPMs) (Raw)"] == "50 RPM (12.5 mg, 25 mg and 100 mg); 75 RPM (150 mg and 200 mg)"]

In [12]:
# Define globally
rpm_case_tracking_dict = {
    "Case: empty entry": 0,
    "Case: pure number": 0,
    "Case: number + unit": 0,
    "Case: number with unit embedded": 0,
    "Case: \"X mg, Y mg and Z mg: N rpm\"": 0,
    "Case: \"<Note>: <number> [unit?]\"": 0,
    "Case: number + parenthetical note": 0,
    "Fallback: keep as note": 0,
}

# Setting for which cases to print (for debugging)
rpm_case_printing_dict = {
    "Case: empty entry": False,
    "Case: pure number": False,
    "Case: number + unit": True,
    "Case: number with unit embedded": True, # T
    "Case: \"X mg, Y mg and Z mg: N rpm\"": False, # T
    "Case: \"<Note>: <number> [unit?]\"": False, # T
    "Case: number + parenthetical note": False,
    "Fallback: keep as note": True, # T
}

def clean_speed_entry(entry):
    """
    Returns a list of tuples: (Speed, Unit, Notes)
    Handles multiple numbers per entry and colon/semicolon-separated cases.
    """
    if pd.isna(entry):
        rpm_case_tracking_dict["Case: empty entry"] += 1
        if rpm_case_printing_dict["Case: empty entry"] == True:
            print("Case: empty entry                    ", (None, None, None))
        return [(None, None, None)]

    # Replacing any Unicode non-breaking space characters ("\xa0") with spaces
    # val = str(entry).replace("\xa0", " ").strip()
    
    # Replace non-breaking spaces and collapse all whitespace in one go
    val = re.sub(r'\s+', ' ', str(entry).replace("\xa0", " ")).strip()

    results = []

    # Split by semicolons for multiple measurements
    parts = [p.strip() for p in val.split(";")]

    for part in parts:
        # Case: pure number
        if part.isdigit():
            results.append((int(part), "RPM", None))
            rpm_case_tracking_dict["Case: pure number"] += 1
            if rpm_case_printing_dict["Case: pure number"] == True:
                print("Case: pure number                    ", (int(part), "RPM", None))
            continue

        # Case: number + parenthetical note
        if re.search(r"\d+\s*\([^)]*\)", part):
            # Split on "&" inside this case, because it can contain multiple entries
            subparts = [sp.strip() for sp in re.split(r"&", part)]
            for sp in subparts:
                m = re.match(r"(\d+)\s*\(([^)]*)\)", sp)
                if m:
                    speed = int(m.group(1))
                    note = m.group(2).strip()
                    results.append((speed, "RPM", note))
                    rpm_case_tracking_dict["Case: number + parenthetical note"] += 1
                    if rpm_case_printing_dict.get("Case: number + parenthetical note", True):
                        print("Case: number + parenthetical note    ", (speed, "RPM", note))
            continue

        # Case: number + unit
        match = re.match(r"(\d+)\s*([A-Za-z/ ]+)", part)
        if match and match.group(2) != '':
            speed = int(match.group(1))
            unit = match.group(2).strip().replace("per minute", "/min")
            results.append((speed, unit, None))
            rpm_case_tracking_dict["Case: number + unit"] += 1
            if rpm_case_printing_dict["Case: number + unit"] == True:
                print("Case: number + unit      * * *       ", (speed, unit, None))
            if unit == "":
                print("Problematic row:")
                display(df[df["Speed (RPMs) (Raw)"] == entry].head())
            continue

        # # Case: number + unit
        # match = re.match(r"^(\d+)\s*([A-Za-z/ ]+)$", part)
        # if match:
        #     speed = int(match.group(1))
        #     raw_unit = match.group(2).strip().lower()
        
        #     # Only accept units we know
        #     if raw_unit in RPM_UNIT_MAP:
        #         unit = RPM_UNIT_MAP[raw_unit]
        #         results.append((speed, unit, None))
        #         rpm_case_tracking_dict["Case: number + unit"] += 1
        #         if rpm_case_printing_dict["Case: number + unit"]:
        #             print("Case: number + unit                  ", (speed, unit, None))
        #     else:
        #         # Otherwise treat the whole string as a note
        #         results.append((None, None, part))
        #         rpm_case_tracking_dict["Fallback: keep as note"] += 1
        #         if rpm_case_printing_dict["Fallback: keep as note"]:
        #             print("Fallback: keep as note               ", (None, None, part))
        #     continue
        
        # Case: number with unit embedded
        embedded = re.search(r"(\d+)\s*(rpm|dpm|cycles per min|cycles/min)", part, flags=re.I)
        if embedded:
            speed = int(embedded.group(1))
            unit = RPM_UNIT_MAP.get(embedded.group(2).lower(), embedded.group(2).upper())
            # Remove speed/unit from note
            note = re.sub(r"\d+\s*" + embedded.group(2), "", part, flags=re.I).strip(",; ")
            results.append((speed, unit, note if note else None))
            rpm_case_tracking_dict["Case: number with unit embedded"] += 1
            if rpm_case_printing_dict["Case: number with unit embedded"] == True:
                print("Case: number with unit embedded   *  ", (speed, unit, note if note else None))
            continue
            
        # Case: "X mg, Y mg and Z mg: N rpm"
        multi_match = re.match(r"([\d, andmgMG ]+):\s*(\d+)\s*([A-Za-z/]+)?", part)
        if multi_match:
            drugs = re.split(r",|and", multi_match.group(1))
            speed = int(multi_match.group(2))
            unit = multi_match.group(3).lower() if multi_match.group(3) else "rpm"
            unit = RPM_UNIT_MAP.get(unit.lower(), unit.upper())
            for d in [d.strip() for d in drugs if d.strip()]:
                results.append((speed, unit, d))
            rpm_case_tracking_dict["Case: \"X mg, Y mg and Z mg: N rpm\""] += 1
            if rpm_case_printing_dict["Case: \"X mg, Y mg and Z mg: N rpm\""] == True:
                print("Case: \"X mg, Y mg and Z mg: N rpm\"")
            continue

        # Case: "<Note>: <number> [unit?]"
        colon_match = re.match(r"(.+?):\s*(\d+)\s*([A-Za-z/]+)?", part)
        if colon_match:
            note = colon_match.group(1).strip()
            speed = int(colon_match.group(2))
            unit = colon_match.group(3).lower() if colon_match.group(3) else None
            if unit:
                unit = RPM_UNIT_MAP.get(unit.lower(), unit.upper())
            results.append((speed, unit, note))
            rpm_case_tracking_dict["Case: \"<Note>: <number> [unit?]\""] += 1
            if rpm_case_printing_dict["Case: \"<Note>: <number> [unit?]\""] == True:
                print("Case: \"<Note>: <number> [unit?]\"     ", (speed, unit, note))
            continue

        # Fallback: keep as note
        results.append((None, None, part))
        rpm_case_tracking_dict["Fallback: keep as note"] += 1
        if rpm_case_printing_dict["Fallback: keep as note"] == True:
            print("Fallback: keep as note               ", (None, None, part))
            
    return results

# Apply and expand, safely overwriting existing columns
expanded_rows = df["Speed (RPMs) (Raw)"].apply(clean_speed_entry)
df_cleaned = pd.DataFrame(
    [t for sublist in expanded_rows for t in sublist],
    columns=["Speed (RPMs) (Clean)", "Unit", "Notes"]
)

# Overwrite the old columns safely
df[["Speed (RPMs) (Clean)", "Unit", "Notes"]] = df_cleaned

pd.DataFrame(list(rpm_case_tracking_dict.items()), columns=["Case", "Count"])

# df[df["Speed (RPMs) (Raw)"] == "50 RPM (12.5 mg, 25 mg and 100 mg); 75 RPM (150 mg and 200 mg)"]
# df[-200:]

Case: number + unit      * * *        (4, 'mL/min', None)
Fallback: keep as note                (None, None, 'Flow @ 8 mL/min')
Case: number + unit      * * *        (50, 'mg', None)
Case: number + unit      * * *        (400, 'mg', None)
Case: number + unit      * * *        (27, 'dpm', None)
Fallback: keep as note                (None, None, 'Flow @ 6 mL/min')
Case: number + unit      * * *        (50, 'RPM', None)
Case: number + unit      * * *        (75, 'RPM', None)
Case: number + unit      * * *        (75, 'rpm', None)
Case: number + unit      * * *        (150, 'rpm', None)
Case: number + unit      * * *        (25, 'dpm', None)
Case: number + unit      * * *        (30, 'cycles per min', None)
Case: number + unit      * * *        (12, 'dpm', None)
Case: number + unit      * * *        (30, 'dips/min', None)
Case: number + unit      * * *        (25, 'and', None)
Case: number + unit      * * *        (30, 'cycles /min', None)
Case: number with unit embedded   *   (205, 'RPM',

Unnamed: 0,Case,Count
0,Case: empty entry,768
1,Case: pure number,756
2,Case: number + unit,19
3,Case: number with unit embedded,7
4,"Case: ""X mg, Y mg and Z mg: N rpm""",0
5,"Case: ""<Note>: <number> [unit?]""",29
6,Case: number + parenthetical note,5
7,Fallback: keep as note,3


In [31]:
# Define globally
rpm_case_tracking_dict = {
    "Case 1: empty entry": 0,
    "Case 2: pure number": 0,
    "Case 3: number + parenthetical note": 0,
    "Case 4: \"<Note>: <number> [unit?]\"": 0,
    "Case 5: \"X mg, Y mg and Z mg: N rpm\"": 0,
    "Case 6: number with unit embedded": 0,
    "Case 7: number + unit": 0,
    "Case 8: multiple numbers without unit": 0,
    "Fallback: keep as note": 0,
}

# Setting for which cases to print (for debugging)
rpm_case_printing_dict = {
    "Case 1: empty entry": False,
    "Case 2: pure number": False,
    "Case 3: number + parenthetical note": False,
    "Case 4: \"<Note>: <number> [unit?]\"": True,
    "Case 5: \"X mg, Y mg and Z mg: N rpm\"": True,
    "Case 6: number with unit embedded": True,
    "Case 7: number + unit": True,
    "Case 8: multiple numbers without unit": True,
    "Fallback: keep as note": True,
}

def normalize_unit(raw_unit: str) -> str:
    raw_unit = raw_unit.strip().lower()
    for canonical, variations in RPM_UNIT_MAP.items():
        for variation in variations:
            if raw_unit == variation.lower():
                return canonical
    return raw_unit  # fallback if no match

def clean_speed_entry(entry):
    """
    Returns a list of tuples: (Speed, Unit, Notes)
    Handles multiple numbers per entry and colon/semicolon-separated cases.
    """
    # Case 1: empty entry
    if pd.isna(entry):
        rpm_case_tracking_dict["Case 1: empty entry"] += 1
        if rpm_case_printing_dict["Case 1: empty entry"] == True:
            print("Case 1: empty entry                  ", (None, None, None))
        return [(None, None, None)]

    # Replacing any Unicode non-breaking space characters ("\xa0") with spaces
    # val = str(entry).replace("\xa0", " ").strip()
    
    # Replace non-breaking spaces and collapse all whitespace in one go
    val = re.sub(r'\s+', ' ', str(entry).replace("\xa0", " ")).strip()

    results = []

    # Split by semicolons for multiple measurements
    parts = [p.strip() for p in val.split(";")]

    for part in parts:
        # Case 2: pure number
        if part.isdigit():
            results.append((int(part), "RPM", None))
            rpm_case_tracking_dict["Case 2: pure number"] += 1
            if rpm_case_printing_dict["Case 2: pure number"] == True:
                print("Case 2: pure number                  ", (int(part), "RPM", None))
            continue

        # Case 3: number + parenthetical note
        if re.search(r"\d+\s*\([^)]*\)", part):
            # Split on "&" inside this case, because it can contain multiple entries
            subparts = [subpart.strip() for subpart in re.split(r"&", part)]
            for subpart in subparts:
                match = re.match(r"(\d+)\s*\(([^)]*)\)", subpart)
                if match:
                    speed = int(match.group(1))
                    note = match.group(2).strip()
                    results.append((speed, "RPM", note))
                    rpm_case_tracking_dict["Case 3: number + parenthetical note"] += 1
                    if rpm_case_printing_dict.get("Case 3: number + parenthetical note", True):
                        print("Case 3: number + parenthetical note  ", (speed, "RPM", note))
            continue

        # # Case 4: "X mg, Y mg and Z mg: N rpm"
        # multi_match = re.match(r"([\d, andmgMG ]+):\s*(\d+)\s*([A-Za-z/]+)?", part)
        # if multi_match:
        #     drugs = re.split(r",|and", multi_match.group(1))
        #     speed = int(multi_match.group(2))
        #     unit = multi_match.group(3).lower() if multi_match.group(3) else "rpm"
        #     unit = RPM_UNIT_MAP.get(unit.lower(), unit.upper())
        #     for d in [d.strip() for d in drugs if d.strip()]:
        #         results.append((speed, unit, d))
        #     rpm_case_tracking_dict["Case 4: \"X mg, Y mg and Z mg: N rpm\""] += 1
        #     if rpm_case_printing_dict["Case 4: \"X mg, Y mg and Z mg: N rpm\""] == True:
        #         print(part)
        #         print("Case 4: \"X mg, Y mg and Z mg: N rpm\" ", (speed, unit, d))
        #     continue

        # Case 4: "<Note>: <number> [unit?]"
        colon_match = re.match(r"(.+?):\s*(\d+)\s*([A-Za-z/]+)?", part)
        if colon_match:
            note = colon_match.group(1).strip()
            speed = int(colon_match.group(2))
            unit = colon_match.group(3).lower() if colon_match.group(3) else None
            raw_unit = colon_match.group(3)
            
            if raw_unit:
                mapped_unit = RPM_UNIT_MAP.get(raw_unit.lower())
                if mapped_unit:  
                    # Normal case: recognized unit
                    unit = mapped_unit
                else:
                    # Fallback: treat as note addition
                    unit = "RPM"
                    extra_note = raw_unit.capitalize()
                    note = f"{note} ({extra_note})"
            else:
                unit = "RPM"

            # Remove any additional colons from the note
            note = note.replace(":", "").strip()
            
            results.append((speed, unit, note))
            rpm_case_tracking_dict["Case 4: \"<Note>: <number> [unit?]\""] += 1
            if rpm_case_printing_dict["Case 4: \"<Note>: <number> [unit?]\""]:
                print("Case 4: \"<Note>: <number> [unit?]\"   ", (speed, unit, note))
            continue

        # Case 5: "X mg, Y mg and Z mg: N rpm"
        multi_match = re.match(r"^\s*([^\:]+):\s*(\d+)\s*([A-Za-z/]+)?", part)
        if multi_match:
            note = multi_match.group(1).strip()  # everything before the ":"
            note = re.sub(r':$', '', note).strip()  # drop trailing colon if present
            
            speed = int(multi_match.group(2))
            unit = multi_match.group(3).lower() if multi_match.group(3) else "rpm"
            unit = RPM_UNIT_MAP.get(unit.lower(), unit.upper())
        
            results.append((speed, unit, note))
        
            rpm_case_tracking_dict["Case 5: \"X mg, Y mg and Z mg: N rpm\""] += 1
            if rpm_case_printing_dict["Case 5: \"X mg, Y mg and Z mg: N rpm\""]:
                print(part)
                print("Case 5: \"X mg, Y mg and Z mg: N rpm\" ", (speed, unit, note))
            continue

        # # Case 6: number with unit embedded
        # embedded = re.search(r"(\d+)\s*(rpm|dpm|cycles per min|cycles/min)", part, flags=re.I)
        # if embedded:
        #     speed = int(embedded.group(1))
        #     unit = RPM_UNIT_MAP.get(embedded.group(2).lower(), embedded.group(2).upper())
        #     # Remove speed/unit from note
        #     note = re.sub(r"\d+\s*" + embedded.group(2), "", part, flags=re.I).strip(",; ")
        #     results.append((speed, unit, note if note else None))
        #     rpm_case_tracking_dict["Case 6: number with unit embedded"] += 1
        #     if rpm_case_printing_dict["Case 6: number with unit embedded"] == True:
        #         print("Case 6: number with unit embedded  * ", (speed, unit, note if note else None))
        #     continue

        # # Case 6: number with unit embedded
        # embedded = re.search(r"(\d+)\s*([A-Za-z/ ]+)", part)
        # if embedded:
        #     speed = int(embedded.group(1))
        #     raw_unit = embedded.group(2).strip()
        #     mapped_unit = RPM_UNIT_MAP.get(raw_unit.lower())
            
        #     if mapped_unit:
        #         unit = mapped_unit  # recognized unit
        #     else:
        #         unit = "RPM"  # fallback
        #         raw_unit_clean = raw_unit.capitalize()
        #         note = f"{raw_unit_clean}"  # append to note
             
        #     # Remove the number + unit from the original string to make the note
        #     note = re.sub(r"\d+\s*" + re.escape(raw_unit), "", part, flags=re.I).strip(",; ")
        #     if 'note' in locals():  # if we added raw_unit above, merge
        #         note = f"{note} ({raw_unit_clean})" if note else f"{raw_unit_clean}"
            
        #     # Remove any extra colons
        #     note = note.replace(":", "").strip()
            
        #     results.append((speed, unit, note if note else None))
        #     rpm_case_tracking_dict["Case 6: number with unit embedded"] += 1
        #     if rpm_case_printing_dict["Case 6: number with unit embedded"]:
        #         print("Case 6: number with unit embedded  * ", (speed, unit, note if note else None))
        #     continue

        # # Case 6: number with unit embedded
        # embedded = re.search(r"(\d+)\s*([A-Za-z/ ()]+)", part)
        # raw_unit = embedded.group(2).strip().lower()
        # if any(raw_unit in [unit.lower() for units in RPM_UNIT_MAP.values() for unit in units]): # recognized unit

        # if embedded:
        #     speed = int(embedded.group(1))
        #     raw_unit = embedded.group(2).strip().lower()
            
        #     if raw_unit in [u.lower() for units in RPM_UNIT_MAP.values() for u in units]:
        #         # recognized RPM unit → keep as speed
        #         # find the standard mapping
        #         for key, units in RPM_UNIT_MAP.items():
        #             if raw_unit in [u.lower() for u in units]:
        #                 unit = key
        #                 break
        #         # Remove the number + unit from the original string to create note
        #         note = re.sub(r"\d+\s*" + re.escape(embedded.group(2)), "", part, flags=re.I).strip(",; ")
        #         note = note.replace(":", "").strip()
        #         results.append((speed, unit, note if note else None))
        #     else:
        #         # unrecognized unit → treat entire string as note
        #         results.append((None, None, part.strip()))
            
        #     rpm_case_tracking_dict["Case 6: number with unit embedded"] += 1
        #     if rpm_case_printing_dict["Case 6: number with unit embedded"]:
        #         print("Case 6: number with unit embedded  * ", results[-1])
        #     continue

        # # Case 6: number with unit embedded
        # # First, check if part contains multiple numbers with no unit (e.g., "25 and 50")
        # multi_numbers = re.findall(r'\b\d+\b', part)
        # recognized_units_in_part = [unit.lower() for units in RPM_UNIT_MAP.values() for unit in units]
        # if multi_numbers and not any(unit in part.lower() for unit in recognized_units_in_part):
        #     for num in multi_numbers:
        #         results.append((int(num), "RPM", None))
        #     rpm_case_tracking_dict["Case 6: number with unit embedded"] += 1
        #     if rpm_case_printing_dict["Case 6: number with unit embedded"]:
        #         print("Case 6: number with unit embedded  * ", results[-len(multi_numbers):])
        #     continue
        
        # # Now look for a number + recognized unit anywhere in the string
        # embedded = re.search(r'(\d+)\s*([A-Za-z/ ()]+)', part)
        # if embedded:
        #     speed = int(embedded.group(1))
        #     raw_unit = embedded.group(2).strip().lower()
        
        #     # Check if the raw unit is recognized
        #     # if any(raw_unit in [unit.lower() for units in RPM_UNIT_MAP.values() for unit in units]):
        #     if raw_unit in [unit.lower() for units in RPM_UNIT_MAP.values() for unit in units]:

        #         # Find the standardized unit
        #         for key, units in RPM_UNIT_MAP.items():
        #             if raw_unit in [unit.lower() for unit in units]:
        #                 unit = key
        #                 break
        #         # Remove the number + unit and optional leading 'at' for clean note
        #         note = re.sub(r'(\bat\s+)?\d+\s*' + re.escape(embedded.group(2)), '', part, flags=re.I).strip(",; ")
        #         note = note.replace(":", "").strip()
        #         results.append((speed, unit, note if note else None))
        #     else:
        #         # Unrecognized unit → treat entire string as note
        #         results.append((None, None, part.strip()))
        
        #     rpm_case_tracking_dict["Case 6: number with unit embedded"] += 1
        #     if rpm_case_printing_dict["Case 6: number with unit embedded"]:
        #         print("Case 6: number with unit embedded  * ", results[-1])
        #     continue

        # # Case 6: number with unit embedded
        # # Handle hyphenated ranges first (e.g. "30-60 CPM")
        # range_match = re.search(r'(\d+)\s*-\s*(\d+)\s*([A-Za-z/ ()]+)', part, flags=re.I)
        # if range_match:
        #     low, high = int(range_match.group(1)), int(range_match.group(2))
        #     avg_speed = (low + high) // 2
        #     raw_unit = range_match.group(3).strip().lower()
        #     unit = normalize_unit(raw_unit)
        #     if unit in RPM_UNIT_MAP:  # only accept known units
        #         note = re.sub(r'\d+\s*-\s*\d+\s*' + re.escape(range_match.group(3)), '', part, flags=re.I).strip(",;: ")
        #         results.append((avg_speed, unit, note if note else None))
        #     else:
        #         results.append((None, None, part.strip()))
        #     rpm_case_tracking_dict["Case 6: number with unit embedded"] += 1
        #     if rpm_case_printing_dict["Case 6: number with unit embedded"]:
        #         print("Case 6: number with unit embedded  * ", results[-1])
        #     continue

        #     note = re.sub(r'\d+\s*-\s*\d+\s*' + re.escape(range_match.group(3)), '', part, flags=re.I).strip(",;: ")
        #     results.append((avg_speed, unit, note if note else None))
        #     rpm_case_tracking_dict["Case 6: number with unit embedded"] += 1
        #     if rpm_case_printing_dict["Case 6: number with unit embedded"]:
        #         print("Case 6: number with unit embedded  * ", results[-1])
        #     continue

        # # Handle multiple numbers without explicit unit (e.g. "25 and 50")
        # multi_numbers = re.findall(r'\b\d+\b', part)
        # recognized_units_in_part = [u.lower() for units in RPM_UNIT_MAP.values() for u in units]
        # if multi_numbers and not any(unit in part.lower() for unit in recognized_units_in_part):
        #     for num in multi_numbers:
        #         results.append((int(num), "RPM", None))
        #     rpm_case_tracking_dict["Case 6: number with unit embedded"] += 1
        #     if rpm_case_printing_dict["Case 6: number with unit embedded"]:
        #         for result in results[-len(multi_numbers):]:
        #             print("Case 6: number with unit embedded  * ", result)
        #     continue
        
        # # Now look for number + recognized unit anywhere in the string
        # embedded = re.search(r'(\d+)\s*([A-Za-z/ ()]+)', part)
        # if embedded:
        #     speed = int(embedded.group(1))
        #     raw_unit = embedded.group(2).strip().lower()
        #     unit = normalize_unit(raw_unit)

        #     # If the unit wasn't recognized, treat the whole part as a note
        #     if unit == raw_unit:  # normalize_unit returned unchanged
        #         results.append((None, None, part.strip()))
        #     else:
        #         # Strip the matched number + unit (and optional 'at') to form note
        #         note = re.sub(r'(\bat\s+)?\d+\s*' + re.escape(embedded.group(2)), '', part, flags=re.I).strip(",;: ")
        #         results.append((speed, unit, note if note else None))

        #     rpm_case_tracking_dict["Case 6: number with unit embedded"] += 1
        #     if rpm_case_printing_dict["Case 6: number with unit embedded"]:
        #         print("Case 6: number with unit embedded  * ", results[-1])
        #     continue

        # # Case 6: number with unit embedded
        # # Handle hyphenated ranges first (e.g. "30-60 CPM")
        # range_match = re.search(r'(\d+)\s*-\s*(\d+)\s*([A-Za-z/ ()]+)', part, flags=re.I)
        # if range_match:
        #     low, high = int(range_match.group(1)), int(range_match.group(2))
        #     avg_speed = (low + high) // 2
        #     raw_unit = range_match.group(3).strip().lower()
        #     unit = normalize_unit(raw_unit)
        #     note = re.sub(r'\d+\s*-\s*\d+\s*' + re.escape(range_match.group(3)), '', part, flags=re.I).strip(",;: ")
        #     results.append((avg_speed, unit, note if note else None))
        #     rpm_case_tracking_dict["Case 6: number with unit embedded"] += 1
        #     if rpm_case_printing_dict["Case 6: number with unit embedded"]:
        #         print("Case 6: number with unit embedded  * ", results[-1])
        #     continue
        #     if unit in RPM_UNIT_MAP:  # only accept known units
        #         note = re.sub(r'\d+\s*-\s*\d+\s*' + re.escape(range_match.group(3)), '', part, flags=re.I).strip(",;: ")
        #         results.append((avg_speed, unit, note if note else None))
        #     else:
        #         results.append((None, None, part.strip()))
        #     rpm_case_tracking_dict["Case 6: number with unit embedded"] += 1
        #     if rpm_case_printing_dict["Case 6: number with unit embedded"]:
        #         print("Case 6: number with unit embedded  * ", results[-1])
        #     continue

        # # Handle multiple numbers without explicit unit (e.g. "25 and 50")
        # multi_numbers = re.findall(r'\b\d+\b', part)
        # recognized_units = [u for units in RPM_UNIT_MAP.values() for u in units]
        # if multi_numbers and not any(u.lower() in part.lower() for u in recognized_units):
        #     for num in multi_numbers:
        #         results.append((int(num), "RPM", None))
        #     rpm_case_tracking_dict["Case 6: number with unit embedded"] += 1
        #     if rpm_case_printing_dict["Case 6: number with unit embedded"]:
        #         for r in results[-len(multi_numbers):]:
        #             print("Case 6: number with unit embedded  * ", r)
        #     continue

        # # Now look for number + recognized unit anywhere in the string
        # embedded = re.search(r'(\d+)\s*([A-Za-z/ ()]+)', part)
        # if embedded:
        #     speed = int(embedded.group(1))
        #     raw_unit = embedded.group(2).strip()
        #     unit = normalize_unit(raw_unit.lower())

        #     if unit in RPM_UNIT_MAP:  # only accept known units
        #         # Remove the matched number + raw_unit (and optional "at") from note
        #         note = re.sub(r'(\bat\s+)?' + str(speed) + r'\s*' + re.escape(raw_unit), '', part, flags=re.I).strip(",;: ")
        #         results.append((speed, unit, note if note else None))
        #     else:
        #         # treat as note if not recognized
        #         results.append((None, None, part.strip()))

        #     rpm_case_tracking_dict["Case 6: number with unit embedded"] += 1
        #     if rpm_case_printing_dict["Case 6: number with unit embedded"]:
        #         print("Case 6: number with unit embedded  * ", results[-1])
        #     continue

        # # Case 6: number with unit embedded
        # # Handle hyphenated ranges first (e.g., "30-60 CPM")
        # range_match = re.search(r'(\d+)\s*-\s*(\d+)\s*([A-Za-z/ ()]+)', part, flags=re.I)
        # if range_match:
        #     low, high = int(range_match.group(1)), int(range_match.group(2))
        #     avg_speed = (low + high) // 2
        #     raw_unit = range_match.group(3).strip().lower()
        #     unit = normalize_unit(raw_unit)
        #     if unit in RPM_UNIT_MAP:  # only accept known units
        #         note = re.sub(r'\d+\s*-\s*\d+\s*' + re.escape(range_match.group(3)), '', part, flags=re.I).strip(",;: ")
        #         results.append((avg_speed, unit, note if note else None))
        #     else:
        #         results.append((None, None, part.strip()))
        #     rpm_case_tracking_dict["Case 6: number with unit embedded"] += 1
        #     if rpm_case_printing_dict["Case 6: number with unit embedded"]:
        #         print("Case 6: number with unit embedded  * ", results[-1])
        #     continue

        # # Handle multiple numbers without explicit unit (e.g., "25 and 50")
        # multi_numbers = re.findall(r'\b\d+\b', part)
        # recognized_units = [unit for units in RPM_UNIT_MAP.values() for unit in units]
        # if multi_numbers and not any(unit.lower() in part.lower() for unit in recognized_units):
        #     for num in multi_numbers:
        #         results.append((int(num), "RPM", None))
        #     rpm_case_tracking_dict["Case 6: number with unit embedded"] += 1
        #     if rpm_case_printing_dict["Case 6: number with unit embedded"]:
        #         for result in results[-len(multi_numbers):]:
        #             print("Case 6: number with unit embedded  * ", result)
        #     continue

        # # Now look for number + recognized unit anywhere in the string
        # embedded = re.search(r'(\d+)\s*([A-Za-z/ ()]+)', part)
        # if embedded:
        #     speed = int(embedded.group(1))
        #     raw_unit = embedded.group(2).strip()
        #     unit = normalize_unit(raw_unit.lower())

        #     if unit in RPM_UNIT_MAP:  # only accept known units
        #         # Remove the matched number + raw_unit (and optional "at") from note
        #         note = re.sub(r'(\bat\s+)?' + str(speed) + r'\s*' + re.escape(raw_unit), '', part, flags=re.I).strip(",;: ")
        #         results.append((speed, unit, note if note else None))
        #     else:
        #         # treat as note if not recognized
        #         results.append((None, None, part.strip()))

        #     rpm_case_tracking_dict["Case 6: number with unit embedded"] += 1
        #     if rpm_case_printing_dict["Case 6: number with unit embedded"]:
        #         print("Case 6: number with unit embedded  * ", results[-1])
        #     continue

        # Case 6: number with recognized unit anywhere in the string
        # 6a) Hyphen/and/to ranges with a recognized unit → average
        m_range = re.search(rf'\b(\d+)\s*(?:-|and|to)\s*(\d+)\s*({UNIT_ALT})\b', part, flags=re.I)
        if m_range:
            a, b = int(m_range.group(1)), int(m_range.group(2))
            unit_text = m_range.group(3)
            unit = normalize_unit(unit_text)  # canonical like "RPM"/"DPM"/"CPM"
            avg = (a + b) // 2 if (a + b) % 2 == 0 else (a + b) / 2
        
            # remove the matched segment (optionally preceded by "at")
            note = re.sub(
                rf'(\bat\s+)?\b{a}\s*(?:-|and|to)\s*{b}\s*{re.escape(unit_text)}\b',
                '',
                part,
                flags=re.I,
                count=1
            ).strip(",;: ")
        
            # tidy note: collapse "at for" → "for" and extra spaces
            note = re.sub(r'\bat\s+for\b', 'for', note, flags=re.I)
            note = re.sub(r'\s{2,}', ' ', note).strip() or None
        
            results.append((int(avg), unit, note))
            rpm_case_tracking_dict["Case 6: number with unit embedded"] += 1
            if rpm_case_printing_dict["Case 6: number with unit embedded"]:
                print("Case 6: number with unit embedded  * ", results[-1])
            continue
        
        # 6b) Single number + recognized unit
        m_single = re.search(rf'\b(\d+)\s*({UNIT_ALT})\b', part, flags=re.I)
        if m_single:
            speed = int(m_single.group(1))
            unit_text = m_single.group(2)
            unit = normalize_unit(unit_text)
        
            note = re.sub(
                rf'(\bat\s+)?\b{speed}\s*{re.escape(unit_text)}\b',
                '',
                part,
                flags=re.I,
                count=1
            ).strip(",;: ")
        
            note = re.sub(r'\bat\s+for\b', 'for', note, flags=re.I)
            note = re.sub(r'\s{2,}', ' ', note).strip() or None
        
            results.append((speed, unit, note))
            rpm_case_tracking_dict["Case 6: number with unit embedded"] += 1
            if rpm_case_printing_dict["Case 6: number with unit embedded"]:
                print("Case 6: number with unit embedded  * ", results[-1])
            continue

        # # Case 7: number + unit
        # match = re.match(r"(\d+)\s*([A-Za-z/ ]+)", part)
        # if match and match.group(2) != '':
        #     speed = int(match.group(1))
        #     unit = match.group(2).strip().replace("per minute", "/min")
        #     results.append((speed, unit, None))
        #     rpm_case_tracking_dict["Case 7: number + unit"] += 1
        #     if rpm_case_printing_dict["Case 7: number + unit"] == True:
        #         print("Case 7: number + unit     * * *      ", (speed, unit, None))
        #     if unit == "":
        #         print("Problematic row:")
        #         display(df[df["Speed (RPMs) (Raw)"] == entry].head())
        #     continue

        # Case 7: number + unit
        match = re.match(r"^(\d+)\s*([A-Za-z/ ]+)$", part)
        if match:
            speed = int(match.group(1))
            unit_raw = match.group(2).strip().lower()
        
            # Only accept units we know
            if unit_raw in RPM_UNIT_MAP:
                unit_norm = normalize_unit(unit_raw)
                # unit = RPM_UNIT_MAP[unit_raw]
                results.append((speed, unit_norm, None))
                rpm_case_tracking_dict["Case 7: number + unit"] += 1
                if rpm_case_printing_dict["Case 7: number + unit"]:
                    print("Case 7: number + unit     * * *      ", (speed, unit_norm, None))
            else:
                # Otherwise treat the whole string as a note
                results.append((None, None, part))
                rpm_case_tracking_dict["Fallback: keep as note"] += 1
                if rpm_case_printing_dict["Fallback: keep as note"]:
                    print("Fallback: keep as note               ", (None, None, part))
            continue

        # Case 8: multiple numbers with "and" or "-" but no unit
        m_multi = re.match(r'^\s*(\d+)\s*(?:and|-|to)\s*(\d+)\s*$', part, flags=re.I)
        if m_multi:
            a, b = int(m_multi.group(1)), int(m_multi.group(2))
            # emit two rows, one for each number
            results.append((a, "RPM", None))
            results.append((b, "RPM", None))
            rpm_case_tracking_dict["Case 8: multiple numbers without unit"] += 1
            if rpm_case_printing_dict["Case 8: multiple numbers without unit"]:
                print("Case 8: multiple numbers without unit", (a, "RPM", None))
                print("Case 8: multiple numbers without unit", (b, "RPM", None))
            continue

        # Fallback: keep as note
        results.append((None, None, part))
        rpm_case_tracking_dict["Fallback: keep as note"] += 1
        if rpm_case_printing_dict["Fallback: keep as note"] == True:
            print("Fallback: keep as note               ", (None, None, part))
            
    return results

# Apply and expand, safely overwriting existing columns
expanded_rows = df["Speed (RPMs) (Raw)"].apply(clean_speed_entry)
df_cleaned = pd.DataFrame(
    [t for sublist in expanded_rows for t in sublist],
    columns=["Speed (RPMs) (Clean)", "Unit", "Notes"]
)

# Overwrite the old columns safely
df[["Speed (RPMs) (Clean)", "Unit", "Notes"]] = df_cleaned

pd.DataFrame(list(rpm_case_tracking_dict.items()), columns=["Case", "Count"])

# df[df["Speed (RPMs) (Raw)"] == "50 RPM (12.5 mg, 25 mg and 100 mg); 75 RPM (150 mg and 200 mg)"]
# df[-200:]

Case 4: "<Note>: <number> [unit?]"    (50, 'RPM', 'Clarithromycin and Vonoprazan')
Case 4: "<Note>: <number> [unit?]"    (50, 'RPM', 'Vonoprazan')
Fallback: keep as note                (None, None, '4 mL/min')
Fallback: keep as note                (None, None, 'Flow @ 8 mL/min')
Case 4: "<Note>: <number> [unit?]"    (100, 'RPM', 'Met')
Case 4: "<Note>: <number> [unit?]"    (75, 'RPM', 'Canagliflozin')
Case 4: "<Note>: <number> [unit?]"    (50, 'RPM', 'Metformin')
Case 4: "<Note>: <number> [unit?]"    (50, 'RPM', 'Carbidopa and Levodopa')
Case 4: "<Note>: <number> [unit?]"    (125, 'RPM', 'Entacapone')
Case 4: "<Note>: <number> [unit?]"    (50, 'RPM', '50 mg, 100 mg and 200 mg (Rpm)')
Case 4: "<Note>: <number> [unit?]"    (75, 'RPM', '400 mg (Rpm)')
Case 6: number with unit embedded  *  (27, 'DPM', None)
Fallback: keep as note                (None, None, 'Flow @ 6 mL/min')
Case 6: number with unit embedded  *  (50, 'RPM', '(12.5 mg, 25 mg and 100 mg)')
Case 6: number with unit embedded 

Unnamed: 0,Case,Count
0,Case 1: empty entry,768
1,Case 2: pure number,756
2,Case 3: number + parenthetical note,5
3,"Case 4: ""<Note>: <number> [unit?]""",36
4,"Case 5: ""X mg, Y mg and Z mg: N rpm""",0
5,Case 6: number with unit embedded,17
6,Case 7: number + unit,0
7,Case 8: multiple numbers without unit,1
8,Fallback: keep as note,4


In [20]:
# Define globally
rpm_case_tracking_dict = {
    "Case 1: empty entry": 0,
    "Case 2: pure number": 0,
    "Case 3: number + parenthetical note": 0,
    "Case 4: \"X mg, Y mg and Z mg: N rpm\"": 0,
    "Case 5: \"<Note>: <number> [unit?]\"": 0,
    "Case 6: number with unit embedded": 0,
    "Case 7: number + unit": 0,
    "Fallback: keep as note": 0,
}

# Setting for which cases to print (for debugging)
rpm_case_printing_dict = {
    "Case 1: empty entry": False,
    "Case 2: pure number": False,
    "Case 3: number + parenthetical note": False,
    "Case 4: \"X mg, Y mg and Z mg: N rpm\"": True,
    "Case 5: \"<Note>: <number> [unit?]\"": False,
    "Case 6: number with unit embedded": True,
    "Case 7: number + unit": True,
    "Fallback: keep as note": True,
}

def normalize_unit(raw_unit: str) -> str:
    raw_unit = raw_unit.strip().lower()
    for canonical, variations in RPM_UNIT_MAP.items():
        for variation in variations:
            if raw_unit == variation.lower():
                return canonical
    return raw_unit  # fallback if no match

def clean_speed_entry(entry):
    """
    Returns a list of tuples: (Speed, Unit, Notes)
    Handles multiple numbers per entry and colon/semicolon-separated cases.
    """
    # Case 1: empty entry
    if pd.isna(entry):
        rpm_case_tracking_dict["Case 1: empty entry"] += 1
        if rpm_case_printing_dict["Case 1: empty entry"] == True:
            print("Case 1: empty entry                  ", (None, None, None))
        return [(None, None, None)]

    # Replacing any Unicode non-breaking space characters ("\xa0") with spaces
    # val = str(entry).replace("\xa0", " ").strip()
    
    # Replace non-breaking spaces and collapse all whitespace in one go
    val = re.sub(r'\s+', ' ', str(entry).replace("\xa0", " ")).strip()

    results = []

    # Split by semicolons for multiple measurements
    parts = [p.strip() for p in val.split(";")]

    for part in parts:
        # Case 2: pure number
        if part.isdigit():
            results.append((int(part), "RPM", None))
            rpm_case_tracking_dict["Case 2: pure number"] += 1
            if rpm_case_printing_dict["Case 2: pure number"] == True:
                print("Case 2: pure number                  ", (int(part), "RPM", None))
            continue

        # Case 3: number + parenthetical note
        if re.search(r"\d+\s*\([^)]*\)", part):
            # Split on "&" inside this case, because it can contain multiple entries
            subparts = [subpart.strip() for subpart in re.split(r"&", part)]
            for subpart in subparts:
                match = re.match(r"(\d+)\s*\(([^)]*)\)", subpart)
                if match:
                    speed = int(match.group(1))
                    note = match.group(2).strip()
                    results.append((speed, "RPM", note))
                    rpm_case_tracking_dict["Case 3: number + parenthetical note"] += 1
                    if rpm_case_printing_dict.get("Case 3: number + parenthetical note", True):
                        print("Case 3: number + parenthetical note  ", (speed, "RPM", note))
            continue

        # Case 4: "X mg, Y mg and Z mg: N rpm"
        multi_match = re.match(r"([\d, andmgMG ]+):\s*(\d+)\s*([A-Za-z/]+)?", part)
        if multi_match:
            drugs = re.split(r",|and", multi_match.group(1))
            speed = int(multi_match.group(2))
            unit = multi_match.group(3).lower() if multi_match.group(3) else "rpm"
            unit = RPM_UNIT_MAP.get(unit.lower(), unit.upper())
            for d in [d.strip() for d in drugs if d.strip()]:
                results.append((speed, unit, d))
            rpm_case_tracking_dict["Case 4: \"X mg, Y mg and Z mg: N rpm\""] += 1
            if rpm_case_printing_dict["Case 4: \"X mg, Y mg and Z mg: N rpm\""] == True:
                print("Case 4: \"X mg, Y mg and Z mg: N rpm\" ", (speed, unit, d))
            continue

        # Case 5: "<Note>: <number> [unit?]"
        colon_match = re.match(r"(.+?):\s*(\d+)\s*([A-Za-z/]+)?", part)
        if colon_match:
            note = colon_match.group(1).strip()
            speed = int(colon_match.group(2))
            unit = colon_match.group(3).lower() if colon_match.group(3) else None
            raw_unit = colon_match.group(3)
            
            if raw_unit:
                mapped_unit = RPM_UNIT_MAP.get(raw_unit.lower())
                if mapped_unit:  
                    # Normal case: recognized unit
                    unit = mapped_unit
                else:
                    # Fallback: treat as note addition
                    unit = "RPM"
                    extra_note = raw_unit.capitalize()
                    note = f"{note} ({extra_note})"
            else:
                unit = "RPM"

            # Remove any additional colons from the note
            note = note.replace(":", "").strip()
            
            results.append((speed, unit, note))
            rpm_case_tracking_dict["Case 5: \"<Note>: <number> [unit?]\""] += 1
            if rpm_case_printing_dict["Case 5: \"<Note>: <number> [unit?]\""]:
                print("Case 5: \"<Note>: <number> [unit?]\"   ", (speed, unit, note))
            continue            

        # # Case 6: number with unit embedded
        # embedded = re.search(r"(\d+)\s*(rpm|dpm|cycles per min|cycles/min)", part, flags=re.I)
        # if embedded:
        #     speed = int(embedded.group(1))
        #     unit = RPM_UNIT_MAP.get(embedded.group(2).lower(), embedded.group(2).upper())
        #     # Remove speed/unit from note
        #     note = re.sub(r"\d+\s*" + embedded.group(2), "", part, flags=re.I).strip(",; ")
        #     results.append((speed, unit, note if note else None))
        #     rpm_case_tracking_dict["Case 6: number with unit embedded"] += 1
        #     if rpm_case_printing_dict["Case 6: number with unit embedded"] == True:
        #         print("Case 6: number with unit embedded  * ", (speed, unit, note if note else None))
        #     continue

        # # Case 6: number with unit embedded
        # embedded = re.search(r"(\d+)\s*([A-Za-z/ ]+)", part)
        # if embedded:
        #     speed = int(embedded.group(1))
        #     raw_unit = embedded.group(2).strip()
        #     mapped_unit = RPM_UNIT_MAP.get(raw_unit.lower())
            
        #     if mapped_unit:
        #         unit = mapped_unit  # recognized unit
        #     else:
        #         unit = "RPM"  # fallback
        #         raw_unit_clean = raw_unit.capitalize()
        #         note = f"{raw_unit_clean}"  # append to note
             
        #     # Remove the number + unit from the original string to make the note
        #     note = re.sub(r"\d+\s*" + re.escape(raw_unit), "", part, flags=re.I).strip(",; ")
        #     if 'note' in locals():  # if we added raw_unit above, merge
        #         note = f"{note} ({raw_unit_clean})" if note else f"{raw_unit_clean}"
            
        #     # Remove any extra colons
        #     note = note.replace(":", "").strip()
            
        #     results.append((speed, unit, note if note else None))
        #     rpm_case_tracking_dict["Case 6: number with unit embedded"] += 1
        #     if rpm_case_printing_dict["Case 6: number with unit embedded"]:
        #         print("Case 6: number with unit embedded  * ", (speed, unit, note if note else None))
        #     continue

        # # Case 6: number with unit embedded
        # embedded = re.search(r"(\d+)\s*([A-Za-z/ ()]+)", part)
        # raw_unit = embedded.group(2).strip().lower()
        # if any(raw_unit in [unit.lower() for units in RPM_UNIT_MAP.values() for unit in units]): # recognized unit

        # if embedded:
        #     speed = int(embedded.group(1))
        #     raw_unit = embedded.group(2).strip().lower()
            
        #     if raw_unit in [u.lower() for units in RPM_UNIT_MAP.values() for u in units]:
        #         # recognized RPM unit → keep as speed
        #         # find the standard mapping
        #         for key, units in RPM_UNIT_MAP.items():
        #             if raw_unit in [u.lower() for u in units]:
        #                 unit = key
        #                 break
        #         # Remove the number + unit from the original string to create note
        #         note = re.sub(r"\d+\s*" + re.escape(embedded.group(2)), "", part, flags=re.I).strip(",; ")
        #         note = note.replace(":", "").strip()
        #         results.append((speed, unit, note if note else None))
        #     else:
        #         # unrecognized unit → treat entire string as note
        #         results.append((None, None, part.strip()))
            
        #     rpm_case_tracking_dict["Case 6: number with unit embedded"] += 1
        #     if rpm_case_printing_dict["Case 6: number with unit embedded"]:
        #         print("Case 6: number with unit embedded  * ", results[-1])
        #     continue



        # Case 6: number with unit embedded
        # First, check if part contains multiple numbers with no unit (e.g., "25 and 50")
        multi_numbers = re.findall(r'\b\d+\b', part)
        recognized_units_in_part = [unit.lower() for units in RPM_UNIT_MAP.values() for unit in units]
        if multi_numbers and not any(unit in part.lower() for unit in recognized_units_in_part):
            for num in multi_numbers:
                results.append((int(num), "RPM", None))
            rpm_case_tracking_dict["Case 6: number with unit embedded"] += 1
            if rpm_case_printing_dict["Case 6: number with unit embedded"]:
                print("Case 6: number with unit embedded  * ", results[-len(multi_numbers):])
            continue
        
        # Now look for a number + recognized unit anywhere in the string
        embedded = re.search(r'(\d+)\s*([A-Za-z/ ()]+)', part)
        if embedded:
            speed = int(embedded.group(1))
            raw_unit = embedded.group(2).strip().lower()
        
            # Check if the raw unit is recognized
            # if any(raw_unit in [unit.lower() for units in RPM_UNIT_MAP.values() for unit in units]):
            if raw_unit in [unit.lower() for units in RPM_UNIT_MAP.values() for unit in units]:

                # Find the standardized unit
                for key, units in RPM_UNIT_MAP.items():
                    if raw_unit in [unit.lower() for unit in units]:
                        unit = key
                        break
                # Remove the number + unit and optional leading 'at' for clean note
                note = re.sub(r'(\bat\s+)?\d+\s*' + re.escape(embedded.group(2)), '', part, flags=re.I).strip(",; ")
                note = note.replace(":", "").strip()
                results.append((speed, unit, note if note else None))
            else:
                # Unrecognized unit → treat entire string as note
                results.append((None, None, part.strip()))
        
            rpm_case_tracking_dict["Case 6: number with unit embedded"] += 1
            if rpm_case_printing_dict["Case 6: number with unit embedded"]:
                print("Case 6: number with unit embedded  * ", results[-1])
            continue

        # # Case 7: number + unit
        # match = re.match(r"(\d+)\s*([A-Za-z/ ]+)", part)
        # if match and match.group(2) != '':
        #     speed = int(match.group(1))
        #     unit = match.group(2).strip().replace("per minute", "/min")
        #     results.append((speed, unit, None))
        #     rpm_case_tracking_dict["Case 7: number + unit"] += 1
        #     if rpm_case_printing_dict["Case 7: number + unit"] == True:
        #         print("Case 7: number + unit     * * *      ", (speed, unit, None))
        #     if unit == "":
        #         print("Problematic row:")
        #         display(df[df["Speed (RPMs) (Raw)"] == entry].head())
        #     continue

        # Case 7: number + unit
        match = re.match(r"^(\d+)\s*([A-Za-z/ ]+)$", part)
        if match:
            speed = int(match.group(1))
            unit_raw = match.group(2).strip().lower()
        
            # Only accept units we know
            if unit_raw in RPM_UNIT_MAP:
                unit_norm = normalize_unit(unit_raw)
                # unit = RPM_UNIT_MAP[unit_raw]
                results.append((speed, unit_norm, None))
                rpm_case_tracking_dict["Case 7: number + unit"] += 1
                if rpm_case_printing_dict["Case 7: number + unit"]:
                    print("Case 7: number + unit     * * *      ", (speed, unit_norm, None))
            else:
                # Otherwise treat the whole string as a note
                results.append((None, None, part))
                rpm_case_tracking_dict["Fallback: keep as note"] += 1
                if rpm_case_printing_dict["Fallback: keep as note"]:
                    print("Fallback: keep as note               ", (None, None, part))
            continue

        # Fallback: keep as note
        results.append((None, None, part))
        rpm_case_tracking_dict["Fallback: keep as note"] += 1
        if rpm_case_printing_dict["Fallback: keep as note"] == True:
            print("Fallback: keep as note               ", (None, None, part))
            
    return results

# Apply and expand, safely overwriting existing columns
expanded_rows = df["Speed (RPMs) (Raw)"].apply(clean_speed_entry)
df_cleaned = pd.DataFrame(
    [t for sublist in expanded_rows for t in sublist],
    columns=["Speed (RPMs) (Clean)", "Unit", "Notes"]
)

# Overwrite the old columns safely
df[["Speed (RPMs) (Clean)", "Unit", "Notes"]] = df_cleaned

pd.DataFrame(list(rpm_case_tracking_dict.items()), columns=["Case", "Count"])

# df[df["Speed (RPMs) (Raw)"] == "50 RPM (12.5 mg, 25 mg and 100 mg); 75 RPM (150 mg and 200 mg)"]
# df[-200:]

Case 6: number with unit embedded  *  [(4, 'RPM', None)]
Case 6: number with unit embedded  *  [(8, 'RPM', None)]
Case 4: "X mg, Y mg and Z mg: N rpm"  (50, 'RPM', '200 mg')
Case 4: "X mg, Y mg and Z mg: N rpm"  (75, 'RPM', '400 mg')
Case 6: number with unit embedded  *  (27, 'DPM', None)
Case 6: number with unit embedded  *  [(6, 'RPM', None)]
Case 6: number with unit embedded  *  (None, None, '50 RPM (12.5 mg, 25 mg and 100 mg)')
Case 6: number with unit embedded  *  (None, None, '75 RPM (150 mg and 200 mg)')
Case 6: number with unit embedded  *  (None, None, '75 rpm (for 75 mg strength)')
Case 6: number with unit embedded  *  (None, None, '150 rpm (for 300 mg strength)')
Case 6: number with unit embedded  *  (25, 'DPM', None)
Case 6: number with unit embedded  *  (30, 'CPM', None)
Case 6: number with unit embedded  *  (12, 'DPM', None)
Case 6: number with unit embedded  *  (30, 'DPM', None)
Case 6: number with unit embedded  *  [(25, 'RPM', None), (50, 'RPM', None)]
Case 6: number w

Unnamed: 0,Case,Count
0,Case 1: empty entry,768
1,Case 2: pure number,756
2,Case 3: number + parenthetical note,5
3,"Case 4: ""X mg, Y mg and Z mg: N rpm""",2
4,"Case 5: ""<Note>: <number> [unit?]""",34
5,Case 6: number with unit embedded,22
6,Case 7: number + unit,0
7,Fallback: keep as note,0


Running functions to clean up dataset.

In [14]:
test = "75 "

# Define globally
rpm_case_tracking_dict = {
    "Case: empty entry": 0,
    "Case: pure number": 0,
    "Case: number + unit": 0,
    "Case: number with unit embedded": 0,
    "Case: \"X mg, Y mg and Z mg: N rpm\"": 0,
    "Case: \"<Note>: <number> [unit?]\"": 0,
    "Fallback: keep as note": 0,
}

# Setting for which cases to print
rpm_case_printing_dict = {
    "Case: empty entry": False,
    "Case: pure number": False,
    "Case: number + unit": True,
    "Case: number with unit embedded": True,
    "Case: \"X mg, Y mg and Z mg: N rpm\"": True,
    "Case: \"<Note>: <number> [unit?]\"": True,
    "Fallback: keep as note": True,
}

def clean_speed_entry(entry):
    """
    Returns a list of tuples: (Speed, Unit, Notes)
    Handles multiple numbers per entry and colon/semicolon-separated cases.
    """
    if pd.isna(entry):
        rpm_case_tracking_dict["Case: empty entry"] += 1
        if rpm_case_printing_dict["Case: empty entry"] == True:
            print("Case: empty entry                    ", (None, None, None))
        return [(None, None, None)]

    # Replacing any Unicode non-breaking space characters ("\xa0") with spaces
    # val = str(entry).replace("\xa0", " ").strip()
    
    # Replace non-breaking spaces and collapse all whitespace in one go
    val = re.sub(r'\s+', ' ', str(entry).replace("\xa0", " ")).strip()

    results = []

    # Split by semicolons for multiple measurements
    parts = [p.strip() for p in val.split(";")]

    for part in parts:
        # Case: pure number
        if part.isdigit():
            results.append((int(part), "RPM", None))
            rpm_case_tracking_dict["Case: pure number"] += 1
            if rpm_case_printing_dict["Case: pure number"] == True:
                print("Case: pure number                    ", (int(part), "RPM", None))
            continue

        # Case: number + unit
        match = re.match(r"(\d+)\s*([A-Za-z/ ]+)", part)
        if match and match.group(2) != '':
            speed = int(match.group(1))
            unit = match.group(2).strip().replace("per minute", "/min")
            results.append((speed, unit, None))
            rpm_case_tracking_dict["Case: number + unit"] += 1
            if rpm_case_printing_dict["Case: number + unit"] == True:
                print("Case: number + unit                  ", (speed, unit, None))
            continue

        # Case: number with unit embedded
        embedded = re.search(r"(\d+)\s*(rpm|dpm|cycles per min|cycles/min)", part, flags=re.I)
        if embedded:
            speed = int(embedded.group(1))
            unit = RPM_UNIT_MAP.get(embedded.group(2).lower(), embedded.group(2).upper())
            # Remove speed/unit from note
            note = re.sub(r"\d+\s*" + embedded.group(2), "", part, flags=re.I).strip(",; ")
            results.append((speed, unit, note if note else None))
            rpm_case_tracking_dict["Case: number with unit embedded"] += 1
            if rpm_case_printing_dict["Case: number with unit embedded"] == True:
                print("Case: number with unit embedded")
            continue
            
        # Case: "X mg, Y mg and Z mg: N rpm"
        multi_match = re.match(r"([\d, andmgMG ]+):\s*(\d+)\s*([A-Za-z/]+)?", part)
        if multi_match:
            drugs = re.split(r",|and", multi_match.group(1))
            speed = int(multi_match.group(2))
            unit = multi_match.group(3).lower() if multi_match.group(3) else "rpm"
            unit = RPM_UNIT_MAP.get(unit.lower(), unit.upper())
            for d in [d.strip() for d in drugs if d.strip()]:
                results.append((speed, unit, d))
            rpm_case_tracking_dict["Case: \"X mg, Y mg and Z mg: N rpm\""] += 1
            if rpm_case_printing_dict["Case: \"X mg, Y mg and Z mg: N rpm\""] == True:
                print("Case: \"X mg, Y mg and Z mg: N rpm\"")
            continue

        # Case: "<Note>: <number> [unit?]"
        colon_match = re.match(r"(.+?):\s*(\d+)\s*([A-Za-z/]+)?", part)
        if colon_match:
            note = colon_match.group(1).strip()
            speed = int(colon_match.group(2))
            unit = colon_match.group(3).lower() if colon_match.group(3) else None
            if unit:
                unit = RPM_UNIT_MAP.get(unit.lower(), unit.upper())
            results.append((speed, unit, note))
            rpm_case_tracking_dict["Case: \"<Note>: <number> [unit?]\""] += 1
            if rpm_case_printing_dict["Case: \"<Note>: <number> [unit?]\""] == True:
                print("Case: \"<Note>: <number> [unit?]\"")
            continue

        # Fallback: keep as note
        results.append((None, None, part))
        rpm_case_tracking_dict["Fallback: keep as note"] += 1
        if rpm_case_printing_dict["Fallback: keep as note"] == True:
            print("Fallback: keep as note               ", (None, None, part))
            
    return results

# Apply and expand, safely overwriting existing columns
expanded_test = clean_speed_entry(test)
expanded_test

[(75, 'RPM', None)]

In [15]:
# Apply and expand "Dosage Form" data into two new columns
df[["Dosage Form (Clean)", "Release Type"]] = df["Dosage Form"].apply(lambda x: pd.Series(clean_dosage_form(x)))
expanded_test
# # Check results
# print(df['DosageForm_clean'].value_counts())
# print(df['ReleaseType'].value_counts())


# # Example transformations
# # Performing dataset cleanup as needed before one-hot encoding
# df['num_sampling_times'] = df['SamplingTimes'].apply(lambda x: len(str(x).split(',')))
# df['max_sampling_time'] = df['SamplingTimes'].apply(lambda x: max([int(t) for t in str(x).split(',')]))

# # One-hot encode categorical variables
# df_encoded = pd.get_dummies(df[['Apparatus', 'MediumType', 'DosageForm']])

# # Combine numeric + encoded
# features = pd.concat([df[['RPM', 'MediumVolume', 'num_sampling_times', 'max_sampling_time']], df_encoded], axis=1)


# # Drop non-numeric columns if needed after feature engineering is complete (sanity check)
# df = df.select_dtypes(include=[np.number]).dropna()

# features.head()

[(75, 'RPM', None)]

In [16]:
df[df.columns[-2]].value_counts()

Dosage Form (Clean)
Tablet                                                971
Capsule                                               335
Suspension                                             67
Oral Suspension                                        17
Film, Transdermal                                      14
For Suspension                                         11
Ophthalmic Suspension                                   9
Film                                                    8
Transdermal                                             8
Injectable Suspension                                   8
Injectable                                              7
For Oral Suspension                                     6
Granule                                                 6
Implant                                                 6
Suppository                                             5
Gel                                                     4
Vaginal Insert                                      

In [17]:
df[df.columns[-1]].value_counts()

Release Type
Extended Release                           209
Delayed Release                             51
Chewable                                    26
Orally Disintegrating                       24
Sublingual                                  10
Copackage                                    8
Buccal                                       7
Soft-Gelatin                                 5
Orally Disintegrating (ODT)                  4
Vaginal                                      3
For Suspension                               2
Soft-Gelatin/Liquid Fill                     2
Delayed Release Pellets                      2
Delayed Release, Orally Disintegrating       2
Sprinkle                                     2
Liposomal                                    2
Extended Release, Orally Disintegrating      2
Effervescent                                 2
Dental                                       1
Chewable dispersible                         1
Pediatric                                    1


#### 3.2. Preprocessing

In this case, all of the columns except the target column are going to be used to predict the target column.

In other words, using a patient's medical and demographic data to predict whether or not they have heart disease.

#### 3.3. Feature Engineering

In this case, a selection of synthetic dissolution profiles will be simulated from the FDA metadata (dosage form, apparatus type, agitation speed (rpm), medium composition, medium volume, and sampling times). For instance, a first-order or Weibull kinetic model may be used to generate curves plotting the percentage of the drug dissolved over time. Parameters can parhaps be linked to method settings (e.g., higher RPM values could lead to faster rate constants).

After this, it will be possible to use GMMs within a Machine Learning workflow and a patient's medical and demographic data to predict whether or not they have heart disease.

In [18]:
# Create X (all the feature columns)
X = heart_disease.drop("target", axis=1)

# Create y (the target column)
y = heart_disease["target"]

NameError: name 'heart_disease' is not defined

In [None]:
# Split the data into training and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

# View the data shapes
X_train.shape, X_test.shape, y_train.shape, y_test.shape

1. Problem defintion
2. Data
3. Evaluation
4. Features
5. Modelling
6. Experiments