# Dissolution Stability Machine Learning Project Using a Gaussian Mixture Model

## Project Overview – Analyzing Dissolution Test Data with Gaussian Mixture Models

**Goal:**  
Use FDA Dissolution Methods metadata + synthetic dissolution profiles to identify clusters of dissolution behaviors ("fast," "medium," "slow").  

**Steps:**  
1. Feature Engineering from FDA database  
2. Synthetic Profile Generation using kinetic models  
3. GMM clustering and visualization  
4. Interpretation & discussion of pharma relevance

1. Problem defintion
2. Data
3. Evaluation
4. Features
5. Modelling
6. Experiments

### 1. Notebook Setup

Getting all the tools ready in the project's virtual environment.

In [1]:
# Setup matplotlib to plot inline (within the notebook)
%matplotlib inline

# Core libraries
import pandas as pd
import numpy as np
import re

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Modeling
from sklearn.preprocessing import StandardScaler
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score

# Utility
from tqdm import tqdm

### 2. Data Acquisition

The FDA Dissolution Methods Database contains:
- Dosage form
- Apparatus type
- Agitation speed (RPM)
- Medium type & volume
- Sampling times

These features can be parsed into a structured table for feature engineering.

Loading the dataset (if dataset is real and available).

In [2]:
# Import dataset from CSV file or URL
df = pd.read_csv("Dissolution Methods.csv")

# Quick check to view the data
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1568 entries, 0 to 1567
Data columns (total 8 columns):
 #   Column                                Non-Null Count  Dtype 
---  ------                                --------------  ----- 
 0   Drug Name                             1568 non-null   object
 1   Dosage Form                           1568 non-null   object
 2   USP Apparatus                         805 non-null    object
 3   Speed (RPMs)                          800 non-null    object
 4   Medium                                1568 non-null   object
 5   Volume (mL)                           792 non-null    object
 6   Recommended Sampling Times (minutes)  803 non-null    object
 7   Date Updated                          1568 non-null   object
dtypes: object(8)
memory usage: 98.1+ KB


Unnamed: 0,Drug Name,Dosage Form,USP Apparatus,Speed (RPMs),Medium,Volume (mL),Recommended Sampling Times (minutes),Date Updated
0,Abacavir Sulfate,Tablet,,,"Refer to FDA's Dissolution Guidance, 2018",,,07/02/2020
1,Abacavir Sulfate/Dolutegravir Sodium/Lamivudine,Tablet,II (Paddle),85.0,0.01 M Phosphate Buffer with 0.5% sodium dodec...,900,"Abacavir and lamivudine: 10, 15, 20, 30 and 45...",05/28/2015
2,Abacavir Sulfate/Dolutegravir Sodium/Lamivudine,Tablet (For Suspension),II (Paddle),50.0,"0.01 M Phosphate Buffer with 0.5 mM EDTA, pH 6.8",500,"5, 10, 15, 30, 45 and 60",10/06/2023
3,Abacavir Sulfate/Lamivudine,Tablet,II (Paddle),75.0,0.1 N HCl,900,"10, 20, 30, and 45",01/03/2007
4,Abacavir Sulfate/Lamivudine/Zidovudine,Tablet,II (Paddle),75.0,0.1 N HCl,Acid Stage: 900 mL; Buffer Stage: 1000 mL,"5, 10, 15, 30 and 45",01/03/2007


Simulating the dataset (if dataset if not available). (OPTIONAL)

In [3]:
# # Simulating dissolution test metrics for 200 batches
# np.random.seed(42)
# fast_group = np.random.normal(
#     loc=90, scale=5, size=(100, 3))  # high % dissolved
# slow_group = np.random.normal(
#     loc=60, scale=5, size=(100, 3))  # lower % dissolved
# synthetic_data = np.vstack([fast_group, slow_group])

# df = pd.DataFrame(synthetic_data, columns=["% Dissolved @ 5min", "% Dissolved @ 10min", "% Dissolved @ 15min"])

### 3. Data Cleaning, Preprocessing, and Feature Engineering

Prepare the dataset for use within the model. Convert the FDA method metadata into ML-ready features:

- Apparatus → categorical → one-hot encode  
- Medium type → categorical → one-hot encode  
- RPM → numerical  
- Medium volume → numerical  
- Sampling times → numerical summary features (e.g., # of samples, max time)  

#### 3.1. Data Cleaning

Ensuring numerical features are clean and scaled.

In [4]:
df[df.columns[1]].value_counts()

Dosage Form
Tablet                          642
Capsule                         239
Tablet                           99
Tablet (Extended Release)        68
Suspension                       55
                               ... 
Implant (Intravitreal)            1
Powder for Injection              1
Tablet (for Oral Suspension)      1
Gel (Topical)                     1
Tablet, for Suspension            1
Name: count, Length: 176, dtype: int64

In [5]:
df.head()

Unnamed: 0,Drug Name,Dosage Form,USP Apparatus,Speed (RPMs),Medium,Volume (mL),Recommended Sampling Times (minutes),Date Updated
0,Abacavir Sulfate,Tablet,,,"Refer to FDA's Dissolution Guidance, 2018",,,07/02/2020
1,Abacavir Sulfate/Dolutegravir Sodium/Lamivudine,Tablet,II (Paddle),85.0,0.01 M Phosphate Buffer with 0.5% sodium dodec...,900,"Abacavir and lamivudine: 10, 15, 20, 30 and 45...",05/28/2015
2,Abacavir Sulfate/Dolutegravir Sodium/Lamivudine,Tablet (For Suspension),II (Paddle),50.0,"0.01 M Phosphate Buffer with 0.5 mM EDTA, pH 6.8",500,"5, 10, 15, 30, 45 and 60",10/06/2023
3,Abacavir Sulfate/Lamivudine,Tablet,II (Paddle),75.0,0.1 N HCl,900,"10, 20, 30, and 45",01/03/2007
4,Abacavir Sulfate/Lamivudine/Zidovudine,Tablet,II (Paddle),75.0,0.1 N HCl,Acid Stage: 900 mL; Buffer Stage: 1000 mL,"5, 10, 15, 30 and 45",01/03/2007


In [None]:
# Establishing known lists for reference in clean up

# Known "Dosage Form" release types to help detect them outside parentheses
KNOWN_RELEASE_TYPES = [
    "Extended Release", "Delayed Release", "Orally Disintegrating", 
    "Immediate Release", "Controlled Release", "Sustained Release", 
    "Copackage" # , "For Suspension"
]

In [None]:
# Defining cleanup functions


# Defining function to cleanup "Dosage Form" data
def clean_dosage_form(value):
    if pd.isna(value):
        return ("Unknown", "Unknown")
    
    # Strip spaces
    val = str(value).strip()

    # # Look for release type in parentheses
    # m = re.match(r"^(.*?)\s*\((.*?)\)$", val)
    
    # Regex: capture base form + anything inside parentheses
    # e.g. "Tablet (Delayed Release, Orally Disintegrating)"
    m = re.match(r"^(.*?)\s*(?:\((.+)\))?$", val)
    # m = re.match(r"^([A-Za-z ]+?)(?:\s*\((.+)\))?$", val)
    if m:
        dosage_form = m.group(1).strip().title()  # normalize dosage form
        release_type = m.group(2).strip() if m.group(2) else None

        # Remove trailing commas from both
        dosage_form = dosage_form.rstrip(',').strip()
        dosage_form = re.sub(r"\s+", " ", dosage_form)
        if release_type:
            release_type = release_type.rstrip(',').strip() # Note: ODT refers to Orally Disintegrating Tablet
            release_type = re.sub(r"\s+", " ", release_type)
            if release_type and not release_type[0].isupper():
                release_type = release_type[0].upper() + release_type[1:]

        # If no parentheses captured and there’s a comma, check for trailing release type
        if not release_type and ',' in dosage_form:
            parts = [p.strip() for p in dosage_form.split(',')]
            # If last part looks like a known release type, separate it
            if parts[-1] in KNOWN_RELEASE_TYPES:
                release_type = parts[-1]
                dosage_form = ', '.join(parts[:-1]).rstrip(',').strip()
        
        return dosage_form, release_type
    else:
        # fallback: keep whole thing in base, no modifiers
        return val.rstrip(',').title(), None


# Defining function to cleanup "USP Apparatus" data

# def clean_usp_apparatus(value):
#     if pd.isna(value):
#         return "Unknown"
    
#     # Strip spaces
#     val = str(value).strip().title()

#     # Normalize common names
#     val = re.sub(r"\b(USP|Apparatus|Dissolution)\b", "", val)

#     # Remove extra whitespace
#     val = re.sub(r"\s+", " ", val).strip()

#     return val if val else "Unknown"


# Defining function to cleanup "Medium" data

# Defining function to cleanup "Volume (mL)" data

# Defining function to cleanup "Recommended Sampling Times (minutes)" data

# Defining function to cleanup "" data

In [14]:
print(len(df))

1568

In [15]:
df["Drug Name"].value_counts()

Drug Name
Ibuprofen                                 5
Methylphenidate HCl                       5
Carbidopa/Levodopa                        5
Minocycline HCl                           5
Carbamazepine                             5
                                         ..
Ethambutol HCl                            1
Ethinyl Estradiol                         1
Ethinyl Estradiol/Ethynodiol Diacetate    1
Ethinyl Estradiol/Etonogestrel            1
Zuranolone                                1
Name: count, Length: 1216, dtype: int64

In [None]:
# Running functions to clean up dataset

# Apply and expand "Dosage Form" data into two new columns
df[["Dosage Form (Clean)", "Release Type"]] = df["Dosage Form"].apply(lambda x: pd.Series(clean_dosage_form(x)))

# # Check results
# print(df['DosageForm_clean'].value_counts())
# print(df['ReleaseType'].value_counts())


# # Example transformations
# # Performing dataset cleanup as needed before one-hot encoding
# df['num_sampling_times'] = df['SamplingTimes'].apply(lambda x: len(str(x).split(',')))
# df['max_sampling_time'] = df['SamplingTimes'].apply(lambda x: max([int(t) for t in str(x).split(',')]))

# # One-hot encode categorical variables
# df_encoded = pd.get_dummies(df[['Apparatus', 'MediumType', 'DosageForm']])

# # Combine numeric + encoded
# features = pd.concat([df[['RPM', 'MediumVolume', 'num_sampling_times', 'max_sampling_time']], df_encoded], axis=1)


# # Drop non-numeric columns if needed after feature engineering is complete (sanity check)
# df = df.select_dtypes(include=[np.number]).dropna()

# features.head()

In [7]:
df[df.columns[-2]].value_counts()

Dosage Form (Clean)
Tablet                             971
Capsule                            335
Suspension                          67
Oral Suspension                     17
Film, Transdermal                   14
                                  ... 
Vaginal Tablet                       1
Lozenges                             1
Tablet (Extended Release             1
Suspension/Drop                      1
Intra-Articular, For Suspension      1
Name: count, Length: 69, dtype: int64

In [8]:
df[df.columns[-1]].value_counts()

Release Type
Extended Release                           209
Delayed Release                             51
Chewable                                    26
Orally Disintegrating                       24
Sublingual                                  10
Copackage                                    8
Buccal                                       7
Soft-Gelatin                                 5
Orally Disintegrating (ODT)                  4
Vaginal                                      3
For Suspension                               2
Soft-Gelatin/Liquid Fill                     2
Delayed Release Pellets                      2
Delayed Release, Orally Disintegrating       2
Sprinkle                                     2
Liposomal                                    2
Extended Release, Orally Disintegrating      2
Effervescent                                 2
Dental                                       1
Chewable dispersible                         1
Pediatric                                    1


#### 3.2. Preprocessing

In this case, all of the columns except the target column are going to be used to predict the target column.

In other words, using a patient's medical and demographic data to predict whether or not they have heart disease.

In [9]:
# Create X (all the feature columns)
X = heart_disease.drop("target", axis=1)

# Create y (the target column)
y = heart_disease["target"]

NameError: name 'heart_disease' is not defined

In [None]:
# Split the data into training and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

# View the data shapes
X_train.shape, X_test.shape, y_train.shape, y_test.shape