# Dissolution Stability Machine Learning Project Using a Gaussian Mixture Model

## Project Overview

This project aims to analyze dissolution test data using GMMs.

In pharmaceutical development and manufacturing, dissolution testing data allows one to predict how quickly and completely a drug will dissolve in a patient's body.

Dissolution methods, however, may vary greatly (e.g., different media, apparatus, agitation speeds, sampling times), and real test results are prone to showing heterogeneity (e.g., some tablets dissolve quickly, others slowly). This project therefore aims to address this problem by building a computational workflow which is able to complete the following:
1) Translate Food and Drug Administration (FDA) method metadata into clearly structured features;
2) Use kinetic models to generate realistic synthetic dissolution profiles;
3) Apply a selection of Gaussian Mixture Models (GMMs) to cluster these profiles into meaningful subgroups (e.g., “fast,” “medium,” “slow”); and
4) Provide a framework that can be applied to real experimental data to detect unexpected subpopulations or anomalies.

This project hopes to add value within the pharmaceutical and biotechnology sector as it could demonstrate how data analysis, machine learning, and statistical modeling can benefit method development, risk assessment, and QC analysis.

The program will intake as inputs the FDA's Dissolution methods database metadata (which includes a drug's dosage form (tablet, capsule, etc.), its apparatus type (basket, paddle, cylinder, etc.), its agitation speed (rpm), its medium composition (e.g., water, HCl, buffer with surfactant), its medium volume (e.g., 500 mL, 900 mL), and its sampling times (e.g., 5, 10, 20, 30, 60 min)) and a selection of synthetic dissolution profiles (simulated from the metadata). For instance, a first-order or Weibull kinetic model may be used to generate curves plotting the percentage of the drug dissolved over time. Parameters can parhaps be linked to method settings (e.g., higher RPM values could lead to faster rate constants).

The program aims to yield several outputs. Firstly, it will cluster dissolution profiles illustrating subgroups of methods/curves labeled as “fast,” “medium,” or “slow” dissolvers. It will also produce informative visualizations such as plots of dissolution curves by cluster and 2D/3D projections of feature space color coded by cluster group. The program will provide cluster descriptors––statistical summaries of what differentiates each subgroup (e.g., Cluster 0 uses low RPM, short times result in fast release). Finally, it will generate probabilistic insights drawn from the data. By using GMMs, the likelihood that a new profile belongs to each subgroup will be known which is crucial for uncertainty handling and model evaluation.

It is hoped that this project would become useful in real-world pharmaceutical and/or biotechnology contexts during several stages, including in Research and Development (R&D), in Process and Method Development, in Manufacturing and Quality Assurance, and in Regulatory and Reporting Operations.

This program may be useful during formulation development as it could be capable of identifying whether a drug batch behaves consistently or splits into sub-populations (e.g., two different release rates). This could help scientists detect potential formulation robustness issues early. It may also be potentially used to compare FDA-recommended methods across drugs and to find natural clusters of dissolution conditions (e.g., methods suited for immediate vs. extended-release). During operation and quality checking, the most performant model might be applied to real batch data. If some tablets cluster into a “slow dissolving” subgroup unexpectedly, this could flag a potential manufacturing deviation which should be addressed. Furthermore, it might be able to provide a systematic way to justify why certain dissolution conditions (RPM, media, times) are chosen. Hence, this program could add a data-driven layer on top of traditional method validation.

## Analyzing Dissolution Test Data with Gaussian Mixture Models

**Goal:**  
Use FDA Dissolution Methods metadata + synthetic dissolution profiles to identify clusters of dissolution behaviors ("fast," "medium," "slow").  

**Steps:**  
1. Feature Engineering from FDA database  
2. Synthetic Profile Generation using kinetic models  
3. GMM clustering and visualization  
4. Interpretation & discussion of pharma relevance

1. Problem defintion
2. Data
3. Evaluation
4. Features
5. Modelling
6. Experiments

### 1. Notebook Setup

Getting all the tools ready in the project's virtual environment.

In [1]:
# Setup matplotlib to plot inline (within the notebook)
%matplotlib inline

# Core libraries
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Modeling
from sklearn.preprocessing import StandardScaler
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score

# Utility
from tqdm import tqdm

### 2. Data Acquisition

Loading the dataset (if dataset if real and available).

In [2]:
# Import dataset from CSV file or URL
df = pd.read_csv("Dissolution Methods.csv")

# Quick check to view the data
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1568 entries, 0 to 1567
Data columns (total 8 columns):
 #   Column                                Non-Null Count  Dtype 
---  ------                                --------------  ----- 
 0   Drug Name                             1568 non-null   object
 1   Dosage Form                           1568 non-null   object
 2   USP Apparatus                         805 non-null    object
 3   Speed (RPMs)                          800 non-null    object
 4   Medium                                1568 non-null   object
 5   Volume (mL)                           792 non-null    object
 6   Recommended Sampling Times (minutes)  803 non-null    object
 7   Date Updated                          1568 non-null   object
dtypes: object(8)
memory usage: 98.1+ KB


Unnamed: 0,Drug Name,Dosage Form,USP Apparatus,Speed (RPMs),Medium,Volume (mL),Recommended Sampling Times (minutes),Date Updated
0,Abacavir Sulfate,Tablet,,,"Refer to FDA's Dissolution Guidance, 2018",,,07/02/2020
1,Abacavir Sulfate/Dolutegravir Sodium/Lamivudine,Tablet,II (Paddle),85.0,0.01 M Phosphate Buffer with 0.5% sodium dodec...,900,"Abacavir and lamivudine: 10, 15, 20, 30 and 45...",05/28/2015
2,Abacavir Sulfate/Dolutegravir Sodium/Lamivudine,Tablet (For Suspension),II (Paddle),50.0,"0.01 M Phosphate Buffer with 0.5 mM EDTA, pH 6.8",500,"5, 10, 15, 30, 45 and 60",10/06/2023
3,Abacavir Sulfate/Lamivudine,Tablet,II (Paddle),75.0,0.1 N HCl,900,"10, 20, 30, and 45",01/03/2007
4,Abacavir Sulfate/Lamivudine/Zidovudine,Tablet,II (Paddle),75.0,0.1 N HCl,Acid Stage: 900 mL; Buffer Stage: 1000 mL,"5, 10, 15, 30 and 45",01/03/2007


Simulating the dataset (if dataset if not available). (OPTIONAL)

In [3]:
# # Simulating dissolution test metrics for 200 batches
# np.random.seed(42)
# fast_group = np.random.normal(
#     loc=90, scale=5, size=(100, 3))  # high % dissolved
# slow_group = np.random.normal(
#     loc=60, scale=5, size=(100, 3))  # lower % dissolved
# synthetic_data = np.vstack([fast_group, slow_group])

# df = pd.DataFrame(synthetic_data, columns=[
#                   "% Dissolved @ 5min", "% Dissolved @ 10min", "% Dissolved @ 15min"])

### 3. Data Cleaning & Preprocessing

Preparing the dataset for use within the model.

#### 3.1. Data Cleaning

Ensuring numerical features are clean and scaled.

In [None]:
# Drop non-numeric columns if needed
numeric_df = df.select_dtypes(include=[np.number]).dropna()

# Scale features for GMM
scaler = StandardScaler()
X_scaled = scaler.fit_transform(numeric_df)

ValueError: at least one array or dtype is required

#### 3.2. Preprocessing

In this case, all of the columns except the target column are going to be used to predict the targert column.

In other words, using a patient's medical and demographic data to predict whether or not they have heart disease.

In [4]:
# Create X (all the feature columns)
X = heart_disease.drop("target", axis=1)

# Create y (the target column)
y = heart_disease["target"]

NameError: name 'heart_disease' is not defined

In [5]:
# Split the data into training and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

# View the data shapes
X_train.shape, X_test.shape, y_train.shape, y_test.shape

NameError: name 'X' is not defined