# Data Analysis of insulin process production

## Goal

Potential projects: Raman spectrometer soft sensor development, Faulty batches detection (before too late), 

## About the data:

The data was downloaded from [Kaggle](https://www.kaggle.com/datasets/stephengoldie/big-databiopharmaceutical-manufacturing/data). It consists of data for 100 batches of insulin, generated by [IndPenSim](http://www.industrialpenicillinsimulation.com/)

![Variables and Parameters](IndPenSim_input_outputs_V2.png) 


## References:

Goldrick S., Stefan, A., Lovett D., Montague G., Lennox B. (2015) The development of an industrial-scale fed-batch fermentation simulation Journal of Biotechnology, 193:70-82.

Goldrick S., Duran-Villalobos C., K. Jankauskas, Lovett D., Farid S. S, Lennox B., (2019) Modern day control challenges for industrial-scale fermentation processes. Computers and Chemical Engineering.

## 1 - Visualisation and cleaning

In [None]:
import pandas as pd
import ipywidgets as widgets
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA

data = pd.read_csv('data/100_Batches_IndPenSim_V3.csv', usecols=range(39)) 
data_summary = pd.read_csv('data/100_Batches_IndPenSim_Statistics.csv')

In [None]:
data.describe()

In [None]:
data.columns

In [None]:
# Rewrites Batch ID's correctly
data = data.rename(columns={'2-PAT control(PAT_ref:PAT ref)': 'Batch reference', # Mistaken attribution
                            'Batch reference(Batch_ref:Batch ref)':'Faulty batch', # Mistaken attribution + seems to indicate if a batch is faulty
                            })

# Drop superfluous/unusable columns
data.drop([' 1-Raman spec recorded'], axis=1, inplace=True) # I don't see the point of this column, seems to be a duplicate of Batch reference
data.drop(['Batch ID'], axis=1, inplace=True) # Mysterious column; is not actually a batch ID, closely follows penicilin concentration
data.drop(['Fault flag'], axis=1, inplace=True) # Mysterious column; is not actually a binary flag, closely follows penicilin concentration

# Changes the ID of the columns for more comprehensive names
data = data.rename(columns={'0 - Recipe driven 1 - Operator controlled(Control_ref:Control ref)':'Operator controlled',
                            'Fault reference(Fault_ref:Fault ref)':'Faulty measure'})

# Correct data readability for a categorical (binary) variable:
data = data.rename(columns={'1- No Raman spec': 'Raman spec recorded'})
data['Raman spec recorded'] = data['Raman spec recorded'] - 1

In [None]:
variable_list = data.columns
variable_plot_selection = widgets.Dropdown(options=variable_list, value = 'Penicillin concentration(P:g/L)')
variable_plot_selection

In [None]:
fig, ax = plt.subplots(figsize=(8,6))
bp = data.groupby('Batch reference').plot(x = 'Time (h)', y = variable_plot_selection.value,   ax=ax, legend = False, )
ax.set_title('Summary of Campaigns')
ax.set_xlabel('Time')
ax.set_ylabel(variable_plot_selection.value)

Important observation: Most batches have a duration of 230h, but some last for slightly longer/shorter times.

- Agitation: Always 100RPM, not useful
- Water for injection/dilution: Same trajectory for all batches
- Air head pressure: Same trajectory for all batches
- Dumped broth flow: Same trajectory for all batches
- Ammonia shots: Same trajectory for all batches
- PAA concentration offline does not show any data on the plot => because taken every 4 hours
- NH3 concentration offline does not show any data on the plot => because taken every 4 hours
- Penicilin concentration offline does not show any data on the plot => because taken every 4 hours
- Biomass concentration offline does not show any data on the plot => because taken every 4 hours
- Viscosity offline does not show any data on the plot => because taken every 4 hours
- What is Batch ID? Seems to be a numerical value related to peniciline concentration
- What is Fault flag? Seems to be a numerical value related to peniciline concentration

//

- Vessel weight and vessel volume should be highly correlated
- Generated heat, oxygen uptake rate and CO2% in outlet should be highly correlated


Some features in the data are there because the data comes from a bioprocess simulation; in reality, these features cannot be measured properly. These features are: 
- On-line Penicillin concentration (We assume we do not have a Raman spectrometer in the tank)
- 

In [None]:
# Drop columns with constant values:
data.drop(['Agitator RPM(RPM:RPM)', 'Ammonia shots(NH3_shots:kgs)'], axis=1, inplace=True)

In [None]:
# Check for missing data:
missing_data = data.isnull().sum()
print(missing_data)

## 2 - Multiway PCA

We are going to apply PCA on the data, to visualize the data and see if we can detect the "good" and the "bad" batches before they end. To do this, we are going to unfold the data, so that we have a row for each batch and a column for every parameter at every time point. It means that every parameter at every timepoint are vectors orthogonal to each other.

In [None]:
# Only keep data from the campaign 4 (same as in the paper): Fixed duration, recipe driven
data = data[data['Operator controlled'] == 0]
data = data[data['Raman spec recorded'] == 0] # Raman spec controlled batches are also marked as "recipe driven"
grouped = data.groupby('Batch reference')
filtered_groups = [group for name, group in grouped if len(group) == 1150 # Only keep 230h long batches
                   and not (group['Faulty measure'] == 1).any()] # Only NOC batches
data = pd.concat(filtered_groups)
del grouped

# Drop columns with discrete/categorical variables:
data.drop(['Faulty measure',
           'Operator controlled',
           'Raman spec recorded'],
            axis=1, inplace=True)

# Drop columns with delayed measurements (offline) and values provided by the Raman spectrometer:
data.drop(['PAA concentration offline(PAA_offline:PAA (g L^{-1}))',
           'Offline Penicillin concentration(P_offline:P(g L^{-1}))',
           'Offline Biomass concentratio(X_offline:X(g L^{-1}))',
           'Viscosity(Viscosity_offline:centPoise)',
           'NH_3 concentration off-line(NH3_offline:NH3 (g L^{-1}))',
           'Substrate concentration(S:g/L)',
           'Penicillin concentration(P:g/L)'], axis=1, inplace=True)

In [None]:
data.columns

In [None]:
data = data.sort_values(by=['Batch reference', 'Time (h)'])

# Pivot the DataFrame
variable_names = list(set(data.columns.to_list()) - set('Time (h)'))
data = data.pivot_table(index='Batch reference', columns='Time (h)', values=variable_names)

# Flatten the MultiIndex columns
data.columns = [f'{var}_t{time}' for var, time in data.columns]

data

In [None]:
# Removes columns with standard deviation of 0, not useful for prediction
# We are able to drop around 40% of the columns!
data = data.loc[:, data.std() != 0]

# Normalization
data = (data - data.mean()) / data.std()

# PCA
pca = PCA()
pca_transformed_data = pca.fit_transform(data) # To change: Fit with only NOC, but transform NOC and AOC

In [None]:
pca_transformed_data

pca.explained_variance_ratio_[0:4].sum()

## 3 - Product concentration prediction (WIP)
Offline parameters are not measured at every time point; We fill the NaN values with the last measurement.
data.ffill(inplace=True)
data.fillna(0, inplace=True)