# Assignment: Properties of Wound Sealing Powder

## Introduction
In this assignment, we will explore the application of Machine Learning in Pharmaceutical Research and Development (R&D) to predict the properties of a final product, specifically a wound sealing powder used in surgeries. 

The production process involves a batch reactor where the ingredients are mixed, heated, and then cooled. The ingredients remain constant as they are based on a specific formula. All experiments are done in room temperature (35°C).

## Learning Objectives
- Understand data exploration, cleaning, and preprocessing
- Learn to identify and handle logical errors in process data
- Practice feature selection with domain knowledge
- Develop and optimize machine learning models
- Learn to use model predictions for process optimization

## Data Description

- **Sample_Number**
- **Heating_Temperature_C:** Heating temperature during the process (in degrees Celsius).
- **Heating_Duration_min:** Duration for which the mixture is heated (in minutes).
- **Heating_Rate_per_min:** Rate at which the heat is transferred to the fluid (in Kj/min).
- **Cooling_Temperature_C:** Final cooling temperature after the process (in degrees Celsius).
- **Heat_Rejection_Rate_per_min:** Rate at which the heat is rejected during cooling (in Kj/min).
- **Final_Product_Absorption_Capacity:** Measured Absorption capacity of the final wound sealing powder (in ml/g). This is the target variable.
- **Start_Timestamp:** Experiment starting time.
- **End_Timestamp:** Experiment end time.



In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Set random seed for reproducibility
np.random.seed(42)

## Part 1: Data Exploration

### Task 1.1: Load and Examine Data

In [None]:
# Load the dataset
df = pd.read_csv('wound_sealing_trial_data.csv')

# Display first few rows
print("First few rows of the dataset:")
display(df.head())

In [None]:
# Display basic information
print("\nDataset Info:")
# TODO

In [None]:
# Display summary statistics
print("\nSummary Statistics:")
# TODO

### Task 1.2: Data Visualization

In [None]:
def plot_feature_distributions(df):
    """Plot histograms for all numerical features"""
    # Remove unnecessary features
    
    num_features = len(df.columns)
    fig, axes = plt.subplots(3, 3, figsize=(15, 15))
    axes = axes.ravel()
    
    for idx, col in enumerate(df.columns):
        if idx < len(axes):
            sns.histplot(data=df, x=col, ax=axes[idx])
            axes[idx].set_title(f'Distribution of {col}')
    
    plt.tight_layout()
    plt.show()

def plot_correlation_matrix(df):
    """Plot correlation matrix heatmap"""
    
    plt.figure(figsize=(10, 8))
    sns.heatmap(df.corr(), annot=True, cmap='coolwarm', center=0)
    plt.title('Correlation Matrix')
    plt.tight_layout()
    plt.show()

def plot_feature_relationships(df, target='Final_Product_Absorption_Capacity'):
    """Plot relationships between features and target"""

    features = [col for col in df.columns if col != target]
    n_features = len(features)
    fig, axes = plt.subplots(3, 2, figsize=(15, 20))
    axes = axes.ravel()
    
    for idx, feature in enumerate(features):
        if idx < len(axes):
            sns.scatterplot(data=df, x=feature, y=target, ax=axes[idx])
            axes[idx].set_title(f'{feature} vs {target}')
    
    plt.tight_layout()
    plt.show()

# Prepare df for data visualization by dropping irrelevant features
# TODO
df

plot_feature_distributions(df)
plot_correlation_matrix(df)
plot_feature_relationships(df)

## Part 2: Data Cleaning and Logical Error Handling

### Task 2.1: Identify and Handle Logical Errors

Develop two functions, one to identify logical errors, and another one to fix them.


In [None]:
def check_logical_errors(df):
    """Check for logical errors in the data"""
    errors = {
        'cooling_temp_higher': df['Cooling_Temperature_C'] > df['Heating_Temperature_C'],
        # TODO
    }
    
    print("Logical Errors Found:")
    for error_type, mask in errors.items():
        print(f"{error_type}: {mask.sum()} instances")
    
    return errors

def fix_logical_errors(df):
    """Fix logical errors in the data"""
    # TODO
    
    return df_clean

# Check for logical errors
errors = check_logical_errors(df)

# Fix logical errors
df_clean = fix_logical_errors(df)

# Verify fixes
print("\nAfter fixing logical errors:")
check_logical_errors(df_clean)

### Task 2.2: Handle Missing Values and Outliers

In [None]:
def handle_missing_values(df):
    """Handle missing values in the dataset"""
    print("Missing values before handling:")
    display(df.isnull().sum())

    # Implement logic for handling missing values
    # You can include multiple methods depending on each feature/case
    # TODO

    
    print("\nMissing values after handling:")
    display(df_clean.isnull().sum())
    
    return df_clean


# Handle missing values
df_clean = handle_missing_values(df_clean)

# Display summary of cleaned data
print("\nSummary of cleaned data:")
display(df_clean.describe())

## Part 3: Feature Selection and Engineering

### Task 3.1: Feature Analysis

In [None]:
def analyze_feature_importance(X, y):
    """Analyze feature importance using multiple methods"""
    # TODO
    
    return importance_df

# Prepare features and target
X = df_clean.drop(['Final_Product_Absorption_Capacity'], axis=1)
y = df_clean['Final_Product_Absorption_Capacity']

# Analyze feature importance
importance_df = analyze_feature_importance(X, y)

### Task 3.2: Feature Engineering (Optional)

Create engineered features as needed. After creating engineered features, fun the feature importance function you developed earlier to evaluate their importance.

In [None]:
def engineer_features(X):
    """Create engineered features"""
    # TODO
    
    return X_engineered

# Engineer features
X_engineered = engineer_features(X)

# Analyze importance of engineered features
importance_df_engineered = analyze_feature_importance(X_engineered, y)

### Task 3.3: Feature Selection

Select the relevant features that you want to include for model training.

In [None]:
# Select top features based on importance
# TODO
top_features = 
X_selected = 

print("Selected features:")
print(top_features)

## Part 4: Model Development and Training

### Task 4.1: Data Preparation and Model Setup

In [None]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X_selected, y, test_size=0.2, random_state=42
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize your model
# TODO
model = 

# Define parameter grid for optimization
param_grid = {
# TODO
}

# Initialize GridSearchCV
grid_search = GridSearchCV(
# TODO
)

### Task 4.2: Model Training and Evaluation

In [None]:
def train_and_evaluate_model(grid_search, X_train_scaled, X_test_scaled, y_train, y_test):
    """Train model and evaluate performance"""
    # Fit model
    grid_search.fit(X_train_scaled, y_train)
    
    # Get best model
    best_model = grid_search.best_estimator_
    
    # Make predictions
    y_pred_train = best_model.predict(X_train_scaled)
    y_pred_test = best_model.predict(X_test_scaled)
    
    # Calculate metrics
    metrics = {
        'train': {
            'r2': r2_score(y_train, y_pred_train),
            'rmse': np.sqrt(mean_squared_error(y_train, y_pred_train)),
            'mae': mean_absolute_error(y_train, y_pred_train)
        },
        'test': {
            'r2': r2_score(y_test, y_pred_test),
            'rmse': np.sqrt(mean_squared_error(y_test, y_pred_test)),
            'mae': mean_absolute_error(y_test, y_pred_test)
        }
    }
    
    # Print results
    print("Best parameters:", grid_search.best_params_)
    print("\nModel Performance:")
    for dataset in ['train', 'test']:
        print(f"\n{dataset.capitalize()} Set:")
        print(f"R² Score: {metrics[dataset]['r2']:.4f}")
        print(f"RMSE: {metrics[dataset]['rmse']:.4f}")
        print(f"MAE: {metrics[dataset]['mae']:.4f}")
    
    return best_model, metrics, y_pred_train, y_pred_test

# Train and evaluate model
best_model, metrics, y_pred_train, y_pred_test = train_and_evaluate_model(
    grid_search, X_train_scaled, X_test_scaled, y_train, y_test
)

### Task 4.3: Model Visualization and Analysis

In [None]:
def plot_prediction_analysis(y_true, y_pred, title):
    """Create detailed prediction analysis plots"""
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
    
    # Actual vs Predicted
    ax1.scatter(y_true, y_pred, alpha=0.5)
    ax1.plot([y_true.min(), y_true.max()], [y_true.min(), y_true.max()], 'r--')
    ax1.set_xlabel('Actual Values')
    ax1.set_ylabel('Predicted Values')
    ax1.set_title(f'{title} - Actual vs Predicted')
    
    # Residuals
    residuals = y_pred - y_true
    ax2.scatter(y_pred, residuals, alpha=0.5)
    ax2.axhline(y=0, color='r', linestyle='--')
    ax2.set_xlabel('Predicted Values')
    ax2.set_ylabel('Residuals')
    ax2.set_title(f'{title} - Residual Plot')
    
    plt.tight_layout()
    plt.show()

# Plot training set results
plot_prediction_analysis(y_train, y_pred_train, 'Training Set')

# Plot test set results
plot_prediction_analysis(y_test, y_pred_test, 'Test Set')

## Part 5: Process Parameter Optimization (Bonus)

Using the model you developed, figure out what parameters should the reasearches try next.

In [None]:
#TODO

## Explanation and Final Comments

TODO: Edit this cell to add your comments