# Health Severity Index Implementation Using TabNet in Unsupervised Mode
 
In this script, we'll implement the health severity index using TabNet, following my project plan. We'll perform the necessary preprocessing, train the TabNet model, and integrate explainable AI techniques using SHAP to interpret the model's predictions.

In this script, we'll implement a method to generate a health severity index using TabNet in unsupervised mode. We'll leverage TabNet's ability to learn meaningful representations from tabular data without labels. By clustering these representations, we can derive a severity index that reflects different levels of health status among patients.

### Library Imports

In [5]:
# Import standard libraries
import numpy as np
import pandas as pd
import os
import joblib
import gc

# Machine learning libraries
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns

# TabNet library
from pytorch_tabnet.pretraining import TabNetPretrainer
from pytorch_tabnet.tab_model import TabNetRegressor

# PyTorch
import torch

# Explainable AI libraries
import shap

# Disable warnings
import warnings
warnings.filterwarnings('ignore')


  from .autonotebook import tqdm as notebook_tqdm


### Data Loading and Preprocessing

#### 3.1 Load Patient Data With Sequences

In [6]:
# Load patient data with sequences
patient_data = pd.read_pickle('Data/patient_data_sequences.pkl')
print("Loaded patient data with sequences.")


Loaded patient data with sequences.


#### Load Code Mappings

In [None]:
# Load code mappings
code_mappings = pd.read_csv('Data/code_mappings.csv')
code_to_id = dict(zip(code_mappings['UNIQUE_CODE'], code_mappings['CODE_ID']))
id_to_code = dict(zip(code_mappings['CODE_ID'], code_mappings['UNIQUE_CODE']))
num_codes = len(code_to_id)
print(f"Number of unique codes: {num_codes}")


#### Aggregate Embeddings for Each Patient

In [None]:
# Define embedding dimension
embedding_dim = 128  # Adjust as needed

# Initialize embeddings randomly
np.random.seed(42)
code_embeddings = np.random.normal(size=(num_codes, embedding_dim))

# Function to get embedding for a code ID
def get_code_embedding(code_id):
    return code_embeddings[code_id]

# Function to aggregate embeddings for a patient
def aggregate_patient_embeddings(visits):
    all_code_ids = [code_id for visit in visits for code_id in visit]
    if not all_code_ids:
        return np.zeros(embedding_dim)
    embeddings = np.array([get_code_embedding(code_id) for code_id in all_code_ids])
    mean_embedding = embeddings.mean(axis=0)
    return mean_embedding

# Aggregate embeddings for each patient
patient_embeddings = np.array([
    aggregate_patient_embeddings(row['SEQUENCE']) for _, row in patient_data.iterrows()
])

print("Aggregated embeddings for all patients.")


#### Prepare Demographic Features

In [None]:
# Select demographic features
demographic_features = ['Id', 'AGE', 'DECEASED', 'GENDER', 'RACE', 'ETHNICITY',
                        'HEALTHCARE_EXPENSES', 'HEALTHCARE_COVERAGE', 'INCOME']

demographics = patient_data[demographic_features]
print("Prepared demographic features.")


#### Combine Features for TabNet
 
 Since TabNet can handle categorical features natively, we'll label and encode the categorical variables instead of one-hot encoding

In [None]:
# Copy demographics to avoid modifying the original DataFrame
demographics_tabnet = demographics.copy()

# Label encode categorical variables
categorical_columns = ['GENDER', 'RACE', 'ETHNICITY']
for col in categorical_columns:
    le = LabelEncoder()
    demographics_tabnet[col] = le.fit_transform(demographics_tabnet[col].astype(str))

# Fill missing values if any
demographics_tabnet.fillna(0, inplace=True)

# Combine embeddings and demographics
features = pd.DataFrame(patient_embeddings)
features.columns = [f'embedding_{i}' for i in range(embedding_dim)]
features.reset_index(drop=True, inplace=True)
demographics_tabnet.reset_index(drop=True, inplace=True)
data = pd.concat([demographics_tabnet, features], axis=1)

print("Combined embeddings and demographics for TabNet.")


### Unsupervised Representation Learning with TabNet
#### 1. Prepare Feature Columns

In [None]:
# Define feature columns (exclude 'Id')
feature_columns = data.columns.drop(['Id'])

# Identify categorical column indices
categorical_columns_indices = [data.columns.get_loc(col) for col in categorical_columns]

# Prepare data for TabNet
X = data[feature_columns].values
print("Prepared data for TabNet.")


#### 2. Standardize Numerical Features
TabNet can handle raw numerical features, but standardizing can sometimes improve performance.

In [None]:
# Standardize numerical features (excluding categorical columns)
numerical_columns = list(set(feature_columns) - set(categorical_columns))
numerical_columns_indices = [data.columns.get_loc(col) for col in numerical_columns]

scaler = StandardScaler()
X[:, numerical_columns_indices] = scaler.fit_transform(X[:, numerical_columns_indices])
print("Standardized numerical features.")


#### 3. Initialize and Train TabNet Pretrainer

In [None]:
# Initialize TabNet Pretrainer
unsupervised_model = TabNetPretrainer(
    n_d=64,
    n_a=64,
    n_steps=5,
    gamma=1.5,
    n_independent=2,
    n_shared=2,
    lambda_sparse=1e-4,
    optimizer_fn=torch.optim.Adam,
    optimizer_params=dict(lr=2e-2),
    mask_type='entmax',  # "sparsemax"
    verbose=10,
)

# Train the model
max_epochs = 1000
patience = 50  # Early stopping patience

unsupervised_model.fit(
    X_train=X,
    eval_set=[X],
    pretraining_ratio=0.8,
    max_epochs=max_epochs,
    patience=patience,
    batch_size=1024,
    virtual_batch_size=128,
    num_workers=0,
    drop_last=False,
)

print("Unsupervised pretraining completed.")


### Clustering and Severity Index Generation
#### 1. Extract Learned Representations

In [None]:
# Extract embeddings from the trained model
embedded_representation = unsupervised_model.transform(X)
print(f"Extracted embedded representations with shape: {embedded_representation.shape}")


#### 2. Perform Clustering on Learned Representations
We'll use KMeans clustering to cluster the representations. The number of clusters can be adjusted based on your requirements.

In [None]:
from sklearn.cluster import KMeans

n_clusters = 10

kmeans = KMeans(n_clusters=n_clusters, random_state=42)
cluster_labels = kmeans.fit_predict(embedded_representation)

# Add cluster labels to the data
data['cluster'] = cluster_labels
print("Performed clustering on embedded representations.")


#### 3. Map Clusters to Severity Index
Assuming that higher cluster labels correspond to higher severity (you may need to adjust this based on analysis).

In [None]:
# Map clusters to severity index
cluster_severity = {cluster: index for index, cluster in enumerate(sorted(data['cluster'].unique()))}
data['severity_index'] = data['cluster'].map(cluster_severity)

# Scale severity index to 0-10 range
scaler_severity = StandardScaler()
data['severity_index_scaled'] = scaler_severity.fit_transform(data[['severity_index']])  # Standardize
data['severity_index_scaled'] = (data['severity_index_scaled'] - data['severity_index_scaled'].min()) / (data['severity_index_scaled'].max() - data['severity_index_scaled'].min()) * 10  # Scale to 0-10

print("Mapped clusters to severity index.")


#### 4. Visualize Clusters

In [None]:
# Use t-SNE for visualization
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
tsne_results = tsne.fit_transform(embedded_representation)

# Create a DataFrame for visualization
visualization_df = pd.DataFrame()
visualization_df['tsne_1'] = tsne_results[:, 0]
visualization_df['tsne_2'] = tsne_results[:, 1]
visualization_df['cluster'] = data['cluster']

# Plot the clusters
plt.figure(figsize=(10, 6))
sns.scatterplot(
    x='tsne_1', y='tsne_2',
    hue='cluster',
    palette=sns.color_palette('hsv', n_clusters),
    data=visualization_df,
    legend='full',
    alpha=0.7
)
plt.title('t-SNE Visualization of Clusters')
plt.show()


### Explainable AI Integration with SHAP
Although SHAP is primarily used for supervised models, we can use it to interpret the TabNet model's feature importance even in unsupervised settings.

#### 1. Initialize SHAP Explainer

In [None]:
# For unsupervised TabNet, we can use the reconstruction error as the target
# We'll train a TabNetRegressor to predict reconstruction error for SHAP

# Compute reconstruction error
reconstructed_X = unsupervised_model.predict(X)
reconstruction_errors = np.mean((X - reconstructed_X) ** 2, axis=1)

# Train a TabNetRegressor on reconstruction errors
tabnet_regressor = TabNetRegressor()
tabnet_regressor.fit(
    X_train=X,
    y_train=reconstruction_errors,
    max_epochs=100,
    patience=20,
    batch_size=1024,
    virtual_batch_size=128,
)

print("Trained TabNetRegressor on reconstruction errors for SHAP analysis.")


#### 2. Compute SHAP Values

In [None]:
# Initialize SHAP explainer
explainer = shap.Explainer(tabnet_regressor.predict, X)

# Compute SHAP values
shap_values = explainer(X)

# Save SHAP values
np.save('shap_values_unsupervised.npy', shap_values.values)
print("Computed and saved SHAP values.")


#### 3. Visualize Feature Importance 

In [None]:
# Plot summary of feature importance
shap.summary_plot(shap_values, feature_names=feature_columns, show=False)
plt.savefig('shap_summary_plot_unsupervised.png')
print("Generated SHAP summary plot.")


### Saving Results
#### 1. Save the Final DataFrame

In [None]:
# Save the data with severity index
data.to_csv('patient_severity_index_tabnet.csv', index=False)
print("Saved patient data with severity index to 'patient_severity_index.csv'.")


#### 2. Save the Trained Models

In [None]:
# Save the unsupervised TabNet model
unsupervised_model.save_model('tabnet_unsupervised_model.zip')
print("Saved the unsupervised TabNet model.")

# Save the TabNetRegressor model
tabnet_regressor.save_model('tabnet_regressor_model.zip')
print("Saved the TabNetRegressor model.")
