# Predicting Disease Recurrence Using SCNA Burden in Breast Cancer

This tutorial demonstrates how to build a machine learning model using somatic copy number alteration (SCNA) burden and clinical data to predict recurrence events in breast cancer. It uses a Random Forest classifier and includes steps for data cleaning, feature engineering, model training, and interpretation.

✅ **Key goals:**
- Make ML workflows usable in low-resource or real-world clinical settings.
- Enable others to swap in their own data with minimal effort.


## 📦 Requirements

Run this in a local Jupyter environment or [Google Colab](https://colab.research.google.com).

```bash
pip install pandas seaborn matplotlib scikit-learn
```


## 🧠 Use With Your Own Data

To use this notebook on your dataset:
- Replace the provided TSV file with your own file containing SCNA burden and recurrence data.
- Ensure the clinical target is binary (e.g., `Disease Free Event`: Yes/No).
- Adjust column names if necessary in the cleaning step.

💡 Tip: This pipeline runs on any laptop — no GPU or cloud required.
Training time: ~10 seconds  
Memory usage: <500MB  


In [None]:

# Breast Cancer SCNA Analysis Template for AMIA 2025 Submission
# Author: [Your Name]
# Objective: Predict clinical benefit or recurrence using SCNA burden and clinical data

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score, roc_curve
import os

# Set output directory
output_dir = "analysis_outputs"
os.makedirs(output_dir, exist_ok=True)

# Load data
df = pd.read_csv(r"C:\Users\ssulley\OneDrive - National Healthy Start Association\Research\merged_segment_clinical_data (1).tsv", sep='\t', low_memory=False)

# Clean and subset
df = df.drop_duplicates()
columns_needed = [
    'Fraction Genome Altered', 'Mutation Count', 'Sex', 'Race Category', 'Ethnicity Category',
    'Breast Cancer Subtype', 'Endocrine Therapy', 'Clinical Benefit', 'Disease Free Event'
]
df = df[columns_needed]
df = df[df['Clinical Benefit'].notna() & df['Disease Free Event'].notna()]

# Save clean data
df.to_csv(os.path.join(output_dir, "cleaned_data.csv"), index=False)

# Visualize SCNA burden vs disease event
plt.figure(figsize=(8,6))
sns.boxplot(x='Disease Free Event', y='Fraction Genome Altered', data=df)
plt.title("SCNA Burden vs Disease Recurrence")
plt.savefig(os.path.join(output_dir, "boxplot_SCNA_vs_recurrence.png"))
plt.close()

# Encode categorical variables
df_encoded = pd.get_dummies(df, drop_first=True)

# Print all column names for debugging
print("Encoded columns:")
print(df_encoded.columns.tolist())

# Automatically identify target column based on prefix match
target_cols = [col for col in df_encoded.columns if 'Disease Free Event' in col and '_Yes' in col]
if not target_cols:
    raise ValueError("Target column for 'Disease Free Event_Yes' not found after encoding. Check data.")
target_col = target_cols[0]

# Define X and y
X = df_encoded.drop([target_col], axis=1)
y = df_encoded[target_col]

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Random Forest
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
report = classification_report(y_test, y_pred, output_dict=True)
report_df = pd.DataFrame(report).transpose()
report_df.to_csv(os.path.join(output_dir, "classification_report.csv"))

roc_auc = roc_auc_score(y_test, y_prob)
fpr, tpr, _ = roc_curve(y_test, y_prob)

# Save ROC Curve
plt.figure(figsize=(8,6))
plt.plot(fpr, tpr, label=f"AUC = {roc_auc:.2f}")
plt.plot([0, 1], [0, 1], '--', color='gray')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve - Disease Free Event Prediction")
plt.legend()
plt.savefig(os.path.join(output_dir, "roc_curve.png"))
plt.close()

# Save feature importance
importances = model.feature_importances_
features = X.columns
feature_importance = pd.DataFrame({'Feature': features, 'Importance': importances})
feature_importance.sort_values(by='Importance', ascending=False, inplace=True)
feature_importance.to_csv(os.path.join(output_dir, "feature_importance.csv"), index=False)

# Plot top 10 important features
plt.figure(figsize=(10,6))
sns.barplot(x='Importance', y='Feature', data=feature_importance.head(10))
plt.title("Top 10 Feature Importances")
plt.tight_layout()
plt.savefig(os.path.join(output_dir, "feature_importance_top10.png"))
plt.close()


## 💡 Clinical & Research Relevance

This tool helps explore how genomic instability (SCNA burden) relates to clinical outcomes like recurrence or benefit from therapy. It can be extended to:
- Evaluate prognostic indicators
- Support decision-making in quality improvement (QI) settings
- Guide future biomarker development in low-resource settings
