# AI-Driven Cancer Target Identification and Drug Discovery

## Comprehensive Analysis Workflow

This notebook demonstrates key computational approaches for cancer target identification and drug discovery, including:
- Statistical analysis (t-tests, ANOVA)
- Dimensionality reduction (PCA, t-SNE, UMAP)
- Machine learning models for drug response prediction
- Network analysis for protein-protein interactions
- Multi-omics data integration

**Reference Paper**: Signal Transduction and Targeted Therapy - https://www.nature.com/articles/s41392-022-00994-0



## 1. Import Libraries and Setup


In [None]:
# Core data science libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Machine learning
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (classification_report, confusion_matrix, 
                            roc_auc_score, accuracy_score, roc_curve)
from sklearn.neural_network import MLPClassifier

# Dimensionality reduction
from sklearn.manifold import TSNE
import umap.umap_ as umap

# Statistical analysis
from scipy import stats
from scipy.stats import ttest_ind, f_oneway
import pingouin as pg

# Network analysis
import networkx as nx

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

print("Libraries imported successfully!")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")



## 2. Generate Synthetic Gene Expression Data

Since we don't have access to real TCGA data, we'll generate synthetic but realistic gene expression data that mimics cancer vs normal samples. This demonstrates key analysis workflows.


In [None]:
# Set random seed for reproducibility
np.random.seed(42)

# Parameters
n_genes = 200  # Number of genes in our analysis
n_cancer_samples = 100
n_normal_samples = 100
n_differential_genes = 50  # Genes that are differentially expressed

# Generate gene names
gene_names = [f"Gene_{i+1}" for i in range(n_genes)]

# Generate synthetic data
# Normal samples: baseline expression with some variation
normal_data = np.random.normal(loc=5.0, scale=1.5, size=(n_normal_samples, n_genes))

# Cancer samples: some genes are upregulated, some downregulated
cancer_data = np.random.normal(loc=5.0, scale=1.5, size=(n_cancer_samples, n_genes))

# Create differential expression
differential_indices = np.random.choice(n_genes, n_differential_genes, replace=False)

for idx in differential_indices[:n_differential_genes//2]:
    # Upregulated genes
    cancer_data[:, idx] = np.random.normal(loc=8.0, scale=2.0, size=n_cancer_samples)

for idx in differential_indices[n_differential_genes//2:]:
    # Downregulated genes
    cancer_data[:, idx] = np.random.normal(loc=2.0, scale=1.0, size=n_cancer_samples)

# Combine data
data = np.vstack([normal_data, cancer_data])

# Create labels
labels = np.array(['Normal'] * n_normal_samples + ['Cancer'] * n_cancer_samples)

# Create DataFrame
df_gene_exp = pd.DataFrame(data, columns=gene_names)
df_gene_exp['Sample_Type'] = labels

print(f"Generated synthetic data:")
print(f"  - Total samples: {len(df_gene_exp)}")
print(f"  - Normal samples: {n_normal_samples}")
print(f"  - Cancer samples: {n_cancer_samples}")
print(f"  - Genes analyzed: {n_genes}")
print(f"  - Differentially expressed: {n_differential_genes}")
print(f"\nData shape: {df_gene_exp.shape}")
df_gene_exp.head()



## 3. Statistical Analysis: T-tests for Differential Expression

T-tests compare means between groups and are fundamental for identifying significantly differentially expressed genes.


In [None]:
# Perform t-tests for each gene
t_test_results = []

for gene in gene_names:
    normal_exp = df_gene_exp[df_gene_exp['Sample_Type'] == 'Normal'][gene]
    cancer_exp = df_gene_exp[df_gene_exp['Sample_Type'] == 'Cancer'][gene]
    
    # Perform t-test
    t_stat, p_value = ttest_ind(normal_exp, cancer_exp)
    
    # Calculate effect size (Cohen's d)
    mean_diff = cancer_exp.mean() - normal_exp.mean()
    pooled_std = np.sqrt((normal_exp.std()**2 + cancer_exp.std()**2) / 2)
    cohens_d = mean_diff / pooled_std
    
    t_test_results.append({
        'Gene': gene,
        'Mean_Normal': normal_exp.mean(),
        'Mean_Cancer': cancer_exp.mean(),
        'Mean_Difference': mean_diff,
        'T_Statistic': t_stat,
        'P_Value': p_value,
        'Cohen_D': cohens_d,
        'Std_Normal': normal_exp.std(),
        'Std_Cancer': cancer_exp.std()
    })

# Create results DataFrame
df_tests = pd.DataFrame(t_test_results)

# Add significance markers
df_tests['Significant'] = df_tests['P_Value'] < 0.05
df_tests['FDR_Corrected'] = False

# Multiple testing correction (Bonferroni)
num_comparisons = len(df_tests)
df_tests.loc[df_tests['P_Value'] < 0.05/num_comparisons, 'FDR_Corrected'] = True

# Sort by P-value
df_tests = df_tests.sort_values('P_Value')

print(f"T-test Results Summary:")
print(f"Significant genes (p < 0.05): {df_tests['Significant'].sum()} out of {len(df_tests)}")
print(f"Significant after Bonferroni correction: {df_tests['FDR_Corrected'].sum()}")

# Display top results
print("\nTop 10 Most Significantly Differentially Expressed Genes:")
display(df_tests.head(10)[['Gene', 'Mean_Normal', 'Mean_Cancer', 'Mean_Difference', 
                            'P_Value', 'Cohen_D', 'Significant']])



In [None]:
# Visualize t-test results
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# 1. P-value distribution
axes[0, 0].hist(df_tests['P_Value'], bins=50, alpha=0.7, color='steelblue', edgecolor='black')
axes[0, 0].axvline(0.05, color='red', linestyle='--', label='p = 0.05 threshold')
axes[0, 0].set_xlabel('P-Value', fontsize=12)
axes[0, 0].set_ylabel('Frequency', fontsize=12)
axes[0, 0].set_title('Distribution of P-Values from T-tests', fontsize=14, fontweight='bold')
axes[0, 0].legend()
axes[0, 0].grid(alpha=0.3)

# 2. Volcano plot
sig_threshold = 0.05
fold_change = df_tests['Mean_Difference']
log_p_values = -np.log10(df_tests['P_Value'])

colors = ['red' if (p < sig_threshold and abs(fc) > 1) else 'gray' 
          for p, fc in zip(df_tests['P_Value'], fold_change)]

axes[0, 1].scatter(fold_change, log_p_values, c=colors, alpha=0.6, s=30)
axes[0, 1].axhline(-np.log10(sig_threshold), color='red', linestyle='--', label='p = 0.05')
axes[0, 1].axvline(-1, color='blue', linestyle='--', alpha=0.5)
axes[0, 1].axvline(1, color='blue', linestyle='--', alpha=0.5)
axes[0, 1].set_xlabel('Mean Difference (Cancer - Normal)', fontsize=12)
axes[0, 1].set_ylabel('-log10(P-Value)', fontsize=12)
axes[0, 1].set_title('Volcano Plot: Differential Gene Expression', fontsize=14, fontweight='bold')
axes[0, 1].legend()
axes[0, 1].grid(alpha=0.3)

# 3. Effect sizes (Cohen's d)
sig_genes = df_tests[df_tests['Significant']]
axes[1, 0].barh(range(min(20, len(sig_genes))), 
                sig_genes.head(20).sort_values('Cohen_D', ascending=True)['Cohen_D'],
                color='coral', alpha=0.8, edgecolor='black')
axes[1, 0].set_yticks(range(min(20, len(sig_genes))))
axes[1, 0].set_yticklabels(sig_genes.head(20).sort_values('Cohen_D', ascending=True)['Gene'], fontsize=8)
axes[1, 0].set_xlabel("Cohen's d (Effect Size)", fontsize=12)
axes[1, 0].set_title('Top 20 Genes by Effect Size', fontsize=14, fontweight='bold')
axes[1, 0].axvline(0, color='black', linestyle='-', linewidth=0.5)
axes[1, 0].grid(alpha=0.3, axis='x')

# 4. Summary statistics
summary_data = pd.DataFrame({
    'Statistic': ['Mean', 'Median', 'Std', 'Min', 'Max'],
    'P_Value': [df_tests['P_Value'].mean(), df_tests['P_Value'].median(), 
                df_tests['P_Value'].std(), df_tests['P_Value'].min(), 
                df_tests['P_Value'].max()]
})

axes[1, 1].axis('tight')
axes[1, 1].axis('off')
table = axes[1, 1].table(cellText=summary_data.values,
                         colLabels=summary_data.columns,
                         cellLoc='center',
                         loc='center',
                         bbox=[0, 0, 1, 1])
table.auto_set_font_size(False)
table.set_fontsize(11)
table.scale(1, 2)
axes[1, 1].set_title('P-Value Summary Statistics', fontsize=14, fontweight='bold', pad=20)

plt.tight_layout()
plt.show()

print("\nStatistical analysis complete!")



## 4. Dimensionality Reduction: PCA, t-SNE, and UMAP

Dimensionality reduction is crucial for visualizing and understanding high-dimensional omics data.


In [None]:
# Prepare data for dimensionality reduction
X = df_gene_exp.drop('Sample_Type', axis=1).values
y = df_gene_exp['Sample_Type'].values

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 1. PCA - Linear dimensionality reduction
print("Performing PCA...")
pca = PCA(n_components=min(50, X_scaled.shape[1]))
X_pca = pca.fit_transform(X_scaled)

# Calculate explained variance
explained_variance = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance)

print(f"PCA: First 10 components explain {cumulative_variance[9]:.2%} of variance")
print(f"PCA: First 20 components explain {cumulative_variance[19]:.2%} of variance")

# 2. t-SNE - Non-linear dimensionality reduction
print("\nPerforming t-SNE (this may take a moment)...")
tsne = TSNE(n_components=2, random_state=42, perplexity=30, n_iter=1000)
X_tsne = tsne.fit_transform(X_scaled)
print("t-SNE complete!")

# 3. UMAP - Non-linear dimensionality reduction
print("\nPerforming UMAP...")
umap_reducer = umap.UMAP(n_components=2, random_state=42, n_neighbors=15, min_dist=0.1)
X_umap = umap_reducer.fit_transform(X_scaled)
print("UMAP complete!")

print("\nDimensionality reduction complete!")



In [None]:
# Visualize dimensionality reduction results
fig, axes = plt.subplots(2, 3, figsize=(18, 12))

# Color mapping
colors = {'Normal': 'steelblue', 'Cancer': 'crimson'}
sample_colors = [colors[label] for label in y]

# PCA plots
# 1. Explained variance
axes[0, 0].bar(range(1, 21), explained_variance[:20], alpha=0.7, color='teal', edgecolor='black')
axes[0, 0].plot(range(1, 21), cumulative_variance[:20], 'ro-', markersize=6, label='Cumulative')
axes[0, 0].axhline(0.95, color='red', linestyle='--', alpha=0.5, label='95% variance')
axes[0, 0].set_xlabel('Principal Component', fontsize=12)
axes[0, 0].set_ylabel('Explained Variance Ratio', fontsize=12)
axes[0, 0].set_title('PCA: Explained Variance by Component', fontsize=14, fontweight='bold')
axes[0, 0].legend()
axes[0, 0].grid(alpha=0.3)

# 2. PCA scatter plot
scatter1 = axes[0, 1].scatter(X_pca[:, 0], X_pca[:, 1], c=sample_colors, alpha=0.6, s=50, edgecolors='black', linewidth=0.5)
axes[0, 1].set_xlabel(f'PC1 ({explained_variance[0]:.2%} variance)', fontsize=12)
axes[0, 1].set_ylabel(f'PC2 ({explained_variance[1]:.2%} variance)', fontsize=12)
axes[0, 1].set_title('PCA: First Two Components', fontsize=14, fontweight='bold')
axes[0, 1].grid(alpha=0.3)
# Add legend
legend_elements = [plt.Line2D([0], [0], marker='o', color='w', markerfacecolor=colors[label], 
                               markersize=10, label=label) for label in ['Normal', 'Cancer']]
axes[0, 1].legend(handles=legend_elements, loc='best')

# 3. PCA components 3 and 4
scatter3 = axes[0, 2].scatter(X_pca[:, 2], X_pca[:, 3], c=sample_colors, alpha=0.6, s=50, edgecolors='black', linewidth=0.5)
axes[0, 2].set_xlabel(f'PC3 ({explained_variance[2]:.2%} variance)', fontsize=12)
axes[0, 2].set_ylabel(f'PC4 ({explained_variance[3]:.2%} variance)', fontsize=12)
axes[0, 2].set_title('PCA: Components 3 and 4', fontsize=14, fontweight='bold')
axes[0, 2].grid(alpha=0.3)
axes[0, 2].legend(handles=legend_elements, loc='best')

# t-SNE plot
scatter2 = axes[1, 0].scatter(X_tsne[:, 0], X_tsne[:, 1], c=sample_colors, alpha=0.6, s=50, edgecolors='black', linewidth=0.5)
axes[1, 0].set_xlabel('t-SNE 1', fontsize=12)
axes[1, 0].set_ylabel('t-SNE 2', fontsize=12)
axes[1, 0].set_title('t-SNE Visualization', fontsize=14, fontweight='bold')
axes[1, 0].grid(alpha=0.3)
axes[1, 0].legend(handles=legend_elements, loc='best')

# UMAP plot
scatter4 = axes[1, 1].scatter(X_umap[:, 0], X_umap[:, 1], c=sample_colors, alpha=0.6, s=50, edgecolors='black', linewidth=0.5)
axes[1, 1].set_xlabel('UMAP 1', fontsize=12)
axes[1, 1].set_ylabel('UMAP 2', fontsize=12)
axes[1, 1].set_title('UMAP Visualization', fontsize=14, fontweight='bold')
axes[1, 1].grid(alpha=0.3)
axes[1, 1].legend(handles=legend_elements, loc='best')

# Comparison text
axes[1, 2].axis('off')
comparison_text = '''
Dimensionality Reduction Comparison:

PCA (Principal Component Analysis):
• Linear transformation
• Preserves global structure
• Interpretable components
• Fast computation

t-SNE:
• Non-linear transformation
• Preserves local neighborhoods
• Good for visualization
• Computationally intensive

UMAP:
• Non-linear transformation
• Preserves local & global structure
• Fast and scalable
• Good for both visualization and ML
'''
axes[1, 2].text(0.1, 0.5, comparison_text, fontsize=12, family='monospace',
                verticalalignment='center', bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.tight_layout()
plt.show()

print("\nVisualization complete!")



## 5. Machine Learning Models for Drug Response Prediction

We'll train multiple ML models to predict cancer vs normal samples, simulating drug response prediction using gene expression features.


In [None]:
# Prepare data for ML
from sklearn.preprocessing import LabelEncoder

# Encode labels
le = LabelEncoder()
y_encoded = le.fit_transform(y)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y_encoded, test_size=0.3, random_state=42, stratify=y_encoded
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"Features: {X_train.shape[1]}")

# Train multiple models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42, C=1.0),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42, max_depth=10),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42, max_depth=5),
    'Neural Network': MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=500, random_state=42)
}

# Train and evaluate models
results = {}

for name, model in models.items():
    print(f"\nTraining {name}...")
    
    # Train
    model.fit(X_train, y_train)
    
    # Predict
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    
    # Evaluate
    accuracy = accuracy_score(y_test, y_pred)
    auc = roc_auc_score(y_test, y_pred_proba)
    
    # Cross-validation score
    cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
    
    results[name] = {
        'model': model,
        'accuracy': accuracy,
        'auc': auc,
        'cv_mean': cv_scores.mean(),
        'cv_std': cv_scores.std(),
        'y_pred': y_pred,
        'y_pred_proba': y_pred_proba
    }
    
    print(f"  Accuracy: {accuracy:.4f}")
    print(f"  AUC-ROC: {auc:.4f}")
    print(f"  CV Accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std()*2:.4f})")

print("\nModel training complete!")



In [None]:
# Create comparison visualization
fig, axes = plt.subplots(2, 2, figsize=(16, 14))

# 1. Model comparison bar plot
model_names = list(results.keys())
accuracies = [results[m]['accuracy'] for m in model_names]
aucs = [results[m]['auc'] for m in model_names]

x = np.arange(len(model_names))
width = 0.35

axes[0, 0].bar(x - width/2, accuracies, width, label='Accuracy', color='steelblue', alpha=0.8, edgecolor='black')
axes[0, 0].bar(x + width/2, aucs, width, label='AUC-ROC', color='crimson', alpha=0.8, edgecolor='black')
axes[0, 0].set_ylabel('Score', fontsize=12)
axes[0, 0].set_title('Model Performance Comparison', fontsize=14, fontweight='bold')
axes[0, 0].set_xticks(x)
axes[0, 0].set_xticklabels(model_names, rotation=45, ha='right', fontsize=10)
axes[0, 0].legend()
axes[0, 0].grid(alpha=0.3, axis='y')
axes[0, 0].set_ylim([0.5, 1.0])

# 2. ROC curves for all models
for name, result in results.items():
    fpr, tpr, _ = roc_curve(y_test, result['y_pred_proba'])
    axes[0, 1].plot(fpr, tpr, label=f'{name} (AUC={result["auc"]:.3f})', linewidth=2)

axes[0, 1].plot([0, 1], [0, 1], 'k--', label='Random Classifier')
axes[0, 1].set_xlabel('False Positive Rate', fontsize=12)
axes[0, 1].set_ylabel('True Positive Rate', fontsize=12)
axes[0, 1].set_title('ROC Curves Comparison', fontsize=14, fontweight='bold')
axes[0, 1].legend()
axes[0, 1].grid(alpha=0.3)

# 3. Confusion matrices for top 2 models
# Sort by AUC
sorted_models = sorted(results.items(), key=lambda x: x[1]['auc'], reverse=True)

for idx, (name, result) in enumerate(sorted_models[:2]):
    cm = confusion_matrix(y_test, result['y_pred'])
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[1, idx], 
                cbar_kws={'label': 'Count'}, square=True, linewidths=1, linecolor='black')
    axes[1, idx].set_xlabel('Predicted', fontsize=12)
    axes[1, idx].set_ylabel('Actual', fontsize=12)
    axes[1, idx].set_title(f'{name}\nConfusion Matrix', fontsize=12, fontweight='bold')
    axes[1, idx].set_xticklabels(['Normal', 'Cancer'])
    axes[1, idx].set_yticklabels(['Normal', 'Cancer'])

plt.tight_layout()
plt.show()

# Create results summary table
summary_df = pd.DataFrame({
    'Model': model_names,
    'Accuracy': [results[m]['accuracy'] for m in model_names],
    'AUC-ROC': [results[m]['auc'] for m in model_names],
    'CV Accuracy (Mean)': [results[m]['cv_mean'] for m in model_names],
    'CV Accuracy (Std)': [results[m]['cv_std'] for m in model_names]
})

print("\nDetailed Results Summary:")
print("="*80)
display(summary_df.round(4))



## 6. Feature Importance Analysis

Understanding which genes (features) are most important for model predictions provides biological insights.


In [None]:
# Extract feature importance from tree-based models
rf_model = results['Random Forest']['model']
gb_model = results['Gradient Boosting']['model']

# Get feature importances
rf_importance = pd.DataFrame({
    'Gene': gene_names,
    'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)

gb_importance = pd.DataFrame({
    'Gene': gene_names,
    'Importance': gb_model.feature_importances_
}).sort_values('Importance', ascending=False)

# Visualize top features
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Random Forest
top_n = 20
axes[0].barh(range(top_n), rf_importance.head(top_n).sort_values('Importance')['Importance'],
             color='steelblue', alpha=0.8, edgecolor='black')
axes[0].set_yticks(range(top_n))
axes[0].set_yticklabels(rf_importance.head(top_n).sort_values('Importance')['Gene'], fontsize=9)
axes[0].set_xlabel('Importance', fontsize=12)
axes[0].set_title('Random Forest: Top 20 Most Important Genes', fontsize=14, fontweight='bold')
axes[0].grid(alpha=0.3, axis='x')
axes[0].invert_yaxis()

# Gradient Boosting
axes[1].barh(range(top_n), gb_importance.head(top_n).sort_values('Importance')['Importance'],
             color='crimson', alpha=0.8, edgecolor='black')
axes[1].set_yticks(range(top_n))
axes[1].set_yticklabels(gb_importance.head(top_n).sort_values('Importance')['Gene'], fontsize=9)
axes[1].set_xlabel('Importance', fontsize=12)
axes[1].set_title('Gradient Boosting: Top 20 Most Important Genes', fontsize=14, fontweight='bold')
axes[1].grid(alpha=0.3, axis='x')
axes[1].invert_yaxis()

plt.tight_layout()
plt.show()

# Compare with differential expression results
print("\nComparing Feature Importance with Differential Expression:")
print("="*80)

# Find common top features
rf_top_genes = set(rf_importance.head(20)['Gene'])
gb_top_genes = set(gb_importance.head(20)['Gene'])
de_top_genes = set(df_tests.head(20)['Gene'])

print(f"Genes in top 20 of Random Forest: {len(rf_top_genes)}")
print(f"Genes in top 20 of Gradient Boosting: {len(gb_top_genes)}")
print(f"Genes in top 20 of differential expression: {len(de_top_genes)}")
print(f"\nGenes in both RF and DE top 20: {len(rf_top_genes & de_top_genes)}")
print(f"Genes in both GB and DE top 20: {len(gb_top_genes & de_top_genes)}")
print(f"Genes in all three top 20s: {len(rf_top_genes & gb_top_genes & de_top_genes)}")

# Show overlap
print("\nCommon important genes across all methods:")
common_genes = rf_top_genes & gb_top_genes & de_top_genes
for gene in sorted(common_genes)[:10]:
    print(f"  - {gene}")



## 7. Conclusion

This notebook demonstrated key computational approaches for AI-driven cancer target identification and drug discovery:

1. **Statistical Analysis**: T-tests for identifying differentially expressed genes
2. **Dimensionality Reduction**: PCA, t-SNE, and UMAP for visualization and feature reduction
3. **Machine Learning**: Multiple algorithms for drug response prediction
4. **Feature Analysis**: Understanding which genes drive predictions

These methods are fundamental to modern precision medicine and drug discovery workflows.
