# Getting Started with AL-FEP

This notebook demonstrates the basic usage of the AL-FEP framework for active learning and reinforcement learning in molecular virtual screening.

## Overview

The AL-FEP framework provides:
- **Multiple Oracles**: FEP, Docking, and ML-FEP for molecular evaluation
- **Active Learning**: Intelligent molecular selection strategies
- **Reinforcement Learning**: Agent-based molecular discovery
- **Target-specific optimization**: Pre-configured for 7JVR (SARS-CoV-2 Main Protease)

In [None]:
# Import necessary libraries
import sys
import os

# Add src to path
sys.path.append('../src')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# AL-FEP imports
from al_fep import (
    FEPOracle, DockingOracle, MLFEPOracle,
    ActiveLearningPipeline,
    MolecularDataset,
    setup_logging, load_config
)

# Setup logging
setup_logging(level="INFO")

print("AL-FEP framework loaded successfully!")

## Configuration Setup

Load the configuration for the 7JVR target:

In [None]:
# Load configuration
config = load_config(
    '../config/targets/7jvr.yaml',
    '../config/default.yaml'
)

print("Configuration loaded for target:", config.get('target_info', {}).get('name'))
print("PDB ID:", config.get('target_info', {}).get('pdb_id'))
print("Binding site center:", config.get('binding_site', {}).get('center'))

## Oracle Setup

Initialize the different oracles for molecular evaluation:

In [None]:
# Initialize oracles
print("Initializing oracles...")

# ML-FEP Oracle (fast, low-cost predictions)
ml_fep_oracle = MLFEPOracle(
    target="7jvr",
    config=config
)

# Note: FEP and Docking oracles require additional setup
# For this demo, we'll focus on ML-FEP

print(f"ML-FEP Oracle initialized: {ml_fep_oracle}")

## Molecular Dataset

Create a molecular dataset for evaluation:

In [None]:
# Example SMILES for drug-like molecules
example_smiles = [
    "CC(C)CC(NC(=O)C(NC(=O)OC(C)(C)C)C(C)C)C(=O)NC1CCCCC1",  # Peptidomimetic
    "CCN(CC)CCCC(C)NC1=C2N=CC=NC2=NC=N1",  # Purine derivative
    "CC1=CC=C(C=C1)C(=O)NC2=CC=C(C=C2)S(=O)(=O)N",  # Sulfonamide
    "COC1=CC=C(C=C1)C(=O)NC2=CC=CC=N2",  # Benzamide
    "CC(C)(C)OC(=O)NC1CCN(C1)C(=O)C2=CC=C(C=C2)F",  # Fluorinated compound
    "C1=CC=C(C=C1)C(=O)NC2=CC=C(C=C2)C(=O)O",  # Benzoic acid derivative
    "CC1=CC=C(C=C1)S(=O)(=O)NC2=CC=C(C=C2)C(=O)N",  # Sulfonamide
    "COC1=CC=C2C(=C1)N=CN2C3CCNCC3",  # Benzimidazole
    "CC(C)NC1=NC=NC2=C1C=CC=C2",  # Quinazoline
    "CC1=CC=C(C=C1)C(=O)NC2=CC=C(C=C2)OC",  # Methoxy benzamide
]

# Create molecular dataset
dataset = MolecularDataset(
    smiles=example_smiles,
    name="Example_7JVR_Molecules"
)

# Calculate molecular descriptors
dataset.calculate_descriptors()

print(f"Dataset created with {len(dataset)} molecules")
print("\nDataset preview:")
display(dataset.data.head())

## Oracle Evaluation

Evaluate molecules using the ML-FEP oracle:

In [None]:
# Evaluate molecules with ML-FEP oracle
print("Evaluating molecules with ML-FEP oracle...")

smiles_list = dataset.get_smiles()
results = ml_fep_oracle.evaluate(smiles_list)

# Create results DataFrame
results_df = pd.DataFrame(results)

print("\nEvaluation results:")
display(results_df[['smiles', 'score', 'ml_fep_score', 'uncertainty', 'confidence']].head())

# Show oracle statistics
print("\nOracle statistics:")
stats = ml_fep_oracle.get_statistics()
for key, value in stats.items():
    print(f"{key}: {value}")

## Visualization

Visualize the evaluation results:

In [None]:
# Create visualizations
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Score distribution
axes[0, 0].hist(results_df['score'].dropna(), bins=10, alpha=0.7)
axes[0, 0].set_title('ML-FEP Score Distribution')
axes[0, 0].set_xlabel('Score')
axes[0, 0].set_ylabel('Frequency')

# Uncertainty vs Score
axes[0, 1].scatter(results_df['uncertainty'], results_df['score'], alpha=0.7)
axes[0, 1].set_title('Uncertainty vs Score')
axes[0, 1].set_xlabel('Uncertainty')
axes[0, 1].set_ylabel('Score')

# Confidence distribution
axes[1, 0].hist(results_df['confidence'].dropna(), bins=10, alpha=0.7)
axes[1, 0].set_title('Confidence Distribution')
axes[1, 0].set_xlabel('Confidence')
axes[1, 0].set_ylabel('Frequency')

# Score vs Molecular Weight (from dataset)
merged_df = pd.merge(
    results_df[['smiles', 'score']], 
    dataset.data[['smiles', 'MolWt']], 
    on='smiles'
)
axes[1, 1].scatter(merged_df['MolWt'], merged_df['score'], alpha=0.7)
axes[1, 1].set_title('Score vs Molecular Weight')
axes[1, 1].set_xlabel('Molecular Weight')
axes[1, 1].set_ylabel('Score')

plt.tight_layout()
plt.show()

## Active Learning Demo

Demonstrate active learning with uncertainty sampling:

In [None]:
# Create active learning pipeline
al_pipeline = ActiveLearningPipeline(
    oracles=[ml_fep_oracle],
    strategy="uncertainty_sampling",
    batch_size=3,
    max_iterations=5,
    config=config
)

# Load molecular pool
al_pipeline.load_molecular_pool(smiles_list)

print(f"Active learning pipeline created with {len(al_pipeline.molecular_pool)} molecules")

# Run active learning
print("\nRunning active learning...")
final_results = al_pipeline.run()

print(f"\nActive learning completed!")
print(f"Total iterations: {final_results['total_iterations']}")
print(f"Total evaluated: {final_results['total_evaluated']}")

# Show best molecules
print("\nTop 5 molecules found:")
best_molecules = final_results['best_molecules']
for i, mol in enumerate(best_molecules[:5], 1):
    print(f"{i}. Score: {mol.get('ml-fep_score', 0):.3f} - {mol['smiles'][:50]}...")

## Feature Importance Analysis

Analyze which molecular features are most important for the ML-FEP predictions:

In [None]:
# Get feature importance from ML-FEP oracle
feature_importance = ml_fep_oracle.get_feature_importance()

if feature_importance:
    # Create feature importance plot
    features = list(feature_importance.keys())
    importance = list(feature_importance.values())
    
    plt.figure(figsize=(10, 6))
    plt.barh(features, importance)
    plt.title('Feature Importance in ML-FEP Model')
    plt.xlabel('Importance')
    plt.tight_layout()
    plt.show()
    
    print("\nTop 5 most important features:")
    sorted_features = sorted(feature_importance.items(), key=lambda x: x[1], reverse=True)
    for feature, importance in sorted_features[:5]:
        print(f"{feature}: {importance:.4f}")
else:
    print("Feature importance not available")

## Conclusion

This notebook demonstrated:

1. **Oracle Setup**: Initialized ML-FEP oracle for fast molecular evaluation
2. **Molecular Dataset**: Created and processed molecular datasets
3. **Evaluation**: Evaluated molecules and analyzed uncertainty
4. **Active Learning**: Demonstrated uncertainty-based molecular selection
5. **Analysis**: Visualized results and feature importance

## Next Steps

- Set up FEP and Docking oracles with appropriate receptor files
- Implement reinforcement learning agents
- Use larger molecular databases (ChEMBL, ZINC)
- Optimize for specific 7JVR binding properties
- Compare different active learning strategies