# EMS Prediction: Score Your Variants

**Use this notebook to score YOUR own genetic variants for functional impact.** Simply provide your variant list and get back functional scores to prioritize variants for follow-up studies.

## What You Get
- **Input**: Your variant list (any format: VCF, TSV, CSV)
- **Output**: Same variants + EMS functional scores (0-1 scale) 
- **Result**: Ranked list of variants by likelihood to affect gene expression

## Quick Start Guide

**All you need:**
1. Your variant list (see format below)
2. Run this notebook 
3. Get scored variants ready for your research

**Variant format examples:**
```
variant_id
2:12345:A:T
2:67890:G:C
```
OR from VCF: convert to chr:pos:ref:alt format

**No genomics expertise required** - the model handles all technical details automatically.

## Step 1: Setup (Run Once)

*The model is pre-trained and ready to use - no training required on your end.*

In [None]:
import pandas as pd
import numpy as np
import joblib
import pickle
import yaml
import warnings
warnings.filterwarnings('ignore')

# Configure paths - CHANGE THESE for your setup
MODEL_PATH = "model_standard_subset_weighted_chr_chr2_NPR_10.joblib"  # Update this path
CONFIG_PATH = "data_config.yaml"  # Update this path

# Load trained model
print("Loading trained CatBoost model...")
trained_model = joblib.load(MODEL_PATH)
print(f"✅ Model loaded: {trained_model.__class__.__name__}")
print(f"Features required: {trained_model.feature_count_}")

## Step 2: Model Object Preview

Quick inspection of the trained model object to understand its key properties.

In [None]:
# Model object summary
print("🔍 TRAINED MODEL OBJECT PREVIEW")
print("=" * 40)
print(f"Model type: {type(trained_model).__name__}")
print(f"Feature count: {trained_model.feature_count_}")
print(f"Classes: {trained_model.classes_}")
print(f"Tree count: {trained_model.tree_count_}")
print(f"Learning rate: {getattr(trained_model, 'learning_rate_', 'N/A')}")

# Top 5 most important features (quick preview)
feature_names = trained_model.feature_names_
feature_importances = trained_model.feature_importances_
top_features = sorted(zip(feature_names, feature_importances), key=lambda x: x[1], reverse=True)[:5]

print(f"\nTop 5 features:")
for i, (feature, importance) in enumerate(top_features, 1):
    print(f"  {i}. {feature[:25]:<25} {importance:.3f}")

print(f"\nModel ready for prediction on {len(feature_names)} features")
print("For training details, see: EMS Training Tutorial")

## Step 3: Load YOUR Variant List

**Replace the toy example below with your own variant data.**

**Supported formats:**
- TSV/CSV with variant_id column
- VCF files (convert to chr:pos:ref:alt format first)
- Any delimiter (comma, tab, space)

In [None]:
# Option 1: Use toy example (for testing)
toy_variants = pd.DataFrame({
    'variant_id': [
        '2:12345:A:T',
        '2:67890:G:C', 
        '2:11111:T:A',
        '2:22222:C:G',
        '2:33333:A:G'
    ]
})

print("Toy example variants:")
print(toy_variants)

# Use toy variants for demonstration
user_variants = toy_variants.copy()

In [None]:
# REPLACE THIS with your own variant file
# YOUR_VARIANT_FILE = "my_variants.tsv"  # <-- Put your file path here
# user_variants = pd.read_csv(YOUR_VARIANT_FILE, sep='\t')
# print(f"✅ Loaded {len(user_variants)} of YOUR variants")

# For now, using toy example (DELETE when using your data)
user_variants = toy_variants.copy()
print("Using toy example - replace with your data above ⬆️")
print(f"🎯 Ready to score {len(user_variants)} variants")

## Step 4: Automatic Feature Processing

*The model automatically handles all genomic annotations - no manual work required.*

In [None]:
def create_variant_features(df):
    """Create basic variant features from variant_id"""
    df = df.copy()
    
    # Parse variant_id: chr:pos:ref:alt
    df[['chr','pos','ref','alt']] = df['variant_id'].str.split(':', expand=True)
    
    # Calculate variant type features
    df['length_diff'] = df['ref'].str.len() - df['alt'].str.len()
    df['is_SNP'] = (df['length_diff'] == 0).astype(int)
    df['is_indel'] = (df['length_diff'] != 0).astype(int)
    df['is_insertion'] = (df['length_diff'] > 0).astype(int)
    df['is_deletion'] = (df['length_diff'] < 0).astype(int)
    
    # Add placeholder genomic annotations (use training medians)
    df['gene_lof'] = -10.0  # Gene constraint score
    df['gnomad_MAF'] = 0.1  # Population allele frequency
    
    # Clean up
    df = df.drop(columns=['chr','pos','ref','alt'])
    
    return df

# Create features
processed_variants = create_variant_features(user_variants)
print("🔄 Processing your variants automatically...")
print(f"Features added: {list(processed_variants.columns[1:])}")

In [None]:
def prepare_prediction_matrix(df, model):
    """Prepare full feature matrix for model prediction"""
    
    # Get required features from trained model
    required_features = model.feature_names_
    
    # Create prediction dataframe with all required features
    prediction_df = df.copy()
    
    # Add missing features with training-informed defaults
    for feature in required_features:
        if feature not in prediction_df.columns:
            if 'distance' in feature.lower() and 'log' in feature.lower():
                prediction_df[feature] = 8.5  # Median log distance from training
            elif 'abc_score' in feature.lower():
                prediction_df[feature] = 0.05  # Median ABC score
            elif 'diff' in feature.lower():
                prediction_df[feature] = 0.0  # Differential signals
            else:
                prediction_df[feature] = 0.0  # Default for other features
    
    # Select and order features to match training
    X = prediction_df[required_features]
    
    # Handle missing/invalid values
    X = X.replace([np.inf, -np.inf], 0)
    X = X.fillna(0)
    
    print(f"✅ Ready for scoring: {X.shape[0]} variants, {X.shape[1]} features")
    print(f"Note: {len(required_features) - len(df.columns) + 1} features imputed with training defaults")
    
    return X

# Prepare prediction matrix
X_prediction = prepare_prediction_matrix(processed_variants, trained_model)

## Step 5: Get Your Scores! 

**This is where the magic happens** - your variants get scored for functional impact.

In [None]:
# Generate predictions
print("Generating EMS scores...")

# Get probability scores (0-1) and binary predictions
ems_scores = trained_model.predict_proba(X_prediction)[:, 1]
binary_predictions = trained_model.predict(X_prediction)

# Add results to original dataframe
results_df = user_variants.copy()
results_df['ems_score'] = ems_scores.round(4)
results_df['predicted_functional'] = binary_predictions

# Add confidence categories
results_df['confidence'] = pd.cut(ems_scores, 
                                 bins=[0, 0.3, 0.7, 1.0], 
                                 labels=['Low', 'Medium', 'High'])

print(f"🎉 SUCCESS! Scored {len(results_df)} variants")
print("\nResults preview:")
print(results_df.to_string(index=False))

## Step 6: Understand Your Results

### What the scores mean:
- **Score 0.8-1.0**: 🔥 High priority - likely functional, investigate first
- **Score 0.5-0.8**: 📋 Medium priority - worth exploring  
- **Score 0.0-0.5**: 📝 Lower priority - likely neutral

### What to do next:
- **High scorers**: Design experiments, check literature, validate in lab
- **Medium scorers**: Look for supporting evidence, combine with other data
- **Low scorers**: Likely safe to deprioritize unless other evidence suggests otherwise

In [None]:
# Summary statistics
print("📊 PREDICTION SUMMARY")
print("=" * 30)
print(f"Total variants: {len(results_df)}")
print(f"Predicted functional: {sum(results_df['predicted_functional'])}")
print(f"Average EMS score: {results_df['ems_score'].mean():.3f}")
print(f"Score range: {results_df['ems_score'].min():.3f} - {results_df['ems_score'].max():.3f}")

print("\nConfidence distribution:")
print(results_df['confidence'].value_counts())

# Highlight top variants
if results_df['ems_score'].max() > 0.5:
    top_variants = results_df.nlargest(3, 'ems_score')[['variant_id', 'ems_score']]
    print("\n🎯 Top-scoring variants:")
    for _, row in top_variants.iterrows():
        print(f"   {row['variant_id']}: {row['ems_score']:.4f}")

## Step 7: Save Your Results

**Your scored variants are ready to use in your research!**

In [None]:
# Save ALL your results
output_file = "MY_scored_variants.tsv"  # Rename as needed
results_df.to_csv(output_file, sep='\t', index=False)

print(f"💾 All results saved: {output_file}")
print(f"\nOutput columns:")
for col in results_df.columns:
    print(f"   - {col}")

# Bonus: Save just the high-priority variants
high_confidence = results_df[results_df['ems_score'] > 0.7]
if len(high_confidence) > 0:
    priority_file = "HIGH_PRIORITY_variants.tsv"  
    high_confidence.to_csv(priority_file, sep='\t', index=False)
    print(f"🚀 High-priority variants saved: {priority_file}")
    print(f"   → {len(high_confidence)} variants need your attention first!")
else:
    print("ℹ️  No high-priority variants found (score > 0.7)")

## You're Done! 🎉

### What you accomplished:
✅ Loaded your variants  
✅ Automatically processed genomic features  
✅ Scored variants for functional impact  
✅ Saved prioritized results for your research  

### Next steps:
1. **Review high-scoring variants** in your saved files
2. **Cross-reference** with your existing data/hypotheses  
3. **Plan experiments** for top candidates
4. **Share results** with your team

### Need help?
- **File formats**: Try different delimiters if loading fails
- **Many variants**: This notebook handles thousands of variants  
- **Questions**: Refer to [EMS Training Tutorial](https://statfungen.github.io/xqtl-protocol/code/xqtl_modifier_score/ems_training.html) for technical details

**Happy variant hunting! 🧬**