## WARNING: Do not run this to get high score. This notebook is a case of breaking down bad assumptions. For high competition scores, check other posts. 

Many thanks to Seddik Turki (https://www.kaggle.com/code/seddiktrk/cafa-6-predictions) and VBS2004 (https://www.kaggle.com/code/vbs2004/cafa-6-protein-ensemble-silver-medal?scriptVersionId=271694892) for the public code releases.

### Disclaimer: Do not run this as your competition submission. This is for educational purposes only. Without significant fixes, this will crash!

In [None]:
import pandas as pd
import numpy as np

## Load the Prediction Files

We'll load both model predictions. Easy peasy - just read them all into memory at once!

In [None]:
# Load first model predictions
model1 = pd.read_csv('/kaggle/input/cafa-6-t5-embeddings-with-ensemble/submission.tsv', 
                      sep='\t', header=None, names=['protein', 'go_term', 'score'])

# Load second model predictions  
model2 = pd.read_csv('/kaggle/input/cafa-6-predictions/submission.tsv',
                      sep='\t', header=None, names=['protein', 'go_term', 'score'])

print(f"Model 1: {len(model1)} predictions")
print(f"Model 2: {len(model2)} predictions")

## Combine the Models

Now we merge them together. We'll use a simple average because all models are equally good, right? No need for fancy weights!

In [None]:
# Rename score columns so we can track them
model1 = model1.rename(columns={'score': 'score1'})
model2 = model2.rename(columns={'score': 'score2'})

# Merge on protein and go_term
combined = model1.merge(model2, on=['protein', 'go_term'], how='inner')

print(f"Combined predictions: {len(combined)}")

## Calculate Ensemble Score

Simple averaging - just add the scores and divide by 2! Math is fun! âž•âž—

In [None]:
# Calculate the average score
combined['final_score'] = (combined['score1'] + combined['score2']) / 2

# Keep only what we need
result = combined[['protein', 'go_term', 'final_score']]

print(f"Final predictions: {len(result)}")
print(f"\nSample predictions:")
print(result.head())

## Save the Submission

Write it out to a TSV file. Done! ðŸŽ‰

In [None]:
result.to_csv('submission.tsv', sep='\t', index=False, header=False)

print("\n Submission saved successfully!")
print(f"Total predictions in submission: {len(result)}")

---

# Problems that may be detrimental to the submission

### 1. **Memory Disaster** 
- Loading 31+ million predictions all at once into memory
- No chunking or streaming - will crash on most systems
- Original used `chunksize=10_000_000` for good reason!

### 2. **Wrong Merge Type** 
- Using `how='inner'` instead of `how='outer'`
- This throws away ALL predictions that don't appear in both models
- You lose most of your predictions! Original had ~31M, this might have ~5M

### 3. **No Handling of Missing Values** 
- When predictions are missing from one model, they become NaN
- Should fill with 0 (no confidence) but we don't
- Results in NaN final scores = invalid predictions

### 4. **Equal Weights** 
- Original used `[0.2, 0.8]` weights because one model performs better
- This uses `[0.5, 0.5]` - treats bad and good models equally
- Drags down performance significantly

### 5. **No Data Validation** 
- Missing `dropna()` checks for malformed rows
- No dtype specifications in `read_csv`
- No handling of duplicate predictions

### 6. **Inefficient Processing** 
- No temporary file system for large operations
- Keeps everything in memory all the time
- No progress tracking for long operations

### 7. **Lost Edge Cases** 
- Removed the clever `key` concatenation approach
- Doesn't handle proteins/GO terms with underscores properly
- Direct merge on two columns is less robust

### 8. **Score Distribution Issues** 
- Simple averaging can create weird score distributions
- No normalization or calibration
- Weighted ensemble preserves confidence levels better

## Performance Impact:

- **Original**: Processes all 31.5M predictions efficiently
- **This version**: 
  - Will crash due to memory on most systems
  - If it runs, loses 80%+ of predictions due to inner merge
  - Remaining predictions have equal-weight averaging 
  - Competition score would be bad

## What Makes the Original Good:

Memory-efficient chunking for huge datasets

Outer merge preserves ALL predictions from both models

Proper filling of missing values (0 for absent predictions)

Weighted ensemble based on model quality

Robust handling of edge cases with key concatenation

Progress tracking and temporary file management

Data validation and error handling
