# EDA 03: Sequence Features & Baseline Model

In this notebook, we will:
1. Analyze the protein sequences themselves (Amino Acid Composition).
2. Explore K-mer frequencies (short subsequences).
3. Build a **Frequency Baseline Model** to establish a minimum performance benchmark.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from Bio import SeqIO
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

# Setup plotting style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

# Paths
TRAIN_SEQ_PATH = '../Train/train_sequences.fasta'
TRAIN_TERMS_PATH = '../Train/train_terms.tsv'


## 1. Load Data
We'll load the sequences and the terms.


In [None]:
# Load Sequences
sequences = []
for record in SeqIO.parse(TRAIN_SEQ_PATH, "fasta"):
    sequences.append({
        'EntryID': record.id,
        'sequence': str(record.seq),
        'length': len(record.seq)
    })

df_seq = pd.DataFrame(sequences)
print(f"Loaded {len(df_seq)} sequences.")
df_seq.head()


In [None]:
# Load Terms
df_terms = pd.read_csv(TRAIN_TERMS_PATH, sep='\t')
print(f"Loaded {len(df_terms)} annotations.")
df_terms.head()


## 2. Amino Acid Composition
Proteins are made of 20 standard amino acids. Let's see the distribution.


In [None]:
# Count all amino acids in the dataset
# We'll use a sample if the dataset is huge, but for ~140k proteins it should be fast enough
all_residues = Counter()

# Using a subset for speed in visualization if needed, but let's try full pass
for seq in df_seq['sequence']:
    all_residues.update(seq)

# Convert to dataframe
aa_df = pd.DataFrame.from_dict(all_residues, orient='index', columns=['count']).reset_index()
aa_df.columns = ['Amino Acid', 'Count']
aa_df['Frequency'] = aa_df['Count'] / aa_df['Count'].sum()
aa_df = aa_df.sort_values('Frequency', ascending=False)

# Plot
plt.figure(figsize=(12, 6))
sns.barplot(data=aa_df, x='Amino Acid', y='Frequency', palette='viridis')
plt.title('Global Amino Acid Composition')
plt.show()

print("Top 5 Amino Acids:")
print(aa_df.head(5))


## 3. K-mer Analysis (3-mers)
K-mers are subsequences of length k. They can capture local motifs.


In [None]:
def get_kmers(sequence, k=3):
    return [sequence[i:i+k] for i in range(len(sequence) - k + 1)]

# Analyze a sample of sequences to save time
sample_seqs = df_seq['sequence'].sample(n=1000, random_state=42)
kmer_counts = Counter()

for seq in sample_seqs:
    kmer_counts.update(get_kmers(seq, k=3))

# Top 20 3-mers
top_kmers = pd.DataFrame(kmer_counts.most_common(20), columns=['K-mer', 'Count'])

plt.figure(figsize=(14, 6))
sns.barplot(data=top_kmers, x='K-mer', y='Count', palette='magma')
plt.title('Top 20 Most Common 3-mers (Sample n=1000)')
plt.xticks(rotation=45)
plt.show()


## 4. Frequency Baseline Model
This is the simplest possible model. We will:
1. Split proteins into Train (80%) and Validation (20%).
2. Calculate the frequency of each GO term in the Train set.
3. For every protein in Validation, predict the top N most frequent terms.
4. Calculate the F1 score.


In [None]:
# 1. Split Data
# We need to split by Protein ID, not by row
unique_proteins = df_seq['EntryID'].unique()
train_ids, val_ids = train_test_split(unique_proteins, test_size=0.2, random_state=42)

print(f"Train proteins: {len(train_ids)}")
print(f"Val proteins: {len(val_ids)}")

# Filter terms dataframe
train_terms = df_terms[df_terms['EntryID'].isin(train_ids)]
val_terms = df_terms[df_terms['EntryID'].isin(val_ids)]


In [None]:
# 2. Train: Calculate Term Frequencies
# We'll calculate separate frequencies for each ontology (BPO, CCO, MFO) if possible, 
# but for a naive baseline, we can just take the global top terms.
# However, CAFA evaluates on specific ontologies. Let's look at the 'aspect' column if it exists.
# If not, we'll just do global.

# Check columns
print(train_terms.columns)

# Calculate frequency
term_counts = train_terms['term'].value_counts()
term_probs = term_counts / len(train_ids) # Probability = Count / Total Training Proteins

print("Top 10 Most Frequent Terms:")
print(term_probs.head(10))


In [None]:
# 3. Predict & Evaluate
# For the baseline, we predict the same set of terms for EVERY protein.
# Let's pick the top 50 terms.

top_n = 50
top_terms = term_probs.head(top_n).index.tolist()

# Prepare Ground Truth for Validation
# We need a format suitable for scikit-learn or manual F1 calculation.
# Since this is multi-label, let's do a simplified evaluation:
# Average Intersection over Union (Jaccard) or Precision/Recall per protein.

def evaluate_baseline(val_ids, val_terms_df, predicted_terms):
    # Create a dictionary of true terms for fast lookup
    true_terms_dict = val_terms_df.groupby('EntryID')['term'].apply(set).to_dict()
    
    precisions = []
    recalls = []
    f1s = []
    
    pred_set = set(predicted_terms)
    
    for pid in val_ids:
        true_set = true_terms_dict.get(pid, set())
        
        if len(true_set) == 0:
            continue
            
        # Intersection
        tp = len(pred_set.intersection(true_set))
        fp = len(pred_set) - tp
        fn = len(true_set) - tp
        
        p = tp / (tp + fp) if (tp + fp) > 0 else 0
        r = tp / (tp + fn) if (tp + fn) > 0 else 0
        f1 = 2 * p * r / (p + r) if (p + r) > 0 else 0
        
        precisions.append(p)
        recalls.append(r)
        f1s.append(f1)
        
    return np.mean(precisions), np.mean(recalls), np.mean(f1s)

precision, recall, f1 = evaluate_baseline(val_ids, val_terms, top_terms)

print(f"Baseline Results (Top {top_n} terms):")
print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}")
print(f"F1 Score:  {f1:.4f}")


## Conclusion
This F1 score represents the "floor" performance. Any machine learning model we build (BLAST, CNN, ProtBERT) must beat this score to be considered useful.

**Next Steps:**
1. Implement a BLAST-based baseline (usually much stronger).
2. Start building the actual Data Loaders for the deep learning pipeline.
