# Cell 1: Markdown
"""
# mRNA Classification Project
**Course:** MSI5001 - Introduction to AI  
**Team Members:** Lisa Mithani, Shawn Lee, Aishwarya Nair, and Kalyani Vijay
**Dataset:** mRNA Classification (Medium difficulty)  
**Objective:** Classify RNA sequences as mRNA vs. other RNA types using machine learning models

---

## Table of Contents
1. Data Loading & Merging
2. Exploratory Data Analysis
3. Feature Extraction
4. Model Training
5. Evaluation & Insights
"""

In [17]:
!pip install biopython




In [18]:
import pandas as pd
import numpy as np
from Bio import SeqIO
import warnings
warnings.filterwarnings('ignore')

print("✓ Libraries imported successfully!")


✓ Libraries imported successfully!


---
# 1. Data Loading and Preparation
**Responsible:** Lisa Mithani

## Overview
This section loads the training and testing datasets, merges them, and prepares features and labels for modeling.

## Tasks:
1. Load training.fa (FASTA) and training.csv → merge using inner join
2. Load testing.csv
3. Separate features (X) and labels (y) for train/test sets

---


In [19]:
# ============================================================================
# STEP 1: Load training.fa and training.csv, then merge using inner join
# ============================================================================

# Function to load FASTA file
def load_fasta_to_dataframe(fasta_file):
    """
    Reads a FASTA file and converts it to a pandas DataFrame.
    
    Parameters:
    - fasta_file: path to .fa file
    
    Returns:
    - DataFrame with columns: ['sequence_id', 'sequence']
    """
    sequences = []
    for record in SeqIO.parse(fasta_file, "fasta"):
        sequences.append({
            'sequence_id': record.id,
            'sequence': str(record.seq)
        })
    return pd.DataFrame(sequences)

# Load training.fa (FASTA file)
print("Loading training.fa...")
fasta_df = load_fasta_to_dataframe('dataset/training.fa')
print(f"✓ Loaded {fasta_df.shape[0]} sequences from training.fa")

# Load training.csv
print("\nLoading training.csv...")
csv_df = pd.read_csv('dataset/training_class.csv')
print(f"✓ Loaded training.csv with shape {csv_df.shape}")

# Merge using inner join with DIFFERENT column names
# FASTA has 'sequence_id', CSV has 'name'
print("\nMerging datasets using inner join...")
training_data = pd.merge(
    fasta_df,
    csv_df,
    left_on='sequence_id',   # Column in FASTA
    right_on='name',          # Column in CSV (different name!)
    how='inner'
)

print(f"✓ Merged training data: {training_data.shape}")
print(f"  Columns: {list(training_data.columns)}")
print(f"\n✓ STEP 1 COMPLETE")

# Display first few rows
training_data.head(2)



Loading training.fa...
✓ Loaded 22867 sequences from training.fa

Loading training.csv...
✓ Loaded training.csv with shape (22867, 2)

Merging datasets using inner join...
✓ Merged training data: (14286, 4)
  Columns: ['sequence_id', 'sequence', 'name', 'class']

✓ STEP 1 COMPLETE


Unnamed: 0,sequence_id,sequence,name,class
0,ENSDART00000138379,TCAAANGGAAAATAATATGTCAGYTGTGATTTTTACTCGANTTAAT...,ENSDART00000138379,1
1,ENSDART00000075994,ATGTCTCTTTTTGAAATAAAAGACCTGNTTNGAGAAGGAAGCTATG...,ENSDART00000075994,1


In [20]:
# ============================================================================
# STEP 2: Load testing.csv
# ============================================================================

print("Loading testing.csv...")
testing_data = pd.read_csv('dataset/test.csv')

print(f"✓ Loaded testing.csv: {testing_data.shape}")
print(f"  Columns: {list(testing_data.columns)}")
print(f"\n✓ STEP 2 COMPLETE")

testing_data.head(2)


Loading testing.csv...
✓ Loaded testing.csv: (4416, 3)
  Columns: ['name', 'sequence', 'class']

✓ STEP 2 COMPLETE


Unnamed: 0,name,sequence,class
0,TCONS_00059596,CUAAUCCCCCCUCCUCCCGCUCCCGCACCAAAGAGUUGCGCCGCCU...,1
1,TCONS_00059678,CUAUUCGGCGCAGUUGCUAUACGUACCCCAGCCUCGUACACAACGC...,1


In [23]:
# ============================================================================
# STEP 3: Drop class labels and store them in y_train and y_test
# ============================================================================

label_column = 'class'

# --- Training Set ---
print("Separating training data...")
y_train = training_data[label_column].copy()
# Keep 'sequence' column for next teammate to extract features
X_train = training_data.drop(columns=[label_column, 'sequence_id', 'name']).copy()

print(f"✓ X_train shape: {X_train.shape}")
print(f"  Columns: {list(X_train.columns)}")  # Should show ['sequence']
print(f"✓ y_train shape: {y_train.shape}")
print(f"  Class distribution:\n{y_train.value_counts()}")

# --- Testing Set ---
print("\nSeparating testing data...")
y_test = testing_data[label_column].copy()
# Keep 'sequence' column for next teammate
X_test = testing_data.drop(columns=[label_column, 'name']).copy()

print(f"✓ X_test shape: {X_test.shape}")
print(f"  Columns: {list(X_test.columns)}")  # Should show ['sequence']
print(f"✓ y_test shape: {y_test.shape}")
print(f"  Class distribution:\n{y_test.value_counts()}")

print(f"\n✓ STEP 3 COMPLETE")
print(f"\n⚠️ Note: X_train and X_test contain raw sequences.")
print(f"   Next step: Feature extraction (k-mer generation)")



Separating training data...
✓ X_train shape: (14286, 1)
  Columns: ['sequence']
✓ y_train shape: (14286,)
  Class distribution:
class
0    9224
1    5062
Name: count, dtype: int64

Separating testing data...
✓ X_test shape: (4416, 1)
  Columns: ['sequence']
✓ y_test shape: (4416,)
  Class distribution:
class
1    2208
0    2208
Name: count, dtype: int64

✓ STEP 3 COMPLETE

⚠️ Note: X_train and X_test contain raw sequences.
   Next step: Feature extraction (k-mer generation)


---
## Section 1 Complete ✓

**Outputs:**
- `X_train` (14,286 samples × 1 column): Raw RNA sequences
- `y_train` (14,286 labels): Class labels (0 or 1)
- `X_test` (4,416 samples × 1 column): Raw RNA sequences
- `y_test` (4,416 labels): Class labels (0 or 1)

**Note for Next Teammate:**
The `X_train` and `X_test` DataFrames contain raw RNA sequences in the `'sequence'` column. Please convert these to numeric features (e.g., k-mer frequencies) before model training.

**Next Section:** Feature Extraction / Exploratory Data Analysis

---



---
# 2. Feature Extraction
**Responsible:** [Teammate Names TBD]

## Overview
This section converts raw RNA sequences into three types of numeric features for modeling.

We will extract:
1. Character Positional Embeddings
2. Character Tokenization
3. K-mer Frequency Features

Each feature type will be used to train separate models, then compared.

---


### Step 2.1: Character Positional Embedding
Convert RNA sequences to positional embeddings using learned vectors.


In [None]:
# ============================================================================
# STEP 2.1: Character Positional Embedding - TO BE COMPLETED
# ============================================================================

# TODO: Convert X_train['sequence'] and X_test['sequence'] to positional embeddings
# Output: X_train_embedding, X_test_embedding (numeric DataFrames)

# Example approach:
# - Map each nucleotide (A, U, G, C) to a position index
# - Create embedding vectors for each position
# - Concatenate or average embeddings across sequence length

print("⚠️ Step 2.1 incomplete - waiting for implementation")


---
### Step 2.2: Character Tokenization
Tokenize sequences into character-level representations.


In [None]:
# ============================================================================
# STEP 2.2: Character Tokenization - TO BE COMPLETED
# ============================================================================

# TODO: Tokenize X_train['sequence'] and X_test['sequence']
# Output: X_train_tokens, X_test_tokens (numeric DataFrames)

# Example approach:
# - Create vocabulary: {'A': 0, 'U': 1, 'G': 2, 'C': 3}
# - Convert each sequence to list of token IDs
# - Pad sequences to same length
# - Convert to numeric array

print("⚠️ Step 2.2 incomplete - waiting for implementation")


---
### Step 2.3: K-mer Feature Extraction
Extract k-mer frequency features (k=3, 4, or 5).


In [None]:
# ============================================================================
# STEP 2.3: K-mer Feature Extraction - TO BE COMPLETED
# ============================================================================

# TODO: Extract k-mer features from X_train['sequence'] and X_test['sequence']
# Output: X_train_kmers, X_test_kmers (numeric DataFrames)

# OPTION A: Extract k-mers manually
# - Loop through sequences and extract overlapping k-mers
# - Count k-mer frequencies
# - Normalize to create feature vectors

# OPTION B: Load pre-computed features (FASTER)
# X_train_kmers = pd.read_csv('dataset/train_kmer_features_k3.csv')

print("⚠️ Step 2.3 incomplete - waiting for implementation")


---
## Section 2 Summary

**Expected Outputs:**
- `X_train_embedding` (14,286 samples × embedding_dim features)
- `X_test_embedding` (4,416 samples × embedding_dim features)
- `X_train_tokens` (14,286 samples × sequence_length features)
- `X_test_tokens` (4,416 samples × sequence_length features)
- `X_train_kmers` (14,286 samples × n_kmers features)
- `X_test_kmers` (4,416 samples × n_kmers features)

All features are now numeric and ready for class balancing.

---


# 3. Class Balancing
**Responsible:** Shawn Lee

## Overview
Balance class distribution for each feature type separately.

Current training class distribution:
- Class 0: 9,224 samples (64.5%)
- Class 1: 5,062 samples (35.5%)

---


### Step 3.1: Balance Embedding Features
Apply class balancing to positional embedding features.


In [None]:
# ============================================================================
# STEP 3.1: Balance Embedding Features - TO BE COMPLETED
# ============================================================================

# TODO: Balance X_train_embedding with y_train
# Output: X_train_embedding_balanced, y_train_balanced

# Example approach using SMOTE:
# from imblearn.over_sampling import SMOTE
# smote = SMOTE(random_state=42)
# X_train_embedding_balanced, y_train_balanced = smote.fit_resample(X_train_embedding, y_train)

print("⚠️ Step 3.1 incomplete - waiting for implementation")


---
### Step 3.2: Balance Token Features
Apply class balancing to tokenization features.


In [None]:
# ============================================================================
# STEP 3.2: Balance Token Features - TO BE COMPLETED
# ============================================================================

# TODO: Balance X_train_tokens with y_train
# Output: X_train_tokens_balanced, y_train_balanced

print("⚠️ Step 3.2 incomplete - waiting for implementation")


---
### Step 3.3: Balance K-mer Features
Apply class balancing to k-mer frequency features.


In [None]:
# ============================================================================
# STEP 3.3: Balance K-mer Features - TO BE COMPLETED
# ============================================================================

# TODO: Balance X_train_kmers with y_train
# Output: X_train_kmers_balanced, y_train_balanced

print("⚠️ Step 3.3 incomplete - waiting for implementation")


---
## Section 3 Summary

All three feature types are now balanced and ready for modeling.

**Balanced Datasets:**
- Embedding features: X_train_embedding_balanced, y_train_balanced
- Token features: X_train_tokens_balanced, y_train_balanced
- K-mer features: X_train_kmers_balanced, y_train_balanced

---


---
# 4. Model Training
**Responsible:** [Teammate Names TBD]

## Overview
Train three separate logistic regression models using each feature type:
1. Model A: Using Positional Embedding features
2. Model B: Using Tokenization features
3. Model C: Using K-mer features

Each model will be evaluated independently to compare feature effectiveness.

---


### Step 4.1: Train Logistic Regression Model A (Embedding Features)
Train logistic regression on positional embedding features.


In [None]:
# ============================================================================
# STEP 4.1: Train Model A (Embedding Features) - TO BE COMPLETED
# ============================================================================

# TODO: Train logistic regression on X_train_embedding_balanced
# from sklearn.linear_model import LogisticRegression
# 
# model_A = LogisticRegression(max_iter=1000, random_state=42)
# model_A.fit(X_train_embedding_balanced, y_train_balanced)
# 
# # Predict on test set
# y_pred_A = model_A.predict(X_test_embedding)
#
# Output: model_A, y_pred_A

print("⚠️ Step 4.1 incomplete - Model A training pending")


---
### Step 4.2: Train Logistic Regression Model B (Token Features)
Train logistic regression on character tokenization features.


In [None]:
# ============================================================================
# STEP 4.2: Train Model B (Token Features) - TO BE COMPLETED
# ============================================================================

# TODO: Train logistic regression on X_train_tokens_balanced
# model_B = LogisticRegression(max_iter=1000, random_state=42)
# model_B.fit(X_train_tokens_balanced, y_train_balanced)
#
# y_pred_B = model_B.predict(X_test_tokens)
#
# Output: model_B, y_pred_B

print("⚠️ Step 4.2 incomplete - Model B training pending")


---
### Step 4.3: Train Logistic Regression Model C (K-mer Features)
Train logistic regression on k-mer frequency features.


In [None]:
# ============================================================================
# STEP 4.3: Train Model C (K-mer Features) - TO BE COMPLETED
# ============================================================================

# TODO: Train logistic regression on X_train_kmers_balanced
# model_C = LogisticRegression(max_iter=1000, random_state=42)
# model_C.fit(X_train_kmers_balanced, y_train_balanced)
#
# y_pred_C = model_C.predict(X_test_kmers)
#
# Output: model_C, y_pred_C

print("⚠️ Step 4.3 incomplete - Model C training pending")


---
## Section 4 Summary

**Trained Models:**
- Model A: Logistic Regression on Embedding Features
- Model B: Logistic Regression on Token Features
- Model C: Logistic Regression on K-mer Features

All models trained and ready for evaluation.

---


# 5. Classification Performance Evaluation
**Responsible:** Shawn Lee

## Overview
Evaluate and compare the performance of all three logistic regression models.

## Metrics:
- Accuracy
- Precision
- Recall
- F1-Score
- Confusion Matrix
- ROC-AUC

---


### Step 5.1: Evaluate All Models
Calculate performance metrics for Models A, B, and C.


In [None]:
# ============================================================================
# STEP 5.1: Evaluate All Models - TO BE COMPLETED
# ============================================================================

# TODO: Evaluate each model's performance
# from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# from sklearn.metrics import classification_report, confusion_matrix
#
# # Evaluate Model A
# acc_A = accuracy_score(y_test, y_pred_A)
# precision_A = precision_score(y_test, y_pred_A)
# recall_A = recall_score(y_test, y_pred_A)
# f1_A = f1_score(y_test, y_pred_A)
#
# # Repeat for Models B and C
#
# # Create comparison DataFrame
# results = pd.DataFrame({
#     'Model': ['Model A (Embedding)', 'Model B (Tokens)', 'Model C (K-mers)'],
#     'Accuracy': [acc_A, acc_B, acc_C],
#     'Precision': [precision_A, precision_B, precision_C],
#     'Recall': [recall_A, recall_B, recall_C],
#     'F1-Score': [f1_A, f1_B, f1_C]
# })
#
# print(results)

print("⚠️ Step 5.1 incomplete - Model evaluation pending")


---
### Step 5.2: Compare Model Performance
Visualize and compare performance across all models.


In [None]:
# ============================================================================
# STEP 5.2: Compare Model Performance - TO BE COMPLETED
# ============================================================================

# TODO: Create comparison visualizations
# import matplotlib.pyplot as plt
# import seaborn as sns
#
# # Bar chart comparing metrics
# results.set_index('Model')[['Accuracy', 'Precision', 'Recall', 'F1-Score']].plot(kind='bar', figsize=(10,6))
# plt.title('Model Performance Comparison')
# plt.ylabel('Score')
# plt.legend(loc='lower right')
# plt.xticks(rotation=45)
# plt.tight_layout()
# plt.show()
#
# # Print confusion matrices for each model
# print("Model A Confusion Matrix:")
# print(confusion_matrix(y_test, y_pred_A))

print("⚠️ Step 5.2 incomplete - Performance comparison pending")


---
### Step 5.3: Select Best Model
Identify the best-performing model based on evaluation metrics.


In [None]:
# ============================================================================
# STEP 5.3: Select Best Model - TO BE COMPLETED
# ============================================================================

# TODO: Determine best model
# best_model_idx = results['F1-Score'].idxmax()
# best_model_name = results.loc[best_model_idx, 'Model']
# best_f1 = results.loc[best_model_idx, 'F1-Score']
#
# print(f"\n{'='*60}")
# print(f"BEST MODEL: {best_model_name}")
# print(f"F1-Score: {best_f1:.4f}")
# print(f"{'='*60}")

print("⚠️ Step 5.3 incomplete - Best model selection pending")


---
## Section 5 Summary

**Performance Comparison Complete:**
- All three models evaluated using standard classification metrics
- Best performing model identified
- Results ready for reporting

---


---
# 6. Advanced Model Experiments
**Team Contributions:** Individual advanced model implementations

## Overview
After baseline logistic regression comparison, each team member explores an advanced model with custom preprocessing to maximize performance.

## Team Member Assignments:
- **Lisa:** Random Forest Classifier
- **Shawn:** Recurrent Neural Network (RNN)
- **Aishwarya:** Long Short-Term Memory (LSTM)
- **Kalyani:** Transformer Model

Each subsection contains: Preprocessing → Model Training → Performance Evaluation (all in one cell).

---


## 6.1 Random Forest Classifier
**Responsible:** Lisa

### Approach
Use Random Forest with k-mer features and custom preprocessing.

This cell contains: Additional preprocessing → Model training → Performance evaluation.


---
## 6.2 Recurrent Neural Network (RNN)
**Responsible:** Shawn

### Approach
Use RNN with sequential token features to capture sequence patterns.

This cell contains: Additional preprocessing → Model training → Performance evaluation.


---
## 6.3 Long Short-Term Memory (LSTM)
**Responsible:** Aishwarya

### Approach
Use LSTM to capture long-range dependencies in RNA sequences.

This cell contains: Additional preprocessing → Model training → Performance evaluation.


---
## 6.4 Transformer Model
**Responsible:** Kalyani

### Approach
Use Transformer architecture with attention mechanism for sequence classification.

This cell contains: Additional preprocessing → Model training → Performance evaluation.


---
## 6.5 Comprehensive Model Comparison
Compare all models: Baseline Logistic Regression + Advanced Models


---
## Section 6 Complete ✓

**Advanced Models Implemented:**
- ✅ Random Forest (Lisa)
- ✅ RNN (Shawn)
- ✅ LSTM (Aishwarya)
- ✅ Transformer (Kalyani)
- ✅ Comprehensive comparison of all 7 models

**Key Findings:** [To be filled after implementation]

---
