# Cell 1: Markdown
"""
# mRNA Classification Project
**Course:** MSI5001 - Introduction to AI  
**Team Members:** Lisa Mithani, Shawn Lee, Aishwarya Nair, and Kalyani Vijay
**Dataset:** mRNA Classification (Medium difficulty)  
**Objective:** Classify RNA sequences as mRNA vs. other RNA types using machine learning models

---

## Table of Contents
1. Data Loading & Merging
2. Exploratory Data Analysis
3. Feature Extraction
4. Model Training
5. Evaluation & Insights
"""

In [17]:
!pip install biopython




In [18]:
import pandas as pd
import numpy as np
from Bio import SeqIO
import warnings
warnings.filterwarnings('ignore')

print("✓ Libraries imported successfully!")


✓ Libraries imported successfully!


---
# 1. Data Loading and Preparation
**Responsible:** Lisa Mithani

## Overview
This section loads the training and testing datasets, merges them, and prepares features and labels for modeling.

## Tasks:
1. Load training.fa (FASTA) and training.csv → merge using inner join
2. Load testing.csv
3. Separate features (X) and labels (y) for train/test sets

---


In [19]:
# ============================================================================
# STEP 1: Load training.fa and training.csv, then merge using inner join
# ============================================================================

# Function to load FASTA file
def load_fasta_to_dataframe(fasta_file):
    """
    Reads a FASTA file and converts it to a pandas DataFrame.
    
    Parameters:
    - fasta_file: path to .fa file
    
    Returns:
    - DataFrame with columns: ['sequence_id', 'sequence']
    """
    sequences = []
    for record in SeqIO.parse(fasta_file, "fasta"):
        sequences.append({
            'sequence_id': record.id,
            'sequence': str(record.seq)
        })
    return pd.DataFrame(sequences)

# Load training.fa (FASTA file)
print("Loading training.fa...")
fasta_df = load_fasta_to_dataframe('dataset/training.fa')
print(f"✓ Loaded {fasta_df.shape[0]} sequences from training.fa")

# Load training.csv
print("\nLoading training.csv...")
csv_df = pd.read_csv('dataset/training_class.csv')
print(f"✓ Loaded training.csv with shape {csv_df.shape}")

# Merge using inner join with DIFFERENT column names
# FASTA has 'sequence_id', CSV has 'name'
print("\nMerging datasets using inner join...")
training_data = pd.merge(
    fasta_df,
    csv_df,
    left_on='sequence_id',   # Column in FASTA
    right_on='name',          # Column in CSV (different name!)
    how='inner'
)

print(f"✓ Merged training data: {training_data.shape}")
print(f"  Columns: {list(training_data.columns)}")
print(f"\n✓ STEP 1 COMPLETE")

# Display first few rows
training_data.head(2)



Loading training.fa...
✓ Loaded 22867 sequences from training.fa

Loading training.csv...
✓ Loaded training.csv with shape (22867, 2)

Merging datasets using inner join...
✓ Merged training data: (14286, 4)
  Columns: ['sequence_id', 'sequence', 'name', 'class']

✓ STEP 1 COMPLETE


Unnamed: 0,sequence_id,sequence,name,class
0,ENSDART00000138379,TCAAANGGAAAATAATATGTCAGYTGTGATTTTTACTCGANTTAAT...,ENSDART00000138379,1
1,ENSDART00000075994,ATGTCTCTTTTTGAAATAAAAGACCTGNTTNGAGAAGGAAGCTATG...,ENSDART00000075994,1


In [20]:
# ============================================================================
# STEP 2: Load testing.csv
# ============================================================================

print("Loading testing.csv...")
testing_data = pd.read_csv('dataset/test.csv')

print(f"✓ Loaded testing.csv: {testing_data.shape}")
print(f"  Columns: {list(testing_data.columns)}")
print(f"\n✓ STEP 2 COMPLETE")

testing_data.head(2)


Loading testing.csv...
✓ Loaded testing.csv: (4416, 3)
  Columns: ['name', 'sequence', 'class']

✓ STEP 2 COMPLETE


Unnamed: 0,name,sequence,class
0,TCONS_00059596,CUAAUCCCCCCUCCUCCCGCUCCCGCACCAAAGAGUUGCGCCGCCU...,1
1,TCONS_00059678,CUAUUCGGCGCAGUUGCUAUACGUACCCCAGCCUCGUACACAACGC...,1


In [23]:
# ============================================================================
# STEP 3: Drop class labels and store them in y_train and y_test
# ============================================================================

label_column = 'class'

# --- Training Set ---
print("Separating training data...")
y_train = training_data[label_column].copy()
# Keep 'sequence' column for next teammate to extract features
X_train = training_data.drop(columns=[label_column, 'sequence_id', 'name']).copy()

print(f"✓ X_train shape: {X_train.shape}")
print(f"  Columns: {list(X_train.columns)}")  # Should show ['sequence']
print(f"✓ y_train shape: {y_train.shape}")
print(f"  Class distribution:\n{y_train.value_counts()}")

# --- Testing Set ---
print("\nSeparating testing data...")
y_test = testing_data[label_column].copy()
# Keep 'sequence' column for next teammate
X_test = testing_data.drop(columns=[label_column, 'name']).copy()

print(f"✓ X_test shape: {X_test.shape}")
print(f"  Columns: {list(X_test.columns)}")  # Should show ['sequence']
print(f"✓ y_test shape: {y_test.shape}")
print(f"  Class distribution:\n{y_test.value_counts()}")

print(f"\n✓ STEP 3 COMPLETE")
print(f"\n⚠️ Note: X_train and X_test contain raw sequences.")
print(f"   Next step: Feature extraction (k-mer generation)")



Separating training data...
✓ X_train shape: (14286, 1)
  Columns: ['sequence']
✓ y_train shape: (14286,)
  Class distribution:
class
0    9224
1    5062
Name: count, dtype: int64

Separating testing data...
✓ X_test shape: (4416, 1)
  Columns: ['sequence']
✓ y_test shape: (4416,)
  Class distribution:
class
1    2208
0    2208
Name: count, dtype: int64

✓ STEP 3 COMPLETE

⚠️ Note: X_train and X_test contain raw sequences.
   Next step: Feature extraction (k-mer generation)


---
## Section 1 Complete ✓

**Outputs:**
- `X_train` (14,286 samples × 1 column): Raw RNA sequences
- `y_train` (14,286 labels): Class labels (0 or 1)
- `X_test` (4,416 samples × 1 column): Raw RNA sequences
- `y_test` (4,416 labels): Class labels (0 or 1)

**Note for Next Teammate:**
The `X_train` and `X_test` DataFrames contain raw RNA sequences in the `'sequence'` column. Please convert these to numeric features (e.g., k-mer frequencies) before model training.

**Next Section:** Feature Extraction / Exploratory Data Analysis

---

