# Stanford RNA 3D Folding Competition Notebook

This notebook is designed for the "Stanford RNA 3D Folding" Kaggle competition.
It covers:

1. Data Exploration
2. Data Preprocessing
   - Sequence encoding
   - Label grouping and padding (with NaN handling)
3. Model Building using a fast CNN architecture
4. Model Training with early stopping
5. Prediction on test set and submission file generation

_Note: This notebook uses only the provided CSV files (no external internet access)._

In [2]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## 2. Data Loading and Exploration

We load the CSV files provided in the competition:
- `train_sequences.csv`
- `train_labels.csv`
- `validation_sequences.csv` & `validation_labels.csv`
- `test_sequences.csv`
- `sample_submission.csv`

**Important:** We fill missing values in the labels data with 0 to avoid NaN issues during training.

### 📄 **[train/validation/test]_sequences.csv** (RNA sequences dataset)

1. **`target_id`** (string) – A unique ID for the RNA sequence.
   - 📝 **Example:** `"4YBB_A"`
   - 🔍 **Meaning:** `"4YBB"` is an entry in the Protein Data Bank (PDB), and `"A"` is the chain ID.

2. **`sequence`** (string) – The actual RNA sequence (nucleotide bases).
   - 📝 **Example:** `"AUGCGUACG"`
   - 🔍 **Meaning:** The RNA molecule consists of the bases A (Adenine), U (Uracil), G (Guanine), and C (Cytosine). 

3. **`temporal_cutoff`** (string) – The date when the sequence was published.
   - 📝 **Example:** `"2022-05-10"`
   - 🔍 **Meaning:** This RNA sequence was made publicly available on May 10, 2022.

4. **`description`** (string) – Extra details about the RNA's origin.
   - 📝 **Example:** `"Ribosomal RNA with a small molecule ligand bound"`
   - 🔍 **Meaning:** This RNA is part of the ribosome and has a small molecule attached.

5. **`all_sequences`** (string) – FASTA-formatted sequences of all molecular chains in the experimental structure.
   - 📝 **Example:**
     ```
     >Chain A
     AUGCGUACG
     >Chain B
     CCGGAUAGU
     ```
   - 🔍 **Meaning:** The experimental structure contains multiple chains of RNA (A and B).

---

### 📄 **[train/validation]_labels.csv** (RNA structure dataset)

1. **`ID`** (string) – A unique identifier for each RNA residue.
   - 📝 **Example:** `"4YBB_A_1"`
   - 🔍 **Meaning:** Residue **1** of chain **A** in the RNA structure from PDB entry `"4YBB"`.

2. **`resname`** (character) – The nucleotide at this position.
   - 📝 **Example:** `"A"`
   - 🔍 **Meaning:** This RNA residue is **Adenine (A)**.

3. **`resid`** (integer) – The residue number (position in the sequence).
   - 📝 **Example:** `1`
   - 🔍 **Meaning:** This is the **first** nucleotide in the RNA sequence.

4. **`x_1,y_1,z_1,x_2,y_2,z_2,…`** (float) – 3D coordinates (Angstroms) of the RNA structure.
   - 📝 **Example:**
     ```
     12.345, 34.567, 45.678, 13.456, 35.678, 46.789
     ```
   - 🔍 **Meaning:** The **C1' atom** of this nucleotide has been captured in two experimental conformations.

---

### 📄 **sample_submission.csv** (Predicted RNA structure)

- The same format as `train_labels.csv`, but you must provide **five** predicted 3D structures.
- Instead of `x_1,y_1,z_1,...`, you need:
  ```
  x_1,y_1,z_1, x_2,y_2,z_2, ..., x_5,y_5,z_5
  ```
- Each row should contain **five possible structures** for the same RNA sequence.

---

### **Simplified Summary:**
- `sequences.csv` contains **RNA sequences** and metadata.
- `labels.csv` provides **experimental 3D coordinates** of RNA residues.
- `sample_submission.csv` is where you submit **five predicted 3D structures** for each sequence.

In [70]:
# Define file paths (Kaggle input paths)
TRAIN_SEQ_PATH = '../data/raw/train_sequences.csv'
TRAIN_LABELS_PATH = '../data/raw/train_labels.csv'
VALID_SEQ_PATH = '../data/raw/validation_sequences.csv'
VALID_LABELS_PATH = '../data/raw/validation_labels.csv'
TEST_SEQ_PATH  = '../data/raw/test_sequences.csv'
SAMPLE_SUB_PATH = '../data/raw/sample_submission.csv'

# Load CSV files
train_sequences = pd.read_csv(TRAIN_SEQ_PATH)
train_labels = pd.read_csv(TRAIN_LABELS_PATH)
valid_sequences = pd.read_csv(VALID_SEQ_PATH)
valid_labels = pd.read_csv(VALID_LABELS_PATH)
test_sequences = pd.read_csv(TEST_SEQ_PATH)
sample_submission = pd.read_csv(SAMPLE_SUB_PATH)

# Fill missing values in labels with 0
train_labels.fillna(0, inplace=True)
valid_labels.fillna(0, inplace=True)

In [71]:
print('Train sequences shape:', train_sequences.shape)
print('Train labels shape:', train_labels.shape)
print('Valid sequences shape:', valid_sequences.shape)
print('Valid labels shape:', valid_labels.shape)
print('Test sequences shape:', test_sequences.shape)
print('Sample submission shape:', sample_submission.shape)

Train sequences shape: (844, 5)
Train labels shape: (137095, 6)
Valid sequences shape: (12, 5)
Valid labels shape: (2515, 123)
Test sequences shape: (12, 5)
Sample submission shape: (2515, 18)


In [72]:
train_sequences.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 844 entries, 0 to 843
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   target_id        844 non-null    object
 1   sequence         844 non-null    object
 2   temporal_cutoff  844 non-null    object
 3   description      844 non-null    object
 4   all_sequences    839 non-null    object
dtypes: object(5)
memory usage: 33.1+ KB


In [73]:
train_labels.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 137095 entries, 0 to 137094
Data columns (total 6 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   ID       137095 non-null  object 
 1   resname  137095 non-null  object 
 2   resid    137095 non-null  int64  
 3   x_1      137095 non-null  float64
 4   y_1      137095 non-null  float64
 5   z_1      137095 non-null  float64
dtypes: float64(3), int64(1), object(2)
memory usage: 6.3+ MB


In [74]:
train_sequences.head(10)

Unnamed: 0,target_id,sequence,temporal_cutoff,description,all_sequences
0,1SCL_A,GGGUGCUCAGUACGAGAGGAACCGCACCC,1995-01-26,"THE SARCIN-RICIN LOOP, A MODULAR RNA",>1SCL_1|Chain A|RNA SARCIN-RICIN LOOP|Rattus n...
1,1RNK_A,GGCGCAGUGGGCUAGCGCCACUCAAAAGGCCCAU,1995-02-27,THE STRUCTURE OF AN RNA PSEUDOKNOT THAT CAUSES...,>1RNK_1|Chain A|RNA PSEUDOKNOT|null\nGGCGCAGUG...
2,1RHT_A,GGGACUGACGAUCACGCAGUCUAU,1995-06-03,24-MER RNA HAIRPIN COAT PROTEIN BINDING SITE F...,>1RHT_1|Chain A|RNA (5'-R(P*GP*GP*GP*AP*CP*UP*...
3,1HLX_A,GGGAUAACUUCGGUUGUCCC,1995-09-15,P1 HELIX NUCLEIC ACIDS (DNA/RNA) RIBONUCLEIC ACID,>1HLX_1|Chain A|RNA (5'-R(*GP*GP*GP*AP*UP*AP*A...
4,1HMH_E,GGCGACCCUGAUGAGGCCGAAAGGCCGAAACCGU,1995-12-07,THREE-DIMENSIONAL STRUCTURE OF A HAMMERHEAD RI...,">1HMH_1|Chains A, C, E|HAMMERHEAD RIBOZYME-RNA..."
5,1RNG_A,GGCGCUUGCGUC,1995-12-07,SOLUTION STRUCTURE OF THE CUUG HAIRPIN: A NOVE...,>1RNG_1|Chain A|RNA (5'-R(*GP*GP*CP*GP*CP*UP*U...
6,1MME_D,GGCCGAAACUCGUAAGAGUCACCAC,1996-02-06,THE CRYSTAL STRUCTURE OF AN ALL-RNA HAMMERHEAD...,">1MME_1|Chains A, C|RNA HAMMERHEAD RIBOZYME|\n..."
7,1KAJ_A,GGCGCAGUGGGCUAGCGCCACUCAAAAGCCCG,1996-07-11,CONFORMATION OF AN RNA PSEUDOKNOT FROM MOUSE M...,>1KAJ_1|Chain A|RNA PSEUDOKNOT APK|\nGGCGCAGUG...
8,1SLO_A,UUACCCAAGUUUGAGGUAA,1996-12-07,FIRST STEM LOOP OF THE SL1 RNA FROM CAENORHABD...,>1SLO_1|Chain A|RNA (5'-R(*UP*UP*AP*CP*CP*CP*A...
9,1BIV_A,GGCUCGUGUAGCUCAUUAGCUCCGAGCC,1996-12-23,"BOVINE IMMUNODEFICIENCY VIRUS TAT-TAR COMPLEX,...",>1BIV_1|Chain A|TAR RNA|synthetic construct (3...


In [None]:
# Sequence length statistics:
sequence_length = train_sequences.sequence.apply(lambda x: len(x))

print("============================")
print('Sequence length statistics:')
print("============================\n")

print('Min:', sequence_length.min())
print('Max:', sequence_length.max())
print('Mean:', sequence_length.mean())
print('Median:', sequence_length.median())

Sequence length statistics:

Min: 3
Max: 4298
Mean: 162.43483412322274
Median: 39.5


In [94]:
# Check every sequence if it contains onlt A, C, G, U
RNA_sequence = {
    'A': "Adenine",
    'C': "Cytosine",
    'G': "Guanine",
    'U': "Uracil",
    '-': "Missing",
    'X': "Unknown"
}

sequence_count = train_sequences['sequence'].apply(lambda x: set(x)).value_counts()
sequence_count

sequence
{G, A, C, U}       815
{G, A, C}           10
{G, C, U}            6
{U}                  3
{A}                  2
{A, U}               2
{U, A, -, C, G}      2
{U, A, X, C, G}      2
{G, C}               1
{G, A}               1
Name: count, dtype: int64

In [69]:
print(f"the Majority of the sequences are composed of {', '.join(RNA_sequence.values())} with {sequence_count.values[0]} sequences.")

the Majority of the sequences are composed of Adenine, Cytosine, Guanine, Uracil with 815 sequences.


In [100]:
# Nucleotide distribution:
nucleotides = train_sequences['sequence'].apply(lambda x: pd.Series(list(x)))
nucleotides = nucleotides.stack().value_counts()

print("========================")
print("Nucleotide distribution:")
print("========================\n")
for k, v in nucleotides.items():
    print(f"There are {v} {RNA_sequence[k]} ({k}) nucleotides in the sequences.")

Nucleotide distribution:

There are 41450 Guanine (G) nucleotides in the sequences.
There are 33937 Cytosine (C) nucleotides in the sequences.
There are 32524 Adenine (A) nucleotides in the sequences.
There are 29178 Uracil (U) nucleotides in the sequences.
There are 4 Missing (-) nucleotides in the sequences.
There are 2 Unknown (X) nucleotides in the sequences.
