# Fully Convolutional PyTorch Model to Predict Protein Secondary Structure

### Overview

Your overall goal is to write a Fully Convolutional PyTorch model that can input protein sequence data (often called the Protein Primary Structure ), or additionally using PSSM Profiles to predict the protein secondary structure (H = Helix, E = Extended Sheet, C = Coil symbols).

The PDB Database contains the protein structures of over 200,000 proteins. Each has a unique PDB_ID code such as 1A0S (the first one in the training data) which is the structure shown above (sucrose-specific porin of salmonella) which is used to transfer sucrose across the cell membrane of salmonella bacteria which causes food poisoning. The protein has a 3D Structure which shows that most of this protein is extended beta sheet (flat arrows) and coil (random lines).

The Data Tab on Kaggle will allow you to browse the available data used for training. You should use this Data Tab to browse through the data so you understand what it is like. You will find a seqs_train.csv file which is a CSV file that gives the PDB_ID (unique identifier) and the SEQUENCE of each protein. You will also find a train.zip file which contains a large collection of 'PDB_ID'_train.csv files containing residue number, amino acid and PSSM profiles for each residue in that particular protein. The labels_train.csv file contains the secondary structure labels for the different training proteins (given as H = Helix, E = Extended Sheet, C = Coil symbols). The seqs_test.csv and test.zip contain similar data for the test sequences for which you need to predict the secondary structure.

IN ADDITION - you will also need to submit your Jupyter Notebook that produces these outputs via the Moodle web page.

Please see the Moodle course site for further details about this coursework.
<br>
### Evaluation
The evaluation metric is the "Q3 Accuracy" which is used for assessing the three states within a protein structure prediction (H = Helix, E = Extended Sheet, C = Coil). <br>

### Submission File
For each PDB_ID in the test set, you must predict the secondary structure of each residue in that protein. The file should contain a header and have the following format:

(So columns give ID consisting of 'PDB_ID', then underscore 'residue number', followed by the predicted secondary structure label of that residue.)

ID,STRUCTURE <br>
2AIO_1_A_1, C <br>
2AIO_1_A_2, C <br>
2AIO_1_A_3, C <br>
2AIO_1_A_4, H <br>
2AIO_1_A_5, H <br>
etc. <br>

## Import necessary libraries and store file paths

In [1]:
import os
import re
import numpy as np
import pandas as pd

import torch
import torch.nn.functional as F
from torch import nn
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import Dataset, DataLoader, random_split

from sklearn.feature_selection import f_classif
from sklearn.preprocessing import LabelEncoder



# Define store file paths
DATA_PATH = "./data/"
labels_train_path = DATA_PATH + "labels_train.csv"
sample_path = DATA_PATH + "sample.csv"
seqs_test_path = DATA_PATH + "seqs_test.csv"
seqs_train_path = DATA_PATH + "seqs_train.csv"
train_path = DATA_PATH + "train"
test_path = DATA_PATH + "test"

## Define a mapping from amino acid characters to integers

To enable model training and assessment, these mappings from amino acid characters to integers for encoding are necessary for converting categorical data into numerical representation: 
- `sec_struct_mapping`: A dictionary mapping secondary structure labels ('H' for Helix, 'E' for Extended Sheet, 'C' for Coil) to integer labels (0, 1, 2 respectively). Additional mappings can be added if there are more labels.
- `amino_acid_mapping`: A dictionary mapping amino acid characters to integer labels. Each amino acid is assigned a unique integer, with additional mappings provided for special cases such as unknown amino acids ('X'), ambiguous cases ('B', 'Z', 'J'), and gap or padding ('-').




In [2]:
# Define a mapping from amino acid characters to integers
sec_struct_mapping = {'H': 0, 'E': 1, 'C': 2}  # Add more mappings if there are more labels
amino_acid_mapping = {
    'A': 0, 'C': 1, 'D': 2, 'E': 3, 'F': 4,
    'G': 5, 'H': 6, 'I': 7, 'K': 8, 'L': 9,
    'M': 10, 'N': 11, 'P': 12, 'Q': 13, 'R': 14,
    'S': 15, 'T': 16, 'V': 17, 'W': 18, 'Y': 19,
    'X': 20,  # Typically used for unknown amino acids
    'B': 21,  # Asparagine or Aspartic acid
    'Z': 22,  # Glutamine or Glutamic acid
    'J': 23,  # Leucine or Isoleucine
    '-': 24,  # Gap or padding
}

In [3]:
# def load_protein_data(csv_file, train_dir):
#     """Loads protein data from CSV and directory."""
#     seqs = pd.read_csv(csv_file)
#     protein_data = {}
#     for filename in os.listdir(train_dir):
#         if filename.endswith(".csv"):
#             protein_id = re.split(r'_train|_test', filename)[0]
#             protein_data[protein_id] = pd.read_csv(os.path.join(train_dir, filename))
#     return seqs, protein_data

# def load_labels(label_file):
#     """Loads labels from a CSV file."""
#     if label_file:
#         return pd.read_csv(label_file)
#     return None

In [4]:
# def encode_sequence(sequence):
#     """Encodes a protein sequence using one-hot encoding."""
#     encoded_sequence = np.zeros((len(sequence), len(amino_acid_mapping)), dtype=int)
#     for i, amino_acid in enumerate(sequence):
#         index = amino_acid_mapping.get(amino_acid, amino_acid_mapping['X'])
#         encoded_sequence[i, index] = 1
#     return encoded_sequence

# def normalize_pssm(pssm, normalize_method='min-max'):
#     """Normalizes a PSSM using the specified method."""
#     # Assuming the first two columns are non-numeric; adjust as necessary based on your actual data format
#     numeric_columns = pssm[:, 2:]  # Adjust this if your numeric data starts from a different column

#     # Convert to floats & handle any errors
#     try:
#         pssm_numeric = numeric_columns.astype(np.float32)
#     except ValueError as e:
#         raise ValueError(f"Error converting PSSM to float: {e}")

#     if normalize_method == 'min-max':
#         # Min-Max normalization
#         pssm_min = pssm_numeric.min(axis=0)
#         pssm_max = pssm_numeric.max(axis=0)
#         # Ensure no division by zero
#         pssm_range = np.where(pssm_max - pssm_min == 0, 1, pssm_max - pssm_min)
#         normalized_pssm = (pssm_numeric - pssm_min) / pssm_range
#     elif normalize_method == 'z-score':
#         # Z-Score normalization
#         pssm_mean = pssm_numeric.mean(axis=0)
#         pssm_std = pssm_numeric.std(axis=0)
#         # Avoid division by zero
#         pssm_std = np.where(pssm_std == 0, 1, pssm_std)
#         normalized_pssm = (pssm_numeric - pssm_mean) / pssm_std
#     else:
#         # If no normalization method provided, return the original PSSM
#         normalized_pssm = pssm_numeric

#     return normalized_pssm


# def prepare_data_point(idx, seqs, protein_data, label_file=None):
#     """Prepares a protein sample for training or inference."""
#     labels = load_labels(label_file)
#     protein_id = seqs.iloc[idx]['PDB_ID']
#     sequence = seqs.iloc[idx]['SEQUENCE']
#     encoded_sequence = encode_sequence(sequence)  # Encode the sequence
#     pssm = protein_data[protein_id].values  # Assuming you will process PSSM separately
#     normalized_pssm = normalize_pssm(pssm)  # Ensure this is uncommented to use normalized PSSM

#     if labels is not None:
#         label_seq = labels.iloc[idx]['SEC_STRUCT']
#         label_numeric = [sec_struct_mapping[char] for char in label_seq]
#         label_tensor = torch.tensor(label_numeric, dtype=torch.long)
#         return (
#             protein_id,
#             torch.tensor(encoded_sequence, dtype=torch.float32),
#             torch.tensor(normalized_pssm, dtype=torch.float32),
#             label_tensor
#         )

#     return (
#         protein_id,
#         torch.tensor(encoded_sequence, dtype=torch.float32),
#         torch.tensor(normalized_pssm, dtype=torch.float32)
#     )

In [5]:
# def encode_sequence(sequence, amino_acid_mapping):
#     # Convert each amino acid in the sequence to a one-hot encoded vector
#     encoded_sequence = np.zeros((len(sequence), len(amino_acid_mapping)), dtype=int)
#     for i, amino_acid in enumerate(sequence):
#         # Default to 'X' for unknown amino acids
#         index = amino_acid_mapping.get(amino_acid, amino_acid_mapping['X'])
#         encoded_sequence[i, index] = 1
#     return encoded_sequence

# def normalize_pssm(pssm, normalize_method='min-max'):
#     # Assuming the first two columns are non-numeric; adjust as necessary based on your actual data format
#     numeric_columns = pssm[:, 2:]  # Adjust this if your numeric data starts from a different column

#     # Convert to floats
#     try:
#         pssm_numeric = numeric_columns.astype(np.float32)
#     except ValueError as e:
#         # Handle or log the error if needed
#         raise ValueError(f"Error converting PSSM to float: {e}")

#     if normalize_method == 'min-max':
#         # Min-Max normalization
#         pssm_min = pssm_numeric.min(axis=0)
#         pssm_max = pssm_numeric.max(axis=0)
#         # Ensure no division by zero
#         pssm_range = np.where(pssm_max - pssm_min == 0, 1, pssm_max - pssm_min)
#         normalized_pssm = (pssm_numeric - pssm_min) / pssm_range
#     elif normalize_method == 'z-score':
#         # Z-Score normalization
#         pssm_mean = pssm_numeric.mean(axis=0)
#         pssm_std = pssm_numeric.std(axis=0)
#         # Avoid division by zero
#         pssm_std = np.where(pssm_std == 0, 1, pssm_std)
#         normalized_pssm = (pssm_numeric - pssm_mean) / pssm_std
#     else:
#         # If no normalization method provided, return the original PSSM
#         normalized_pssm = pssm_numeric

#     return normalized_pssm

# def protein_dataset(csv_file, train_dir, label_file=None, normalize_method='min-max'):
#     # Load the sequences
#     seqs = pd.read_csv(csv_file)

#     # Load the protein data from the directory
#     protein_data = {}
#     for filename in os.listdir(train_dir):
#         if filename.endswith(".csv"):  # Check if the file is a CSV
#             protein_id = re.split(r'_train|_test', filename)[0]
#             protein_data[protein_id] = pd.read_csv(os.path.join(train_dir, filename))

#     # Load the labels, if provided
#     if label_file:
#         labels = pd.read_csv(label_file)
#     else:
#         labels = None

#     # Amino acid mapping
#     amino_acid_mapping = {
#         'A': 0, 'C': 1, 'D': 2, 'E': 3, 'F': 4,
#         'G': 5, 'H': 6, 'I': 7, 'K': 8, 'L': 9,
#         'M': 10, 'N': 11, 'P': 12, 'Q': 13, 'R': 14,
#         'S': 15, 'T': 16, 'V': 17, 'W': 18, 'Y': 19,
#         'X': 20,  # Typically used for unknown amino acids
#         'B': 21,  # Asparagine or Aspartic acid
#         'Z': 22,  # Glutamine or Glutamic acid
#         'J': 23,  # Leucine or Isoleucine
#         '-': 24,  # Gap or padding
#     }

# def get_item(idx):
#     protein_id = seqs.iloc[idx]['PDB_ID']
#     sequence = seqs.iloc[idx]['SEQUENCE']
#     encoded_sequence = encode_sequence(sequence, amino_acid_mapping)  # Encode the sequence
#     pssm = protein_data[protein_id].values  # Assuming you will process PSSM separately
#     normalized_pssm = normalize_pssm(pssm, normalize_method)  # Ensure this is uncommented to use normalized PSSM

#     if labels is not None:
#         label_seq = labels.iloc[idx]['SEC_STRUCT']
#         label_numeric = [sec_struct_mapping[char] for char in label_seq]
#         label_tensor = torch.tensor(label_numeric, dtype=torch.long)
#         return (
#             protein_id,
#             torch.tensor(encoded_sequence, dtype=torch.float32),
#             torch.tensor(normalized_pssm, dtype=torch.float32),
#             label_tensor
#         )

#     return (
#         protein_id,
#         torch.tensor(encoded_sequence, dtype=torch.float32),
#         torch.tensor(normalized_pssm, dtype=torch.float32)
#     )



In [6]:
class ProteinDataset(Dataset):
    def __init__(self, csv_file, train_dir, label_file=None, normalize_method='min-max'):

        # Load the sequences
        self.seqs = pd.read_csv(csv_file)

        # Load the protein data from the directory
        self.protein_data = {}
        for filename in os.listdir(train_dir):
            if filename.endswith(".csv"):  # Check if the file is a CSV
                protein_id = re.split(r'_train|_test', filename)[0]
                self.protein_data[protein_id] = pd.read_csv(os.path.join(train_dir, filename))

        # Load the labels, if provided
        if label_file:
            self.labels = pd.read_csv(label_file)
        else:
            self.labels = None

        # Amino acid mapping
        self.amino_acid_mapping = amino_acid_mapping
        self.normalize_method = normalize_method

    def encode_sequence(self, sequence):
        # Convert each amino acid in the sequence to a one-hot encoded vector
        encoded_sequence = np.zeros((len(sequence), len(self.amino_acid_mapping)), dtype=int)
        for i, amino_acid in enumerate(sequence):
            # Default to 'X' for unknown amino acids
            index = self.amino_acid_mapping.get(amino_acid, self.amino_acid_mapping['X'])
            encoded_sequence[i, index] = 1
        return encoded_sequence

    def normalize_pssm(self, pssm):
        # Assuming the first two columns are non-numeric; adjust as necessary based on your actual data format
        numeric_columns = pssm[:, 2:]  # Adjust this if your numeric data starts from a different column

        # Convert to floats
        try:
            pssm_numeric = numeric_columns.astype(np.float32)
        except ValueError as e:
            # Handle or log the error if needed
            raise ValueError(f"Error converting PSSM to float: {e}")

        if self.normalize_method == 'min-max':
            # Min-Max normalization
            pssm_min = pssm_numeric.min(axis=0)
            pssm_max = pssm_numeric.max(axis=0)
            # Ensure no division by zero
            pssm_range = np.where(pssm_max - pssm_min == 0, 1, pssm_max - pssm_min)
            normalized_pssm = (pssm_numeric - pssm_min) / pssm_range
        elif self.normalize_method == 'z-score':
            # Z-Score normalization
            pssm_mean = pssm_numeric.mean(axis=0)
            pssm_std = pssm_numeric.std(axis=0)
            # Avoid division by zero
            pssm_std = np.where(pssm_std == 0, 1, pssm_std)
            normalized_pssm = (pssm_numeric - pssm_mean) / pssm_std
        else:
            # If no normalization method provided, return the original PSSM
            normalized_pssm = pssm_numeric

        return normalized_pssm

    def __len__(self):
        return len(self.seqs)

    def __getitem__(self, idx):
        protein_id = self.seqs.iloc[idx]['PDB_ID']
        sequence = self.seqs.iloc[idx]['SEQUENCE']
        encoded_sequence = self.encode_sequence(sequence)  # Encode the sequence
        pssm = self.protein_data[protein_id].values  # Assuming you will process PSSM separately
        normalized_pssm = self.normalize_pssm(pssm)  # Ensure this is uncommented to use normalized PSSM

        if self.labels is not None:
            label_seq = self.labels.iloc[idx]['SEC_STRUCT']
            label_numeric = [sec_struct_mapping[char] for char in label_seq]
            label_tensor = torch.tensor(label_numeric, dtype=torch.long)
            return (
                protein_id,
                torch.tensor(encoded_sequence, dtype=torch.float32),
                torch.tensor(normalized_pssm, dtype=torch.float32),
                label_tensor
            )

        return (
            protein_id,
            torch.tensor(encoded_sequence, dtype=torch.float32),
            torch.tensor(normalized_pssm, dtype=torch.float32)
        )


In [7]:
class FullyConvolutionalProteinModel(nn.Module):
    def __init__(self, num_classes=3, input_channels=20):  # 20 for amino acid one-hot, adjust if using PSSM
        super(FullyConvolutionalProteinModel, self).__init__()

        # Define convolutional layers
        self.conv1 = nn.Conv1d(in_channels=input_channels, out_channels=64, kernel_size=3, padding=1)
        self.conv2 = nn.Conv1d(in_channels=64, out_channels=128, kernel_size=3, padding=1)
        self.conv3 = nn.Conv1d(in_channels=128, out_channels=256, kernel_size=3, padding=1)

        # Final layer that maps to the number of classes
        self.final_conv = nn.Conv1d(in_channels=256, out_channels=num_classes, kernel_size=1)

    def forward(self, x):
        # Apply convolutional layers with activation functions
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = F.relu(self.conv3(x))

        # Apply final convolutional layer - no activation, as CrossEntropyLoss includes it
        x = self.final_conv(x)

        # No softmax here, as nn.CrossEntropyLoss applies it internally.
        # Transpose the output to match [batch_size, sequence_length, num_classes]
        # This makes it easier to calculate loss later
        x = x.transpose(1, 2)

        return x

In [8]:
class ProteinModelTrainer:
    def __init__(self, model, criterion, optimizer, train_dataset, val_dataset=None, test_dataset=None, batch_size=64):
        self.model = model
        self.criterion = criterion
        self.optimizer = optimizer
        self.train_dataset = train_dataset
        self.val_dataset = val_dataset
        self.test_dataset = test_dataset
        self.batch_size = batch_size

        self.train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=self.collate_fn)
        if val_dataset:
            self.val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, collate_fn=self.collate_fn)
        else:
            self.val_loader = None

        if test_dataset:
            self.test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False, collate_fn=self.collate_fn_without_labels)
        else:
            self.test_loader = None

    def collate_fn_without_labels(self, batch):
        id, sequences, pssms = zip(*batch)

        sequences_padded = pad_sequence([seq.clone().detach() for seq in sequences], batch_first=True)
        pssms_padded = pad_sequence([pssm.clone().detach() for pssm in pssms], batch_first=True)

        return id, sequences_padded, pssms_padded

    def collate_fn(self,batch):
        _, sequences, pssms, labels_list = zip(*batch)  # Unzip the batch

        # Pad sequences and PSSMs
        sequences_padded = pad_sequence([seq.clone().detach() for seq in sequences], batch_first=True)

        pssms_padded = pad_sequence([pssm.clone().detach() for pssm in pssms], batch_first=True)

        # Handling labels correctly
        if labels_list[0] is not None:  # Check if labels exist
            labels_padded = pad_sequence([label.clone().detach() for label in labels_list], batch_first=True)

        else:
            labels_padded = None

        return sequences_padded, pssms_padded, labels_padded
    
    def train(self, num_epochs):
        for epoch in range(num_epochs):
            self.train_epoch()
            if self.val_loader:
                self.validate()

    def train_epoch(self):
        self.model.train()
        running_loss = 0.0
        correct_preds = 0
        total_preds = 0

        for sequences, pssms, labels in self.train_loader:
            inputs = pssms.permute(0, 2, 1)

            self.optimizer.zero_grad()

            outputs = self.model(inputs)
            loss = self.criterion(outputs.transpose(1, 2), labels)

            loss.backward()
            self.optimizer.step()

            running_loss += loss.item() * inputs.size(0)

            _, predicted = torch.max(outputs, 2)
            correct_preds += (predicted == labels).sum().item()
            total_preds += labels.numel()

        epoch_loss = running_loss / len(self.train_dataset)
        epoch_acc = correct_preds / total_preds
        print(f'Train Loss: {epoch_loss:.4f}, Train Accuracy: {epoch_acc:.4f}')

    def validate(self):
        self.model.eval()
        running_loss = 0.0
        correct_preds = 0
        total_preds = 0

        with torch.no_grad():
            for sequences, pssms, labels in self.val_loader:
                inputs = pssms.permute(0, 2, 1)

                outputs = self.model(inputs)
                loss = self.criterion(outputs.transpose(1, 2), labels)

                running_loss += loss.item() * inputs.size(0)

                _, predicted = torch.max(outputs, 2)
                correct_preds += (predicted == labels).sum().item()
                total_preds += labels.numel()

        val_loss = running_loss / len(self.val_dataset)
        val_acc = correct_preds / total_preds
        print(f'Validation Loss: {val_loss:.4f}, Validation Accuracy: {val_acc:.4f}')

    def test(self, output_file='./submission.csv'):
        if self.test_loader:
            self.test_model(self.test_loader, output_file)
        else:
            self.test_model_direct(self.test_dataset, output_file)

    def test_model(self, test_loader, output_file):
        self.model.eval()
        running_loss = 0.0
        correct_preds = 0
        total_preds = 0

        with torch.no_grad():
            for sequences, pssms, labels in test_loader:
                inputs = pssms.permute(0, 2, 1)

                outputs = self.model(inputs)
                loss = self.criterion(outputs.transpose(1, 2), labels)

                running_loss += loss.item() * inputs.size(0)

                _, predicted = torch.max(outputs, 2)
                correct_preds += (predicted == labels).sum().item()
                total_preds += labels.numel()

        test_loss = running_loss / len(test_loader.dataset)
        test_acc = correct_preds / total_preds
        print(f'Test Loss: {test_loss:.4f}, Test Accuracy: {test_acc:.4f}')

    def test_model_direct(self, test_dataset, output_file):
        self.model.eval()
        predictions = []

        with torch.no_grad():
            for i in range(len(test_dataset)):
                pdb_id, _, pssm = test_dataset[i]

                input_pssm = pssm.unsqueeze(0).permute(0, 2, 1)

                outputs = self.model(input_pssm)
                _, predicted = torch.max(outputs, 2)

                seq_len = pssm.shape[0]
                for j in range(seq_len):
                    residue_id = f"{pdb_id}_{j + 1}"
                    structure_label = ['H', 'E', 'C'][predicted[0, j].item()]
                    predictions.append([residue_id, structure_label])

        pd.DataFrame(predictions, columns=['ID', 'STRUCTURE']).to_csv(output_file, index=False)
        print(f'Submission file saved to {output_file}')

In [9]:
# def get_optimizer(optimizer_type, model, lr, weight_decay):
#     # Choose the optimizer based on the parameterization
#     if optimizer_type == "adam":
#         optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay)
#     elif optimizer_type == "sgd":
#         optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9, weight_decay=weight_decay)
#     elif optimizer_type == "rmsprop":
#         optimizer = torch.optim.RMSprop(model.parameters(), lr=lr, weight_decay=weight_decay)
#     else:
#         raise ValueError("Unknown optimizer")

#     return optimizer


# def train_with_params(
#         lr=0.001,
#         batch_size=4,
#         hidden_layers=5,
#         dropout_rate=0.233246,
#         weight_decay=0.0,
#         optimizer='rmsprop',
#         normalization='min-max',
#         num_epochs=10,
#         output_file='submission.csv'
# ):
#     train_dataset = ProteinDataset(csv_file=seqs_train_path, train_dir=train_path, label_file=labels_train_path,
#                                    normalize_method=normalization)
#     train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn)

#     test_dataset = ProteinDataset(csv_file=seqs_test_path, train_dir=test_path, normalize_method=normalization)

#     # Splitting train_dataset into train and validation sets (adjust sizes as needed)
#     train_size = int(0.8 * len(train_dataset))
#     val_size = len(train_dataset) - train_size
#     train_subset, val_subset = random_split(train_dataset, [train_size, val_size])
#     val_loader = DataLoader(val_subset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn)

#     model = FullyConvolutionalProteinModel(hidden_layers_number=hidden_layers, dropout_rate=dropout_rate)
#     criterion = nn.CrossEntropyLoss()
#     optimizer = get_optimizer(optimizer, model, lr, weight_decay)

#     train_model(model, criterion, optimizer, train_dataloader, num_epochs)
#     validate_model(model, criterion, val_loader)
#     test_model_direct(model, test_dataset, output_file)


In [10]:
# # Load training data
# seqs_train, protein_data_train = load_protein_data(seqs_train_path, train_path)
# labels_train = load_labels(labels_train_path)

# # Prepare training samples (assuming collate_fn handles batching and shuffling)
# train_samples = []
# for idx in range(len(seqs_train)):
#     protein_id = seqs_train.iloc[idx]['PDB_ID']
#     sequence = seqs_train.iloc[idx]['SEQUENCE']
#     pssm = protein_data_train[protein_id].values
#     sample = prepare_protein_sample(
#         protein_id, sequence, pssm, labels_train and labels_train.iloc[idx],  # Include label if exists
#         amino_acid_mapping, normalize_method, sec_struct_mapping
#     )
#     train_samples.append(sample)

# # Load testing data
# seqs_test, protein_data_test = load_protein_data(seqs_test_path, test_path)

# # Prepare testing samples (assuming collate_fn handles batching)
# test_samples = []
# for idx in range(len(seqs_test)):
#     protein_id = seqs_test.iloc[idx]['PDB_ID']
#     sequence = seqs_test.iloc[idx]['SEQUENCE']
#     pssm = protein_data_test[protein_id].values
#     sample = prepare_protein_sample(
#         protein_id, sequence, pssm, labels_train.empty and None or labels_train.iloc[idx],  # Include label if exists
#         amino_acid_mapping, normalize_method, sec_struct_mapping
#     )

#     test_samples.append(sample)

# # Create dataloaders (assuming collate_fn remains the same)
# train_dataloader = DataLoader(train_samples, batch_size=4, collate_fn=collate_fn)
# test_dataloader = DataLoader(test_samples, batch_size=4, collate_fn=collate_fn)


In [11]:
# seqs, protein_data = load_protein_data(seqs_train_path, train_path)
# train_dataset = [prepare_data_point(idx, seqs, protein_data, labels_train_path) for idx in range(len(seqs))]
# train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=False, collate_fn=collate_fn)
# test_dataset = [prepare_data_point(idx, seqs, protein_data, label_file=None) for idx in range(len(seqs))]

# # Model definition and training...
# model = FullyConvolutionalProteinModel()
# criterion = torch.nn.CrossEntropyLoss()
# optimizer = torch.optim.RMSprop(model.parameters(), lr=0.001, weight_decay=0.0)
# num_epochs = 25

# # Train and Test model on test dataset and create submission file
# train_model(model, criterion, optimizer, train_dataloader, num_epochs)
# test_model_direct(model, test_dataset, output_file='./data/submission.csv')


In [12]:
# Initialize datasets
train_dataset = ProteinDataset(csv_file=seqs_train_path, train_dir=train_path, label_file=labels_train_path)
test_dataset = ProteinDataset(csv_file=seqs_test_path, train_dir=test_path)

# Initialize model, criterion, and optimizer
model = FullyConvolutionalProteinModel()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.RMSprop(model.parameters(), lr=0.001, weight_decay=0.0)

# Create an instance of the ProteinModelTrainer
trainer = ProteinModelTrainer(model, criterion, optimizer, train_dataset, test_dataset=test_dataset, batch_size=64)

# Train the model
trainer.train(num_epochs=10)

# Test the model
trainer.test(output_file='submission.csv')

Train Loss: 0.2448, Train Accuracy: 0.8963
Train Loss: 0.2041, Train Accuracy: 0.9183
Train Loss: 0.1914, Train Accuracy: 0.9238
Train Loss: 0.1850, Train Accuracy: 0.9268
Train Loss: 0.1780, Train Accuracy: 0.9301
Train Loss: 0.1713, Train Accuracy: 0.9323
Train Loss: 0.1718, Train Accuracy: 0.9319
Train Loss: 0.1703, Train Accuracy: 0.9332
Train Loss: 0.1695, Train Accuracy: 0.9336
Train Loss: 0.1680, Train Accuracy: 0.9345


RuntimeError: Given groups=1, weight of size [64, 20, 3], expected input[64, 25, 696] to have 20 channels, but got 25 channels instead