# Pfam protein domain classification

In this notebook, I will use the Pfam dataset to classify protein domains. The dataset contains sequences of proteins and their respective Pfam labels. The Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs). Each family is typically defined by a curated seed alignment and a profile HMM. The database also contains additional information such as multiple sequence alignments, consensus sequences, and annotation.

The dataset is available at https://www.kaggle.com/googleai/pfam-seed-random-split

The dataset contains the following columns:

- `sequence`: The protein sequence. The protein sequences are typically 50-1000 amino acids long. The first 10-20 amino acids are a signal peptide that can be ignored during model training. The remaining sequence is the mature protein. The protein sequence is represented as a string. The alphabet for the protein sequence is 20 amino acids (e.g., A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y).
- `family_accession`: The Pfam family accession. This is a unique identifier of a Pfam family. For example, the Pfam family accession of the "7tm_1" family is "PF00001". The family accessions starting with "PF" followed by five digits are curated Pfam families, and the rest are automatically generated Pfam families. The manually curated Pfam families are a gold standard, but there are also high-quality automatically generated Pfam families.
- `sequence_name`: The name of the sequence. It is not guaranteed to be unique. In fact, multiple sequences may share the same sequence name. This is because the same sequence may belong to multiple Pfam families. The sequence name is mainly for human inspection and can be ignored during model training. The same sequence name does not imply the same protein sequence. The protein sequences are unique and different.
- `aligned_sequence`: The aligned protein sequence. It is derived from the HMM-HMM alignment between the protein sequence and the Pfam family HMM. It is a string with "-" to denote insertions relative to the HMM. The first 10-20 amino acids are a signal peptide that can be ignored during model training. The remaining sequence is the mature protein. The length of the aligned sequence is typically similar to the length of the protein sequence. The aligned sequence is informative for training models that rely on multiple sequence alignments.
- `family_id`: The Pfam family ID. It is similar to the `family_accession` but provides a different naming convention. It is also a unique identifier of a Pfam family. The family IDs starting with "CL" followed by seven digits are curated Pfam families, and the rest are automatically generated Pfam families. The manually curated Pfam families are a gold standard, but there are also high-quality automatically generated Pfam families.

The task is to predict the `family_accession` of a protein sequence.

In [16]:
import pandas as pd
import os    
import matplotlib.pyplot as plt

In [17]:
DIR_PATH='data/'

In [19]:
def read_data(sub_dir, dir=DIR_PATH):
    data=[]
    for file in os.listdir(dir+sub_dir):
        data.append(pd.read_csv(dir+sub_dir+file))
    return pd.concat(data)


df_train = read_data('train/')
df_test = read_data('test/')
df_val = read_data('dev/')