# Protein Modification

Find this notebook at `EpyNN/nnlive/ptm_protein/prepare_dataset.ipynb`. 
Regular python code at `EpyNN/nnlive/ptm_protein/prepare_dataset.py`.

This notebook is part of the series on preparing data for Neural Network regression with EpyNN. It deals with a real world problem and therefore will focus on the problem itself, rather than basics that were reviewed along with the preparation of dummy dataset [with Boolean](../dummy_boolean/prepare_dataset.ipynb), [with string](../dummy_string/prepare_dataset.ipynb), [with time-series (numerical)](../dummy_time/prepare_dataset.ipynb) and [with image (numerical)](../dummy_image/prepare_dataset.ipynb).

## Post Translational Modification (PTM) of Proteins



## Prepare a set of peptides

### Imports

In [1]:
# EpyNN/nnlive/ptm_protein/prepare_dataset.ipynb
# Standard library imports
import tarfile
import random
import wget
import os

# Local application/library specific imports
from nnlibs.commons.library import read_file
from nnlibs.commons.logs import process_logs

### Seeding

In [2]:
random.seed(1)

### Download sequences

In [3]:
def download_sequences():
    """Download a set of peptide sequences.
    """
    data_path = os.path.join('.', 'data')

    if not os.path.exists(data_path):

        # Download @url with wget
        url = 'https://synthase.s3.us-west-2.amazonaws.com/ptm_prediction_data.tar'
        fname = wget.download(url)

        # Extract archive
        tar = tarfile.open(fname).extractall('.')
        process_logs('Make: ' + fname, level=1)

        # Clean-up
        os.remove(fname)

    return None

Retrieve the data as follows.

In [4]:
download_sequences()

### Prepare dataset

In [5]:
def prepare_dataset(N_SAMPLES=100):
    """Prepare a set of labeled peptides.

    :param N_SAMPLES: Number of peptide samples to retrieve, defaults to 100.
    :type N_SAMPLES: int

    :return: Set of peptides.
    :rtype: tuple[list[str]]

    :return: Set of single-digit peptides label.
    :rtype: tuple[int]
    """
    # Single-digit positive and negative labels
    p_label = 1
    n_label = 0

    # Positive data are Homo sapiens O-GlcNAcylated peptide sequences from oglcnac.mcw.edu
    path_positive = 'data/21_positive.dat'

    # Negative data are peptide sequences presumably not O-GlcNAcylated
    path_negative = 'data/21_negative.dat'

    # Read text files, each containing one sequence per line
    positive = [[list(x), p_label] for x in read_file(path_positive).splitlines()]
    negative = [[list(x), n_label] for x in read_file(path_negative).splitlines()]

    # Shuffle data to prevent from any sorting previously applied
    random.shuffle(positive)
    random.shuffle(negative)

    # Truncate to prepare a balanced dataset
    negative = negative[:len(positive)]

    # Prepare a balanced dataset
    dataset = positive + negative

    # Shuffle dataset
    random.shuffle(dataset)

    # Truncate dataset to N_SAMPLES
    dataset = dataset[:N_SAMPLES]

    # Separate X-Y pairs
    X_features, Y_label = zip(*dataset)

    return X_features, Y_label

Let's check the function.

In [6]:
X_features, Y_label = prepare_dataset(N_SAMPLES=10)

for peptide, label in zip(X_features, Y_label):
    print(peptide, label)

TAAMRNTKRGSWYIEALAQVF 0
NKKLAPSSTPSNIAPSDVVSN 1
RGAGSSAFSQSSGTLASNPAT 1
TDNDWPIYVESGEENDPAGDD 0
GQERFRSITQSYYRSANALIL 0
SINTGCLNACTYCKTKHARGN 0
NKASLPPKPGTMAAGGGGPAP 1
ASVQDQTTVRTVASATTAIEI 1
ASLEGKKIKDSTAASRATTLS 1
RRQPVGGLGLSIKGGSEHNVP 1


## What comes next?

The function ``prepare_dataset()`` applied to dummy string data is used in the following working examples:

XXX