# Data Collection

## Selection criteria
From *UniProt* select the Positive and Negative data following this criteria:

> - **Both sets:** protein evidence, protein length, reviewed status, only eukaryotes, no fragments  
> - **Positive set:** experimental evidence of signal peptide  
> - **Negative set:** no signal peptide (any evidence) + experimental evidence for non-SP compartments

This pipeline retrieves protein data from the UniProt database using its REST API and filters positive entries for the existance of a cleavage site and signal peptide (SP) length over 14 residues.

The results are finally saved into TSV and FASTA files, reporting following information:

## Output files

We exported the datasets in **two formats**:

> - **FASTA files** contain only the amino acid sequences of the selected proteins.  
> - **TSV files** contain metadata associated with each protein entry.

### .TSV files content

**Positive dataset**
 1. UniProt accession  
 2. Organism name  
3. Eukaryotic kingdom (Metazoa, Fungi, Plants, Other)  
4. Protein length  
5. Position of the signal peptide cleavage site  

**Negative dataset**
1. UniProt accession  
2. Organism name  
3. Eukaryotic kingdom (Metazoa, Fungi, Plants, Other)  
4. Protein length  
5. Presence of a transmembrane helix in the first 90 residues (True/False)  


In [68]:
# imports and setup
import requests   # to query UniProt REST API
from requests.adapters import HTTPAdapter, Retry   # to handle failed requests
import json     # To parse and handle data in JSON format from API responses
import re      # To use regular expressions to find patterns in text

In [69]:
url_positive = "https://rest.uniprot.org/uniprotkb/search?format=json&query=%28%28fragment%3Afalse%29+AND+%28taxonomy_id%3A2759%29+AND+%28length%3A%5B40+TO+*%5D%29+AND+%28reviewed%3Atrue%29+AND+%28existence%3A1%29+AND+%28ft_signal_exp%3A*%29%29&size=500"
url_negative = "https://rest.uniprot.org/uniprotkb/search?format=json&query=%28%28reviewed%3Atrue%29+AND+%28fragment%3Afalse%29+AND+%28taxonomy_id%3A2759%29+AND+%28length%3A%5B40+TO+*%5D%29+AND+%28existence%3A1%29+NOT+%28ft_signal%3A*%29+OR+%28cc_scl_term_exp%3ASL-0191%29+OR+%28cc_scl_term_exp%3ASL-0204%29+OR+%28cc_scl_term_exp%3ASL-0039%29+OR+%28cc_scl_term_exp%3ASL-0091%29+OR+%28cc_scl_term_exp%3ASL-0209%29+OR+%28cc_scl_term_exp%3ASL-0173%29%29&size=500"
col_positive = ["Accession", "Organism", "Kingdom", "Sequence length", "SP cleavage"]
col_negative = ["Accession", "Organism", "Kingdom", "Sequence length", "N-term transmembrane"]

In [70]:
def get_next_link(headers):
  """
  get_batch function take the REST API from UniProt and extracts all the information for the entries.
  """

  # Extract the next page URL from the "Link" header (if present)
  if "Link" in headers:

     # Use the expression re_next_link that matches the "next" URL
      match = re_next_link.match(headers["Link"])
      if match:
          return match.group(1)

def get_batch(batch_url):
  """
  The function get_next_link reads the output of get_batch and, if a link statement is present in the headers, extracts the next link.
  """

  # Retrieve information from the batch_url
  while batch_url:
        response = session.get(batch_url)

        # Stop execution if the request failed (status !=200)
        response.raise_for_status()

        # Get the total number of entries in the search
        total = response.headers["x-total-results"]
        yield response, total

        # Update with the next URL (from the Link)
        batch_url = get_next_link(response.headers)
        print(batch_url)


In [71]:
# Finds the URL for the next page in the dataset in a response header
re_next_link = re.compile(r'<(.+)>; rel="next"')

# Manages failed requests, it will try the request up to five times before failing
retries = Retry(total=5, backoff_factor=0.25, status_forcelist=[500, 502, 503, 504])

# To handle multiple requests efficiently
session = requests.Session()

# Applies the retry logic defined by retries to the session
session.mount("https://", HTTPAdapter(max_retries=retries))

In [72]:
get_batch(url_positive)

# Look at the JSON architecture of only the first protein
for response, total in get_batch(url_positive):
  data = response.json()

  entries = data["results"]
  if entries:
    # use the first entry
    first_entry = entries[0]

    # look only the features
    features = first_entry["features"]
    print(json.dumps(features, indent=2))
  break

[
  {
    "type": "Signal",
    "location": {
      "start": {
        "value": 1,
        "modifier": "EXACT"
      },
      "end": {
        "value": 21,
        "modifier": "EXACT"
      }
    },
    "description": "",
    "evidences": [
      {
        "evidenceCode": "ECO:0000269",
        "source": "PubMed",
        "id": "15340161"
      },
      {
        "evidenceCode": "ECO:0000269",
        "source": "PubMed",
        "id": "9571159"
      }
    ]
  },
  {
    "type": "Chain",
    "location": {
      "start": {
        "value": 22,
        "modifier": "EXACT"
      },
      "end": {
        "value": 401,
        "modifier": "EXACT"
      }
    },
    "description": "Tumor necrosis factor receptor superfamily member 11B",
    "featureId": "PRO_0000034587"
  },
  {
    "type": "Repeat",
    "location": {
      "start": {
        "value": 24,
        "modifier": "EXACT"
      },
      "end": {
        "value": 62,
        "modifier": "EXACT"
      }
    },
    "description": "T

In [73]:
def get_kingdom(entry):
  """
  The function get_kingdom search the kingdom of the entry following the JSON structure.
  """

  if "Metazoa" in entry["organism"]["lineage"]:
    k = "Metazoa"
  elif "Viridiplantae" in entry["organism"]["lineage"]:
    k = "Viridiplantae"
  elif "Fungi" in entry["organism"]["lineage"]:
    k = "Fungi"
  else:
    k = "Other"
  return k

def filter_entry_positive(entry):
  """
  The function filter_entry_positive filter the entries following the criteria:
  - presence of the SP-cleavage site (looking for the presence of the end position)
  - length of the SP (SP>14)
  """

  try:
      e_pos = int(entry["features"][0]["location"]["end"]["value"])

      # If description != "" there is no cleavage site
      if entry["features"][0]["description"] != "":
          return False

      # keep only entries with SP length>14 aa
      if e_pos <= 13:
          return False

      return True

  except (KeyError, IndexError, ValueError):
      return False


def filter_entry_negative(entry):
    return True


def json_to_tsv_positive(entry):
  """
  The function json_to_tsv_positive retrieves the information to save in .tsv  file.
  """
  return (entry["primaryAccession"],
          entry["organism"]["scientificName"],
          get_kingdom(entry),
          entry["sequence"]["length"],
          entry["features"][0]["location"]["end"]["value"])

def json_to_tsv_negative(entry):
  """
  The function json_to_tsv_negative look for the presence of transmembrane helices in the first 90 aa of the entries,
  in order to discriminate SP from transmembrane helices and reduce FPR.
  """

  k = get_kingdom(entry)

  tm_evidence = False
  for f in entry["features"]:
      if f["type"]=="Transmembrane":
        if re.search("Helical",f["description"]):
          if f["location"]["start"]["value"]<=90:
            tm_evidence = True
            break

  return (entry["primaryAccession"],
          entry["organism"]["scientificName"],
          get_kingdom(entry),
          entry["sequence"]["length"],
          tm_evidence)


def get_dataset(search_url, filter_function, json_to_tsv_function, columns, tsv_file, fasta_file):
  """
  The function get_dataset process all batches and prints the total number of entries and the filtered number of entries.
  """
  n_total, n_filtered = 0, 0

  with open(tsv_file, 'w') as tsv, open(fasta_file, 'w') as fasta:
    print(*columns, sep="\t", file=tsv)

    for batch, total in get_batch(search_url):
        data = json.loads(batch.text)["results"]

        # Loop through each individual entry in the 'results' list of the JSON data
        for entry in data:
          n_total += 1

          # Check if the entry meets the criteria defined by filter_function
          if filter_function(entry):
            n_filtered += 1
            fields = json_to_tsv_function(entry)
            print(*fields, sep="\t", file=tsv)
            print(">", entry["primaryAccession"], sep="", file=fasta)
            print(entry["sequence"]["value"], file=fasta)

  print(f"Total: {n_total}\nFiltered: {n_filtered}")

In [74]:
if __name__ == "__main__":
  print('Positive entries:')
  get_dataset(url_positive, filter_entry_positive, json_to_tsv_positive, col_positive, "positive.tsv", "positive.fasta")
  print('\nNegative entries:')
  get_dataset(url_negative, filter_entry_negative, json_to_tsv_negative, col_negative, "negative.tsv", "negative.fasta")

Positive entries:
https://rest.uniprot.org/uniprotkb/search?format=json&query=%28%28fragment%3Afalse%29%20AND%20%28taxonomy_id%3A2759%29%20AND%20%28length%3A%5B40%20TO%20%2A%5D%29%20AND%20%28reviewed%3Atrue%29%20AND%20%28existence%3A1%29%20AND%20%28ft_signal_exp%3A%2A%29%29&cursor=c9bacmxsqhkqgdxgnaquhsybhmvuelm5fxu0j&size=500
https://rest.uniprot.org/uniprotkb/search?format=json&query=%28%28fragment%3Afalse%29%20AND%20%28taxonomy_id%3A2759%29%20AND%20%28length%3A%5B40%20TO%20%2A%5D%29%20AND%20%28reviewed%3Atrue%29%20AND%20%28existence%3A1%29%20AND%20%28ft_signal_exp%3A%2A%29%29&cursor=28m7xk8oeejl5mhhil7voukl953i9phd2w397hu&size=500
https://rest.uniprot.org/uniprotkb/search?format=json&query=%28%28fragment%3Afalse%29%20AND%20%28taxonomy_id%3A2759%29%20AND%20%28length%3A%5B40%20TO%20%2A%5D%29%20AND%20%28reviewed%3Atrue%29%20AND%20%28existence%3A1%29%20AND%20%28ft_signal_exp%3A%2A%29%29&cursor=28ndf240hnvwz75t65klyjzcujbd3y9nn3i5gky&size=500
https://rest.uniprot.org/uniprotkb/search?for

In [75]:
import pandas as pd

negative_set = pd.read_csv("negative.tsv", sep="\t")
positive_set = pd.read_csv("positive.tsv", sep="\t")

print(len(positive_set))
print(len(negative_set))
print(len(negative_set[negative_set["N-term transmembrane"] == True]))


2932
20615
2465
