# Overview

This notebook was used to prepare the Uniprot `features_file.tsv` to use for Cluster Mode of the ProteinCartography pipeline.

In order to use this notebook, you must have ProteinCartography installed and activate the `cartography` conda env. To use this notebook, clone the [ProteinCartography GitHub repository](https://github.com/Arcadia-Science/ProteinCartography).

The original list of actins was generated in our [Defining Actin](https://research.arcadiascience.com/pub/idea-defining-actin/release/4?readingCollection=9a516d32) pub and the corresponding [Actin Prediction](https://github.com/Arcadia-Science/2022-actin-prediction) pipeline, where we did a protein BLAST search for 50,000 matches to [human beta-actin](https://www.uniprot.org/uniprotkb/P60709/entry). The list can be found in this [Zenodo archive](https://zenodo.org/records/7384393).


## Setup

Import dependencies.


In [16]:
import os
import pandas as pd
import sys

sys.path.append("../ProteinCartography/")

os.chdir("../ProteinCartography")

Prepare directories. Once these are prepared, place the 2022-actin-prediction-blasoutputs.txt files in the `~/Actin/prep` folder.


In [25]:
mkdir -p {Actin/prep,Actin/output}

## Map Refseq IDs

The output of BLAST is a list of Refseq IDs, but ProteinCartography and the AlphaFold database reference proteins based on their Accessions/UniProt IDs. This step converts Refseq IDs to Accessions/UniProt IDs and then uses them to download the required metadata file. Note that I split this into 2 batches as the single batch was too large and continually failed.


In [None]:
# Map Refseq IDs to UniProt IDs

# Batch 1
map_refseqids = "ProteinCartography/map_refseqids.py"
os.system(
    f"python ProteinCartography/map_refseqids.py -i Actin/prep/2022-actin-prediction-blastoutputs1.txt -o Actin/prep/blastoutput_uniprot1.txt "
)

# Batch 2
map_refseqids = "ProteinCartography/map_refseqids.py"
os.system(
    f"python ProteinCartography/map_refseqids.py -i Actin/prep/2022-actin-prediction-blastoutputs2.txt -o Actin/prep/blastoutput_uniprot2.txt"
)

## Download UniProt metadata

After obtaining UniProt IDs, we use them to download metadata from UniProt. This data will be the bulk of the UniProt features file.


In [None]:
# Batch 1
fetch_uniprot_metadata = "ProteinCartography/fetch_uniprot_metadata.py"
os.system(
    f"python ProteinCartography/fetch_uniprot_metadata.py -i Actin/prep/blastoutput_uniprot1.txt -o Actin/prep/uniprot_features1.tsv"
)

# Batch 2
fetch_uniprot_metadata = "ProteinCartography/fetch_uniprot_metadata.py"
os.system(
    f"python ProteinCartography/fetch_uniprot_metadata.py -i Actin/prep/blastoutput_uniprot2.txt -o Actin/prep/uniprot_features2.tsv"
)

In [None]:
# Merge batched files
uf1 = pd.read_csv("Actin/prep/uniprot_features1.tsv", sep="\t")
uf2 = pd.read_csv("Actin/prep/uniprot_features2.tsv", sep="\t")
uf1 = uf1.drop_duplicates(subset="protid", keep="first")
uf2 = uf2.drop_duplicates(subset="protid", keep="first")
ufc = uf1.merge(uf2, how="outer")

# Save uniprot_features file
ufc.to_csv("Actin/prep/uniprot_features_combined.tsv", sep="\t", index=None)

## Filter UniProt hits

We filtered UniProt hits based on fragment status and whether or not the UniProt entry was active using the `filter_uniprot_hits.py` script from [ProteinCartography](https://github.com/Arcadia-Science/ProteinCartography).


In [9]:
filter_uniprot_hits = "ProteinCartography/filter_uniprot_hits.py"
os.system(
    f"python ProteinCartography/filter_uniprot_hits.py -i Actin/prep/uniprot_features_combined.tsv -o Actin/prep/features_filter.tsv"
)

0

In [80]:
# Renames first column to protid
ff2 = pd.read_csv("Actin/prep/features_filter.tsv", sep="\t", header=None)
ff2.columns = ["protid"]

# Save uniprot_features file
ff2.to_csv("Actin/prep/features_filter.tsv", sep="\t", index=None)

In [81]:
# Apply filter
ufc = pd.read_csv("Actin/prep/uniprot_features_combined.tsv", sep="\t")
ff = pd.read_csv("Actin/prep/features_filter.tsv", sep="\t")
uff = ufc.merge(ff, on="protid", how="inner")

# Save uniprot_features file
uff.to_csv("Actin/prep/uniprot_features_filtered.tsv", sep="\t", index=None)

## Reformat features file

The features file from UniProt must be reformatted slightly to work well with ProteinCartography. We reformatted the file to fit with guidelines listed [here](https://github.com/Arcadia-Science/ProteinCartography#feature-file-main-columns).


In [None]:
# Read in raw features file

uff = pd.read_csv("Actin/prep/uniprot_features_filtered.tsv", sep="\t")

# Reformat lineage column

lineage_string_splitter = lambda lineage_string: [
    rank.split(" (")[0] for rank in lineage_string.split(", ")
]

uff["Lineage"] = uff["Taxonomic lineage"].apply(lineage_string_splitter)

# Saves updated uniprot_features file

uff.to_csv("Actin/output/uniprot_features.tsv", sep="\t", index=None)