# Overview

This notebook was used to prepare the Uniprot `features_file.tsv` to use for Cluster Mode of the ProteinCartography pipeline.

In order to use this notebook, you must have ProteinCartography installed and activate the `cartography` conda env. To use this notebook, clone the [ProteinCartography GitHub repository](https://github.com/Arcadia-Science/ProteinCartography).

The original list of actins was generated in our [Defining Actin](https://research.arcadiascience.com/pub/idea-defining-actin/release/4?readingCollection=9a516d32) pub and the corresponding [Actin Prediction](https://github.com/Arcadia-Science/2022-actin-prediction) pipeline, where we did a protein BLAST search for 50,000 matches to [human beta-actin](https://www.uniprot.org/uniprotkb/P60709/entry). The list can be found in this [Zenodo archive](https://zenodo.org/records/7384393).


## Setup

Import dependencies.


In [None]:
import os
import pandas as pd
import sys

PC_path = "/Users/brae/ProteinCartography/ProteinCartography/"
sys.path.append("/Users/brae/ProteinCartography/ProteinCartography/")

Prepare directories. Once these are prepared, place the 2022-actin-prediction-blasoutputs.txt files in the `~/Actin/prep` folder.


In [None]:
os.makedirs("../output/", exist_ok=True)

## Map Refseq IDs

The output of BLAST is a list of Refseq IDs, but ProteinCartography and the AlphaFold database reference proteins based on their Accessions/UniProt IDs. This step converts Refseq IDs to Accessions/UniProt IDs and then uses them to download the required metadata file. Note that I split this into 2 batches as the single batch was too large and continually failed.


In [None]:
# Map Refseq IDs to UniProt IDs

map_refseqids = os.path.join(PC_path, "map_refseqids.py")

# Batch 1
os.system(
    f"python {map_refseqids} -i ../input/2022-actin-prediction-blastoutputs1.txt -o ../input/blastoutput_uniprot1.txt"
)

# Batch 2
os.system(
    f"python {map_refseqids} -i ../input/2022-actin-prediction-blastoutputs2.txt -o ../input/blastoutput_uniprot2.txt"
)


## Download UniProt metadata

After obtaining UniProt IDs, we use them to download metadata from UniProt. This data will be the bulk of the UniProt features file.


In [None]:
fetch_uniprot_metadata = os.path.join(PC_path, "fetch_uniprot_metadata.py")

# Batch 1
os.system(
    f"python {fetch_uniprot_metadata} -i ../input/blastoutput_uniprot1.txt -o ../input/uniprot_features1.tsv"
)

# Batch 2
os.system(
    f"python {fetch_uniprot_metadata} -i ../input/blastoutput_uniprot2.txt -o ../input/uniprot_features2.tsv"
)


In [None]:
# Merge batched files
uniprot_features1 = pd.read_csv("../input/uniprot_features1.tsv", sep="\t")
uniprot_features2 = pd.read_csv("../input/uniprot_features2.tsv", sep="\t")
uniprot_features1 = uniprot_features1.drop_duplicates(subset="protid", keep="first")
uniprot_features2 = uniprot_features2.drop_duplicates(subset="protid", keep="first")
uniprot_features_combined = pd.concat([uniprot_features1, uniprot_features2])

# Save uniprot_features file
uniprot_features_combined.to_csv(
    "../input/uniprot_features_combined.tsv", sep="\t", index=None
)

## Filter UniProt hits

We filtered UniProt hits based on fragment status and whether or not the UniProt entry was active using the `filter_uniprot_hits.py` script from [ProteinCartography](https://github.com/Arcadia-Science/ProteinCartography).


In [None]:
filter_uniprot_hits = os.path.join(PC_path, "filter_uniprot_hits.py")

os.system(
    f"python {filter_uniprot_hits} -i ../input/uniprot_features_combined.tsv -o ../input/features_filter.tsv"
)


In [None]:
# Renames first column to protid
features_filtered = pd.read_csv("../input/features_filter.tsv", sep="\t", header=None)
features_filtered.columns = ["protid"]

# Save uniprot_features file
features_filtered.to_csv("../input/features_filter.tsv", sep="\t", index=None)


In [None]:
# Apply filter
uniprot_features_combined = pd.read_csv(
    "../input/uniprot_features_combined.tsv", sep="\t"
)
featurs_filtered = pd.read_csv("../input/features_filter.tsv", sep="\t")
uniprot_features_filtered = uniprot_features_combined.merge(
    features_filtered, on="protid", how="inner"
)

# Save uniprot_features file
uniprot_features_filtered.to_csv(
    "../input/uniprot_features_filtered.tsv", sep="\t", index=None
)

## Reformat features file

The features file from UniProt must be reformatted slightly to work well with ProteinCartography. We reformatted the file to fit with guidelines listed [here](https://github.com/Arcadia-Science/ProteinCartography#feature-file-main-columns).


In [None]:
# Read in raw features file

uniprot_features_filtered = pd.read_csv(
    "../input/uniprot_features_filtered.tsv", sep="\t"
)

# Reformat lineage column

lineage_string_splitter = lambda lineage_string: [
    rank.split(" (")[0] for rank in lineage_string.split(", ")
]

uniprot_features_filtered["Lineage"] = uniprot_features_filtered[
    "Taxonomic lineage"
].apply(lineage_string_splitter)

# Saves updated uniprot_features file

uniprot_features_filtered.to_csv("../output/uniprot_features.tsv", sep="\t", index=None)
uniprot_features_filtered.to_csv(
    "../../ProteinCartography/Actin/output/uniprot_features.tsv", sep="\t", index=None
)