# Overview

This notebook was used to prepare the UniProt `features_file.tsv` to use for Cluster Mode of the ProteinCartography pipeline.

This notebook uses scripts from the [ProteinCartography repo]([https://github.com/Arcadia-Science/ProteinCartography](https://github.com/Arcadia-Science/ProteinCartography/tree/v0.4.2)) (V0.4.2). In order to use this notebook, you must have the ProteinCartography repo downloaded and activate the conda environment created by using the [`envs/web_apis.yml`](https://github.com/Arcadia-Science/ProteinCartography/blob/v0.4.2/envs/web_apis.yml) file in the ProteinCartography repo. After setting up and activating the repo, install `ipykernal` in order to use the jupyter notebook using the following command: 
```
conda install -n web_apis ipykernel --update-deps --force-reinstall
```


The proteins analyzed here are found by searching [UniProt](https://www.uniprot.org) for the species of interest (in this case *Ornithodoros turicata*). This notebook will allow you to download the associated metadata and prep it for analysis with ProteinCartography. To download the *Ornithodoros* protein information, you should use the file `metadata/ornithodoros.txt` found in this repo and in the [Zenodo](10.5281/zenodo.12796464) repo..


## Setup


Import dependencies.

In [1]:
import os
import pandas as pd
import sys

PC_path = "./../../ProteinCartography/ProteinCartography"
sys.path.append(PC_path)

Prepare directories. 

In [None]:
os.makedirs("output/", exist_ok=True)

## Download UniProt metadata

After obtaining UniProt IDs, we use them to download metadata from UniProt. This data will be the bulk of the UniProt features file.


In [None]:
# Set paths

fetch_uniprot_metadata = os.path.join(PC_path, "fetch_uniprot_metadata.py")
input_uniprot_metadata_download = "ornithodoros.txt"
output_uniprot_metadata_download = "uniprot_features1.tsv"

os.system(
    f"python {fetch_uniprot_metadata} -i {input_uniprot_metadata_download} -o {output_uniprot_metadata_download}"
)


## Reformat features file

The features file from UniProt must be reformatted slightly to work well with ProteinCartography. We reformatted the file to fit with guidelines listed [here](https://github.com/Arcadia-Science/ProteinCartography#feature-file-main-columns).


In [None]:
# Read in raw features file and change first column to protid

uniprot_features = pd.read_csv("uniprot_features1.tsv", sep="\t", names=["protid"])

# Reformat lineage column

lineage_string_splitter = lambda lineage_string: [
    rank.split(" (")[0] for rank in lineage_string.split(", ")
]

uniprot_features["Lineage"] = uniprot_features[
    "Taxonomic lineage"
].apply(lineage_string_splitter)

# Saves updated uniprot_features file

uniprot_features.to_csv("../output/uniprot_features.tsv", sep="\t", index=None)