# How to use Cartography
This is a pre-alpha version of Cartography for use by the Annotation team.  
It's broken and buggy in lots of places, but we're working on making it more robust!  

These instructions are for running the code in a Jupyter lab environment that has already been set up to run Cartography.  
If you want to set up your own environment locally, see **[Local Setup](#Local-Setup)**.

---
## Vanilla Mode
Starts with input FASTA proteins, performs FoldSeek and BLAST, and builds a space.

![vanilla rulegraph](rulegraph.png)

#### 1. Open up a new terminal.  
- Click the blue "+" button on the top left.  
- Under "Other", click "Terminal".  

#### 2. Activate the conda environment.  
- Run this command in the Terminal:  
    `conda activate cartography`
    
#### 3. Add query files to the `input/` folder.
- The pipeline will search these proteins against the BLAST NR database and the Foldseek afdb50, afdb-proteome, and afdb-swissprot.
- You can manually add FASTA files less than 400aa long, one file per peptide. The pipeline runs ESMFold to generate a predicted structure.
- For longer sequences, use [CoLabFold](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb).
- You can also pull down proteins with a Uniprot accession number and Alphafold structure using the following command.  
    Replace `{accession}` with your Uniprot accession (e.g. [`P24928`](https://www.uniprot.org/uniprotkb/P24928/entry)).  
    
    `python ProteinCartography/fetch_accession.py -a {accession} -o input -f fasta pdb`
    
    - This saves a FASTA file from Uniprot and a PDB file from AlphaFold to the `input/` folder/
- Sequences >700aa in length currently cause memory failures for remote blast.

#### 4. Run the pipeline.
- Run this command in the Terminal:  
    `snakemake --cores 16`
- You can also give a nickname to the analysis using:
    `snakemake --cores 16 --config analysis_name={my_name}`

---
## From-Folder Mode
Starts with a folder of PDB files and a `uniprot_features.tsv` file and builds a space.

![from folder rulegraph](rulegraph_ff.png)

#### 1. Put all the PDB files in a folder
- You can do this on your personal computer.
- Each protein should have a unique ID as the filename prefix, referred to as the "protid". For example, a protein with the protid `P00000` should have a PDB file with the name `P00000.pdb`.

#### 2. Make a `uniprot_features.tsv` file.
- For each protid in the dataset, you should gather features from Uniprot into a `uniprot_features.tsv` file.
- The first column of the file should be the `protid`, e.g. `P00000` above, with `protid` as the column name.
- You should also have additional columns with the following values for each protein:

| column name | data type | description |
| :---------- | :-------- | :---------- |
| `proteinDescription.recommendedName.fullName.value` | `str` | a short human-readable description of the protein |
| `organism.commonName` | `str` | the common name of the organism |
| `organism.scientificName` | `str` | the scientific name of the protein |
| `sequence.length` | `str` | the number of amino acids in the protein |
| `annotationScore` | `int` | the [Uniprot Annotation Score](https://www.uniprot.org/help/annotation_score) for the protein. Should be 1 if it's a new protein. |
| `organism.lineage` | `list` of `str` | the taxonomic lineage of the organism. each taxonomic group associated with the protein should be a string (in quotation marks) separated by a comma. Square brackets should surround the comma-separated list so that it is converted to a Python list upon reading with the [`eval()`](https://docs.python.org/3/library/functions.html#eval) built-in function. |

> For `organism.lineage`, an example for a mouse protein might look like this:  
> `['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Mammalia', 'Eutheria', 'Euarchontoglires', 'Glires', 'Rodentia', 'Myomorpha', 'Muroidea', 'Muridae', 'Murinae', 'Mus', 'Mus']`

- You should add this file to the same folder as the PDBs.

#### 3. Upload the files to this compute environment.
- You might want to zip the folder.

#### 4. Update the `config_from_folder.yml` file.
- Change the value of `input_dir` to be the folder of the PDBs that you uploaded.

#### 5. Activate the conda environment.  
- Run this command in the Terminal:  
    `conda activate cartography`

#### 6. Run the pipeline.
- Run this command in the Terminal:  
    `snakemake --snakefile Snakefile_from_folder --cores 16`
- You can also give a nickname to the analysis using:
    `snakemake --snakefile Snakefile_from_folder --cores 16 --config analysis_name={my_name}`

---
## Local Setup

If you'd like to run Cartography on your local computer instead, you can try cloning the repo using GitHub and setting up the Conda environment instead.

#### 1. Clone from GitHub
- `git clone https://github.com/Arcadia-Science/gene-family-cartography.git`

#### 2. Setup conda environment
- `cd` into the github repo (e.g. `cd gene-family-cartography`)
- `conda env create -n cartography -f envs/cartography.yml`
- This can be slow, so you could install `mamba` first and then run the above conda creation using `mamba env create` instead.
    - `conda install mamba`
    - `mamba env create -n cartography -f envs/cartography.yml`

#### 3. Get Started
- See **[Vanilla Mode](#Vanilla-Mode)** or **[From-Folder Mode](#From-Folder-Mode)** above for your use case.

--- 
## Known Bugs

#### Foldseek Connection Failure
The first time we run the pipeline, we often get an error from the rule `run_foldseek`:  
`requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))`

This has something to do with the speed of our requests.  
Cancelling the pipeline with <kbd>Control</kbd> + <kbd>C</kbd> and then rerunning the `snakemake --cores 16` command seems to get around this.  

#### BLAST dies
Sometimes BLAST will take forever to respond or will die with an error like `CPU usage limit was exceeded, resulting in SIGXCPU (24)`.  
You can wait a few minutes and try again later. Make sure to delete the `blastresults/` folder in `output/` so Snakemake knows to re-run that part of the pipeline.


