# 4 Assign GAIN-GRN
#### This is a notebook to construct the complete GAIN GRN indexing on a collection of GAIN domains.

The completed indexing objects are present in the ../data folder as stal_indexing.pkl, human_indexing.pkl and pkd_indexing.pkl, respectively.

Requirements:
> - GESAMT binary
> - STRIDE set of files (one for each entry in the dataset, here we use float-modified STRIDE files for the outliers)
> - A Folder of template PDBs
> - template_data.json with all information about the template elements and centers

The main challenge in constructing a good indexing on each GAIN-Domain is the detection of the start and end of each segment. Often, segments are continuously indexed as Helix or Strand, despite it being two actual segments (i.e. a kink between Helix 4,5 and 6, but all residues are in a single helical segment).

*To run this notebook, you will need to download all PDB models and the GAIN-GRN data from the zenodo repository:* 

https://dx.doi.org/10.5281/zenodo.12515545/gaingrn_data.tgz : download via `gaingrn.scripts.io.download_data()` into `path/to/GAIN-GRN/data`

https://dx.doi.org/10.5281/zenodo.12515545/agpcr_gains.tgz : download via `gaingrn.scripts.io.download_pdbs(target_directory=my/dir/to/pdbs)` and specify the `PDB_DIR` variable.

In [None]:
# DEPENDENCIES
import os
import pandas as pd
import pickle 

# LOCAL IMPORTS
import gaingrn.scripts.io
import gaingrn.scripts.assign
import gaingrn.scripts.alignment_utils
import gaingrn.scripts.bb_angle_tools
import gaingrn.scripts.indexing_utils
from gaingrn.scripts.indexing_classes import StAlIndexing


try: 
    GESAMT_BIN = os.environ.get('GESAMT_BIN')
except:
    GESAMT_BIN = "/home/hildilab/lib/xtal/ccp4-8.0/ccp4-8.0/bin/gesamt"
if GESAMT_BIN is None:
    GESAMT_BIN = "/home/hildilab/lib/xtal/ccp4-8.0/ccp4-8.0/bin/gesamt"

PDB_DIR = "../../all_pdbs"

To tackle the issue of a very broad assignment of some elements, and to help with splitting them into their respective elements, there is a two-fold modification of STRIDE files in place:

1. Every residue outside of 2 Sigmas of the mean (keep in mind, this is circular statistics) gets assigned a lower-case letter as the SSE descriptor, enabling resolving element ambiguities
2. The multiple of sigmas for outliers is written into columns 66-70 of the stride file. If this exceeds a defines threshold (usually 5.0), the element is truncated here always.

In [None]:
# This is already done in your set, just for further reference.

# import gaingrn.scripts.bb_angle_tools
# gaingrn.scripts.bb_angle_tools.stride_file_processing(stride_files = glob.glob("/home/hildilab/projects/agpcr_nom/sigmas/sigma_2/*"), outfolder = "../data/gain_strides")

Load the GainCollection objects to be indexed. Here, we have the whole 14435 structure set (valid_collection) and the 31 structure set (human_collection)

In [None]:
valid_collection = pd.read_pickle("../data/valid_collection.pkl")
human_collection = pd.read_pickle("../data/human_collection.pkl")

#### For testing, in this cell an individual indexing can be constructed. 

We have implemented a heirarchy of "split" modes, which will disambiguate continuous SSE where multiple segment centers are contained --> **split_modes**

Setting _debug=True_ will result in a large amount of information being printed, enabling the tracing of errors and irregularities during the assignment process. 

In [None]:
# Specify a Uniprot identifier here.
uniprot = "Q8IZF6"

for i, gain in enumerate(human_collection.collection):
    if uniprot not in gain.name: 
        continue
    file_prefix = f"../test_stal_indexing/f3_test{gain.name}" # a temp folder where calculations and outputs will be stored.
    print("_"*30, f"\n{i} {gain.name}")
    element_intervals, element_centers, residue_labels, unindexed_elements, params = gaingrn.scripts.assign.assign_indexing(gain, 
                                                                                                file_prefix=file_prefix, 
                                                                                                gain_pdb=gaingrn.scripts.io.find_pdb(gain.name, PDB_DIR), 
                                                                                                template_dir='../data/template_pdbs/',
                                                                                                template_json='../data/template_data.json',
                                                                                                outlier_cutoff=5.0,
                                                                                                gesamt_bin=GESAMT_BIN,
                                                                                                debug=False, # If you want ALL that is happening
                                                                                                create_pdb=False,
                                                                                                hard_cut={"S2":7,"S7":3,"H5":3},
                                                                                                patch_gps=True
                                                                                                )
    # This dictionary denotes the priority line of splitting. Only the lowest-heirarchy split is indicated, i.e. the higher the number, the "worse" the split.
    split_modes = {
        0:"No Split.",
        1:"Split by coiled residue.",
        2:"Split by disordered residue.",
        3:"Split by Proline/Glycine",
        4:"Split by hard cut.",
        5:"Overwrite by anchor priority."
    }
    print(gain.name, gain.subdomain_boundary)
    if params["split_mode"] > 0:
        print(params["split_mode"], split_modes[params["split_mode"]])
    #print(element_intervals, element_centers, residue_labels, unindexed_elements, sep="\n")
    print(unindexed_elements, sep="\n")

#### Here, the full __StAlIndexing__ may be constructed test-wise, or by default the pickle of the Indexing is loaded. 
Keep in mind that within this jupyter notebook - due to its handling of multiprocessig.Pool - the number of threads is limited to 1 and this takes a while for the full set.

In [None]:
accessions = [gain.name.split("-")[0].split("_")[0] for gain in valid_collection.collection]
sequences = ["".join(gain.sequence) for gain in valid_collection.collection]

fasta_offsets = gaingrn.scripts.alignment_utils.find_offsets("../data/seq_aln/all_query_sequences.fasta",
                                 accessions, 
                                 sequences)

# Pseudocenter cases: cases, where the segment center is NOT part of the segment in question. Therefore, an alternative indexing method is applied,
# using a "pseudocenter", a residue which matches the segment, but is not the .50 residue.
ps_file = "../data/pseudocenters.csv"
open(ps_file,"w").write(f"GAIN,res,elem\n")

# Careful when running, this takes a lot of time to calculate on single-thread. use run_indexing.py for fast multithreaded calculation.
stal_indexing = StAlIndexing(valid_collection.collection, 
                             prefix="../test_stal_indexing/test20", 
                             pdb_dir=f'{PDB_DIR}/',
                             template_json='../data/template_data.json',
                             gesamt_bin=GESAMT_BIN, 
                             template_dir='../data/template_pdbs/', 
                             fasta_offsets=fasta_offsets,
                             n_threads=1,
                             #pseudocenters=ps_file,
                             debug=False)
#with open("../data/stal_indexing.pkl","wb") as save:
#    pickle.dump(stal_indexing, save)

#### Make a file contaning all segment starts, ends and center residues for each GAIN domain. This is the basis for the GPCRdb implementation.

In [None]:
# Load the pre-calculated STAL indexing, which saves some time.
stal_indexing = pd.read_pickle("../data/stal_indexing.pkl")

header, matrix = stal_indexing.construct_data_matrix(unique_sse=False)
stal_indexing.data2csv(header, matrix, "../data/gaingrn_indexing.csv")
header, matrix = stal_indexing.construct_data_matrix(unique_sse=True)
stal_indexing.data2csv(header, matrix, "../data/gaingrn_indexing.unique.csv")

#### Here, we construct the indexing for the human set with the modified STRIDE files.

In [None]:
human_collection = pd.read_pickle("../data/human_collection.pkl")

human_accessions = [gain.name.split("-")[0].split("_")[0] for gain in human_collection.collection]
human_sequences = ["".join(gain.sequence) for gain in human_collection.collection]

human_fasta_offsets = gaingrn.scripts.alignment_utils.find_offsets("../data/seq_aln/all_query_sequences.fasta", 
                                 human_accessions, 
                                 human_sequences)

stal_human_indexing  = StAlIndexing(human_collection.collection, 
                             prefix="../../test_stal_indexing/test", 
                             pdb_dir=f'{PDB_DIR}/',  
                             template_dir='../data/template_pdbs/', 
                             template_json = '../data/template_data.json',
                             outlier_cutoff=5.0,
                             fasta_offsets=human_fasta_offsets,
                             gesamt_bin=GESAMT_BIN,
                             n_threads=1,
                             debug=False)

with open("../data/human_indexing.pkl", "wb") as humanfile:
    pickle.dump(stal_human_indexing, humanfile, -1)

header, matrix = stal_human_indexing.construct_data_matrix(overwrite_gps=True, unique_sse=False)
stal_human_indexing.data2csv(header, matrix, "../data/human_indexing.csv")

# Also include unique Helices in a separate file.
header, matrix = stal_human_indexing.construct_data_matrix(overwrite_gps=True, unique_sse=True)
stal_human_indexing.data2csv(header, matrix, "../data/human_indexing.unique.csv")

In [None]:

with open("../data/human_indexing.pkl", "wb") as humanfile:
    pickle.dump(stal_human_indexing, humanfile, -1)

header, matrix = stal_human_indexing.construct_data_matrix(overwrite_gps=True, unique_sse=False)
stal_human_indexing.data2csv(header, matrix, "../data/human_indexing.csv")

# Also include unique Helices in a separate file.
header, matrix = stal_human_indexing.construct_data_matrix(overwrite_gps=True, unique_sse=True)
stal_human_indexing.data2csv(header, matrix, "../data/human_indexing.unique.csv")

#### The Indexing for the whole set is best constructed via multithreaded exection by _stal\_indexing.py_ and saved in a pickle.