### Research question 5: Are transmembrane domains enriched in the SPP vs. NAT peptides?

#### Transmembrane domains

* Transmembrane domains are portions of the enzyme which are embedded within a cell membrane or organelle membrane
* In the case of SPP, the membrane in question is attached to the rough endoplasmic reticulum (ER) allowing for cleaving of signal peptides
* As per the previous questions, the intent is to compare SPP is against NAT in terms of the individual peptide sequences which cross the membranes to determine which (if any) are more prevalent in SPP

#### Overall Method of solution
The provided data-set includes peptide sequences which are to be analyzed using non-local processing, the results of this are then to be recorded and then compared.

#### Purpose of this code segment
To generate FASTA files from the source .xlsx and send the data off for TMHMM processing while recording the results.

#### Method of solution

* Import pybiolib
  * A common biotech library allowing for TMHMM analysis
  * Returns values giving the probability of where the peptide sequence will be found in reference to a membrane
*  Importing lab data from .xlsx
*  Creation of files fitting the FASTA format in preparation for processing
  * FASTA format
    * .txt file
    * Documentation/general identification on the first line
    * Peptide sequence on the following line to be read
* Send each individual file for TMHMM processing
  * Full data set of 20000 requires about 100 hours of processing time
* Record results in a local directory "transmembrane_local_directory"
  * Each data collection is recorded in a subdirectory noting the record ranges
  * Each record is in its own subdirectory corresponding to the record number
 
NOTE: Since the data set is too large to process all at once, data analysis must be done several hundred records at a time. The provided code includes globals which can be changed to vary which records to process.


In [1]:
!pip3 install -qU pybiolib
import biolib
import pandas as pd
import os
import re
deeptmhmm = biolib.load('DTU/DeepTMHMM')
biolib.login()


2023-06-06 17:52:59,023 | INFO : Loaded project DTU/DeepTMHMM:1.0.24
2023-06-06 17:52:59,025 | INFO : Already signed in


In [27]:
import re
import os
deeptmhmm = biolib.load('DTU/DeepTMHMM')
biolib.utils.STREAM_STDOUT = True
accession_only = True

# Modify this as needed to collect new data sets from TMHMM
start_index = 1200
end_index = 2000

def make_file(file_name, pep_seq):
    with open(file_name, "w") as file:
        sample_category = pep_seq['Biological sample category']
        protein_name = pep_seq['Protein name']
        normalized_percent = pep_seq['Normalized protein percentage']
        peptide_sequence = pep_seq['Peptide sequence']
        accession_number = pep_seq['Protein accession number']
        
        pattern = r"OX=\d+"
        match = re.search(pattern, protein_name)

        pattern2 = r"GN=\w+"
        match2 = re.search(pattern2, protein_name)

        if not accession_only and match and match2:
            ox_value = match.group()
            gn_value = match2.group()
            oxgn_value = str(ox_value) + " " + str(gn_value)
        else:
            oxgn_value = str(accession_number)            
        
        header = ">Type: " + str(sample_category) + " ||| " + oxgn_value + " ||| " + str(normalized_percent) + "\n"
        file.write(header)
        file.write(peptide_sequence)


# Only a certain number of records can be analyzed at a given time
data = pd.read_excel('output.xlsx', header=0)

data = data.iloc[start_index:end_index]

directory = 'transmembrane_local_directory/transmembrane_results_local_'+str(start_index)+'_'+str(end_index-1)
os.makedirs(directory)


fasta_list = []

# Generate fasta format files for deepTMHMM processing
for index, peptide in data.iterrows():
    
    make_file("fasta_files/query"+str(index)+".fasta", peptide)
    fasta_list.append("fasta_files/query"+str(index)+".fasta")

print("List writing complete")

# Send the fasta files to the TMHMM database for processing and save the results
index = start_index
with biolib.Experiment('TMHMM_analysis'):
    for fasta_file in fasta_list:
        result = deeptmhmm.cli(args='--fasta '+fasta_file, machine='local')
        result.save_files(directory+'/peptide_row'+str(index))
        index+=1

print("done!")



2023-06-07 14:05:22,747 | INFO : Loaded project DTU/DeepTMHMM:1.0.24
List writing complete
2023-06-07 14:05:27,687 | INFO : Job "272f9499-583f-45c1-947c-00016c36727f" is starting...
2023-06-07 14:05:27,993 | INFO : Started compute node
2023-06-07 14:05:28,702 | INFO : Compute Node: Initializing
2023-06-07 14:05:29,145 | INFO : Job "272f9499-583f-45c1-947c-00016c36727f" running...
2023-06-07 14:05:30,205 | INFO : Compute Node: Pulling images...
2023-06-07 14:05:30,205 | INFO : Compute Node: Computing...
Running DeepTMHMM on 1 sequence...
Step 1/4 | Loading transformer model...

Step 2/4 | Generating embeddings for sequences...
Generating embeddings: 100% 1/1 [00:00<00:00,  3.70seq/s]

Step 3/4 | Predicting topologies for sequences in batches of 1...
Topology prediction: 100% 1/1 [00:00<00:00,  8.32seq/s]

Step 4/4 | Generating output...
2023-06-07 14:05:39,227 | INFO : Compute Node: Computation finished
2023-06-07 14:05:39,228 | INFO : Compute Node: Result Ready
2023-06-07 14:05:39,825 

[2023-06-07 15:42:25 -0700] [28918] [ERROR] Connection in use: ('127.0.0.1', 43674)
[2023-06-07 15:42:25 -0700] [28918] [ERROR] Retrying in 1 second.
[2023-06-07 15:42:26 -0700] [28918] [ERROR] Connection in use: ('127.0.0.1', 43674)
[2023-06-07 15:42:26 -0700] [28918] [ERROR] Retrying in 1 second.
[2023-06-07 15:42:27 -0700] [28918] [ERROR] Connection in use: ('127.0.0.1', 43674)
[2023-06-07 15:42:27 -0700] [28918] [ERROR] Retrying in 1 second.
[2023-06-07 15:42:28 -0700] [28918] [ERROR] Connection in use: ('127.0.0.1', 43674)
[2023-06-07 15:42:28 -0700] [28918] [ERROR] Retrying in 1 second.
[2023-06-07 15:42:29 -0700] [28918] [ERROR] Connection in use: ('127.0.0.1', 43674)
[2023-06-07 15:42:29 -0700] [28918] [ERROR] Retrying in 1 second.
[2023-06-07 15:42:30 -0700] [28918] [ERROR] Can't connect to ('127.0.0.1', 43674)


2023-06-07 15:42:45,054 | ERROR : Compute failed with: Could not connect to local compute node


BioLibError: Could not connect to local compute node