## Table of Contents
- [Research question 5 continued: Are transmembrane domains enriched in the SPP vs. NAT peptides?](#section-one)
- [Code Segment 2 for creation of a new file](#section-two)
- [Return to Project Table of Contents](project_overview.ipynb)

<a id="section-one"></a>
### Research question 5 continued: Are transmembrane domains enriched in the SPP vs. NAT peptides?

#### Purpose of this file

Once the full data set has been processed through deepTMHMM computations (transmembrane-domain-comparison) the following segements are used to compile the individual data collections into one directory for ease of processing for insertion into the data .xlsx file

#### Method of solution
* Locate the .3line files in the processed data
* Compile the files into one common directory with cooresponding identification subdirectories
* Remove Jupyter duplicates
* 

#### Code Segment 1

Run this segment to collect the individual data runs in 'transmembrane_local_directory' and skims off unnecessary files and deposits the information in 'transmembrane_local_compiled'

[Return to Table of Contents](#Table-of-Contents)

In [9]:
import re
import os
import shutil

source_directory = 'transmembrane_local_directory'
destination_directory = 'transmembrane_local_compiled'

for root, dirs, files in os.walk(source_directory):
    for file in files:        
        if file.endswith('.3line'):
            last_dir_index = root.rfind('/')
            destination_sub = root[last_dir_index + 1:]
            
            if destination_sub != ".ipynb_checkpoints":
                origin = os.path.join(root, file)
                final_sub = os.path.join(destination_directory, destination_sub)
                final_path = os.path.join(final_sub, file)

                dest_directory = os.path.dirname(final_path)
                os.makedirs(dest_directory, exist_ok=True)

                shutil.copy(origin, final_path)
                
            
            


![3line](images/compiled_files.png)

![3line](images/gSpreadsheet.png)

<a id="section-two"></a>
## Code Segment 2

After compiling the peptide information in Code Segment 1, the following can be used to take in the source Excel file, make a copy, and add two additional columns representing the new information.

[Return to Table of Contents](#Table-of-Contents)

In [13]:
import re
import os
import pandas as pd

# Function to read the contents of a TMHMM 3line file
# Returns array with the type of sequence and the I, O, M, S tags
def read_text_file(file_path):
    with open(file_path, 'r') as file:
        
        lines = file.read().splitlines()
        result = re.search(r'\|\s*(\S+)', lines[0])
        if result:
            tag = result.group(1)
        else:
            tag = "ERR"
     
        a_acid_category = lines[2]
        
        return tag, a_acid_category
        
# Directory containing the text files
directory = 'transmembrane_local_compiled'

# Path to the existing Excel file
excel_file_path_source = 'output.xlsx'
excel_file_path_destination = 'post_transmembrane.xlsx'

# Create a list to store the contents of all text files
extracted_info = []

# Loop through each text file in the directory
for root, dirs, files in os.walk(directory):
    for file in files:
        
        if file.endswith('.3line'):
            
            content = []
            file_path = os.path.join(root, file)
            # print("Current file: " + file_path)
            tag, amino_acid_category = read_text_file(file_path)

            match = re.search(r"_row(\d+)", file_path)

            if match:
                row_num = str(int(match.group(1)))
            else:
                row_num = 0

            content.append(row_num)
            content.append(tag)
            content.append(amino_acid_category)

            pattern = r'checkpoints'
            duplicated = re.search(pattern, file_path)

            extracted_info.append(content) # array of [row_num, TM/GLOB/SP/etc., I/M/O/etc.]

# Read the existing Excel file into a DataFrame
new_headers = ["Tag", "Transmembrane Tag"]               
df = pd.read_excel(excel_file_path_source)

# Prepare extracted info for insertion into dataframe by sorting the values and removing duplicates
tags_column = ["not recorded"] * len(df)
transmembrane_column = ["not recorded"] * len(df)

for record in extracted_info:
    tags_column[int(record[0])] = record[1]
    transmembrane_column[int(record[0])] = record[2]

df[new_headers[0]] = tags_column
df[new_headers[1]] = transmembrane_column

# Save the updated DataFrame to an Excel file
df.to_excel(excel_file_path_destination, index=False)