This is the notebook related to HMM data preprocessing. To run this, please uncomment all the cells an press 'Run all' button.

In [1]:
# Set the path and input parameters
import os
directory = os.getcwd() # the main directory of the project

# The credentials for the remote cluster
name = 'alina'
server = 'ecate'

In [2]:
# Importing the libraries
from functions import *

## HMM: Data preparation
In this part we generate HMM models based on MSA and retrieve the results from hmmsearch and then from Pfam. We start with the loading of the `disordered` dataframe and analyse how many proteins and disordered regions we have.

In [3]:
# Set the maximum width of the columns
pd.set_option('display.max_colwidth', 20)

In [4]:
# Load the disordered and curated_disprot dataframes
disordered_df = pd.read_csv('{}/disordered_df.csv'.format(directory))
curated_disprot_df = pd.read_csv('{}/curated_disprot.csv'.format(directory))

print('The number of rows with the disordered regions: {}'.format(len(disordered_df)))

The number of rows with the disordered regions: 7393


In [5]:
# Collect the disordered regions in a dictionary
dis_regs = set()

for i, row in disordered_df.iterrows():
    dis_id = row[0]
    matching_row = curated_disprot_df[curated_disprot_df['acc'] == dis_id]
    if not matching_row.empty:
        region = matching_row['region']
        dis_regs.update(region)
        dis_regs_list = list(dis_regs) # convert set to a list
        
# Define an array of disordered regions ids
disprot_ids = disordered_df['query_id'].unique()

print('The number of proteins with disordered regions: {}'.format(len(disprot_ids)))
print('The number of disordered regions in the proteins: {}'.format(len(dis_regs_list)))

The number of proteins with disordered regions: 39
The number of disordered regions in the proteins: 53


The results above show how many proteins with the disordered regions are in the provided data, as well as how many disordered regions do they contain.

## 1. hmmbuild
We build an HMM of each disordered region, using trimmed MSA as an input. 

In [6]:
# Set the paths to  MSA (input) and HMM (output) files - ClustalOmega
msa_dir_clustal = '{}/results/alignments/output_files/disordered/clustal'.format(directory)
hmmbuild_dir_clustal = '{}/results/hmms/hmmbuild/clustal'.format(directory)

# Create directories if they don't exist
create_directory(msa_dir_clustal)
create_directory(hmmbuild_dir_clustal)

Directory /Users/alina/HMM/results/alignments/output_files/disordered/clustal already exists.
Directory /Users/alina/HMM/results/hmms/hmmbuild/clustal already exists.


In [8]:
# # Build HMM for ClustalOmega
# for filename in os.listdir(msa_dir_clustal):
#     if filename.endswith('.fasta'):
#         input_file = os.path.join(msa_dir_clustal, filename)
#         output_file = os.path.join(hmmbuild_dir_clustal, os.path.splitext(filename)[0] + '.hmm')

#         subprocess.run(['hmmbuild', output_file, input_file])
#         print('hmmbuild completed for {}'.format(filename))

## 2. hmmsearch

After building the models, our objective is to assess if overlaps with the profiles in the sequence database (Reference Proteome 75%) exist. We retrive the data containing the most significant sequences, with a default E-value threshold of 0.01.

The same as before, we do it for BLAST and ClustalOmega separately.

In [10]:
# # Copy the HMMs from the ClustalOmega folder
# for filename in os.listdir(hmmbuild_dir_clustal):
#     source_path = os.path.join(hmmbuild_dir_clustal, filename)
#     destination_path = '{}@{}:~/hmm/clustal/hmmbuild/{}'.format(name, server, filename)
#     subprocess.run(['scp', source_path, destination_path])
#     print('File {} copied to {}'.format(filename, destination_path))

After copying the files, we run `hmmsearch` command. 
It is done remotely by running the command `sbatch start_array_jobs.sh <ali_type>`.
`<ali_type>` can be either `blast` or `clustal`.

In [13]:
# Construct the source and the target directories for ClustalOmega
clustal_in_dir_rem = '/home/{}/hmm/clustal/hmmbuild'.format(name)
clustal_out_dir_rem = '/home/{}/hmm/clustal/hmmsearch'.format(name)

# # List files in the source directory on the remote machine
# ls_command_clustal = 'ssh {}@{} "ls {}"'.format(name, server, clustal_in_dir_rem)
# file_list_clustal = get_ipython().getoutput(ls_command_clustal)

# Then run sbatch start_array_jobs.sh clustal on the remote computer

In [14]:
# # Copy ClustalOmega results to the local folder
# hmmsearch_dir_clustal = '{}/results/hmms/hmmsearch/clustal'.format(directory)

# for filename in file_list_clustal:
#     filename = filename.strip()
#     source_path = '{}/{}'.format(clustal_in_dir_rem, filename)
#     dest_filename = os.path.splitext(filename)[0]
    
#     # Copy files
#     !scp {name}@{server}:~/hmm/clustal/hmmsearch/hmmsearch_clustal_{dest_filename}.txt {hmmsearch_dir_clustal}

In [16]:
# Preprocess hmmsearch results - ClustalOmega
hmmsearch_dir_clustal = '{}/results/hmms/hmmsearch/clustal'.format(directory)
hmmsearch_dir_clustal_prepr_output = '{}/results/hmms/hmmsearch/clustal/hmmsearch_df'.format(directory)
create_directory(hmmsearch_dir_clustal)
create_directory(hmmsearch_dir_clustal_prepr_output)

clustal_results_list = []

for filename in os.listdir(hmmsearch_dir_clustal):
    if filename.endswith('.txt'):
        file_path = os.path.join(hmmsearch_dir_clustal, filename)
        parts = filename.split('_')
        protein_id = '{}_{}'.format(parts[2], parts[3].split(".")[0])
        try:
            clustal_stat = process_hmmsearch_file(file_path) 
            clustal_reg = extract_table_from_output(file_path)
            clustal_result = pd.merge(clustal_stat, clustal_reg, left_on='Sequence', right_on='id', how='inner')
            clustal_result = clustal_result.drop(columns=['Description', 'id'])
            
            # Save individual DataFrames to a list
            clustal_results_list.append(clustal_result)
        
            # Save the DataFrame to a CSV file
            output_file = os.path.join(hmmsearch_dir_clustal_prepr_output, 'hmmsearch_clustal_df_{}.csv'.format(protein_id))
            clustal_result.to_csv(output_file, index=False)
            
        except ValueError as e:
            print('Error processing file {}: {}'.format(filename, e))
            
# Merge all DataFrames into one
hmmsearch_results_clustal = pd.concat(clustal_results_list, ignore_index=True)

# Save the merged DataFrame to a CSV file
merged_output_file = os.path.join('{}/results/hmms/hmmsearch/'.format(directory), 'hmmsearch_results_clustal.csv')
hmmsearch_results_clustal.to_csv(merged_output_file, index=False)
print('The number of retrieved hmmsearch results for ClustalOmega: {}'.format(len(hmmsearch_results_clustal)))
hmmsearch_results_clustal[:10]

Directory /Users/alina/HMM/results/hmms/hmmsearch/clustal already exists.
Directory /Users/alina/HMM/results/hmms/hmmsearch/clustal/hmmsearch_df already exists.
Error processing file hmmsearch_clustal_J8TM36_236-249.txt: '  ------ inclusion threshold ------\n' is not in list
Error processing file hmmsearch_clustal_O15922_315-324.txt: '  ------ inclusion threshold ------\n' is not in list
Error processing file hmmsearch_clustal_Q5T4W7_108-120.txt: '  ------ inclusion threshold ------\n' is not in list
The number of retrieved hmmsearch results for ClustalOmega: 211312


Unnamed: 0,E-value,score,bias,E-value.1,score.1,bias.1,exp,N,Sequence,hmm_from,hmm_to,hmm_length,ali_from,ali_to,ali_length,env_from,env_to,env_length
0,3.8e-32,122.0,13.1,4.4e-32,121.8,13.1,1.0,1,A0A8C7BFT3,1,61,61,9,69,61,9,70,62
1,2.4e-31,119.5,13.6,9.9e-31,117.5,13.6,2.2,1,A0A3Q7NSQ2,1,62,62,352,413,62,352,413,62
2,3.0000000000000003e-31,119.2,13.6,1.4000000000000001e-30,117.0,13.6,2.3,1,A0A2U3ZCX9,1,62,62,451,512,62,451,512,62
3,3.0000000000000003e-31,119.1,13.6,1.4000000000000001e-30,117.0,13.6,2.2,1,A0A2U3VQT6,1,62,62,454,515,62,454,515,62
4,3.1e-31,119.1,13.6,1.4000000000000001e-30,117.0,13.6,2.2,1,A0A3Q7P1S3,1,62,62,451,512,62,451,512,62
5,3.3e-31,119.0,13.6,1.4000000000000001e-30,117.0,13.6,2.2,1,A0A384CTJ6,1,62,62,452,513,62,452,513,62
6,3.3e-31,119.0,13.6,1.4000000000000001e-30,117.0,13.6,2.2,1,A0A452RB54,1,62,62,452,513,62,452,513,62
7,3.3e-31,119.0,13.6,1.4000000000000001e-30,117.0,13.6,2.2,1,A0A7N5P5S9,1,62,62,451,512,62,451,512,62
8,7.199999999999999e-31,117.9,13.1,3.3e-30,115.8,13.1,2.2,1,A0A2Y9JGP3,1,61,61,450,510,61,450,511,62
9,1.2000000000000001e-30,117.2,14.1,5.2000000000000004e-30,115.2,14.1,2.2,1,G1RG64,1,62,62,446,507,62,446,507,62


We can see that 3 regions of proteins do not have HMM matches in RP 75% database: **O15922_315-324, J8TM36_236-249, Q5T4W7_108-120**.

The possible reason may lie in the size of the region - all of them are quite small, about 10-14 amino acids. By analysing other short disordered regions like **O43791_169-178** (the length of 10 amino acids) we can see that there are reported the results only below the threshold.

In terms of region we are interested in the 3 fields:
- `hmm_from`-`hmm_to`: the endpoints of the hmm profile. In our case the beginning is usually 1, since we parse not the initial alignment as an input, but the separate disordered regions where the position mostly starts from 1 (in case there are no gaps in the subject sequences).
- `ali_from`-`ali_to`: the endpoints of the target sequence. They are obtained by matching HMM to the sequence, thus obtaining the positions.
- `env_from`-`env_to`: the envelope of the domain's location. It's usually a bit wider than the alignment.

## 3. Pfam
In this part we will use the proteins obtained in the dataframes with the `hmmsearch` results to iterate over the Protein2ipr database to find which HMM models do already exist in Pfam.

In [17]:
# # Copy the files with the hmmsearch statistics to the remote computer
# !scp {directory}/results/hmms/hmmsearch/hmmsearch_results_clustal.csv {name}@{server}:~/stats/hmmsearch_results_clustal.csv

The file `filtered.tsv.gz` is a shorter archive derived from `protein2ipr` database. **It contains only the domains from Interpro**. We check the overlap of the found results from `hmmsearch`.

In [18]:
# # Check the overlaps with Interpro domains (ClustalOmega)
# !ssh {name}@{server} '/home/alina/protein2ipr.py /home/alina/stats/clustal /home/alina/filtered.tsv.gz protein2ipr_clustal.tsv'

In [19]:
# # Copy the files to the local folder
# !scp {name}@{server}:~/protein2ipr_clustal.tsv {directory}/results/pfam

## 4. Results preprocessing

In [20]:
# Filter only entries with Pfam ID and intercepting regions with the curated_disprot instances - ClustalOmega
filename_clustal = '{}/results/pfam/protein2ipr_clustal.tsv'.format(directory)
pfam_clustal = pfam_processing(filename_clustal)

print('The number of retrieved Pfam instances for ClustalOmega: {}'.format(len(pfam_clustal)))
pfam_clustal.head()

The number of retrieved Pfam instances for ClustalOmega: 1073417


Unnamed: 0,uniprot_id,pfam_id,ipr_id,start_pfam,end_pfam,length_pfam
0,A0A010Q304,PF02775,IPR011766,499,646,148
1,A0A010Q304,PF00205,IPR012000,289,434,146
2,A0A010Q304,PF02776,IPR012001,91,205,115
3,A0A010Q7P7,PF00018,IPR001452,435,481,47
4,A0A010Q7P7,PF03114,IPR004148,113,226,114


In [22]:
# Merge hmmsearch results with Pfam - ClustalOmega
pfam_clustal = pd.merge(pfam_clustal, 
                        hmmsearch_results_clustal[['Sequence', 
                                                   'hmm_from', 'hmm_to', 'hmm_length',
                                                   'ali_from', 'ali_to', 'ali_length',
                                                   'env_from', 'env_to', 'env_length']], 
                        left_on='uniprot_id', right_on='Sequence', how='left')

pfam_clustal = pfam_clustal.dropna(axis=0)
pfam_clustal = pfam_clustal.drop(columns=['Sequence'])
print('The number of retrieved Pfam instances for ClustalOmega: {}'.format(len(pfam_clustal)))
pfam_clustal[:10]

The number of retrieved Pfam instances for ClustalOmega: 1110208


Unnamed: 0,uniprot_id,pfam_id,ipr_id,start_pfam,end_pfam,length_pfam,hmm_from,hmm_to,hmm_length,ali_from,ali_to,ali_length,env_from,env_to,env_length
0,A0A010Q304,PF02775,IPR011766,499,646,148,1,16,16,583,598,16,583,598,16
1,A0A010Q304,PF02775,IPR011766,499,646,148,1,40,40,650,689,40,650,690,41
2,A0A010Q304,PF00205,IPR012000,289,434,146,1,16,16,583,598,16,583,598,16
3,A0A010Q304,PF00205,IPR012000,289,434,146,1,40,40,650,689,40,650,690,41
4,A0A010Q304,PF02776,IPR012001,91,205,115,1,16,16,583,598,16,583,598,16
5,A0A010Q304,PF02776,IPR012001,91,205,115,1,40,40,650,689,40,650,690,41
6,A0A010Q7P7,PF00018,IPR001452,435,481,47,10,65,56,428,485,58,424,487,64
7,A0A010Q7P7,PF03114,IPR004148,113,226,114,10,65,56,428,485,58,424,487,64
8,A0A010QBJ0,PF00018,IPR001452,1080,1125,46,18,68,51,1082,1132,51,1073,1135,63
9,A0A010QBJ0,PF00063,IPR001609,44,702,659,18,68,51,1082,1132,51,1073,1135,63


Then we add the calculated overlaps in the dataframe using the function `pfam_hmm_overlap`.

- `overl_len`: the common region of the DisProt-HMM and Pfam-HMM. We find the minimum of the end position both in Pfam and HMM and the maximum of start position. Their difference will give us the overlap.
- `overlap_pfam`: % of overlap with Pfam only. We divide `overl_len` by the length of Pfam.
- `overlap_ali`: % of overlap with HMM only. We divide `overl_len` by the length of HMM.
- `non_overlap_len`: the non-covered region, the rest of the Pfam and HMM regions.
- `overlap_perc`: % of covered region. Calculated by division of `overl_len` to the whole length of both Pfam and HMM.

In [25]:
# Add the overlaps to the dataframe - ClustalOmega
overlap_pfam_hmm = []

for index_pfam, row_pfam in pfam_clustal.iterrows():
    overl_len, overl_pfam, overl_ali, non_overl_len, overl_perc, overlap_sym = pfam_hmm_overlap(row_pfam)
    overl_pfam = round(overl_pfam, 2)
    overl_ali = round(overl_ali, 2)
    overl_perc = round(overl_perc, 2)
    overlap_sym = round(overlap_sym, 2)
    overlap_pfam_hmm.append((overl_len, overl_pfam, overl_ali, non_overl_len, overl_perc, overlap_sym))

# Extract overlap_pfam and overlap_hmm from the list of tuples
pfam_clustal['overl_len'], pfam_clustal['overl_pfam'], pfam_clustal['overl_ali'], pfam_clustal['non_overl_len'], pfam_clustal['overl_perc'], pfam_clustal['overlap_sym'] = zip(*overlap_pfam_hmm)
pfam_clustal.to_csv(f'results/pfam/pfam_overlap/pfam_clustal.csv', index=False)
pfam_clustal.head()

Unnamed: 0,uniprot_id,pfam_id,ipr_id,start_pfam,end_pfam,length_pfam,hmm_from,hmm_to,hmm_length,ali_from,...,ali_length,env_from,env_to,env_length,overl_len,overl_pfam,overl_ali,non_overl_len,overl_perc,overlap_sym
0,A0A010Q304,PF02775,IPR011766,499,646,148,1,16,16,583,...,16,583,598,16,16,10.81,100.0,133,10.81,19.51
1,A0A010Q304,PF02775,IPR011766,499,646,148,1,40,40,650,...,40,650,690,41,0,0.0,0.0,195,0.0,0.0
2,A0A010Q304,PF00205,IPR012000,289,434,146,1,16,16,583,...,16,583,598,16,0,0.0,0.0,459,0.0,0.0
3,A0A010Q304,PF00205,IPR012000,289,434,146,1,40,40,650,...,40,650,690,41,0,0.0,0.0,617,0.0,0.0
4,A0A010Q304,PF02776,IPR012001,91,205,115,1,16,16,583,...,16,583,598,16,0,0.0,0.0,886,0.0,0.0
