# Keyword_Frequency_SP_PDB

Overview needed input files: 
##### Frequency_#1: 
    "filename"_FL_against_pdb_nosymbol_hits.tsv
##### Extract_Keywords_HomologyTable_#2: 
    "filename2".xlsx 
##### Subtraction_Keywords_#3: 
    "filename"_freq.tsv (Output1 Frequency_#1)
    "filename"_freq-1.tsv (Output2 Frequency_#1)
    "filename2"_keywords.tsv (Output Extract_Keywords_HomologyTable_#2)
                             
##### File_Merging_#4: 
    "filename"_freq-withoutHF.tsv (Output1 Subtraction_Keywords_#3)
    "filename"_freq-1-withoutHF.tsv (Output2 Subtraction_Keywords_#3)

## Keyword_Frequency_SP_PDB: Frequency_#1

### Description:

The Word Frequency Analysis Tool is a Python script to analyze text data and generate frequency statistics for words within specific domains. The tool is primarily intended for processing TSV (Tab-Separated Values) files containing domain and word information.

This Python script processes a TSV file containing information about virus proteins and their associated words. It extracts word frequency data for each virus protein domain, excluding specified common words. The resulting frequency data is then saved to a new TSV file for further analysis.

Exclusion of Common and Irrelevant Words: The script allows for the exclusion of a predefined list of common English words, Greek alphabet characters, and specific user-defined words. These exclusions help focus the analysis on more meaningful content.

Case Insensitivity and Punctuation Handling: The script processes text in a case-insensitive manner, ensuring that words are counted regardless of their capitalization. Additionally, it handles punctuation marks, such as commas and double quotes, ensuring they do not interfere with word recognition.

Exclusion of Single Letters and Numbers: The tool automatically excludes words consisting of only a single letter or a single number, ensuring that the analysis focuses on meaningful terms.

Excluded Words: The list of words to be excluded can be customized by modifying the exclude_words list within the script

Example input: HCMV_UL18_domain_0.pdb Crystal structure of HLA DR52c 3c5j_B 7.74E-10 HCMV_UL18_domain_0.pdb Structure of chicken CD1-2 with bound fatty acid 3dbx_A 4.10E-11 HCMV_UL18_domain_0.pdb TCR complex 2wbj_F 7.32E-10 HCMV_UL18_domain_0.pdb NEONATAL FC RECEPTOR, PH 6.5" 3fru_C 6.58E-12 HCMV_UL18_domain_0.pdb HLA-DR1 with GMF Influenza PB1 Peptide 6qza_BBB 1.14E-09 HCMV_UL18_domain_0.pdb Crystal structure of FcRn bound to UCB-84 6c98_C 6.23E-12 HCMV_UL18_domain_0.pdb immune receptor complex 6v15_B 8.18E-10 HCMV_UL18_domain_0.pdb Immune Receptor 4mdi_B 3.02E-10 [...]

##### Output (XY_freq.tsv): The script generates an output TSV file with domain-specific word frequencies, which can be further analyzed or used for visualization. Both output-files will be used in Subtraction_Keywords_#3  and File_Merging_#4 as input. 

Example output: HCMV_UL18_domain_0 human(169), mhc(161), tcr(128), receptor(66), epitope(46), histocompatibility(45), hla(44), hiv-1(43), h-2db(40), chicken(39), binding(38), murine(36), virus(35), hla-dr1(34), major(32), [...]

In [11]:
import os

# Define the file path manually
file_path = r'<filename>.tsv'

def is_valid_word(word):
    return not (len(word) == 1 and (word.isalpha() or word.isdigit())) and not (len(word) == 2 and word.isdigit())

def extract_word_frequency(filename, exclude_words, exclude_single_count=True):
    domain_word_freq = {}
    with open(filename, 'r') as file:
        for line in file:
            parts = line.strip().split('\t')
            virus_protein_domain_parts = parts[0].split('_')
            virus_protein_domain = '_'.join(virus_protein_domain_parts).replace('.pdb', '')  # Take all parts and exclude ".pdb" extension
            words = parts[2].split()
            
            if virus_protein_domain not in domain_word_freq:
                domain_word_freq[virus_protein_domain] = {}
            
            for word in words:
                # Convert both the word and exclude_words to lowercase for case insensitivity
                word = word.lower().replace(',', '').replace('"', '').replace('(',"").replace(')',"").replace(':', "")  # Remove commas, double quotesparentheses and quotes from word
                if word not in exclude_words and is_valid_word(word):
                    domain_word_freq[virus_protein_domain][word] = domain_word_freq[virus_protein_domain].get(word, 0) + 1

    # Exclude words with only one count if exclude_single_count is True
    if exclude_single_count:
        for domain in domain_word_freq:
            domain_word_freq[domain] = {word: count for word, count in domain_word_freq[domain].items() if count > 1}

    return domain_word_freq

def save_frequency_tsv(input_path, domain_word_freq, suffix="_freq"):
    output_folder, input_filename = os.path.split(input_path)
    output_filename = os.path.splitext(input_filename)[0] + f"{suffix}.tsv"
    output_path = os.path.join(output_folder, output_filename)

    with open(output_path, 'w') as output_file:
        sorted_domains = sorted(domain_word_freq.keys())  # Sort domains alphabetically

        for domain in sorted_domains:
            word_freq = domain_word_freq[domain]
            sorted_freq = sorted(word_freq.items(), key=lambda x: x[1], reverse=True)
            result = [f'{key}({value})' for key, value in sorted_freq]
            output_file.write(f'{domain}\t{", ".join(result)}\n')

def main():

    if file_path:
        # Add your list of excluded words here
        exclude_words = ["and", "alpha", "beta", "gamma", "delta", "epsilon", "zeta", "eta", "theta", "iota", 
         "kappa", "lambda", "mu", "nu", "xi", "omicron", "pi", "rho", "sigma", "tau", "upsilon", 
         "phi", "chi", "psi", "omega", "a", "an", "the", "in", "on", "to", "with", "for", "of", 
         "as", "by", "at", "ii", "cell", "protein", "probable", "motif", "chain", "type", "molecule",
         "uncharacterized", "c-x-c", "c-c", "basic", "structure", "complex", "crystal", "bound", "fab", "domain", 
         "antibody", "mutant", "peptide", "cryo-em", "fragment", "complexed", "the", "compound", "atomic", "structural",
         "by", "coli", "cryoem", "resolution", "its" , "form", "e.", "x-ray", "class", "structures", "-", "class",
         "therapeutic", "deletion", "variant", "site", "functional", "reveal", "c1", "implications", "monoclonal",
         "different", "allosteric", "ligand", "c-terminal", "low-ph", "escherichia", "e.coli", "between", "complexes.",
         "(crystal", "engineered", "full", "xfel", "natural", "angstroms", "allosteric", "design", "discovery", "open"
         "edition", "mode", "opposite", "novel", "region", "ligand-binding", "apo", "holoenzyme", "high", "full-length",
         "loop", "angstrom", "template-primer", "based", "using", "from", "open", "conserved", "significance", "unit",
         "loader", "neutralizing", "specificity", "substrate", "one", "inhibitor", "inhibitors", "after", "dehydration",
         "de", "novo", "symmetry", "element", "iii", "ion", "", "herpes", "virus", "herpesvirus", ]  
        
        # Save with "_freq-1" suffix (exclude single count)
        domain_word_freq = extract_word_frequency(file_path, exclude_words)
        save_frequency_tsv(file_path, domain_word_freq, suffix="_freq-1")
        print(f'Frequency data saved to: {os.path.splitext(file_path)[0]}_freq-1.tsv')

        # Save with "_freq" suffix (do not exclude single count)
        domain_word_freq_no_exclusion = extract_word_frequency(file_path, exclude_words, exclude_single_count=False)
        save_frequency_tsv(file_path, domain_word_freq_no_exclusion, suffix="_freq")
        print(f'Frequency data saved to: {os.path.splitext(file_path)[0]}_freq.tsv')
    else:
        print("No input file selected.")

if __name__ == "__main__":
    main()


Frequency data saved to: C:\Users\agbosse\Desktop\Test_File\herpes_FL_against_pdb_nosymbol_hits_freq-1.tsv
Frequency data saved to: C:\Users\agbosse\Desktop\Test_File\herpes_FL_against_pdb_nosymbol_hits_freq.tsv


## Keyword_Frequency_SP_PDB: Extract_Keywords_HomologyTable_#2

Description:

This Python script extracts keywords from a specified Excel file containing information about virus proteins and their associated abbreviations and descriptions. The output is a TSV (Tab-Separated Values) file with each row representing a virus protein domain and its corresponding set of keywords. The extraction process involves parsing the Excel file, combining virus names with protein names, and collecting keywords from the "Abbreviation" and "Description" columns.

Example Input Structure (Excel file):
<filename2>.xlsx

Example Output Structure (TSV file):
EBV_Protein1_Name      abb1, desc1, keyword1, keyword2, ...
HSV_Protein1_Name      abb2, desc2, keyword3, keyword4, ...
...

##### Output-file will be used in Subtraction_Keywords_#3 as input. 

In [3]:
import os
import pandas as pd

# Define the file path manually
input_file_path = r'<filename2>.xlsx'
output_file_path = r'<filename2>_keywords.tsv'

def extract_keywords_from_excel(filename):
    domain_keywords = {}
    
    # Read Excel file into a DataFrame
    df = pd.read_excel(filename, header=0)

    # Iterate through rows in the DataFrame
    for index, row in df.iterrows():
        virus_name = None
        
        for col in df.columns[2:]:
            virus_name = col.split('_')[0]
            protein_name = row[col]

            if pd.notna(protein_name):
                # Take only the first part of the protein name
                protein_name = protein_name.split()[0].replace(',', '').replace('"', '').replace('(', '').replace(')', '').replace(':', '')

                # Assign virus name to the protein name
                virus_protein_domain = f"{virus_name}_{protein_name}"
                
                if virus_protein_domain not in domain_keywords:
                    domain_keywords[virus_protein_domain] = set()
                
                # Assign abbreviation and description
                abbreviation = str(row['Abbreviation']).lower().replace(',', '').replace('"', '').replace('(', '').replace(')', '').replace(':', '')
                description = str(row['Description']).lower().replace(',', '').replace('"', '').replace('(', '').replace(')', '').replace(':', '')
                
                # Split keywords if separated by a space
                keywords = abbreviation.split() + description.split()
                
                domain_keywords[virus_protein_domain].update(keywords)

    return domain_keywords

def save_keywords_tsv(output_path, domain_keywords):
    with open(output_path, 'w') as output_file:
        # Sort the domain_keywords alphabetically based on the first column
        sorted_domains = sorted(domain_keywords.keys())
        
        for domain in sorted_domains:
            keywords_str = ', '.join(domain_keywords[domain])
            output_file.write(f'{domain}\t{keywords_str}\n')

def main():

    if input_file_path:
        domain_keywords = extract_keywords_from_excel(input_file_path)

        # Save the extracted keywords to a TSV file
        save_keywords_tsv(output_file_path, domain_keywords)
        print(f'Extracted keywords saved to: {output_file_path}')
    else:
        print("No input file selected.")

if __name__ == "__main__":
    main()


Extracted keywords saved to: C:\Users\agbosse\Desktop\Test_File\HerpesFolds_v6_9HHV_no-links_2023-11-30_TS_SAS_keywords.tsv


## Keyword_Frequency_SP_PDB: Subtraction_Keywords_#3

Description:

This Python script #3 processes two TSV (Tab-Separated Values) files: 
one containing information about virus proteins and their associated keywords ("Herpesfolds Homolog Table" TSV, output from script #2), 
and the other containing frequency information about these virus_protein_domain_number and their keywords with counts ("Keyword Freq" TSV, output from script #1). 

The goal is to subtract keywords associated with virus proteins in the "Herpesfolds Homolog Table" TSV from the corresponding entries in the "Keyword Freq" TSV. The output is a new TSV file containing the virus proteins along with the remaining keywords and their counts.

#### Example Input Structure ("Herpesfolds Homolog Table" TSV - output from cell #):
EBV_BKRF3     udg, glycosylase, uracil-dna, ung
EBV_BALF5     dna, dpol, polymerase

#### Example Input Structure ("Keyword Freq" TSV - output from cell #):
EBV_BKRF3_domain_0     uracil-dna(391), glycosylase(391)
EBV_BALF5_domain_0     dna(82), polymerase(82), subunit(53), catalytic(51), polb(2)

#### Example Output Structure:
EBV_BKRF3_domain_0     
EBV_BALF5_domain_0     subunit(53), catalytic(51), polb(2)

In the output, keywords associated with virus proteins from the "Herpesfolds" TSV are subtracted from the corresponding entries in the "Freq" TSV. The resulting keywords, along with their counts, are retained in the output. The script ensures that the counts are accurately preserved during the subtraction process. The output is organized in the same format as the "Freq" TSV, with each line representing a virus protein domain and its associated keywords and counts.

##### Output-files will be used in File_Merging_#4 as input. 

In [12]:
import os
import csv

# # Define the input file path manually
freq_tsv_path_1 = r'<filename>_freq.tsv'
freq_tsv_path_2 = r'<filename>_freq-1.tsv'
herpesfolds_tsv_path = r'<filename2>_keywords.tsv'

# Output file paths
output_tsv_path_1 = r'<filename>_freq-withoutHF.tsv'
output_tsv_path_2 = r'<filename>_freq-1-withoutHF.tsv'

def subtract_keywords(herpesfolds_row, freq_row):
    # Extract keywords from Herpesfolds row and format them
    herpesfolds_keywords = set(herpesfolds_row[1].split(', '))
    
    # Extract keywords and counts from Freq row and format them
    freq_keywords_with_counts = [word.split('(') if '(' in word else (word, '') for word in freq_row[1].split(', ')]
    
    # List to store subtracted keywords
    subtracted_keywords = []

    # Iterate over keywords and counts in Freq row
    for word, count in freq_keywords_with_counts:
        # Check if the keyword is not present in Herpesfolds keywords
        if word not in herpesfolds_keywords and word != '':
            # Append the keyword and its count to the subtracted_keywords list
            subtracted_keywords.append(f'{word}({count})')

    # Return the original virus_protein combination and the subtracted keywords
    return freq_row[0], ', '.join(subtracted_keywords)

def process_files(freq_tsv_path, herpesfolds_tsv_path, output_tsv_path):
    # Dictionary to store unique rows based on their keys
    output_dict = {}

    # Read Freq TSV file into a list of rows
    with open(freq_tsv_path, 'r', newline='', encoding='utf-8') as freq_file:
        freq_reader = csv.reader(freq_file, delimiter='\t')
        freq_rows = list(freq_reader)

    # Read Herpesfolds TSV file into a list of rows
    with open(herpesfolds_tsv_path, 'r', newline='', encoding='utf-8') as herpesfolds_file:
        herpesfolds_reader = csv.reader(herpesfolds_file, delimiter='\t')
        herpesfolds_rows = list(herpesfolds_reader)

    # Iterate over Herpesfolds and Freq rows to perform subtraction
    for herpesfolds_row in herpesfolds_rows:
        for freq_row in freq_rows:
            if herpesfolds_row[0] in freq_row[0]:
                # Call subtract_keywords function and store the result in the dictionary
                key, value = subtract_keywords(herpesfolds_row, freq_row)
                output_dict[key] = value

    # Convert the dictionary to a list before sorting and writing to the output file
    output_rows_list = sorted([(key, value) for key, value in output_dict.items()], key=lambda x: x[0])

    # Write the sorted output list to the output TSV file
    with open(output_tsv_path, 'w', newline='', encoding='utf-8') as output_file:
        output_writer = csv.writer(output_file, delimiter='\t')
        
        # Iterate over output rows and write them to the file
        for key, value in output_rows_list:
            # Correctly format the output to avoid double closing parenthesis
            output_writer.writerow([key, value.replace('))', ')')])

    # Print the output file path when saved
    print(f'Output data saved to: {output_tsv_path}')

if __name__ == "__main__":

    # Call the process_files function for the first input
    process_files(freq_tsv_path_1, herpesfolds_tsv_path, output_tsv_path_1)

    # Call the process_files function for the second input
    process_files(freq_tsv_path_2, herpesfolds_tsv_path, output_tsv_path_2)

Output data saved to: C:\Users\agbosse\Desktop\Test_File\herpes_FL_against_pdb_nosymbol_hits_freq-withoutHF.tsv
Output data saved to: C:\Users\agbosse\Desktop\Test_File\herpes_FL_against_pdb_nosymbol_hits_freq-1-withoutHF.tsv


## Keyword_Frequency_SP_PDB: File_Merging_#4

The Python script merge_columns.py merges specific columns from three TSV files. It extracts the first two columns from the first file, the second column from the second file, and the second column from the third file. The merged data is saved with headers in a new TSV file named after the first input file, suffixed "_HF_HF-1.tsv".

Input Example Files:

    <filename>_freq.tsv: Extracts first two columns.
    <filename>_freq-withoutHF.tsv: Extracts the second column.
    <filename>_freq-1-withoutHF.tsv: Extracts the second column.

Output Example File:

    Merged data with headers saved as <filename>_freq_HF_HF-1.tsv.

Columns in Example Output File:

    Virus_protein: First column, virus protein information.
    Keywords: Second column, keywords from the first file.
    Keywords without HomologyTable: Third column, keywords from the second file.
    Keywords without HomologyTable and single keywords: Fourth column, keywords from the third file.

In [13]:
import os
import csv

# Paths to input and output files
file1_path = r'<filename>_freq.tsv'
file2_path = r'<filename>_freq-withoutHF.tsv'
file3_path = r'<filename>_freq-1-withoutHF.tsv'
    
def merge_columns(file1_path, file2_path, file3_path, output_path):
    data = []

    # Read the first two columns from the first file
    with open(file1_path, 'r', newline='', encoding='utf-8') as file1:
        reader1 = csv.reader(file1, delimiter='\t')
        data = [row[:2] for row in reader1]

    # Read the third column from the second file and append it to the data
    with open(file2_path, 'r', newline='', encoding='utf-8') as file2:
        reader2 = csv.reader(file2, delimiter='\t')
        for i, row in enumerate(reader2):
            if i < len(data):
                data[i].append(row[1])

    # Read the fourth column from the third file and append it to the data
    with open(file3_path, 'r', newline='', encoding='utf-8') as file3:
        reader3 = csv.reader(file3, delimiter='\t')
        for i, row in enumerate(reader3):
            if i < len(data):
                data[i].append(row[1])

    # Insert headers
    headers = ["Virus_protein", "Keywords", "Keywords without HomologyTable", "Keywords without HomologyTable and single keywords"]
    data.insert(0, headers)

    # Save the merged data to the output file
    with open(output_path, 'w', newline='', encoding='utf-8') as output_file:
        writer = csv.writer(output_file, delimiter='\t')
        writer.writerows(data)

    # Print the output file path when saved
    print(f'Merged data with headers saved to: {output_path}')

if __name__ == "__main__":

    # Generate the output file path based on the first file
    output_path = os.path.splitext(file1_path)[0] + "_HF_HF-1.tsv"

    # Call the merge_columns function with the provided paths
    merge_columns(file1_path, file2_path, file3_path, output_path)


Merged data with headers saved to: C:\Users\agbosse\Desktop\Test_File\herpes_FL_against_pdb_nosymbol_hits_freq_HF_HF-1.tsv
