This notebook is created to retrieve and parse the functional information associated to 6 organism models about their protein sequences. The final goal is to create a format available to visualize in Tableau (e.g tsv format) and explore them at the functional level of the orthology relationships among inferred transcriptional regulatory interactions.

We'll create (see below) three folders to save our results: *InterproResults*, containing the output of interproscan; *InterproParsedXML*, containing the outputs we'll create and; *InterproFullMerge*, the final tables with all the info

In [1]:
%%bash

# If folders dont exist, create them

if [ ! -d InterproResults ]; then
    mkdir InterproResults
fi

if [ ! -d InterproParsedXML ]; then 
    mkdir InterproParsedXML
fi

if [ ! -d InterproFullMerge ]; then
    mkdir InterproFullMerge
fi

and we'll import the python libraries

In [2]:
import pandas as pd
import numpy as np
import os

## Preprocessing

In order to retrieve information asscociated about the function of proteins/domains of the sequences, interproscan (only available for based-Linux systems) was used for the six organisms. Basically, Interproscan was ran as is shown below:

In [None]:
%%bash

cd InterproResults
for fasta in ../genomes/*faa; do

        # Uncomment the following line to run interproscan and replace the path with your
        # current sets of interproscan
        #./../../../../Descargas/Interpro/interproscan-5.60-92.0/interproscan.sh --verbose -i "${fasta}" -f xml,tsv -appl PANTHER,Pfam,FunFam,SUPERFAMILY,CDD --goterms
        # To give the opportunity to release memory before the following fasta
        sleep 5
done

cd ..

Five methods were used (PANTHER,Pfam,FunFam,SUPERFAMILY,CDD) and two output formats were asked. A tsv file containing the structure we are interested to and, a xml containing a wealthier description about the matches of the sequences. Each file is named as the fasta file plus the respective extention, for instance, if the file of the protein sequences of _E.coli_ is **GCF_000005845.2_E_coli_K12_genomic.faa**, the interproscan output will be **GCF_000005845.2_E_coli_K12_genomic.faa.tsv** and **GCF_000005845.2_E_coli_K12_genomic.faa.xml**. 

We are goin to add some headers to the tsv files. The description of those can be found in the respective documentation of [interproscan](https://interproscan-docs.readthedocs.io/en/latest/OutputFormats.html)

In [21]:
%%bash

for tsv in InterproResults/*tsv; do
    
    # Note below we are modifying inplace, therefore, I'll check if the headers are already there
    HEADER=$(head -n 1 "${tsv}")
    EXPECTED_HEADER=$(printf "id_protein\tmd5\tseq_len\tanalysis\tid_analysis\tdesc_analysis\tstart\tstop\tscore\tstatus\tdate\tid_inter\tdesc_inter\tgo\n")
    
    echo "Checking headers in ${net}"
    
    if [ "${HEADER}" != "${EXPECTED_HEADER}" ]; then
        printf "\tAdding header\n"
        sed -i "1s/^/${EXPECTED_HEADER}\n/" "${tsv}"
    fi

done

Checking headers in 
	Adding header
Checking headers in 
	Adding header
Checking headers in 
	Adding header
Checking headers in 
	Adding header
Checking headers in 
	Adding header
Checking headers in 
	Adding header


We will cross the information of both files for each genome later. Now, an overview of both files are shown in the following cells

In [22]:
path_interpro_results = "InterproResults"
all_files_names_results = os.listdir(path_interpro_results)

# Filtering only the files ending with "faa.tsv"
tsv_files_names_results = [f for f in all_files_names_results if f.endswith("faa.tsv") ]

In [23]:
tsv_files_names_results.sort()
tsv_files_names_results

['GCF_000005845.2_E_coli_K12_genomic.faa.tsv',
 'GCF_000006765.1_ASM676v1_P_aeruginosa_PA01_genomic.faa.tsv',
 'GCF_000006945.2_ASM694v2_S_enterica_LT2_genomic.faa.tsv',
 'GCF_000009045.1_ASM904v1_B_subtilis_168_genomic.faa.tsv',
 'GCF_000009645.1_ASM964v1_S_aureus_N315_genomic.faa.tsv',
 'GCF_000195955.2_ASM19595v2_M_tuberculosis_H37Rv_genomic.faa.tsv']

In [26]:
# We'll load the files later, here I just show one of them
pd.read_csv(os.path.join(path_interpro_results, "GCF_000005845.2_E_coli_K12_genomic.faa.tsv"),
            sep="\t",
            nrows=5)

Unnamed: 0,id_protein,md5,seq_len,analysis,id_analysis,desc_analysis,start,stop,score,status,date,id_inter,desc_inter,go
0,NP_416254.1,ea37cb76b651f90db9b8d4dbf6861145,275,CDD,cd00553,NAD_synthase,17,262,1.85755e-84,T,08-02-2023,IPR003694,NAD(+) synthetase,GO:0003952|GO:0004359|GO:0005737|GO:0009435
1,NP_416254.1,ea37cb76b651f90db9b8d4dbf6861145,275,PANTHER,PTHR23090,NH 3 /GLUTAMINE-DEPENDENT NAD + SYNTHETASE,11,242,7.6e-32,T,08-02-2023,IPR003694,NAD(+) synthetase,GO:0003952|GO:0004359|GO:0005737|GO:0009435
2,NP_416254.1,ea37cb76b651f90db9b8d4dbf6861145,275,FunFam,G3DSA:3.40.50.620:FF:000015,NH(3)-dependent NAD(+) synthetase,1,274,0.0,T,08-02-2023,-,-,
3,NP_416254.1,ea37cb76b651f90db9b8d4dbf6861145,275,SUPERFAMILY,SSF52402,Adenine nucleotide alpha hydrolases-like,3,273,1.45e-86,T,08-02-2023,-,-,
4,NP_416254.1,ea37cb76b651f90db9b8d4dbf6861145,275,Pfam,PF02540,NAD synthase,23,265,1.6000000000000002e-81,T,08-02-2023,IPR022310,NAD/GMP synthase,-


In [27]:
%%bash

head -n 20 "InterproResults/GCF_000005845.2_E_coli_K12_genomic.faa.xml"

<?xml version="1.0" encoding="UTF-8"?><protein-matches xmlns="http://www.ebi.ac.uk/interpro/resources/schemas/interproscan5" interproscan-version="5.60-92.0">
  <protein>
    <sequence md5="ea37cb76b651f90db9b8d4dbf6861145">MTLQQQIIKALGAKPQINAEEEIRRSVDFLKSYLQTYPFIKSLVLGISGGQDSTLAGKLCQMAINELRLETGNESLQFIAVRLPYGVQADEQDCQDAIAFIQPDRVLTVNIKGAVLASEQALREAGIELSDFVRGNEKARERMKAQYSIAGMTSGVVVGTDHAAEAITGFFTKYGDGGTDINPLYRLNKRQGKQLLAALACPEHLYKKAPTADLEDDRPSLPDEVALGVTYDNIDDYLEGKNVPQQVARTIENWYLKTEHKRRPPITVFDDFWKK</sequence>
    <xref id="NP_416254.1" name="NP_416254.1&#9;nadE|NAD synthetase, NH3-dependent|Escherichia coli str. K-12 substr. MG1655"/>
    <matches>
      <hmmer3-match evalue="0.0" score="522.3">
        <signature ac="G3DSA:3.40.50.620:FF:000015" desc="NH(3)-dependent NAD(+) synthetase">
          <signature-library-release library="FUNFAM" version="4.3.0"/>
        </signature>
        <model-ac>3.40.50.620-FF-000015</model-ac>
        <locations>
          <hmmer3-location env-end="274" 

We need to transform the XMF file into something more easy to manipulate and cross with the tsv file. As an initial approach, I'll use RegEx to filter the tags of **xreg**;containing the sequence ID, **signature ac**;the id of the method and job used;**entry ac**; general information, **protein**; indicates a protein or group of proteins with same results and, **go-xref category**; descriptions of GO terms

First, using the GO terms we'll construct a reference table using bash about all descriptions of the proteins and we'll keep that table separated. In theory, each GO term has a unique ID and that is the reason why this should work.

In addition, we'll add a value "NaN" to group all the terms with no information and also the header.

In [28]:
%%bash

# Empty file
printf "" > GO_terms_tmp.tsv

for XML in InterproResults/*xml; do

    # Retrieve go terms; remove attribute names; remove leading spaces and 
    # replace the first two blank spaces by tabs
    PRE_CLEANED_GO=$(grep -E "<go-xref\s+category" "${XML}" |
                         sed -r 's/(<go-xref category=|db="GO"\s+id=|name=|\/>|")//g' |
                         sed -r 's/^\s+//g' |
                         sed -r 's/\s+/\t/1' |
                         sed -r 's/\s+/\t/2')
    
    echo "${PRE_CLEANED_GO}" >> GO_terms_tmp.tsv

done

# Removing go terms repeated
sort GO_terms_tmp.tsv | uniq > GO_terms.tsv && rm GO_terms_tmp.tsv

# Adding NaN at the end and headers at the beginning
printf "NaN\tNaN\tNaN\n" >> GO_terms.tsv
sed -i "1s/^/category\tid_go\tdescription_go\n/" GO_terms.tsv

column -s$'\t' -t GO_terms.tsv | head

category            id_go       description_go
BIOLOGICAL_PROCESS  GO:0000041  transition metal ion transport
BIOLOGICAL_PROCESS  GO:0000103  sulfate assimilation
BIOLOGICAL_PROCESS  GO:0000105  histidine biosynthetic process
BIOLOGICAL_PROCESS  GO:0000160  phosphorelay signal transduction system
BIOLOGICAL_PROCESS  GO:0000162  tryptophan biosynthetic process
BIOLOGICAL_PROCESS  GO:0000256  allantoin catabolic process
BIOLOGICAL_PROCESS  GO:0000271  polysaccharide biosynthetic process
BIOLOGICAL_PROCESS  GO:0000272  polysaccharide catabolic process
BIOLOGICAL_PROCESS  GO:0000413  protein peptidyl-prolyl isomerization


As you can see, the new file named *GO_terms.tsv* has three columns: first one with the category of the GO term, a second one with the GO id and the last one with the description about that term.

Now it's time to process the remained labels. Because of the complexity of the XML files (e.g many nested tags) the use of libraries such as ```xml.etree.ElementTree``` in pyhon does not seem appropiate in terms of for loops needed and maybe in terms of legibility (but I'll use regex haha), same case for trying to use ```pandas``` so again, I'll do that using bash, feel free to find another way to acomplish this

In [81]:
%%bash

for XML in InterproResults/*xml; do

    OUTPUT_NAME=$(basename "${XML}" | sed -r 's/$/\.tsv/g')

    # Retrieve all tags and attributes; retrieve only important attributes; clean attribute names 
    # and quote "
    PRE_CLEANED_XML=$(grep -P "(<xref |<signature |<entry ac|<protein>)" "${XML}" |
                          grep -Po '(<protein>|<xref id="[^"]+"|<signature ac="[^"]+"|type="[^"]+")' |
                          sed -r 's/(^<\w+ |")//g')

    # merge multiples group of proteins into a single row separated by |;
    # merge type with method ID into a single row separated by tab;
    # clean attributes names
    PRE_FORMATED_XML=$(echo "${PRE_CLEANED_XML}" |
                           sed ':r;$!{N;br};s/\nid/\|id/g' |
                           sed ':r;$!{N;br};s/\ntype/\ttype/g' |
                           sed -r 's/<protein>\|//g' |
                           sed -r 's/ac=|type=//g')

    # Parser protein ID (clear format) and get tabular form
    echo "${PRE_FORMATED_XML}" |
        perl -ne 'if($_ =~ /id=/){
                        chomp($_);
                        $header= ($_ =~ s/id=//gr);
                } else{
                        chomp($_);
                        $new_string=$header . "\t" . $_;
                        $len = 3 - scalar(split("\t", $new_string));
                        print $new_string,"\tNULL_TYPE"x$len,"\n"
                }' > InterproParsedXML/$OUTPUT_NAME
    
    # Adding header
    sed -i '1s/^/id_protein\tid_analysis\ttype\n/' InterproParsedXML/$OUTPUT_NAME

done

# Ecoli file
head InterproParsedXML/GCF_000005845.2_E_coli_K12_genomic.faa.xml.tsv

id_protein	id_analysis	type
NP_416254.1	G3DSA:3.40.50.620:FF:000015	NULL_TYPE
NP_416254.1	PF02540	DOMAIN
NP_416254.1	PTHR23090	FAMILY
NP_416254.1	cd00553	FAMILY
NP_416254.1	SSF52402	NULL_TYPE
NP_416630.1	PF07694	DOMAIN
NP_416630.1	PF06580	DOMAIN
NP_416630.1	G3DSA:3.30.450.40:FF:000013	NULL_TYPE
NP_416630.1	PTHR34220	NULL_TYPE


From this file, we can load and map by protein ID and Method ID the XML file parsed and the tsv from interproscan for every organism, we'll take _E.coli_ as an example of the logic to follow:

In [136]:
# Reading two into two dataFrames the tsv file from interproscan and the tsv file obtained from the 
# processing of the xml file

path_interpro_parsed = "InterproParsedXML/"

xml_parsed_df = pd.read_csv(os.path.join(path_interpro_parsed,"GCF_000005845.2_E_coli_K12_genomic.faa.xml.tsv"),
                            sep="\t")

tsv_df = pd.read_csv(os.path.join(path_interpro_results, "GCF_000005845.2_E_coli_K12_genomic.faa.tsv"),
                     sep="\t", )

# Replacing null values (NULL_type and score - in both df) by np.NaN
for col in xml_parsed_df:
    xml_parsed_df[col] = xml_parsed_df[col].replace("NULL_TYPE", np.NaN)

for col in tsv_df:
    tsv_df[col] = tsv_df[col].replace("-", np.NaN)

In [137]:
xml_parsed_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20195 entries, 0 to 20194
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id_protein   20195 non-null  object
 1   id_analysis  20195 non-null  object
 2   type         11561 non-null  object
dtypes: object(3)
memory usage: 473.4+ KB


In [138]:
tsv_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20615 entries, 0 to 20614
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id_protein     20615 non-null  object 
 1   md5            20615 non-null  object 
 2   seq_len        20615 non-null  int64  
 3   analysis       20615 non-null  object 
 4   id_analysis    20615 non-null  object 
 5   desc_analysis  20529 non-null  object 
 6   start          20615 non-null  int64  
 7   stop           20615 non-null  int64  
 8   score          20615 non-null  float64
 9   status         20615 non-null  object 
 10  date           20615 non-null  object 
 11  id_inter       11921 non-null  object 
 12  desc_inter     11921 non-null  object 
 13  go             5540 non-null   object 
dtypes: float64(1), int64(3), object(10)
memory usage: 2.2+ MB


From the first file and the second one respectively, both *go* and *id_protein* columns have multiples values in a single row, in order to connect the table we are creating with the **GO_terms.tsv** table for each protein, we'll convert the DataFrame in a "long" format dataframe expanding the multiples values found in *go* and *id_protein* into multiples rows describing the same result but, this time, each row will have only one GO term and id as exemplified below:

In [139]:
ids = [0,1,"2|3"]
go_terms = ["GO:1|GO:2",np.NaN,"GO:3"]

df_shown = pd.DataFrame({"ids":ids, "go":go_terms})
df_shown

Unnamed: 0,ids,go
0,0,GO:1|GO:2
1,1,
2,2|3,GO:3


In [140]:
ids = np.array([0,0,1,2,3])
go_terms = ["GO:1","GO:2", np.NaN, "GO:3", "GO:3"]

df_expected = pd.DataFrame({"ids":ids, "go":go_terms})
df_expected.index = [0,0,1,2,2]
df_expected

Unnamed: 0,ids,go
0,0,GO:1
0,0,GO:2
1,1,
2,2,GO:3
2,3,GO:3


Note that maintaining the index is important since it will be used to map with the original table. Let's do it using only the *go* column

In [142]:
# Convert the GO terms found into a dataframe of multiples columns depending of the number of GO terms
go_tsv = tsv_df["go"].str.split("|").apply(pd.Series)

# "Merge" all the columns from the same index into multiples rows of one colum
go_tsv = go_tsv.copy().T
go_tsv_long = go_tsv.melt(var_name="index_tmp", value_name="id_go").drop_duplicates()

# Masking to remove duplicated index with np.NaN value
mask = ~(go_tsv_long["index_tmp"].duplicated(keep=False) & go_tsv_long["id_go"].isna())
go_tsv_long = go_tsv_long[mask].copy().set_index("index_tmp")

# Left join with the original table
tsv_df = tsv_df.merge(go_tsv_long, how="left", right_index=True, left_index=True).copy()

In [144]:
tsv_df.head(10)

Unnamed: 0,id_protein,md5,seq_len,analysis,id_analysis,desc_analysis,start,stop,score,status,date,id_inter,desc_inter,go,id_go
0,NP_416254.1,ea37cb76b651f90db9b8d4dbf6861145,275,CDD,cd00553,NAD_synthase,17,262,1.85755e-84,T,08-02-2023,IPR003694,NAD(+) synthetase,GO:0003952|GO:0004359|GO:0005737|GO:0009435,GO:0003952
0,NP_416254.1,ea37cb76b651f90db9b8d4dbf6861145,275,CDD,cd00553,NAD_synthase,17,262,1.85755e-84,T,08-02-2023,IPR003694,NAD(+) synthetase,GO:0003952|GO:0004359|GO:0005737|GO:0009435,GO:0004359
0,NP_416254.1,ea37cb76b651f90db9b8d4dbf6861145,275,CDD,cd00553,NAD_synthase,17,262,1.85755e-84,T,08-02-2023,IPR003694,NAD(+) synthetase,GO:0003952|GO:0004359|GO:0005737|GO:0009435,GO:0005737
0,NP_416254.1,ea37cb76b651f90db9b8d4dbf6861145,275,CDD,cd00553,NAD_synthase,17,262,1.85755e-84,T,08-02-2023,IPR003694,NAD(+) synthetase,GO:0003952|GO:0004359|GO:0005737|GO:0009435,GO:0009435
1,NP_416254.1,ea37cb76b651f90db9b8d4dbf6861145,275,PANTHER,PTHR23090,NH 3 /GLUTAMINE-DEPENDENT NAD + SYNTHETASE,11,242,7.6e-32,T,08-02-2023,IPR003694,NAD(+) synthetase,GO:0003952|GO:0004359|GO:0005737|GO:0009435,GO:0003952
1,NP_416254.1,ea37cb76b651f90db9b8d4dbf6861145,275,PANTHER,PTHR23090,NH 3 /GLUTAMINE-DEPENDENT NAD + SYNTHETASE,11,242,7.6e-32,T,08-02-2023,IPR003694,NAD(+) synthetase,GO:0003952|GO:0004359|GO:0005737|GO:0009435,GO:0004359
1,NP_416254.1,ea37cb76b651f90db9b8d4dbf6861145,275,PANTHER,PTHR23090,NH 3 /GLUTAMINE-DEPENDENT NAD + SYNTHETASE,11,242,7.6e-32,T,08-02-2023,IPR003694,NAD(+) synthetase,GO:0003952|GO:0004359|GO:0005737|GO:0009435,GO:0005737
1,NP_416254.1,ea37cb76b651f90db9b8d4dbf6861145,275,PANTHER,PTHR23090,NH 3 /GLUTAMINE-DEPENDENT NAD + SYNTHETASE,11,242,7.6e-32,T,08-02-2023,IPR003694,NAD(+) synthetase,GO:0003952|GO:0004359|GO:0005737|GO:0009435,GO:0009435
2,NP_416254.1,ea37cb76b651f90db9b8d4dbf6861145,275,FunFam,G3DSA:3.40.50.620:FF:000015,NH(3)-dependent NAD(+) synthetase,1,274,0.0,T,08-02-2023,,,,
3,NP_416254.1,ea37cb76b651f90db9b8d4dbf6861145,275,SUPERFAMILY,SSF52402,Adenine nucleotide alpha hydrolases-like,3,273,1.45e-86,T,08-02-2023,,,,


We have split correctly the *go* column! Now we don't need that column anymore

In [145]:
tsv_df.drop("go", axis=1, inplace=True)
tsv_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25644 entries, 0 to 20614
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id_protein     25644 non-null  object 
 1   md5            25644 non-null  object 
 2   seq_len        25644 non-null  int64  
 3   analysis       25644 non-null  object 
 4   id_analysis    25644 non-null  object 
 5   desc_analysis  25522 non-null  object 
 6   start          25644 non-null  int64  
 7   stop           25644 non-null  int64  
 8   score          25644 non-null  float64
 9   status         25644 non-null  object 
 10  date           25644 non-null  object 
 11  id_inter       16950 non-null  object 
 12  desc_inter     16950 non-null  object 
 13  id_go          10569 non-null  object 
dtypes: float64(1), int64(3), object(10)
memory usage: 2.9+ MB


Now we would have to do the same with the *id_protein* column of the parsed xml file and we will do it, but for now I just want to continue explaining what would be the next thing to do and in this way we would have to join both tables from both files. To do this, we will make a shared column that will serve as a key to join both datasets based on the id of the protein and of the method used, columns repeated will be deleted

In [146]:
# shared_index column will be used as key in both files
tsv_df["shared_index"] = tsv_df["id_protein"] + "\t" + tsv_df["id_analysis"]
xml_parsed_df["shared_index"] = xml_parsed_df["id_protein"] + "\t" + xml_parsed_df["id_analysis"]

# Left join
full_tsv = tsv_df.merge(xml_parsed_df, on="shared_index", how="left")
full_tsv.head(3)

Unnamed: 0,id_protein_x,md5,seq_len,analysis,id_analysis_x,desc_analysis,start,stop,score,status,date,id_inter,desc_inter,id_go,shared_index,id_protein_y,id_analysis_y,type
0,NP_416254.1,ea37cb76b651f90db9b8d4dbf6861145,275,CDD,cd00553,NAD_synthase,17,262,1.85755e-84,T,08-02-2023,IPR003694,NAD(+) synthetase,GO:0003952,NP_416254.1\tcd00553,NP_416254.1,cd00553,FAMILY
1,NP_416254.1,ea37cb76b651f90db9b8d4dbf6861145,275,CDD,cd00553,NAD_synthase,17,262,1.85755e-84,T,08-02-2023,IPR003694,NAD(+) synthetase,GO:0004359,NP_416254.1\tcd00553,NP_416254.1,cd00553,FAMILY
2,NP_416254.1,ea37cb76b651f90db9b8d4dbf6861145,275,CDD,cd00553,NAD_synthase,17,262,1.85755e-84,T,08-02-2023,IPR003694,NAD(+) synthetase,GO:0005737,NP_416254.1\tcd00553,NP_416254.1,cd00553,FAMILY


In [147]:
full_tsv.drop(["shared_index","id_protein_y","id_analysis_y"], axis=1, inplace=True)
full_tsv.rename({"id_protein_x":"id_protein"}, inplace=True, axis=1)
full_tsv.head(3)

Unnamed: 0,id_protein,md5,seq_len,analysis,id_analysis_x,desc_analysis,start,stop,score,status,date,id_inter,desc_inter,id_go,type
0,NP_416254.1,ea37cb76b651f90db9b8d4dbf6861145,275,CDD,cd00553,NAD_synthase,17,262,1.85755e-84,T,08-02-2023,IPR003694,NAD(+) synthetase,GO:0003952,FAMILY
1,NP_416254.1,ea37cb76b651f90db9b8d4dbf6861145,275,CDD,cd00553,NAD_synthase,17,262,1.85755e-84,T,08-02-2023,IPR003694,NAD(+) synthetase,GO:0004359,FAMILY
2,NP_416254.1,ea37cb76b651f90db9b8d4dbf6861145,275,CDD,cd00553,NAD_synthase,17,262,1.85755e-84,T,08-02-2023,IPR003694,NAD(+) synthetase,GO:0005737,FAMILY


Remember we don't split the *id_protein* column and the DataFrame is not entirely valid.

Now we need to do the same for every organism and split *id_protein*, I'll create some functions of the codes shown above, so they are going to do all the hard work. In addition, a short code will be added to save into a file the tables created and this time

In [206]:
def LoadFiles(file_1:str, file_2:str, keydict:dict=None) -> (pd.DataFrame, pd.DataFrame):
    
    # This function read a tsv file and a parsed xml file obtained from interproscan into a pandas DataFrame
    
    print(f"Reading files into DataFrames: {file_1}")
    print(f"                               {file_2}")
    
    df_1 = pd.read_csv(file_1, **keydict)
    df_2 = pd.read_csv(file_2, **keydict)
    
    return df_1, df_2

def ReplaceInFrame(data:pd.DataFrame, to_replace:str, by=np.NaN) -> pd.DataFrame:
    
    # This function replace a value for another in a dataframe
    print(f"Replacing {to_replace} by {by}")
    
    for col in data:
        data[col] = data[col].replace(to_replace, by)
    
    return data

def ParseMultiValuesFrames(data:pd.DataFrame, col_target, sep_target,
                           value_name:str, keydict:dict) -> pd.DataFrame:
    
    # This function split multiples values found in a column and row of a dataFrame
    # into multiples lines containing only one value term in a dataframe  
    
    print(f"Parsing multivalues in column {col_target}")
    
    df_splitted = data[col_target].str.split(sep_target).apply(pd.Series)

    # "Merge" all the columns from the same index into multiples rows of one colum
    df_splitted_T = df_splitted.copy().T
    df_splitted_T = df_splitted_T.melt(var_name="index_tmp", value_name="column_tmp").drop_duplicates()

    # Masking to remove duplicated index with np.NaN value
    mask = ~(df_splitted_T["index_tmp"].duplicated(keep=False) & df_splitted_T["column_tmp"].isna())
    df = df_splitted_T[mask].copy().set_index("index_tmp")
    full_df = data.merge(df, **keydict)
    full_df = full_df.drop(col_target, axis=1)
    
    return full_df.rename({"column_tmp": value_name}, axis=1)
    
def CreateSharedIndex(df_1:pd.DataFrame, df_1_cols:list,
                      df_2:pd.DataFrame, df_2_cols:list,
                      shared_index:str, sep="\t") -> (pd.DataFrame, pd.DataFrame):
    
    # This function create a shared index between two data frames
    
    print(f"Creating shared_index = {shared_index}")
    
    df_1[shared_index] = df_1[df_1_cols[0]] + sep + df_1[df_1_cols[0]]
    df_2[shared_index] = df_2[df_2_cols[0]] + sep + df_2[df_2_cols[1]]
    
    return df_1, df_2

def JoinFrames(left:pd.DataFrame, right:pd.DataFrame, keydict:dict, drop:list=None) -> pd.DataFrame:
    
    # This function do a join operation between two dataframes
    
    print(f"Joining frames with parameters: {keydict}")
    df = left.merge(right, **keydict)
    
    if drop:
        df.drop(drop, axis=1, inplace=True)
    
    return df

TSV file are already loaded, we'll need to load now the created xml.tsv files

In [207]:
all_files_names_parsed = os.listdir(path_interpro_parsed)

# Filter only the files ending with "xml.tsv"
xml_tsv_files_names = [f for f in all_files_names_parsed if f.endswith("xml.tsv") ]
xml_tsv_files_names.sort()
xml_tsv_files_names

['GCF_000005845.2_E_coli_K12_genomic.faa.xml.tsv',
 'GCF_000006765.1_ASM676v1_P_aeruginosa_PA01_genomic.faa.xml.tsv',
 'GCF_000006945.2_ASM694v2_S_enterica_LT2_genomic.faa.xml.tsv',
 'GCF_000009045.1_ASM904v1_B_subtilis_168_genomic.faa.xml.tsv',
 'GCF_000009645.1_ASM964v1_S_aureus_N315_genomic.faa.xml.tsv',
 'GCF_000195955.2_ASM19595v2_M_tuberculosis_H37Rv_genomic.faa.xml.tsv']

In [208]:
tsv_files_names_results

['GCF_000005845.2_E_coli_K12_genomic.faa.tsv',
 'GCF_000006765.1_ASM676v1_P_aeruginosa_PA01_genomic.faa.tsv',
 'GCF_000006945.2_ASM694v2_S_enterica_LT2_genomic.faa.tsv',
 'GCF_000009045.1_ASM904v1_B_subtilis_168_genomic.faa.tsv',
 'GCF_000009645.1_ASM964v1_S_aureus_N315_genomic.faa.tsv',
 'GCF_000195955.2_ASM19595v2_M_tuberculosis_H37Rv_genomic.faa.tsv']

In [214]:
# Arguments for the functions and folder to save files
output_dir = "InterproFullMerge"
arguments_load_function = {"sep":"\t"}
arguments_parse_function = {"how":"left", "right_index":True, "left_index":True}
arguments_join_function = {"on":"shared_index", "how":"left"}

# Doing the same as shown above but using all organisms
for t_f, x_f in zip(tsv_files_names_results, xml_tsv_files_names):
    
    tsv_path = os.path.join(path_interpro_results, t_f)
    xml_path = os.path.join(path_interpro_parsed, x_f)
    xml_df, tsv_df = LoadFiles(xml_path, tsv_path, arguments_load_function)
    
    xml_df = ReplaceInFrame(xml_df, to_replace="NULL_TYPE")
    tsv_df = ReplaceInFrame(tsv_df, to_replace="-")
    
    xml_df = ParseMultiValuesFrames(xml_df, col_target="id_protein",
                                    sep_target="|", value_name="id_protein",
                                    keydict=arguments_parse_function)
    
    tsv_df = ParseMultiValuesFrames(tsv_df, col_target="go", 
                                    sep_target="|", value_name="id_go",
                                    keydict=arguments_parse_function)
    
    xml_df, tsv_df = CreateSharedIndex(df_1=xml_df, df_1_cols=["id_protein", "id_analysis"],
                                       df_2=tsv_df, df_2_cols=["id_protein", "id_analysis"],
                                       shared_index="shared_index")
    
    full_df = JoinFrames(tsv_df, xml_df,
                         arguments_join_function, ["shared_index","id_protein_y","id_analysis_y"])
    
    full_df.rename({"id_protein_x":"id_protein","id_analysis_x":"id_analysis"},
                   inplace=True, axis=1)
    
    # Saving frame into a file ending with ".full" 
    output_file_path = os.path.join(output_dir, t_f + ".full.tsv")
    print(f"Saving frame into {output_file_path}\n")
    full_df.to_csv(output_file_path, sep="\t", header=True, index=False, na_rep="NaN")

print("All done")

Reading files into DataFrames: InterproParsedXML/GCF_000005845.2_E_coli_K12_genomic.faa.xml.tsv
                               InterproResults/GCF_000005845.2_E_coli_K12_genomic.faa.tsv
Replacing NULL_TYPE by nan
Replacing - by nan
Parsing multivalues in column id_protein
Parsing multivalues in column go
Creating shared_index = shared_index
Joining frames with parameters: {'on': 'shared_index', 'how': 'left'}
Saving frame into InterproFullMerge/GCF_000005845.2_E_coli_K12_genomic.faa.tsv.full.tsv

Reading files into DataFrames: InterproParsedXML/GCF_000006765.1_ASM676v1_P_aeruginosa_PA01_genomic.faa.xml.tsv
                               InterproResults/GCF_000006765.1_ASM676v1_P_aeruginosa_PA01_genomic.faa.tsv
Replacing NULL_TYPE by nan
Replacing - by nan
Parsing multivalues in column id_protein
Parsing multivalues in column go
Creating shared_index = shared_index
Joining frames with parameters: {'on': 'shared_index', 'how': 'left'}
Saving frame into InterproFullMerge/GCF_000006765.1_A