## <center>Taxonium visualization preparation of MAPLE output (extraction of metadata)</center>


| **Label** | **start time** | **finish time** | **last modified** |
|:--------------:|:-----------:|:-----------:|:----------------:|
|   Project 2_convert_4_Taxonium   |  2023-05-03 |  2023-05-09 |   2023-05-09     |

## TODO

- Encapsulate

##### INPUT: MAPLE_support_exampleMAT_100_MATfromGivenTree_nexusTree.tree
##### OUTPUT: Metadata_4_taxonium.tsv

In [3]:
input_file = '/nfs/research/goldman/zihao/code/EBI_INTER/A_Datas/MAPLE_support_exampleMAT_100_MATfromGivenTree_nexusTree.tree'
output_file = 'Metadata_4_taxonium_MAPLE0.3.2_nexusTree_1000samples_simulations_repl1.tree.tsv'

In [4]:
import re
import pandas as pd
from decimal import Decimal

# Define regular expressions
name_re = r'([\w/\-\.\|]+)(?:\[&|)'
support_re = r'support=([0-9.]+)'
alt_re = r'alternativePlacements=({.*?})'
mut_re = r'mutations=({.*?})'
ns_re = r'Ns=({.*?})'
length_re = r':([0-9.e-]+)'
country_date_re = r'(\w+)/.*\|.*?(\d{4}-\d{2}-\d{2})'

with open(input_file) as f:
    for line in f:
        if line.strip() == 'begin trees;':
            break

    # Read data lines
    data = ''
    for line in f:
        if line.strip() == ';':
            break
        data += line

# Split data into multiple lines
data = re.sub(r'\s+', '', data)
data = data.split(';')[0] + ';'

# Match all node information and store it in a list
node_list = re.findall(r'([\w/\-\.\|]+(?:\[&.*?\]|)):([0-9.e-]+)', data)

# Convert node information list to a DataFrame
df = pd.DataFrame(node_list, columns=['node', 'length'])

# Use regular expressions to extract the fields of node information
df['name'] = df['node'].apply(lambda x: re.findall(name_re, x)[0] if re.findall(name_re, x) else x)
df['support'] = df['node'].apply(lambda x: re.findall(support_re, x))
df['support'] = df['support'].apply(lambda x: x[0] if len(x) > 0 else 'NA')
df['alternativePlacements'] = df['node'].apply(lambda x: re.findall(alt_re, x))
df['alternativePlacements'] = df['alternativePlacements'].apply(lambda x: ','.join(re.findall(r'([A-Za-z_0-9]+):', x[0])) if len(x) > 0 and '{' in x[0] else 'NA')
df['mutations'] = df['node'].apply(lambda x: re.findall(mut_re, x))
df['mutations'] = df['mutations'].apply(lambda x: re.sub(r':1\.0', '', x[0]) if len(x) > 0 else 'NA')
df['Ns'] = df['node'].apply(lambda x: re.findall(ns_re, x))
df['Ns'] = df['Ns'].apply(lambda x: x[0] if len(x) > 0 else 'NA')
df['length'] = df['length'].apply(lambda x: str(Decimal(x)))

# Extract country and date
df['country'] = df['name'].apply(lambda x: re.findall(country_date_re, x)[0][0] if re.findall(country_date_re, x) else 'NA')
df['date'] = df['name'].apply(lambda x: re.findall(country_date_re, x)[0][1] if re.findall(country_date_re, x) else 'NA')

# Rearrange columns
df = df[['name', 'country', 'date', 'support', 'alternativePlacements', 'mutations', 'Ns', 'length']]

print(df.head())
df.to_csv(output_file,index=False,sep='\t')

             name country date support alternativePlacements  \
0  EPI_ISL_482423      NA   NA      NA                    NA   
1  EPI_ISL_776270      NA   NA     1.0                         
2  EPI_ISL_498658      NA   NA     1.0                         
3            in99      NA   NA     1.0                         
4  EPI_ISL_418243      NA   NA     1.0                         

                                           mutations  \
0                                                 NA   
1                                  {G22468T,G28878A}   
2  {A8081G,C17747T,A17858G,C18060T,T23287C,A24694...   
3                                   {C8782T,T28144C}   
4                                  {C23707T,T27384C}   

                               Ns                  length  
0                              NA                     0.0  
1              {1-55,29838-29891}  0.00006713773603169003  
2  {1-17,19293-19551,29870-29870}  0.00023634542535825445  
3              {1-17,29870-29870}  0.0