## <center>Taxonium visualization preparation of MAPLE output (extraction of metadata)</center>


| **Label** | **start time** | **finish time** | **last modified** |
|:--------------:|:-----------:|:-----------:|:----------------:|
|   Project 2_convert_4_Taxonium   |  2023-05-03 |  2023-05-09 |   2023-05-09     |

## TODO

- Encapsulate

##### INPUT: MAPLE_support_exampleMAT_100_MATfromGivenTree_nexusTree.tree
##### OUTPUT: Metadata_4_taxonium.tsv

In [4]:
input_file = '../A_Datas/MAPLE0.3.2_nexusTree_1000samples_simulations_repl1.tree'
output_file = 'Metadata_4_taxonium_MAPLE0.3.2_nexusTree_1000samples_simulations_repl1.tree.tsv'

In [5]:
import re
import pandas as pd
from decimal import Decimal

# Define regular expressions
name_re = r'([\w/\-\.\|]+)(?:\[&|)'
support_re = r'support=([0-9.]+)'
alt_re = r'alternativePlacements=({.*?})'
mut_re = r'mutations=({.*?})'
ns_re = r'Ns=({.*?})'
length_re = r':([0-9.e-]+)'
country_date_re = r'(\w+)/.*\|.*?(\d{4}-\d{2}-\d{2})'

with open(input_file) as f:
    for line in f:
        if line.strip() == 'begin trees;':
            break

    # Read data lines
    data = ''
    for line in f:
        if line.strip() == ';':
            break
        data += line

# Split data into multiple lines
data = re.sub(r'\s+', '', data)
data = data.split(';')[0] + ';'

# Match all node information and store it in a list
node_list = re.findall(r'([\w/\-\.\|]+(?:\[&.*?\]|)):([0-9.e-]+)', data)

# Convert node information list to a DataFrame
df = pd.DataFrame(node_list, columns=['node', 'length'])

# Use regular expressions to extract the fields of node information
df['name'] = df['node'].apply(lambda x: re.findall(name_re, x)[0] if re.findall(name_re, x) else x)
df['support'] = df['node'].apply(lambda x: re.findall(support_re, x))
df['support'] = df['support'].apply(lambda x: x[0] if len(x) > 0 else 'NA')
df['alternativePlacements'] = df['node'].apply(lambda x: re.findall(alt_re, x))
df['alternativePlacements'] = df['alternativePlacements'].apply(lambda x: ','.join(re.findall(r'([A-Za-z_0-9]+):', x[0])) if len(x) > 0 and '{' in x[0] else 'NA')
df['mutations'] = df['node'].apply(lambda x: re.findall(mut_re, x))
df['mutations'] = df['mutations'].apply(lambda x: re.sub(r':1\.0', '', x[0]) if len(x) > 0 else 'NA')
df['Ns'] = df['node'].apply(lambda x: re.findall(ns_re, x))
df['Ns'] = df['Ns'].apply(lambda x: x[0] if len(x) > 0 else 'NA')
df['length'] = df['length'].apply(lambda x: str(Decimal(x)))

# Extract country and date
df['country'] = df['name'].apply(lambda x: re.findall(country_date_re, x)[0][0] if re.findall(country_date_re, x) else 'NA')
df['date'] = df['name'].apply(lambda x: re.findall(country_date_re, x)[0][1] if re.findall(country_date_re, x) else 'NA')

# Rearrange columns
df = df[['name', 'country', 'date', 'support', 'alternativePlacements', 'mutations', 'Ns', 'length']]

print(df.head())
df.to_csv(output_file,index=False,sep='\t')

Unnamed: 0,name,country,date,support,alternativePlacements,mutations,Ns,length
0,Scotland/CVR65/2020|2020-03-06,Scotland,2020-03-06,,,,,0.0
1,Wales/PHWC-26E0A/2020|2020-03-29,Wales,2020-03-29,0.8546038249510233,in1,"{T2350C,G2409T,T3834C,G20083A,T29068C}",{},0.0001645498048891425
2,in961,,,,,,,0.0
3,WIV06|GWHABKN00000001|2019-12-30,,,,,,,0.0
4,in960,,,,,,,0.0
...,...,...,...,...,...,...,...,...
1986,G24626T,,,,,,,0.9958117759030256
1987,T26613C,,,,,,,0.4145513539963248
1988,A27821C,,,,,,,0.6282737569595008
1989,G29755A,,,,,,,0.8872991979636166
