## <center>Taxonium visualization preparation of MAPLE output (extraction of metadata)</center>


| **Label** | **start time** | **finish time** | **last modified** |
|:--------------:|:-----------:|:-----------:|:----------------:|
|   Project 2   |  2023-05-03 |  2023-05-04 |   2023-05-04     |

## TODO
- Slice according to tree1,tree2
- Encapsulate

##### INPUT: MAPLE_support_exampleMAT_100_MATfromGivenTree_nexusTree.tree
##### OUTPUT: Metadata_4_taxonium.tsv

In [68]:
import re
import pandas as pd
from decimal import Decimal

# Define regular expressions
name_re = r'([\w/\-\.\|]+)\[&'
support_re = r'support=([0-9.]+)'
alt_re = r'alternativePlacements=({.*?})'
mut_re = r'mutations=({.*?})'
ns_re = r'Ns=({.*?})'
length_re = r':([0-9.e-]+)'
country_date_re = r'(\w+)/.*\|(\d{4}-\d{2}-\d{2})'

with open('/homes/zihao/EBI_INTER/A_Datas/MAPLE0.3.2_nexusTree_1000samples_simulations_repl1.tree') as f:
    for line in f:
        if line.strip() == 'begin trees;':
            break

    # Read data lines
    data = ''
    for line in f:
        if line.strip() == ';':
            break
        data += line

# Split data into multiple lines
data = re.sub(r'\s+', '', data)
data = data.split(';')[0] + ';'

# Match all node information and store it in a list
node_list = re.findall(r'([\w/\-\.\|]+\[&.*?\]):([0-9.e-]+)', data)

# Convert node information list to a DataFrame
df = pd.DataFrame(node_list, columns=['node', 'length'])

# Use regular expressions to extract the fields of node information
df['name'] = df['node'].apply(lambda x: re.findall(name_re, x)[0] if re.findall(name_re, x) else x)
df['support'] = df['node'].apply(lambda x: re.findall(support_re, x))
df['support'] = df['support'].apply(lambda x: x[0] if len(x) > 0 else 'NA')
df['alternativePlacements'] = df['node'].apply(lambda x: re.findall(alt_re, x))
df['alternativePlacements'] = df['alternativePlacements'].apply(lambda x: ','.join(re.findall(r'([A-Za-z_0-9]+):', x[0])) if len(x) > 0 and '{' in x[0] else 'NA')
df['mutations'] = df['node'].apply(lambda x: re.findall(mut_re, x))
df['mutations'] = df['mutations'].apply(lambda x: re.sub(r':1\.0', '', x[0]) if len(x) > 0 else 'NA')
df['Ns'] = df['node'].apply(lambda x: re.findall(ns_re, x))
df['Ns'] = df['Ns'].apply(lambda x: x[0] if len(x) > 0 else 'NA')
df['length'] = df['length'].apply(lambda x: str(Decimal(x)))

# Extract country and date
df['country'] = df['name'].apply(lambda x: re.findall(country_date_re, x)[0][0] if re.findall(country_date_re, x) else 'NA')
df['date'] = df['name'].apply(lambda x: re.findall(country_date_re, x)[0][1] if re.findall(country_date_re, x) else 'NA')

# Rearrange columns
df = df[['name', 'country', 'date', 'support', 'alternativePlacements', 'mutations', 'Ns', 'length']]

df

Unnamed: 0,name,country,date,support,alternativePlacements,mutations,Ns,length
0,Wales/PHWC-26E0A/2020|2020-03-29,Wales,2020-03-29,0.8546038249510233,in1,"{T2350C,G2409T,T3834C,G20083A,T29068C}",{},0.0001645498048891425
1,Scotland/CVR1339/2020|2020-03-31,Scotland,2020-03-31,1.0,,{G23568T},"{1-39,29837-29879}",0.000032907579649191664
2,USA/CO-CDPHE-2004230072/2020|OK557358.1|2020-0...,USA,2020-04-21,1.0,,"{G4807A,G21848T}","{1-54,29837-29879}",0.00006636060630025907
3,USA/TG547366/2020|MZ906875.1|2020-07-07,USA,2020-07-07,0.9998932453172774,,"{C13501T,G19204T,T19281C,A19761T,C21207T,G2340...","{1-38,7299-7299,11288-11296,21765-21770,21991-...",0.00023486416835291933
4,England/QEUH-9F39B1/2020|OA992969.1|2020-09-24,England,2020-09-24,0.9997957717116446,,"{C794T,G7693T,G8872T,G9116A,C12823T,G13651T,C1...","{1-54,29768-29879}",0.00033656734387543433
...,...,...,...,...,...,...,...,...
1153,England/PHEC-1BCA4/2020|2020-03-03,England,2020-03-03,0.9896527437388796,in2,"{G10877T,C13181T,G16897T,G18674A,G25540A}","{1-21,29832-29858}",0.00015704254582052127
1154,England/PRIN-254BB58/2020|2020-04-20,England,2020-04-20,0.9896252899842426,in2,"{G11774T,G13771T,C25678T}","{1-54,10738-10873,10888-11022,11288-11296,2176...",0.00009938909399086092
1155,in4,,,0.8553865516037018,in1,{A16044G},"{1-21,29837-29858}",0.00003932770412099634
1156,in2,,,1.0,,"{T8648G:0.004183417557926692,A13193G:0.1126450...",{},0.000049395613418929594


In [69]:
df[df['name']=='Scotland/CVR10244/2021|2021-08-26']

Unnamed: 0,name,country,date,support,alternativePlacements,mutations,Ns,length


In [65]:
df.to_csv('Metadata_4_taxonium_repeat1_1000samples_simulationsBranchSupportSubsamples_MAT.tsv',index=False,sep='\t')