# Pseudolabels analysis for *Escherichia coli* data

This notebook provides a preliminary analysis for pseudolabels determined for the 54 *E. coli* samples to be used for GraSSPlas project. 

## Obtaining pseudolabels

In this analysis, we focus on two types of pseudolabels: plasmid and chromosome. Plasmid labels were obtained by mapping short read contigs against known plasmid gene databases. Chromosome databases on the other hand were obtained based on contig length and read depth thresholds. 

The databases required for plasmid pseudolabels were obtained manually by following the instructions on https://github.com/phac-nml/mob-suite on 14/05/2024. From the downloaded zipped file, we use four FASTA files. Each file contains a set of genes or proteins found in plasmids. The gene databases include replicon DNA sequences (rep.dna.fas) and origin of transfer DNA sequences (orit.dna.fas). The protein databases include the mobility proteins (mob.proteins.faa) and mate-pair formation proteins (mpf.proteins.faa). We use the Basic Local Alignment Search Tool (BLAST) to map the genes or proteins from these databases onto the short read contigs. If the contig is at least 500 bp and the quality of the mapping between the gene / protein and contig is high enough (0.8), the contig is assigned the label plasmid.

To assign the chromosome label, the mean ($m$) and standard deviation ($sd$) of normalized read depths is obtained from the sample GFA files. If the contig has a length of at least 200000 bp and a read depth of at *most*  $m+2sd$, the contig is labelled a chromosome.

In [9]:
import os
import pandas as pd

In [8]:
PLSNESS_DIR = '../data/plasmidness'
LABELS_DIR = '../data/pseudolabels'

In [14]:
labels_dict = {}
for sample in os.listdir(LABELS_DIR):
    sample_df = pd.read_csv(LABELS_DIR+'/'+sample+'/pseudolabels.tsv', sep='\t')
    pls_ctgs = sample_df['plasmid'].sum()
    chr_ctgs = sample_df['chromosome'].sum()
    labels_dict[sample] = {'plasmids': pls_ctgs, 'chromosomes': chr_ctgs}
labels_df = pd.DataFrame.from_dict(labels_dict).T

In [22]:
labels_df.sum()

plasmids       229
chromosomes    376
dtype: int64

In [15]:
labels_df

Unnamed: 0,plasmids,chromosomes
EC_E13DN_1_E,3,7
EC_0038_3S2_D,3,8
EC_0205_3S1_D,6,9
EC_9619_1H1_D,1,4
EC_9226_1H3_D,4,9
EC_E12F_2_E,1,6
EC_9503_1S2_D,4,10
EC_6245_2L1_D,5,9
EC_6245_1H1_D,5,9
EC_6245_C4_H,5,4


In [17]:
pls_labels = {}
mappings = ['matepair', 'mobility', 'origintransfer', 'replicon'] 
for sample in os.listdir(PLSNESS_DIR):
    for mapping in mappings:
        mapping_file = PLSNESS_DIR+'/'+sample+'/'+mapping+'_mapping.tsv'
        if os.path.getsize(mapping_file) > 0:
            mapping_df = pd.read_csv(mapping_file, sep='\t')
            for index, row in mapping_df.iterrows():
                sample_ctg_id, ctg_len = sample+'_'+str(row[0]), int(row[3])
                if sample_ctg_id not in pls_labels:
                    pls_labels[sample_ctg_id] = {'length': 0, 'matepair': 0, 
                                                 'mobility': 0, 'origintransfer': 0, 'replicon': 0}
                pls_labels[sample_ctg_id]['length'] = ctg_len
                pls_labels[sample_ctg_id][mapping] = 1
pls_labels_df = pd.DataFrame.from_dict(pls_labels).T

In [18]:
pls_labels_df

Unnamed: 0,length,matepair,mobility,origintransfer,replicon
EC_E13DN_1_E_35,38992,1,1,1,1
EC_E13DN_1_E_38,23307,0,0,0,1
EC_E13DN_1_E_48,12343,0,0,0,1
EC_0038_3S2_D_19,89983,1,1,1,1
EC_0038_3S2_D_38,6995,0,1,1,1
...,...,...,...,...,...
EC_0012_C1_H_65,10303,1,0,1,0
EC_0012_C1_H_77,5569,0,1,1,1
EC_0012_C1_H_99,2235,0,0,0,1
EC_4957_C1_H_21,79861,1,1,1,1


In [20]:
min_len = pls_labels_df['length'].min()
max_len = pls_labels_df['length'].max()
mean_len = pls_labels_df['length'].mean()
print("Minimum length of plasmid labelled contig: ", min_len)
print("Maximum length of plasmid labelled contig: ", max_len)
print("Average length of plasmid labelled contig: ", mean_len)

Minimum length of plasmid labelled contig:  502
Maximum length of plasmid labelled contig:  366504
Average length of plasmid labelled contig:  27919.329268292684


In [21]:
pls_labels_df[['matepair', 'mobility', 'origintransfer', 'replicon']].sum()

matepair           91
mobility          103
origintransfer    110
replicon          194
dtype: int64