# 01_Ingest_data

In Maffei.xlsx, in the sheet summary statistics, it is reported which phage causes lysis in which E.coli strain. Each row represents a phage, and the columns represent the host. The classification is binary (1=lysis, 0=no-lysis). For the sake of the project, we will consider successful infection only when there is lysis.

### Import libraries

In [1]:
import pandas as pd
import os

# Import user defined libraries
import sys
sys.path.append("../src")
from fastaLoad import fasta

## 1. Maffei
Let's load data from Maffei. The excel sheet is not clean for import and some rows and columns need to be skipped, as well as some header renamed.\
Also, we only care about the phage identifiers and lysis data, so we'll only keep those columns.\
We obtain a maffei dataset with the first 4 columns identifying the virus: bas, family, genus and phage.\
The following columns state host strains and contain binary values for lysis or not: 1 = lysis

In [2]:
# Maffei raw data
RAW_PATH = os.path.join('..', 'data', 'raw')

maffei = pd.read_excel(os.path.join(RAW_PATH, 'Maffei.xlsx'), sheet_name='summary statistics', skiprows=1, header=[0,1,2])
maffei.drop(columns=maffei.columns[0], inplace=True)

# Store the first level of the first 4 headers in a list
header = [col[0] for col in maffei.columns.tolist()[:4]]
header.append('lysis observed (yes=1, no=0)')

# Keep only selected headers
maffei = maffei.loc[:, header]

# Drop odd columns up to 7
maffei.drop(columns=maffei.columns[1:9:2], inplace=True)

# Remove the first two rows of headers
maffei.columns = maffei.columns.droplevel([0,1])

# Substitute the first 4 headers
maffei = maffei.set_axis(header[:4] + maffei.columns[4:].tolist(), axis=1)

# Drop last column (who was kept for some mysterious reason)
maffei = maffei.iloc[:, :10]

# Rename columns
change_names = dict()
change_names= {'ICTV family (subfamily)': 'family',
               'ICTV genus': 'genus',
               'Bas#': 'bas',
               'Phage (name)': 'phage'}
maffei.rename(columns=change_names, inplace=True)
del change_names
maffei.columns = maffei.columns.str.replace(' ', '_')
maffei.columns = maffei.columns.str.replace('.', '')

# Move bas column to first position
bas = maffei.pop('bas')
maffei.insert(0, 'bas', bas)
del bas

# Define schema
header = maffei.columns[:4].tolist()
maffei[header] = maffei[header].astype('string')
maffei.iloc[:, 4:] = maffei.iloc[:, 4:].astype('int64') # Could be bool or categorical, but int64 is more convenient for machine learning purposes
del header

display(maffei)
maffei.dtypes


Unnamed: 0,bas,family,genus,phage,E_coli_UTI89,E_coli_CFT073,E_coli_55989,S_e_Typhimurium_12023s,S_e_Typhimurium_SL1344,E_coli_B_REL606
0,Bas01,Drexlerviridae; Braunvirinae,Guelphvirus,Escherichia phage AugustePiccard,0,0,0,0,0,1
1,Bas02,Drexlerviridae; Braunvirinae,Rtpvirus,Escherichia phage JeanPiccard,0,0,0,0,0,1
2,Bas03,Drexlerviridae; Braunvirinae,Rtpvirus,Escherichia phage JulesPiccard,0,0,0,0,0,1
3,Bas04,Drexlerviridae; Tempevirinae,Warwickvirus,Escherichia phage FritzSarasin,0,1,0,0,0,1
4,Bas05,Drexlerviridae; Tempevirinae,Warwickvirus,Escherichia phage PeterMerian,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...
73,phage N4,Schitoviridae; Enquatrovirinae,Enquatrovirus,n.a.,1,0,1,0,0,0
74,Bas69,Schitoviridae; Enquatrovirinae,Enquatrovirus,Escherichia phage AlfredRasser,1,0,1,0,0,0
75,lambdavir,Siphoviridae,Lambdavirus,n.a.,0,0,0,0,0,1
76,P1vir,Myoviridae,Punavirus,n.a.,0,0,0,0,0,1


bas                       string[python]
family                    string[python]
genus                     string[python]
phage                     string[python]
E_coli_UTI89                       int64
E_coli_CFT073                      int64
E_coli_55989                       int64
S_e_Typhimurium_12023s             int64
S_e_Typhimurium_SL1344             int64
E_coli_B_REL606                    int64
dtype: object

There are 78 entries.\
Notice that the last few entries of pahge are n.a.

## 2. Basel receptors
Basel receptors describes the receptors for each phage in Maffei's paper

In [3]:
receptors = pd.read_csv(os.path.join(RAW_PATH, 'basel_receptors.csv'), sep=';')

# Rename columns
change_names = {'ICTV genus': 'genus',
                'Bas##': 'bas',
                'Phage (name)': 'phage',
                'closest relative (BLASTN total score)': 'closest relative'} 
receptors.rename(columns=change_names, inplace=True)
del change_names
receptors.columns = receptors.columns.str.replace(' ', '_')

# Move genus and phage columns to second and third position
genus = receptors.pop('genus')
phage = receptors.pop('phage')
receptors.insert(1, 'genus', genus)
receptors.insert(2, 'phage', phage)
del genus, phage

# Define schema
receptors = receptors.astype('string')

display(receptors)
receptors.dtypes

Unnamed: 0,bas,genus,phage,morphotype,closest_relative,primary_receptor,terminal_receptor,receptor
0,Bas01,Rtpvirus,Escherichia phage AugustePiccard,siphovirus,RTP (AM156909.1),LPS / O-antigen?,LptD,n.a.
1,Bas02,Guelphvirus,Escherichia phage JeanPiccard,siphovirus,CEB_EC3a (NC_047812.1),LPS / O-antigen?,LptD,n.a.
2,Bas03,Guelphvirus,Escherichia phage JulesPiccard,siphovirus,CEB_EC3a (NC_047812.1),LPS / O-antigen?,FhuA,n.a.
3,Bas04,Warwickvirus,Escherichia phage FritzSarasin,siphovirus,tonnikala (NC_049817.1),LPS / O-antigen?,BtuB,n.a.
4,Bas05,Warwickvirus,Escherichia phage PeterMerian,siphovirus,XY3 (MN781674.1),LPS / O-antigen?,FhuA,n.a.
...,...,...,...,...,...,...,...,...
73,n.a.,Teseptimavirus,Escherichia phage T7 (reference),podovirus,n.a.,n.a.,n.a.,rough LPS
74,n.a.,Enquatrovirus,Escherichia phage N4 (reference),podovirus,n.a.,ECA,NfrA?,n.a.
75,n.a.,Lambdavirus,Escherichia phage lambdavir (reference),siphovirus,n.a.,none,LamB,n.a.
76,n.a.,Punavirus,Escherichia phage P1vir (reference),myovirus,n.a.,n.a.,n.a.,rough LPS


bas                  string[python]
genus                string[python]
phage                string[python]
morphotype           string[python]
closest_relative     string[python]
primary_receptor     string[python]
terminal_receptor    string[python]
receptor             string[python]
dtype: object

In [4]:
# Display receptor where bas is nan
receptors[receptors.bas == 'n.a.']

Unnamed: 0,bas,genus,phage,morphotype,closest_relative,primary_receptor,terminal_receptor,receptor
69,n.a.,Tequintavirus,Escherichia phage T5 (reference),siphovirus,n.a.,LPS / O-antigen,FhuA,n.a.
70,n.a.,Tequatrovirus,Escherichia phage T2 (reference),myovirus,n.a.,FadL,LPS (deep core),n.a.
71,n.a.,Tequatrovirus,Escherichia phage T4D (reference),myovirus,n.a.,OmpC,LPS (deep core),n.a.
72,n.a.,Tequatrovirus,Escherichia phage T6 (reference),myovirus,n.a.,Tsx,LPS (deep core),n.a.
73,n.a.,Teseptimavirus,Escherichia phage T7 (reference),podovirus,n.a.,n.a.,n.a.,rough LPS
74,n.a.,Enquatrovirus,Escherichia phage N4 (reference),podovirus,n.a.,ECA,NfrA?,n.a.
75,n.a.,Lambdavirus,Escherichia phage lambdavir (reference),siphovirus,n.a.,none,LamB,n.a.
76,n.a.,Punavirus,Escherichia phage P1vir (reference),myovirus,n.a.,n.a.,n.a.,rough LPS
77,n.a.,Peduovirus,Escherichia phage P2vir (reference),myovirus,n.a.,n.a.,n.a.,rough LPS


The file has 78 entries, as expected.\
Notice that the last few bas values are n.a.\
We might join the information from maffei and receptors to fill the gaps 

## 3. Phage protein labels
Phage protein is a list of preotein sequences IDs (FASTA) for the viruses in Maffei's paper. It states the categorization of these proteins according to the following Neural Networks:

        PhaNNS
        PhageRBPdetect
        ESM-based method developed by Yumeng\
\
Interpreting the Labels:

        Columns 1-11 present the label scores for PhaNNs, with values approaching 10 signifying high confidence.
        Column 12 provides PhaNNs' confidence level.
        Column 13 indicates PhageRBPdetect predictions: 1 denotes an RBP prediction, while 0 signifies otherwise.
        Column 14 offers PhageRBPdetect scores, with values close to 1 signifying strong confidence.
        Column 15 presents the ESM-based label.
        Column 16 features 1 if the label in Column 15 is "tail_fiber."

In [5]:
protein_NN = pd.read_csv(os.path.join(RAW_PATH, 'phage_proteins_labels.csv'), sep=',')

# Rename columns
protein_NN.columns = protein_NN.columns.str.replace('.', '_')
protein_NN.rename(columns=lambda x: x.lower() if protein_NN.columns.get_loc(x) < 10 else x, inplace=True)
protein_NN.rename(columns=lambda x: x.lower() if protein_NN.columns.get_loc(x) in [11, 12] else x, inplace=True)
protein_NN.rename(columns={'name': 'seqID'}, inplace=True)
protein_NN.rename(columns={'PhageRBDdetect_score': 'PhageRBPdetect_score'}, inplace=True)

# Define schema
protein_NN['seqID'] = protein_NN['seqID'].astype('string')
protein_NN.iloc[:, 1:13] = protein_NN.iloc[:, 1:13].astype('float64')
protein_NN['PhageRBPdetect_prediction'] = protein_NN['PhageRBPdetect_prediction'].astype('int64')
protein_NN['PhageRBPdetect_score'] = protein_NN['PhageRBPdetect_score'].astype('float64')
protein_NN['ESM_based_label'] = protein_NN['ESM_based_label'].str.strip().astype('string')  # Remove leading and trailing whitespaces
protein_NN['ESM_based_fiber_prediction'] = protein_NN['ESM_based_fiber_prediction'].astype('int64')

# display proteins where PhageRBPdetect_prediction is 1
display(protein_NN)
protein_NN.shape
protein_NN.dtypes

Unnamed: 0,seqID,major_capsid,minor_capsid,baseplate,major_tail,minor_tail,portal,tail_fiber,tail_shaft,collar,HTJ,other,confidence,PhageRBPdetect_prediction,PhageRBPdetect_score,ESM_based_label,ESM_based_fiber_prediction
0,lcl|MZ501051.1_prot_QXV76132.1_1,0.22,0.84,0.18,0.21,0.34,0.85,0.49,0.24,0.46,0.38,5.79,0.98,0,2.403247e-05,other,0
1,lcl|MZ501051.1_prot_QXV76133.1_2,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,10.00,1.00,0,3.570606e-08,other,0
2,lcl|MZ501051.1_prot_QXV76134.1_3,0.35,1.01,0.71,0.26,1.30,0.97,0.68,0.25,0.24,0.73,3.51,0.97,0,1.817806e-06,other,0
3,lcl|MZ501051.1_prot_QXV76135.1_4,0.01,0.23,0.02,0.01,0.03,0.10,0.09,0.01,0.16,0.03,9.32,1.00,0,1.184212e-05,other,0
4,lcl|MZ501051.1_prot_QXV76136.1_5,0.05,0.07,0.06,0.05,0.13,0.09,0.03,0.07,0.08,0.07,9.28,1.00,0,8.151848e-05,other,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35411,lcl|HQ259105.1_prot_AEL79679.1_75,0.41,1.24,0.25,0.09,0.18,0.86,0.43,1.48,0.21,0.63,4.22,0.97,0,8.595828e-07,other,0
35412,lcl|HQ259105.1_prot_AEL79680.1_76,0.02,0.17,0.05,0.11,0.12,0.14,0.09,0.04,0.10,5.88,3.28,0.80,0,5.343628e-08,other,0
35413,lcl|HQ259105.1_prot_AEL79681.1_77,0.01,0.07,0.07,0.00,0.05,0.01,0.06,0.00,0.01,0.46,9.25,1.00,0,2.539302e-05,other,0
35414,lcl|HQ259105.1_prot_AEL79682.1_78,0.00,0.00,0.00,0.00,0.01,0.01,0.02,0.00,0.02,0.00,9.94,1.00,0,4.384229e-07,other,0


seqID                         string[python]
major_capsid                         float64
minor_capsid                         float64
baseplate                            float64
major_tail                           float64
minor_tail                           float64
portal                               float64
tail_fiber                           float64
tail_shaft                           float64
collar                               float64
HTJ                                  float64
other                                float64
confidence                           float64
PhageRBPdetect_prediction              int64
PhageRBPdetect_score                 float64
ESM_based_label               string[python]
ESM_based_fiber_prediction             int64
dtype: object

## 4. Basel proteome
basel_proteome.fasta includes the full proteome of each phage mentioned in the Maffei et al. paper. This covers a total of 248 phages that have been sequenced and annotated. Specifically, this includes 69 phages from the Basel collection.

Let's load the fasta file in a dataframe, with each tag of the header as a column, as well as the sequence

In [6]:
# Load fasta file
basel_proteome = fasta(os.path.join(RAW_PATH, 'basel_proteome.fasta'), 'phage')

# Add bas column to basel_proteome (capitalized)
basel_proteome.df['bas'] = basel_proteome.df['locus_tag'].str.split('_', expand=True)[0]
basel_proteome.df['bas'] = basel_proteome.df['bas'].str.capitalize()

In [7]:
display(basel_proteome.df.head())
basel_proteome.df.shape

Unnamed: 0,seqID_phage,locus_tag,protein,protein_id,location,gbkey,sequence,gene,db_xref,frame,partial,exception,bas
0,lcl|MZ501051.1_prot_QXV76132.1_1,bas01_0001,terminase small subunit,QXV76132.1,1..507,CDS,MSKAALKMGEGNFKALYNKKYGDIAMVAINRKYTPEEVFDFAVRYF...,,,,,,Bas01
1,lcl|MZ501051.1_prot_QXV76133.1_2,bas01_0002,hypothetical protein,QXV76133.1,526..621,CDS,MKGFIKLFIWYYLLTSISLCVFMLVVKLWLI,,,,,,Bas01
2,lcl|MZ501051.1_prot_QXV76134.1_3,bas01_0003,terminase large subunit,QXV76134.1,609..2177,CDS,MANLIWEEMTSQEKLAVKAISEHSFEGFLRCWFSITQGERYIPNWH...,,,,,,Bas01
3,lcl|MZ501051.1_prot_QXV76135.1_4,bas01_0004,putative homing endonuclease,QXV76135.1,2321..2716,CDS,MVAGSLSGNGYLHIRIGDRRVKNHLIIWEMHNGRIPEGMEIDHINH...,,,,,,Bas01
4,lcl|MZ501051.1_prot_QXV76136.1_5,bas01_0005,hypothetical protein,QXV76136.1,2788..3168,CDS,MTKKSKAVYLGNTEGEYYGFTVGNEYDVHNYETEDNFIGTFGDDGG...,,,,,,Bas01


(35416, 13)

In [8]:
# Display basel_proteome where bas is nan, grouped by seqID_phage, considering the string before the '.'
basel_proteome.df[basel_proteome.df.bas.isna()].groupby(basel_proteome.df.seqID_phage.str.split('.', expand=True)[0]).count()

Unnamed: 0_level_0,seqID_phage,locus_tag,protein,protein_id,location,gbkey,sequence,gene,db_xref,frame,partial,exception,bas
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
lcl|AM156909,75,0,75,75,75,75,75,75,75,0,0,0,0
lcl|AY247822,50,0,50,50,50,50,50,0,0,0,0,0,0
lcl|EU547803,55,0,55,55,55,55,55,55,0,0,0,3,0
lcl|EU877232,140,0,140,140,140,140,140,140,0,0,0,0,0
lcl|HM997020,273,0,273,273,273,273,273,0,0,0,0,0,0
lcl|JF770475,74,0,74,74,74,74,74,0,0,0,0,0,0
lcl|JN202312,268,0,268,268,268,268,268,234,0,0,0,0,0
lcl|JN672684,83,0,83,83,83,83,83,0,0,1,0,0,0
lcl|JX560968,136,0,136,136,136,136,136,0,0,0,0,1,0
lcl|KF208315,271,0,271,271,271,271,271,72,0,0,0,0,0


## 5. K12 proteome

In [9]:
k12_proteome = fasta(os.path.join(RAW_PATH, 'K12.fasta'), 'k12')
display(k12_proteome.df.head())
k12_proteome.df.shape

Unnamed: 0,seqID_k12,name,OS,OX,GN,PE,SV,sequence
0,sp|A5A616|MGTS_ECOLI,Small protein MgtS,Escherichia coli (strain K12),83333,mgtS,1,1,MLGNMNVFMAVLGIILFSGFLAAYFSHKWDD
1,sp|O32583|THIS_ECOLI,Sulfur carrier protein ThiS,Escherichia coli (strain K12),83333,thiS,1,1,MQILFNDQAMQCAAGQTVHELLEQLDQRQAGAALAINQQIVPREQW...
2,sp|P00350|6PGD_ECOLI,"6-phosphogluconate dehydrogenase, decarboxylat...",Escherichia coli (strain K12),83333,gnd,1,2,MSKQQIGVVGMAVMGRNLALNIESRGYTVSIFNRSREKTEEVIAEN...
3,sp|P00363|FRDA_ECOLI,Fumarate reductase flavoprotein subunit,Escherichia coli (strain K12),83333,frdA,1,3,MQTFQADLAIVGAGGAGLRAAIAAAQANPNAKIALISKVYPMRSHT...
4,sp|P00370|DHE4_ECOLI,NADP-specific glutamate dehydrogenase,Escherichia coli (strain K12),83333,gdhA,1,1,MDQTYSLESFLNHVQKRDPNQTEFAQAVREVMTTLWPFLEQNPKYR...


(4403, 8)

## 6. Save clean datesets in <code>data/interim/clean<code>

In [10]:
CLEAN_PATH = os.path.join('..', 'data', 'interim', 'clean')

if not os.path.exists(CLEAN_PATH):
    os.makedirs(CLEAN_PATH)

# Save dataframes to csv
maffei.to_csv(os.path.join(CLEAN_PATH, '1_maffei.csv'), index=False)
receptors.to_csv(os.path.join(CLEAN_PATH, '1_receptors.csv'), index=False)
protein_NN.to_csv(os.path.join(CLEAN_PATH, '1_proteins.csv'), index=False)
basel_proteome.df.to_csv(os.path.join(CLEAN_PATH, '1_basel_proteome.csv'), index=False)
k12_proteome.df.to_csv(os.path.join(CLEAN_PATH, '1_k12_proteome.csv'), index=False)

# Save dataframes to pickle
maffei.to_pickle(os.path.join(CLEAN_PATH, '1_maffei.pkl'))
receptors.to_pickle(os.path.join(CLEAN_PATH, '1_receptors.pkl'))
protein_NN.to_pickle(os.path.join(CLEAN_PATH, '1_proteins.pkl'))
basel_proteome.df.to_pickle(os.path.join(CLEAN_PATH, '1_basel_proteome.pkl'))
k12_proteome.df.to_pickle(os.path.join(CLEAN_PATH, '1_k12_proteome.pkl'))