# 01 - Data Exploration and labelling

In Maffei.xlsx, in the sheet summary statistics, it is reported which phage causes lysis in which E.coli strain. Each row represents a phage, and the columns represent the host. The classification is binary (1=lysis, 0=no-lysis). For the sake of the project, we will consider successful infection only when there is lysis.

## Import libraries

In [None]:
import pandas as pd
import os

# Import user defined libraries
import sys
sys.path.append("../src")
from fastaLoad import fasta

IndentationError: unexpected indent (fastaLoad.py, line 60)

## 1.Load data


### 1.1 Maffei
Let's load data from Maffei. The excel sheet is not clean for import and some rows and columns need to be skipped, as well as some header renamed.\
Also, we only care about the phage identifiers and lysis data, so we'll only keep those columns.\
We obtain a maffei dataset with the first 4 columns identifying the virus: bas, family, genus and phage.\
The following columns state host strains and contain binary values for lysis or not: 1 = lysis

In [None]:
# Maffei raw data
RAW_PATH = os.path.join('..', 'data', 'raw')

maffei = pd.read_excel(os.path.join(RAW_PATH, 'Maffei.xlsx'), sheet_name='summary statistics', skiprows=1, header=[0,1,2])
maffei.drop(columns=maffei.columns[0], inplace=True)

# Store the first level of the first 4 headers in a list
header = [col[0] for col in maffei.columns.tolist()[:4]]
header.append('lysis observed (yes=1, no=0)')

# Keep only selected headers
maffei = maffei.loc[:, header]

# Drop odd columns up to 7
maffei.drop(columns=maffei.columns[1:9:2], inplace=True)

# Remove the first two rows of headers
maffei.columns = maffei.columns.droplevel([0,1])

# Substitute the first 4 headers
maffei = maffei.set_axis(header[:4] + maffei.columns[4:].tolist(), axis=1)

# Drop last column (who was kept for some mysterious reason)
maffei = maffei.iloc[:, :10]

# Rename columns
change_names = dict()
change_names= {'ICTV family (subfamily)': 'family',
               'ICTV genus': 'genus',
               'Bas#': 'bas',
               'Phage (name)': 'phage'}
maffei.rename(columns=change_names, inplace=True)
del change_names
maffei.columns = maffei.columns.str.replace(' ', '_')
maffei.columns = maffei.columns.str.replace('.', '')

# Move bas column to first position
bas = maffei.pop('bas')
maffei.insert(0, 'bas', bas)
del bas

# Define schema
header = maffei.columns[:4].tolist()
maffei[header] = maffei[header].astype('string')
maffei.iloc[:, 4:] = maffei.iloc[:, 4:].astype('int64') # Could be bool or categorical, but int64 is more convenient for machine learning purposes
del header

display(maffei.head())
maffei.dtypes


: 

### 1.2 Basel receptors
Basel receptors describes the receptors for each phage in Maffei's paper

In [None]:
receptors = pd.read_csv(os.path.join(RAW_PATH, 'basel_receptors.csv'), sep=';')

# Rename columns
change_names = {'ICTV genus': 'genus',
                'Bas##': 'bas',
                'Phage (name)': 'phage',
                'closest relative (BLASTN total score)': 'closest relative'} 
receptors.rename(columns=change_names, inplace=True)
del change_names
receptors.columns = receptors.columns.str.replace(' ', '_')

# Move genus and phage columns to second and third position
genus = receptors.pop('genus')
phage = receptors.pop('phage')
receptors.insert(1, 'genus', genus)
receptors.insert(2, 'phage', phage)
del genus, phage

# Define schema
receptors = receptors.astype('string')

display(receptors.head())
receptors.dtypes

: 

### 1.3 Phage protein labels
Phage protein is a list of preotein sequences IDs (FASTA) for the viruses in Maffei's paper. It states the categorization of these proteins according to the following Neural Networks:

        PhaNNS
        PhageRBPdetect
        ESM-based method developed by Yumeng\
\
Interpreting the Labels:

        Columns 1-11 present the label scores for PhaNNs, with values approaching 10 signifying high confidence.
        Column 12 provides PhaNNs' confidence level.
        Column 13 indicates PhageRBPdetect predictions: 1 denotes an RBP prediction, while 0 signifies otherwise.
        Column 14 offers PhageRBPdetect scores, with values close to 1 signifying strong confidence.
        Column 15 presents the ESM-based label.
        Column 16 features 1 if the label in Column 15 is "tail_fiber."

In [None]:
proteins = pd.read_csv(os.path.join(RAW_PATH, 'phage_proteins_labels.csv'), sep=',')

# Rename columns
proteins.columns = proteins.columns.str.replace('.', '_')
proteins.rename(columns=lambda x: x.lower() if proteins.columns.get_loc(x) < 10 else x, inplace=True)
proteins.rename(columns=lambda x: x.lower() if proteins.columns.get_loc(x) in [11, 12] else x, inplace=True)
proteins.rename(columns={'name': 'seqID'}, inplace=True)

# Define schema
proteins['seqID'] = proteins['seqID'].astype('string')
proteins.iloc[:, 1:13] = proteins.iloc[:, 1:13].astype('float64')
proteins['PhageRBPdetect_prediction'] = proteins['PhageRBPdetect_prediction'].astype('int64')
proteins['PhageRBDdetect_score'] = proteins['PhageRBDdetect_score'].astype('float64')
proteins['ESM_based_label'] = proteins['ESM_based_label'].astype('string')
proteins['ESM_based_fiber_prediction'] = proteins['ESM_based_fiber_prediction'].astype('int64')

display(proteins.head())
proteins.dtypes

: 

### 1.4 Basel proteome
basel_proteome.fasta includes the full proteome of each phage mentioned in the Maffei et al. paper. This covers a total of 248 phages that have been sequenced and annotated. Specifically, this includes 69 phages from the Basel collection.

Let's load the fasta file in a dataframe, with each tag of the header as a column, as well as the sequence

In [None]:
basel_proteome = fasta(os.path.join(RAW_PATH, 'basel_proteome.fasta'), 'phage')
display(proteome.df.head())
basel_proteome.df.dtypes
basel_proteome.df.shape

: 

### 1.5 K12 proteome

In [None]:
k12_proteome = fasta(os.path.join(RAW_PATH, 'K12.fasta'), 'bacteria')
display(k12_proteome.df.head())

: 

### 1.6 Save clean datesets in <code>data/interim/clean<code>

In [None]:
CLEAN_PATH = os.path.join('..', 'data', 'interim', 'clean')

if not os.path.exists(CLEAN_PATH):
    os.makedirs(CLEAN_PATH)

maffei.to_csv(os.path.join(CLEAN_PATH, 'maffei.csv'), index=False)
receptors.to_csv(os.path.join(CLEAN_PATH, 'receptors.csv'), index=False)
proteins.to_csv(os.path.join(CLEAN_PATH, 'proteins.csv'), index=False)
basel_proteome.df.to_csv(os.path.join(CLEAN_PATH, 'basel_proteome.csv'), index=False)


: 

## 2 Data exploration

Some of basel receptors are terminal (aka secondary) and some are primary. In both columns there are both LPS and proteins. We only care about proteins, so it might be a good idea to add a column 'receptor_protein' which selects the proteic one from the two. Also, the 'receptor' column has no proteins, so it might be useless for further analysis. 