## 2_Transform_data

### Import libraries

In [1]:
import pandas as pd
import numpy as np
import os

### Load clean data

In [2]:
CLEAN_PATH = os.path.join('..', 'data', 'interim', 'clean')

if not os.path.exists(CLEAN_PATH):
    raise Exception('Clean data path does not exist. Did you run the notebook 01_clean_data.ipynb?')

# Import pickle clean data
# maffei = pd.read_pickle(os.path.join(CLEAN_PATH, '0_maffei.pkl'))
receptors = pd.read_pickle(os.path.join(CLEAN_PATH, '0_receptors.pkl'))
protein_NN = pd.read_pickle(os.path.join(CLEAN_PATH, '0_proteins.pkl'))
basel_proteome = pd.read_pickle(os.path.join(CLEAN_PATH, '0_basel_proteome.pkl'))
k12_proteome = pd.read_pickle(os.path.join(CLEAN_PATH, '0_k12_proteome.pkl'))


In [3]:
# basel_proteome['bas'] = basel_proteome['locus_tag'].str.split('_', expand=True)[0]
# basel_protein_names = basel_proteome.df['protein'].unique().tolist()
# display(basel_proteome.head())

## 01. Protein receptors
Some of basel receptors are terminal (aka secondary) and some are primary. In both columns there are both LPS and proteins. We only care about proteins, so it might be a good idea to add a column 'receptor_protein' which selects the proteic one from the two. Also, the 'receptor' column has no proteins, so it might be useless for further analysis. \
We will keep only the values which find a match in the k12 gene set.\
NB: some values have a question mark. Since we are aiming fro an as-strict-as-possible labelling, these values are not included

In [4]:
# Get a set of the elements in a df
GN_set = set(k12_proteome['GN'].str.lower().tolist())

# Create a subset that contains only the rows where either primary_receptor or terminal_receptor are in GN_set
receptors = receptors[(receptors['primary_receptor'].str.lower().isin(GN_set)) | (receptors['terminal_receptor'].str.lower().isin(GN_set))]
receptors['receptor_protein'] = np.where(receptors['primary_receptor'].str.lower().isin(GN_set), receptors['primary_receptor'], 
                                      np.where(receptors['terminal_receptor'].str.lower().isin(GN_set), receptors['terminal_receptor'], np.nan))
receptors.drop(columns=['primary_receptor', 'terminal_receptor', 'receptor'], inplace=True)


display(receptors.head())
display(receptors.shape)

Unnamed: 0,bas,genus,phage,morphotype,closest_relative,receptor_protein
0,Bas01,Rtpvirus,Escherichia phage AugustePiccard,siphovirus,RTP (AM156909.1),LptD
1,Bas02,Guelphvirus,Escherichia phage JeanPiccard,siphovirus,CEB_EC3a (NC_047812.1),LptD
2,Bas03,Guelphvirus,Escherichia phage JulesPiccard,siphovirus,CEB_EC3a (NC_047812.1),FhuA
3,Bas04,Warwickvirus,Escherichia phage FritzSarasin,siphovirus,tonnikala (NC_049817.1),BtuB
4,Bas05,Warwickvirus,Escherichia phage PeterMerian,siphovirus,XY3 (MN781674.1),FhuA


(50, 6)

The number of samples was reduced from 78 to 50.\
Let's now check for the unicity of the gene names in k12 proteome

In [5]:
# Count occurences in k12_proteome of each GN
GN_count = k12_proteome['GN'].str.lower().value_counts().to_frame()

display(GN_count.head())
display(GN_count.shape)

# Display when it is higher than 1
display(GN_count[GN_count['count'] != 1])

Unnamed: 0_level_0,count
GN,Unnamed: 1_level_1
mgts,1
rpnb,1
xdhc,1
yger,1
ynje,1


(4403, 1)

Unnamed: 0_level_0,count
GN,Unnamed: 1_level_1


We can confirm that there are no duplicates among the k12 genes

## 2. Match receptor and sequence
Let's match each phage in the receptors dataframe with the sequence of the associated receptor, contained in k12_proteome.\
We expect that many entries from receptors could match to the same gene in k12_proteome, and not viceversa. \
We validate this using validate='many_to_one' during the merge, which would raise an error otherwise.

In [8]:
# Add a lower case column of receptor_protein as common identifier between receptors and k12_proteome
receptors['receptor_protein_lower'] = receptors['receptor_protein'].str.lower()
k12_proteome['GN_lower'] = k12_proteome['GN'].str.lower()

# Merge datasets
receptors_k12 = receptors.merge(k12_proteome, how='inner', left_on='receptor_protein_lower', right_on='GN_lower', suffixes=('_phage', '_k12'), validate='many_to_one')

# Drop GN_lower and receptor_protein_lower and order by bas
receptors_k12.drop(columns=['GN_lower', 'receptor_protein_lower'], inplace=True)
receptors_k12.sort_values(by=['bas'], inplace=True)

display(receptors_k12.head())
display(receptors_k12.shape)

Unnamed: 0,bas,genus,phage,morphotype,closest_relative,receptor_protein,seqID_k12,name,OS,OX,GN,PE,SV,sequence
0,Bas01,Rtpvirus,Escherichia phage AugustePiccard,siphovirus,RTP (AM156909.1),LptD,sp|P31554|LPTD_ECOLI,LPS-assembly protein LptD,Escherichia coli (strain K12),83333,lptD,1,2,MKKRIPTLLATMIATALYSQQGLAADLASQCMLGVPSYDRPLVQGD...
1,Bas02,Guelphvirus,Escherichia phage JeanPiccard,siphovirus,CEB_EC3a (NC_047812.1),LptD,sp|P31554|LPTD_ECOLI,LPS-assembly protein LptD,Escherichia coli (strain K12),83333,lptD,1,2,MKKRIPTLLATMIATALYSQQGLAADLASQCMLGVPSYDRPLVQGD...
7,Bas03,Guelphvirus,Escherichia phage JulesPiccard,siphovirus,CEB_EC3a (NC_047812.1),FhuA,sp|P06971|FHUA_ECOLI,Ferrichrome outer membrane transporter/phage r...,Escherichia coli (strain K12),83333,fhuA,1,2,MARSKTAQPKHSLRKIAVVVATAVSGMSVYAQAAVEPKEDTITVTA...
17,Bas04,Warwickvirus,Escherichia phage FritzSarasin,siphovirus,tonnikala (NC_049817.1),BtuB,sp|P06129|BTUB_ECOLI,Vitamin B12 transporter BtuB,Escherichia coli (strain K12),83333,btuB,1,2,MIKKASLLTACSVTAFSAWAQDTSPDTLVVTANRFEQPRSTVLAPT...
8,Bas05,Warwickvirus,Escherichia phage PeterMerian,siphovirus,XY3 (MN781674.1),FhuA,sp|P06971|FHUA_ECOLI,Ferrichrome outer membrane transporter/phage r...,Escherichia coli (strain K12),83333,fhuA,1,2,MARSKTAQPKHSLRKIAVVVATAVSGMSVYAQAAVEPKEDTITVTA...


(50, 14)

## 3. 

Next, we need to select from each Bas (from receptors), which of its proteins (from basel_proteome) is expected to interact with that kind of receptor. It might be a good idea to define a dictionary to do this. 

After selecting the involved protein(s) for each Bas, we'll have to link those proteins with the corresponding protein in the k12_proteome. This pairs will be labeled as interacting. All the others as non-interacting