# LPD to RS to WP Mapping

### Methods
<ol>
    <li>Load LPD to RS mapping as it is found in Henson et al 2018 supplementary information </li>
    <li>Load text file that contains gene infomation about the Broad Institute's R. opacus genome assembly</li>
    <li>Create dictionary with RS gene ID's as keys and WP gene ID's as values</li>
    <li>Add WP gene ids to gene mapping data frame</li>
    <li>Save data frame as .csv</li>
</ol>

### Imports

In [1]:
import pandas as pd

### Load LPD to RS data from the supplement of Henson et al (2018)
[Multi-omic elucidation of aromatic catabolism in adaptively evolved Rhodococcus opacus](https://www.sciencedirect.com/science/article/pii/S1096717618300910?via%3Dihub)

In [37]:
LPD_to_RS_df = pd.read_csv('LPD_to_RS_gene_mappping.csv')
# Remove the unneed '\t' from the end of LPD genes
LPD_to_RS_df['Gene ID (LPD)'] = [val.split('\t')[0] for val in LPD_to_RS_df['Gene ID (LPD)']]
LPD_to_RS_df.head()


Unnamed: 0,Gene ID (RS),Gene ID (LPD),Annotation
0,PD630_RS00005,Pd630_LPD00001,chromosomal replication initiator protein DnaA
1,PD630_RS00010,Pd630_LPD00002,DNA polymerase III subunit beta
2,PD630_RS00015,Pd630_LPD00003,6-phosphogluconate dehydrogenase
3,PD630_RS00020,Pd630_LPD00004,DNA replication and repair protein RecF
4,PD630_RS00025,Pd630_LPD00005,hypothetical protein


### Load the R. opacus PD630 gtf file
The gtf file can be downloaded from the [R. opacus ncbi assembly page](https://www.ncbi.nlm.nih.gov/assembly/GCF_000234335.1).

In [35]:
f = open("rhodococcus_opacus_pd630_gtf.txt", "r")
full_text = f.read()
full_text[:1000]

'#gtf-version 2.2\n#!genome-build ASM23433v1\n#!genome-build-accession NCBI_Assembly:GCF_000234335.1\n#!annotation-date 11/08/2020 19:47:09\n#!annotation-source NCBI RefSeq \nNZ_JH377097.1\tRefSeq\tgene\t677\t1252\t.\t+\t.\tgene_id "OPAG_RS42320"; transcript_id ""; gbkey "Gene"; gene_biotype "pseudogene"; locus_tag "OPAG_RS42320"; partial "true"; pseudo "true"; \nNZ_JH377097.1\tProtein Homology\tCDS\t677\t1249\t.\t+\t0\tgene_id "OPAG_RS42320"; transcript_id "unknown_transcript_1"; gbkey "CDS"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_010843886.1"; locus_tag "OPAG_RS42320"; note "frameshifted; incomplete; partial in the middle of a contig; missing C-terminus"; partial "true"; product "alpha/beta hydrolase"; pseudo "true"; transl_table "11"; \nNZ_JH377097.1\tProtein Homology\tstart_codon\t677\t679\t.\t+\t0\tgene_id "OPAG_RS42320"; transcript_id "unknown_transcript_1"; gbkey "CDS"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_010843886.1"; locus_tag "OPAG_RS423

### Create dictionary to map RS gene IDs to WP gene IDs

In [31]:
gene_list = full_text.split('gene_id')
gene_dictionary = {}
for gene in gene_list:
    if 'OPAG_RS' in gene and 'RefSeq:WP_' in gene:
        RS_name = 'PD630_' + gene.split('OPAG_')[1].split('"')[0]
        WP_name = gene.split('RefSeq:')[1].split('"')[0]
        gene_dictionary[RS_name] = WP_name

### Add WP Annotations to the gene mapping dataframe

In [38]:
WP_annotations = [gene_dictionary[RS_name] if RS_name in gene_dictionary.keys() else 'none' for RS_name in LPD_to_RS_df['Gene ID (RS)']]
LPD_to_RS_df.insert(2, 'Gene ID (WP)', WP_annotations)

LPD_to_RS_df

Unnamed: 0,Gene ID (RS),Gene ID (LPD),Gene ID (WP),Annotation
0,PD630_RS00005,Pd630_LPD00001,none,chromosomal replication initiator protein DnaA
1,PD630_RS00010,Pd630_LPD00002,WP_005569241.1,DNA polymerase III subunit beta
2,PD630_RS00015,Pd630_LPD00003,WP_007296166.1,6-phosphogluconate dehydrogenase
3,PD630_RS00020,Pd630_LPD00004,WP_005237760.1,DNA replication and repair protein RecF
4,PD630_RS00025,Pd630_LPD00005,WP_005237761.1,hypothetical protein
5,PD630_RS00030,Pd630_LPD00006,WP_005256534.1,metal-dependent hydrolase
6,PD630_RS00035,Pd630_LPD00007,WP_005569233.1,ferredoxin
7,PD630_RS00040,Pd630_LPD00008,WP_005569232.1,hydrolase
8,PD630_RS00045,Pd630_LPD00009,WP_016884547.1,DNA topoisomerase IV subunit B
9,PD630_RS00050,Pd630_LPD00011,WP_009477017.1,ATP-binding protein


In [39]:
LPD_to_RS_df.to_csv('LPD_RS_WP_gene_mapping.csv')