<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Manually-Curation-of-non-Aligned-Locus-Tags" data-toc-modified-id="Manually-Curation-of-non-Aligned-Locus-Tags-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Manually Curation of non-Aligned Locus Tags</a></span><ul class="toc-item"><li><span><a href="#Setup" data-toc-modified-id="Setup-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Setup</a></span><ul class="toc-item"><li><span><a href="#Get-information-from-GFF-file" data-toc-modified-id="Get-information-from-GFF-file-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>Get information from GFF file</a></span><ul class="toc-item"><li><span><a href="#Convert-GFF-to-Pandas-DataFrame" data-toc-modified-id="Convert-GFF-to-Pandas-DataFrame-1.1.1.1"><span class="toc-item-num">1.1.1.1&nbsp;&nbsp;</span>Convert GFF to Pandas DataFrame</a></span></li></ul></li></ul></li><li><span><a href="#Manual-Curation-Starts-Here" data-toc-modified-id="Manual-Curation-Starts-Here-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Manual Curation Starts Here</a></span><ul class="toc-item"><li><span><a href="#Merge-and-Curate-along-1-limiting-parameter-(Start-vs-End)" data-toc-modified-id="Merge-and-Curate-along-1-limiting-parameter-(Start-vs-End)-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Merge and Curate along 1 limiting parameter (Start vs End)</a></span><ul class="toc-item"><li><span><a href="#Merge-values-by-Start-only" data-toc-modified-id="Merge-values-by-Start-only-1.2.1.1"><span class="toc-item-num">1.2.1.1&nbsp;&nbsp;</span>Merge values by Start only</a></span></li><li><span><a href="#Merge-values-by-End-only" data-toc-modified-id="Merge-values-by-End-only-1.2.1.2"><span class="toc-item-num">1.2.1.2&nbsp;&nbsp;</span>Merge values by End only</a></span></li><li><span><a href="#Find-and-Curate-Intersection" data-toc-modified-id="Find-and-Curate-Intersection-1.2.1.3"><span class="toc-item-num">1.2.1.3&nbsp;&nbsp;</span>Find and Curate Intersection</a></span></li><li><span><a href="#Curate-those-with-only-'Start'-or-'End'-Matches" data-toc-modified-id="Curate-those-with-only-'Start'-or-'End'-Matches-1.2.1.4"><span class="toc-item-num">1.2.1.4&nbsp;&nbsp;</span>Curate those with only 'Start' or 'End' Matches</a></span></li><li><span><a href="#Curate-remainder-(non-matching)-loci" data-toc-modified-id="Curate-remainder-(non-matching)-loci-1.2.1.5"><span class="toc-item-num">1.2.1.5&nbsp;&nbsp;</span>Curate remainder (non-matching) loci</a></span></li></ul></li></ul></li><li><span><a href="#Save-curated-file-for-further-manual-curation" data-toc-modified-id="Save-curated-file-for-further-manual-curation-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Save curated file for further manual curation</a></span></li></ul></li></ul></div>

# Manually Curation of non-Aligned Locus Tags

The purpose of this notebook is to join together the ncbi DF_annot and the prokka DF_annot DataFrames. This will then be followed by further manual curation using current literature as described at the end.

___
## Setup

In [1]:
import sys
sys.path.append('..')

In [2]:
from pymodulon.gene_util import *
from tqdm.notebook import tqdm
import numpy as np
import os
from Bio import SeqIO

In [3]:
org_dir = '/Users/siddharth/PycharmProjects/modulome_saci'
kegg_organism_code = 'sai'
seq_dir = os.path.join(org_dir,'sequence_files')
sacid_seq_dir = os.path.join(org_dir,'Sacid_prokka')

### Get information from GFF file

#### Convert GFF to Pandas DataFrame

In [4]:
annot_list = []
for filename in os.listdir(seq_dir):
    if filename.endswith('.gff3'):
        gff = os.path.join(seq_dir,filename)
        annot_list.append(gff2pandas(gff))
keep_cols = ['refseq','start','end','strand','gene_name','locus_tag','old_locus_tag','gene_product','ncbi_protein']
DF_annot = pd.concat(annot_list)[keep_cols]
DF_annot = DF_annot.drop_duplicates('locus_tag')
DF_annot.set_index('locus_tag',drop=True,inplace=True)

In [5]:
annot_list = []
for filename in os.listdir(sacid_seq_dir):
    if filename.endswith('.gff'):
        gff = os.path.join(sacid_seq_dir,filename)
        annot_list.append(gff2pandas(gff))
keep_cols = ['refseq','start','end','strand','gene_name','locus_tag','old_locus_tag','gene_product','ncbi_protein']
DF_annot_sacid = pd.concat(annot_list)[keep_cols]
DF_annot_sacid = DF_annot_sacid.drop_duplicates('locus_tag')
DF_annot_sacid.set_index('locus_tag',drop=True,inplace=True)

In [6]:
tpm_file = os.path.join(org_dir,'data','log_tpm.csv')
DF_log_tpm = pd.read_csv(tpm_file,index_col=0)

Check that the genes are the same in the expression dataset as in the annotation dataframe.

In [7]:
DF_log_tpm.head()

Unnamed: 0_level_0,ERX1518397,ERX1518398,ERX1518399,ERX3018360,ERX3018361,ERX3018362,ERX3018363,SRX2548838,SRX2548839,SRX2548840,...,SRX5653264,SRX5653265,SRX5653266,SRX5653267,SRX5653268,SRX5653269,SRX6762909,SRX6762910,SRX6762911,SRX6762912
Geneid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
SACI_RS00005,0.0,0.0,0.0,8.903589,8.430008,8.961871,8.59526,8.448594,8.228731,8.368168,...,6.848788,7.185195,7.376186,7.685027,6.601357,7.652005,8.639968,8.791053,8.445832,8.605442
SACI_RS00010,0.0,0.0,0.0,8.103548,7.884489,8.137422,7.954576,7.012592,8.351356,8.608128,...,6.114308,6.546701,6.703395,7.01319,5.926302,6.883948,8.100742,8.554694,8.134199,8.34369
SACI_RS00015,11.275116,11.282262,9.90887,10.824914,11.150282,10.843054,11.12013,9.559263,6.69793,7.179276,...,10.860233,10.735659,10.796574,10.874813,11.073584,10.869027,10.799551,11.091902,10.519661,10.883387
SACI_RS00020,6.535285,0.0,0.0,4.920237,5.854611,5.760206,5.885891,6.917147,5.140496,3.635002,...,5.309888,5.338677,5.418699,5.533618,5.420223,5.384989,5.386592,5.732974,5.931306,5.798582
SACI_RS00025,7.016261,0.0,0.0,7.904266,8.076969,7.998144,7.867476,7.082449,7.023425,6.744682,...,8.011013,8.092458,8.111934,8.131836,8.261467,8.238378,7.956496,7.930298,7.902762,7.961625


In [8]:
DF_annot.head()

Unnamed: 0_level_0,refseq,start,end,strand,gene_name,old_locus_tag,gene_product,ncbi_protein
locus_tag,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
SACI_RS00005,NC_007181.1,101,1261,+,,Saci_0001,AAA family ATPase,WP_011276932.1
SACI_RS00010,NC_007181.1,1294,1629,+,,Saci_0002,hypothetical protein,WP_011276933.1
SACI_RS00015,NC_007181.1,1665,2504,+,,Saci_0003,hypothetical protein,WP_011276934.1
SACI_RS00020,NC_007181.1,2553,3056,-,,Saci_0004,hypothetical protein,WP_015385334.1
SACI_RS00025,NC_007181.1,3049,3768,-,,Saci_0005,hypothetical protein,WP_011276936.1


In [9]:
DF_annot_sacid.head()

Unnamed: 0_level_0,refseq,start,end,strand,gene_name,old_locus_tag,gene_product,ncbi_protein
locus_tag,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Sacid_00002,NC_007181.1,1294.0,1629.0,+,,,hypothetical protein,
Sacid_00003,NC_007181.1,1665.0,2504.0,+,,,hypothetical protein,
Sacid_00004,NC_007181.1,2553.0,3110.0,-,,,hypothetical protein,
Sacid_00005,NC_007181.1,3049.0,3768.0,-,,,hypothetical protein,
Sacid_00006,NC_007181.1,3801.0,4052.0,-,,,hypothetical protein,


###### Annotations differ between 'SACI' and 'Sacid', need to map between them to make a unified DF_annot file

In [10]:
# Initial Merge (Based on refseq, start, end, and strand columns)

DF_fin = DF_annot.reset_index().merge(DF_annot_sacid.reset_index(),
                                      how='inner',
                                      on=['refseq', 'start', 'end', 'strand'])


## Rename columns in new merged DataFrame

rename_dict = {'locus_tag_x': 'locus_tag',
               'locus_tag_y': 'prokka_locus_tag',
               'old_locus_tag_x': 'old_locus_tag',
               'ncbi_protein_x': 'ncbi_protein'}

DF_fin.rename(columns=rename_dict,
              inplace=True)

## Drop columns with only None values
DF_fin.drop(columns=['old_locus_tag_y', 'ncbi_protein_y'],
            inplace=True)

DF_fin['gene_name'] = None

In [11]:
def gene_name_cmp(x, y):
    out = None
    
    if x == None and y != None:
        out = y
    elif x != None and y == None:
        out = x
    
    elif x != None and y != None:
        if x == y:
            out = x
        
        else:
            out = str(x) + ', ' + str(y)
    
    return out


# Merge gene name columns
for idx in tqdm(DF_fin.index):
    DF_fin.loc[idx, 'gene_name'] = gene_name_cmp(DF_fin.loc[idx, 'gene_name_x'], DF_fin.loc[idx, 'gene_name_y'])

DF_fin.drop(columns=['gene_name_x', 'gene_name_y'],
            inplace=True)

DF_fin.set_index('locus_tag', inplace=True)

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=1942.0), HTML(value='')))




In [12]:
DF_fin

Unnamed: 0_level_0,refseq,start,end,strand,old_locus_tag,gene_product_x,ncbi_protein,prokka_locus_tag,gene_product_y,gene_name
locus_tag,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
SACI_RS00010,NC_007181.1,1294,1629,+,Saci_0002,hypothetical protein,WP_011276933.1,Sacid_00002,hypothetical protein,
SACI_RS00015,NC_007181.1,1665,2504,+,Saci_0003,hypothetical protein,WP_011276934.1,Sacid_00003,hypothetical protein,
SACI_RS00025,NC_007181.1,3049,3768,-,Saci_0005,hypothetical protein,WP_011276936.1,Sacid_00005,hypothetical protein,
SACI_RS00030,NC_007181.1,3801,4052,-,Saci_0006,winged helix-turn-helix transcriptional regulator,WP_011276937.1,Sacid_00006,hypothetical protein,
SACI_RS11965,NC_007181.1,4138,4419,-,Saci_0007,hypothetical protein,WP_061972215.1,Sacid_00007,hypothetical protein,
...,...,...,...,...,...,...,...,...,...,...
SACI_RS11460,NC_007181.1,2217234,2218298,-,Saci_2370,aspartate-semialdehyde dehydrogenase,WP_011279150.1,Sacid_02368,Aspartate-semialdehyde dehydrogenase,"asd, asd_2"
SACI_RS11465,NC_007181.1,2218618,2219355,-,Saci_2371,class I SAM-dependent methyltransferase,WP_011279151.1,Sacid_02369,2-methoxy-6-polyprenyl-1%2C4-benzoquinol methy...,COQ5_5
SACI_RS11470,NC_007181.1,2219468,2220394,+,Saci_2372,ornithine cyclodeaminase family protein,WP_011279152.1,Sacid_02370,Delta(1)-pyrroline-2-carboxylate reductase,
SACI_RS11475,NC_007181.1,2220381,2220989,-,Saci_2373,cob(I)yrinic acid a%2Cc-diamide adenosyltransf...,WP_011279153.1,Sacid_02371,Cobalamin adenosyltransferase,cobO


In [13]:
# Unified DF_annot DataFrame generated and partially filled
DF_annot_union = pd.DataFrame(data=DF_fin, index=DF_annot.index)
DF_annot_union

Unnamed: 0_level_0,refseq,start,end,strand,old_locus_tag,gene_product_x,ncbi_protein,prokka_locus_tag,gene_product_y,gene_name
locus_tag,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
SACI_RS00005,,,,,,,,,,
SACI_RS00010,NC_007181.1,1294.0,1629.0,+,Saci_0002,hypothetical protein,WP_011276933.1,Sacid_00002,hypothetical protein,
SACI_RS00015,NC_007181.1,1665.0,2504.0,+,Saci_0003,hypothetical protein,WP_011276934.1,Sacid_00003,hypothetical protein,
SACI_RS00020,,,,,,,,,,
SACI_RS00025,NC_007181.1,3049.0,3768.0,-,Saci_0005,hypothetical protein,WP_011276936.1,Sacid_00005,hypothetical protein,
...,...,...,...,...,...,...,...,...,...,...
SACI_RS11465,NC_007181.1,2218618.0,2219355.0,-,Saci_2371,class I SAM-dependent methyltransferase,WP_011279151.1,Sacid_02369,2-methoxy-6-polyprenyl-1%2C4-benzoquinol methy...,COQ5_5
SACI_RS11470,NC_007181.1,2219468.0,2220394.0,+,Saci_2372,ornithine cyclodeaminase family protein,WP_011279152.1,Sacid_02370,Delta(1)-pyrroline-2-carboxylate reductase,
SACI_RS11475,NC_007181.1,2220381.0,2220989.0,-,Saci_2373,cob(I)yrinic acid a%2Cc-diamide adenosyltransf...,WP_011279153.1,Sacid_02371,Cobalamin adenosyltransferase,cobO
SACI_RS11480,NC_007181.1,2221039.0,2224263.0,-,Saci_2374,S8 family serine peptidase,WP_011279154.1,Sacid_02372,hypothetical protein,


___
## Manual Curation Starts Here

### Merge and Curate along 1 limiting parameter (Start vs End)

#### Merge values by Start only

In [14]:
extra = set(DF_annot.index) - set(DF_fin.index)
DF_extra_start = DF_annot.loc[extra].reset_index().merge(DF_annot_sacid.reset_index(),
                                                   how='inner', on=['refseq', 'start', 'strand'])
DF_extra_start

Unnamed: 0,locus_tag_x,refseq,start,end_x,strand,gene_name_x,old_locus_tag_x,gene_product_x,ncbi_protein_x,locus_tag_y,end_y,gene_name_y,old_locus_tag_y,gene_product_y,ncbi_protein_y
0,SACI_RS09200,NC_007181.1,1710351,1711319,-,,Saci_1910,hypothetical protein,WP_015385714.1,Sacid_01884,1711343.0,,,hypothetical protein,
1,SACI_RS06690,NC_007181.1,1192841,1193146,-,,Saci_1399,hypothetical protein,WP_015385608.1,Sacid_01369,1193155.0,,,hypothetical protein,
2,SACI_RS06075,NC_007181.1,1082226,1083023,-,,Saci_1272,hypothetical protein,WP_015385588.1,Sacid_01249,1083035.0,,,hypothetical protein,
3,SACI_RS05450,NC_007181.1,954542,955027,-,,Saci_1143,hypothetical protein,WP_080504013.1,Sacid_01119,954898.0,,,hypothetical protein,
4,SACI_RS06335,NC_007181.1,1131771,1133401,-,,,DEAD/DEAH box helicase,,Sacid_01300,1133174.0,,,hypothetical protein,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145,SACI_RS00080,NC_007181.1,11673,12182,-,,Saci_0017,methyltransferase domain-containing protein,WP_080504000.1,Sacid_00017,12041.0,,,hypothetical protein,
146,SACI_RS09000,NC_007181.1,1649867,1650301,-,,Saci_1871,hypothetical protein,WP_015385698.1,Sacid_01845,1650409.0,,,hypothetical protein,
147,SACI_RS09735,NC_007181.1,1828174,1829049,-,cas4a,,type I-A CRISPR-associated protein Cas4/Csa1,,Sacid_02007,1828665.0,,,hypothetical protein,
148,SACI_RS06295,NC_007181.1,1123028,1123243,-,,Saci_1319,hypothetical protein,WP_011278156.1,Sacid_01292,1123267.0,,,hypothetical protein,


#### Merge values by End only

In [15]:
DF_extra_stop = DF_annot.loc[extra].reset_index().merge(DF_annot_sacid.reset_index(),
                                                   how='inner', on=['refseq', 'end', 'strand'])
DF_extra_stop

Unnamed: 0,locus_tag_x,refseq,start_x,end,strand,gene_name_x,old_locus_tag_x,gene_product_x,ncbi_protein_x,locus_tag_y,start_y,gene_name_y,old_locus_tag_y,gene_product_y,ncbi_protein_y
0,SACI_RS10105,NC_007181.1,1909666,1910067,+,,Saci_2092,hypothetical protein,WP_015385768.1,Sacid_02088,1909657.0,,,hypothetical protein,
1,SACI_RS10360,NC_007181.1,1970572,1971624,+,,Saci_2142,amidohydrolase family protein,WP_015385780.1,Sacid_02140,1970614.0,,,hypothetical protein,
2,SACI_RS06385,NC_007181.1,1139910,1140488,+,,Saci_1336,TATA-box-binding protein,WP_080504018.1,Sacid_01310,1139955.0,,,hypothetical protein,
3,SACI_RS06825,NC_007181.1,1218041,1218610,+,,Saci_1426,aminodeoxychorismate/anthranilate synthase com...,WP_011278258.1,Sacid_01396,1218008.0,pabA,,Aminodeoxychorismate/anthranilate synthase com...,
4,SACI_RS07255,NC_007181.1,1297720,1298100,+,,Saci_1520,50S ribosomal protein L7ae,WP_011278341.1,Sacid_01487,1297711.0,,,hypothetical protein,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
155,SACI_RS02045,NC_007181.1,358904,359723,+,,,aldolase,,Sacid_00415,359304.0,lsrF,,3-hydroxy-5-phosphonooxypentane-2%2C4-dione th...,
156,SACI_RS09550,NC_007181.1,1788954,1790069,+,,Saci_1976,hypothetical protein,WP_011278780.1,Sacid_01965,1788969.0,,,hypothetical protein,
157,SACI_RS06045,NC_007181.1,1077222,1078061,+,sucD,Saci_1266,succinate--CoA ligase subunit alpha,WP_080504015.1,Sacid_01243,1077291.0,sucD,,Succinate--CoA ligase [ADP-forming] subunit alpha,
158,SACI_RS04415,NC_007181.1,739876,741549,+,,Saci_0924,methylmalonyl-CoA mutase family protein,WP_015385519.1,Sacid_00900,739870.0,mcm,,Methylmalonyl-CoA mutase,


#### Find and Curate Intersection

This intersection is filled with multiple Sacid loci being mapped to one SACI locus, and so will be mapped manually

In [16]:
cmn_genes = set(DF_extra_start.locus_tag_x).intersection(set(DF_extra_stop.locus_tag_x))

DF_extra_start.set_index('locus_tag_x').loc[cmn_genes].head()

Unnamed: 0_level_0,refseq,start,end_x,strand,gene_name_x,old_locus_tag_x,gene_product_x,ncbi_protein_x,locus_tag_y,end_y,gene_name_y,old_locus_tag_y,gene_product_y,ncbi_protein_y
locus_tag_x,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
SACI_RS02475,NC_007181.1,423985,425845,+,,,DUF87 domain-containing protein,,Sacid_00507,424263.0,,,hypothetical protein,
SACI_RS11865,NC_007181.1,1611244,1612227,-,,,cyclase family protein,,Sacid_01812,1611591.0,,,hypothetical protein,
SACI_RS09915,NC_007181.1,1868239,1868964,-,,,CRISPR-associated RAMP protein,,Sacid_02045,1868706.0,,,hypothetical protein,
SACI_RS09730,NC_007181.1,1827072,1827638,-,cas4,,CRISPR-associated protein Cas4,,Sacid_02005,1827473.0,,,hypothetical protein,
SACI_RS05165,NC_007181.1,883177,883904,-,,,hypothetical protein,,Sacid_01058,883632.0,,,hypothetical protein,


In [17]:
DF_extra_stop.set_index('locus_tag_x').loc[cmn_genes].head()

Unnamed: 0_level_0,refseq,start_x,end,strand,gene_name_x,old_locus_tag_x,gene_product_x,ncbi_protein_x,locus_tag_y,start_y,gene_name_y,old_locus_tag_y,gene_product_y,ncbi_protein_y
locus_tag_x,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
SACI_RS02475,NC_007181.1,423985,425845,+,,,DUF87 domain-containing protein,,Sacid_00508,424334.0,,,hypothetical protein,
SACI_RS11865,NC_007181.1,1611244,1612227,-,,,cyclase family protein,,Sacid_01813,1611697.0,,,hypothetical protein,
SACI_RS09915,NC_007181.1,1868239,1868964,-,,,CRISPR-associated RAMP protein,,Sacid_02046,1868785.0,,,hypothetical protein,
SACI_RS09730,NC_007181.1,1827072,1827638,-,cas4,,CRISPR-associated protein Cas4,,Sacid_02006,1827522.0,,,hypothetical protein,
SACI_RS05165,NC_007181.1,883177,883904,-,,,hypothetical protein,,Sacid_01059,883722.0,,,hypothetical protein,


In [18]:
def gpy_cmp(pgstart, pgstop):
    hp = 'hypothetical protein'
    out = hp
    
    if pgstart == hp and pgstop != hp:
        out = pgstop
    
    elif pgstart != hp and pgstop == hp:
        out = pgstart
    
    elif pgstart != hp and pgstop != hp:        
        if pgstart == pgstop:
            out = pgstart
        
        else:
            out = pgstart + ', '+ pgstop
    
    return out

# gene_name, gene_product_y merge
for gene in cmn_genes:
    pgstart = DF_extra_start.set_index('locus_tag_x').loc[gene, 'gene_product_y']
    pgstop = DF_extra_stop.set_index('locus_tag_x').loc[gene, 'gene_product_y']
    
    start_name = gene_name_cmp(DF_extra_start.set_index('locus_tag_x').loc[gene, 'gene_name_x'],
                               DF_extra_start.set_index('locus_tag_x').loc[gene, 'gene_name_y'])
    
    stop_name = gene_name_cmp(DF_extra_stop.set_index('locus_tag_x').loc[gene, 'gene_name_x'],
                              DF_extra_stop.set_index('locus_tag_x').loc[gene, 'gene_name_y'])
    
    DF_annot_union.loc[gene, 'gene_product_y'] = gpy_cmp(pgstart, pgstop).replace('%2C', ',')
    DF_annot_union.loc[gene, 'gene_name'] = gene_name_cmp(start_name, stop_name)

In [19]:
# refseq, strand, ncbi_protein, old_locus_tag, gene_product_x
DF_annot_union.loc[cmn_genes, 'refseq'] = 'NC_007181.1'

DF_annot_union.loc[cmn_genes, 'old_locus_tag'] = DF_annot.loc[cmn_genes, 'old_locus_tag']

DF_annot_union.loc[cmn_genes, 'strand'] = DF_extra_start.set_index('locus_tag_x').loc[
    cmn_genes, 'strand']

DF_annot_union.loc[cmn_genes, 'ncbi_protein'] = DF_extra_start.set_index('locus_tag_x').loc[
    cmn_genes, 'ncbi_protein_x']

DF_annot_union.loc[cmn_genes, 'gene_product_x'] = DF_extra_start.set_index('locus_tag_x').loc[
    cmn_genes, 'gene_product_x']

# start, end
sstart = DF_extra_start.set_index('locus_tag_x').loc[cmn_genes]
DF_annot_union.loc[sstart.index, 'start'] = sstart.start
DF_annot_union.loc[sstart.index, 'end'] = sstart.end_x


# prokka_locus_tag
no_SACI_RS06870 = cmn_genes - set(['SACI_RS06870'])

prokka1 = DF_extra_start.set_index('locus_tag_x').loc[no_SACI_RS06870, 'locus_tag_y']
prokka2 = DF_extra_stop.set_index('locus_tag_x').loc[no_SACI_RS06870, 'locus_tag_y']

DF_annot_union.loc[no_SACI_RS06870, 'prokka_locus_tag'] = prokka1 + ', '+ prokka2

# Manually curate prokka_locus_tag of SACI_RS06870
DF_annot_union.loc['SACI_RS06870', 'prokka_locus_tag'] = 'Sacid_01405, Sacid_01406, Sacid_01407'

In [20]:
DF_annot_union.loc[cmn_genes]

Unnamed: 0_level_0,refseq,start,end,strand,old_locus_tag,gene_product_x,ncbi_protein,prokka_locus_tag,gene_product_y,gene_name
locus_tag,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
SACI_RS02475,NC_007181.1,423985.0,425845.0,+,,DUF87 domain-containing protein,,"Sacid_00507, Sacid_00508",hypothetical protein,
SACI_RS11865,NC_007181.1,1611244.0,1612227.0,-,,cyclase family protein,,"Sacid_01812, Sacid_01813",hypothetical protein,
SACI_RS09915,NC_007181.1,1868239.0,1868964.0,-,,CRISPR-associated RAMP protein,,"Sacid_02045, Sacid_02046",hypothetical protein,
SACI_RS09730,NC_007181.1,1827072.0,1827638.0,-,,CRISPR-associated protein Cas4,,"Sacid_02005, Sacid_02006",hypothetical protein,cas4
SACI_RS05165,NC_007181.1,883177.0,883904.0,-,,hypothetical protein,,"Sacid_01058, Sacid_01059",hypothetical protein,
SACI_RS06870,NC_007181.1,1225387.0,1227549.0,-,,malto-oligosyltrehalose synthase,,"Sacid_01405, Sacid_01406, Sacid_01407",Maltooligosyl trehalose synthase,"treY, treY, treY_2"
SACI_RS10510,NC_007181.1,2007741.0,2009515.0,+,,S9 family peptidase,,"Sacid_02171, Sacid_02172",hypothetical protein,
SACI_RS02045,NC_007181.1,358904.0,359723.0,+,,aldolase,,"Sacid_00414, Sacid_00415","2-amino-3,7-dideoxy-D-threo-hept-6-ulosonate s...","aroA', lsrF"
SACI_RS09735,NC_007181.1,1828174.0,1829049.0,-,,type I-A CRISPR-associated protein Cas4/Csa1,,"Sacid_02007, Sacid_02008",hypothetical protein,cas4a
SACI_RS03755,NC_007181.1,626583.0,627977.0,+,,FAD-dependent oxidoreductase,,"Sacid_00763, Sacid_00764",hypothetical protein,


#### Curate those with only 'Start' or 'End' Matches

In [21]:
DF_extra_start.rename(columns=rename_dict, inplace=True)
DF_extra_start.set_index('locus_tag', inplace=True)

DF_extra_start.drop(index=cmn_genes, inplace=True)
DF_extra_start.drop(columns=['old_locus_tag_y', 'ncbi_protein_y'],
                    inplace=True)


# gene_name merge
for gene in DF_extra_start.index:
    gene_name_x = DF_extra_start.loc[gene, 'gene_name_x']
    gene_name_y = DF_extra_start.loc[gene, 'gene_name_y']
    
    DF_annot_union.loc[gene, 'gene_name'] = gene_name_cmp(gene_name_x, gene_name_y)


# refseq, strand, ncbi_protein, old_locus_tag, gene_product_x, gene_product_y
DF_annot_union.loc[DF_extra_start.index, 'refseq'] = 'NC_007181.1'
DF_annot_union.loc[DF_extra_start.index, 'old_locus_tag'] = DF_annot.loc[DF_extra_start.index, 'old_locus_tag']
DF_annot_union.loc[DF_extra_start.index, 'strand'] = DF_extra_start['strand']
DF_annot_union.loc[DF_extra_start.index, 'ncbi_protein'] = DF_extra_start['ncbi_protein']
DF_annot_union.loc[DF_extra_start.index, 'gene_product_x'] = DF_extra_start['gene_product_x']
DF_annot_union.loc[DF_extra_start.index, 'gene_product_y'] = DF_extra_start['gene_product_y']

# start, end
DF_annot_union.loc[DF_extra_start.index, 'start'] = DF_extra_start['start']
DF_annot_union.loc[DF_extra_start.index, 'end'] = DF_extra_start['end_x']

# prokka_locus_tag
DF_annot_union.loc[DF_extra_start.index, 'prokka_locus_tag'] = DF_extra_start['prokka_locus_tag']

In [22]:
DF_annot_union.loc[DF_extra_start.index]

Unnamed: 0_level_0,refseq,start,end,strand,old_locus_tag,gene_product_x,ncbi_protein,prokka_locus_tag,gene_product_y,gene_name
locus_tag,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
SACI_RS09200,NC_007181.1,1710351.0,1711319.0,-,Saci_1910,hypothetical protein,WP_015385714.1,Sacid_01884,hypothetical protein,
SACI_RS06690,NC_007181.1,1192841.0,1193146.0,-,Saci_1399,hypothetical protein,WP_015385608.1,Sacid_01369,hypothetical protein,
SACI_RS06075,NC_007181.1,1082226.0,1083023.0,-,Saci_1272,hypothetical protein,WP_015385588.1,Sacid_01249,hypothetical protein,
SACI_RS05450,NC_007181.1,954542.0,955027.0,-,Saci_1143,hypothetical protein,WP_080504013.1,Sacid_01119,hypothetical protein,
SACI_RS06335,NC_007181.1,1131771.0,1133401.0,-,,DEAD/DEAH box helicase,,Sacid_01300,hypothetical protein,
...,...,...,...,...,...,...,...,...,...,...
SACI_RS00625,NC_007181.1,103217.0,103636.0,-,Saci_0132,hypothetical protein,WP_015385365.1,Sacid_00127,hypothetical protein,
SACI_RS00080,NC_007181.1,11673.0,12182.0,-,Saci_0017,methyltransferase domain-containing protein,WP_080504000.1,Sacid_00017,hypothetical protein,
SACI_RS09000,NC_007181.1,1649867.0,1650301.0,-,Saci_1871,hypothetical protein,WP_015385698.1,Sacid_01845,hypothetical protein,
SACI_RS06295,NC_007181.1,1123028.0,1123243.0,-,Saci_1319,hypothetical protein,WP_011278156.1,Sacid_01292,hypothetical protein,


In [23]:
DF_extra_stop.rename(columns=rename_dict, inplace=True)
DF_extra_stop.set_index('locus_tag', inplace=True)

DF_extra_stop.drop(index=cmn_genes, inplace=True)
DF_extra_stop.drop(columns=['old_locus_tag_y', 'ncbi_protein_y'],
                   inplace=True)


# gene_name merge
for gene in DF_extra_stop.index:
    gene_name_x = DF_extra_stop.loc[gene, 'gene_name_x']
    gene_name_y = DF_extra_stop.loc[gene, 'gene_name_y']
    
    DF_annot_union.loc[gene, 'gene_name'] = gene_name_cmp(gene_name_x, gene_name_y)


# refseq, strand, ncbi_protein, old_locus_tag, gene_product_x, gene_product_y
DF_annot_union.loc[DF_extra_stop.index, 'refseq'] = 'NC_007181.1'
DF_annot_union.loc[DF_extra_stop.index, 'old_locus_tag'] = DF_annot.loc[DF_extra_stop.index, 'old_locus_tag']
DF_annot_union.loc[DF_extra_stop.index, 'strand'] = DF_extra_stop['strand']
DF_annot_union.loc[DF_extra_stop.index, 'ncbi_protein'] = DF_extra_stop['ncbi_protein']
DF_annot_union.loc[DF_extra_stop.index, 'gene_product_x'] = DF_extra_stop['gene_product_x']
DF_annot_union.loc[DF_extra_stop.index, 'gene_product_y'] = DF_extra_stop['gene_product_y']

# start, end
DF_annot_union.loc[DF_extra_stop.index, 'start'] = DF_extra_stop['start_x']
DF_annot_union.loc[DF_extra_stop.index, 'end'] = DF_extra_stop['end']

# prokka_locus_tag
DF_annot_union.loc[DF_extra_stop.index, 'prokka_locus_tag'] = DF_extra_stop['prokka_locus_tag']

In [24]:
DF_annot_union.loc[DF_extra_stop.index]

Unnamed: 0_level_0,refseq,start,end,strand,old_locus_tag,gene_product_x,ncbi_protein,prokka_locus_tag,gene_product_y,gene_name
locus_tag,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
SACI_RS10105,NC_007181.1,1909666.0,1910067.0,+,Saci_2092,hypothetical protein,WP_015385768.1,Sacid_02088,hypothetical protein,
SACI_RS10360,NC_007181.1,1970572.0,1971624.0,+,Saci_2142,amidohydrolase family protein,WP_015385780.1,Sacid_02140,hypothetical protein,
SACI_RS06385,NC_007181.1,1139910.0,1140488.0,+,Saci_1336,TATA-box-binding protein,WP_080504018.1,Sacid_01310,hypothetical protein,
SACI_RS06825,NC_007181.1,1218041.0,1218610.0,+,Saci_1426,aminodeoxychorismate/anthranilate synthase com...,WP_011278258.1,Sacid_01396,Aminodeoxychorismate/anthranilate synthase com...,pabA
SACI_RS07255,NC_007181.1,1297720.0,1298100.0,+,Saci_1520,50S ribosomal protein L7ae,WP_011278341.1,Sacid_01487,hypothetical protein,
...,...,...,...,...,...,...,...,...,...,...
SACI_RS02390,NC_007181.1,412079.0,412927.0,+,Saci_0495,hypothetical protein,WP_011277403.1,Sacid_00488,hypothetical protein,
SACI_RS10450,NC_007181.1,1994871.0,1995689.0,+,Saci_2161,DUF2079 domain-containing protein,WP_015385783.1,Sacid_02159,hypothetical protein,
SACI_RS09550,NC_007181.1,1788954.0,1790069.0,+,Saci_1976,hypothetical protein,WP_011278780.1,Sacid_01965,hypothetical protein,
SACI_RS06045,NC_007181.1,1077222.0,1078061.0,+,Saci_1266,succinate--CoA ligase subunit alpha,WP_080504015.1,Sacid_01243,Succinate--CoA ligase [ADP-forming] subunit alpha,sucD


#### Curate remainder (non-matching) loci

Remaining 56 non-matching loci do not have intersecting (or even approximate) matches with prokka tags, will be filled out using only ncbi data from DF_annot

In [25]:
remainder = DF_annot_union['refseq'].isna()

# Match through DF_annot (ncbi seq)
DF_annot_union.loc[remainder, 'refseq'] = DF_annot.loc[remainder, 'refseq']
DF_annot_union.loc[remainder, 'start'] = DF_annot.loc[remainder, 'start']
DF_annot_union.loc[remainder, 'end'] = DF_annot.loc[remainder, 'end']
DF_annot_union.loc[remainder, 'strand'] = DF_annot.loc[remainder, 'strand']
DF_annot_union.loc[remainder, 'gene_name'] = DF_annot.loc[remainder, 'gene_name']
DF_annot_union.loc[remainder, 'old_locus_tag'] = DF_annot.loc[remainder, 'old_locus_tag']
DF_annot_union.loc[remainder, 'gene_product_x'] = DF_annot.loc[remainder, 'gene_product']
DF_annot_union.loc[remainder, 'ncbi_protein'] = DF_annot.loc[remainder, 'ncbi_protein']

# No match via prokka, these are left as None
DF_annot_union.loc[remainder, 'prokka_locus_tag'] = None
DF_annot_union.loc[remainder, 'gene_product_y'] = None

In [26]:
DF_annot_union = DF_annot_union[['refseq', 'start', 'end', 'strand',
                                 'gene_name', 'old_locus_tag', 'prokka_locus_tag',
                                 'gene_product_x', 'gene_product_y', 'ncbi_protein']]


for gene in DF_annot_union.index:
    if DF_annot_union['gene_product_x'].isna().loc[gene] == False:
        DF_annot_union.loc[gene, 'gene_product_x'] = DF_annot_union.loc[gene, 'gene_product_x'].replace('%2C', ',')
    
    if DF_annot_union['gene_product_y'].isna().loc[gene] == False:
        DF_annot_union.loc[gene, 'gene_product_y'] = DF_annot_union.loc[gene, 'gene_product_y'].replace('%2C', ',')


DF_annot_union.rename(columns={'gene_product_x': 'gene_product', 'gene_product_y': 'gene_product_prokka'},
                      inplace=True)

DF_annot_union

Unnamed: 0_level_0,refseq,start,end,strand,gene_name,old_locus_tag,prokka_locus_tag,gene_product,gene_product_prokka,ncbi_protein
locus_tag,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
SACI_RS00005,NC_007181.1,101.0,1261.0,+,,Saci_0001,,AAA family ATPase,,WP_011276932.1
SACI_RS00010,NC_007181.1,1294.0,1629.0,+,,Saci_0002,Sacid_00002,hypothetical protein,hypothetical protein,WP_011276933.1
SACI_RS00015,NC_007181.1,1665.0,2504.0,+,,Saci_0003,Sacid_00003,hypothetical protein,hypothetical protein,WP_011276934.1
SACI_RS00020,NC_007181.1,2553.0,3056.0,-,,Saci_0004,Sacid_00004,hypothetical protein,hypothetical protein,WP_015385334.1
SACI_RS00025,NC_007181.1,3049.0,3768.0,-,,Saci_0005,Sacid_00005,hypothetical protein,hypothetical protein,WP_011276936.1
...,...,...,...,...,...,...,...,...,...,...
SACI_RS11465,NC_007181.1,2218618.0,2219355.0,-,COQ5_5,Saci_2371,Sacid_02369,class I SAM-dependent methyltransferase,"2-methoxy-6-polyprenyl-1,4-benzoquinol methyla...",WP_011279151.1
SACI_RS11470,NC_007181.1,2219468.0,2220394.0,+,,Saci_2372,Sacid_02370,ornithine cyclodeaminase family protein,Delta(1)-pyrroline-2-carboxylate reductase,WP_011279152.1
SACI_RS11475,NC_007181.1,2220381.0,2220989.0,-,cobO,Saci_2373,Sacid_02371,"cob(I)yrinic acid a,c-diamide adenosyltransferase",Cobalamin adenosyltransferase,WP_011279153.1
SACI_RS11480,NC_007181.1,2221039.0,2224263.0,-,,Saci_2374,Sacid_02372,S8 family serine peptidase,hypothetical protein,WP_011279154.1


___
## Save curated file for further manual curation

Genes with no name will take up the `old_locus_tag` value. If this is also empty, then the genes will take up the `locus_tag` value.

A `synonyms` column will be added for genes with multiple names.

Additional manual curation will also be performed from genes/TFs discovered in recent literature.

In [27]:
DF_annot_union.to_csv('../data/DF_annot_curated_1.tsv', sep='\t')