# General  Genomic  Structure  of  Commensalibacter  and  Bombella

In this notebook you can find the python functions used for the associated report. 
Genomes used are:
    - Bombella 360 PacBio assembly : 'Bombella_PacBio_genes_localization.txt'
    - Bombella ESl0368: 'Ga0216350.gff'
Genes used in ESL0368 genome are:
    - Metabolics:'metabo_368'
    - Defenses: 'defense_368'
    - Housekeepings:'Housekeeping_news.txt'
    - Conserved: 'conserved_position_in_Pacbio'

## - Get_real_pos:

    takes a gene id and return its position in the new PacBio localization of genes

In [4]:
def get_real_pos(gene_id):
    file=open('Bombella_PacBio_genes_localization.txt')
    for line in file:
        tmp=line.split()
        id1=tmp[0]
        start1=tmp[8]
        end1=tmp[9]
        if id1==gene_id:
            return(start1,end1)

## - Get_product_name:

takes a gene id and return its poroduct name in ESL0368 annotations


In [5]:
def get_product_name(gene_id):
    file=open('Ga0216350.gff')
    for line in file:
        tmp=line.split()
        if len(tmp)>7:
            id2=tmp[8][3:13]
            if id2==gene_id:
                if tmp[2]=='tRNA':
                    product_name='tRNA'
                else:
                    product_name=line.strip()[line.find('product=')+8:]
    return(product_name)

## - Get_pos_product:

used the two previous function to return the localisation of a given gene in the PacBio and its product name annotated in ESL0368


Outputs a tab-delimited file following the structure: "gene_start \t gene_end \t product_name"

In [6]:
def get_pos_product(liste_genes_id, output_txt):
    txt_file=open(output_txt,'w')
    file=open(liste_genes_id)
    liste=[]
    for line in file:
        liste.append(line.strip('\n'))
    
    for gene_id in liste:
        pos=get_real_pos(gene_id)
        prod= get_product_name(gene_id)
        if int(pos[0])>int(pos[1]):
            w= str(pos[1])+'\t'+str(pos[0])+'\t'+str(prod)+'\n'
        else:
            w= str(pos[0])+'\t'+str(pos[1])+'\t'+str(prod)+'\n'
        txt_file.write(w)

# - Extract informations to plot genes in BRIG

### Metabolism genes, defense genes, housekeeping genes:

In [7]:
get_pos_product('metabo_368', 'V4_metabo_368.txt')
get_pos_product('defense_368', 'V4_defense_368.txt')


# few code to clean the file housekeeping before use
file=open('Housekeeping_news.txt')
txt_file=open('housekeeping_id_v2.txt','w')
for line in file:
    txt_file.write(line.strip('v='))
txt_file.close()


get_pos_product('housekeeping_id_v2.txt', 'V4_Housekeeping_368.txt')

### Conserved genes
positions of conserved domains extracted from GenoplotR and stored in the file V3_syntenic_genes.txt


In [9]:
pacbio_file=open('Bombella_PacBio_genes_localization.txt')
txt_file=open('V4_Syntenic_genes.txt','w')

conserved_genes=[]

# 1) get gene_id in PacBio
for line in pacbio_file:
    tmp=line.split()
    gene_id=tmp[0]
    start=tmp[8]
    end=tmp[9]
    
    positions=open('conserved_position_in_Pacbio')
    for line2 in positions:
        tmp=line2.split()
        if tmp[0]<start<tmp[1]:
            conserved_genes.append(gene_id) # store gene id to test common with housekeeping genes
            prod=get_product_name(gene_id) # get product name in esl0368
            if start>end:
                w=str(end)+'\t'+str(start)+'\t'+str(prod)+'\n'
            else:
                w=str(start)+'\t'+str(end)+'\t'+str(prod)+'\n'

            txt_file.write(w)
            
txt_file.close()        
