# Make Synthetic Genome
Author : Mathieu Giguere \
Date : 02/07/2024 \
Brief : Strings together the genes of interest of some species into a singular synthetic genome. \
Dependencies : 

## plan

### 1. Lire les fichiers genomes (fasta) -> big string
### 2. Lire les fichiers gffs -> Extraire les coordonnees des genes d'interets
### 3. Pour chaque genome, chercher la sous-liste des bases de chaque gene d'interet et les appends avec un spacer 'NNNNNNNNNN'
### 4. Append tous les str en 1 genome synthetique avec spacer 'N+' adequat
### 5. Write synthetic genome file in fasta format

In [1]:
import re
import pandas as pd

In [2]:
with open("genes-species.xlsx") as xlsx:
    xlsx_content = xlsx.read()

csv = re.sub("\\t", ",", xlsx_content)


new_csv = open("gene-species.csv", "w")
new_csv.write(csv)
new_csv.close()

In [3]:
df = pd.read_csv("gene-species.csv")

df = df.groupby('Species')['Gene ID'].apply(list).reset_index()

df

Unnamed: 0,Species,Gene ID
0,Aspergillus fumigatus,"[AFUA_4G06890, AFUA_6G05140, AFUA_2G00320, AFU..."
1,Candida albicans,"[C5_00660C_A, C1_04770C_A, C1_02420C_A, CR_008..."
2,Candida auris,"[B9J08_001448, B9J08_003737, B9J08_000964, B9J..."
3,Candida parapsilosis,"[CPAR2_303740, CPAR2_105550, CPAR2_106400, CPA..."
4,Candida tropicalis,"[CTRG_05283, CTRG_04480, CTRG_04661, CTRG_0268..."
5,Cryptococcus neoformans,"[CNA00300, CNN02320, CNE02100, CNA05950, CNF04..."
6,Nasakeomyces glabrata,"[CAGL0E04334g, CAGL0F01793g, CAGL0G01034g, CAG..."
7,Pichia kudriavzevii,"[JL09_g2508, JL09_g3074, JL09_g1956, JL09_g200..."


In [4]:

species_genome_file_list = ["fasta_files/unzipped/Aspergillus_fumigatus.ASM265v1.dna.toplevel.fa",
                           "fasta_files/unzipped/C_albicans_SC5314_version_A22-s07-m01-r195_chromosomes.fasta",
                           "fasta_files/unzipped/C_auris_B8441_version_s01-m03-r08_chromosomes.fasta",
                           "fasta_files/unzipped/C_glabrata_CBS138_version_s05-m03-r06_chromosomes.fasta",
                           "fasta_files/unzipped/C_parapsilosis_CDC317_version_s01-m06-r03_chromosomes.fasta",
                           "fasta_files/unzipped/Candida_tropicalis.GCA000006335v3.dna.toplevel.fa",
                           "fasta_files/unzipped/Cryptococcus_neoformans.ASM9104v1.dna.toplevel.fa",
                           "fasta_files/unzipped/Pichia_kudriavzevii_gca_000764455.ASM76445v1.dna.toplevel.fa"]

species_list = ["Aspergillus fumigatus", "Candida albicans", "Candida auris", "Nasakeomyces glabrata", "Candida parapsilosis", "Candida tropicalis", "Cryptococcus neoformans", "Pichia kudriavzevii"]

gff_files_list = ["gff_files/Aspergillus_fumigatus.ASM265v1.59.gff3",
                 "gff_files/C_albicans_SC5314_version_A22-s07-m01-r195.gff",
                 "gff_files/C_auris_B8441_version_s01-m03-r08.gff",
                 "gff_files/C_glabrata_CBS138_version_s05-m03-r06.gff",
                 "gff_files/C_parapsilosis_CDC317_version_s01-m06-r03.gff",
                 "gff_files/Candida_tropicalis.GCA000006335v3.59.gff3",
                 "gff_files/Cryptococcus_neoformans.ASM9104v1.59.gff3",
                 "gff_files/Pichia_kudriavzevii_gca_000764455.ASM76445v1.59.gff3"]

In [5]:
synth = ""
spacer = "NNNNNNNNN"

for i, file in enumerate(species_genome_file_list):
    species = species_list[i]
    print(species)
    with open(file) as genome:
        genome_content = genome.read()
    
    # Removes first line puts a space
    dna = re.sub(">.*\\n", " ", genome_content)
    
    gff_file = gff_files_list[i]
    with open(gff_file) as gff:
        gff_content = gff.read()
    
    gene_list = df['Gene ID'][df['Species'] == species].values[0]
    
    for g in gene_list:
        print(g)
        
        # find gene coordinates
        y = re.findall(f"gene\\t.*?{g};", gff_content)
        #print(f"Found: {y}")
        coordinates = re.findall("[0-9]+", y[0])
        #print(coordinates)
        begin = int(coordinates[0])
        end = int(coordinates[1])
        
        gene_coor = dna[begin:end]
        gene = re.sub("\\n", "", gene_coor)
        synth += gene + spacer + "W"
    
    synth += spacer + spacer + "S"        

my_synth_genome = open("my_synthetic_genome.txt", "w")
my_synth_genome.write(synth)
my_synth_genome.close()

Aspergillus fumigatus
AFUA_4G06890
AFUA_6G05140
AFUA_2G00320
AFUA_6G12400
AFUA_5G05460
AFUA_1G05050
AFUA_2G15130
AFUA_4G12560
AFUA_2G03700
AFUA_3G05760
AFUA_5G06070
AFUA_1G10910
AFUA_5G10370
AFUA_7G03740
Candida albicans
C5_00660C_A
C1_04770C_A
C1_02420C_A
CR_00850C_A
C5_03390C_A
C6_00620W_A
C3_05920W_A
C3_05220W_A
C5_01840C_A
C1_08460C_A
C3_04890W_A
C3_07860C_A
C6_03170C_A
C1_08590C_A
C3_02220W_A
C1_00800C_A
C1_03780C_A
C3_06850W_A
CR_08780W_A
CR_08800W_A
C1_00710C_A
Candida auris
B9J08_001448
B9J08_003737
B9J08_000964
B9J08_001020
B9J08_004076
B9J08_005397
B9J08_000164
B9J08_004061
B9J08_004819
B9J08_000270
B9J08_003224
B9J08_001232
B9J08_001055.75
B9J08_001055.25
B9J08_003245
Nasakeomyces glabrata
CAGL0E04334g
CAGL0F01793g
CAGL0G01034g
CAGL0K04037g
CAGL0H09064g
CAGL0D01562g
CAGL0A00451g
CAGL0I07733g
CAGL0M01760g
CAGL0I07755g
CAGL0F07865g
CAGL0L11506g
CAGL0L13392r
CAGL0L13376r
CAGL0K12650g
Candida parapsilosis
CPAR2_303740
CPAR2_105550
CPAR2_106400
CPAR2_804030
CPAR2_502030
CPAR2_602

### It works but some problems: 

Candida albicans  and tropicalis both have 1 unrecognisable genes. Respectively C1_00710C_A and CTRG_02689. *Solved !* Problem was erronous " " after Gene ID in xlsx file.

Candida tropicalis' gff file is broken. *Solved !* I changed the gff and fasta files. Took the top level fasta and the "normal" gff3

And are they finding the right sequences.... I'm not sure. Large NNN sequence in Cryptococcus ?? 