# Contaminat mapping to genome

This Notebook and its associated scripts use the report from OMArk and an annotation file of a genome (GFF3) to identify stretch of the genomes that are riches with contaminants. 
It may be tuned with different parameters to make it more specific and sensitive:
    -Minimal proportion of "contaminant" genes for contaminant stretches. Set higher for more specifity, lower for higher sensitivity - default:0.5
    -Minimal number of contaminant genes in a stretches that are not the full chromosome. Set higher for more specifity, lower for higher sensitivity - default:5
    -Whether contaminant stretches should only be whole contigs or contigs extremity  - Default False
    

It can then be used to create a FASTA files in which proteins from those stretches are removed.

## Import dependancies

In [1]:
import gffutils
from Bio import SeqIO
import contamination_chromosome_filtering as ccf

## Input files
Set your own values for genome of your choice. 

* OMArk folder: the output folder from OMArk
* GFF: A GFF3 file indicating the position of genes in the genome. This code was tested with files from the NCBI and Ensembl
* OG_FASTA:the original FASTA file of the proteomes
* FILTER_FASTA: the output FASTA file filtered from possible contaminants
* REPORT: the output file indicated what part of the genome was considered as contaminants, and the list of corresponding genes and proteins.

* THRESHOLD: The threshold (float between 0 and 1) that the proportion of genes in a "contaminant stretch" must pass to be considered as valid
* MIN_NUMBER_GENES_IN_STRETCH: Integer.  The minimum number of genes in a contaminant stretch to be considered as valid. If the number of genes in a chromosome is lower than this, all of the genes in it must be contaminant to be considered as stretch.
* ONLY_EXTREMITIES: a boolean that indicates whether stretch can be only at extremities or also in middle of chromosomes. When true, force the stretches to be at the start or end of chromosomes.

In [2]:
OMARK_FOLDER = ''
GFF = ''
OG_FASTA = ''
FILTER_FASTA = ''
REPORT = ''

THRESHOLD = 0.5
MIN_NUMBER_GENES_IN_STRETCH = 5
ONLY_EXTREMITIES = False

## Stretches identification

In [3]:
#Get contaminant proteins in OMARK_FOLDER
contaminant = ccf.get_contaminants(OMARK_FOLDER)
#Get the positions of the contaminants and genes from the GFF file
cont_pos, all_pos, gene_to_prot = ccf.get_position_conta(contaminant, GFF)
#Define contaminant_stretches, with selected parameter
contaminant_stretches = ccf.infer_contaminant_genome_stretches(cont_pos, all_pos,THRESHOLD,MIN_NUMBER_GENES_IN_STRETCH,ONLY_EXTREMITIES)


## Manual validation (optional)

In [4]:
#selected_stretches = []
#for x in contaminant_stretches:
#    print(x)
#    correct_input = False
#    while not correct_input:
#        keep_input = input("Select as contaminant? (Y or N)")
#        if keep_input=='Y'or keep_input=='N':
#            correct_input=True
#        if keep_input=='Y':
#            selected_stretches.append(x)
#contaminant_stretches = selected_stretches

## Filter proteins in stretches and write outputs

In [5]:
#List of protein and genes, from contaminant_stretches
protein_to_remove, gene_to_remove =  ccf.get_genes_in_cont_stretches(contaminant_stretches, all_pos, gene_to_prot)
#Filter protein from a FASTA and create a filtered copy with contaminant removed
ccf.filter_proteins(OG_FASTA, FILTER_FASTA, protein_to_remove)
#Write a textual report noting removed genes
ccf.write_report(contaminant_stretches,gene_to_remove, protein_to_remove, gene_to_prot, outfile =REPORT)