## Task 1 - Explain ciprofloxacin resistance

**A common mutation causing ciprofloxacin resistance is found in the *gyrA* and results in the following amino acid change - S91F. Can you confirm if this is present in the sequence data, using the mapped data, and find another resistance determinant?**

This relies on using python to identify the gene in the whole genome andconvert DNA sequence to protein sequences, so we can then compare the reference genome with our genome of interest to look for differences.

### Tip 1

Find the location of the *gyrA* gene in the reference file using the NCBI website - https://www.ncbi.nlm.nih.gov/nuccore/NC_011035.1?report=graph - use the find function on the webpage

- Coordinates of the *gyrA* gene on [NCBI](https://www.ncbi.nlm.nih.gov/nuccore/NC_011035.1?report=graph): 1051396 to 1054146
- In Python: 1051395 to 1054145.

![NCBI Screenshot](task1_tip1.png)

### Tip 2

Use biopython's `SeqIO.read()` function to read in the mapped fasta file `cdt-tutorial/super_gonorrhoea/super_gc_mapped.fa` (https://biopython.org/wiki/SeqIO)

**Change path.**

In [9]:
from Bio import SeqIO
super_gc_mapped_fa = SeqIO.read('../super_gonorrhoea/super_gc_mapped.fa', 'fasta')

### Tip 3

You can will need to extract the DNA sequence of the gene using it's coordinates within the whole genome (remember that these number from zero in python, but from one on the NCBI website!)

In [7]:
super_gc_GyrA_extract = super_gc_mapped_fa[1051395:1054146]
print(super_gc_GyrA_extract)

ID: OxfordGC_R00000419_R00000419
Name: OxfordGC_R00000419_R00000419
Description: OxfordGC_R00000419_R00000419 Software:/home/compass/PIPELINE/mmmPipeline/compass/g4_basecall.py Version:1.0.1
Number of features: 0
Seq('ATGACCGACGCAACCATCCGCCACGACCACAAATTCGCCCTCGAAACCCTGCCC...TGA', SingleLetterAlphabet())


### Tip 4

You will need to translate the DNA sequences to a protein sequence, `seq.translate()` in biopython (for *gyrA* the gene is in the same direction as the DNA is numbered, if it were not you would need to generate the reverse complement sequence first), if this works you should end up with a string of amino acids represented as single letters and ending with a stop codon shown as an asterisk.

In [8]:
super_gc_GyrA = super_gc_GyrA_extract.translate()
print(super_gc_GyrA)

ID: <unknown id>
Name: <unknown name>
Description: <unknown description>
Number of features: 0
Seq('MTDATIRHDHKFALETLPVSLEDEMRKSYLDYAMSVIVGRALPDVRDGLKPVHR...EN*', HasStopCodon(ExtendedIUPACProtein(), '*'))


### Tip 5

Compare the amino acid sequences from the reference (`cdt-tutorial/super_gonorrhoea/reference.fa` ) with the sequence from the case using python, does the reference share any mutations or have any different mutations in *gyrA* (hint the reference sequence is also ciprofloxacin resistant)?

**Change path.**

In [19]:
# explore 90-th position
super_gc_GyrA[90]

'F'

We have 'F'on the 91st (90th from 0) place, so we confirm the amino acid change - S91F.

In [27]:
# reference

reference_mapped_fa = SeqIO.read('../super_gonorrhoea/reference.fa', 'fasta')
reference_GyrA = reference_gc_mapped_fa[1051395:1054146].translate()
print(reference_GyrA)

ID: <unknown id>
Name: <unknown name>
Description: <unknown description>
Number of features: 0
Seq('MTDATIRHDHKFALETLPVSLEDEMRKSYLDYAMSVIVGRALPDVRDGLKPVHR...EN*', HasStopCodon(ExtendedIUPACProtein(), '*'))


In [53]:
import numpy as np

inconsistent_idxs = np.array((reference_GyrA)) != np.array((super_gc_GyrA))
print(sum(inconsistent_idxs), 'inconsistent index(es)')

(np.arange(1, len(reference_GyrA)+1)[inconsistent_idxs],
np.array(super_gc_GyrA)[inconsistent_idxs],
np.array(reference_GyrA)[inconsistent_idxs])

1 inconsistent index(es)


(array([95]), array(['A'], dtype='<U1'), array(['G'], dtype='<U1'))

In [54]:
for i, (aa1, aa2) in enumerate(zip(super_gc_GyrA, reference_GyrA)):
    if aa1!=aa2:
        print(i+1, aa1, aa2)

95 A G


## Task 2 - Explain the tetracycline resistance
**Tetracycline resistance is conferred by gain of a new gene, *tetM*, carried on a plasmid. Is this gene present?**  
To do this we will use a tool called BLAST. This is an optimised algorithm for searching for one or more sequences (the query) within a database of sequences. In this example we will treat the gene we are trying to find as the query and the database will be the contigs that make up the assembly of our genome of interest.

I suggest you read the Process and Algorithm sections of the wikipedia article on BLAST before continuing - as this will help with parsing the output generated - https://en.wikipedia.org/wiki/BLAST_(biotechnology).