### Data structure diagram:
Centered around **edge** objects. These objects hold 4 properties: their label(string), their reversal(bool), their tail connection (Nothing or edge object), and their head connection (Nothing or edge object).

### Genomes:
Example: {{at}, {ah, ct}, {ch, dh}, {dt}, {bh, et}, {eh, bt}, {ft}, {fh, gt}, {gh}}

This example is composed of **three** loops:
- {at}, {ah, ct}, {ch, dh}, {dt}
    - This is a **linear** path
    - Gene A is **positively** oriented (from left to right goes **tail** to **head**)
    - Gene C is **positively** oriented (from left to right goes **tail** to **head**)
    - Gene D is **negatively** oriented (from left to right goes **head** to **tail**)
- {bh, et}, {eh, bt}
    - This is a **cyclic** path
    - Gene B is **negatively** oriented (from left to right goes **head** to **tail**)
    - Gene E is **positively** oriented (from left to right goes **tail** to **head**)
- {ft}, {fh, gt}, {gh}
    - This is a **linear** path
    - Gene F is **positively** oriented (from left to right goes **tail** to **head**)
    - Gene G is **positively** oriented (from left to right goes **tail** to **head**)

### Data structure to genome relation
These edges relate to the genome diagram as follows:

The gene diagram is structured as: 
{HEAD Slot, TAIL Slot}
So using the gene {ah, ct},
- ah is the head slot
- ct is the tail slot
Using a more complete genome {at} {ah, ct} {ch}, we convert this to **two** edge objects.
Edge 1:
Label: a
Reversal: false
tail connection: Nothing
head connection: Edge 2(c)

Edge 2:
Label: c
Reversal: false
tail connection: Edge 1(a)
Head connection: Nothing

This introduces **telomeres** - edges with only one connection. This is an external node on the graph. Internal nodes will have edges in both connection slots.

The conversion is done as following:
Each gene contains either one or two entries. The h or t defines which slot the entry fills. If there is no entry in one of the slots, the gene is a telomere. It makes an edge object. In the above example, {at} is the first gene we look at. It makes an edge object with no tail connection, as there is no other edge(gene c) with the tail of the A gene. {ah, ct} tells us that the edge C goes in the head slot of edge A, because the ah is in the same brackets as another edge part.

There is a pretty big thing I have left out discussing so far: If a genome looks like this: {at} {ah, ch} {ct}

This would confuse us. A genome like this indicates one of the edges is negatively oriented. By convention, we say the C edge is negatively oriented.
We turn this into the following edges:
Edge 1:
Label: a
Reversal: false
Tail connection: nothing
Head connection: edge 2(c)

Edge 2:
Label: c
Reversal: true
Tail connection: nothing
Head connection: edge 1(a)

By convention, when we hit a gene of the form {ah, ch} (or {at, ct}), we take whatever is in its "correct" spot to be the positively oriented gene. So, looking at our gene diagram, ah is in the correct slot, as it is a suffix of h in the head slot. So, edge C is reversed. Thus, when we are creating the C edge we must put edge A in the head connection of the C edge, as that is where there head connects to the A edge.

In [70]:
using Parameters
using Random
import Base: show



In [1]:
mutable struct Edge
    label::String
    reversed::Bool
    tail_conn::Union{Edge, Nothing}
    head_conn::Union{Edge, Nothing}
end

mutable struct Path
    root::Edge
    all_edges::Set{Edge}
end

mutable struct Genome
    input_str::String
    root_edges::Vector{Edge}
end

In [35]:
function is_telomere(edge::Edge)
    if edge.tail_conn === nothing || edge.head_conn === nothing
        return true
    end
    return false
end

function Base.show(io::IO, edge::Edge)
    tail_label = edge.tail_conn === nothing ? "none" : edge.tail_conn.label
    head_label = edge.head_conn === nothing ? "none" : edge.head_conn.label
    reversed_str = edge.reversed ? "reversed" : "normal"
    print(io, "Edge(label: $(edge.label), reversed: $reversed_str, tail_conn: $tail_label, head_conn: $head_label)")
end

function display_genome(genome::Genome)
    for path in genome.root_edges
        display_path(path)
        print("  ")
    end
end

function display_path(path::Path)
    root = path.root
    edges = path.all_edges
    display_path(root, edges)
end

function display_path(root::Edge, edges::Set{Edge})
    edge_set = edges
    telos = Set{Edge}()
    for edge in edge_set
        if is_telomere(edge)
            push!(telos, edge)
        end
    end

    trail = root
    println(trail)
    println("edges:")
    for edge in edge_set
        println(edge)
    end
    println("telos:")
    for telo in telos
        println(telo)
    end
    println("")
    seen = Set{Edge}()
    output = ""
    while trail !== nothing && !(trail in seen)
        push!(seen, trail)
        println(trail)
        # Starting at a telomere(for now only linear paths supported)
        head_slot = trail.reversed ? trail.head_conn : trail.tail_conn
        tail_slot = trail
        next_slot = trail.reversed ? trail.tail_conn : trail.head_conn
        println("Trail: $(trail.label), head slot: $(head_slot === nothing ? nothing : head_slot.label), tail slot: $(tail_slot === nothing ? nothing : tail_slot.label), next slot = $(next_slot === nothing ? nothing : next_slot.label)")
        if head_slot === nothing
            output *= get_formatted_gene(nothing, trail) * "  "
            trail = next_slot
            continue
        end
        if next_slot === nothing && trail.reversed
            output *= get_formatted_gene(head_slot, tail_slot) * "  " * get_formatted_gene(trail, nothing)
            trail = next_slot
            continue
        end
        if next_slot === nothing
            output *= get_formatted_gene(head_slot, tail_slot) * "  " * get_formatted_gene(trail, next_slot)
            trail = next_slot
            continue
        end
        output *= get_formatted_gene(head_slot, tail_slot) * "  "
        trail = next_slot
    end
    println(output)
end

display_path (generic function with 2 methods)

In [36]:
function get_formatted_gene(head_slot::Union{Nothing, Edge}, tail_slot::Union{Nothing, Edge})
    if head_slot === nothing
        tail_string = tail_slot.label * "T"
        if tail_slot.reversed
            tail_string = tail_slot.label * "H"
        end
        return "{$tail_string}"
    end
    if tail_slot === nothing
        head_string = head_slot.label * "H"
        if head_slot.reversed
            head_string = head_slot.label * "T"
        end
        return "{$head_string}"
    end
    head_string = head_slot.label * "H"
    if head_slot.reversed
        head_string = head_slot.label * "T"
    end
    tail_string = tail_slot.label * "T"
    if tail_slot.reversed
        tail_string = tail_slot.label * "H"
    end
    return "{$head_string, $tail_string}"
end

get_formatted_gene (generic function with 1 method)

In [37]:
function convert_string_to_genome(input::String)
    split_input = split(input, ",")
    seen_labels = Dict{String, Edge}()
    root = nothing
    for gene in split_input
        println(gene)
        curr_string = ""
        components = Vector{String}()
        for ch in gene
            curr_string *= string(ch)
            if ch == 'H' || ch == 'T'
                push!(components, curr_string)
                curr_string = ""
            end
        end
        if length(components) == 1
            # Telomere
            no_suffix = components[1][1:end-1]
            suffix = components[1][end]
            if !(no_suffix in keys(seen_labels))
                # Need to make a new edge. If this does not happen, the edge has already
                # been created and since we would be setting a connection to nothing 
                # we dont need to do anything
                reversed = false
                if suffix == 'H'
                    reversed = true
                end
                tail = get!(seen_labels, no_suffix, Edge(no_suffix, reversed, nothing, nothing))
            end
        else
            # Full gene
            head_no_suffix = components[1][1:end-1]
            head_suffix = components[1][end]
            tail_no_suffix = components[2][1:end-1]
            tail_suffix = components[2][end]
            head = get!(seen_labels, head_no_suffix, Edge(head_no_suffix, false, nothing, nothing))
            tail = get!(seen_labels, tail_no_suffix, Edge(tail_no_suffix, false, nothing, nothing))

            if head_suffix == 'T'
                head.reversed = true
                head.tail_conn = tail
            else
                head.head_conn = tail
            end
            if tail_suffix == 'H'
                tail.reversed = true
                tail.head_conn = head
            else
                tail.tail_conn = head
            end
        end
        if root === nothing
            root = tail
        end
    end
    edges = Set{Edge}(value for (key, value) in seen_labels)
    graph = Path(root, edges)
    return graph
end

convert_string_to_genome (generic function with 1 method)

In [40]:
test_convert_string = convert_string_to_genome("aT,aHcT,cHdT,dHeH,eTfH,fT")
display_path(test_convert_string)

aT
aHcT
cHdT
dHeH
eTfH
fT
Edge(label: a, reversed: normal, tail_conn: none, head_conn: c)
edges:
Edge(label: e, reversed: reversed, tail_conn: f, head_conn: d)
Edge(label: c, reversed: normal, tail_conn: a, head_conn: d)
Edge(label: d, reversed: normal, tail_conn: c, head_conn: e)
Edge(label: f, reversed: reversed, tail_conn: none, head_conn: e)
Edge(label: a, reversed: normal, tail_conn: none, head_conn: c)
telos:
Edge(label: f, reversed: reversed, tail_conn: none, head_conn: e)
Edge(label: a, reversed: normal, tail_conn: none, head_conn: c)

Edge(label: a, reversed: normal, tail_conn: none, head_conn: c)
Trail: a, head slot: nothing, tail slot: a, next slot = c
Edge(label: c, reversed: normal, tail_conn: a, head_conn: d)
Trail: c, head slot: a, tail slot: c, next slot = d
Edge(label: d, reversed: normal, tail_conn: c, head_conn: e)
Trail: d, head slot: c, tail slot: d, next slot = e
Edge(label: e, reversed: reversed, tail_conn: f, head_conn: d)
Trail: e, head slot: d, tail slot: e, n