# Indexing

An overview of how the indexing code works in cfgrib.jl

In [None]:
using DataStructures  # For the OrderedDict type
using Dates
using GRIB

## `FileIndex` Type

In [None]:
mutable struct FileIndex
    allowed_protocol_version::VersionNumber

    grib_path::String
    index_path::String

    index_keys::Array{String, 1}
    offsets::Array
    message_lengths::Array{Int, 1}
    header_values::OrderedDict{String, Array}

    filter_by_keys::Dict

    FileIndex() = new(v"0.0.0") #  Here we define an inner constructor which will always create an object with a set `allowed_protocol_version`
end

This class contains the same attributes as the Python version, however it stores the path to the grib file instead of a file stream.

## `FileIndex` Constructors

Constructors are functions which onstruct new objects. You can make a `FileIndex` by just calling the type itself and providing all of the fields, however usually special outer constructors (constructors outside the class definition) are used out of convenience.

In [None]:
function get_header_values(grib_path::String, index_keys::Array{String, 1})
    fileindex = FileIndex() #  `FileIndex` is a mutable type, so we can initialise an empty version here, which we then fill in
    fileindex.grib_path = grib_path
    fileindex.index_keys = index_keys

    #  Here we compute the index hash, and set `index_path` in a similar way to the python version
    index_keys_hash = hash(
        join([fileindex.index_keys..., fileindex.allowed_protocol_version])
    )
    index_keys_hash = string(index_keys_hash, base=16)
    fileindex.index_path = ".$(fileindex.grib_path).$index_keys_hash.idx"

    if isfile(index_path(fileindex))
        #  Now if the index path exists, we load from it
        from_indexfile!(fileindex)
    else
        #  Otherwise we load from the grib file and then get the header values
        from_gribfile!(fileindex)
        get_header_values!(fileindex)
    end

    return fileindex
end

In Julia if a function ends with `!` it means that the function modifies its arguments instead of returning the modified version. So `get_header_values!` gets the header values and adds them to the existing `FileIndex`, whereas `get_header_values` would return the actual header values (if the function actually existed that is).

Another example of this is `filter` and `filter!`, one will return a copy of the filtered index, whereas the other would filter the index in-place.

## Extending Interfaces

In Python if you want a way to get items via indexing, you would add a `__getitem__` dunder method to your class. As Julia is based on multiple dispatch, we insted extend `Base.getindex` so that is knows how to access our object:

In [None]:
Base.getindex(obj::FileIndex, key) = obj.header_values[key]

We want indexing into... the index to return the corresponding header value, so we define `getindex` to do just that.

## Loading Index from GRIB File

Effectively the `from_filestream` classmethod under the `FileIndex` class:

In [None]:
function from_gribfile!(index::FileIndex)
    offsets = OrderedDict()
    count_offsets = Dict{Int, Int}()

    index_keys = index.index_keys
    index_key_count = length(index_keys)
    index_key_symbols = Tuple(Symbol.(index_keys))
    HeaderTuple = NamedTuple{index_key_symbols}

    #  TODO: Time function to see if it is worth optimising
    #  based on gribfile.nmessages w/ known-length arrays
    #  more, or if I/O overhead too large
    GribFile(index.grib_path) do f
        message_lengths = Array{Int, 1}(undef, f.nmessages)
        for (nmessage, message) in enumerate(f)
            header_values = Array{Any}(undef, index_key_count)
            for (i, key) in enumerate(index_keys)
                value = haskey(message, key) ? message[key] : missing
                value = value isa Array ? Tuple(value) : value
                #  TODO: use dispatch to do this via GRIB
                value = key == "time" ? from_grib_date_time(message) : value

                header_values[i] = value
            end

            offset = Int(message["offset"])
            if offset in keys(count_offsets)
                count_offsets[offset] += 1
                offset_field = (offset, count_offsets[offset])
            else
                count_offsets[offset] = 0
                offset_field = offset
            end

            message_lengths[nmessage] = Int(message["totalLength"])
            offsets[HeaderTuple(header_values)] = offset_field
        end
        index.message_lengths = message_lengths
    end

    index.offsets = collect(pairs(offsets))
end

**notes on implementation**:

- Currently does not handle exceptions
- Does not do the header_values_cache as... I don't really get what it is
- Not clear to me how the date values are converted in cfgrib.py, and how this should be implemented
- I'd like to find out more about:
  - from_grib_date_time
  - to_grib_date_time
  - from_grib_step
  - to_grib_step
  - from_grib_month
  - build_valid_time

Currently the time conversion is done by a few functions:

In [None]:
DEFAULT_EPOCH = DateTime(1970, 1, 1, 0, 0)


function from_grib_date_time(date::Int, time::Int; epoch=DEFAULT_EPOCH)
    hour = time ÷ 100
    minute = time % 100
    year = date ÷ 10000
    month = date ÷ 100 % 100
    day = date % 100

    data_datetime = DateTime(year, month, day, hour, minute)

    return Dates.value(Dates.Second(data_datetime - epoch))
end

function from_grib_date_time(
        message::GRIB.Message, date_key="dataDate",
        time_key="dataTime", epoch=DEFAULT_EPOCH
    )
    date = message[date_key]
    time = message[time_key]

    return from_grib_date_time(date, time)
end


#  TODO: This probably won't work translated directly from python
#  check cases where time and step are effectively missing
function build_valid_time(time::Int, step::Int)
    step_s = step * 3600

    data = time + step_s
    dims = ()

    return dims, data
end

function build_valid_time(time::Array{Int, 1}, step::Int)
    step_s = step * 3600

    data = time .+ step_s
    dims = ("time", )

    return dims, data
end

function build_valid_time(time::Int, step::Array{Int, 1})
    step_s = step * 3600

    data = time .+ step_s
    dims = ("step", )

    return dims, data
end

function build_valid_time(time::Array{Int, 1}, step::Array{Int, 1})
    step_s = step * 3600

    if length(time) == 1 && length(step) == 1
        return build_valid_time(time[1], step[1])
    end

    #  TODO: Julia is column major, numpy is row major, not too sure what
    #  the correct approach would be here...
    data = time' .+ step_s
    dims = ("time", "step")
    return dims, data

end

#  TODO: implement other conversion methods, but some seem unused, should these
#  be implemented as well:
#   - to_grib_date_time
#   - from_grib_step
#   - to_grib_step
#   - from_grib_month


## Getting Header Values

In [None]:
function get_header_values!(index::FileIndex)
    header_values = OrderedDict{String, Array}()
    for key in index.index_keys
        header_values[key] = unique([
            offset[1][Symbol(key)]
            for offset
            in index.offsets
        ])
    end

    index.header_values = header_values
end

## `getone`

In [None]:
function getone(index::FileIndex, item)
    values = index[item]

    if length(values) != 1
        throw("Expected 1 value for $(item), found $(length(values)) instead")
    end

    return values[1]
end

## `first`

In [None]:
function first(index::FileIndex)
    GribFile(index.grib_path) do file
        first_offset = index.offsets[1][2][1]
        #  There is a discrepancy between how offsets are defined and used
        #  in cfgrib with the GRIB file seek method and in the Julia GRIB
        #  package, in Julia seek seeks through the messages themselves not
        #  the acutal offset values. Here we use the cumulative sum of the
        #  message lengths to work out which message an offset value is in.
        #
        #  TODO: This is probably due to me making a mistake, don't know
        #  enough about GRIB spec to figure out how this should be done, get
        #  ECMWF help with this
        message_length_cumsum = cumsum(index.message_lengths)
        offset_message_index = findfirst(message_length_cumsum .> first_offset) - 1
        seek(file, offset_message_index)
        return Message(file)
    end
end

**notes on implementation**:

This... escapes me a bit, as I'm still not familiar with GRIB files/ecCodes. 

Far as I can tell, `seek` in GRIB.jl seeks through entire messages, whereas cfgrib.py seeks through the offset values. 

## Filtering

Similar to python, Julia has a wildcard-esque equivalent. In python you can write `func(*args)` to collect a list of arguments, and `func(**kwargs)` for keyword arguments. In julia, keyword arguments are after a semicolon, and instead of an asterisc you use an ellipses to collect arguments, so:

In [None]:
test_args(args...; kwargs...) = println("args: $(args)"), println("kwargs: $(kwargs)")

In [None]:
test_args(1,2,"potato"; a=3, b=2.9, c="carrot")

Filter is defined using collected keyword arguments into a `query` variable:

In [None]:
function filter_offsets(index::FileIndex; query...)
    filtered_offsets = Array{Pair{Any,Any},1}()

    for (header_values, offset_values) in index.offsets
        for (k, v) in query
            if header_values[k] != v
                break
            else
                append!(filtered_offsets, [Pair(header_values, offset_values)])
                break
            end
        end
    end

    return filtered_offsets
end

function filter(index::FileIndex; query...)
    filtered_offsets = filter_offsets(index; query...)

    filtered_index = deepcopy(index)
    filtered_index.offsets = filtered_offsets
    filtered_index.filter_by_keys = query

    get_header_values!(filtered_index)

    return filtered_index
end

function filter!(index::FileIndex; query...)
    filtered_offsets = filter_offsets(index; query...)

    index.offsets = filtered_offsets
    index.filter_by_keys = query

    get_header_values!(index)
end

**notes on implementation**:

`filter_by_keys` exists and is used quite often, however this hasn't been exposed in the user callable functions yet as I may change how it works in the future. GRIB.jl has its own `Index` type, indexing functionality, and index filtering. At the start of the project I had no clue what any of that meant/was for, so I just copied the python implementation from cfgrib, but now that I vaguely understand it I want to replace my indexing and filtering with the GRIB.jl implementation where possible.

From the GRIB.jl readme, filtering is done like this:

```
Index(filename, "shortName", "typeOfLevel", "level") do index
    select!(index, "shortName", "t")
    select!(index, "typeOfLevel", "isobaricInhPa")
    select!(index, "level", 500)
    for msg in index
        # Do things with msg
    end
end
```