# Text Search Engine in Julia

Refactored by [Ismael Venegas](https://twitter.com/ismael_vc) from [*Searching for syntactic sugar, man*](http://a-coda.tumblr.com/post/149265834291/searching-for-syntactic-sugar-man) by [**Jason Trenouth**](https://twitter.com/JasonTrenouth).

***

## Importing functions that are going to be extended

> I’m going to *define*¹ some new functions on some built-in operators so I need to import them. I wouldn’t need to do this if I was just using the operators.

```julia
import Base.&, Base.|, Base.-, Base.!
```

Or use the fully qualified name at the moment of definition:

```julia
Base.&(hits1::Set, hits2::Set) = intersect(hits1, hits2)
```

Or one could also use this other syntax:

In [1]:
import Base: &, |, -, !

1. The *julian* term is *to extend* a generic function.

## Cross-version compatibility with [Compat](https://github.com/JuliaLang/Compat.jl)

Just as an example of optional typing, I'm going to restring operations on `String`s:

* In **v0.3** `String` is an abstract type with `ASCIIString` and `UTF8String` as subtypes, in **v0.4** it was renamed as `AbstractString`.
* In **v0.5** `String` returns as the only type of string in Julia.
* Compat defines `String` as `Union{ASCIIString, UTF8String}`.

Also `Compat` provides a `waldir` function introduced in **v0.5**, so definifg the following function is not necessary:

```julia
function walk_directory(fn, directory)
    for file in readdir(directory)
        path = joinpath(directory, file)
        if isdir(path)
            walk_directory(fn, path)
        else
            fn(path)
        end
    end
end
```

In [2]:
import Compat: String, walkdir 

## Export the desired API functionality

In this case only this 3 objects are exported (`Base: &, |, -, !` don't need to be exported since they are extending already existing functions).

This package uses [NBInclude](https://github.com/stevengj/NBInclude.jl) which provides and `nbinclude` function similar to `include` in order to include the code in this [IJulia](https://github.com/JuliaLang/IJulia.jl) notebook:

In [3]:
export index_file, index_directory, @S_str

## Global variables

> I introduce “index” as a module variable because I want it to be the context for later operations. This is obviously a bit of a hack.

When a variable is declared on the global scope, it's best to declare it also as `const` (which means constant **type**, not constant **value**) as this will aid the compiler to infer the type of the variable correctly (since a non constant global variable can change type at any point) and generate optimized code.

It is also a good idea to name such variables in all upercase style and since this is a non exported object we could also prefix it with `_`.

Note that I'm also adding here more optional typing to show it's use:

In [4]:
const _INDEX = Dict{String, Set{String}}()

Dict{ByteString,Set{ByteString}} with 0 entries

## One line function definitions

> I can define simple functions as one-line expressions without much syntactic overhead.

In [5]:
_tokenize(text::String) = map(lowercase, split(text))
_all() = reduce(union, values(_INDEX))    # `all` already exist in julia also!

Here I show another bit of syntactic sugar in the form of infix binary operators `∩` (`\cap<TAB>`) for `intersect` and `∪` (`\cup<TAB>`) for `union`. Also notice the use of `hits₁` (`hits\_1<TAB>`) and `hits₂` (`hits\_2<TAB>`) instead of `hits1` and `hits2` respectively:

In [6]:
# if `&` not in parenthesis: `ERROR: syntax: invalid assignment location`
(&)(hits₁::Set{String}, hits₂::Set{String}) = hits₁ ∩ hits₂
|(hits₁::Set{String}, hits₂::Set{String}) = hits₁ ∪ hits₂
-(hits₁::Set{String}, hits₂::Set{String}) = setdiff(hits₁, hits₂)
!(hits::Set{String}) = _all() - hits;

## `do` blocks

> The second trick is a bit of sugar (“->”) for defining an anonymous function.

In this example I use the `do` syntax instead of the explicit anonymous function.

> This is a Ruby-style do-block which is syntactic sugar for making an anonymous function out of the body of the block

In [7]:
function _lookup(terms)
    mapreduce(intersect, _tokenize(terms)) do token
        get(_INDEX, token, Set{String}())
    end
end;

## Indexing functions

In [8]:
function index_file(file::String)
    for word in _tokenize(open(readall, file))
        push!(get!(_INDEX, word, Set{String}()), file)
    end
end

function index_directory(directory, extension = "")
    for (root, dirs, files) in walkdir(directory)
        for file in files
            if endswith(file, extension)
                index_file(joinpath(root, file))
            end
        end
    end
end;

## Non standard string literals

Here I use `S_str` instead of `s_str`, since starting with **v0.4** `s_str` is a used to construct `SubstitutionString{String}`s:

In [9]:
macro S_str(terms)
    _lookup(terms)
end

## Create package

Simply use `Pkg.generate`:

```julia
julia> Pkg.generate("TextSearch", "MIT")
INFO: Initializing TextSearch repo: C:\Users\Ismael\.julia\v0.4\TextSearch
INFO: Generating LICENSE.md
INFO: Generating README.md
INFO: Generating src/TextSearch.jl
INFO: Generating test/runtests.jl
INFO: Generating REQUIRE
INFO: Generating .travis.yml
INFO: Generating appveyor.yml
INFO: Generating .gitignore
INFO: Committing TextSearch generated files
```

I also added an `examples` directory after the fact and populated `test/runtests.jl` with the usage examples bellow and `src` with this notebook. since we are using `NBInclude `the contents of `src/TextSearch.jl` is simply:

```julia
module TextSearch

using NBInclude

nbinclude("text_search.ipynb")

end # module

```

The contents of `REQUIRE` are like this:

```julia
julia 0.4
Compat
NBInclude
```

## Usage


```julia
julia> using TextSearch

julia> cd(Pkg.dir("TextSearch"))

julia> index_directory("examples", ".jl")

julia> TextSearch._INDEX
Dict{ByteString,Set{ByteString}} with 83 entries:
  "∩"                        => Set(Union{ASCIIString,UTF8String}["examples\\bazbar.…
  "push!(get!(_index,"       => Set(Union{ASCIIString,UTF8String}["examples\\foobar.…
  "!"                        => Set(Union{ASCIIString,UTF8String}["examples\\quz.jl"…
  "="                        => Set(Union{ASCIIString,UTF8String}["examples\\bar.jl"…
  "parenthesis:"             => Set(Union{ASCIIString,UTF8String}["examples\\bazbar.…
  "import"                   => Set(Union{ASCIIString,UTF8String}["examples\\quzqux.…
  "_all()"                   => Set(Union{ASCIIString,UTF8String}["examples\\bazbar.…
  "index_file(file::string)" => Set(Union{ASCIIString,UTF8String}["examples\\foobar.…
  "mapreduce(intersect,"     => Set(Union{ASCIIString,UTF8String}["examples\\foo.jl"…
  "else"                     => Set(Union{ASCIIString,UTF8String}["examples\\bar.jl"…
  "hits₁"                    => Set(Union{ASCIIString,UTF8String}["examples\\bazbar.…
  "split(text))"             => Set(Union{ASCIIString,UTF8String}["examples\\baz.jl"…
  "#"                        => Set(Union{ASCIIString,UTF8String}["examples\\bazbar.…
  "assignment"               => Set(Union{ASCIIString,UTF8String}["examples\\bazbar.…
  "file)"                    => Set(Union{ASCIIString,UTF8String}["examples\\bar.jl"…
  "compat:"                  => Set(Union{ASCIIString,UTF8String}["examples\\quzqux.…
  "end;"                     => Set(Union{ASCIIString,UTF8String}["examples\\foobarb…
  "-"                        => Set(Union{ASCIIString,UTF8String}["examples\\bazbar.…
  "_tokenize(terms))"        => Set(Union{ASCIIString,UTF8String}["examples\\foo.jl"…
  "setdiff(hits₁,"           => Set(Union{ASCIIString,UTF8String}["examples\\bazbar.…
  "set{string}())"           => Set(Union{ASCIIString,UTF8String}["examples\\foo.jl"…
  "|(hits₁::set{string},"    => Set(Union{ASCIIString,UTF8String}["examples\\bazbar.…
  "location`"                => Set(Union{ASCIIString,UTF8String}["examples\\bazbar.…
  "path"                     => Set(Union{ASCIIString,UTF8String}["examples\\bar.jl"…
  "token,"                   => Set(Union{ASCIIString,UTF8String}["examples\\foo.jl"…
  "values(_index));"         => Set(Union{ASCIIString,UTF8String}["examples\\baz.jl"…
  "_tokenize(text::string)"  => Set(Union{ASCIIString,UTF8String}["examples\\baz.jl"…
  "(&)(hits₁::set{string},"  => Set(Union{ASCIIString,UTF8String}["examples\\bazbar.…
  "in"                       => Set(Union{ASCIIString,UTF8String}["examples\\bar.jl"…
  "index_directory(director… => Set(Union{ASCIIString,UTF8String}["examples\\foobarb…
  "macro"                    => Set(Union{ASCIIString,UTF8String}["examples\\qux.jl"…
  "extension"                => Set(Union{ASCIIString,UTF8String}["examples\\foobarb…
  "hits₂)"                   => Set(Union{ASCIIString,UTF8String}["examples\\bazbar.…
  "do"                       => Set(Union{ASCIIString,UTF8String}["examples\\foo.jl"…
  "-,"                       => Set(Union{ASCIIString,UTF8String}["examples\\quz.jl"…
  "endswith(file,"           => Set(Union{ASCIIString,UTF8String}["examples\\foobarb…
  "not"                      => Set(Union{ASCIIString,UTF8String}["examples\\bazbar.…
  "-(hits₁::set{string},"    => Set(Union{ASCIIString,UTF8String}["examples\\bazbar.…
  "set{string}()),"          => Set(Union{ASCIIString,UTF8String}["examples\\foobar.…
  "walkdir(directory)"       => Set(Union{ASCIIString,UTF8String}["examples\\foobarb…
  "`error:"                  => Set(Union{ASCIIString,UTF8String}["examples\\bazbar.…
  "for"                      => Set(Union{ASCIIString,UTF8String}["examples\\bar.jl"…
  "map(lowercase,"           => Set(Union{ASCIIString,UTF8String}["examples\\baz.jl"…
  "∪"                        => Set(Union{ASCIIString,UTF8String}["examples\\bazbar.…
  "get(_index,"              => Set(Union{ASCIIString,UTF8String}["examples\\foo.jl"…
  "function"                 => Set(Union{ASCIIString,UTF8String}["examples\\bar.jl"…
  "|,"                       => Set(Union{ASCIIString,UTF8String}["examples\\quz.jl"…
  "if"                       => Set(Union{ASCIIString,UTF8String}["examples\\bar.jl"…
  "(root,"                   => Set(Union{ASCIIString,UTF8String}["examples\\foobarb…
  "isdir(path)"              => Set(Union{ASCIIString,UTF8String}["examples\\bar.jl"…
  ⋮                          => ⋮

julia> S"∪" & S"∩"
Set(Union{ASCIIString,UTF8String}["examples\\bazbar.jl"])

julia> S"∪" | S"end"
Set(Union{ASCIIString,UTF8String}["examples\\bazbar.jl","examples\\bar.jl","examples\\foo.jl","examples\\foobar.jl","examples\\qux.jl","examples\\foobarbaz.jl"])

julia> S"function" - S"="
Set(Union{ASCIIString,UTF8String}["examples\\foo.jl","examples\\foobar.jl"])

julia> S"else" & !S"#"
Set(Union{ASCIIString,UTF8String}["examples\\bar.jl"])
```