# No-Support-Julia
## Introduction
There's no wifi on the airplane, but I've got three hours to spare, so I'm just going to be banging my head against the julia stdlib. Let's see what happens


## Part 1 - Let's load some data
Let's load a basic set of text that I wrote in 5 secs.


In [1]:
test_file = readstring(open("data/test.txt"))

"This is some text you will need to parse using Julia.\nHow can it be parsed? Using the standard library functions.\nNotice how the lines do not all have the same length, and the values are not consistent.\nFurthermore, there are some miscellaneous details that are irrelevant to the summarization effort\nYou will need to extract these as well.\n"

We have some data now, let's try splitting it up first

In [2]:
words = split(test_file)


59-element Array{SubString{String},1}:
 "This"         
 "is"           
 "some"         
 "text"         
 "you"          
 "will"         
 "need"         
 "to"           
 "parse"        
 "using"        
 "Julia."       
 "How"          
 "can"          
 ⋮              
 "to"           
 "the"          
 "summarization"
 "effort"       
 "You"          
 "will"         
 "need"         
 "to"           
 "extract"      
 "these"        
 "as"           
 "well."        

How about plotting this stuff in histogram?


First let's sanitize our words to remove punctuation

In [3]:
sanitized = map(a -> lowercase(join(filter(x -> isalnum(x), a))), words)

59-element Array{String,1}:
 "this"         
 "is"           
 "some"         
 "text"         
 "you"          
 "will"         
 "need"         
 "to"           
 "parse"        
 "using"        
 "julia"        
 "how"          
 "can"          
 ⋮              
 "to"           
 "the"          
 "summarization"
 "effort"       
 "you"          
 "will"         
 "need"         
 "to"           
 "extract"      
 "these"        
 "as"           
 "well"         

Now we need to create a frequency table

In [4]:
counter = Dict{String, Int}()
for word in sanitized
    counter[word] = get(counter, word, 0) + 1
end


Okay, now I'm lost. I've got no idea how to plot this.
Let's drop the idea of a histogram and move onto some cooler things then.

In [5]:
idmap = Dict{String, Int}()
currId = 0
for word in sanitized
    id = get(idmap, word, -1)
    if id == -1
        idmap[word] = currId
        currId += 1
    end
end


Now we can represent each word with unique 44 value vector

Hey I've got an idea, let's try word2vec on this.

Now I read an article on this, but I just skimmed it - enough to get the big picture, but I'm missing the details.

From what I understood from the article, word2vec works by running an autoencoder network against the words, but then the article also mentioned some other stuff they do, like using the past word to predict the current to improve accuracy - I'm not 100% sure, but that sounds about right.

Alright. Nice, I've got an idea of what to do. Now. First things first, we need to set up a variational autoencoder.

To do this, we'll need a sequence of layers which gradually reduce the input vector of size 44 to some small value representing the hidden layers, then a sequence of layers which gradually increase the size of the vector from the sparse encoding to the initial size

Let's start by defining a struct to represent our autoencoder


In [28]:
struct AutoEncoder
    encoder::Array
    decoder::Array
    input_size::UInt
    hidden_layer_size::UInt
    layers::UInt
    function AutoEncoder(inp, hidd, layers) 
        step = convert(Int, (hidd - inp)/layers)
        encoder = Array{Array{Float64}}(layers)
        decoder = Array{Array{Float64}}(layers)
        
        current_size = inp
        for i in 1:layers
            encoder[i] = rand(current_size, current_size + step)
            current_size += step
        end
        step *= -1
        for i in 1:layers
            decoder[i] = rand(current_size, current_size + step)
            current_size += step
        end
        new(encoder, decoder, inp, hidd, layers)
    end
            
        
end


Notice we've made  a constructor, which when given the input size, hidden layer size, and layers, auto initializes the encoder

LoadError: [91msyntax: unexpected "end"[39m

In [35]:
enc = AutoEncoder(100,10,5)

AutoEncoder(Array{Float64,N} where N[[0.379269 0.0453297 … 0.311252 0.986605; 0.305363 0.772192 … 0.644362 0.652224; … ; 0.781224 0.140815 … 0.280981 0.921464; 0.528331 0.948057 … 0.505355 0.545835], [0.491154 0.18796 … 0.256367 0.930093; 0.795206 0.166115 … 0.816345 0.26615; … ; 0.924861 0.738504 … 0.205056 0.213574; 0.237844 0.689262 … 0.717254 0.287205], [0.828815 0.116469 … 0.962458 0.538443; 0.752039 0.433587 … 0.383888 0.824494; … ; 0.123673 0.705926 … 0.608085 0.474822; 0.260323 0.739131 … 0.584931 0.703986], [0.695669 0.704979 … 0.980679 0.170187; 0.870905 0.665081 … 0.134856 0.963135; … ; 0.972461 0.458998 … 0.900291 0.420887; 0.340354 0.694829 … 0.662717 0.46091], [0.21244 0.283266 … 0.157289 0.358607; 0.660753 0.231501 … 0.392178 0.0814015; … ; 0.404754 0.526115 … 0.0947717 0.825928; 0.90473 0.440946 … 0.158078 0.538893]], Array{Float64,N} where N[[0.967083 0.780214 … 0.838402 0.953968; 0.642957 0.655513 … 0.010003 0.872708; … ; 0.159979 0.806286 … 0.26883 0.0445287; 0.47889

Now let's define a function to unsupervised train the object on a given input

first we need an error function - we'll use the euclidean distance for now

In [30]:
function euclidean_distance(a, b)
    distance = 0.0
    for i in 1:length(a)
        a_i = a[i]
        b_i = b[i]
        distance += (a_i - b_i) * (a_i - b_i)
    end
    distance
end
        

euclidean_distance (generic function with 1 method)

In [31]:
euclidean_distance([1;1;1],[1;2;1])

1.0

Now we can define the train function - with the default error function being euclid

In [55]:
function train(p::AutoEncoder, input::Array{Float64}, loss_function=euclidean_distance) 
    result = input
    for i in 1:first(size(p.encoder))
        result = p.encoder[i]
        
end

train (generic function with 2 methods)

In [56]:
train(enc, [1.0,2.0,3.0])

0