# Introduction to Packages

Julia code is organized in packages, and package management is built into the Julia language.

The assumption is that packages are developed with `git` and Julia will clone the whole repository when installing a package.

Users can have their packages registered on a special GitHub repository: [METADATA.jl](https://github.com/JuliaLang/METADATA.jl). Dependencies are tracked in the `REQUIRE` file.

In [None]:
# update the local copy of METADATA
Pkg.update()

# install a registered package
Pkg.add("DataFrames")

# install any other package
#Pkg.clone("https://github.com/leethargo/PipeLayout.jl")

# checkout a branch of a package (default: master)
#Pkg.checkout("PipeLayout")

# list installed packages with versions
Pkg.status()

# Creating an index fund

The goal of this project is the definition of an index fund, following the Dow Jones. That is, we want to select few stocks of the index, together with weights, that show a similar behavior to the overall index.

We start with price data of all the Dow Jones stocks from 2016. From the averages prices, we define weights of the stocks to be used

## Loading the price data

The data is provided in a file using comma-separated values and three columns:

In [None]:
# ; runs shell command
;head dowjones2016.csv

Julia provides a function to read csv files into arrays:

In [None]:
?readcsv

In [None]:
data = readcsv("dowjones2016.csv")
data[1:5,:]

But we will use the DataFrames package for easier processing.

In [None]:
using DataFrames

In [None]:
df = readtable("dowjones2016.csv")
df[1:4, :]

We can now access the columns by name:

In [None]:
df[:price]

Let's compute mean prices for the stocks, using a groupby-and-aggregate approach.

In [None]:
# anonymous function: ->
avg = by(df, :symbol, d -> DataFrame(avgprice = mean(d[:price])))
avg[1:4, :]

We can now use these averages to compute weights.

In [None]:
weights = DataFrame(symbol = avg[:symbol],
                    weight = avg[:avgprice] / sum(avg[:avgprice]))
weights[1:4, :]

We can also _pivot_ the table into a two-way format.

In [None]:
# original dataframe
df[1:4, :]

In [None]:
# two-way table with symbols as columns
#                    rows   columns  data
prices = unstack(df, :date, :symbol, :price)
prices[1:4, 1:4]

In [None]:
joined = join(df, weights, on=:symbol)
joined[1:4, :]

In [None]:
joined[:contribution] = joined[:weight] .* joined[:price]
joined[1:4, :]

In [None]:
index = by(joined, :date, d -> DataFrame(value = sum(d[:contribution])))
index[1:4, :]

## Visualization the time series

In [None]:
using Plots      # general plotting
pyplot()         # backend, based on Python's matplotlib

In [None]:
x = cumsum(randn(10, 3))

Plots will interprete the *columns* of the data as *series* to be plotted independently:

In [None]:
plot(x)

In [None]:
plot(x')

You can also add to existing plots, using the call `plot!`.

In [None]:
plot(x, color=[:red :green])
# plot! adds to the last plot
plot!(x + 3, color=:black, alpha=0.5)

In [None]:
using StatPlots  # for DataFrames integration

We can set common attributes for several plots using the `with` wrapper:

In [None]:
with(grid=false, legend=false, xticks=false, ylim=(0,300)) do
    plot(df, :date, :price, group=:symbol, color=:grey, alpha=0.4)
    plot!(index, :date, :value, linewidth=2)
end

In [None]:
bar(weights, :symbol, :weight, xrotation=50, color=:weight, grid=false)

## Picking stocks

We know come to the decision problem, where we want to pick a small subset of the stocks together with some weights, such that this portfolio has a similar behavior to our overall Dow Jones index.

The model is based on a linear regression over the time series, but we minimize the loss using the L1-norm (absolute value), and allow only a fixed number of weights to take nonzero variable.

A high-level mathematical model might look like this ($w$: weights, $P$: prices, $I$: value of index):

\begin{align*}
\text{minimize}   \quad & \lVert w^T P - I \rVert_1 \\
\text{subject to} \quad & \lVert w \rVert_0 \le K
\end{align*}

For the curious: this can be expressed as a [Mixed-Integer Linear Program](https://en.wikipedia.org/wiki/Integer_programming) in the following form:

\begin{align*}
\text{minimize}   \quad & \sum_d \Delta^+_d + \Delta^-_d & \\
\text{subject to} \quad & \sum_s P_{d,s} w_s = I_d + \Delta^+_d + \Delta^-_d & (\forall d) \\
                        & w_s \le p_s & (\forall s) \\
                        & \sum_s p_s \le K & \\
                        & w_s \ge 0, \quad p_s \in \{0,1\} & (\forall s) \\
                        & \Delta^+_d \ge 0, \quad \Delta^-_d \ge 0 &  (\forall d)
\end{align*}

Several Julia packages are devoted to this kind of optimization, such as [JuMP](https://github.com/JuliaOpt/JuMP.jl) and [Convex](https://github.com/JuliaOpt/Convex.jl) for modeling, solver backends like [Cbc](https://github.com/JuliaOpt/Cbc.jl) or [SCIP](https://github.com/SCIP-Interfaces/SCIP.jl) and [MathProgBase](https://github.com/JuliaOpt/MathProgBase.jl) as glue. See [JuliaOpt](http://www.juliaopt.org/) for an overview.

In [None]:
using JuMP # modeling
using Cbc  # solver backend

In [None]:
# preparing data for indexing
syms = [Symbol(s) for s in weights[:symbol]]
days = 1:length(prices[:date])

@show size(syms) size(days);

We will formulate a model that should look quite close to the mathematical notation above.

Note the heavy use of Julia macros to define variables and constraints. The expressions are used as parsed by the Julia language and directly translated to the solver's internal form.

In [None]:
function find_fund(maxstocks; timelimit=10.0, gaplimit=0.01, lastday=200)
    days = 1:lastday

    fund = Model(solver=CbcSolver(seconds=timelimit, ratioGap=gaplimit))

    # decisions
    @variable(fund, pick[syms], Bin)    # is stock included?
    @variable(fund, weight[syms] ≥ 0)   # what part of the portfolio

    # auxiliary variables
    @variable(fund, Δ⁺[days] ≥ 0) # positive slack
    @variable(fund, Δ⁻[days] ≥ 0) # negative slack

    # fit to Dow Jones index
    for d in days
        @constraint(fund, sum(prices[d,s] * weight[s] for s in syms) == index[d, :value] + Δ⁺[d] - Δ⁻[d])
    end

    # can only use stock if picked
    for s in syms
        @constraint(fund, weight[s] ≤ pick[s])
    end
                
    # few stocks allowed
    @constraint(fund, sum(pick[s] for s in syms) ≤ maxstocks)
                          
    # minimize the absolute violation (L1 norm)
    @objective(fund, :Min, sum(Δ⁺[d] + Δ⁻[d] for d in days))
                            
                            
    status = solve(fund)
    @show status
    
    getvalue(weight)
end

In [None]:
trainingdays = 100
sol = find_fund(3, timelimit=6, lastday=trainingdays)

In [None]:
solfund = sum(sol[s] * prices[:, s] for s in syms);

In [None]:
with(xticks=[0, trainingdays, length(days)], yticks=[]) do
    plot(index, :date, :value, label="Dow Jones")
    plot!(solfund, label="Index Fund")
end

In [None]:
errors = abs.(index[:value] - solfund)

with(bins=20) do
    histogram(errors[trainingdays:252], label="later", color=:red)
    histogram!(errors[1:trainingdays], alpha=0.8, label="training", color=:green)
end