In [None]:
using Plots; gr()
using Interact

## Function parameters

In this notebook, we'll work again with sigmoid functions of the form

$$\sigma(x) := \frac{1}{1 + \exp(-x)}.$$

Julia allows us to define this function with the following simple syntax:

In [None]:
σ(x) = 1 / (1 + exp(-x))

Instead of working with a single function, we can work with a whole class (set) of functions that look similar but differ in the value of a **parameter**. Let's make a new function that uses the previous $\sigma$ function, but also has a parameter, $w$. Note that Julia treats parameters just as extra arguments:

$$f_w(x) = f(x; w) = \sigma(w \, x).$$

Mathematically speaking, we can think of $f_w$ as a different function for each different value of the parameter $w$.

In Julia, this becomes

In [None]:
f(x, w) = σ(w * x)

Note that Julia just treats parameters as additional arguments to the function.

Mathematically we can write this in two different ways:

$$f(x; w) = f_w(x).$$

We can now investigate the effect of $w$ interactively. To do so, we need a way of writing in Julia "the function of one variable $x$ that we obtain when we fix the value of $w$". We write this as an "anonymous function", as we saw in the last notebook:

    x -> f(x, w)
    
We can read this as "the function that maps $x$ to the value of $f(x, w)$. 

In [None]:
@manipulate for w in -2:0.01:2
    plot(x->f(x, w), -5, 5, ylims=(0, 1))
end

#### Exercise

Try writing your own function that takes a parameter. Start by copying and executing

```julia
square(x) = x^2
```

Then use `square` to declare a new function `square_and_scale` that takes two inputs, `a` and `x` such that

$$square\_and\_scale(x; a) := a \cdot x^2$$

Once you have declared `square_and_scale`, uncomment the code below and see how the parameter `a` scales the function `square` :

In [None]:
# x = -10:10
# @manipulate for a in 0:0.01:10
#     plot(x, square.(x), label="x^2")
#     plot!(x, square_and_scale.(x, a), ls=:dash, label="ax^2")
# end

## Fitting a function to data

Suppose we are given a single data point $(x_0, y_0) = (2, 0.8)$. We can try to "fit" a function $f_w$ by adjusting the parameter $w$ until the function passes through the data:

In [None]:
x0, y0 = 2, 0.8
@manipulate for w in -2:0.01:2
    plot(x->f(x, w), -5, 5, ylims=(0, 1))
    scatter!([x0], [y0])
end

We can calculate how far we are from the goal, for example, by finding the vertical distance from the curve to the point.
Here we will use the function $C$ to calculate how far we are from the goal; it is a function of $w$ and gives us the square of the distance from the curve to the point:

$$C(w) = (y_0 - f(x_0, w))^2.$$

In [None]:
@manipulate for w in -2:0.01:2
    plot(x->f(x, w), -5, 5, ylims=(0, 1))

    x0, y0 = 2, 0.8
    
    plot!([x0, x0], [y0, f(x0, w)])
    scatter!([x0], [y0])
    C(w) = (y0 - f(x0, w))^2
    title!("Distance^2 = C(w) = $(C(w))")

end

Let's draw $C(w)$ as a function of the parameter $w$:

In [None]:
x0, y0 = 2.0, 0.8
C(w) = (y0 - f(x0, w))^2

plot(w -> C(w), -3, 3, xlabel="w", ylabel="C(w)", ylims=(0, 0.7))

We see that there is a special value of $w$ where the function $C$ reaches $0$, since for this value of $w$, the graph of $f$ does pass exactly through the point $(x_0, y_0)$. We could find the place $w^*$ where the function hits $0$ by zooming in on that piece of the graph, or with the help of the red vertical line in the next example.

In [None]:
x0, y0 = 2.0, 0.8
@manipulate for w in -2:0.1:2
    plot(w->(y0 - f(x0, w))^2, -3, 3, xlabel="w", ylabel="C(w)", ylims=(0, 0.7))
    vline!([w])
    title!("Vertical line at w = $w")
end

#### Exercise

For what value of `w` does `C(w)` reach 0?

**Why did we use such a complicated function $C$ with those squares inside?** We could just take the distance (instead of the distance squared) using the absolute value function:

In [None]:
x0, y0 = 2.0, 0.8

plot(w->abs(y0 - f(x0, w)), -3, 3, xlabel="w", ylabel="C_abs(w)", ylims=(0, 0.7))

Now we see why squares are generally preferred: using the absolute value gives a cost function that is *not smooth*. This makes it difficult to use methods from calculus to find the minimum. Nonetheless, using non-smooth functions is much more common nowadays.

## More data

Suppose there are now two data points to fit, the previous $(x_0, y_0)$ together with $(x_1, y_1) = (-3, 0.3)$.
We will calculate the cost function, `C(w)`, as the sum of the squared vertical distances from the graph to each data point:

$$C(w) = \sum_i [y_i - f_w(x_i)]^2$$

Let's try to minimize `C(w)`.

In [None]:
xs = [2, -3]
ys = [0.8, 0.3]

@manipulate for w in -2:0.01:2
    plot(x->f(x, w), -5, 5, ylims=(0, 1))
    
    scatter!(xs, ys)

    for i in 1:2
        plot!([xs[i], xs[i]], [ys[i], f(xs[i], w)])
    end
    
    C(w) = sum(abs2, ys .- f.(xs, w))
    
    title!("Distance^2 = C(w) =  $(C(w))")

end

After playing with this for a while, it is intuitively obvious that we cannot make the function pass through both data points for any value of $w$. In other words, our cost function, `C(w)` is never zero.

#### Exercise

What is the minimum value of `C(w)` that you can find by altering `w`? What is the corresponding value of `w`?

In generating the above plot, we used the `sum` function. `sum` can add together all the elements of a collection or range, or it can add together the outputs of applying a function to all the elements of a collection or range. 

Look up the docs for `sum` via

```julia
?sum
```
if you need more information.

#### Exercise

Use `sum` to add together all integers in the range 1 to 16, inclusive. What is the result?

#### Exercise

What is the sum of the absolute values of all integers between -3 and 3? Use `sum` and the `abs` function.

In our last attempt to minimize `C(w)` by varying `w`, we saw that `C(w)` was always greater than 0. 

Now let's plot `C(w)` as a function of `w`.

In [None]:
C(w) = sum(abs2, ys .- f.(xs, w))

plot(C, -1, 1, xlabel="w", ylabel="C(w)", ylims=(0, 1.2))

We see that $C$ has a minimum close to $0$ for a special value of $w$; let's call it $w^*$. From the graph, we can see that it's around $w^* \simeq 0.4$. We could again zoom in on that region of the graph to estimate it more precisely.

Isn't there a better way of using the computer to find this value of $w^*$?

## More parameters

If we add more parameters to a function, we may be able to improve how it fits to data. For example, we could define a new function $g$ with another parameter, a shift or **bias**:

$$g(x; w, b) := \sigma(w \, x) + b$$

In [None]:
g(x, w, b) = σ(w*x) + b

*Note*: In the last notebook, we added parameters to a sigmoid function to get the form $$\sigma(w \, x + b)$$ and here we are working with the form $$\sigma(w \, x) + b$$ Both of these are valid ways to apply a bias to a function! In machine learning terminology, the first is a *neural net* with a single *neuron* and the second represents one with two layers.

Let's try to fit this to the data:

In [None]:
xs = [2, -3]
ys = [0.8, 0.3]

@manipulate for w in -2:0.01:2, b in -2:0.01:2
    plot(x->g(x, w, b), -5, 5, ylims=(0, 1))
    
    scatter!(xs, ys)

    for i in 1:2
        plot!([xs[i], xs[i]], [ys[i], g(xs[i], w, b)])
    end
    
    C2D(w, b) = sum(abs2, ys .- g.(xs, w, b))
    
    title!("Distance^2 = C2D(w, b) = $(C2D(w, b))")

end

You should be able to convince yourself that we can now make the curve pass through both points simultaneously. 

#### Exercise

For what values of `w` and `b` does the line pass through both points?

Let's plot the cost function, taking both parameters into account. To do this, let's first write the cost function as $C_{2D}$:

In [None]:
C2D(w, b) = sum(abs2, ys .- g.(xs, w, b))

To plot the cost as a function of *two* variables (the weights *and* the bias), we will make use of the `surface` function.

In [None]:
ws = -2:0.05:2
bs = -2:0.05:2

surface(ws, bs, C2D, alpha=0.8, zlims=(0,3))

Let's make a new version of this plot using the `plotlyjs` backend so that it becomes a bit more interactive. This will allow us to rotate the surface within our notebook and get a better sense of what it looks like.

In [None]:
plotlyjs()

In [None]:
ws = -2:0.05:2
bs = -2:0.05:2

surface(ws, bs, C2D, alpha=0.8, zlims=(0,3))

If we rotate the surface around, we can see that indeed there is a unique point $(w^*, b^*)$ where the function $C_2$ attains its minimum.

If we add more data, however, we will again not be able to fit all of the data; we will only be able to attain a "best fit".

Let's create `xs` and `ys` with more data:

In [None]:
xs = [2, -3, -1, 1]
ys = [0.8, 0.3, 0.4, 0.4]

and now we can try to plot our best fit, given all this data:

In [None]:
@manipulate for w in -2:0.01:2, b in -2:0.01:2
    plot(x->g(x, w, b), -5, 5, ylims=(-0.2, 1))
    
    scatter!(xs, ys)

    for i in 1:length(xs)
        plot!([xs[i], xs[i]], [ys[i], g(xs[i], w, b)])
    end
    
    C2D(w, b) = sum(abs2, ys .- g.(xs, w, b))
    
    title!("Distance^2 = C2D(w, b) = $(C2D(w, b))")

end

Let's define the cost function that we're using above so that we can plot it as a function of the parameters `ws` and `bs`.

#### Exercise

We've seen the cost function, $$C_{2D}(w, b) = \sum_i(ys_i - g(xs_i, w, b))^2$$ written in Julia as

```julia
C2D(w, b) = sum(abs2, ys .- g.(xs, w, b))
```

a few times now. To ensure you understand what this function is doing, implement your own cost function using the commented code below. Do this without using `sum`, `abs2`, or broadcasting dot syntax (for example, `.-`). Hint: you'll want to use a `for` loop to do this.

In [None]:
# function myC2D(w, b)
#     cost = 0
    
#     return cost
# end

Now that you've defined `myC2D`, we can plot it using `surface`!

In [None]:
ws = -2:0.05:2
bs = -2:0.05:2

surface(ws, bs, myC2D, alpha=0.8, zlims=(0,3))