# Julia Things

### Environment

First things first. Let us set up the environment with the requried packages for this notebook:

In [3]:
for p in ("Knet", "Plots", "Plotly.jl", "DataFrames")
    Pkg.installed(p) == nothing && Pkg.add(p)
end

using Knet, Plots, DataFrames
gr()

Knet.gpu(0); # set the desired GPU to use
atype = Array{Float32}; # atype = KnetArray{Float32} for gpu usage, Array{Float32} for cpu. 

println("OS: ", Sys.KERNEL)
println("Julia: ", VERSION)
println("Knet: ", Pkg.installed("Knet"))
println("GPU: ", readstring(`nvidia-smi --query-gpu=name --format=csv,noheader`))

OS: Linux
Julia: 0.6.0
Knet: 0.8.5+
GPU: NVS 310
TITAN X (Pascal)



### New Stuff

In this notebook we introduce the following Julia/Knet packages and functions:

* Julia's package [DataFrames](https://en.wikibooks.org/wiki/Introducing_Julia/DataFrames)
* Julia's (technically `JuliaDB`) function [loadtable](http://juliadb.org/latest/api/io.html#JuliaDB.loadtable)

# Binary classification with logistic regression

In the last tutorial we worked through how to implement a [linear regression model](https://github.com/moralesq/Knet-the-Julia-dope/blob/master/chapter02_supervised-learning/linear-regression.ipynb).

Regression is the hammer we reach for when we want to answer *how much?* or *how many?* questions. If you want to predict the number price at which a house will be sold, or the number of wins a baseball team might have,  or the number of days that a patient will remain hospitalized before being discharged, then you're probably looking for a regression model.

Based on our experience, in industry, we're more often interested in making categorical assignments. *Does this email belong in the spam folder or the inbox*? *How likely is this custromer to sign up for subscription service?* When we're interested in either assigning datapoints to categories or assessing the *probability* that a category applies, we call this task *classification*. 

The simplest kind of classification problem is *binary classification*, when there are only two categories, so let's start there. Let's call our two categories the positive class $y_i=1$ and the negative class $y_i = 0$ (another common way of defining the labels are $y_i=\pm1$). Even with just two categories, and even confining ourselves to linear models, 
there are many ways we might approach the problem. For example, we might try to draw a line that best separates the points:

![](../img/linear-separator.png)

A whole family of algorithms called support vector machines pursue this approach.
The main idea here is choose a line that maximizes the marigin to the closest data points on either side of the decision boundary.  In these appraoches, only the points closest to the decision boundary (the support vectors)  actually influence the choice of the linear separator.

With neural networks, we usually appraoch the problem differently. Instead of just trying to separate the points, we train *probabilistic classifiers* which estimate, for each data point $\boldsymbol{x}$, the *conditional probability* $\mathbb{P}(y|\boldsymbol{x})$ that it belongs to class $y$. 

Recall that in linear regression, we made predictions of the form

$$ \hat{y} = \boldsymbol{w}^T \boldsymbol{x} + b, $$

where $\hat{y},b\in\mathbb{R}$ and $\boldsymbol{w},\boldsymbol{x}\in\mathbb{R}^d$. We are interested in asking the question *"what is the probability that example $\boldsymbol{x}$ belongs to the positive class?"* A regular linear model is a poor choice here because it can output values greater than $1$ or less than $0$. To coerce reasonable answers from our model,  we're going to modify it slightly, by running the linear function through a sigmoid activation function $\sigma$:

$$ \hat{y} =\sigma(\boldsymbol{w}^T \boldsymbol{x} + b). $$

The sigmoid function $\sigma$, sometimes called a squashing function or a *logistic* function - thus the name logistic regression - maps a real-valued input to the range 0 to 1. Indeed, the logistic function $\sigma(z)$ is a good choice since it has the form of a probability, i.e. $\sigma(-z)=1-\sigma(z)$ and $\sigma(z)\in (0,1)$ as $z\rightarrow \pm \infty$. If we pick the labels $y\in(0,1)$ we may assign  

\begin{equation}
\begin{aligned}
\mathbb{P}(y=1|z) & =\sigma(z)=\frac{1}{1+e^{-z}}\\
\mathbb{P}(y=0|z) & =1-\sigma(z)=\frac{1}{1+e^{z}}\\
\end{aligned}
\end{equation}


which can be written more compactly as $\mathbb{P}(y|z)  =\sigma(z)^y(1-\sigma(z))^{1-y}$. Let us define and visualize this function:

In [4]:
sigmoid(z) = 1 ./ (1 + exp.(-z))
plot(-5:0.1:5, sigmoid(-5:0.1:5), xlabel=:z, ylabel="sigmoid(z)", title="Logistic Function", legend=false, size=(400,200))

Note that and input of $0$ gives a value of $.5$. 
So in the common case, where we want to predict positive whenever the probability is greater than $.5$
and negative whenever the probability is less than $.5$,
we can just look at the sign of $\boldsymbol{w}^T \boldsymbol{x} + b$. Formally, in (binary) classification problems one aims at finding a classification rule (also called decision rule) which is a binary valued function on the input sapce $c:\mathcal{X}\rightarrow\{0,1\}$ (in this example $\mathcal{X}=\mathbb{R}^d$). However, direct minimization of the classification error is not computationally feasible mostly because the classification loss is not convex (to be discussed later). In practice, once looks for real valued (rather than binary valued) function $f:\mathcal{X}\rightarrow \mathbb{R}$ and replaces the loss function with some  convex loss. A classification rule is then obtained by taking the sign. 

## Binary cross-entropy loss

Now that we've got a model that outputs probabilities,
we need to choose a loss function.
When we wanted to predict *how much* we used squared error $y-\hat{y}^2$,
as our measure our model's performance. 

Since now we're thinking about outputing probabilities,
one natural objective is to say that we should choose the weights (or parameters $\theta$ )
that give the actual labels in the training data highest probability. For $n$ samples $\{x_i,y_i\}$ we want to maximize

$$\max_{\theta} \mathbb{P}_{\theta}\big( y_1,\dots,y_n \big|\,\boldsymbol{x}_1,\dots\boldsymbol{x}_n \big)$$

Because each example is independent of the others, and each label depends only on the features of the corresponding examples, we can rewirte the above as

$$\max_\theta \prod_i^n\mathbb{P}_\theta(y_i| \boldsymbol{x}_i)=\max_{\theta} \mathbb{P}_{\theta}\big(y_1|\boldsymbol{x}_1\big)\mathbb{P}_{\theta}\big(y_2|\boldsymbol{x}_2\big)\cdots\mathbb{P}_{\theta}\big(y_n|\boldsymbol{x}_n\big)$$

This function is a product over the examples, but in general, because we want to train by stochastic gradient descent, it's a lot easier to work with a loss function that breaks down as a sum over the training examples. 

$$\max_\theta \log\big(\prod_i^n\mathbb{P}(y_i|\boldsymbol{x}_i)\big)= \sum_i^m\log\big(\mathbb{P}(y_i|\boldsymbol{x}_i)\big)=\log\big(\mathbb{P}(y_1|\boldsymbol{x}_1)\big)+\cdots+\log\big(\mathbb{P}(y_n|\boldsymbol{x}_n)\big)$$

Because we typically express our objective as a *loss* we can just flip the sign, giving us the *negative log probability:*

$$  \min_\theta \Big(- \sum_i^m\log\big(\mathbb{P}(y_i|\boldsymbol{x}_i)\big)\Big)$$

Recall that we can write $\mathbb{P}_\theta(y_1|z_i)$ compactly as

$$\mathbb{P}_\theta(y_i|z_i) =\sigma(z_i)^{y_i}(1-\sigma(z_i))^{1-y_i},$$

where $\hat{y}_i = \sigma(z_i) = \sigma(\boldsymbol{w}^T \boldsymbol{x}_i + b)$ and $\theta=\{w,b\}$ are the parameters to be optimized. Let us work through this expresion. With the (important) relation $\sigma(-z) = 1-\sigma(z)$ we have

\begin{equation}
\begin{aligned}
\log\big(\mathbb{P}_\theta(y|z)\big)&=
\log\big(\sigma(z)^{y}(1-\sigma(z))^{1-y}\big)\\
&=y\log\sigma(z) + (1-y)\log(1-\sigma(z))\\
&=y\big(\log\sigma(z)-\log\sigma(-z)\big) + \log\sigma(-z)\\
&=y\log \frac{\sigma(z)}{\sigma(-z)} + \log\sigma(-z)\\
&=y\log\Big( \frac{1+e^{z}}{1+e^{-z}} \Big) + \log\sigma(-z)\\
&=y\log\Big( \frac{e^{z}(e^{-z}+1)}{1+e^{-z}} \Big) + \log\sigma(-z)\\
&=yz + \log\sigma(-z)
\end{aligned}
\end{equation}

Therefore we take the negative of this expression and minimize the objective function 

$$l = \sum_{i=1}^n y_iz_i + \log(1+e^z)$$

whew! what a bunch math! If you're learning machine learning for the first time, that might have been too much information too quickly. Let's take a look at this loss function and break down what's going on more slowly.

We started with the espression 

$$\mathbb{P}_\theta(y_i|z_i) =\sigma(z_i)^{y_i}(1-\sigma(z_i))^{1-y_i},$$

where $\hat{y}_i = \sigma(z_i) = \sigma(\boldsymbol{w}^T \boldsymbol{x}_i + b)$. This is the conditional probability. 

We then found that the loss function depended on two terms:

* $y_i\log \hat{y}_i$
* $(1-y_i)\log (1-\hat{y}_i)$

But recall that we are intepreting $\hat{y}_i=\sigma(z_i)$ as a probability that $x_i$ has a given label, namely $\mathbb{P}(y_i=1|z_i)=\sigma(z_i)$ and $\mathbb{P}(y_i=0|z_i)=1-\sigma(z_i)$. Because $y_i$ only takes values $0$ or $1$, for an given data point, one of these terms disapears. 
When $y_i$ is $1$, this loss says that we should maximize $\log \hat{y}_i$, giving higher probability to the *correct* answer. 
When $y_i$ is $0$, this loss function takes value $\log (1-\hat{y}_i)$. That says that we should maximize the value $1-\hat{y}$ which we already know is the probability assigned to $\boldsymbol{x}_i$ belonging to the negative class.


Note that this loss function is commonly called *log loss* and also commonly referred to as *binary cross entropy*. It is a special case of negative log [likelihood](https://en.wikipedia.org/wiki/Likelihood_function). And it is a special case of [cross-entropy](https://en.wikipedia.org/wiki/Cross_entropy), which can apply to the multi-class ($>2$) setting. 

**If instead we were to use the labels $y_i=\pm1$, the loss function has to modified to $\log(1+e^{-z})$. This usually leads to a lot of confussion as to why there exists two versions of logistic regression. See [here](https://stats.stackexchange.com/questions/250937/which-loss-function-is-correct-for-logistic-regression/279698#279698) for more information on the topic**

## The Adult Dataset

We'll use the Adult dataset taken from the [UCI repository](http://archive.ics.uci.edu/ml/datasets.html). The dataset was constructed by Barry Becker from 1994 census data. In its original form, the dataset contained $14$ features, including age, education, occupation, sex, native-country, among others. In this version, hosted by [National Taiwan University](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html), the data have been re-processed to $123$ binary features each representing quantiles among the original features. The label is a binary indicator indicating whether the person corresponding to each row made more ($y_i = 1$) or less ($y_i = 0$) than $50,000 of income in 1994. The dataset we're working with contains 30,956 training examples and 1,605 examples set aside for testing. We can download and read the datasets into main memory like so:

In [5]:
url_train = "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a2a.t"
url_test  = "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a1a"

if !isfile("../datasets/adult.train")
    rawdata_train = readtable(download(url_train, "../datasets/adult.train"), header=false);
else
    rawdata_train = readtable("../datasets/adult.train", header=false);
end

if !isfile("../datasets/adult.test")
    rawdata_test  = readtable(download(url_test, "../datasets/adult.test"), header=false);
else
    rawdata_test  = readtable("../datasets/adult.test", header=false);    
end

@sprintf "Training size = %d    Testing size = %d" size(rawdata_train, 1) size(rawdata_test, 1)

"Training size = 30296    Testing size = 1605"

Notice that we set `header=false` in `readtable` since in this case the first row of the data does not include the column name. Let's take a look at the data:

In [6]:
rawdata_train[1][1:5]

5-element DataArrays.DataArray{String,1}:
 "-1 5:1 7:1 14:1 19:1 39:1 40:1 51:1 63:1 67:1 73:1 74:1 76:1 78:1 83:1"
 "-1 3:1 6:1 17:1 22:1 36:1 41:1 53:1 64:1 67:1 73:1 74:1 76:1 80:1 83:1"
 "-1 5:1 6:1 17:1 21:1 35:1 40:1 53:1 63:1 71:1 73:1 74:1 76:1 80:1 83:1"
 "-1 2:1 6:1 18:1 19:1 39:1 40:1 52:1 61:1 71:1 72:1 74:1 76:1 80:1 95:1"
 "-1 3:1 6:1 18:1 29:1 39:1 40:1 51:1 61:1 67:1 72:1 74:1 76:1 80:1 83:1"

The data consists of lines like the following:

-1 4:1 6:1 15:1 21:1 35:1 40:1 57:1 63:1 67:1 73:1 74:1 77:1 80:1 83:1

The first entry in each row is the value of the label. The following tokens are the indices of the non-zero features. The number $1$ here is redundant. But we don't always have control over where our data comes from, so we might as well get used to mucking around with weird file formats. Let's write a simple script to process our dataset.

In [7]:
function nonzeroindex(sample)

    s      = split(sample, ":1 ") 
    val    = parse(Int64, split(sample)[1])
    s[1]   = split(s[1])[2]
    s[end] = split(s[end], ":")[1]
    
    output = zeros(Float32, 1, 124)
    output[parse.(Int64, s)] = 1
    output[end] = val
    return output
end

function processdata(rawdata; atype=Array{Float32})
    data = map(atype, [vcat(nonzeroindex.(rawdata[1])...)'])[1];
    # change label from {-1,1} to {0,1}
    x, y = map(atype, [data[1:end-1, :], (data[end:end, :] + 1) / 2])
    return x, y
end

processdata (generic function with 1 method)

As always, we will try to generic functions that can run with either `Array` or `KnetArray` data types, symbolized here are `atype`

In [8]:
atype=KnetArray{Float32}; Knet.gpu(0)
xtrn, ytrn  = processdata(rawdata_train, atype=atype);
xtst, ytst  = processdata(rawdata_test, atype=atype);

We can also check the fraction of positive examples in our training and test sets. This will give us one nice (necessay but insufficient) sanity check that our training and test data really are drawn from the same distribution.

In [9]:
sum(ytrn) / length(ytrn), sum(ytst) / length(ytst)

(0.23993267f0, 0.24610592f0)

Let's get a [minibatch](http://denizyuret.github.io/Knet.jl/latest/reference.html#Knet.minibatch) and define our model

In [30]:
dtrn = minibatch(xtrn, ytrn, 64, shuffle=true);

In [31]:
pred(w, x) = w[1] * x .+ w[2];

function loss(w, x, y)
    yhat = sigm.(pred(w, x))
    return -sum(y .* log.(yhat) + (1-y) .* log.(1-yhat))
end

lossgradient  = grad(loss)

(::gradfun) (generic function with 1 method)

In [32]:
function train(w, dtrn; lr=1e-6, epochs=5)
    tloss = []
    for epoch = 1:epochs
        eloss = 0
        for (x,y) in dtrn
            eloss += loss(w, x, y)
            g = lossgradient(w, x, y)
            for i = 1:length(w)
                w[i] -= lr * g[i]
            end
        end
        push!(tloss, eloss/length(dtrn))
    end
    
    return w, tloss
end

train (generic function with 1 method)

While the negative log likelihood gives us a sense of how well the predicted probabilities agree with the observed labels, it's not the only way to assess the performance of our classifiers. For example, at the end of the day, we'll often want to apply a threshold to the predicted probabilities in order to make hard predictions. For example, if we were building a spam filter, we'll need to either send the email to the spam folder or to the inbox. In these cases, we might not care about negative log likelihood, but instead we want know how many errors our classifier makes. Let's code up a simple script that calculates the accuracy of our classifier.

In [40]:
Accuracy(w, x, ygold) = sum((sign.(pred(w, x)) + 1) / 2 .== ygold) / length(ygold)

Accuracy (generic function with 1 method)

In [46]:
w = map(atype, Any[randn(1, size(xtrn, 1)), zeros(Float32,1,1) ]);
Accuracy(w, xtst, ytst)

0.2753894f0

In [47]:
w, Loss = train(w, dtrn; epochs=30, lr=1e-2);

In [48]:
Accuracy(w, xtrn, ytrn)

0.8489239f0

In [49]:
scatter(Loss)

Let us now check the accuracy on the test set: 

In [50]:
Accuracy(w, xtst, ytst)

0.83925235f0

This isn't too bad! A naive classifier would predict that nobody had an income greater than $50k (the majority class). This classifier would achieve an accuracy of roughly 75\%. By contrast, our classifier gets an accuracy of .84 (results may vary a small amount on each run owing to random initializations and random sampling of the batches).

By now you should have some feeling for the two most fundamental tasks in supervised learning: regression and classification. In the following chapters we'll go deeper into these problems, exploring more complex models, loss functions, optimizers, and training schemes. We'll also look at more interesting datasets. And finally, in the following chapters we'll also look more advanced problems where we want, for example, to predict more structured objects.

## Medical Appoinment Data

We will now use a different dataset, the Medical Appointment No Shows [dataset](https://www.kaggle.com/joniarroba/noshowappointments). This dataset is RAW, meaning that binary features have not yet been extracted. The medical appoinments raw dataset consists of 13 features and 1 binary label. The labels are show or no show to an scheduled medical appoinment, and the features include gender, age, neighbourhood, etc. 

We will use Julia's pakage `DataFrames` and the type `DataFrame` to load and manipulate the data. Let us load the data and explore the type and content of the output:

In [303]:
rawdata = readtable("../datasets/KaggleV2-May-2016.csv");
typeof(rawdata)

DataFrames.DataFrame

As you can see, `rawdata` is of type `DataFrame` and it automatically reads the csv file into a Frame. By default it assumes that the names of each column are on the firstrow. There are MANY ways of manipulating this data and you should [become familiar](https://en.wikibooks.org/wiki/Introducing_Julia/DataFrames) with them. For example, we can access one particular feature with its name:

In [304]:
rawdata[2000:2005, :Gender], rawdata[2000:2005, :No_show]

(String["F", "F", "F", "F", "F", "F"], String["Yes", "No", "Yes", "Yes", "Yes", "No"])

Our first obstacle is to decide how to encode some of the features. For example, there are multiple ways to [encode the month](https://stats.stackexchange.com/questions/108977/whats-the-optimal-way-to-encode-a-month-feature) for when the appoinment was scheduled and for the date of the appoinment. We could encode a single variable with 12 discrete values {Month = 1, 2, ..., 12} or equivalent. This might be OK if you really want to treat each month separately. However, there is a potential problem of modulo distance, i.e. if you use numbers 1, ... , 12 then each month differs by 1 from the previous month, except for Dec=12 & Jan=1. Let us take a look at this feature to see what we're dealing with. 

First we define the feature encoding functions and we split the date into the relevant components:

In [305]:
gender(x)            = 1.0(x .== "M") # one for male 
appointment_month(x) = parse(Int64, x[2]);
appointment_date(x)  = parse(Int64, split(x[3], 'T')[1]);
appointment_hour(x)  = parse(Int64, split(split(x[3], 'T')[2], ':')[1]);

In [306]:
s = split.(rawdata[:, :ScheduledDay], "-");

smonth = appointment_month.(s);
sdate  = appointment_date.(s);
shour  = appointment_hour.(s);

Let us redefine the labels from `:No_show=[yes, no]` to `:No_show=[1, 0]` such that a subject with a label of `1` did not show up to the appoinment.  

In [307]:
rawdata[:, :No_show] = map(Float64, rawdata[:, :No_show] .== "Yes");
rawdata[2000:2005, :No_show]

6-element DataArrays.DataArray{Float64,1}:
 1.0
 0.0
 1.0
 1.0
 1.0
 0.0

You can verify that this agrees with our original labels. We can now plot the conditional probabilities for the hour and month for all subjects:

In [88]:
p1 = histogram(shour[rawdata[:, :No_show] .== 1],  bins=0.5:1:23.5, norm=true, label="y=No_show", title="P(y|hour)", color=:blue)
     histogram!(shour[rawdata[:, :No_show] .== 0], bins=0.5:1:23.5, norm=true, opacity=0.8, label="y=show", title="P(y|hour)", color=:orange);

p2 = histogram(smonth[rawdata[:, :No_show] .== 1],  bins=0.5:11.5, norm=true, label="y=No_show", title="P(y|month)", color=:blue)
     histogram!(smonth[rawdata[:, :No_show] .== 0], bins=0.5:11.5, norm=true, opacity=0.8, label="y=show", title="P(y|month)", color=:orange);

In [95]:
plot(p1,p2, size=(600,200))

As you can see, we don't have any data after June, so for now it would be safe to simply encode the month like this. In addition, you can verify that almost the entire dataset has the same year (2016) and therefore for simplicity we will ignore it. We have a similar situation with the hour since most appoinments were done early during the day and therefore we could take the same approach. Nevertheless, for comparison to our previous dataset we will perform binary enconding. For example, each month will be a separate feature with either 1 or 0 if say the appoinment occurs during that month. 

Now, let's examine what these plots are saying. On the left we have the conditional probability $\mathbb{P}(y|hour)$, i.e. what is the probability that the subject will show up given the time of the day *in which the appoinment was scheduled* (discretized in hours). The results are quite interesting! this plot says that it's more probable that the subject will show up if the appoinment was scheduled is at before 9am (peaking at 7am), and more probable that it will be a no show if the appoinment was scheduled after 9am (peaking at 2pm). ummm, perhaps early birds are more reliable? 

On the right we have $\mathbb{P}(y|month)$. In this case we find that a subject is more probable to show up if the appoinment  was scheduled after summer (peaking in May), and more likely to be a no show before that. Perhaps those that schedule early during the year assume they will be able to re-schedule at a later month? or is it a New Year's resolution syndrome?

ok, time to get serious. Let us pre-process this datast now. Notice that

* The year will be ignored since it does no vary 
* The appoinment hour is ignored since it is not shown

We will start by handling the neighbourhood. First we create a dictionary with all the unique neighbourhoods (which should be 81). We then create create 81 features $f_i$ such that $f_i=1$ if the subject lives in such neighbourhood. Let us start by creating an empty dataset and setting the label:

In [331]:
x = DataFrame();

In [332]:
dict = unique(rawdata[:, :Neighbourhood]);
for key in dict
    x[:, [Symbol(key)]] = 1.0(rawdata[:, :Neighbourhood] .== key)
end

We can verify that our encoding is correct

In [333]:
x[1:5, :]

Unnamed: 0,JARDIM DA PENHA,MATA DA PRAIA,PONTAL DE CAMBURI,REPÚBLICA,GOIABEIRAS,ANDORINHAS,CONQUISTA,NOVA PALESTINA,DA PENHA,TABUAZEIRO,BENTO FERREIRA,SÃO PEDRO,SANTA MARTHA,SÃO CRISTÓVÃO,MARUÍPE,GRANDE VITÓRIA,SÃO BENEDITO,ILHA DAS CAIEIRAS,SANTO ANDRÉ,SOLON BORGES,BONFIM,JARDIM CAMBURI,MARIA ORTIZ,JABOUR,ANTÔNIO HONÓRIO,RESISTÊNCIA,ILHA DE SANTA MARIA,JUCUTUQUARA,MONTE BELO,MÁRIO CYPRESTE,SANTO ANTÔNIO,BELA VISTA,PRAIA DO SUÁ,SANTA HELENA,ITARARÉ,INHANGUETÁ,UNIVERSITÁRIO,SÃO JOSÉ,REDENÇÃO,SANTA CLARA,CENTRO,PARQUE MOSCOSO,DO MOSCOSO,SANTOS DUMONT,CARATOÍRA,ARIOVALDO FAVALESSA,ILHA DO FRADE,GURIGICA,JOANA D´ARC,CONSOLAÇÃO,PRAIA DO CANTO,BOA VISTA,MORADA DE CAMBURI,SANTA LUÍZA,SANTA LÚCIA,BARRO VERMELHO,ESTRELINHA,FORTE SÃO JOÃO,FONTE GRANDE,ENSEADA DO SUÁ,SANTOS REIS,PIEDADE,JESUS DE NAZARETH,SANTA TEREZA,CRUZAMENTO,ILHA DO PRÍNCIPE,ROMÃO,COMDUSA,SANTA CECÍLIA,VILA RUBIM,DE LOURDES,DO QUADRO,DO CABRAL,HORTO,SEGURANÇA DO LAR,ILHA DO BOI,FRADINHOS,NAZARETH,AEROPORTO,ILHAS OCEÂNICAS DE TRINDADE,PARQUE INDUSTRIAL
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We can do something similar for the scheduling month (`smont`) and the appoinment month (`amonth`):

In [334]:
smonth = appointment_month.(split.(rawdata[:, :ScheduledDay], "-"););
sdate  = appointment_date.(split.(rawdata[:, :ScheduledDay], "-"););
shour  = appointment_hour.(split.(rawdata[:, :ScheduledDay], "-"););
amonth = appointment_month.(split.(rawdata[:, :AppointmentDay], "-"););
adate  = appointment_date.(split.(rawdata[:, :AppointmentDay], "-"););

dict = Dict("Jan"  => 1, "Feb" => 2, "Mar"  => 3,"Apr" => 4,  "May" => 5, "June" => 6,
            "July" => 7, "Aug" => 8, "Sept" => 9,"Oct" => 10, "Nov" => 11, "Dec" => 12);
for (key, val) in dict
    skey = @sprintf "s%s" key
    akey = @sprintf "a%s" key
    y1 = 1.0(smonth .== val)
    y2 = 1.0(amonth .== val)
    if sum(y1) > 0
        x[:, [Symbol(skey)]] = y1
    end
    if sum(y2) > 0
        x[:, [Symbol(akey)]] = y2
    end
end

The date and the time are simpler since they range from 1 to 31 and 1 to 24:

In [335]:
for date =1:31
    skey = @sprintf "sdate_%d" date
    akey = @sprintf "adate_%d" date
    y1 = 1.0(sdate .== date)
    y2 = 1.0(adate .== date)
    
    if sum(y1) > 0
        x[:, [Symbol(skey)]] = y1
    end
    if sum(y2) > 0
        x[:, [Symbol(akey)]] = y2
    end
end

for hour = 1:24
    skey = @sprintf "shour%d" hour
    y = 1.0(shour .== hour)
    if sum(y) > 0
        x[:, [Symbol(skey)]] = y
    end
end

Finally, we can extract the last features which are binary. Notice that for the age we simply normalize to ensure the range goes from 0 to 1:

In [336]:
x[:, :age]    = (rawdata[:, :Age]- minimum(rawdata[:, :Age])) / (maximum(rawdata[:, :Age]) - minimum(rawdata[:, :Age]))
x[:, :Scholarship]  = rawdata[:, :Scholarship]
x[:, :Hipertension] = rawdata[:, :Hipertension]
x[:, :Diabetes]     = rawdata[:, :Diabetes]
x[:, :Alcoholism]   = rawdata[:, :Alcoholism]
x[:, :Handcap]      = rawdata[:, :Handcap]
x[:, :SMS_received] = rawdata[:, :SMS_received];

All in all we end up with 171 features:

In [337]:
size(x)

(110527, 170)

We can now add the final column: the labels. Verify that the last column corresponds to the label:

In [338]:
x[:, :label] = rawdata[:, :No_show];

In [339]:
x[1:2, :]

Unnamed: 0,JARDIM DA PENHA,MATA DA PRAIA,PONTAL DE CAMBURI,REPÚBLICA,GOIABEIRAS,ANDORINHAS,CONQUISTA,NOVA PALESTINA,DA PENHA,TABUAZEIRO,BENTO FERREIRA,SÃO PEDRO,SANTA MARTHA,SÃO CRISTÓVÃO,MARUÍPE,GRANDE VITÓRIA,SÃO BENEDITO,ILHA DAS CAIEIRAS,SANTO ANDRÉ,SOLON BORGES,BONFIM,JARDIM CAMBURI,MARIA ORTIZ,JABOUR,ANTÔNIO HONÓRIO,RESISTÊNCIA,ILHA DE SANTA MARIA,JUCUTUQUARA,MONTE BELO,MÁRIO CYPRESTE,SANTO ANTÔNIO,BELA VISTA,PRAIA DO SUÁ,SANTA HELENA,ITARARÉ,INHANGUETÁ,UNIVERSITÁRIO,SÃO JOSÉ,REDENÇÃO,SANTA CLARA,CENTRO,PARQUE MOSCOSO,DO MOSCOSO,SANTOS DUMONT,CARATOÍRA,ARIOVALDO FAVALESSA,ILHA DO FRADE,GURIGICA,JOANA D´ARC,CONSOLAÇÃO,PRAIA DO CANTO,BOA VISTA,MORADA DE CAMBURI,SANTA LUÍZA,SANTA LÚCIA,BARRO VERMELHO,ESTRELINHA,FORTE SÃO JOÃO,FONTE GRANDE,ENSEADA DO SUÁ,SANTOS REIS,PIEDADE,JESUS DE NAZARETH,SANTA TEREZA,CRUZAMENTO,ILHA DO PRÍNCIPE,ROMÃO,COMDUSA,SANTA CECÍLIA,VILA RUBIM,DE LOURDES,DO QUADRO,DO CABRAL,HORTO,SEGURANÇA DO LAR,ILHA DO BOI,FRADINHOS,NAZARETH,AEROPORTO,ILHAS OCEÂNICAS DE TRINDADE,PARQUE INDUSTRIAL,sMay,aMay,sDec,sApr,aApr,sFeb,sMar,sJune,aJune,sJan,sNov,sdate_1,adate_1,sdate_2,adate_2,sdate_3,adate_3,sdate_4,adate_4,sdate_5,adate_5,sdate_6,adate_6,sdate_7,adate_7,sdate_8,adate_8,sdate_9,adate_9,sdate_10,adate_10,sdate_11,adate_11,sdate_12,adate_12,sdate_13,adate_13,sdate_14,adate_14,sdate_15,sdate_16,adate_16,sdate_17,adate_17,sdate_18,adate_18,sdate_19,adate_19,sdate_20,adate_20,sdate_21,sdate_22,sdate_23,sdate_24,adate_24,sdate_25,adate_25,sdate_26,sdate_27,sdate_28,sdate_29,adate_29,sdate_30,adate_30,sdate_31,adate_31,shour6,shour7,shour8,shour9,shour10,shour11,shour12,shour13,shour14,shour15,shour16,shour17,shour18,shour19,shour20,shour21,age,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,label
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.5431034482758621,0,1,0,0,0,0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.4913793103448275,0,0,0,0,0,0,0.0


Now, let us split the data into training and testing batches and use Knet's [minibatch](http://denizyuret.github.io/Knet.jl/latest/reference.html#Knet.minibatch) function to as a data provider:

In [340]:
atype = KnetArray{Float32}
test  = 0.9;
n     = round(Int, (1 - test) * size(x, 1));
r     = randperm(size(x, 1));

In [341]:
xtrn, ytrn = map(atype, [Array(x[r[1:n],     1:end-1])', Array(x[r[1:n],     end:end])']);
xtst, ytst = map(atype, [Array(x[r[n+1:end], 1:end-1])', Array(x[r[n+1:end], end:end])']);

Again, let us perform a sanity check and ensure our data belongs to the same distribution

In [342]:
sum(ytrn) / length(ytrn), sum(ytst) / length(ytst)

(0.20229802f0, 0.20189196f0)

In [343]:
size(xtrn), size(xtst)

((170, 11053), (170, 99474))

We're ready to train!

In [344]:
dtrn = minibatch(xtrn, ytrn, 50; shuffle=true);

In [347]:
w = map(atype, Any[randn(1, size(xtrn, 1)), zeros(Float32,1,1) ]);
Accuracy(w, xtst, ytst)

0.3773046f0

In [348]:
w, Loss = train(w, dtrn; epochs=30, lr=1e-2);

In [354]:
scatter(Loss, size=(400,200))

In [351]:
Accuracy(w, xtst, ytst)

0.79141283f0

Not bad! considering that we're only using 10% of the data for training!

### Exercises

* Plot the accuracy of the medical appoinment dataset as a function of `test`, where `test` is the ratio to use for testing. That is, a value of `test=0.1` implies 10% of the data is used for testing. What does this plot say about the training process and our algorithm? 

## Next:
[Softmax regression](softmax-regression.ipynb)

For whinges or inquiries, [open an issue on GitHub](https://github.com/moralesq/Knet-the-Julia-dope)