In [None]:
using DataFrames
using FreqTables
using Plots
using StatPlots
using RDatasets
using Distributions
using DecisionTree

plotly()

# IL027 Core Lecture 3 part 2 - Data Analysis

### James Kermode

### School of Engineering

## Overview

- Reading datasets
- Visualisation
- Clustering and classification
- Missing data
- Feature engineering

# Iris Dataset

## Loading Data and Initial Exploration

We start with the classic [Iris flower dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set), from a 1936 paper by Ronald Fisher. We can load this from the `RDatasets` package.

In [None]:
iris = dataset("datasets", "iris")
head(iris) # print the first few rows

### Size and Shape of Data

The `DataFrames` packages provides a number of functions to describe the data.

In [None]:
@show size(iris)
@show nrow(iris)
@show ncol(iris)
@show names(iris)
@show eltypes(iris);

In [None]:
describe(iris[:Species])

We see there are three possible values for the species in this dataset. The `levels()` function identifies what these are:

In [None]:
species = levels(iris[:Species])

## Visualisation

### Scatter Plots

We start with a 2D scatter plot, to show the relationship between two variables, e.g. sepal width and sepal length

In [None]:
scatter(iris[:SepalWidth], iris[:SepalLength], group=iris[:Species], 
        markershape=:xcross)

There is a special syntax for plotting dataframes which saves a bit of typing, e.g. to look now at petal width vs. petal length. This uses the `@df` macro (similar to `@show` that we saw earlier).

In [None]:
@df iris scatter(:PetalWidth, :PetalLength, group=:Species,
                 markershape=:xcross)

### Histogram

Histograms are useful to visualise the distribution of individual variables

In [None]:
p1 = @df iris histogram(:PetalLength, bins=30, label="PetalLength", c=1)
p2 = @df iris histogram(:PetalWidth, bins=30, label="PetalWidth", c=2)

plot(p1, p2)

### Marginal histograms

Marginal histograms allow the correlation between two variables to be assessed

In [None]:
@df iris marginalhist(:PetalLength, :PetalWidth, bins=30)

### Correlation plot

A correlation plot combines histograms for each variable (diagonal) with marginal histograms (above diagonal) and scatter plots for each pair (below diagonal)

**Lecture Question** Which variables would you expect to be correlated with one another? Does this match what you see here?

In [None]:
@df iris corrplot([:SepalLength :SepalWidth :PetalLength :PetalWidth], bins=20, grid=true)

## Clustering and Classification

Now we've played around with our data a little, let's try to do some more detailed analysis. We would like to learn the relationship between the four variables and the species of iris. 

We can do this using clustering (*unsupervised learning*, i.e. only the features without labels are used) or classification (*supervised learing*, i.e. we provide the labels for a training set).

**K-means clustering** is a classic method for clustering that produces a fixed number $K$ of clusters, based on solving the optimisation problem

$$
\mathrm{minimize} \ \| \mathbf{x}_i - \boldsymbol{\mu}_{z_i} \|^2
\mathrm{with\ respect\ to} \ (\boldsymbol{\mu}, z)
$$

where $\boldsymbol{\mu}_k$ is the center of the $k$-th cluster, and $z_i$ indicates the cluster for $\mathbf{x}_i$. The implementation is fairly straighforward; here is an unoptimised version, see also `Clustering` package for faster code.

In [None]:
function kmeans(data, K; means=nothing, update_means=true)
    N, M = size(data)
    if means == nothing
        # Initialise centers randomly within range of data        
        means = hcat([rand(Uniform(dmin, dmax), K) for (dmin, dmax) 
                        in zip(minimum(data,1), maximum(data,1)) ]...)
    end
    assign, oldassign = zeros(Int, N), zeros(Int, N)
    while true
       for n in 1:N  # E-step - update the assignment
           d = data[n, :] .- means' # distance from all the means
           dmin, kmin = findmin([norm(d[:, i]) for i=1:K])
           assign[n] = kmin # assign point to closest centre
       end
       all(oldassign == assign) && break # if nothing changed, we're done
       oldassign = copy(assign)
        
       if update_means
           means[:] = 0.0 # M-step - update the centers
           for k in 1:K
               any(assign .== k) && (means[k,:] = mean(data[assign .== k, :], 1))
           end
       end
   end
   return (assign, means)
end

### Applying K-means to the Iris Dataset

In [None]:
features = convert(Array, iris[:, 1:4])
labels = convert(Array, iris[:, :Species])

assign, means = kmeans(features, 3); # K=3 clusters

We can compare the true assignment with the one we get from K-means using scatter plots

In [None]:
cols = [1, 2]
p1 = scatter(iris[:,cols[1]], iris[:,cols[2]], 
             group=labels, marker=:xcross, ms=3, title="True Assignment",
             xlabel=names(iris)[cols[1]], ylabel=names(iris)[cols[2]])
p2 = scatter(iris[:,cols[1]], iris[:,cols[2]], group=assign, 
             marker=:xcross, ms=3, title="K-means",
             xlabel=names(iris)[cols[1]], ylabel=names(iris)[cols[2]])
scatter!(means[:,cols[1]], means[:,cols[2]], ms=5, 
         label="Centers", color=[:blue, :red, :green])
plot(p1, p2, layout=(1,2))

### Confusion matrix

The confusion matrix gives the number of results in each class, with predictions on rows and true labels as columns. This means we can see at a glance how many assignments are correct (diagonal entries) and how many are wrong (off-diagonal). Because the assignments are based on distance, $K$-means only works well for roughly spherical clusters.

We're using the `confusion_matrix` function from `DataFrames`, but it would be easy to code up by hand. Since the clustering was unsupervised, we first have to match up the labels and assignments by hand:

In [None]:
function assign2predict(g)
    if g == 3
        return "setosa"
    elseif g == 2
        return "versicolor"
    elseif g == 1
        return "virginica"
    end
end
predictions = assign2predict.(assign)
cm = confusion_matrix(labels, predictions)

**Lecture Question** Write an expression to compute the accuracy, which is given by the ratio of the number of correct predictions to the total number of predictions. Compare it to the answer given by the `confusion_matrix()` function.

## Decision Tree Classifier

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/c6/Manual_decision_tree.jpg/220px-Manual_decision_tree.jpg" align="right" width="25%">

This is based on the idea of a tree of decisions, which you may be familiar with from conditional probability.

Here, the target value can take a discrete set of values, so this is called a *classification tree*. Leaves represent labels and branches represent combinations of features that lead to those class labels. Decision trees can also be used for *regression*, that is where the target variable is continouous.

This is an example of a *supervised learning* appraoach. We will use the implementation in the `DecisionTree` package.
First we need to split our dataset into ~2/3 training and ~1/3 test sets:

In [None]:
train = rand(nrow(iris)) .< 2/3
test = .!train 
sum(train), sum(test)

In [None]:
model = build_tree(labels[train], features[train,:])
print_tree(model, 4)

We first check for consistency on the training set:

In [None]:
predictions = apply_tree(model, features[train,:])
cm = confusion_matrix(labels[train], predictions)

Now we apply to the test set:

In [None]:
predictions = apply_tree(model, features[test, :])
cm = confusion_matrix(labels[test], predictions)

### Remark: supervised learning with K-means

We could adapt the K-means algorithm to do supervised learning by initially allowing the means to move when clustering from the training data, and then fixing the means before predicting on the test data. 

In [None]:
assign_train, means = kmeans(features[train,:], 3, means=means, update_means=true) 
assign_test, means = kmeans(features[test,:], 3, means=means, update_means=false)

predict_test = assign2predict.(assign_test)
cm = confusion_matrix(labels[test], predict_test)

K-means is suprisingly good on this dataset for such a simple algorithm. However, note we had to choose the number of clusters and map the clusters to labels by hand.

# Titanic Dataset

We now move onto a more challenging dataset, taken from the passenger records for the *RMS Titanic*, which sank on the 14th April 1912 with the loss of more than 1500 lives. The task is to use data such as the age, sex and fare paid by a passenger to predict whether they survived or not. As before, we will use a portion of the dataset to train a predictive model, and then assess it using the remaining test data.

In [None]:
titanic = readtable("titanic.csv")
head(titanic)

In [None]:
describe(titanic[:Survived])

Counting values is such a common operation that there's a special function for it, `countmap`. 

`freqtable` from the `FreqTables` package does the same thing, but produces easier to read output.

In [None]:
countmap(titanic[:Survived])

In [None]:
freqtable(titanic, :Survived)

We can also group by multiple values:

In [None]:
freqtable(titanic, :Survived, :Sex)

To make things a bit more readable, let's replace 0 and 1 with labels `Dead` and `Survived`

In [None]:
@enum SurvivedType Dead=0 Survived=1
titanic[:Survived] = SurvivedType.(titanic[:Survived]);

In [None]:
freqtable(titanic, :Survived, :Sex)

## Dealing with missing data

If we try to make a table of the `Embarked` column we get an error:

In [None]:
freqtable(titanic, :Embarked)

We can see why by describing this column:

In [None]:
describe(titanic[:Embarked])

There are two missing values, denoted `NA` for "not applicable". We can neglect these with the `dropna()` function, which returns a copy of the  data with missing values removed.

In [None]:
freqtable(dropna(titanic[:Embarked]))

Once we've dropped the missing data, we see there are three different embarkation points: Cherbourg (C), Queenstown (Q), or Southampton (S). Southampton is the most popular (modal value), so let's fill in the missing values with that:

In [None]:
embarked_mode = mode(dropna(titanic[:Embarked]))
titanic[isna.(titanic[:Embarked]), :Embarked] = embarked_mode

We have a similar problem with `Age`, where 20% of the data are missing. 

**Lecture Question** We will simply delete those rows for now, but can you think of a better solution?

In [None]:
describe(titanic[:Age])

In [None]:
titanic = titanic[.!isna.(titanic[:Age]), :];

In [None]:
describe(titanic[:Age])

## Visualisation

Let's start digging into the data with some pie charts

In [None]:
male = titanic[titanic[:Sex] .== "male", :]
female = titanic[titanic[:Sex] .== "female", :]
pie(["Dead", "Survived"], freqtable(male, :Survived), title="Male")

### Histogram

As before, we can use histograms to get a better picture of the distribution of a variable

In [None]:
@df titanic histogram(:Age, xlabel="Age", ylabel="Frequency", bins=20, legend=false)

### Density plots

A density plot is similar to a histogram, but drawn with a line interpolating the bars. This makes it easier to overlay multiple plots, e.g. to compare the age distributions for passengers who did and did not survive.

In [None]:
@df titanic density(:Age, groups=:Survived, lw=3, xlabel="Age", ylabel="Frequency")

In [None]:
@df titanic density(:Age, groups=:Sex, lw=3, xlabel="Age", ylabel="Frequency")

## Feature Engineering

Let's make a new column `Child` for people under 13, based on the bump in age distrubtions. This feature could be used as one of the inputs into a decision tree or other predictive model (cf. bonus question in the assignment).

In [None]:
@enum ChildType Child=0 Adult=1
function classify_by_age(x)
  if x < 13
    return Child
  else
    return Adult
  end
end

titanic[:Child] = classify_by_age.(titanic[:Age])

In [None]:
freqtable(titanic, :Child, :Survived)