In [5]:
using DataFrames #NB: we are using pinned version 0.9.1 to match JuliaBox
using FreqTables
using Plots
using StatPlots
using Distributions
using DecisionTree

gr()
#Plots.scalefontsizes(1.5)

Plots.GRBackend()

# IL027 Core Lecture 3 part b - Data Analysis

### James Kermode

### School of Engineering

## Overview

- Reading datasets
- Visualisation
- Clustering and classification
- Missing data
- Feature engineering

# Iris Dataset

## Loading Data and Initial Exploration

We start with the classic [Iris flower dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set), from a 1936 paper by Ronald Fisher. We can load this from the `"iris.csv"` file provived.

In [14]:
iris = readtable("iris.csv");

In [15]:
?iris

search: cl[1mi[22mb[1mr[22mar[1mi[22me[1ms[22m [1mI[22mnse[1mr[22mt[1mi[22mon[1mS[22mort [1mI[22mnve[1mr[22mseW[1mi[22m[1ms[22mhart D[1mi[22msc[1mr[22meteD[1mi[22m[1ms[22mtribution



No documentation found.

`iris` is of type `DataFrames.DataFrame`.

**Summary:**

```
mutable struct DataFrames.DataFrame <: DataFrames.AbstractDataFrame
```

**Fields:**

```
columns  :: Array{T,1} where T
colindex :: DataFrames.Index
```


In [16]:
head(iris) # print the first few rows

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
5,5.0,3.6,1.4,0.2,setosa
6,5.4,3.9,1.7,0.4,setosa


### Size and Shape of Data

The `DataFrames` packages provides a number of functions to describe the data.

In [17]:
@show size(iris)
@show nrow(iris)
@show ncol(iris)
@show names(iris)
@show eltypes(iris);

size(iris) = (150, 5)
nrow(iris) = 150
ncol(iris) = 5
names(iris) = Symbol[:SepalLength, :SepalWidth, :PetalLength, :PetalWidth, :Species]
eltypes(iris) = Type[Float64, Float64, Float64, Float64, String]


In [18]:
describe(iris[:Species])

Summary Stats:
Length:         150
Type:           String
Number Unique:  3
Number Missing: 0
% Missing:      0.000000


We see there are three possible values for the species in this dataset. The `levels()` function identifies what these are:

In [19]:
species = levels(iris[:Species])

3-element DataArrays.DataArray{String,1}:
 "setosa"    
 "versicolor"
 "virginica" 

## Visualisation

### Scatter Plots

We start with a 2D scatter plot, to show the relationship between two variables, e.g. sepal width and sepal length

In [20]:
scatter(iris[:SepalWidth], iris[:SepalLength], group=iris[:Species])

There is a special syntax for plotting dataframes which saves a bit of typing, e.g. to look now at petal width vs. petal length. This uses the `@df` macro (similar to `@show` that we saw earlier).

In [21]:
@df iris scatter(:PetalWidth, :PetalLength, group=:Species)

### Histogram

Histograms are useful to visualise the distribution of individual variables

In [22]:
p1 = @df iris histogram(:PetalLength, bins=30, label="PetalLength", c=1)
p2 = @df iris histogram(:PetalWidth, bins=30, label="PetalWidth", c=2)

plot(p1, p2)

### Marginal histograms

Marginal histograms allow the correlation between two variables to be assessed (ignore warning message from `GR` backend)

In [23]:
@df iris marginalhist(:PetalLength, :PetalWidth, bins=30)



### Correlation plot

A correlation plot combines histograms for each variable (diagonal) with marginal histograms (above diagonal) and scatter plots for each pair (below diagonal)

In [24]:
@df iris corrplot([:SepalLength :SepalWidth :PetalLength :PetalWidth], bins=20, grid=true)

## Clustering and Classification

Now we've played around with our data a little, let's try to do some more detailed analysis. We would like to learn the relationship between the four variables and the species of iris. 

We can do this using clustering (*unsupervised learning*, i.e. only the features without labels are used) or classification (*supervised learing*, i.e. we provide the labels for a training set).

**K-means clustering** is a classic method for clustering that produces a fixed number $K$ of clusters, based on solving the optimisation problem

$$
\mathrm{minimize} \ \| \mathbf{x}_i - \boldsymbol{\mu}_{z_i} \|^2
\mathrm{with\ respect\ to} \ (\boldsymbol{\mu}, z)
$$

where $\boldsymbol{\mu}_k$ is the center of the $k$-th cluster, and $z_i$ indicates the cluster for $\mathbf{x}_i$. The implementation is fairly straighforward; here is an unoptimised version, see also `Clustering` package for faster code.

In [25]:
function kmeans(data, K; means=nothing, update_means=true)
    N, M = size(data)
    if means == nothing
        # Initialise centers randomly within range of data        
        means = hcat([rand(Uniform(dmin, dmax), K) for (dmin, dmax) 
                        in zip(minimum(data,1), maximum(data,1)) ]...)
    end
    assign = zeros(Int, N)
    oldassign = zeros(Int, N)
    while true
       for n in 1:N  # E-step - update the assignment
           d = data[n, :] .- means' # distance from all the means
           dmin, kmin = findmin([norm(d[:, i]) for i=1:K])
           assign[n] = kmin # assign point to closest centre
       end
       all(oldassign == assign) && break # if nothing changed, we're done
       oldassign = copy(assign)
        
       if update_means
           means[:] = 0.0 # M-step - update the centers
           for k in 1:K
               any(assign .== k) && (means[k,:] = mean(data[assign .== k, :], 1))
           end
       end
   end
   return (assign, means)
end

kmeans (generic function with 1 method)

### Applying K-means to the Iris Dataset

We create features and labels from the first four and last columns of the dataset, respectively (ignore deprecation warning)

In [28]:
features = convert(Array, iris[:, 1:4])
labels = convert(Array, iris[:, :Species])

assign, means = kmeans(features, 3); # K=3 clusters

Stacktrace:
 [1] [1mdepwarn[22m[22m[1m([22m[22m::String, ::Symbol[1m)[22m[22m at [1m./deprecated.jl:70[22m[22m
 [2] [1manyna[22m[22m[1m([22m[22m::DataArrays.DataArray{Float64,1}[1m)[22m[22m at [1m./deprecated.jl:57[22m[22m
 [3] [1mconvert[22m[22m[1m([22m[22m::Type{Array{Float64,2}}, ::DataFrames.DataFrame[1m)[22m[22m at [1m/home/jrun/.julia/v0.6/DataFrames/src/abstractdataframe/abstractdataframe.jl:522[22m[22m
 [4] [1mconvert[22m[22m[1m([22m[22m::Type{Array}, ::DataFrames.DataFrame[1m)[22m[22m at [1m/home/jrun/.julia/v0.6/DataFrames/src/abstractdataframe/abstractdataframe.jl:508[22m[22m
 [5] [1minclude_string[22m[22m[1m([22m[22m::String, ::String[1m)[22m[22m at [1m./loading.jl:522[22m[22m
 [6] [1minclude_string[22m[22m[1m([22m[22m::Module, ::String, ::String[1m)[22m[22m at [1m/home/jrun/.julia/v0.6/Compat/src/Compat.jl:174[22m[22m
 [7] [1mexecute_request[22m[22m[1m([22m[22m::ZMQ.Socket, ::IJulia.Msg[1m)[22m

We can compare the true assignment with the one we get from K-means using scatter plots

In [29]:
cols = [1, 3]
p1 = scatter(iris[:,cols[1]], iris[:,cols[2]], 
             group=labels, title="True Assignment", marker=:o,
             xlabel=names(iris)[cols[1]], ylabel=names(iris)[cols[2]])
p2 = scatter(iris[:,cols[1]], iris[:,cols[2]], group=assign, 
             title="K-means",
             xlabel=names(iris)[cols[1]], ylabel=names(iris)[cols[2]])
scatter!(means[:,cols[1]], means[:,cols[2]], ms=10, marker=:s,
         label="Centers", color=[:blue, :red, :green])
plot(p1, p2, layout=(1,2))

### Confusion matrix

The confusion matrix gives the number of results in each class, with predictions on rows and true labels as columns. This means we can see at a glance how many assignments are correct (diagonal entries) and how many are wrong (off-diagonal). 

We're using the `confusion_matrix` function from `DataFrames`, but it would be easy to code up by hand. Since the clustering was unsupervised, we first have to match up the labels and assignments by hand:

In [31]:
function assign2predict(g)
    if g == 2
        return "setosa"
    elseif g == 3
        return "versicolor"
    elseif g == 1
        return "virginica"
    end
end

predictions = assign2predict.(assign)
cm = confusion_matrix(labels, predictions)

3×3 Array{Int64,2}:
 50   0   0
  0  47   3
  0  14  36

Classes:  String["setosa", "versicolor", "virginica"]
Matrix:   
Accuracy: 0.8866666666666667
Kappa:    0.8300000000000001

Accuracy is estimated from the number of correct predictions divided by the total number of predictions. Because the assignments are based on distance, it only works well for roughly spherical clusters.

In [32]:
accuracy = sum(diag(cm.matrix))/sum(cm.matrix) * 100.

## Decision Tree Classifier

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/c6/Manual_decision_tree.jpg/220px-Manual_decision_tree.jpg" align="right" width="25%">

This is based on the idea of a tree of decisions, which you may be familiar with from conditional probability.

Here, the target value can take a discrete set of values, so this is called a *classification tree*. Leaves represent labels and branches represent combinations of features that lead to those class labels. Decision trees can also be used for *regression*, that is where the target variable is continouous.

This is an example of a *supervised learning* appraoach. We will use the implementation in the `DecisionTree` package.
First we need to split our dataset into ~2/3 training and ~1/3 test sets:

In [33]:
train = rand(nrow(iris)) .< 2/3
test = .!train 
sum(train), sum(test)

(94, 56)

In [34]:
model = build_tree(labels[train], features[train,:])
print_tree(model, 4)

Feature 3, Threshold 3.3
L-> setosa : 32/32
R-> Feature 4, Threshold 1.8
    L-> Feature 3, Threshold 5.0
        L-> versicolor : 26/26
        R-> Feature 2, Threshold 2.7
            L-> virginica : 2/2
            R-> versicolor : 1/1
    R-> virginica : 33/33


We first check for consistency on the training set:

In [37]:
predictions = apply_tree(model, features[train,:])
cm = confusion_matrix([String(L) for L in labels[train]], predictions)

3×3 Array{Int64,2}:
 32   0   0
  0  27   0
  0   0  35

Classes:  String["setosa", "versicolor", "virginica"]
Matrix:   
Accuracy: 1.0
Kappa:    1.0

Now we apply to the test set:

In [38]:
predictions = apply_tree(model, features[test, :])
cm = confusion_matrix([String(L) for L in labels[test]], predictions)
@show cm
cm.matrix

3×3 Array{Int64,2}:
 18   0   0
  1  21   1
  0   3  12

cm = Classes:  String["setosa", "versicolor", "virginica"]
Matrix:   
Accuracy: 0.9107142857142857
Kappa:    0.8632144601856375


3×3 Array{Int64,2}:
 18   0   0
  1  21   1
  0   3  12

### Remark: supervised learning with K-means

We could adapt the K-means algorithm to do supervised learning by initially allowing the means to move when clustering from the training data, and then fixing the means before predicting on the test data. 

In [39]:
assign_train, means = kmeans(features[train,:], 3, means=means, update_means=true) 
assign_test, means = kmeans(features[test,:], 3, means=means, update_means=false)
predict_test = assign2predict.(assign_test)
cm = confusion_matrix([String(L) for L in labels[test]], predictions)
@show cm
cm.matrix

3×3 Array{Int64,2}:
 18   0   0
  1  21   1
  0   3  12

cm = Classes:  String["setosa", "versicolor", "virginica"]
Matrix:   
Accuracy: 0.9107142857142857
Kappa:    0.8632144601856375


3×3 Array{Int64,2}:
 18   0   0
  1  21   1
  0   3  12

K-means is suprisingly good on this dataset for such a simple algorithm. However, note we had to choose the number of clusters and map the clusters to labels by hand.

# Titanic Dataset

We now move onto a more challenging dataset, taken from the passenger records for the *RMS Titanic*, which sank on the 14th April 1912 with the loss of more than 1500 lives. The task is to use data such as the age, sex and fare paid by a passenger to predict whether they survived or not. As before, we will use a portion of the dataset to train a predictive model, and then assess it using the remaining test data.

In [40]:
titanic = readtable("titanic.csv")
head(titanic)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C
3,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
6,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q


In [41]:
describe(titanic[:Survived])

Summary Stats:
Mean:           0.383838
Minimum:        0.000000
1st Quartile:   0.000000
Median:         0.000000
3rd Quartile:   1.000000
Maximum:        1.000000
Length:         891
Type:           Int64
Number Missing: 0
% Missing:      0.000000


To make things a bit more readable, let's first replace 0 and 1 with labels `Dead` and `Survived`

In [42]:
@enum SurvivedType Dead=0 Survived=1
titanic[:Survived] = SurvivedType.(titanic[:Survived]);

The `freqtable` function from the `FreqTables` package can now be used to print a nicely formatted table counting how many values are in each category

In [43]:
freqtable(titanic, :Survived)

2-element Named Array{Int64,1}
Survived  │ 
──────────┼────
Dead      │ 549
Survived  │ 342

We can also group by multiple values:

In [44]:
freqtable(titanic, :Survived, :Sex)

2×2 Named Array{Int64,2}
Survived ╲ Sex │ female    male
───────────────┼───────────────
Dead           │     81     468
Survived       │    233     109

## Dealing with missing data

If we try to make a table of the `Embarked` column we see there are missing values:

In [45]:
freqtable(titanic, :Embarked)

LoadError: [91mMethodError: Cannot `convert` an object of type DataArrays.NAtype to an object of type String
This may have arisen from a call to the constructor String(...),
since type constructors fall back to convert methods.[39m

We can see also see this by describing this column:

In [54]:
describe(titanic[:Embarked])

Summary Stats:
Length:         891
Type:           String
Number Unique:  3
Number Missing: 0
% Missing:      0.000000


There are two missing values. We can neglect these with the `dropna()` function

In [53]:
freqtable(dropna(titanic[:Embarked]))

3-element Named Array{Int64,1}
Dim1  │ 
──────┼────
C     │ 168
Q     │  77
S     │ 646

Once we've dropped the missing data, we see there are three different embarkation points: Cherbourg (C), Queenstown (Q), or Southampton (S). Southampton is the most popular (modal value), so let's fill in the missing values with that:

In [52]:
embarked_mode = mode(dropna(titanic[:Embarked]))
titanic[isna.(titanic[:, :Embarked]), :Embarked] = embarked_mode

"S"

We have a similar problem with `Age`, where 20% of the data are missing. We can simply delete those rows for now, but perhaps you can think of a better solution.

In [55]:
describe(titanic[:Age])

Summary Stats:
Mean:           29.699118
Minimum:        0.420000
1st Quartile:   20.125000
Median:         28.000000
3rd Quartile:   38.000000
Maximum:        80.000000
Length:         714
Type:           Float64
Number Missing: 177
% Missing:      19.865320


In [56]:
titanic = titanic[.!isna.(titanic[:Age]), :];

In [57]:
describe(titanic[:Age])

Summary Stats:
Mean:           29.699118
Minimum:        0.420000
1st Quartile:   20.125000
Median:         28.000000
3rd Quartile:   38.000000
Maximum:        80.000000
Length:         714
Type:           Float64
Number Missing: 0
% Missing:      0.000000


## Visualisation

Let's start digging into the data with some pie charts

In [58]:
male = titanic[titanic[:Sex] .== "male", :]
female = titanic[titanic[:Sex] .== "female", :]

pie(["Dead", "Survived"], freqtable(male, :Survived), layout=(1, 2), title="Male")
pie!(["Dead", "Survived"], freqtable(female, :Survived), layout=(1, 2), subplot=2, title="Female")

### Histogram

As before, we can use histograms to get a better picture of the distribution of a variable

In [59]:
@df titanic histogram(:Age, xlabel="Age", ylabel="Frequency", bins=20, legend=false)

### Density plots

A density plot is similar to a histogram, but drawn with a line interpolating the bars. This makes it easier to overlay multiple plots, e.g. to compare the age distributions for passengers who did and did not survive.

In [60]:
@df titanic density(:Age, groups=:Survived, lw=3)

In [61]:
@df titanic density(:Age, groups=:Sex, lw=3)

## Feature Engineering

Let's make a new column `Child` for people under 13, based on the bump in age distrubtions

In [62]:
@enum ChildType Child=0 Adult=1

function classify_by_age(x)
  if x < 13
    Child
  else
    Adult
  end
end

titanic[:Child] = classify_by_age.(titanic[:Age]);

In [63]:
freqtable(titanic, :Child, :Survived)

2×2 Named Array{Int64,2}
Child ╲ Survived │     Dead  Survived
─────────────────┼───────────────────
Child            │       29        40
Adult            │      395       250

This feature could be used as one of the inputs into a decision tree or other predictive model (cf. bonus question in the assignment!)