# Kaggle Titanic Dataset Using Julia 0.4.5

General Notes:

1) Plotting libraries have not reached the same level of maturity as maplotlib/seaborn/plotly.

2) Cleaning data is more difficult because aggregate functions are more limited. Must loop to apply function row-wise.

3) Documentation is a bit harder to find

In [98]:
using DataFrames;
using StatPlots;
using Plots;

In [134]:
pyplot()
# Plots uses pyplot by default but can set the backend to many different options

Plots.PyPlotBackend()

In [3]:
train = readtable("train.csv")
head(train)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C
3,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
6,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q


Quick way to glance at all NA values

In [4]:
showcols(train)

891×12 DataFrames.DataFrame
│ Col # │ Name        │ Eltype     │ Missing │
├───────┼─────────────┼────────────┼─────────┤
│ 1     │ PassengerId │ Int64      │ 0       │
│ 2     │ Survived    │ Int64      │ 0       │
│ 3     │ Pclass      │ Int64      │ 0       │
│ 4     │ Name        │ UTF8String │ 0       │
│ 5     │ Sex         │ UTF8String │ 0       │
│ 6     │ Age         │ Float64    │ 177     │
│ 7     │ SibSp       │ Int64      │ 0       │
│ 8     │ Parch       │ Int64      │ 0       │
│ 9     │ Ticket      │ UTF8String │ 0       │
│ 10    │ Fare        │ Float64    │ 0       │
│ 11    │ Cabin       │ UTF8String │ 687     │
│ 12    │ Embarked    │ UTF8String │ 2       │

0 - Did not survive

1 - Survived

In [137]:
survival_count = by(train, :Survived, nrow)
bar(survival_count[:Survived],survival_count[:x1], xticks=[0,1], bar_width=0.5, legend=:none)
# Alternate method: plot(survival_count[:Survived],survival_count[:x1], line=:bar)

In [6]:
countmap(train[:Survived])

Dict{Union{DataArrays.NAtype,Int64},Int64} with 2 entries:
  0 => 549
  1 => 342

There are 233 female survivors while only 109 male survivors!

In [7]:
gender_survive = by(train, [:Survived,:Sex], nrow)

Unnamed: 0,Survived,Sex,x1
1,0,female,81
2,0,male,468
3,1,female,233
4,1,male,109


Many 3rd class passengers did not survive. Was this because there were more 3rd class passengers or statistically sigificant effect for their higher mortality rate?

In [8]:
class_survive = by(train, [:Survived, :Pclass], nrow)

Unnamed: 0,Survived,Pclass,x1
1,0,1,80
2,0,2,97
3,0,3,372
4,1,1,136
5,1,2,87
6,1,3,119


Creating a plot of age distributions on the Titanic

In [9]:
age_dist = by(train, :Age, nrow);
deleterows!(age_dist,find(isna(age_dist[:,symbol("Age")])));

In [129]:
bar(age_dist[:Age],age_dist[:x1])
# Maybe just use KDE plot instead

In [11]:
sibling_spouse = by(train, :SibSp, nrow);
bar(sibling_spouse[:SibSp],sibling_spouse[:x1],bar_width=0.5, xticks=sibling_spouse[:SibSp])
bar!(xlabel="Number of Spouses or Siblings", ylabel="Count", title="Passenger With Given Number of Sibings/Spouses")

In [12]:
plot(train[:Fare], line=:histogram, nbins=40)
plot!(xlabel="Price", ylabel="Frequency")

In [122]:
class_age = by(train, [:Pclass,:Age], nrow);
deleterows!(class_age,find(isna(class_age[:,symbol("Age")])));
boxplot(class_age[:Pclass],class_age[:Age],marker=(0.6,:orange,stroke(2)))
boxplot!(yticks=collect(0:10:100), xlabel="Passenger Class", ylabel="Age")

Determine average for each Pclass using looping as seen from boxplot above

In [14]:
pclass_mean = []
for i = 1:3
    push!(pclass_mean, mean(dropna(train[train[:Pclass] .== i, :][:Age])))
end
pclass_mean

3-element Array{Any,1}:
 38.2334
 29.8776
 25.1406

If the age is NA, the estimate will be the average age of that person's Pclass

In [15]:
#by(train[[:Age,:Pclass]], [:Pclass,:Age], nrow)
for i in 1:3
    train[(isna(train[:Age])) & (train[:Pclass] .== i), :Age] = pclass_mean[i]
end

DataFrame now has NA values within the Age column repaired

In [16]:
showcols(train)

891×12 DataFrames.DataFrame
│ Col # │ Name        │ Eltype     │ Missing │
├───────┼─────────────┼────────────┼─────────┤
│ 1     │ PassengerId │ Int64      │ 0       │
│ 2     │ Survived    │ Int64      │ 0       │
│ 3     │ Pclass      │ Int64      │ 0       │
│ 4     │ Name        │ UTF8String │ 0       │
│ 5     │ Sex         │ UTF8String │ 0       │
│ 6     │ Age         │ Float64    │ 0       │
│ 7     │ SibSp       │ Int64      │ 0       │
│ 8     │ Parch       │ Int64      │ 0       │
│ 9     │ Ticket      │ UTF8String │ 0       │
│ 10    │ Fare        │ Float64    │ 0       │
│ 11    │ Cabin       │ UTF8String │ 687     │
│ 12    │ Embarked    │ UTF8String │ 2       │

Cabin column is not used for analysis; delete function is an in-place method whereas selection returns a copy of the DataFrame

In [17]:
delete!(train, :Cabin);

Remove remaining two NA values from Embarked column using complete_cases()

In [71]:
clean_train = train[complete_cases(train),:]
showcols(clean_train)

889×11 DataFrames.DataFrame
│ Col # │ Name        │ Eltype     │ Missing │
├───────┼─────────────┼────────────┼─────────┤
│ 1     │ PassengerId │ Int64      │ 0       │
│ 2     │ Survived    │ Int64      │ 0       │
│ 3     │ Pclass      │ Int64      │ 0       │
│ 4     │ Name        │ UTF8String │ 0       │
│ 5     │ Sex         │ UTF8String │ 0       │
│ 6     │ Age         │ Float64    │ 0       │
│ 7     │ SibSp       │ Int64      │ 0       │
│ 8     │ Parch       │ Int64      │ 0       │
│ 9     │ Ticket      │ UTF8String │ 0       │
│ 10    │ Fare        │ Float64    │ 0       │
│ 11    │ Embarked    │ UTF8String │ 0       │

In [131]:
training_data = clean_train[[:Survived, :Pclass,:Sex, :Age, :SibSp, :Parch, :Fare]];
head(training_data)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare
1,1,1,male,22.0,1,0,7.25
2,1,1,female,38.0,1,0,71.2833
3,1,3,female,26.0,0,0,7.925
4,1,1,female,35.0,1,0,53.1
5,1,1,male,35.0,0,0,8.05
6,1,1,male,25.14061971830986,0,0,8.4583


# Logistic Regression using GLM

 * Syntax nearly identical to R
 * Weird bug on variable name after pooling?
 * See libraries in http://juliastats.github.io/

In [20]:
using GLM

[1m[34mINFO: Precompiling module GLM...
[0m

Convert categorical variables from DataArray to PooledDataArray in order for GLM to create model.

In [133]:
pool!(training_data, [:Sex])
levels(training_data[:Sex])

2-element Array{UTF8String,1}:
 "female"
 "male"  

In [82]:
model = glm(Survived~Pclass+Age+Sex, training_data, Binomial(), LogitLink())

DataFrames.DataFrameRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.Binomial{Float64},GLM.LogitLink},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: Survived ~ 1 + Pclass + Age + Sex

Coefficients:
              Estimate  Std.Error  z value Pr(>|z|)
(Intercept)    4.96863   0.477237  10.4112   <1e-24
Pclass        -1.23628   0.125213 -9.87344   <1e-22
Age          -0.037006 0.00765846 -4.83204    <1e-5
Sex: male     -2.59889    0.18716 -13.8859   <1e-43


In [114]:
test = readtable("test.csv");
pool!(test, [:Sex])
test_data = test[complete_cases(test), :];

Predict the results using test.csv and rounding the output to either 0 or 1.

In [118]:
results = predict(model, test_data);
test_data[:Survived] = round(results);

In [119]:
predicted_gender_survive = by(test_data, [:Survived,:Sex], nrow)

Unnamed: 0,Survived,Sex,x1
1,0.0,male,33
2,1.0,female,44
3,1.0,male,10


Based on the Logistic Regression model using Pclass, Age, and Sex features, the model predicts all females will survive while only 10 males will survive.