# Exploratory Data Analysis of Titanic Dataset in Haskell

In [61]:
-- {-# LANGUAGE DataKinds, FlexibleContexts, TemplateHaskell #-}
{-# LANGUAGE DataKinds, OverloadedStrings, FlexibleContexts, FlexibleInstances, MultiParamTypeClasses, QuasiQuotes, TemplateHaskell, TypeOperators, UndecidableInstances #-}
import Frames
import Control.Foldl as L hiding (mapM_, map, length, genericLength)
import Lens.Micro
import Lens.Micro.Extras
import Control.Monad
import Data.Foldable as F
import Frames.CSV (readTableOpt, rowGen, RowGen(..))
import Pipes hiding (Proxy)
import Pipes.Prelude as P hiding (print, take, mapM_, filter, map, zipWith)
import Data.List (genericLength)
import Data.Proxy (Proxy)
import Data.Functor.Const
import Data.Monoid ((<>), First(..))
import Data.Vinyl (Rec, rmap, RecApplicative, rapply)
import Data.Vinyl.Functor (Lift(..))

In Haskell we have `Frames` as a `Pandas` analog. Of course, the way you think about programming in Haskell is quite different and that will become apparent shortly.

First order of business is to handle the data. Enabling TemplateHaskell allows us to use the `tableTypes` function to define our row types and generate lenses for the aspects (columns).

In [9]:
:t tableTypes
tableTypes "Passenger" "data/train.csv"
-- :i Passenger will let you look at what's been defined for our new Passenger type

Now we need to grab the data. We'll do this with `readTable` and `inCoreAoS` which have to following type signatures.

In [10]:
:t readTable
:t inCoreAoS

In [283]:
rowStream' :: MonadSafe m => Producer Passenger m ()
rowStream' = readTable "data/train.csv"

loadRows' :: IO (Frame Passenger)
loadRows' = inCoreAoS rowStream

passengers' :: IO [Passenger]
passengers' = F.toList <$> loadRows

A natural first question is "How big is the dataset?"

In [159]:
length <$> passengers'

714

Well that poses a problem... There is apparently some data missing. There should be 891 rows in this data set.

It seems that our `readTable` function drops any rows that have missing data. While we can ignore this for now and address plugging in the correct data set later; real world data can often come with holes so it is a worthwhile digression to deal with this problem before moving on. We will follow the method prescribed by https://github.com/acowley/Frames/blob/master/demo/MissingData.hs of creating a `Default` typeclass with which to define how missing fields should be reported. 

In [15]:
class Default a where
  def :: a

In [20]:
instance Default ("Age" :-> Double) where def = Col 0.0
instance Default ("Name" :-> Text) where def = Col "Unnamed"
instance Default ("Cabin" :-> Text) where def = Col ""
instance Default ("Survived" :-> Bool) where def = Col False
instance Default ("PassengerId" :-> Int) where def = Col (-1)
instance Default ("Embarked" :-> Text) where def = Col ""
instance Default ("Pclass" :-> Int) where def = Col 0
instance Default ("Sex" :-> Text) where def = Col ""
instance Default ("SibSp" :-> Int) where def = Col 0
instance Default ("Parch" :-> Int) where def = Col 0
instance Default ("Ticket" :-> Text) where def = Col mempty
instance Default ("Fare" :-> Double) where def = Col 0
instance (Applicative f, AllConstrained Default ts, RecApplicative ts)
  => Default (Rec f ts) where
     def = reifyDict [pr|Default|] (pure def)

In [13]:
:t readTableMaybe
:t (>->)
:t P.map
:t rmap
:t rapply
:t getFirst
:t Lift
:t First
:t recMaybe
holefill :: Rec Maybe (RecordColumns Passenger) -> Maybe Passenger
holefill = undefined

fromJust = maybe (error "a") id
:t fromJust
:t P.map (fromJust . holefill)

In [21]:
holesFilled :: MonadSafe m => Producer Passenger m ()
holesFilled = readTableMaybe "data/train.csv" >-> P.map (fromJust . holeFiller)
  where holeFiller :: Rec Maybe (RecordColumns Passenger) -> Maybe Passenger
        holeFiller = recMaybe
                   . rmap getFirst
                   . rapply (rmap (Lift . flip (<>)) def)
                   . rmap First
fromJust = maybe (error "Frames holesFilled failure") id

Now let's check the length of the Frame.

In [22]:
loadRows :: IO (Frame Passenger)
loadRows = inCoreAoS holesFilled

passengers :: IO [Passenger]
passengers = F.toList <$> loadRows

In [24]:
Prelude.length <$> passengers

891

Yay! We got them all! So let's turn that into a function rather than a variable.

In [36]:
holeFiller :: Rec Maybe (RecordColumns Passenger) -> Maybe Passenger
holeFiller = recMaybe
           . rmap getFirst
           . rapply (rmap (Lift . flip (<>)) def)
           . rmap First
           
fromJust = maybe (error "Couldn't fill holes") id

readSmart :: MonadSafe m => FilePath -> Producer Passenger m ()
readSmart filepath = readTableMaybe filepath >-> P.map (fromJust . holeFiller)

In [345]:
:i Passenger

In [66]:
:t passengerId
:t survived
:t pclass
:t name
:t sex
:t age
:t sibSp
:t parch
:t ticket
:t fare
:t cabin
:t embarked

These lenses are functions that can be used as getters and setters for looking through our passengers table. Before moving forward I will quickly handle the test set for later (at the end).

In [39]:
testPassengers :: IO [Passenger]
testPassengers = ((<$>) F.toList) . inCoreAoS . readSmart $ "data/test.csv"

In [40]:
Prelude.length <$> testPassengers

418

In [46]:
mapM_ print =<< take 5 <$> testPassengers

{PassengerId :-> 892, Survived :-> False, Pclass :-> 0, Name :-> "male", Sex :-> "34.5", Age :-> 0.0, SibSp :-> 0, Parch :-> 330911, Ticket :-> "7.8292", Fare :-> 0.0, Cabin :-> "Q", Embarked :-> ""}
{PassengerId :-> 893, Survived :-> False, Pclass :-> 0, Name :-> "female", Sex :-> "47", Age :-> 1.0, SibSp :-> 0, Parch :-> 363272, Ticket :-> "7", Fare :-> 0.0, Cabin :-> "S", Embarked :-> ""}
{PassengerId :-> 894, Survived :-> False, Pclass :-> 0, Name :-> "male", Sex :-> "62", Age :-> 0.0, SibSp :-> 0, Parch :-> 240276, Ticket :-> "9.6875", Fare :-> 0.0, Cabin :-> "Q", Embarked :-> ""}
{PassengerId :-> 895, Survived :-> False, Pclass :-> 0, Name :-> "male", Sex :-> "27", Age :-> 0.0, SibSp :-> 0, Parch :-> 315154, Ticket :-> "8.6625", Fare :-> 0.0, Cabin :-> "S", Embarked :-> ""}
{PassengerId :-> 896, Survived :-> False, Pclass :-> 0, Name :-> "female", Sex :-> "22", Age :-> 1.0, SibSp :-> 1, Parch :-> 3101298, Ticket :-> "12.2875", Fare :-> 0.0, Cabin :-> "S", Embarked :-> ""}

Hmm... Okay, so not quite so straightforward to also load the test data. We'll deal with that later.

Now lets see what our training data looks like. We'll take just the first 5 rows to get a rough idea.

In [134]:
mapM_ print =<< take 5 <$> passengers'

{PassengerId :-> 1, Survived :-> False, Pclass :-> 3, Name :-> "Braund, Mr. Owen Harris", Sex :-> "male", Age :-> 22.0, SibSp :-> 1, Parch :-> 0, Ticket :-> "A/5 21171", Fare :-> 7.25, Cabin :-> "", Embarked :-> "S"}
{PassengerId :-> 2, Survived :-> True, Pclass :-> 1, Name :-> "Cumings, Mrs. John Bradley (Florence Briggs Thayer)", Sex :-> "female", Age :-> 38.0, SibSp :-> 1, Parch :-> 0, Ticket :-> "PC 17599", Fare :-> 71.2833, Cabin :-> "C85", Embarked :-> "C"}
{PassengerId :-> 3, Survived :-> True, Pclass :-> 3, Name :-> "Heikkinen, Miss. Laina", Sex :-> "female", Age :-> 26.0, SibSp :-> 0, Parch :-> 0, Ticket :-> "STON/O2. 3101282", Fare :-> 7.925, Cabin :-> "", Embarked :-> "S"}
{PassengerId :-> 4, Survived :-> True, Pclass :-> 1, Name :-> "Futrelle, Mrs. Jacques Heath (Lily May Peel)", Sex :-> "female", Age :-> 35.0, SibSp :-> 1, Parch :-> 0, Ticket :-> "113803", Fare :-> 53.1, Cabin :-> "C123", Embarked :-> "S"}
{PassengerId :-> 5, Survived :-> False, Pclass :-> 3, Name :-> "All

So to set our baseline model we ought to see what the over all survival rate was. This will require us to start using those lenses.

In [146]:
:t view
:t survived
:t view survived

In [52]:
survival = map (view survived) <$> passengers

In [53]:
--saved :: Fractional a => a
saved = genericLength . filter id <$> survival

--lost :: Fractional a => a
lost = genericLength . filter not <$> survival

--total :: Fractional a => a
total = genericLength <$> passengers

-- (total, saved, lost)
lrate = liftM2 (/) saved total
drate = liftM2 (/) lost total

lrate
drate

0.3838383838383838

0.6161616161616161

Okay cool. So as our baseline model, if we always guess that any given passenger died, we will be correct about $60\%$ of the time. Since we're calling this our baseline model we should recast it as such. In general a model is a function $M: X\rightarrow Y$ where $X$ and $Y$ represent our observables and classification/value, respectively. Our baseline model represents the case where no observables are given. In other words, $M_{baseline}: \emptyset\rightarrow Y$ where $Y = \{0,1\}$ representing either having perished or survived the calamity. Of course there are only two possible functions $Y=0$ and $Y=1$.

In [54]:
baseline :: Passenger -> Bool
baseline = const False --read this a the response to "Did they survive?"

Now as above we'll formalize the notion that the baseline model performs with an accuracy of about $60\%$. We'll do this in a simple way just running the model over our passenger list and checking the length of the matching list.

In [55]:
runModel :: (Passenger -> Bool) -> IO [Passenger] -> IO [Bool]
runModel = (<$>).(<$>)

In [58]:
basePreds = runModel baseline passengers
basePreds

[False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,Fal

Nifty! So `runModel` is literally a one-liner! Okay so now we got to see how well this list of `False` did :P

In [59]:
:t view survived
survival = (view survived <$>) <$> passengers
survival

[False,True,True,True,False,False,False,False,True,True,True,True,False,False,False,True,False,True,False,True,False,True,True,True,False,True,False,False,True,False,False,True,True,False,False,False,True,False,False,True,False,False,False,True,True,False,False,True,False,False,False,False,True,True,False,True,True,False,True,False,False,True,False,False,False,True,True,False,True,False,False,False,False,False,True,False,False,False,True,True,False,True,True,False,True,True,False,False,True,False,False,False,False,False,False,False,False,True,True,False,False,False,False,False,False,False,True,True,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,True,False,True,True,False,False,False,False,True,False,False,True,False,False,False,False,True,True,False,False,False,True,False,False,False,False,True,False,False,False,False,True,False,False,False,False,True,False,False,False,True,True,False,False,False,False,False,True,False,False,False,Fa

So evidently not everyone died. How well does the model perform?

In [62]:
len :: Fractional a => [b] -> a
len = genericLength
accuracy :: (Fractional b, Eq a) => IO [a] -> IO [a] -> IO b
accuracy preds true = liftM2 (flip (/)) total $ len <$> filter id <$> liftM2 (zipWith (==)) preds true
accuracy basePreds survival

0.6161616161616161

Not the prettiest accuracy calculation, but perhaps I'll clean that up another time. Regardless, there we have it! The model is shit as expected, but remember its purpose as a baseline. We must demand that any future model do better than this for consideration (not terribly mighty a feat haha) but now back to Exploratory Data Analysis before getting carried away.