## Exploratory Data Analysis

But first:

In [2]:
(import '[com.twosigma.beakerx.chart.xychart Plot]
        '[com.twosigma.beakerx.chart.xychart.plotitem Line Bars Points]
        'com.twosigma.beakerx.chart.Color
        '[com.twosigma.beakerx.chart.histogram Histogram])

class com.twosigma.beakerx.chart.histogram.Histogram

In [3]:
(defn bar-plot
    ([x y]
     (doto (Bars.)
         (.setX x)
         (.setY y)
         (.setColor Color/yellow)
         (.setWidth 0.3)))
    ([x y color]
     (doto (Bars.)
         (.setX x)
         (.setY y)
         (.setColor color)
         (.setWidth 0.3)))
    ([x y color width]
     (doto (Bars.)
         (.setX x)
         (.setY y)
         (.setColor color)
         (.setWidth width)))
    ([x y color width show]
     (show
      (bar-plot x y color width))))

(defn line-plot
    ([x y]
     (doto (Line.)
         (.setX x)
         (.setY y)
         (.setColor Color/red)))
    ([x y color]
     (doto (Line.)
         (.setX x)
         (.setY y)
         (.setColor color)))
    ([x y color show]
     (show
      (line-plot x y color))))

(defn scatter-plot
    ([x y]
     (doto (Points.)
         (.setX x)
         (.setY y)
         (.setColor Color/red)))
    ([x y color]
     (doto (Points.)
         (.setX x)
         (.setY y)
         (.setColor color)))
    ([x y color show]
     (show
      (scatter-plot x y color))))

(defn show!
    ([plot]
     (.add (Plot.) plot))
    ([plot title]
     (doto (Plot.)
         (.setTitle title)
         (.add plot)))
    ([plot title x-label y-label]
     (doto (Plot.)
         (.setTitle title)
         (.setXLabel x-label)
         (.setYLabel y-label)
         (.add plot))))

#'beaker_clojure_shell_ed045675-d4fe-4cdd-9de3-4eda4c147f44/show!

In [4]:
(def enhanced-titanic (read-string (slurp "titanic-enhanced-data.txt")))

#'beaker_clojure_shell_ed045675-d4fe-4cdd-9de3-4eda4c147f44/enhanced-titanic

In [5]:
(take 1 enhanced-titanic)

In [6]:
(defn mean
    [k coll is-valid]
    (when (is-valid k)
        (float (/ (reduce + 0 (map #(read-string (str (get % k))) coll)) (count coll)))))

(defn calculate-std
    [[m & ms] [v & vs]]
    (when-let [_ v]
        (lazy-seq
         (if m
             (let [READ (map (comp read-string str) v)
                   squared (mapv #(Math/pow (- % m) 2) READ)
                   sum-of-sq (reduce + 0 squared)
                   variance (/ sum-of-sq ((comp dec count) v))
                   std      (Math/sqrt variance)]
                 (cons std
                       (calculate-std ms vs)))
             (cons nil (calculate-std ms vs))))))

(defn minimum
    [k coll is-valid]
    (when (is-valid k)
        (let [read-fn (comp read-string str #(get % k))
              col     (map read-fn coll)]
            (apply min col))))

(defn read-n*
    [n]
    (cond
        (int? n) n
        (float? n) n
        (double? n) n
        (string? n) (if (read-n* (read-string n)) (read-string n) n)))

(defn percentile
    [pth k coll is-valid]
    (when (is-valid k)
        (let [k-va (sort-by read-n* (map #(get % k) coll))
              size (count k-va)]
            (apply (fn [n]
                       (if ((complement int?) n)
                           (nth k-va (Math/round (float n)))
                           (/ (+ (nth k-va n) (nth k-va (inc n))) 2)))
             [(* (/ pth 100) (inc size))]))))

(defn maximum
    [k coll is-valid]
    (when (is-valid k)
        (let [read-fn (comp read-string str #(get % k))
              col     (map read-fn coll)]
            (apply max col))))

(defn make-template
    [ks]
    (let [pad (apply str (map (constantly "  %11s |") ks))]
        (str (str "         |" pad)
             (str "\n   count |" pad)
             (str "\n    mean |" pad)
             (str "\n     std |" pad)
             (str "\n     min |" pad)
             (str "\n     25%% |" pad)
             (str "\n     50%% |" pad)
             (str "\n     75%% |" pad)
             (str "\n     max |" pad))))

(defn table-show
    [templ elements]
    (apply format templ 
     (map (fn [_] 
              (cond
                  (nil? _) "nil"
                  (float? _) (format "%.2f" _)
                  :else (str _)))
          elements)))
             


(defn output-describe-table
    [ks [cnt mean std mn _25% _50% _75% mx :as all]]
    (table-show (make-template ks) 
     (concat ks cnt mean std mn _25% _50% _75% mx)))

(defn describe
    [coll]
    (let [ks (keys (first coll))
          count-row (map (fn [k] (count (map #(get % k) coll))) ks)
          mean-row (map (fn [k] (mean k coll #{"Age" "SibSp" "Parch" "Survived" "Pclass"})) ks)
          std-row (calculate-std mean-row (map (fn [k] (map #(get % k) coll)) ks))
          min-row (map (fn [k] (minimum k coll #{"Age" "SibSp" "Parch" "Survived" "Pclass"})) ks)
          _25%-per (map (fn [k] (percentile 25 k coll #{"Age" "SibSp" "Parch" "Survived" "Pclass"})) ks)
          _50%-per (map (fn [k] (percentile 50 k coll #{"Age" "SibSp" "Parch" "Survived" "Pclass"})) ks)
          _75%-per (map (fn [k] (percentile 75 k coll #{"Age" "SibSp" "Parch" "Survived" "Pclass"})) ks)
          max-row (map (fn [k] (maximum k coll #{"Age" "SibSp" "Parch" "Survived" "Pclass"})) ks)]
        (output-describe-table ks [count-row mean-row std-row min-row _25%-per _50%-per _75%-per max-row])))

(describe enhanced-titanic)

         |          Age |  PassengerId |        SibSp |        Parch |          Sex |     Survived |     Embarked |       Pclass |         Name |
   count |         1199 |         1199 |         1199 |         1199 |         1199 |         1199 |         1199 |         1199 |         1199 |
    mean |        30.67 |          nil |         0.53 |         0.36 |          nil |         0.46 |          nil |         2.02 |          nil |
     std |        13.82 |          nil |         1.02 |         0.77 |          nil |         0.50 |          nil |         0.89 |          nil |
     min |         0.42 |          nil |            0 |            0 |          nil |            0 |          nil |            1 |          nil |
     25% |           23 |          nil |            0 |            0 |          nil |            0 |          nil |            1 |          nil |
     50% |           28 |          nil |            0 |            0 |          nil |            0 |          nil |         

We will now see the correlation between cols

In [7]:
(defn corr
    [x y]
    (let [x-all (reduce + 0 x)
          y-all (reduce + 0 y)
          x-2 (map #(Math/pow % 2) x)
          x-2-all (reduce + 0 x-2)
          y-2 (map #(Math/pow % 2) y)
          y-2-all (reduce + 0 y-2)
          xy (map * x y)
          xy-all (reduce + 0 xy)
          n (count x)]
        (/ (- (* n xy-all) (* x-all y-all))
           (Math/sqrt (* (- (* n x-2-all) (Math/pow x-all 2)) 
                         (- (* n y-2-all) (Math/pow y-all 2)))))))

#'beaker_clojure_shell_ed045675-d4fe-4cdd-9de3-4eda4c147f44/corr

In [8]:
(defn make-corr-template
    [ks]
    (let [templ (for [row (range (inc (count ks)))]
                    (list* "%11s |"
                          (concat (for [col (range (count ks))]
                                      "%11s |") ["\n"])))]
        (apply str (mapcat identity templ))))

#'beaker_clojure_shell_ed045675-d4fe-4cdd-9de3-4eda4c147f44/make-corr-template

In [9]:
(defn corr*
    [x y]
    (if (or (some (comp string? read-n*) x) (some (comp string? read-n*) y))
        "nil"
        (format "%.2f" (corr (map read-n* x) (map read-n* y)))))

(defn make-corr-rows
    [ks coll]
    (let [kvs (map (fn [k] (map #(get % k) coll)) ks)]
        (mapcat #(list* %1 %2) ks
         (for [i kvs]
             (for [a kvs]
                 (corr* i a))))))

(defn corr-table
    [coll]
    (let [ks (distinct (mapcat keys coll))
          rows (make-corr-rows ks coll)
          templ (make-corr-template ks)]
        (apply format templ (concat [" "] ks rows))))

(corr-table enhanced-titanic)

            |        Age |PassengerId |      SibSp |      Parch |        Sex |   Survived |   Embarked |     Pclass |       Name |
        Age |       1.00 |       0.28 |      -0.19 |      -0.17 |        nil |      -0.01 |        nil |      -0.41 |        nil |
PassengerId |       0.28 |       1.00 |      -0.03 |      -0.05 |        nil |       0.14 |        nil |      -0.52 |        nil |
      SibSp |      -0.19 |      -0.03 |       1.00 |       0.37 |        nil |      -0.02 |        nil |       0.06 |        nil |
      Parch |      -0.17 |      -0.05 |       0.37 |       1.00 |        nil |       0.09 |        nil |       0.03 |        nil |
        Sex |        nil |        nil |        nil |        nil |        nil |        nil |        nil |        nil |        nil |
   Survived |      -0.01 |       0.14 |      -0.02 |       0.09 |        nil |       1.00 |        nil |      -0.39 |        nil |
   Embarked |        nil |        nil |        nil |        nil |        nil |     

Lets plot Age vs Pclass

In [10]:
(defn by-ks
    ([data k1 k2]
     (let [x (map #(read-string (get % k1)) data)
           y (map #(read-string (get % k2)) data)]
         #(% x y Color/red)))
    ([data k1 k2 color]
     ((by-ks data k1 k2) (fn [x y _] (fn [f] (f x y color)))))
    ([data k1 k2 color plot]
     ((by-ks data k1 k2 color) (fn [x y color] (fn [f] (plot x y color f)))))
    ([data k1 k2 color plot show]
     ((by-ks data k1 k2 color plot) show)))

#'beaker_clojure_shell_ed045675-d4fe-4cdd-9de3-4eda4c147f44/by-ks

In [11]:
(by-ks enhanced-titanic "Pclass" "Age" Color/yellow scatter-plot (fn [p] (show! p "Pclass vs Age" "Pclass" "Age")))

In [12]:
(by-ks enhanced-titanic 
       "Pclass" 
       "Age" 
       0.5 
       (fn [x y width f] (bar-plot x y Color/green width f))
       (fn [p] (show! p "Pclass vs Age" "Pclass" "Age")))

We don't see much difference between the 3 passenger classes with respect to Age, but we see that the oldest passengers are in third class <br>
followed by first and then second class.

In [13]:
(def female #(= (get % "Sex") "female"))

(def male #(= (get % "Sex") "male"))

(def female-passengers (keep #(if (female %) %) enhanced-titanic))

(def male-passengers (keep #(if (male %) %) enhanced-titanic))

#'beaker_clojure_shell_ed045675-d4fe-4cdd-9de3-4eda4c147f44/male-passengers

In [14]:
(defn hist-plot
    ([x y]
     (doto (Histogram.)
         (.setNames x)
         (.setData y)
         (.setColor Color/red)))
    ([x y color]
     (doto (Histogram.)
         (.setNames x)
         (.setData y)
         (.setColor color)))
    ([x y color xlabel]
     (doto (hist-plot x y color)
         (.setXLabel xlabel)))
    ([x y color xlabel ylabel]
     (.setYLabel (hist-plot x y color xlabel) ylabel))
    ([x y color xlabel ylabel title]
     (.setTitle (hist-plot x y color xlabel ylabel) title)))

#'beaker_clojure_shell_ed045675-d4fe-4cdd-9de3-4eda4c147f44/hist-plot

In [15]:
(bar-plot [0 1] 
          [((comp count filter) #{1} (map #(read-n* (get % "Survived")) female-passengers))
           ((comp count filter) #{1} (map #(read-n* (get % "Survived")) male-passengers))]
          [Color/yellow Color/green]
          0.3
          (fn [p] (show! p "Survived by Sex" "Female vs. Male" "Survived")))

In [16]:
(defn calculate-survived-percentage
    [passengers surv-count]
    (float (* (/ surv-count (count passengers)) 100)))

#'beaker_clojure_shell_ed045675-d4fe-4cdd-9de3-4eda4c147f44/calculate-survived-percentage

In [17]:
(calculate-survived-percentage female-passengers ((comp count filter) #{"1"} (map #(get % "Survived") female-passengers)))

81.28898

81.29% of the female passengers survived

In [18]:
(calculate-survived-percentage male-passengers ((comp count filter) #{"1"} (map #(get % "Survived") male-passengers)))

21.72702

~22% of the male passengers survived

In [19]:
(hist-plot nil
           (keep #(when (= "1" (get % "Survived")) ((comp int read-string get) % "Age")) male-passengers)
           Color/red
           "Age"
           "Frequency"
           "Most common ages amongst the male survivors of the titanic")

We observe that an 28 count of the age of the survivors is 34. We also see that after 34 years the frequency starts to decrease until 58 which is at the minimum

In [20]:
(defn pick-based-on-key
    ([coll k pred]
     (keep (pred k) coll))
    ([coll k pred further-analysis]
     (further-analysis (pick-based-on-key coll k pred))))

#'beaker_clojure_shell_ed045675-d4fe-4cdd-9de3-4eda4c147f44/pick-based-on-key

If we actually zoom in on the above table we'll find out that from 0 up to 8 of age we have a frequency of 19

In [21]:
(pick-based-on-key male-passengers 
                   "Age" 
                   #(fn [m] 
                        (let [age (get m %)
                              raed (read-string age)]
                            (when (and (>= raed 0) (< raed 8)) m)))
                   (fn [kept]
                       (let [survd (filter #(= "1" (get % "Survived")) kept)]
                           (float (* 100 (/ (count survd) (count kept)))))))

65.51724

The ages from 0 to 8 of the male passengers have an 65.52 percent of survival

In [22]:
(pick-based-on-key male-passengers
                   "Age"
                   #(fn [m]
                        (let [age (get m %)
                              raed (read-string age)]
                            (when (and (>= raed 8) (< raed 16)) m)))
                   (fn [kept]
                       (let [survd (filter #(= "1" (get % "Survived")) kept)]
                           (float (* 100 (/ (count survd) (count kept)))))))

52.63158

52.63 percent survival for ages 8 to 16

In [23]:
(pick-based-on-key male-passengers
                   "Age"
                   #(fn [m]
                        (let [age (get m %)
                              raed (read-string age)]
                            (when (and (>= raed 16) (< raed 24)) m)))
                   (fn [kept]
                       (let [survd (filter #(= "1" (get % "Survived")) kept)]
                           (float (* 100 (/ (count survd) (count kept)))))))

11.570248

11.57 percent for ages 16 to 24

In [24]:
(pick-based-on-key male-passengers
                   "Age"
                   #(fn [m]
                        (let [age (get m %)
                              raed (read-string age)]
                            (when (and (>= raed 24) (< raed 32)) m)))
                   (fn [kept]
                       (let [survd (filter #(= "1" (get % "Survived")) kept)]
                           (float (* 100 (/ (count survd) (count kept)))))))

17.624521

17.62 percent for ages 24 to 32

In [25]:
(pick-based-on-key male-passengers
                   "Age"
                   #(fn [m]
                        (let [age (get m %)
                              raed (read-string age)]
                            (when (and (>= raed 32) (< raed 40)) m)))
                   (fn [kept]
                       (let [survd (filter #(= "1" (get % "Survived")) kept)]
                           (float (* 100 (/ (count survd) (count kept)))))))

25.0

25% for 32 to 40

In [26]:
(pick-based-on-key male-passengers
                   "Age"
                   #(fn [m]
                        (let [age (get m %)
                              raed (read-string age)]
                            (when (and (>= raed 40) (< raed 48)) m)))
                   (fn [kept]
                       (let [survd (filter #(= "1" (get % "Survived")) kept)]
                           (float (* 100 (/ (count survd) (count kept)))))))

23.170732

23.17 percent (40 - 48)

In [27]:
(pick-based-on-key male-passengers
                   "Age"
                   #(fn [m]
                        (let [age (get m %)
                              raed (read-string age)]
                            (when (and (>= raed 48) (< raed 56)) m)))
                   (fn [kept]
                       (let [survd (filter #(= "1" (get % "Survived")) kept)]
                           (float (* 100 (/ (count survd) (count kept)))))))

30.76923

30.77% from 48 to 56

In [28]:
(pick-based-on-key male-passengers
                   "Age"
                   #(fn [m]
                        (let [age (get m %)
                              raed (read-string age)]
                            (when (and (>= raed 56) (< raed 64)) m)))
                   (fn [kept]
                       (let [survd (filter #(= "1" (get % "Survived")) kept)]
                           (float (* 100 (/ (count survd) (count kept)))))))

15.625

56 - 64 (15.63%)

In [29]:
(pick-based-on-key male-passengers
                   "Age"
                   #(fn [m]
                        (let [age (get m %)
                              raed (read-string age)]
                            (when (and (>= raed 72) (< raed 80)) m)))
                   (fn [kept]
                       (let [survd (filter #(= "1" (get % "Survived")) kept)]
                           (float (* 100 (/ (count survd) (count kept)))))))

0.0

72 - 80 (0%). However:

In [30]:
(pick-based-on-key male-passengers
                   "Age"
                   #(fn [m]
                        (let [age (get m %)
                              raed (read-string age)]
                            (when (= raed 80) m)))
                   #(get (first %) "Survived"))

null

Now

In [31]:
(filter #(let [_ (read-string (get % "Age"))] (and (>= _ 24) (<= _ 32))) male-passengers)

In [32]:
(let [pred (fn [m] (let [surv (get m "Survived")] (if (= surv "1") (get m "Pclass"))))]
    (hist-plot nil
               (keep pred male-passengers)
               Color/green
               "Pclass"
               "Frequency"
               "Survivors per Pclass"))

First class has the most survivors, interestingly third class is second followed by second class

In [33]:
(let [survd (filter #(= "1" (get % "Survived")) male-passengers)
      ports (group-by #(get % "Embarked") survd)
      ks (keys ports)
      counts (map count (vals ports))
      form-string (map (constantly "%20s") (range (count ks)))]
    (bar-plot (map #(identity %2) ks (range))
              counts
              [Color/red Color/blue Color/yellow Color/green]
              0.5
              (fn [p] (show! p "Survivors per port" (apply format (apply str form-string) ks) "Surv count"))))

Lets check the Pclass of Southampton

In [34]:
(let [south (filter #(= "Southampton" (get % "Embarked")) male-passengers)
      by-class (group-by #(get % "Pclass") south)
      ks (keys by-class)
      counts (map count (vals by-class))
      form-string (apply str (repeat (count ks) "%20s"))]
    (bar-plot (map #(identity %2) ks (range))
              counts
              [Color/green Color/red Color/blue]
              0.4
              (fn [p] (show! p "Survivors(count) by Pclass in Southampton" (apply format form-string ks) "Survived"))))

In [35]:
(filter #(= (get % "Survived") "1")
        (filter #(= "Southampton" (get % "Embarked")) male-passengers))

In [36]:
(defn query-data
    ([queries coll]
     (lazy-seq
      (if-let [s (seq queries)]
          (let [f (first s)]
              (filter f (query-data (rest s) coll)))
          coll)))
    ([queries coll do-after]
     (do-after (query-data queries coll))))

#'beaker_clojure_shell_ed045675-d4fe-4cdd-9de3-4eda4c147f44/query-data

In [37]:
(query-data [#(= "3" (get % "Pclass"))#(= "1" (get % "Survived"))#(= "Southampton" (get % "Embarked"))]
            male-passengers)

In [38]:
(query-data [#(= "3" (get % "Pclass"))#(= "1" (get % "Survived"))#(= "Southampton" (get % "Embarked"))]
            male-passengers
            (fn [data] (reduce (fn [acc [k v]] (assoc acc k (count v)))
                               {}
                               (group-by #(read-n* (get % "Age")) data))))

Clearly most of the survivors in third class have ages close or below 30, lets see the other 2 classes.

In [39]:
(query-data [#(= "1" (get % "Pclass"))#(= "1" (get % "Survived"))#(= "Southampton" (get % "Embarked"))]
            male-passengers
            (fn [data] (reduce (fn [acc [k v]] (assoc acc k (count v)))
                               {}
                               (group-by #(read-n* (get % "Age")) data))))

In [40]:
(query-data [#(= "1" (get % "Pclass"))#(= "1" (get % "Survived"))#(= "Southampton" (get % "Embarked"))]
            male-passengers
            (fn [data] 
                (let [mp (reduce (fn [acc [k v]] 
                                     (assoc acc k (count v))) 
                                 {} 
                                 (group-by #(int (read-n* (get % "Age"))) data))]
                    (query-data nil
                     (reduce (fn [acc [k v]]
                                 (assoc acc k (map key v)))
                             {}
                             (group-by val mp))
                      (fn [data] (reduce (fn [acc [k v]]
                                             (assoc acc (apply list v) (* k (count v))))
                                         {}
                                         data))))))

We begin to see ages further above 30, although we still see some like in third class

In [41]:
(query-data [#(= "2" (get % "Pclass"))#(= "1" (get % "Survived"))#(= "Southampton" (get % "Embarked"))]
            male-passengers
            (fn [data] 
                (let [mp (reduce (fn [acc [k v]] 
                                     (assoc acc k (count v))) 
                                 {} 
                                 (group-by #(int (read-n* (get % "Age"))) data))]
                    (query-data nil
                     (reduce (fn [acc [k v]]
                                 (assoc acc k (map key v)))
                             {}
                             (group-by val mp))
                      (fn [data] (reduce (fn [acc [k v]]
                                             (assoc acc (apply list v) (* k (count v))))
                                         {}
                                         data))))))

We see the ages start to get closer again to 30

In [42]:
(query-data [#(= "2" (get % "Pclass"))#(= "1" (get % "Survived"))#(= "Southampton" (get % "Embarked"))]
            male-passengers)

It seems that the vast majority of survivors in second class were travelling with family

In [43]:
(query-data [#(= "1" (get % "Pclass"))#(= "1" (get % "Survived"))#(= "Southampton" (get % "Embarked"))]
            male-passengers)

Not many were travelling with family in first class

In [44]:
(let [south (query-data [#(= "Southampton" (get % "Embarked"))]
                        male-passengers)
      by-class (group-by #(get % "Pclass") south)
      ks (keys by-class)
      percents (map (fn [v]
                        (let [surv (filter #(= "1" (get % "Survived")) v)]
                            (float (* 100 (/ (count surv) (count v)))))) (vals by-class))
      form-s (apply str (repeat (count ks) "%20s"))]
    (bar-plot (map #(identity %2) ks (range))
              percents
              [Color/green Color/blue Color/red]
              0.3
              (fn [p] (show! p "Survival percent per Pclass" (apply format form-s ks) "Percent %"))))

The class that has the highest survival rate is first class followed by second which we know that most of the passengers were travelling with <br>
family, and then third-class which have the ages closer to or below 30.

In [45]:
(let [south (filter #(= "Cherbourg" (get % "Embarked")) male-passengers)
      by-class (group-by #(get % "Pclass") south)
      ks (keys by-class)
      counts (map count (vals by-class))
      form-string (apply str (repeat (count ks) "%20s"))]
    (bar-plot (map #(identity %2) ks (range))
              counts
              [Color/green Color/red Color/blue]
              0.4
              (fn [p] (show! p "Survivors by Pclass in Cherbourg" (apply format form-string ks) "Survived"))))

In [46]:
(query-data [#(= "1" (get % "Pclass"))#(= "1" (get % "Survived"))#(= "Cherbourg" (get % "Embarked"))]
            male-passengers
            (fn [data] (reduce (fn [acc [k v]] (assoc acc k (count v)))
                               {}
                               (group-by #(read-n* (get % "Age")) data))))

More ages closer to 20 or below 30 in first class passengers that boarded in Cherbourg

In [47]:
(query-data [#(= "3" (get % "Pclass"))#(= "1" (get % "Survived"))#(= "Cherbourg" (get % "Embarked"))]
            male-passengers
            (fn [data] (reduce (fn [acc [k v]] (assoc acc k (count v)))
                               {}
                               (group-by #(read-n* (get % "Age")) data))))

All third class passenger survivors that boarded in Cherbourg had ages below 30

In [48]:
(query-data [#(= "2" (get % "Pclass"))#(= "1" (get % "Survived"))#(= "Cherbourg" (get % "Embarked"))]
            male-passengers
            (fn [data] (reduce (fn [acc [k v]] (assoc acc k (count v)))
                               {}
                               (group-by #(read-n* (get % "Age")) data))))

Only 2 people from second class survived, also to notice is that as the number of survivors is decreasing the percent of ages below 30 <br>
with respect to the number of survivors is increasing

In [49]:
(query-data [#(= "1" (get % "Pclass"))#(= "1" (get % "Survived"))#(= "Cherbourg" (get % "Embarked"))]
            male-passengers)

We see passengers with family on board in first class

In [50]:
(query-data [#(= "3" (get % "Pclass"))#(= "1" (get % "Survived"))#(= "Cherbourg" (get % "Embarked"))]
            male-passengers)

50% of the third class survivors had either a sibling or spouse or had children or had parents

In [51]:
(query-data [#(= "2" (get % "Pclass"))#(= "1" (get % "Survived"))#(= "Cherbourg" (get % "Embarked"))]
            male-passengers)

1 second class survivor out of the 2 had family

In [52]:
(let [cher (query-data [#(= "Cherbourg" (get % "Embarked"))]
                        male-passengers)
      by-class (group-by #(get % "Pclass") cher)
      ks (keys by-class)
      percents (map (fn [v]
                        (let [surv (filter #(= "1" (get % "Survived")) v)]
                            (float (* 100 (/ (count surv) (count v)))))) (vals by-class))
      form-s (apply str (repeat (count ks) "%20s"))]
    (bar-plot (map #(identity %2) ks (range))
              percents
              [Color/green Color/blue Color/red]
              0.3
              (fn [p] (show! p "Survival percent per Pclass" (apply format form-s ks) "Percent %"))))

In this case the rank of survivors gives the same positions with first class followed by third class followed by second class as the survival rate

In [53]:
(let [with-family (remove #(and (= "0" (get % "Parch")) (= "0" (get % "SibSp"))) male-passengers)
      survived (filter #(= "1" (get % "Survived")) male-passengers)
      by-num (group-by #(+ (read-n* (get % "SibSp")) (read-n* (get % "Parch"))) with-family)
      ks (keys by-num)
      perc (map (fn [v]
                    (let [surv (filter #(= "1" (get % "Survived")) v)]
                        (float (* 100 (/ (count surv) (count survived)))))) (vals by-num))]
    (println (map vector ks perc))
    (scatter-plot ks
                  perc
                  Color/green
                  (fn [p] (show! p "Distribution of percentage of total survived by family number" "Family" "Percentage %"))))

([1 17.307692] [4 0.64102566] [6 0.64102566] [5 0.0] [2 19.23077] [7 0.0] [3 5.1282053] [10 0.0])


Interestingly the passengers with the most family members are among those that occupy the less percentage of total male survivors of the titanic,
with 1 or 2 family members having the most percentage of the total male survivors that there were.

In [54]:
(def get-n (comp read-n* get))

#'beaker_clojure_shell_ed045675-d4fe-4cdd-9de3-4eda4c147f44/get-n

In [55]:
(filter #(= 6 (+ (get-n % "Parch") (get-n % "SibSp"))) male-passengers)

In [56]:
(query-data [#(= 6 (+ (get-n % "Parch") (get-n % "SibSp")))]
            male-passengers
            (fn [data] (let [size (count data)] (* 100 (/ (count (filter #(= 1 (get-n % "Survived")) data)) size)))))

25

25% survival for individuals with six family members.

In [57]:
(filter #(= 4 (+ (get-n % "Parch") (get-n % "SibSp"))) male-passengers)

In [58]:
(query-data [#(= 4 (+ (get-n % "Parch") (get-n % "SibSp")))]
            male-passengers
            (fn [data]
                (let [size (count data)
                      surv (filter #(= 1 (get-n % "Survived")) data)]
                    (float (* 100 (/ (count surv) size))))))

20.0

20% Survival for 4 family members

In [59]:
(query-data [#(= 3 (+ (get-n % "Parch") (get-n % "SibSp")))]
            male-passengers)

In [60]:
(query-data [#(= 3 (+ (get-n % "Parch") (get-n % "SibSp")))]
            male-passengers
            (fn [data]
                (let [size (count data)
                      surv (filter #(= 1 (get-n % "Survived")) data)]
                    (float (* 100 (/ (count surv) size))))))

47.058823

47.06% of survival, also we start to find more first class passengers with some of second class and 3 of third class

In [None]:
(query-data [#(= 2 (+ (get-n % "Parch") (get-n % "SibSp")))]
            male-passengers)

In [None]:
(query-data [#(= 2 (+ (get-n % "Parch") (get-n % "SibSp")))]
            male-passengers
            (fn [data]
                (let [size (count data)
                      surv (filter #(= 1 (get-n % "Survived")) data)]
                    (float (* 100 (/ (count surv) size))))))

39.473682

39.47% survival rate. Something worth noting is that many albeit a pair of exceptions did not have an age above 10 and <br>
many were either 2nd or 3rd class. Many 1st class passengers in this category did not have an age below 10. 1st class had the most survivors <br>
followed by 2nd and then 3rd class.

In [None]:
(query-data [#(= 1 (+ (get-n % "Parch") (get-n % "SibSp")))]
            male-passengers)

In [64]:
(query-data [#(= 1 (+ (get-n % "Parch") (get-n % "SibSp")))]
            male-passengers
            (fn [data]
                (let [size (count data)
                      surv (filter #(= 1 (get-n % "Survived")) data)]
                    (float (* 100 (/ (count surv) size))))))

28.125

In [65]:
(query-data [#(= 1 (+ (get-n % "Parch") (get-n % "SibSp")))]
            male-passengers
            (fn [data]
                (let [size (count data)
                      surv (map (fn [pclass]
                                    [pclass (filter #(= pclass (get-n % "Pclass")) data)])
                                [1 2 3])]
                    (map (fn [[pclass query]] [pclass (float (* 100 (/ (count query) size)))]) surv))))

[[1, 52.083332], [2, 18.75], [3, 29.166666]]

* __1st:__ 52.08% survival
* __3rd:__ 29.17% survival
* __2nd:__ 18.75% survival

__Conclusion about male passengers in the titanic:__ The ages with the lowest survival are (from 72 to 80), (from 16 to 24), (from 24 to 32), and (from 56 to 64). There was only one passenger that had an age of 80(Pclass 1) and survived. There were more survivors that embarked in Southampton, second Cherbourg and then Queenstown. The younger the passengers were of age and if they were of 2nd class, were highly likely to survive followed by 3rd class, older or younger 1st class passengers were more likely to survive in comparison with older or younger 2nd and 3rd class passengers.

In [66]:
(query-data [#(= 1 (get-n % "Survived"))]
            female-passengers
            (fn [data]
                (let [per #(float (* 100 (/ (count %) (count data))))
                      by-age (group-by #(int (get-n % "Age")) data)
                      sorted (sort-by first by-age)]
                    (line-plot (map first sorted)
                               (map (comp per second) sorted) 
                               Color/green
                               (fn [p] (show! p "Survival rate per Age" "Age" "Survival"))))))

We see that about 10.75% of the total female passengers that survived had an age of about 23

In [67]:
(query-data [#(= 1 (get-n % "Survived"))]
            female-passengers
            (fn [data]
                (let [form-string (apply str (repeat 3 "%20s"))]
                    (query-data [#(= 23 (int (get-n % "Age")))]
                                data
                                (fn [dta]
                                    (let [per #(* 100 (/ (count %) (count dta)))
                                          pclass (group-by #(get-n % "Pclass") dta)
                                          sorted (sort-by first pclass)]
                                        (bar-plot (map first sorted)
                                                  (map (comp float per second) sorted)
                                                  [Color/yellow Color/green Color/red]
                                                  0.2
                                                  (fn [p]
                                                      (let [title "Pclass % in female passengers of age ~23"]
                                                          (show! p title (apply format form-string (map first sorted)) "%"))))))))))

In [68]:
(query-data [#(= 1 (get-n % "Survived")) #(= 23 (int (get-n % "Age")))]
            female-passengers)

Lets see what was the most Embarked port.

In [69]:
(query-data [#(= 1 (get-n % "Survived"))]
            female-passengers
            (fn [data]
                (let [port (group-by #(get-n % "Embarked") data)
                      sorted (sort-by first port)
                      pad (repeat (count port) "%20s")
                      form-s (apply str pad)]
                    (bar-plot (map #(identity %2) sorted (range))
                              (map (comp count second) sorted)
                              (take (count sorted) (cycle [Color/red Color/green Color/blue]))
                              0.4
                              (fn [p]
                                  (show! p
                                         "Survivor count per port embarked"
                                         (apply format form-s (map first sorted))
                                         "Surv. count"))))))

Many more female passengers boarded on Southampton, that would explain the high survival rate of third class, there was many more female passengers in third class.

In [70]:
(query-data [#(not= 0 (+ (get-n % "Parch") (get-n % "SibSp")))]
            female-passengers
            (fn [data]
                (let [fam (group-by #(+ (get-n % "Parch") (get-n % "SibSp")) data)
                      sorted (sort-by first fam)]
                    ((fn [with-percent]
                         (let [title "Survival rate of female passengers per number of family members on board"]
                             (line-plot (map first with-percent)
                                        (map second with-percent)
                                        Color/red
                                        (fn [p] (show! p title "Num. family mem." "Survival")))))
                     (reduce (fn [vc [k v]]
                                 (let [surv-s (query-data [#(= 1 (get-n % "Survived"))]
                                                          v
                                                          count)]
                                     (conj vc [k (float (* 100 (/ surv-s (count v))))]))) [] sorted)))))

As we saw in male passengers of the titanic, the ideal number of family memebers that had high survival chance is between 1 to 3. Then it starts to decrease.

In [71]:
(query-data nil
            female-passengers
            (fn [data]
                (let [by-age (group-by #(int (get-n % "Age")) data)
                      sorted (sort-by first by-age)]
                    ((fn [pers]
                         (let [title "Survival rate of female passengers per Age"]
                             (bar-plot (map first pers)
                                       (map second pers)
                                       (take (count pers) (cycle [Color/blue Color/red Color/green]))
                                       0.5
                                       (fn [p] (show! p title "Age" "Survival (%)")))))
                     (reduce (fn [vc [k v]]
                                 (let [surv-p (query-data [#(= 1 (get-n % "Survived"))]
                                                          v
                                                          #(float (* 100 (/ (count %) (count v)))))]
                                     (conj vc [k surv-p])))
                             []
                             sorted)))))

In [72]:
(query-data nil
            female-passengers
            (fn [data]
                (let [by-age (group-by #(int (get-n % "Age")) data)
                      sorted (sort-by first by-age)]
                    ((fn [pers]
                         (let [by-per (group-by second pers)]
                             (reduce (fn [acc [k v]]
                                         (assoc acc k (map first v)))
                                     (sorted-map)
                                     by-per)))
                     (reduce (fn [vc [k v]]
                                 (let [surv-p (query-data [#(= 1 (get-n % "Survived"))]
                                                          v
                                                          #(float (* 100 (/ (count %) (count v)))))]
                                     (conj vc [k surv-p])))
                             []
                             sorted)))))

In [73]:
(query-data nil
            female-passengers
            (fn [data]
                (let [by-pclass (group-by #(get-n % "Pclass") data)
                      sorted (sort-by first by-pclass)]
                    ((fn [pers]
                         (let [title "Survival rate of female passengers per Pclass"]
                             (bar-plot (map first pers)
                                       (map second pers)
                                       [Color/blue Color/red Color/green]
                                       0.5
                                       (fn [p] (show! p title "Pclass" "Rate (%)")))))
                     (reduce (fn [vc [k v]]
                                 (let [surv-p (query-data [#(= 1 (get-n % "Survived"))]
                                                          v
                                                          #(* 100 (/ (count %) (count v))))]
                                     (conj vc [k (float surv-p)]))) [] sorted)))))

In [74]:
(query-data nil
            male-passengers
            (fn [data]
                (let [by-age (group-by #(get-n % "Age") data)
                      sorted (sort-by first by-age)]
                    ((fn [pers]
                         (let [title "Survival rate of male passengers per Age"]
                             (bar-plot (map first pers)
                                       (map second pers)
                                       (take (count pers) (cycle [Color/blue Color/red Color/green]))
                                       0.5
                                       (fn [p] (show! p title "Age" "Survival (%)")))))
                     (reduce (fn [vc [k v]]
                                 (let [surv-p (query-data [#(= 1 (get-n % "Survived"))]
                                                          v
                                                          #(float (* 100 (/ (count %) (count v)))))]
                                     (conj vc [k surv-p])))
                             []
                             sorted)))))

__Conclusion:__ Female passengers are more likely to survive the titanic overall, young male passengers were highly likely to survive, more third class female passengers survived, nevertheless being a female and being third class meant that you had the lowest survival rate per pclass, with 1st and 2nd class having almost the same probability of survival. Many ages had a 100% survival rate, some were young and some were older, this basically means that unlike male passengers in which the younger they were the more likely they were to survive, the likelyness of high probability rate was spread across the different ages with some being lower, many other being high, and the rest being the maximum possible.

In [75]:
(query-data nil
            enhanced-titanic
            (fn [data]
                (let [by-fam (group-by #(+ (get-n % "Parch") (get-n % "SibSp")) data)
                      sorted (sort-by first by-fam)]
                    ((fn [per]
                         (let [title "Survival rate of passengers per family members on board"]
                             (line-plot (map first per)
                                        (map second per)
                                        Color/green
                                        (fn [p] (show! p title "Num." "Survival Rate %")))))
                     (reduce (fn [vc [k v]]
                                 (let [surv-p (query-data [#(= 1 (get-n % "Survived"))]
                                                          v
                                                          #(* 100 (/ (count %) (count v))))]
                                     (conj vc [k (float surv-p)]))) [] sorted)))))

It seems that there is a positive correlation between the number of family members and the survival rate up to 3, afterwards it starts to decrease until it reaches 0.

In [76]:
(query-data [#(= 1 (get-n % "Survived"))]
            enhanced-titanic
            (fn [data]
                (let [by-fam (group-by #(+ (get-n % "Parch") (get-n % "SibSp")) data)
                      sorted (sort-by first by-fam)]
                    ((fn [per]
                         (let [title "Percent of total survived by family members"]
                             (line-plot (map first per)
                                        (map second per)
                                        Color/red
                                        (fn [p] (show! p title "Num." "%")))))
                     (reduce (fn [vc [k v]]
                                 (let [percent (* 100 (/ (count v) (count data)))]
                                     (conj vc [k (float percent)])))
                             []
                             sorted)))))

Of those that survived, it seems that the more family memebers that you had the less percent that you occupied of the total percent that survived.

In [77]:
(def parch-plot (atom nil))

(def sibsp-plot (atom nil))

(query-data [#(= 1 (get-n % "Survived"))]
            enhanced-titanic
            (fn [data]
                (let [by-parch (group-by #(get-n % "Parch") data)
                      by-sibsp (group-by #(get-n % "SibSp") data)
                      sort-parch (sort-by first by-parch)
                      sort-sibsp (sort-by first by-sibsp)]
                    ((fn [parch-per sibsp-per]
                         (reset! parch-plot
                          (bar-plot (map first parch-per)
                                    (map second parch-per)
                                    (take (count parch-per) (cycle [Color/red Color/yellow Color/green]))
                                    0.3
                                    (fn [p] (show! p "Total % of survivors by Parch" "Parch" "%"))))
                         (reset! sibsp-plot
                          (bar-plot (map first sibsp-per)
                                    (map second sibsp-per)
                                    (take (count sibsp-per) (cycle [Color/red Color/yellow Color/green]))
                                    0.3
                                    (fn [p] (show! p "Total % of survivors by SibSp" "SibSp" "%"))))
                         nil)
                     (reduce (fn [acc [k v]]
                                 (let [percent (* 100 (/ (count v) (count data)))]
                                     (conj acc [k (float percent)]))) [] sort-parch)
                     (reduce (fn [acc [k v]]
                                 (let [percent (* 100 (/ (count v) (count data)))]
                                     (conj acc [k (float percent)]))) [] sort-sibsp)))))

null

In [78]:
@parch-plot

Passengers with no parents or children had the highest percent of survivors.

In [79]:
@sibsp-plot

0 siblings or spouses also had the highest percentage, nevertheless we do see that having 1 sibling or spouse had a higher % of survivors than having 1 parent or children.

In [80]:
(query-data nil
            enhanced-titanic
            (fn [data]
                (let [grouped (group-by #(+ (get-n % "Parch") (+ (get-n % "SibSp"))) data)
                      sorted (sort-by first grouped)]
                    ((fn [per]
                         (bar-plot (map first per)
                                   (map second per)
                                   (take (count per) (cycle [Color/blue Color/red Color/green]))
                                   0.4
                                   (fn [p] (show! p "Percent of passengers per family members on board" "Num." "percent (%)"))))
                     (reduce (fn [acc [k v]]
                                 (let [percent (* 100 (/ (count v) (count data)))]
                                     (conj acc [k (float percent)])))
                             []
                             sorted)))))

There were many more passengers without family on board overall.

__Conclusion:__ A passenger with 3 family members on board had the highest chance of survival, having 0, 1 or 2 also gave you a high chance. It is not the same having 0 parents or children as the chance of survival was higher than having 0 siblings or spouses, however overall having more siblings had more survival rate than having more children. All of this might be due to that parents prefered to stay so that their children could be on a lifeboat.

**Up-next:** Data Visualization