# Data Wrangling

Lets extract the content of titanic.txt

In [1]:
(def titanic-dot-txt (slurp "titanic.txt"))

#'beaker_clojure_shell_87f17984-683a-40e1-9c29-ed7da36932d5/titanic-dot-txt

Recall that the content is a seq of maps? Lets take the first 10 elements.

In [185]:
(take 10 titanic-dot-txt)

[(, {, ", A, g, e, ",  , ", 2]

Something went wrong! The slurp function returns a string of the contents of a file, fortunately Clojure has the read-string function.
For example:

In [186]:
(= 2 (read-string "2"))

true

or

In [187]:
(= (list 1 2 3) (read-string "(1 2 3)"))

true

So

In [2]:
(alter-var-root #'titanic-dot-txt #(read-string %))

Now lets try it again.

In [3]:
(take 10 titanic-dot-txt)

Note: we used alter-var-root which expects a var and a fn to alter the var with. This means that if we were to run the above cell again it would result in an
      exception.

In [190]:
(try (alter-var-root #'titanic-dot-txt #(read-string %))
    (catch Exception e
        (str "Exception caught")))

Exception caught

First lets write a function which returns a fn which expects f, this argument will be the function that we pass it to explore the data.

In [4]:
(defn return-fn
    [data]
    (fn [f]
        (f data)))

#'beaker_clojure_shell_87f17984-683a-40e1-9c29-ed7da36932d5/return-fn

Now

In [5]:
(def titanic-data (return-fn titanic-dot-txt))

#'beaker_clojure_shell_87f17984-683a-40e1-9c29-ed7da36932d5/titanic-data

Lets count the missing values

In [193]:
(titanic-data
 (fn [data]
     (count
      (filter empty?
              (mapcat vals
                      data)))))

866

We will replace those missing values

In [194]:
(titanic-data
 (fn [data]
     (let [kys (distinct (mapcat keys data))]
         (keep (fn [k]
                   (let [values (map #(get % k) data)]
                       (when (some empty? values)
                           k))) kys))))

[Age, Embarked, Cabin]

We have 3 keys with missing values. Time to load the other data.

In [6]:
(def first-class-dot-text (read-string (slurp "First class.txt")))

(def second-class-dot-text (read-string (slurp "Second class.txt")))

(def third-class-dot-text (read-string (slurp "Third class.txt")))

#'beaker_clojure_shell_87f17984-683a-40e1-9c29-ed7da36932d5/third-class-dot-text

In [7]:
(def first-class-data (fn [f] (f first-class-dot-text)))

(def second-class-data (fn [f] (f second-class-dot-text)))

(def third-class-data (fn [f] (f third-class-dot-text)))

#'beaker_clojure_shell_87f17984-683a-40e1-9c29-ed7da36932d5/third-class-data

Lets see if we can find some of the missing data in any of the other datasets

In [8]:
(defn search-names-in-other-data
    [main-data search-data search-fn]
    (let [missing-row-fn (fn [dta] 
                             (keep #(when (some empty? (vals %)) (get % "Name")) dta))
          names-with-missing (main-data missing-row-fn)]
        (reduce (fn [acc s-data]
                    (let [ready (search-fn names-with-missing)]
                        (lazy-cat acc (s-data ready))))
                (lazy-seq)
                search-data)))

#'beaker_clojure_shell_87f17984-683a-40e1-9c29-ed7da36932d5/search-names-in-other-data

In [9]:
(search-names-in-other-data titanic-data
                            [first-class-data second-class-data third-class-data]
                            (fn [names-with-missing]
                                (fn [data]
                                    ((fn iter [[n & names]]
                                         (lazy-seq
                                          (when-let [n n]
                                              (let [x (some #(and (get % n) %) data)]
                                                  (if x
                                                      (cons x (iter names))
                                                      (iter names)))))) names-with-missing))))

[]

We don't have a single passenger that appears in any of the other data, we will replace Age

In [10]:
(defn round
    [n]
    (read-string
     (format "%.2f" n)))

(defn get-age
    [dataset kont]
    (kont
     (dataset
      (fn [data]
          (loop [[m & ms] data avg {:male [] :female []}]
              (if (not m)
                  (zipmap '(:male :female) (map #(str (round (/ (apply + %) (count %)))) (vals avg)))
                  (let [sex? (get m "Sex")
                        age? (get m "Age")]
                      (cond
                          (empty? age?) (recur ms (update avg (keyword sex?) conj 0))
                          (= sex? "male") (recur ms (update avg :male conj (read-string age?)))
                          (= sex? "female") (recur ms (update avg :female conj (read-string age?)))))))))))
 

(get-age titanic-data 
         (fn [sex-avg-age]
             (alter-var-root #'titanic-data
                             (fn [FN]
                                 (let [data (FN identity)]
                                     (fn [f]
                                         (f ((fn recur-iter [[m & ms]]
                                                 (lazy-seq
                                                  (when-let [m m]
                                                      (if (empty? (get m "Age"))
                                                          (let [k (if (= (get m "Sex") "female")
                                                                      :female
                                                                      :male)]
                                                              (cons (assoc m "Age" (k sex-avg-age))
                                                                    (recur-iter ms)))
                                                          (cons m (recur-iter ms)))))) data))))))))

beaker_clojure_shell_87f17984_683a_40e1_9c29_ed7da36932d5$eval219$fn__220$fn__221$fn__222@55e2c1fc

In [200]:
(titanic-data
 (fn [data]
     (let [vs (mapcat vals data)]
         (count
          (filter empty? vs)))))

689

In [201]:
(titanic-data
 (fn [data]
     (let [ks (distinct (mapcat keys data))]
         (distinct 
          (mapcat (fn [k]
                      (keep #(when (empty? (get % k))
                                 k) data)) ks)))))

[Embarked, Cabin]

How many missing values the remaining keys have? Well...

In [12]:
(titanic-data
 (fn [data]
     (let [ks (distinct (mapcat keys data))]
         (reduce (fn [m [k v]]
                     (assoc m k (count v)))
                 {}
                 (group-by identity
                           (mapcat (fn [k]
                                       (keep #(when (empty? (get % k))
                                                  k) data)) ks))))))

Now we will iterate through each individual map, if that map doesn't have a missing value in 'Embarked', then we will just leave it as is. When 'Embarked' is missing
we will get the 'Fare' that was paid in that map, and pit it against the minimum value of the maximum price per port embarked, upon being below that value then the port
to which that value belongs to is the one that will replace our missing value.

In [13]:
(defn compute-max-value-embarked-and-return-fn
    [dataset create-max-embarked-fn]
    (dataset
     (fn [data]
         (let [embarked (distinct (filter not-empty (map #(get % "Embarked") data)))
               maxes-embarked (create-max-embarked-fn embarked data)
               ports (keys maxes-embarked)
               maxes (vals maxes-embarked)]
             (fn [f] (f ports 
                        maxes 
                        data
                        (fn [[p & ps][m & ms] mp]
                            (when-let [_ (and p m)]
                                (let [fare? (read-string (get mp "Fare"))]
                                    (if (< fare? m)
                                        (assoc mp "Embarked" p)
                                        (recur ps ms mp)))))))))))

(defn- calculate-max
       [mp]
       (let [result (zipmap (keys mp)
                            (map #(apply max %) (vals mp)))]
           (into (sorted-map-by #(compare (get result %1) (get result %2)))
                 result)))

(defn iterate-through-maxes
    [ports maxes data iterator]
    (when-let [s (seq data)]
        (lazy-seq
         (loop [[m & ms] s]
             (let [embarked? (get m "Embarked")]
                 (if (empty? embarked?)
                     (let [new-m (iterator ports maxes m)]
                         (cons new-m (iterate-through-maxes ports maxes ms iterator)))
                     (cons m (iterate-through-maxes ports maxes ms iterator))))))))

(defn return
    [FN final-op]
    (FN final-op))


(return
 (compute-max-value-embarked-and-return-fn
  titanic-data
  (fn [ks data]
      (let [MAP (zipmap ks (repeat (count ks) []))]
          (calculate-max
           (reduce (fn [acc m]
                       (let [embarked? (get m "Embarked")
                             fare? (read-string (get m "Fare"))]
                           (if (empty? embarked?)
                               acc
                               (update acc embarked? conj fare?))))
                   MAP data)))))
 (fn [& args]
     (alter-var-root #'titanic-data
                     (fn [_]
                         (fn [f]
                             (f (apply iterate-through-maxes args)))))))

beaker_clojure_shell_87f17984_683a_40e1_9c29_ed7da36932d5$eval309$fn__314$fn__315$fn__316@4a742b4f

Check missing values again

In [14]:
(titanic-data
  (fn [data]
      (let [ks (distinct (mapcat keys data))]
          (reduce (fn [m [k v]]
                      (assoc m k (count v)))
                  {}
                  (group-by identity
                            (mapcat (fn [k]
                                        (keep #(when (empty? (get % k))
                                                   k) data)) ks))))))

**About Cabin :**

**Pros:**
* There probably was cabins that were evacuated before others, we could probably check if that was the case for that particular cabin. (Not guaranteed).

**Cons:**
* Lots of missing cabins in the dataset.
* Each cabin is unique, meaning that either a given number of passengers shared it or not. (No same cabin else in the ship).
* Could probably check to which pclass is attributed, but there is no guarantee the guest survived or died.
         
After all it is the cabin in which a passenger was, meaning that what it could give us is the pclass of its occupants but thats it, there isn't even a ship section
to which it belongs, of course we could try to find ship plans, and then evacuation ones. But how do you know which passenger was in which cabin? This is important
and a lot of cabins are omitted in this dataset, which basically means that I think that its best at least for now to drop entirely a 'Cabin' row.

So

In [15]:
(alter-var-root
 #'titanic-data
 (fn [FN]
     (let [data (FN identity)]
         (fn [f]
             (f (map #(dissoc % "Cabin") data))))))

beaker_clojure_shell_87f17984_683a_40e1_9c29_ed7da36932d5$eval337$fn__338$fn__339@4735ba05

Now

In [16]:
(titanic-data
  (fn [data]
      (let [ks (distinct (mapcat keys data))]
          (reduce (fn [m [k v]]
                      (assoc m k (count v)))
                  {}
                  (group-by identity
                            (mapcat (fn [k]
                                        (keep #(when (empty? (get % k))
                                                   k) data)) ks))))))

Now we will process the remaining datasets

In [17]:
(defn retrieve-missing-map
    [data]
    (let [ks (distinct (mapcat keys data))]
        (reduce (fn [m [k v]]
                    (assoc m k (count v)))
                {}
                (group-by identity
                          (mapcat (fn [k]
                                      (keep #(when (empty? (get % k "not-present"))
                                                 k) data)) ks)))))

(defn keys-with-missing
    [datasets cont]
    (let [dataset (first datasets)]
        (if dataset
            (let [missing ((:data dataset) retrieve-missing-map)]
                (recur (rest datasets) #(cont (assoc % (:dset dataset) missing))))
            (cont {}))))


(keys-with-missing [{:dset :first-class-data :data first-class-data}
                    {:dset :second-class-data :data second-class-data}
                    {:dset :third-class-data :data third-class-data}]
                   identity)

Something to consider: The third class data has 8 keys instead of 7, this means that firstly we are going to focus on the first 2.

https://en.wikipedia.org/wiki/Passengers_of_the_Titanic#Third_class_2

The missing values of Boarded and Hometwon can be replaced by the most common values for each of them.

In [18]:
(defn replace-in-data
    ([replace-with data]
     (data replace-with))
    ([replace-with data1 data2]            ;; replace-in-data expects that replace-with is a function with side-effects
     (do (data1 replace-with)
         (data2 replace-with)
         nil))
    ([replace-with data1 data2 & data]
     (when data1
         (do (replace-in-data replace-with data1 data2)
             (apply replace-in-data replace-with data)))))

(defn find-max-occuring-value
    [k data]
    (let [s (map #(get % k) data)
          grouped (group-by identity s)
          maxed (group-by count (vals grouped))
          MAX (apply max (keys maxed))]
        ((fn [MAX] (ffirst (get maxed MAX))) MAX)))

(defn make-fn
    [data k]
    (fn [dta]
        (alter-var-root
         data
         (fn [_]
             #(%
               (let [pad (find-max-occuring-value k dta)]
                   (map (fn [m] (if (empty? (get m k)) (assoc m k pad) m)) dta)))))))

(defn replace-in-k-with
    [datas ks]
    (let [k-fns (map make-fn datas ks)]
        (dorun (map #(replace-in-data %2 %1) datas k-fns))))


(replace-in-k-with [#'first-class-data #'second-class-data] ["Hometown" "Boarded"])

null

In [19]:
(keys-with-missing [{:dset :first-class-data :data first-class-data}
                    {:dset :second-class-data :data second-class-data}
                    {:dset :third-class-data :data third-class-data}]
                   identity)

The missing values of Destination in the second-class-data can also be replaced by the most common value for it.

In [20]:
(replace-in-k-with [#'second-class-data] ["Destination"])

null

In [21]:
(keys-with-missing [{:dset :first-class-data :data first-class-data}
                    {:dset :second-class-data :data second-class-data}
                    {:dset :third-class-data :data third-class-data}]
                   identity)

**About Lifeboat and Body:** basically what we could is for those rows where lifeboat is known, we could add a new col (Survived) which contains 1 (survived), <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
or if body is known then 0 (not survived).
We would then drop this 2 columns.

In [22]:
(defn dissoc*
    [m [k & ks]]
    (if k
        (recur (dissoc m k) ks)
        m))

(defn survived?
    [m]
    (let [lifeboat (get m "Lifeboat")
          body (get m "Body")]
        (cond
            (and (empty? body) (not-empty lifeboat)) (assoc (dissoc* m ["Lifeboat" "Body"]) "Survived" "1")
            (and (not-empty body) (empty? lifeboat)) (assoc (dissoc* m ["Lifeboat" "Body"]) "Survived" "0")
            :else m)))

(defn iterate-and-change
    [data FN]
    (let [data-fn (deref data)]
        (data-fn (fn [dta] (alter-var-root data (fn [_] #(% (map FN dta))))))))

#'beaker_clojure_shell_87f17984-683a-40e1-9c29-ed7da36932d5/iterate-and-change

In [23]:
(iterate-and-change #'second-class-data survived?)

beaker_clojure_shell_87f17984_683a_40e1_9c29_ed7da36932d5$iterate_and_change$fn__417$fn__418$fn__419@7a379dca

In [24]:
(keys-with-missing [{:dset :first-class-data :data first-class-data}
                    {:dset :second-class-data :data second-class-data}
                    {:dset :third-class-data :data third-class-data}]
                   identity)

Same for first-class-data.

In [25]:
(iterate-and-change #'first-class-data survived?)

beaker_clojure_shell_87f17984_683a_40e1_9c29_ed7da36932d5$iterate_and_change$fn__417$fn__418$fn__419@2467308f

In [27]:
(keys-with-missing [{:dset :first-class-data :data first-class-data}
                    {:dset :second-class-data :data second-class-data}
                    {:dset :third-class-data :data third-class-data}]
                   identity)

In [28]:
(require '[clojure.string :as cs])

null

We will now add a sex column to all rows in the first and second class data

In [29]:
(def add-sex (fn [m] 
                 (let [nme (get m "Name")] 
                     (cond 
                         (.contains nme "Miss") (assoc m "Sex" "female") 
                         (.contains nme "Mrs.") (assoc m "Sex" "female")
                         :else (assoc m "Sex" "male")))))

#'beaker_clojure_shell_87f17984-683a-40e1-9c29-ed7da36932d5/add-sex

In [30]:
(defn alterate
    [datas FN]
    (run! #(iterate-and-change % FN) datas)) ;;function to iterate and change through various data at once

#'beaker_clojure_shell_87f17984-683a-40e1-9c29-ed7da36932d5/alterate

In [31]:
(alterate [#'first-class-data #'second-class-data]
          add-sex)

null

In [32]:
(first-class-data (fn [data] (take 5 data)))

In [33]:
(second-class-data (fn [data] (take 5 data)))

In [34]:
(let [pred (fn [data] ((comp count filter) #(not (get % "Survived")) data))]
    (mapv #(% pred) [first-class-data second-class-data]))

[89, 131]

**first-class-data** has **89** rows without a **survived** column and **second-class-data** has **131**

In [35]:
(defn read-age
    [age-s]
    (cond
        (.contains age-s "mo.") (if (.contains age-s "12") 1 0)
        :else (read-string age-s)))

(defn make-intervals
    [coll minify maxify interval-fn & do-afters]
    (let [help #(partial apply %)
          [MIN MAX] ((juxt (help min) (help max)) coll)
          intervals (interval-fn (minify MIN) (maxify MAX))
          taken (take-while #(not (nil? %)) intervals)]
        ((apply comp do-afters) (partition 2 1 taken))))

(def prepare-MAX (fn [MAX]
                     (cond
                         (< MAX 10) (+ MAX (- 10 MAX))
                         (int? (/ MAX 10)) MAX
                         :else (let [s (str MAX)
                                     f (first s)
                                     tnth (str f \0)]
                                   (+ 10 (read-string tnth))))))

(titanic-data
 (fn [data]
     (let [ages (map (comp read-age #(get % "Age")) data)
           decrement-last #(vector (first %) (dec (last %)))
           decrement* #(map decrement-last %&)]
         (make-intervals ages
                         int
                         prepare-MAX
                         (fn [MIN MAX] (iterate (fn [i] (when-not (> (+ 10 i) MAX) (+ 10 i))) MIN))
                         (fn iter [[s1 s2 & Ss]] (lazy-seq (when s1 (concat (decrement* s1 s2) (iter Ss)))))
                         #(do (def intervals) (alter-var-root #'intervals (fn [_] %)))))))
                          
                          

[[0, 9], [10, 19], [20, 29], [30, 39], [40, 49], [50, 59], [60, 69], [70, 79]]

Now that we have the intervals we will build a map that has the intervals as keys and a map that contains sex as the keys. This can then be used by get-in <br>
to retrieve the most common survived for a particular age for a particular sex.

In [36]:
(defn find-most
    [coll]
    (let [by-identity (vals (group-by identity coll))
          by-count (group-by count by-identity)
          [m v] ((juxt identity #(apply max (keys %))) by-count)]
        (ffirst (get m v))))

#'beaker_clojure_shell_87f17984-683a-40e1-9c29-ed7da36932d5/find-most

In [37]:
(defn make-map
    [intervals sexes helper-fn ensure-fn cont]
    (let [fns (map helper-fn intervals)
          [sx1 sx2] sexes
          sx1 (group-by (fn [m] (some #(% m) fns)) sx1)
          sx2 (group-by (fn [m] (some #(% m) fns)) sx2)
          pad (repeat (count intervals) {"male" nil "female" nil})
          reader #(read-string (get % "Survived" 0))]
        (loop [[sx & sxs] [["male" sx1]["female" sx2]] m (zipmap (map #(apply list %) intervals) pad)]
            (if-not sx
                (cont (reduce ensure-fn {} m))
                (let [[k-sx sx] sx]
                    (recur sxs (reduce (fn [acc [k v]]
                                           (let [most (find-most (map reader v))]
                                               (update-in acc [(vec k) k-sx] (fn [_] most)))) m sx)))))))

(titanic-data
 (fn [data]
     (let [sexes (vals (group-by #(get % "Sex") data))]
         (make-map intervals
                   sexes
                   (fn [[l h :as all]]
                       #(let [age (read-age (get % "Age"))]
                            (and (>= age l) (<= age h) all)))
                   (fn [acc [k v]]
                       (let [m (get v "male")
                             f (get v "female")]
                           (cond
                               (not m) (assoc acc k (assoc v "male" "0"))
                               (not f) (assoc acc k (assoc v "female" "0"))
                               :else (assoc acc k v))))
                   #(do (def map-with-pads) (alter-var-root #'map-with-pads (fn [_] %)))))))

In [38]:
map-with-pads

Now we will add the survived column.

In [39]:
(defn between-Ns
    [number]
    (if (int? (/ number 10))
        (list number (+ number 10))
        (between-Ns (read-string (str (first (str number)) \0))))) ;; This function ensures that eg. 6 becomes (0 10)

(defn assoc-survived
    [m]
    (let [age #(read-age (get m "Age"))
          sex #(get m "Sex")]
        (if (get m "Survived")
            m
            (dissoc* (assoc m "Survived" (get-in map-with-pads [(between-Ns (age)) (sex)]))
                     ["Lifeboat" "Body"]))))

#'beaker_clojure_shell_87f17984-683a-40e1-9c29-ed7da36932d5/assoc-survived

In [40]:
(alterate [#'first-class-data #'second-class-data]
          assoc-survived)

null

In [41]:
(let [pred (fn [data] ((comp count filter) #(not (get % "Survived")) data))]
    (mapv #(% pred) [first-class-data second-class-data]))

[0, 0]

Next we're going to need to add a SibSp and a Parch column

In [42]:
(defn assoc*
    "Like assoc but can add
     multiple keys and values at once"
    [m [k & ks] [v & vs]]
    (if k
        (assoc (assoc* m ks vs) k v)
        m))

(defn chain-split
    "Like clojure.string/split but will split
     a string as long as there is a regex available."
    [string [rex & regexes]]
    (if rex
        (recur (apply str (interpose " " (cs/split string rex))) regexes)
        (remove #{""} (cs/split string #" "))))

(defn extract-surname
    [NAME]
    (let [name? (chain-split NAME [#"," #"\."])
          f (first name?)]
        (if (= f "and")
            (last name?)
            f)))

(defn sort-by-dec
    [FN coll]
    (sort-by FN #(compare %2 %1) coll))

#'beaker_clojure_shell_87f17984-683a-40e1-9c29-ed7da36932d5/sort-by-dec

In [43]:
(defn assoc-family
    [data]
    (fn [m]
        (let [nme (get m "Name")
              surname (extract-surname nme)
              family (filter #(= (extract-surname (get % "Name")) surname) data)
              sorted-fam (sort-by-dec #(read-age (get % "Age")) family)
              [x1 x2 & rst] sorted-fam
              nme-x1 (get x1 "Name")
              nme-x2 (get x2 "Name")]
            (if rst
                (if (and (or (.contains nme-x1 "Mr.") (.contains nme-x1 "Dr."))
                         (.contains nme-x2 "Mrs.") (<= (read-age (get x1 "Age")) (+ 20 (read-age (get x2 "Age")))))
                    (concat (map #(assoc* % ["SibSp" "Parch"] ["1" (str (count rst))]) [x1 x2])
                            (map #(assoc* % ["SibSp" "Parch"] [(str (dec (count rst))) "2"]) rst))
                    (if (<= (read-age (get x1 "Age")) (+ 20 (read-age (get x2 "Age"))))
                        (map #(assoc* % ["SibSp" "Parch"] [(str (inc (count rst))) "0"]) (concat [x1 x2] rst))
                        (cons (assoc* x1 ["SibSp" "Parch"] ["0" (str (count rst))])
                              (map #(assoc* % ["SibSp" "Parch"] [(str (count rst)) "1"]) rst))))
                (if x2
                    (if (and (.contains nme-x1 "Mr.") (.contains nme-x2 "Mrs."))
                        (if (<= (read-age (get x1 "Age")) (+ 20 (read-age (get x2 "Age"))))
                            (map #(assoc* % ["SibSp" "Parch"] ["1" "0"]) [x1 x2])
                            (map #(assoc* % ["SibSp" "Parch"] ["0" "1"]) [x1 x2]))
                        (if (<= (read-age (get x1 "Age")) (+ 20 (read-age (get x2 "Age"))))
                            (map #(assoc* % ["SibSp" "Parch"] ["1" "0"]) [x1 x2])
                            (map #(assoc* % ["SibSp" "Parch"] ["0" "1"]) [x1 x2])))
                    (assoc* x1 ["SibSp" "Parch"] ["0" "0"]))))))

#'beaker_clojure_shell_87f17984-683a-40e1-9c29-ed7da36932d5/assoc-family

In [44]:
(defn alterate-data
    [[data & rst] FN]
    (when data
        (let [map-modifier ((deref data) #(FN %))]
            (alterate [data] map-modifier)                 ;; Useful when you want to alterate different data with the same function
            (recur rst FN))))                               ;;FN must be a function that returns another function to operate on each individual element

#'beaker_clojure_shell_87f17984-683a-40e1-9c29-ed7da36932d5/alterate-data

In [45]:
(alterate-data [#'first-class-data #'second-class-data]
               assoc-family)

null

In [50]:
(first-class-data
 (fn [data]
     (println (take 5 data))))

({Hometown St Louis, Missouri, US, Age 29, SibSp 0, Parch 0, Destination St Louis, Sex female, Survived 1, Boarded Southampton, Name Allen, Miss Elizabeth Walton} ({Hometown Montreal, Quebec, Canada, Age 30, SibSp 1, Parch 2, Destination Montreal, Quebec, Canada, Sex male, Survived 0, Boarded Southampton, Name Allison, Mr. Hudson Creighton} {Hometown Montreal, Quebec, Canada, Age 25, SibSp 1, Parch 2, Destination Montreal, Quebec, Canada, Sex female, Survived 1, Boarded Southampton, Name Allison, Mrs. Bessie Waldo (née Daniels)} {Hometown Montreal, Quebec, Canada, Age 2, SibSp 1, Parch 2, Destination Montreal, Quebec, Canada, Sex female, Survived 1, Boarded Southampton, Name Allison, Miss Helen Loraine} {Hometown Montreal, Quebec, Canada, Age 11 mo., SibSp 1, Parch 2, Destination Montreal, Quebec, Canada, Sex male, Survived 1, Boarded Southampton, Name Allison, Master Hudson Trevor}) {Hometown New York City, Age 19, SibSp 0, Parch 0, Destination Montreal, Quebec, Canada, Sex male, Surv

null

In [48]:
(second-class-data
 (fn [data]
     (take 5 data)))

[[{Hometown=Russia, Age=30, SibSp=1, Parch=0, Destination=New York, New York, US, Sex=male, Survived=0, Boarded=Cherbourg, Name=Abelson, Mr. Samuel}, {Hometown=Russia, Age=28, SibSp=1, Parch=0, Destination=New York, New York, US, Sex=female, Survived=1, Boarded=Cherbourg, Name=Abelson, Mrs. Anna (née Wizosky?)}], [{Hometown=Russia, Age=30, SibSp=1, Parch=0, Destination=New York, New York, US, Sex=male, Survived=0, Boarded=Cherbourg, Name=Abelson, Mr. Samuel}, {Hometown=Russia, Age=28, SibSp=1, Parch=0, Destination=New York, New York, US, Sex=female, Survived=1, Boarded=Cherbourg, Name=Abelson, Mrs. Anna (née Wizosky?)}], [{Hometown=Redruth, Cornwall, England, Age=30, SibSp=1, Parch=0, Destination=Houghton, Michigan, US, Sex=male, Survived=0, Boarded=Southampton, Name=Andrew, Mr. Frank Thomas}, {Hometown=San Ambrosio, Córdoba, Argentina, Age=17, SibSp=1, Parch=0, Destination=Trenton, New Jersey, US, Sex=male, Survived=0, Boarded=Southampton, Name=Andrew, Mr. Edgar Samuel}], [{Hometown=R

In [51]:
(defn create-map
    [data field1 field2]
    (fn [f]
        (f (data (fn [dta] (into {} (map vector (field1 dta) (field2 dta))))))))

(defn await-fn
    [& args]
    (fn [FN]
        #(% (apply map FN args))))

(defn chain-fn-appl
    [& [fn1 & fns]]
    (if-let [f (first fns)]
        (recur (cons (fn1 f) (rest fns)))
        fn1))      

#'beaker_clojure_shell_87f17984-683a-40e1-9c29-ed7da36932d5/chain-fn-appl

###### Third-class-data

Basically we are going to join the Hometown and Home country entries into a Hometown one, then we will apply all transformations that we applied to the other <br>
data. we'll just have to then be concerned simply about including the third class data in the vector passed to show replacement.

In [52]:
(alter-var-root #'third-class-data
                (fn [FN]
                    (let [data (FN identity)]
                        (chain-fn-appl
                         (return-fn (map #(assoc % "Hometown" (str (get % "Hometown") ", " (get % "Home country"))) data))
                         (fn [data] (return-fn (map #(dissoc % "Home country") data)))))))

beaker_clojure_shell_87f17984_683a_40e1_9c29_ed7da36932d5$return_fn$fn__163@1cf9fed9

In [53]:
(third-class-data
 (fn [data]
     (take 10 data)))

Now the transformations

In [54]:
(defn replace-k-times
    [VAR]
    (fn [ks]
        (replace-in-k-with (vec (repeat (count ks) VAR)) ks)))

(do ((replace-k-times #'third-class-data) ["Destination" "Boarded" "Hometown"])
    (iterate-and-change #'third-class-data survived?))

beaker_clojure_shell_87f17984_683a_40e1_9c29_ed7da36932d5$iterate_and_change$fn__417$fn__418$fn__419@74ec2fa4

In [55]:
(third-class-data #(take 10 %))

The 'age' key is different from the 'Age' key from the other 2 datasets, we are going to change that.

In [56]:
(alter-var-root #'third-class-data
                (fn [FN]
                    (let [data (FN (fn [data]
                                       (map #(dissoc (assoc % "Age" (get % "age")) "age") data)))]
                        #(% data))))

beaker_clojure_shell_87f17984_683a_40e1_9c29_ed7da36932d5$eval735$fn__736$fn__741@280f25a4

In [57]:
(third-class-data (fn [data] (map #(get % "Age") data)))

[40, 39, 16, 14, 18, 16, 25, 20, 18, 30, 26, 40, 21, 10 mo., 26, 23, 19, 24, 25, 35, 15, 22, 33, 19, 39, 39, 11, 9, 6, 4, 2, 17, 38, 26, 20, 26, 25, 18, 24, 35, 40, 38, 13, 9, 5, 5, 3, 23, 45, 23, 30, 17, 23, 15, 20, 32, 33, 18, 40, 26, 31, 24, 5, 4, 9 mo., 45, 18, 26, 22, 18, 26, 22, 20, 21, 18, 26, 42, 32, 40, 20, 22, 20, 29, 22, 22, 35, 21, 22, 40, 9, 7, 19, 17, 24, 4, 2, 18, 38, 30, 21, 17, 17, 22, 21, 21, 28, 18, 24, 24, 47, 28, 24, 32, 28, 32, 29, 26, 18, 20, 18, 24, 24, 36, 31, 31, 35, 22, 61, 43, 35, 27, 19, 30, 16, 36, 9, 3, 59, 19, 44, 16, 17, 28, 45, 22, 19, 30, 29, 34, 28, 4 mo., 27, 25, 22, 24, 21, 17, 32, 34, 36, 36, 36, 16, 25, 32, 1, 2 mo., 25, 30, 26, 22, 19, 17, 42, 43, 21, 43, 18, 22, 31, 24, 33, 24, 19, 65, 23, 22, 18, 16, 45, 29, 15, 17, 47, 6, 37, 39, 38, 25, 34, 18, 22, 28, 42, 19, 20, 22, 48, 20, 18, 16, 7, 28, 29, 21, 21, –, 22, 17, 19, 33, 31, 9, 41, 42, 43, 16, 14, 13, 12, 10, 1, 40, 32, 32, 19, 37, 28, 19, 24, 28, 19, 28, 24, 19, 27, 18, 35, 41, 45, 26, 21, 

In [58]:
(third-class-data
 (fn [data]
     (map #(read-string (get % "Age")) data)))

[40, 39, 16, 14, 18, 16, 25, 20, 18, 30, 26, 40, 21, 10, 26, 23, 19, 24, 25, 35, 15, 22, 33, 19, 39, 39, 11, 9, 6, 4, 2, 17, 38, 26, 20, 26, 25, 18, 24, 35, 40, 38, 13, 9, 5, 5, 3, 23, 45, 23, 30, 17, 23, 15, 20, 32, 33, 18, 40, 26, 31, 24, 5, 4, 9, 45, 18, 26, 22, 18, 26, 22, 20, 21, 18, 26, 42, 32, 40, 20, 22, 20, 29, 22, 22, 35, 21, 22, 40, 9, 7, 19, 17, 24, 4, 2, 18, 38, 30, 21, 17, 17, 22, 21, 21, 28, 18, 24, 24, 47, 28, 24, 32, 28, 32, 29, 26, 18, 20, 18, 24, 24, 36, 31, 31, 35, 22, 61, 43, 35, 27, 19, 30, 16, 36, 9, 3, 59, 19, 44, 16, 17, 28, 45, 22, 19, 30, 29, 34, 28, 4, 27, 25, 22, 24, 21, 17, 32, 34, 36, 36, 36, 16, 25, 32, 1, 2, 25, 30, 26, 22, 19, 17, 42, 43, 21, 43, 18, 22, 31, 24, 33, 24, 19, 65, 23, 22, 18, 16, 45, 29, 15, 17, 47, 6, 37, 39, 38, 25, 34, 18, 22, 28, 42, 19, 20, 22, 48, 20, 18, 16, 7, 28, 29, 21, 21, {name=–, namespace=null}, 22, 17, 19, 33, 31, 9, 41, 42, 43, 16, 14, 13, 12, 10, 1, 40, 32, 32, 19, 37, 28, 19, 24, 28, 19, 28, 24, 19, 27, 18, 35, 41, 45, 2

We can see we have 2 symbols that shouldn't be there so we are going to retrieve their indexes and to which rows they correspond.

In [59]:
(third-class-data
 (fn [data]
     (let [ages (mapv #(read-string (get % "Age")) data)]
         (reduce (fn [idxs i] (if (symbol? (ages i)) (conj idxs i) idxs))
                 []
                 (range (count ages))))))

[216, 339]

We retrieve the rows

In [60]:
(third-class-data
 (fn [data]
     (map #(nth data %) [216 339])))

Sort all ages of all passengers that boarded in Cherbourg.

In [61]:
(third-class-data
 (fn [data]
     (let [cherbourg (filter #(.contains (get % "Boarded") "Cherbourg") data)]
         (sort-by-dec identity
                      (remove symbol? (map #(read-string (get % "Age")) cherbourg))))))

[45, 45, 45, 40, 40, 40, 38, 37, 35, 35, 34, 33, 33, 33, 33, 31, 30, 30, 30, 30, 30, 29, 29, 28, 28, 27, 27, 27, 27, 26, 26, 26, 25, 25, 25, 25, 25, 25, 24, 24, 24, 22, 22, 22, 22, 22, 22, 21, 20, 20, 20, 20, 20, 20, 20, 20, 20, 19, 19, 19, 19, 18, 18, 18, 18, 18, 18, 17, 17, 17, 17, 16, 16, 16, 16, 16, 15, 15, 15, 15, 14, 12, 11, 10, 9, 9, 9, 8, 7, 7, 5, 5, 4, 4, 4, 2, 1]

In [62]:
(defn rand-long
    ([n]
     (long (rand n)))
    ([start end]
     (let [number (rand-long end)]
         (if (< number start)
             (recur start end)
             number))))

#'beaker_clojure_shell_87f17984-683a-40e1-9c29-ed7da36932d5/rand-long

In [63]:
(rand-long 20 68)

65

'rand-long' is going to be used to return a random long between the lowest and highest age that the passengers that boarded on Cherbourg had.

In [64]:
(alter-var-root #'third-class-data
                (fn [FN]
                    (let [data #(FN (fn [data] (map % data)))]
                        (return-fn
                         (data (fn [m]
                                   (let [age (read-string (get m "Age"))]
                                       (if (symbol? age)
                                           (assoc m "Age" (str (rand-long 0 45)))
                                           m))))))))

beaker_clojure_shell_87f17984_683a_40e1_9c29_ed7da36932d5$return_fn$fn__163@5b1e390e

In [65]:
(third-class-data
 (fn [data]
     (map #(nth data %) [216 339])))

We add the sex

In [66]:
(alterate [#'third-class-data]
          add-sex)

null

We add the **Survived** column

In [67]:
(alterate [#'third-class-data]
          assoc-survived)

null

In [68]:
(third-class-data
 (fn [data]
     (take 10 data)))

Add the **Sibsp** and **Parch** column

In [69]:
(alterate-data [#'third-class-data]
               assoc-family)

null

In [71]:
(third-class-data
 (fn [data]
     (println (take 10 data))))

({Hometown Cincinnati, Ohio, US, Age 40, SibSp 0, Parch 0, Destination Cincinnati, Ohio, US, Sex male, Survived 0, Boarded Southampton, Name Abbing, Mr. Anthony} ({Hometown East Providence, Rhode Island, US, Age 39, SibSp 0, Parch 1, Destination East Providence, Rhode Island, US, Sex female, Survived 1, Boarded Southampton, Name Abbott, Mrs. Rhoda Mary (née Hunt)} {Hometown East Providence, Rhode Island, US, Age 14, SibSp 1, Parch 1, Destination East Providence, Rhode Island, US, Sex male, Survived 0, Boarded Southampton, Name Abbott, Mr. Eugene Joseph}) ({Hometown East Providence, Rhode Island, US, Age 39, SibSp 0, Parch 1, Destination East Providence, Rhode Island, US, Sex female, Survived 1, Boarded Southampton, Name Abbott, Mrs. Rhoda Mary (née Hunt)} {Hometown East Providence, Rhode Island, US, Age 14, SibSp 1, Parch 1, Destination East Providence, Rhode Island, US, Sex male, Survived 0, Boarded Southampton, Name Abbott, Mr. Eugene Joseph}) ({Hometown East Providence, Rhode Island

null

In [72]:
(titanic-data #(take 1 %))

**first-class-data** = **Pclass 1**

**second-class-data** = **Pclass 2**

**third-class-data** = **Pclass 3**

In [73]:
(defn only-maps
    [[m & ms]]
    (lazy-seq
     (when m
         (if (map? m)
             (cons m (only-maps ms))
             (concat (only-maps m) (only-maps ms))))))

#'beaker_clojure_shell_87f17984-683a-40e1-9c29-ed7da36932d5/only-maps

In [74]:
(alter-var-root #'first-class-data #(return-fn (% only-maps)))

(alter-var-root #'second-class-data #(return-fn (% only-maps))) ;; We need to do this since right now we have a lazy-seq of lazy-seq of maps
                                                                 ; where each lazy-seq is a family or a single passenger
(alter-var-root #'third-class-data #(return-fn (% only-maps)))

beaker_clojure_shell_87f17984_683a_40e1_9c29_ed7da36932d5$return_fn$fn__163@1f6a5cb1

In [75]:
(do
    (alter-var-root #'first-class-data (fn [FN] (return-fn (FN (fn [dta] (map #(assoc % "Pclass" "1") dta))))))
    
    (alter-var-root #'second-class-data (fn [FN] (return-fn (FN (fn [dta] (map #(assoc % "Pclass" "2") dta))))))
    
    (alter-var-root #'third-class-data (fn [FN] (return-fn (FN (fn [dta] (map #(assoc % "Pclass" "3") dta)))))))

beaker_clojure_shell_87f17984_683a_40e1_9c29_ed7da36932d5$return_fn$fn__163@e7dd9cb

In [76]:
(first-class-data #(take 3 %))

In [77]:
(second-class-data #(take 3 %))

In [78]:
(third-class-data #(take 3 %))

In [79]:
(do (alter-var-root #'first-class-data #(return-fn (% distinct)))
    (alter-var-root #'second-class-data #(return-fn (% distinct))) ;; Remove repeated rows
    (alter-var-root #'third-class-data #(return-fn (% distinct))))

beaker_clojure_shell_87f17984_683a_40e1_9c29_ed7da36932d5$return_fn$fn__163@6200c25c

In [80]:
(second-class-data #(take 3 %))

In [81]:
(alter-var-root #'titanic-data (fn [FN] (return-fn (FN (fn [dta] (map #(dissoc* % ["Fare" "Ticket"]) dta))))))

beaker_clojure_shell_87f17984_683a_40e1_9c29_ed7da36932d5$return_fn$fn__163@676b20b7

In [82]:
(titanic-data #(take 5 %))

Replacing the **Boarded** key to **Embarked**.

In [83]:
(def Boarded->Embarked (fn [m] (zipmap (map #(if (= % "Boarded") "Embarked" %) (keys m)) (vals m))))

(alterate [#'first-class-data #'second-class-data #'third-class-data]
          Boarded->Embarked)

null

In [84]:
(def remove-hometown-dest (fn [m] (dissoc* m ["Hometown" "Destination"])))

(alterate [#'first-class-data  #'second-class-data #'third-class-data]
          remove-hometown-dest)

null

The titanic-data does not have a **Hometown** or **Destination** column.

In [85]:
(def first-s #(str (first %)))

(def catenate* (fn [[d1 d2 d3]] (lazy-cat (d1 identity) (d2 identity) (d3 identity))))

(def datas [first-class-data second-class-data third-class-data])

(defn full-port-name
    [ports]
    (fn [port]
        (some #(and (= (str port) (first-s %)) %) ports)))

(defn port-name
    [TITANIC-DATA do-with-data]
    (let [ports (distinct (map #(get % "Embarked") (catenate* datas)))
          return-port (full-port-name ports)]
        ((do-with-data TITANIC-DATA) return-port)))

#'beaker_clojure_shell_87f17984-683a-40e1-9c29-ed7da36932d5/port-name

We are going to change the single capital letters of the port embarked in titanic-data to be the whole name. This will be achieved by using the <br>
**first-class-data**, **second-class-data**, and **third-class-data**.

In [86]:
(port-name #'titanic-data
           (fn [TITANIC-DATA]
               (let [data ((deref TITANIC-DATA) identity)]
                   (fn [f]
                       (alter-var-root TITANIC-DATA
                                       (fn [_]
                                           (return-fn (map #(let [p (f (get % "Embarked"))] (assoc % "Embarked" p)) data))))))))

beaker_clojure_shell_87f17984_683a_40e1_9c29_ed7da36932d5$return_fn$fn__163@2194854c

In [87]:
(def max-ns-needed (fn [d1 d2 d3] (+ (d1 count) (d2 count) (d3 count))))

(alter-var-root #'titanic-data
                (fn [FN]
                    (let [size (FN count)
                          datas-size (apply max-ns-needed datas)
                          n-seq (range size datas-size)
                          catted (catenate* datas)]
                        (return-fn (FN (fn [data]
                                           (lazy-cat data
                                                     (map #(assoc (Boarded->Embarked %1) "PassengerId" %2)
                                                          catted
                                                          n-seq))))))))

beaker_clojure_shell_87f17984_683a_40e1_9c29_ed7da36932d5$return_fn$fn__163@7dea4a4f

We now have the original titanic-data plus all the other first-class, second-class and third-class-data, we'll store this as **titanic-enhanced-data** and then <br>
proceed to the Exploratory Data Analysis notebook.

In [117]:
(defn keep-if-name-already
    [row coll]
    (if (= "male" (get row "Sex"))
        (let [nme (get row "Name")
              !-mr (cs/replace nme #"Mr." "")
              !-mr (cs/replace !-mr #"Mr" "")
              !-master (cs/replace !-mr #"Master." "")
              !-master (cs/replace !-master #"Master" "")]
            (some #(let [nme (get % "Name")
                         !-mr2 (cs/replace nme #"Mr." "")
                         !-mr2 (cs/replace !-mr2 #"Mr" "")
                         !-master2 (cs/replace !-mr2 #"Master." "")
                         !-master2 (cs/replace !-master2 #"Master" "")]
                       (= (sort !-master) (sort !-master2)))
                  (keep #(when (= (get % "Sex") "male") %) coll)))
        (let [nme (get row "Name")]
            (if (.contains nme "Mrs")
                (let [!-mrs (cs/replace nme #"Mrs." "")
                      !-mrs (cs/replace !-mrs #"Mrs" "")]
                    (some #(let [nme (get % "Name")
                                 !-mrs2 (cs/replace nme #"Mrs." "")
                                 !-mrs2 (cs/replace !-mrs2 #"Mrs" "")]
                               (= (sort !-mrs) (sort !-mrs2)))
                          (keep #(when (= (get % "Sex") "female") %) coll)))
                (let [!-ms (cs/replace nme #"Miss." "")
                      !-ms (cs/replace !-ms #"Miss" "")]
                    (some #(let [nme (get % "Name")
                                 !-ms2 (cs/replace nme #"Miss." "")
                                 !-ms2 (cs/replace !-ms2 #"Miss" "")]
                               (= (sort !-ms) (sort !-ms2)))
                          (keep #(when (= (get % "Sex") "female") %) coll)))))))

(keep-if-name-already {"Name" "Allison, Master. Hudson Trevor" "Sex" "male"} (titanic-data identity))

true

In [124]:
(defn store-seq
    [nme sq]
    (let [ret (atom "")]
        (run! #(swap! ret (fn [s] (str s %))) sq)
        (spit nme (format "(%s)" @ret))))

(titanic-data
 (fn [data]
     (let [!-repeated-passengers (reduce (fn [acc idx]
                                             (let [row (nth data idx)
                                                   rm (nthrest data (inc idx))]
                                                 (if (keep-if-name-already row rm)
                                                     acc
                                                     (lazy-cat acc [row])))) '() (range (count data)))]
         (store-seq "titanic-enhanced-data.txt" !-repeated-passengers))))

null

Now some EDA