# Data Collection

Clojure is a dialect of lisp, it is dynamic and runs on the Java platform. This is the Data Collection notebook in which we will be using clojure to collect data.

In [1]:
%classpath add jar ./data.csv-1.1.0.jar

(require '[clojure.data.csv :as csv]
         '[clojure.java.io :as io])

null

In [2]:
(defn read-contents
    [path]
    (csv/read-csv (io/reader path)))

(defn csv-data->maps [csv-data]
  (map zipmap
       (->> (first csv-data)
            repeat)
	  (rest csv-data)))

(defn store-seq
    [file]
    (let [nam (apply str (take-while #(not= % \.) file))]
        (spit (str nam ".txt")
              (apply list (csv-data->maps (read-contents file))))
        "done"))

#'beaker_clojure_shell_5741a78b-609f-468e-b823-64137e74c63c/store-seq

First lets read the titanic.csv dataset dowloaded from https://github.com/datasciencedojo/datasets/blob/master/titanic.csv using read-contents.

In [3]:
(read-contents "titanic.csv")

[[PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked], [1, 0, 3, Braund, Mr. Owen Harris, male, 22, 1, 0, A/5 21171, 7.25, , S], [2, 1, 1, Cumings, Mrs. John Bradley (Florence Briggs Thayer), female, 38, 1, 0, PC 17599, 71.2833, C85, C], [3, 1, 3, Heikkinen, Miss. Laina, female, 26, 0, 0, STON/O2. 3101282, 7.925, , S], [4, 1, 1, Futrelle, Mrs. Jacques Heath (Lily May Peel), female, 35, 1, 0, 113803, 53.1, C123, S], [5, 0, 3, Allen, Mr. William Henry, male, 35, 0, 0, 373450, 8.05, , S], [6, 0, 3, Moran, Mr. James, male, , 0, 0, 330877, 8.4583, , Q], [7, 0, 1, McCarthy, Mr. Timothy J, male, 54, 0, 0, 17463, 51.8625, E46, S], [8, 0, 3, Palsson, Master. Gosta Leonard, male, 2, 3, 1, 349909, 21.075, , S], [9, 1, 3, Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg), female, 27, 0, 2, 347742, 11.1333, , S], [10, 1, 2, Nasser, Mrs. Nicholas (Adele Achem), female, 14, 1, 0, 237736, 30.0708, , C], [11, 1, 3, Sandstrom, Miss. Marguerite Rut, female, 4, 1, 1

read-contents produces a lazy seq, this means that items are produced on demand, lets try to grab only the first 10 lines with clojure.core/take.

In [4]:
(take 10 (read-contents "titanic.csv"))

[[PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked], [1, 0, 3, Braund, Mr. Owen Harris, male, 22, 1, 0, A/5 21171, 7.25, , S], [2, 1, 1, Cumings, Mrs. John Bradley (Florence Briggs Thayer), female, 38, 1, 0, PC 17599, 71.2833, C85, C], [3, 1, 3, Heikkinen, Miss. Laina, female, 26, 0, 0, STON/O2. 3101282, 7.925, , S], [4, 1, 1, Futrelle, Mrs. Jacques Heath (Lily May Peel), female, 35, 1, 0, 113803, 53.1, C123, S], [5, 0, 3, Allen, Mr. William Henry, male, 35, 0, 0, 373450, 8.05, , S], [6, 0, 3, Moran, Mr. James, male, , 0, 0, 330877, 8.4583, , Q], [7, 0, 1, McCarthy, Mr. Timothy J, male, 54, 0, 0, 17463, 51.8625, E46, S], [8, 0, 3, Palsson, Master. Gosta Leonard, male, 2, 3, 1, 349909, 21.075, , S], [9, 1, 3, Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg), female, 27, 0, 2, 347742, 11.1333, , S]]

In [5]:
;You could be implement it like this


(defn cons
    [head tail]
    (try (clojure.core/cons head tail)
        (catch Exception e
            (tail head))))

(defn first
    [coll]
    (try (clojure.core/first coll)
        (catch Exception e
            (coll true))))

(defn rest
    [coll]
    (try (clojure.core/rest coll)
        (catch Exception e
            (coll false)))) 

;  Had to redefine some clojure functions to know what to do with our lazy seq fn

(defmacro my-lazy-seq
    ([body]
     `(~(fn constructor [head tail]
          (fn inner [in]
              (cond
                  (= true in) (if head head (when tail (if (ifn? tail) (if (tail) (first (tail)) nil) tail)))
                  (= false in) (if head (when tail (if (ifn? tail) (if (tail) (tail) nil) tail)) (rest (tail)))
                  :else (constructor in #(constructor head tail))))) nil (fn ~'thunk [] ~body))))
        

(defn my-r
    "An implementation of range,
     but using my-lazy-seq instead."
    ([n]
     (my-lazy-seq
      (cons n (my-r (inc n)))))
    ([start end]
     (my-lazy-seq
      (if (= start end)
          []
          (my-r (inc start) end)))))

#_(loop [coll (my-r 0) t 0 acc []]
    (if (= t 10)
        acc
        (recur (rest coll) (inc t) (conj acc (first coll)))))

(defn my-take
    "An implementation of take"
    [n coll]
    (my-lazy-seq
     (when-not (zero? n)
         (let [f (first coll)]
             (cons f (my-take (dec n) (rest coll)))))))

(defn realize
    [lzy]
    (loop [x lzy acc []]
        (let [f (first x)]
            (if f
                (recur (rest x) (conj acc f))
                (seq acc)))))



#'beaker_clojure_shell_5741a78b-609f-468e-b823-64137e74c63c/realize

In [6]:
(my-take 10 (read-contents "titanic.csv"))

beaker_clojure_shell_5741a78b_609f_468e_b823_64137e74c63c$my_lazy_seq$constructor__258$inner__259@65679473

Unfortunately my-lazy-seq is still pretty basic as if the last element of a recursive process was an integer, it wouldn't
throw an exception. Like clojure's lazy-seq does. Also you need to pass the seq to realize in order to display it.
Anyways:

In [7]:
(realize (my-take 10 (read-contents "titanic.csv")))

[[PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked], [1, 0, 3, Braund, Mr. Owen Harris, male, 22, 1, 0, A/5 21171, 7.25, , S], [2, 1, 1, Cumings, Mrs. John Bradley (Florence Briggs Thayer), female, 38, 1, 0, PC 17599, 71.2833, C85, C], [3, 1, 3, Heikkinen, Miss. Laina, female, 26, 0, 0, STON/O2. 3101282, 7.925, , S], [4, 1, 1, Futrelle, Mrs. Jacques Heath (Lily May Peel), female, 35, 1, 0, 113803, 53.1, C123, S], [5, 0, 3, Allen, Mr. William Henry, male, 35, 0, 0, 373450, 8.05, , S], [6, 0, 3, Moran, Mr. James, male, , 0, 0, 330877, 8.4583, , Q], [7, 0, 1, McCarthy, Mr. Timothy J, male, 54, 0, 0, 17463, 51.8625, E46, S], [8, 0, 3, Palsson, Master. Gosta Leonard, male, 2, 3, 1, 349909, 21.075, , S], [9, 1, 3, Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg), female, 27, 0, 2, 347742, 11.1333, , S]]

Meaning...

In [8]:
(=
 (seq (take 10 (read-contents "titanic.csv"))) ;; We use seq because the output of take is a LazySeq
 
 (realize (my-take 10 (read-contents "titanic.csv")))) ;; The output is a seq

true

Lets group it as a seq of maps.

In [9]:
(println (csv-data->maps (read-contents "titanic.csv")))

({Age 22, Fare 7.25, PassengerId 1, SibSp 1, Parch 0, Sex male, Survived 0, Ticket A/5 21171, Embarked S, Cabin , Pclass 3, Name Braund, Mr. Owen Harris} {Age 38, Fare 71.2833, PassengerId 2, SibSp 1, Parch 0, Sex female, Survived 1, Ticket PC 17599, Embarked C, Cabin C85, Pclass 1, Name Cumings, Mrs. John Bradley (Florence Briggs Thayer)} {Age 26, Fare 7.925, PassengerId 3, SibSp 0, Parch 0, Sex female, Survived 1, Ticket STON/O2. 3101282, Embarked S, Cabin , Pclass 3, Name Heikkinen, Miss. Laina} {Age 35, Fare 53.1, PassengerId 4, SibSp 1, Parch 0, Sex female, Survived 1, Ticket 113803, Embarked S, Cabin C123, Pclass 1, Name Futrelle, Mrs. Jacques Heath (Lily May Peel)} {Age 35, Fare 8.05, PassengerId 5, SibSp 0, Parch 0, Sex male, Survived 0, Ticket 373450, Embarked S, Cabin , Pclass 3, Name Allen, Mr. William Henry} {Age , Fare 8.4583, PassengerId 6, SibSp 0, Parch 0, Sex male, Survived 0, Ticket 330877, Embarked Q, Cabin , Pclass 3, Name Moran, Mr. James} {Age 54, Fare 51.8625, Pa

null

Now we will store it.

In [10]:
; Note: store already takes care for us of reading the contents
;       and grouping them as a seq of maps. Also it writes a
;       .txt file with the seq as a string, prepending the name
;       of the dataset passed to the extension

(store-seq "titanic.csv")

done

# Web Scraping with Clojure

Clojure is designed to interop with its host language, as such you can use a Java library from Clojure. I picked jsoup for webscraping.

In [11]:
%classpath add jar ./jsoup-1.18.1.jar

(import 'org.jsoup.Jsoup)

class org.jsoup.Jsoup

Parse html

In [12]:
(defn parse-html
    [url]
    (.get (Jsoup/connect url)))

(def parsed-html (parse-html "https://en.wikipedia.org/wiki/Passengers_of_the_Titanic#Passenger_list"))

#'beaker_clojure_shell_5741a78b-609f-468e-b823-64137e74c63c/parsed-html

In [13]:
parsed-html

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-toc-available" lang="en" dir="ltr">
 <head> 
  <meta charset="UTF-8" /> 
  <title>Passengers of the Titanic - Wikipedia</title> 
  <script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width

The parsed html is a org.jsoup.nodes.Document class.

In [14]:
(class parsed-html)

class org.jsoup.nodes.Document

Lets see its methods:

In [15]:
(.getMethods (.getClass parsed-html))

[public java.lang.Object org.jsoup.nodes.Document.clone() throws java.lang.CloneNotSupportedException, public org.jsoup.nodes.Document org.jsoup.nodes.Document.clone(), public org.jsoup.nodes.Element org.jsoup.nodes.Document.clone(), public org.jsoup.nodes.Node org.jsoup.nodes.Document.clone(), public org.jsoup.nodes.Element org.jsoup.nodes.Document.head(), public org.jsoup.nodes.Element org.jsoup.nodes.Document.text(java.lang.String), public org.jsoup.nodes.Element org.jsoup.nodes.Document.body(), public java.lang.String org.jsoup.nodes.Document.title(), public void org.jsoup.nodes.Document.title(java.lang.String), public java.lang.String org.jsoup.nodes.Document.outerHtml(), public java.lang.String org.jsoup.nodes.Document.nodeName(), public org.jsoup.nodes.Document org.jsoup.nodes.Document.normalise(), public org.jsoup.nodes.Document$QuirksMode org.jsoup.nodes.Document.quirksMode(), public org.jsoup.nodes.Document org.jsoup.nodes.Document.quirksMode(org.jsoup.nodes.Document$QuirksMo

In [16]:
(class (.getElementsByTag parsed-html "tbody"))

class org.jsoup.select.Elements

We want the 3 Wikipedia tables of:

*  https://en.wikipedia.org/wiki/Passengers_of_the_Titanic#First_class_2

*  https://en.wikipedia.org/wiki/Passengers_of_the_Titanic#Second_class_2

*  https://en.wikipedia.org/wiki/Passengers_of_the_Titanic#Third_class_2

In [17]:
(defn select-only-tables
    "Assigns a name of \"tbodies\" to the value of
     obtaining all elements with a tag of 'tbody'.
     Yields a org.jsoup.select.Elements class. This
     is then iterated using the higher order function keep
     which when given a fn and a collection applies this function
     and upon returning nil does not include it in the result."
    [parsed]
    (let [tbodies (.getElementsByTag parsed "tbody")]
        (keep (fn [tbody]
                  (when (empty? (.select tbody "img"))
                      tbody)) 
              tbodies)))

(defn select-table-rows
    "Returns a lazy sequence of selecting
     the 'tr' elements from tables"
    [tables]
    (map #(.select % "tr") tables))

(defn select-table-trs
    [coll]
    (keep (fn [tr] (when (empty? (.select tr "ul")) tr))
          coll))

(defn partition-by-headers
    [coll]
    (partition-by #(when (not-empty (.select % "th")) %)
                  coll))

(defn read-number
    "First tries to parse an integer from
     the resulting string from concatenating
     a lazy-seq of all chars that are digits
     from string. Catches the exception thrown
     and tries to do the same after dropping
     leading zeroes."
    [string]
    (let [read-n #(Integer/parseInt (apply str (re-seq #"\d" %)))
          drop-leading-zeroes #(read-n (apply str (drop-while (partial = \0) %)))]
        (try (read-n string)
            (catch Exception _
                (drop-leading-zeroes string)))))

(defn extract-rowspan
    "Parses the number after rowspan=\"XXX\"."
    [string]
    (let [idx1 (.indexOf string "=\"")
          new-string (subs string (+ idx1 2))
          string (apply str (reverse new-string))
          idx2 (.lastIndexOf string "\"")
          new-string (subs string (inc idx2))]
        ((fn return [] 
             (read-number new-string)))))


(defn insert
    [sq index value]
    (let [[head tail] (split-at (dec index) sq)]
        (lazy-cat head [value] tail)))

(defn textify
    "Extracts the '>XXX</' text from value."
    [value]
    (if (= (class value) (class ""))
        (let [f (rest (drop-while #(not= % \>) value))
              s (take-while #(not= % \<) f)]
            (apply str s))
        (.text value)))


(defn prep
    "Iterates through tds, if a td
     contains rowspan, extracts the
     rowspan and then decrements it,
     then adds to the first element
     of remaining the formatted string
     of td with the new rowspan and
     the text value of td, else
     returns remaining once all idxs
     have been iterated over."
    [tds remaining]
    (let [idxs (range (count tds))]
        (loop [[i & is] idxs remaining remaining]
            (if (not i)
                remaining
                (let [value (nth tds i)]
                    (if (.contains (str value) "rowspan")
                        (let [extracted-rowspan (extract-rowspan (str value))]
                            (if (= extracted-rowspan 1)
                                (recur is remaining)
                                (let [[nxt & rmng] remaining
                                      n extracted-rowspan
                                      rowspan #(format "<td rowspan=\"%s\">%s</td>" %1 %2)
                                      inserted (insert nxt (inc i) (rowspan (dec n) (textify value)))]
                                    (recur is (cons inserted rmng)))))
                        (recur is remaining)))))))

(defn iter
    "Returns a lazy-seq of consing
     a lazy-seq of the text values of
     tds to the rest of (iter MAX prepared).
     Checks if (count row) is eq to MAX, if
     it is then just gives row to cons it,
     else if is less then gives row + 
     (repeat (- MAX (count row)) \"\").
     Otherwise gives butlast row."
    [MAX trs]
    (lazy-seq
     (when-first [tds trs]
         (let [remaining (rest trs)
               prepared (prep tds remaining)
               row (map #(textify %) tds)]
             (#(cons % (iter MAX prepared))
               (cond
                   (> (count row) MAX) (butlast row)
                   (< (count row) MAX) (concat row (repeat (- MAX (count row)) ""))
                   :else row))))))
(defn go-through
    "Parses a org.jsoup.Elements
     class into a lazy-seq of tds.
     This is then passed to iter."
    [MAX vs]
    (iter MAX (map #(map identity (.select % "td")) vs)))


(defn html->seq-of-seq-of-maps
    "Retrieves a lazy-seq of
     lazy-seqs of maps."
    [html]
    (letfn [(helper
             [part]
             (map (fn [f]
                      (let [headers (map #(.text %) (.select (first f) "th"))
                            values (go-through (count headers) (rest (first f)))]
                          (map #(zipmap headers %) values))) part))]
        (helper (partition-by-headers
                 (select-table-trs
                  (select-table-rows
                   (select-only-tables html)))))))

#'beaker_clojure_shell_5741a78b-609f-468e-b823-64137e74c63c/html->seq-of-seq-of-maps

In [18]:
(def seq-of-seq-of-maps (html->seq-of-seq-of-maps parsed-html))

#'beaker_clojure_shell_5741a78b-609f-468e-b823-64137e74c63c/seq-of-seq-of-maps

Next we will pass the table titles we were looking for to *store*, it will take '(count titles)' from the seq of seq of maps from above and write its contents
to a .txt file with the corresponding title from the titles seq as its name and the corresponding seq of maps as its content.

In [19]:
(defn store
    [titles sq]
    (dorun (map (fn [title data]
                    (spit (str title ".txt") (apply list data)))
                titles
                (take (count titles) sq)))
    "done")

(store ["First class" "Second class" "Third class"] 
       seq-of-seq-of-maps)

done

Now onto the second part of the Data Science methodology "Data Wrangling" done with **Clojure**.