# Clojure Decision Tree
- A01173359 - Mario Emilio Jiménez Vizcaíno
- A01656159 - Juan Sebastián Rodríguez Galarza
- A01656257 - Kevin Torres Martínez

Queremos predecir la calidad del vino rojo en base a 6 variables independientes no lineales, por lo que el árbol de decisión es el algoritmo más indicado para esta situación porque nuestro conjunto de datos de entrada está etiquetado, además de que la predicción dependerá de varias variables continuas.

Usamos el dataset [Red Wine Quality en Kaggle](https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009)

# Columnas

In [8]:
(require '[cemerick.pomegranate :refer [add-dependencies]])
(add-dependencies :coordinates '[[org.clojure/data.csv "0.1.2"]])
(require '[clojure.data.csv :as csv])
(require '[clojure.spec.alpha :as s])

nil

In [9]:
(def wineQuality(with-open [in-file (clojure.java.io/reader "winequality-red.csv")] 
    (doall (csv/read-csv in-file))))

#'decision-tree.core/wineQuality

In [10]:
(defrecord Wine [fixed_acidity volatile_acidity citrid_acid chlorides sulphates alcohol quality])

decision_tree.core.Wine

In [19]:
(defn vectorToWine [v]
    (Wine.
        (Double. (nth v 0))
        (Double. (nth v 1))
        (Double. (nth v 2))
        (Double. (nth v 4))
        (Double. (nth v 9))
        (Double. (nth v 10))
        (if (< 5.5 (Integer. (nth v 11))) "buena calidad" "mala calidad")))

#'decision-tree.core/vectorToWine

In [20]:
(def data (map #(vectorToWine %) (rest wineQuality)))
(take 5 data)

(#decision_tree.core.Wine{:fixed_acidity 7.4, :volatile_acidity 0.7, :citrid_acid 0.0, :chlorides 0.076, :sulphates 0.56, :alcohol 9.4, :quality "mala calidad"} #decision_tree.core.Wine{:fixed_acidity 7.8, :volatile_acidity 0.88, :citrid_acid 0.0, :chlorides 0.098, :sulphates 0.68, :alcohol 9.8, :quality "mala calidad"} #decision_tree.core.Wine{:fixed_acidity 7.8, :volatile_acidity 0.76, :citrid_acid 0.04, :chlorides 0.092, :sulphates 0.65, :alcohol 9.8, :quality "mala calidad"} #decision_tree.core.Wine{:fixed_acidity 11.2, :volatile_acidity 0.28, :citrid_acid 0.56, :chlorides 0.075, :sulphates 0.58, :alcohol 9.8, :quality "buena calidad"} #decision_tree.core.Wine{:fixed_acidity 7.4, :volatile_acidity 0.7, :citrid_acid 0.0, :chlorides 0.076, :sulphates 0.56, :alcohol 9.4, :quality "mala calidad"})

# División del dataset

En total tenemos 1,599 datos. De los cuales los primeros 1,279 (80% del total) se utilizarán para entrenar el modelo de machine learning. Los 320 (20% del total) datos restantes se utilizarán realizar las pruebas.

In [21]:
(def trainingData (take 1279 data))
(take 5 trainingData)

(#decision_tree.core.Wine{:fixed_acidity 7.4, :volatile_acidity 0.7, :citrid_acid 0.0, :chlorides 0.076, :sulphates 0.56, :alcohol 9.4, :quality "mala calidad"} #decision_tree.core.Wine{:fixed_acidity 7.8, :volatile_acidity 0.88, :citrid_acid 0.0, :chlorides 0.098, :sulphates 0.68, :alcohol 9.8, :quality "mala calidad"} #decision_tree.core.Wine{:fixed_acidity 7.8, :volatile_acidity 0.76, :citrid_acid 0.04, :chlorides 0.092, :sulphates 0.65, :alcohol 9.8, :quality "mala calidad"} #decision_tree.core.Wine{:fixed_acidity 11.2, :volatile_acidity 0.28, :citrid_acid 0.56, :chlorides 0.075, :sulphates 0.58, :alcohol 9.8, :quality "buena calidad"} #decision_tree.core.Wine{:fixed_acidity 7.4, :volatile_acidity 0.7, :citrid_acid 0.0, :chlorides 0.076, :sulphates 0.56, :alcohol 9.4, :quality "mala calidad"})

In [22]:
(def testingData (drop 1279 data))
(take 5 testingData)

(#decision_tree.core.Wine{:fixed_acidity 9.8, :volatile_acidity 0.3, :citrid_acid 0.39, :chlorides 0.062, :sulphates 0.57, :alcohol 11.5, :quality "buena calidad"} #decision_tree.core.Wine{:fixed_acidity 7.1, :volatile_acidity 0.46, :citrid_acid 0.2, :chlorides 0.077, :sulphates 0.64, :alcohol 10.4, :quality "buena calidad"} #decision_tree.core.Wine{:fixed_acidity 7.1, :volatile_acidity 0.46, :citrid_acid 0.2, :chlorides 0.077, :sulphates 0.64, :alcohol 10.4, :quality "buena calidad"} #decision_tree.core.Wine{:fixed_acidity 7.9, :volatile_acidity 0.765, :citrid_acid 0.0, :chlorides 0.084, :sulphates 0.68, :alcohol 10.9, :quality "buena calidad"} #decision_tree.core.Wine{:fixed_acidity 8.7, :volatile_acidity 0.63, :citrid_acid 0.28, :chlorides 0.096, :sulphates 0.63, :alcohol 10.2, :quality "buena calidad"})

In [23]:
(defn- update-key
  [from-key to-key map]
  {:pre [(s/valid? keyword? from-key)
         (s/valid? keyword? to-key)
         (s/valid? map? map)]
   :post [(s/valid? map? %)]}
  (-> map
      (assoc to-key (from-key map))
      (dissoc from-key)))

(defn- update-value
  [map key updated-value]
  {:pre [(s/valid? map? map)
         (s/valid? keyword? key)]
   :post [(s/valid? map? %)]}
  (update map key (fn [_] (identity updated-value))))


(s/def ::objective-variable string?)
(s/def ::objective-variable-vector (s/coll-of ::objective-variable))

(defn- gini-impurity
  [y]
  {:pre [(s/valid? ::objective-variable-vector y)]
   :post [(s/valid? (s/and number? #(<= 0 % 1)) %)]}
  (let [number-of-each-data (->> (group-by identity y)
                                 (map (comp count val)))
        sum-of-number-of-data (apply + number-of-each-data)]
    (->> number-of-each-data
         (map #(/ % sum-of-number-of-data))
         (map #(Math/pow % 2))
         (apply +)
         (- 1))))

(defn information-gain [node-data leaf1-data leaf2-data]
  (let [node-number-of-data (count node-data)
        node-gini-impurity (gini-impurity (map :Classes node-data))
        leaf1-number-of-data (count leaf1-data)
        leaf1-gini-impurity (gini-impurity (map :Classes leaf1-data))
        leaf2-number-of-data (count leaf2-data)
        leaf2-gini-impurity (gini-impurity (map :Classes leaf2-data))]
    (- node-gini-impurity
       (+ (* (/ leaf1-number-of-data node-number-of-data) leaf1-gini-impurity)
          (* (/ leaf2-number-of-data node-number-of-data) leaf2-gini-impurity)))))


(s/def ::feature (s/or :nil nil?
                       :keyword keyword?))
(s/def ::threshold (s/or :nil nil?
                         :number number?))
(s/def ::data (s/coll-of map?))
(s/def ::left (s/or :nil nil?
                    :map? ::node))
(s/def ::right (s/or :nil nil?
                     :map? ::node))
(s/def ::node  (s/keys :req-un [::feature ::threshold ::data ::left ::right]))

(defn- count-number-of-kinds-of-objective-variables
  [data key-of-objective-variable]
  {:pre [(s/valid? ::data data)
         (s/valid? keyword? key-of-objective-variable)]
   :post [(s/valid? int? %)]}
  (->> data
       (map key-of-objective-variable)
       set
       count))

(defn- get-explanatory-variables-from
  [data key-of-objective-variable]
  {:pre [(s/valid? ::data data)
         (s/valid? keyword? key-of-objective-variable)]
   :post [s/valid? (s/coll-of keyword?) %]}
  (->> (first data)
       keys
       (filter (partial not= key-of-objective-variable))))


(defn- get-threshold-point-candidates
  [data]
  {:pre [(s/valid? ::data data)]
   :post [(s/valid? (s/coll-of map?) %)]}
  (let [features (get-explanatory-variables-from data :Classes)
        get-threshold-point-candidates-one-feature (fn [data feature] (map #(select-keys %1 [feature]) data))]
    (->> (map (partial get-threshold-point-candidates-one-feature data) features)
         flatten)))

(defn- create-node-from-data
  [data]
  {:pre [(s/valid? ::data data)]
   :post [s/valid? ::node %]}
  {:feature nil :threshold nil :data data :right nil :left nil})

(defn- split-one-node
  [node threshold key]
  {:pre [(s/valid? ::node node)
         (s/valid? ::threshold threshold)
         (s/valid? ::feature key)]
   :post [(s/valid? ::node %)]}
  (let [left-data (filter #(> (key %1) threshold) (:data node))
        left-node (create-node-from-data left-data)
        right-data (filter #(<= (key %1) threshold) (:data node))
        right-node (create-node-from-data right-data)]
    (-> node
        (update-value :feature key)
        (update-value :threshold threshold)
        (update-value :left left-node)
        (update-value :right right-node))))

(defn- calculate-information-gains
  [node]
  {:pre  [(s/valid? ::node node)]
   :post [(s/valid? (s/coll-of map?) %)]}
  (->> (get-threshold-point-candidates (:data node))
       (map #(split-one-node node ((comp first vals) %1) ((comp first keys) %1)))
       (map #(hash-map
               :information-gain (information-gain (:data %) (:data (:left %)) (:data (:right %)))
               :threshold (:threshold %)
               :feature (:feature %)))))



(defn- stop-split?
  [node max-depth]
  (or (= max-depth 0)
      (<= (count (:data node)) 1)
      (<= (count-number-of-kinds-of-objective-variables (:data node) :Classes) 1)))

(defn- get-maximum-information-gain-splitter
  [node]
  (->> (calculate-information-gains node)
       ;; max-key returns a last element if there are exact same values.
       ;; So we added shuffle because we want this function returns a random element in exact same values.
       shuffle
       (apply max-key :information-gain)))

(defn- get-most-popular-objective-variable-values
  [data]
  (->> data
       (map :Classes)
       (frequencies)
       (apply max-key val)
       key))

(defn- split-node
  [node threshold key max-depth]
  {:pre [(s/valid? ::node node)
         (s/valid? ::threshold threshold)
         (s/valid? ::feature key)]
   :post [(s/valid? ::node %)]}
  (if (stop-split? node max-depth)
    (assoc node :predict (get-most-popular-objective-variable-values (:data node)))
    (let [split (split-one-node node threshold key)
          left-node (:left split)
          right-node (:right split)
          left-splitter (get-maximum-information-gain-splitter left-node)
          right-splitter (get-maximum-information-gain-splitter right-node)]
      (-> node
          (update-value :feature key)
          (update-value :threshold threshold)
          (update-value :left (split-node left-node (:threshold left-splitter) (:feature left-splitter) (dec max-depth)))
          (update-value :right (split-node right-node (:threshold right-splitter) (:feature right-splitter) (dec max-depth)))))))

(defn make-decision-tree
  [train-data max-depth key-of-objective-variable]
  {:pre [(s/valid? ::data train-data)]
   :post [(s/valid? ::node %)]}
  (let [train-data (map (partial update-key key-of-objective-variable :Classes) train-data)
        node (create-node-from-data train-data)
        splitter (get-maximum-information-gain-splitter node)]
    (split-node node (:threshold splitter) (:feature splitter) max-depth)))

(defn predict
  [tree data]
  (cond (not (nil? (:predict tree)))
        (:predict tree)
        (> ((:feature tree) data) (:threshold tree)) (predict (:left tree) data)
        :else (predict (:right tree) data)))

#'decision-tree.core/predict

In [25]:
(def tree (make-decision-tree trainingData 3 :quality))

#'decision-tree.core/tree

In [26]:
(predict tree (first testingData))

"buena calidad"

In [27]:
(first testingData)

#decision_tree.core.Wine{:fixed_acidity 9.8, :volatile_acidity 0.3, :citrid_acid 0.39, :chlorides 0.062, :sulphates 0.57, :alcohol 11.5, :quality "buena calidad"}

In [41]:
(defn test_TestingData [] 
    (let [predictData (map #(predict tree %) testingData) realData (map #(:quality %) testingData)]
        (loop [x predictData y  realData equals 0]
            (if (empty? x) 
                equals 
                (recur (rest x) (rest y) (if (= (first x) (first y)) (inc equals) equals)))
            )
        )
)

#'decision-tree.core/test_TestingData

In [42]:
(test_TestingData)

228