<h1 style="font-size: 1.6rem; font-weight: bold">ITO 5047: Fundamentals of Artificial Intelligence</h1>
<h1 style="font-size: 1.6rem; font-weight: bold">Topic 2 - Machine Learning: Decision Trees</h1>
<p style="margin-top: 5px; margin-bottom: 5px;">Monash University Australia</p>
<p style="margin-top: 5px; margin-bottom: 5px;">Jupyter Notebook by: Tristan Sim Yook Min</p>

---

### **Decision Trees**

Decision Trees are a **Supervised Machine-Learning** method that uses a **Tree-Structure** to predict that value of objects. The **Decision Nodes** are the **Non-Leaf Nodes** which divides the data into **explanatory attributes** whilst the **Leaf Nodes** repsents the **Classes**.

Each Decision Node (starting from the Root Node) evaluates an **Object Attributes** based on a **Attribute Value or Range** in order to classify an object. A Path from the Root Node to the Leaf Node gives the **Class** of an Object.


### **Decision Tree Example: Weather and Playing Ball**

The goal is to predict whether to go outside and play ball using daily weather information. The dataset contains weather attribute vectors along with the corresponding target variable indicating the actual decision made.

**Data Structure:**
| Day | **Outlook** | **Temperature** | **Humidity** | **Wind** | **Play Ball** |
|-----|-------------|-----------------|--------------|----------|---------------|
| D1  | Sunny       | Hot             | High         | Weak     | No            |
|     | ← **Input (x)** → | ← **Input (x)** → | ← **Input (x)** → | ← **Input (x)** → | **Output (y)** |

**The Training Data can be Divided into:**
- **Input (x)**: Vectors of Explanatory Attributes - Weather attributes [Outlook, Temperature, Humidity, Wind]
- **Output (y)**: Corresponding Target Values - Decision to play ball (Yes/No)

---

### **Training Dataset** 

*Play-ball data set (2023) courtesy of Ethan Wills from Monash University*

In this example, we have an entire training data set with data for 14 days.

| Day | Outlook | Temperature | Humidity | Wind | Play Ball |
|-----|---------|-------------|----------|------|-----------|
| D1  | Sunny   | Hot         | High     | Weak | No        |
| D2  | Sunny   | Hot         | High     | Strong | No      |
| D3  | Overcast| Hot         | High     | Weak | Yes       |
| D4  | Rain    | Mild        | High     | Weak | Yes       |
| D5  | Rain    | Cool        | Normal   | Weak | Yes       |
| D6  | Rain    | Cool        | Normal   | Strong | No      |
| D7  | Overcast| Cool        | Normal   | Strong | Yes     |
| D8  | Sunny   | Mild        | High     | Weak | No        |
| D9  | Sunny   | Cool        | Normal   | Weak | Yes       |
| D10 | Rain    | Mild        | Normal   | Weak | Yes       |
| D11 | Sunny   | Mild        | Normal   | Strong | Yes     |
| D12 | Overcast| Mild        | High     | Strong | Yes     |
| D13 | Overcast| Hot         | Normal   | Weak | Yes       |
| D14 | Rain    | Mild        | High     | Strong | No      |

#### **The decision tree constructed from this data follows this structure:**

![image.png](attachment:image.png)

<br>

#### **Testing the Model**

Suppose we have a new day: D15, with the following conditions. Should we play ball?

| Day | Outlook | Temperature | Humidity | Wind | Play Ball |
|-----|---------|-------------|----------|------|-----------|
| D15 | Sunny   | Hot         | High     | Weak | ?         |

**Decision Process:**
1. Check the outlook → Sunny
2. Since outlook is sunny, check humidity → High
3. Since humidity is high, the tree says **No**, we should not play ball.

![image-2.png](attachment:image-2.png)

---


### **Decision Stump: Simplest Tree** 

The simplest decsion tree is a Decision Stump, where values are split on a single attribute from one decision node.

![image-2.png](attachment:image-2.png)

### **Splitting Attributes** 

![image.png](attachment:image.png)


---

### **Entropy**

Entropy serves as a mathematical tool for quantifying the level of uncertainty or randomness within a probability distribution. For discrete random variables, entropy uses a specific mathematical formula that combines probabilities with logarithmic functions.

The Mathematical Definition for a discrete random variable $X$ that can take values $x₁, x₂, ..., xₙ$ with corresponding probabilities $Pr(x₁), Pr(x₂), ..., Pr(xₙ)$, the entropy $H(X)$ is calculated as:

$$H(X) = -∑(i=1 to n) Pr(xᵢ) × log₂(Pr(xᵢ))$$

Key properties of entropy:
- The result is always non-negative: H(X) ≥ 0
- It represents the average amount of information needed to describe the random variable

### **Example Problem: Entropy**

The entropy curve demonstrates how uncertainty changes based on probability distributions. The graph shows entropy values for a binary classification problem (purple vs green shapes) plotted against the probability of one outcome.

![image.png](attachment:image.png)

**Maximum Entropy (Peak Uncertainty)**
- Occurs when probabilities are equal: Pr(blue) = Pr(orange) = 0.5
- At this point, entropy reaches its maximum value of 1.0
- This represents **maximum uncertainty** - you cannot predict the outcome better than random guessing

**Minimum Entropy (Perfect Certainty)**
- Occurs at the extremes: when Pr(blue) = 0 or Pr(orange) = 1
- At these points, entropy equals 0
- This represents **perfect certainty** - you know exactly what the outcome will be

Consider a random variable S that can have values 'a' or 'b'. Three different probability scenarios are presented:

1. **Scenario 1**: Pr(a) = 1, Pr(b) = 0
2. **Scenario 2**: Pr(a) = 0.9, Pr(b) = 0.1  
3. **Scenario 3**: Pr(a) = 0.5, Pr(b) = 0.5

**Question**: Which probability assignment results in the highest entropy H(S)?

Based on the entropy curve, Scenario 3 (equal probabilities) will have the highest entropy, representing maximum uncertainty about the outcome.

### **Example Problem: Entropy is Bad, Homogenity is Good**

In machine learning and classification tasks, entropy serves as a measure of impurity or disorder within a dataset. When we say "entropy is bad," we mean that high entropy indicates mixed, uncertain data that's difficult to classify cleanly.

**High Entropy = Bad (Mixed Data)**
- Indicates a dataset with mixed positive and negative examples
- Makes prediction uncertain and classification difficult
- Represents impurity in the data

**Low Entropy = Good (Homogeneous Data)**
- Indicates a dataset with mostly similar examples
- Makes prediction more certain and reliable
- Represents purity in the data

Consider a dataset S with examples labeled as positive (+) and negative (-). We use the convention that 0 × log₂(0) = 0 when calculating entropy.

Let's examine a dataset with:
- 9 positive examples
- 5 negative examples
- Total: 14 examples

**Step 1: Calculate Probabilities**
- Pr(P) = 9/14 (probability of positive examples)
- Pr(N) = 5/14 (probability of negative examples)

**Step 2: Apply Entropy Formula**

$$Entropy(S) = -Pr(P) × log₂(Pr(P)) - Pr(N) × log₂(Pr(N))$$

**Step 3: Substitute Values**

$$Entropy([9+, 5-]) = -(9/14) × log₂(9/14) - (5/14) × log₂(5/14)$$
$$Entropy([9+, 5-])  = 0.94$$

Interpreting the Result, the entropy value of 0.94 indicates:
- **Moderate impurity**: The dataset is neither perfectly pure nor completely mixed
- **Room for improvement**: A perfectly homogeneous dataset would have entropy = 0
- **Classification challenge**: The mixed nature makes prediction less certain

---

### **Information Gain**

Information gain is the fundamental metric used to construct decision trees effectively. It measures how much a particular attribute reduces uncertainty (entropy) when we use it to split our dataset. Information gain quantifies the expected reduction in entropy that occurs when we partition a dataset based on a specific attribute. It helps us determine which attribute provides the most valuable split at each node of a decision tree. 

Information gain compares the entropy before and after splitting, measuring the improvement in data homogeneity.

The information gain for splitting dataset S using attribute A is calculated as:

$$IG(S,A) = H(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} \times H(S_v)$$

Where:
- **H(S)** = entropy of the original dataset S
- **Values(A)** = all possible values that attribute A can take
- **$S_v$** = subset of examples where attribute A has value v
- **$|S_v|$** = number of examples in subset $S_v$
- **|S|** = total number of examples in the original dataset
- **$H(S_v)$** = entropy of subset $S_v$

### Information Gain Example: Wind and Play Ball

In this example, using the play-ball data, we would like to find the information gained from play-ball by splitting on the 'wind' attribute. We consider  S  to be 'play-ball' and  A  to be 'wind'. 'Wind' can either be 'strong' or 'weak'. The table shows that unsplit, play ball has nine 'Yes' and 5 'No'. We can see the number of 'Yes' and 'No' values for play-ball when split based on 'Weak' and 'strong' wind. If we place the values into our information gain formula and calculate the entropy of  S_w  and  S_v , we find the information gain to be 0.048.

| Day | Wind   | Play Ball |
|-----|--------|-----------|
| D1  | Weak   | No        |
| D2  | Strong | No        |
| D3  | Weak   | Yes       |
| D4  | Weak   | Yes       |
| D5  | Weak   | Yes       |
| D6  | Strong | No        |
| D7  | Strong | Yes       |
| D8  | Weak   | No        |
| D9  | Weak   | Yes       |
| D10 | Weak   | Yes       |
| D11 | Strong | Yes       |
| D12 | Strong | Yes       |
| D13 | Weak   | Yes       |
| D14 | Strong | No        |

**Original Dataset (Unsplit):**
- Total examples: 14
- Play Ball = Yes: 9 examples
- Play Ball = No: 5 examples

**After Splitting on Wind:**

**Weak Wind Subset ($S_w$):**
- Examples: [6+, 2-] (6 Yes, 2 No)
- Total: 8 examples

**Strong Wind Subset ($S_s$):**
- Examples: [3+, 3-] (3 Yes, 3 No)
- Total: 6 examples

<br>

**Using the information gain formula:**

$$IG(S, A) = H(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} \times H(S_v)$$

**Step 1: Calculate original entropy H(S)**

**Step 1a: Calculate Probabilities**
- Pr(P) = $\frac{9}{14}$ (probability of positive examples)
- Pr(N) = $\frac{5}{14}$ (probability of negative examples)

**Step 1b: Apply Entropy Formula**

$$Entropy(S) = -Pr(P) × log₂(Pr(P)) - Pr(N) × log₂(Pr(N))$$

**Step 1c: Substitute Values**

$Entropy([9+, 5-]) = -(9/14) × log₂(frac{9}{14}) - (frac{5}{14}) × log₂(frac{5}{14})$
$Entropy([9+, 5-])  = 0.94$

**Step 2: Calculate entropy for each subset**

**Step 2a: Calculate entropy for Weak Wind subset ($S_w$)**
- $S_w$ = [6+, 2-] (6 Yes, 2 No out of 8 total)
- $Pr(Yes) = frac{6}{8} = 0.75$
- $Pr(No) = frac{2}{8} = 0.25$

$H(S_w) = -(6/8) × log₂(6/8) - (2/8) × log₂(2/8)$

$H(S_w) = -0.75 × log₂(0.75) - 0.25 × log₂(0.25)$

$H(S_w) = -0.75 × (-0.415) - 0.25 × (-2.0)$

$H(S_w) = 0.311 + 0.5 = 0.811$

**Step 2b: Calculate entropy for Strong Wind subset ($S_s$)**
- $S_s$ = [3+, 3-] (3 Yes, 3 No out of 6 total)
- Pr(Yes) = 3/6 = 0.5
- Pr(No) = 3/6 = 0.5

$H(S_s) = -(3/6) × log₂(3/6) - (3/6) × log₂(3/6)$

$H(S_s) = -0.5 × log₂(0.5) - 0.5 × log₂(0.5)$

$H(S_s) = -0.5 × (-1.0) - 0.5 × (-1.0)$

$H(S_s) = 0.5 + 0.5 = 1.0$

**Step 3: Apply the information gain formula**

$\begin{align}
IG(S, A) &= H(S) - \frac{|S_w|}{|S|} H(S_w) - \frac{|S_s|}{|S|} H(S_s) \\
&= H(S) - \frac{8}{14} H(S_w) - \frac{6}{14} H(S_s) \\
&= 0.94 - \frac{8}{14} \times 0.811 - \frac{6}{14} \times 1.00 \\
&= 0.94 - 0.464 - 0.428 \\
&= 0.048
\end{align}$

**Information Gain = 0.048**

This relatively low information gain indicates that:
- Splitting on 'wind' provides only a small reduction in entropy
- The 'wind' attribute is not particularly effective for predicting play-ball outcomes
- Other attributes might provide better splits for building the decision tree

The weak information gain suggests that wind conditions alone don't strongly correlate with the decision to play ball in this dataset.



![image.png](attachment:image.png)

---

### **Decsion Tree Algorithm Details**

**The Recursive Splitting Process**
The algorithm's power lies in the recursive Split function, which performs top-down recursive splitting of attributes based on information gain. Here's how it works:

**Step 1: Check Stopping Condition**

If all data points in the current subset belong to the same class, create a leaf node and return
This represents a pure subset where no further splitting is needed

**Step 2: Attribute Evaluation**

For each remaining attribute A, calculate the information gain that would result from splitting on A
Information gain measures how much the split reduces entropy (uncertainty)

**Step 3: Select Best Split**

Choose the attribute with the highest information gain
This attribute provides the most valuable division of the data

**Step 4: Partition and Recurse**

Split the data into subsets based on the chosen attribute's values
Recursively apply the Split function to each subset
Continue until all subsets are pure (same class) or no more attributes remain

---

### **Overfitting and Decision Tree**

Decision trees can become overly specialized to training examples, learning from random variations and anomalies, which decreases their performance on new data. This problem intensifies as tree complexity increases.

![image.png](attachment:image.png)

### **Three main approaches exist to combat overfitting:**

* **Pre-pruning** - Halt tree growth early when a quality metric drops below a set threshold
* **Post-pruning** - Build the complete tree, then trim it back to the configuration that achieved peak performance on validation data
* **Regularization** - Include a complexity cost in the performance calculation, such as penalizing additional nodes (e.g., Complexity = Node count)

Post-pruning involves monitoring model performance on validation data throughout training. Once training completes, the tree is reduced to its state when validation performance was optimal. This approach uses unseen validation data to determine the ideal tree size for generalization. The diagram below shows the tree cut back when the performance on the validation data was at its highest.

![image-2.png](attachment:image-2.png)


### **Improvements to Decision Tree Construction**

Decision trees can be enhanced in multiple ways. While we've focused on categorical attributes so far, trees can also process **continuous-valued attributes**.
* Input: Apply threshold values for splitting decisions.
* Output: Calculate linear functions at leaf nodes (such as computing the mean).

The algorithm can be modified to manage **missing categorical values** by substituting the mode or selecting randomly based on the attribute's distribution where values are absent.

We can also manage **missing continuous values** by using the mean value.

Additionally, we've demonstrated how overfitting can be minimized through techniques like post-pruning.

### **Rule Extraction from Decision Trees**

Decision trees offer excellent interpretability, allowing us to convert the learned structure into **IF-THEN** rules that humans can easily understand.

The diagram below shows that each path from root to leaf generates one rule, where attribute-value pairs along the path create conjunctions, and the leaf node contains the class prediction.

* **IF** outlook="sunny" **AND** humidity="high" **THEN** play-ball="NO"
* **IF** outlook="sunny" **AND** humidity="normal" **THEN** play-ball="YES"
* **IF** outlook="overcast" **THEN** play-ball="YES"
* **IF** outlook="rain" **AND** wind="strong" **THEN** play-ball="NO"
* **IF** outlook="rain" **AND** wind="weak" **THEN** play-ball="YES"

![image.png](attachment:image.png)

---