# RIPPER

*03-20-2021 (updated 03-08-2022)*

This page contains my reading notes on 

- [**Fast Eective Rule Induction**](https://www.semanticscholar.org/paper/Fast-Effective-Rule-Induction-Cohen/6665e03447f989c9bdb3432d93e89b516b9d18a7)

Some of the knowledge are also from:

- [**Incremental Reduced Error Pruning**](https://www.semanticscholar.org/paper/Incremental-Reduced-Error-Pruning-F%C3%BCrnkranz-Widmer/e37790eae6a0ed842c7260df39aab9161c4d1aa1)
- [**Fast Effective Rule Induction Overview**](https://static.aminer.org/pdf/PDF/000/334/623/fast_effective_rule_induction.pdf)

## Introduction
1. In this paper, Cohen first implements his own version of [**IREP**](https://www.semanticscholar.org/paper/Incremental-Reduced-Error-Pruning-F%C3%BCrnkranz-Widmer/e37790eae6a0ed842c7260df39aab9161c4d1aa1) (Incremental Reduced Error Pruning) with some minor differences and has support multi-label problems and missing attributes.
1. Then he proposes several major changes to IREP and names the improved version **IREP\***. 
1. Finally, based on IREP*, he proposes a new rule mining algorithm called **RIPPER** (Repeated Incremental Pruning to Produce Error Reduction).

## IREP (Cohen version)

#### IREP algorithm 

The characteristics of IREP have two fold:
1. **Separate and conquer**: the covered instances in the training set are removed after a rule is found; thus in the next iteration, a new rule will be learned on the training instances that have not been covered by the previously found rules. 
1. **Integration of pre-pruning and post-pruning**:
    1. Pre-pruning: a rule is learned such that it deliberately does not cover certain training instances.
    1. Post-pruning: after a rule is learned, some literals are deleted to avoid over-fitting.

> **Input**: the training set $\mathcal{D}$ with binary labels and all possible features $\mathcal{F}$.  
> **Output**: the learned rule set $\mathcal{R}$.
> 1. Initialize an empty rule set $\mathcal{R}$.
> 1. While there are still positive instances in $\mathcal{D}$:
>     1. Randomly choose 2/3 from $\mathcal{D}$ as the growing set $\mathcal{G}$ and the rest 1/3 becomes the pruning set $\mathcal{P}$. 
>     1. $R$ = **GrowRule($\mathcal{G}$)**
>     1. $R$ = **PruneRule($\mathcal{P}$, $R$)**
>     1. If the accuracy of $R < 0.5$ on $\mathcal{D}$: break
>     1. Add $R$ to $\mathcal{R}$.
>     1. Remove instances that are covered by $R$ from $\mathcal{D}$.
> 1. Return $\mathcal{R}$

The changes that Cohen made to the original version are:
1. stopping condition. The original IREP stopped when the accuracy of the learned rule is less than the accuracy of the empty rule instead of 50\%. 
1. PruneRule algorithm, which is to be detailed later. 

#### GrowRule

In each iteration, the feature $f$ with value $v$ that has the **maximum FOIL score** is selected to the rule and the iterations terminate when **the rule doesn't cover any negative instances** in the growing set. 

> **Input**: the growing set $\mathcal{G}$ with binary labels and all possible features $\mathcal{F}$.  
> **Output**: the unpruned rule $R$.
> 1. Initialize an empty rule $R$. 
> 1. Until all instances in $\mathcal{G}$ that satisfy $R$ are positive (accuracy of $R$ is 1 in $\mathcal{G}$) or there is no feature to add: 
>     1. For every feature $f \in \mathcal{F}$ not in $R$ and every possible value $v \in \mathcal{V}(f)$:
>         1. Create a temp rule $R_{t}$ by copying current $R$. 
>         1. Add $(f, v)$ to $R_{t}$.
>         1. Calculate FOIL's information gain of $R_{t}$: $\mathrm{Foil}(R, R_{t})$ based on $\mathcal{G}$.
>     1. Get the $R_{t}^{max}$ with the max value of $\mathrm{Foil}(R_{t})$.
>     1. $R=R_{t}^{max}$.
> 1. Return $R$.

**[Support for categorical and continous features]**: the definition of $\mathcal{V}(f)$ for different feature $f$ is different for categorical and numerical features. 
1. For a categorical feature $f_{c}$, $\mathcal{V}(f)$  is the collection of all possible values that $f_{c}$ can take.
1. For a numerical feature $f_{n}$, $\mathcal{V}(f)$ is the cartesian product of $\{\leq, \geq\}$ and all values of $f$ that appear in the training set. For example, if all values that appear in the training set for feature age is $\{10, 20, 30\}$, then $\mathcal{V}(\text{age})$ is $\{\leq 10, \geq 10, \leq 20, \geq 20, \leq 30, \geq 30\}$

**[FOIL's information gain]**: it gives how much information entropy is reduced from $R_{old}$ to $R_{new}$.

$$ \mathrm{Foil}(R_{old}, R_{new}) = P(R_{old}) (\log_{2}(\frac{P(R_{old})}{P(R_{old}) + N(R_{old})}) - \log_{2}(\frac{P(R_{new})}{P(R_{new}) + N(R_{new})})) $$

where $P(R)$ is the number of positive instances covered by $R$ and $N(R)$ is the number of negative instances covered by $R$.

#### PruneRule

PruneRule considers deleting **any final sequence of conditions** from the rule and chooses the deletion that **maximizes the Rule-Value metric**.

TODO: the algorithm below is not accurate.
> **Input**: the pruning set $\mathcal{P}$ and the unpruned rule $R$.  
> **Output**: the unpruned rule $R$.
> 1. For all $(f=v)_i \in R$ starting from the last added one to the first one:
>     1. Get $R_{p}$ by removing $(f=v)_i$ from $R$.
>     1. If $\mathrm{Value}(R_{p}) \geq \mathrm{Value}(R)$: then $R=R_{p}$.

**[IREP Rule-value metric]**: 

$$ \mathrm{Value}(R) = \frac{P(R) + (N - N(R))}{P + N} $$

where $P$ is the total number of positive instances,  $N$ is the total number of negative instances, $P(R)$ is the number of positive instances covered by $R$ and $N(R)$ is the number of negative instances covered by $R$.

## IREP* as an improved version of IREP

The support for multi-class and missing value allows IREP to be applied on a wide range of benchmarks and Cohen further improves on his implementation of IREP on the stopping condition and pruning metric.  

## RIPPER
RIPPER is an improved version of IREP and the changes are:
1. Use a different equation (RIPPER Rule value) to replace IREP's rule value equation.
2. Replace the stopping condition (line 3.D. in the pseudocode above) with the following logic on the dataset $\mathcal{P}$ ($d$ is a hyperparameter with the default value of 64):
    1. Calculate the *total description length* of $R$: $\mathrm{MDL}(R)$.
    2. If $\mathrm{MDL}(R) > \mathrm{MDL_{min}} + d$: break
    3. If $\mathrm{MDL}(R) < \mathrm{MDL_{min}}$: $\mathrm{MDL_{min} = MDL}(R)$
3. The version of IREP with the 2 changes above is called IREP*. After we get a rule set $\mathcal{R}$ from one run of IREP*, we perform the followings on the dataset $\mathcal{P}$:
    1. For each $R_{i} \in \mathcal{R}$:
        1. Get the *replacement* $\hat{R}_{i}$ of $R_{i}$ by growing from an empty rule until all instance that satisfy $\hat{R}_{i}$ are positive. 
        2. Pruning $\hat{R}_{i}$ by minimizing error of the entire rule set $\mathcal{\hat{R}} = R_{1}, ..., \hat{R}_{i}, ..., R_{n}$ (no $R_i$) on $\mathcal{P}$.
        3. Get the *revision* $\bar{R}_{i}$ of $R_{i}$ by growing from $R_{i}$ until all instance that satisfy $\bar{R}_{i}$ are positive.
        4. Pruning $\bar{R}_{i}$ by minimizing error of the entire rule set $\mathcal{\bar{R}} = R_{1}, ..., \bar{R}_{i}, ..., R_{n}$ (no $R_i$) on $\mathcal{P}$.
        5. $\mathcal{R} = \underset{\mathcal{R_i} \in \{ \mathcal{R}, \hat{\mathcal{R}}, \bar{\mathcal{R}} \}}{\mathrm{argmin}} \sum_{R_{i} \in \mathcal{R_{i}}} \mathrm{MDL}(R_i)$
4. Run IREP* again to cover remaining positive instances. 

**[RIPPER Rule value]**: 

$$ \mathrm{Value}(R) = \frac{P(R) - N(R)}{P(R) + N(R)} $$

where $P(R)$ is the number of positive instances covered by $R$ and $N(R)$ is the number of negative instances covered by $R$

**[Total description length]**: MDL is composed of model description length and exceptions description length. 

$$ \mathrm{MDL}(R) = \mathrm{MDL_{M}}(R) + \mathrm{MDL_{E}}(R) $$

Model description length evaluates the rule itself:

$$ \mathrm{MDL_{M}}(R) = 0.5(k\log_{2}\frac{1}{r} + (n-k) \log_2\frac{1}{1-r} + \lVert k \rVert) $$

where $k$ is the number of features in the rule, $n$ is the number of all features, and $r = \frac{k}{n}$. $\lVert k \rVert = \log_{2}(k)$. The 0.5 factor is to account for possible redundancies. \
Exceptions description length evaluates the performance of the rule on the dataset:

$$ \mathrm{MDL_{E}}(R) = \log_{2}{P(R) \choose \mathit{FP}(R)} + \log_{2}{N(R) \choose \mathit{FN}(R)}$$ 

where $P(R)$ is the number of positive instances covered by $R$, $N(R)$ is the number of negative instances covered by $R$ $\mathit{FP}(R)$ is the number of false positives covered by $R$, $\mathit{FN}(R)$ is the number of false negatives covered by $R$ 