# WEEK 8 - ENTROPY, RULE LEARNING

**Student Name:** Tran Thi Hong Phuong </br>
**Student ID:** s3623386

# Introduction

sklearn does not have an implementation of a rule learner. Instead you are going to implement a simplified CN2 algorithm. This algorithm will only construct pre-conditions that contain a single term (or test), that is, the rule precondition will not contain any conjunctions. This lab will require you to implement functions in python, and use simple loops and if-statements. If you are unfamiliar with these, first revise the Python tutorials from Week 1.

# Datasets
You will be looking at two data sets for this lab which you have seen before:

- Sailing days
- Zoo (animal) classification

**Load these data sets as sailData and zooData respectively, and remove the unnecessary column `name`.**

In [1]:
import pandas as pd

sailData = pd.read_csv('sailing-custom-python.tab', delimiter='\t')
sailData.head()

Unnamed: 0,Outlook,Company,Sailboat,Sail
0,rainy,big,big,yes
1,rainy,big,small,yes
2,rainy,med,big,no
3,rainy,med,small,no
4,sunny,big,big,yes


In [2]:
zooData = pd.read_csv('zoo-python.tab', delimiter='\t')
zooData.drop(['name'], axis=1, inplace=True)
zooData.head()

Unnamed: 0,hair,feathers,eggs,milk,airborne,aquatic,predator,toothed,backbone,breathes,venomous,fins,legs,tail,domestic,catsize,type
0,Yes,No,No,Yes,No,No,Yes,Yes,Yes,Yes,No,No,4.0,No,No,Yes,mammal
1,Yes,No,No,Yes,No,No,No,Yes,Yes,Yes,No,No,4.0,Yes,No,Yes,mammal
2,No,No,Yes,No,No,Yes,Yes,Yes,Yes,No,No,Yes,0.0,Yes,No,No,fish
3,Yes,No,No,Yes,No,No,Yes,Yes,Yes,Yes,No,No,4.0,No,No,Yes,mammal
4,Yes,No,No,Yes,No,No,Yes,Yes,Yes,Yes,No,No,4.0,Yes,No,Yes,mammal


# Simple Rule Learner

You will develop the simple rule learner over three parts:

1. Entropy calculation function
2. Majority class calculation function
3. Rule learner

## Entropy function

**First you will need a function that calculates the entropy of a data set.** This function takes two parameters, (1) the data set, and (2) the column name of the output/target class. The function should return the entropy of the data set. As a reminder, entropy is:

$$entropy(S) = -\sum_{i=0}^{c}p_i log_2 p_i$$

In [3]:
import math

def entropy(data, target):
    etp = 0
    # Count of all target values
    vCounts = pd.value_counts(data[target]).index
    # For each unique target value
    for value in list(vCounts):
        # Get instances having the considered target value
        matching = data.loc[data[target] == value]
        # Calculate p_i
        p_i = matching.shape[0] / data.shape[0]
        # Add to entropy
        etp += p_i * math.log(p_i, 2)
    return - etp

In [4]:
print('Entropy of Sail Data:', entropy(sailData, 'Sail'))
print('Entropy of Zoo Data:', entropy(zooData, 'type'))

Entropy of Sail Data: 0.9975025463691153
Entropy of Zoo Data: 2.390559682294039


## Majority Class

**Secondly, you will need to implement a function that returns the value of the target column which has the majority number of values.** This code should be very similar to the entropy calculation. Use the following as the definition for your function:

In [5]:
# Using loop
def majority_class(data, target):
    majority = 0
    class_name = ''
    for value in list(data[target].value_counts().index):
        count = data[data[target] == value].shape[0]
        if count > majority:
            majority = count
            class_name = value
    return class_name

In [6]:
print('Majority class of Sail Data:', majority_class(sailData, 'Sail'))
print('Majority class of Zoo Data:', majority_class(zooData, 'type'))

Majority class of Sail Data: yes
Majority class of Zoo Data: mammal


Alternatively, you can investigate how to use the `idmax()` function, which is a function of a pandas dataframe/series.

In [7]:
# Using idmax()
def majority_class(data, target):
    return data[target].value_counts().idxmax()

print('Majority class of Sail Data:', majority_class(sailData, 'Sail'))
print('Majority class of Zoo Data:', majority_class(zooData, 'type'))

Majority class of Sail Data: yes
Majority class of Zoo Data: mammal


## Rule Learner

Given the above two functions, it is now possible to implement a simple propositional rule learner. The features of this rule learner are:

- The pre-condition of each rule contains a single
- All attributes are treated as
- The rules are going to be printed to the command

In [12]:
def simpler_rule_learner(data, target):
    while data.shape[0] > 0:
        if entropy(data, target) == 0:
            print('otherwise =>', majority_class(data, target))
            # Drop all rows in dataframe
            data = data.iloc[0:0]
        else:
            best_entropy = entropy(data, target)
            best_attribute = ''
            best_value = ''
            best_data = data
            for attribute in list(data.columns):
                for value in list(data[attribute].value_counts().index):
                    sub_set = data[data[attribute] == value]
                    if entropy(sub_set, target) < best_entropy:
                        best_entropy = entropy(sub_set, target)
                        best_attribute = attribute
                        best_value = value
                        best_data = sub_set
            print(best_attribute, '=', best_value, '=>', majority_class(best_data, target))
            data = data.loc[data[best_attribute] != best_value]

In [13]:
simpler_rule_learner(sailData, 'Sail')

Company = big => yes
Outlook = rainy => no
Company = med => yes
Sailboat = big => no
otherwise => yes


In [14]:
simpler_rule_learner(zooData, 'type')

feathers = Yes => bird
milk = Yes => mammal
hair = Yes => insect
airborne = Yes => insect
fins = Yes => fish
legs = 8.0 => invertebrate
eggs = No => reptile
breathes = No => invertebrate
aquatic = Yes => amphibian
predator = Yes => reptile
backbone = Yes => reptile
legs = 0.0 => invertebrate
otherwise => insect


To demonstrate your understanding of the rule learner, please **write a short paragraph to explain how the implemented algorithm works, i.e., how does it use the two functions `entropy()` and `majority_class()` to come up with the outputs as shown above.**

> *ANSWER:* If the dataset is pure, print the majority class name and end the function. Else, for each attribute of the dataset, divide the dataset by each unique value of the attribute into subsets. Drop the subset with lowest entropy from the original dataset. Repeat the previous two steps for the remaining dataset until all rows have been dropped.