# Probability and Entropy

## 1 Imports and preparation

### 1.1 Imports

In [189]:
import numpy as np
import pandas as pd
from typing import List
from decimal import Decimal

### 1.2 Feature class 

In [190]:
class Feature: 
    def __init__(self, D, f):
        self._D = D
        self._f = f
    
    def filter(self, l): 
        return Feature(self._D[self._D[self._f] == l], self._f)
    
    def IsHomogenous(self) -> bool:
        return self._D[self._f].nunique() == 1
    
    @property
    def Mode(self) -> str:
        return self._D[self._f].mode().iloc[0]
    
    @property
    def levels(self) -> List[str]:
        return self._D[self._f].unique()
    
    @property
    def rows(self) -> int:
        return self._D.shape[0]

### 1.3 DataSet class

In [191]:
class DataSet:
    def __init__(self, D: pd.DataFrame, t: str):
        self._D = D
        self._t = t

    def feature(self, feature) -> Feature: 
        return Feature(self._D, feature)
    
    def desc(self, feature) -> Feature: 
        return Feature(self._D, feature)
    
    def target(self) -> Feature: 
        return Feature(self._D, self._t)
    
    def filter(self, f, l): 
        return DataSet(self._D[self._D[f] == l], self._t)
    
    @property
    def rows(self) -> int:
        return self._D.shape[0]

    @property
    def df(self): 
        return self._D 

## 2 Examples

### 2.1 Basic probability

The probability is the likelyhood of an event. There are multiple variants of probability: 
- *Probability:* P(x) = |x| / | all events |
- *Joint probability* = | x & y | / | all events |
- *Conditional Probability:* P(x|y) = P(x,y) / P(y)

You can recalculate a joint probability from a Conditional probability: P(x,y) = P(x|y) x P(y)


In [192]:
# How does probability work, e.g. probability per playing card
print(f"Chance of getting a single card: {1 / 52}")
print(f"Chance of getting a specific suit: {1 / 13}")

Chance of getting a single card: 0.019230769230769232
Chance of getting a specific suit: 0.07692307692307693


### 2.2 Calculating entropy

Entropy is the level of impurity or uncertainty in a dataset, it is thecharacteristic in features that ML algorithms will try to minimize. 

Terminology:
- **f** (feature) a column in the dataset
- **t** (target feature) the feature we're targeting
- **l** (level) a specific value in the feature, like 'Clubs' in Category
- **D** (DataSet) the full dataset

Functions:
- **P** (Probability) The probability function
- **H** (Entropy) The entropy function
- **REM** (Remaining entropy)



In [193]:
probability_distribution = [1 / 52 for _ in range(52)]   

shannon_entropy = -np.sum(probability_distribution * np.log2(probability_distribution))
print(f'Total entropy: {round(shannon_entropy, 3)} bits')

Total entropy: 5.7 bits


We can calculate entropy by creating a dataset containing every card available in a deck of cards.

In [194]:
data = [
    ['Ace', 'Hearts'], ['Two', 'Hearts'], ['Three', 'Hearts'], ['Four', 'Hearts'], ['Five', 'Hearts'], ['Six', 'Hearts'], ['Seven', 'Hearts'], ['Eight', 'Hearts'], ['Nine', 'Hearts'], ['Ten', 'Hearts'], ['Jack', 'Hearts'], ['Queen', 'Hearts'], ['King', 'Hearts'],
    ['Ace', 'Diamonds'], ['Two', 'Diamonds'], ['Three', 'Diamonds'], ['Four', 'Diamonds'], ['Five', 'Diamonds'], ['Six', 'Diamonds'], ['Seven', 'Diamonds'], ['Eight', 'Diamonds'], ['Nine', 'Diamonds'], ['Ten', 'Diamonds'], ['Jack', 'Diamonds'], ['Queen', 'Diamonds'], ['King', 'Diamonds'],
    ['Ace', 'Clubs'], ['Two', 'Clubs'], ['Three', 'Clubs'], ['Four', 'Clubs'], ['Five', 'Clubs'], ['Six', 'Clubs'], ['Seven', 'Clubs'], ['Eight', 'Clubs'], ['Nine', 'Clubs'], ['Ten', 'Clubs'], ['Jack', 'Clubs'], ['Queen', 'Clubs'], ['King', 'Clubs'],
    ['Ace', 'Spades'], ['Two', 'Spades'], ['Three', 'Spades'], ['Four', 'Spades'], ['Five', 'Spades'], ['Six', 'Spades'], ['Seven', 'Spades'], ['Eight', 'Spades'], ['Nine', 'Spades'], ['Ten', 'Spades'], ['Jack', 'Spades'], ['Queen', 'Spades'], ['King', 'Spades']
]

playing_cards = pd.DataFrame(data, columns=['Value', 'Suit'])
t = 'Suit'

D = DataSet(playing_cards, t)

#### Probability
The probability function P(l, f) will calculate the probability of a certain level. For example the probability of the suit to be 'Clubs'. 

In [195]:
def P(l, f: Feature) -> float:
    return f.filter(l).rows / f.rows

# Probability per Suit
f = D.feature('Suit')
print(f"Probability of Clubs is: { P('Clubs', f) }")

Probability of Clubs is: 0.25


The entropy function uses the probability function to calculate the entropy of the column. We want our entropy to be as low as possible. 

```
|           n                                    |
| H(t,D) = -Σ     ( P(t[l]) x log2(P(t[l])) )    |
|           l∈{♣,♦,♥,♠}                          |
```


In [196]:
def H(f: str, D: DataSet) -> float:
    t = D.feature(f)    

    # entropy per level: P(level) x log2(P(level))
    entropy_per_level = [ P(l, t) * np.log2( P(l, t) ) for l in t.levels ]
    
    return -sum(entropy_per_level)

# Entropy based on a descriptive feature
print(f"The calculated entropy of 'Suit' is { H('Suit', D) }")

The calculated entropy of 'Suit' is 2.0


#### Remaining entropy
The REM function calculates the remaining entropy after a specific feature is used, the ID3 algorithm uses this function to determine which features are most beneficial to put high in the tree. 

```
|                                                   |
| REM(d, D) =   Σ ( ( |Dd=i| / |D| ) x H(t, Dd=i) ) |
|               i∈levels(d)                         |
```

In [197]:
def REM(target_feature: str, desc_feature: str, D: DataSet) -> float: 
    d = D.feature(desc_feature)
    
    rem_entropy = 0
    for l in d.levels: 
        rem_entropy += P(l, d) * H(target_feature, D.filter(desc_feature, l))
    
    # sum ( weight of level * entropy of level) 
    return rem_entropy

print(f"The calculated REM after 'Value' is: { REM('Suit', 'Value', D) }")

The calculated REM after 'Value' is: 1.9999999999999996


#### Information Gain
Finally, the IG shows the difference between the raw entropy and the information gained by the specific descriptive feature:

**Drawback**: Favors features with many levels because the output datasets are very small

In [198]:
def IG(target_feature: str, desc_feature: str, D: DataSet) -> float:  
    entropy = H(target_feature, D)
    rem = REM(target_feature, desc_feature, D)
    return round(entropy - rem, 16)

print(f"The Information Gain of 'Value' is: {IG('Suit', 'Value', D):.16f}")

The Information Gain of 'Value' is: 0.0000000000000004


### Alternative impurity metrics

#### Information Gain Ratio
dividing the info gain of a feature by the amount of information used to determine the value of the feature.

```
|                                                           |
| GR(d, D) = IG(d, D) / - Σ ( P( d[l] ) x log2(P(d[l])) )   |
|                         l∈levels(d)                       |
```

In [199]:
def GR(target_feature: str, desc_feature: str, D: DataSet) -> float:  
    ig = IG(target_feature, desc_feature, D)
    split_info = H(desc_feature, D)
    return ig / split_info

print(f"The IGR of 'Value' is: {GR('Suit', 'Value', D):.16f}")

The IGR of 'Value' is: 0.0000000000000001


#### Gini index
The percentage of misclassified instances if the prediction were made only based on the distribution of the target levels in the dataset. 

4 instances with eq likelyhood = Gini index of 0.75 = 75% chance of mismatch

```
|                                      |
| GINI(t, D) = 1 - Σ ( P( t[l] )^2 )   |
|                  l∈levels(d)         |
```

In [200]:
def GINI(f: str, D: DataSet):
    t = D.feature(f) 
    gini_per_feature = [ P(l, t) ** 2 for l in t.levels ]
    return 1-sum(gini_per_feature)