<div style="background-image: linear-gradient(145deg, rgba(35, 47, 62, 1) 0%, rgba(0, 49, 129, 1) 40%, rgba(32, 116, 213, 1) 60%, rgba(244, 110, 197, 1) 85%, rgba(255, 173, 151, 1) 100%); padding: 1rem 2rem; width: 95%"><img style="width: 60%;" src="../../images/MLU_logo.png"></div>

# <a name="0">MLU Mathematical Fundamentals for Machine Learning</a>
# <a name="0">Lecture 5: Probability and Statistics Applications</a>
## <a name="0">Lab 5.3: Entropy</a>

 1. <a href="#1">Entropy's definition</a> 
 2. <a href="#2">Entropy in Decision Trees</a> 
 
In this short lab we'll recall the definition of Entropy and will see in practice how Entropy is used for training a decision tree classifier.

In [None]:
# Standard libraries
# Upgrade dependencies
#!pip install --upgrade pip
#!pip install --upgrade scikit-learn

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import entropy
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

import warnings
warnings.filterwarnings("ignore")

## <a name="1">1. Entropy's definition</a> 
(<a href="#0">Go to top</a>)

For any random variable $X$ that follows a probability distribution $P$ with a probability density function (PDF) or a probability mass function (PMF) $p(x)$, we measure the expected amount of information through *entropy* (or *Shannon entropy*) as follows:

$$H(X) = - E_{x \sim P} [\log p(x)].$$

To be specific, if $X$ is discrete, $$H(X) = - \sum_i p_i \log p_i \textrm{, where } p_i = P(X_i).$$

Otherwise, if $X$ is continuous, we also refer entropy as *differential entropy*

$$H(X) = - \int_x p(x) \log p(x) \; dx.$$

We can caclulate entropy based on this definition.

In [None]:
def calc_entropy(p):
    entropy = - p * np.log2(p)
    out = sum(entropy)
    return out

calc_entropy(np.array([0.1, 0.5, 0.1, 0.3]))

We can also use the <code>scipy.stats.entropy</code> function as follows.

In [None]:
entropy(np.array([0.1, 0.5, 0.1, 0.3]), base=2)

### Interpretation

You may be curious: in the entropy definition, why do we use an expectation of a negative logarithm? Here are some intuitions.

First, why do we use a *logarithm* function $\log$? Suppose that $p(x) = f_1(x) f_2(x) \ldots, f_n(x)$, where each component function $f_i(x)$ is independent from each other. This means that each $f_i(x)$ contributes independently to the total information obtained from $p(x)$. As discussed above, we want the entropy formula to be additive over independent random variables. Luckily, $\log$ can naturally turn a product of probability distributions to a summation of the individual terms.

Next, why do we use a *negative* $\log$? Intuitively, more frequent events should contain less information than less common events, since we often gain more information from an unusual case than from an ordinary one. However, $\log$ is monotonically increasing with the probabilities, and indeed negative for all values in $[0, 1]$.  We need to construct a monotonically decreasing relationship between the probability of events and their entropy, which will ideally be always positive. Hence, we add a negative sign in front of $\log$ function.

Last, where does the *expectation* function come from? Consider a random variable $X$. We can interpret the self-information ($-\log(p)$) as the amount of *surprise* we have at seeing a particular outcome.  Indeed, as the probability approaches zero, the surprise becomes infinite.  Similarly, we can interpret the entropy as the average amount of surprise from observing $X$. For example, imagine that a slot machine system emits statistical independently symbols ${s_1, \ldots, s_k}$ with probabilities ${p_1, \ldots, p_k}$ respectively. Then the entropy of this system equals to the average self-information from observing each output, i.e.,

$$H(S) = \sum_i {p_i \cdot I(s_i)} = - \sum_i {p_i \cdot \log p_i}.$$

### Exercise

<div style="align: left; border: 4px solid cornflowerblue; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px; width: 65%">
        <img style="float: left; max-width: 80%; max-height:80%; margin: 5px;" src="../../images/MLU_challenge.png" alt="MLU challenge" width=12% height=12%/>
    <span style="padding: 20px; align: left;">
        <p><b>Try it yourself!</b></p>
        <p><b>Exercise 1.</b>Compute the Entropy of the probability distribution explored in lecture 3: $X \sim \text{sum of rolling two dice}$. The probability distribution is given below.</p>
                    <table>
  <thead>
    <tr>
      <th scope="col">$x$</th>
      <th scope="col">$PMF$</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>$2$</td>
      <td>$\displaystyle{\frac{1}{36}}$</td>
    </tr>
          <tr>
      <td>$3$</td>
      <td>$\displaystyle{\frac{2}{36}}$</td>
    </tr>
          <tr>
      <td>$4$</td>
      <td>$\displaystyle{\frac{3}{36}}$</td>
    </tr>
          <tr>
      <td>$5$</td>
      <td>$\displaystyle{\frac{4}{36}}$</td>
    </tr>
          <tr>
      <td>$6$</td>
      <td>$\displaystyle{\frac{5}{36}}$</td>
    </tr>
          <tr>
      <td>$7$</td>
      <td>$\displaystyle{\frac{6}{36}}$</td>
    </tr>
          <tr>
      <td>$8$</td>
      <td>$\displaystyle{\frac{5}{36}}$</td>
    </tr>
          <tr>
      <td>$9$</td>
      <td>$\displaystyle{\frac{4}{36}}$</td>
    </tr>
          <tr>
      <td>$10$</td>
      <td>$\displaystyle{\frac{3}{36}}$</td>
    </tr>
          <tr>
      <td>$11$</td>
      <td>$\displaystyle{\frac{2}{36}}$</td>
    </tr>
          <tr>
      <td>$12$</td>
      <td>$\displaystyle{\frac{1}{36}}$</td>
    </tr>
      </table>
    </span>
</div>

In [None]:
###### YOUR CODE HERE ######






###### END OF CODE ######

In [None]:
# %load solutions/lab53_ex1_solutions.txt

## <a name="2">2. Entropy in Decision Trees</a> 
(<a href="#0">Go to top</a>)

Entropy is used in practice as one of the criteria to train decision trees. We will demonstrate this via a simple example based on a small loans data set of $14$ records.

In [None]:
# Load in the dataset and print the first five examples
df = pd.read_csv('../../data/MATH_Lecture_5_Loans.csv')
df.columns

In [None]:
df.head(5)

This loans data set has a number of features and the label is the *Risk* column, which can assume values high, medium and low.
First off let's do some data cleaning to deal with missing data as well as convert categorical features into numerical features as Python decision trees classifiers can deal only with numerical data as of now.

In [None]:
# Data Cleansing / Preparation
df["Risk"] = df["Risk"].map({"high":2, "medium":1, "low":0})
df["Credit History"] = df["Credit History"].map({"bad":1, "unknown":0, "good":1})
df["Income_high"] = df["Income"].map({"< NOK 15K":0,"NOK 15K – NOK 35K":0,"> NOK 35K":1})
df.drop(["No", "Income"], axis=1, inplace=True)
df = pd.get_dummies(data=df,columns=["Credit History", "Debt", "Collateral"], drop_first=True)

Now we have transformed all features as well as the label into numerical values.

In [None]:
df.head(3)

In [None]:
df.rename(columns={"Credit History_1": "Credit_History"}, inplace=True)

We will now prepare the data set of features (matrix $X$) and the labels (vector $y$) to train a decision tree classifier from <code>sklearn</code> to predict the *Risk* based on all the other features.
Note that we specify <code>criterion='entropy'</code> while instantiating the <code>DecisionTreeClassifier</code> object to ensure the decision tree training uses the Entropy metric. There are other alternatives available such as the Gini impurity, which is the default criterion.

In [None]:
X = df.drop(['Risk'], axis=1, inplace=False)
y = df['Risk']

dtree=DecisionTreeClassifier(criterion='entropy')

In [None]:
dtree.fit(X,y)

The training is complete, let's now visualise the decision tree.

In [None]:
fig=plt.figure(figsize=(20,10))
tree.plot_tree(dtree, feature_names=X.columns, filled=True)
plt.show()

To demonstrate the application of the Entropy measure in practice, we can look at the first node of the tree. At this point all features are candidates for splitting and *Income_high* is chosen. This choice is due to the fact that *Income_high* yields the highest **Information Gain** compared to the other features *Credit_History*, *Debt_low* and *Collateral_yes*.

Information Gain while training decision trees is computed at every split as the difference between the Entropy of the node before the split and after the split. Let's see how this calculation works.

In [None]:
# Adding a dummy column of ones for counting purposes
df['dummy_col'] = np.ones(df['Risk'].size)

In [None]:
df

In [None]:
for c in X.columns:
    gr_df = df[[c, 'Risk', 'dummy_col']].groupby([c, 'Risk']).sum().reset_index()

    wh = 0
    print(gr_df)
    for v in gr_df[c].unique():
        w = gr_df[gr_df[c]==v]['dummy_col']/gr_df[gr_df[c]==v]['dummy_col'].sum()
        h = entropy(w, base=2)
        print(f"The entropy value for {c}={v} is {h}")
        wh += h * (gr_df[gr_df[c]==v]['dummy_col'].sum()/14)
    
    print(f"The weighted entropy value for {c} is {wh}")
    
    # Entropy before the split minus weighted average of the entropy for each value of the feature
    ig = entropy([5/14, 3/14, 6/14], base=2) - wh
    
    print(f"The information gain splitting by {c} is {ig}")
    print('\n \n')

The highest information gain at the first split is obtained by leveraging *Income_high* as a splitting feature. We can see just based on the *Risk* label distribution alone the entropy before the split is equal to $\text{entropy}{([\displaystyle{\frac{5}{14}}, \displaystyle{\frac{3}{14}}, \displaystyle{\frac{6}{14}}], \text{base}=2)=1.531}$ while the entropy after each split is computed as the weighted average of the inidividual entropies for each variable of the splitting feature, as explained in the slides and implemented in the above code snippet.

<div style="display: flex; align-items: center; justify-content: left; background-color:#330066; width:99%;"> 
        <img style="float: left; max-width: 100%; max-height:100%; margin: 15px;" src="../../images/MLU_robot.png" alt="MLU robot" width="100" height="100"/>
    <span style="color: white; padding-left: 10px; align: left; margin: 15px;">
        <h3>Congratulations!</h3>
        You have completed Lab 5.3: Entropy of Lecture 5: Probability and Statistics Applications of MLU Mathematical Fundamentals of Machine Learning.
        <br/>
    </span>
</div>