# Project 1 - Information measures

The goal of this first project is to get accustomed to the information and uncertainty measures. We ask you to write a brief report (pdf format) collecting your answers to the different questions. All codes must be written in Python inside this Jupyter Notebook. No other code file will be accepted. Note that you can not change the content of locked cells or import any extra Python library than the ones already imported (numpy and pandas).

## Implementation

In this project, you will need to use information measures to answer several questions. Therefore, in this first part, you are asked to write several functions that implement some of the main measures seen in the first theoretical lectures. Remember that you need to fill in this Jupyter Notebook to answer these questions. Pay particular attention to the required output format of each function.

In [17]:
# [Locked Cell] You can not import any extra Python library in this Notebook.
import numpy as np
import pandas as pd

### Question 1

Write a function entropy that computes the entropy $\mathcal{H(X)}$ of a random variable $\mathcal{X}$ from its probability distribution $P_\mathcal{X} = (p_1, p_2, . . . , p_n)$. Give the mathematical formula that you are using and explain the key parts of your implementation. Intuitively, what is measured by the entropy?

In [18]:
def entropy(Px):
    """
    Computes the entropy from the marginal probability distribution. 
    Arguments:
    ----------
    - Px :  Marginal probability distribution of the random 
            variable X in a numpy array where Px[i]=P(X=i)
    Return:
    -------
    - The entropy of X (H(X)) as a number (integer, float or double).
    """
    # convert probabilities to a numpy array
    Px = np.array(Px)
    
    # set 0*log(0) to 0
    Px[ Px==0 ] = 1
    
    # calculate entropy
    entropy = -np.sum(Px * np.log(Px)) / np.log(2)
    
    return entropy

### Question 2

Write a function joint_entropy that computes the joint entropy $\mathcal{H(X,Y)}$ of two discrete random variables $\mathcal{X}$ and $\mathcal{Y}$ from the joint probability distribution $P_\mathcal{X,Y}$. Give the mathematical formula that you are using and explain the key parts of your implementation. Compare the entropy and joint_entropy functions (and their corresponding formulas), what do you notice?

In [19]:
import numpy as np


def joint_entropy(Pxy):
    """
    Computes the joint entropy from the joint probability distribution.  
    Arguments:
    ----------
    - Pxy:  joint probability distribution of X and Y 
            in a 2-D numpy array where Pxy[i][j]=P(X=i,Y=j)
    Return:
    -------
    - The joint entropy H(X,Y) as a number (integer, float or double).
    """
    # flatten the joint probability distribution
    Pxy = np.ravel(Pxy)

    # set 0*log(0) to 0
    Pxy[Pxy == 0] = 1

    # calculate joint entropy
    joint_entropy = -np.sum(Pxy * np.log(Pxy)) / np.log(2)

    return joint_entropy

### Question 3

Write a function conditional_entropy that computes the conditional entropy $\mathcal{H(X|Y)}$ of a discrete random variable $\mathcal{X}$ given another discrete random variable $\mathcal{Y}$ from the joint probability distribution $P_\mathcal{X,Y}$. Give the mathematical formula that you are using and explain the key parts of your implementation. Describe an equivalent way of computing that quantity.

In [20]:
import numpy as np

def conditional_entropy(Pxy):
    """
    Computes the conditional entropy from the joint probability distribution.
    Arguments:
    ----------
    - Pxy:  joint probability distribution of X and Y 
            in a 2-D numpy array where Pxy[i][j]=P(X=i,Y=j)
    Return:
    -------
    - The conditional entropy H(X|Y) as a number (integer, float or double)
    """
    # calculate marginal probability distribution of Y
    Py = np.sum(Pxy, axis=0)
    
    # set 0*log(0/0) to 0
    Py[ Py==0 ] = 1
    
    # calculate conditional entropy
    conditional_entropy = -np.sum(Pxy * np.log2(Pxy / Py)) / np.log2(2)
    
    return conditional_entropy

### Question 4

Write a function mutual_information that computes the mutual information $\mathcal{I(X;Y)}$ between two discrete random variables $\mathcal{X}$ and $\mathcal{Y}$ from the joint probability distribution $P_\mathcal{X,Y}$ . Give the mathematical formula that you are using and explain the key parts of your implementation. What can you deduce from the mutual information $\mathcal{I(X;Y)}$ on the relationship between $\mathcal{X}$ and $\mathcal{Y}$? Discuss.

In [21]:
def mutual_information(Pxy):
    """
    Computes the mutual information I(X;Y) from joint probability distribution
    
    Arguments:
    ----------
    - Pxy:  joint probability distribution of X and Y 
            in a 2-D numpy array where Pxy[i][j]=P(X=i,Y=j)
    Return:
    -------
    - The mutual information I(X;Y) as a number (integer, float or double)
    """
    Px = np.sum(Pxy, axis=1)
    Py = np.sum(Pxy, axis=0)
    
    PxPy = np.outer(Px,Py)

    Pxy = np.maximum(Pxy, 1e-12) # avoid numerical instability with log2(0)
    PxPy = np.maximum(PxPy, 1e-12)

    Ixy = np.sum(Pxy * np.log2(Pxy / PxPy))
    return Ixy

### Question 5

Let $\mathcal{X}$, $\mathcal{Y}$ and $\mathcal{Z}$ be three discrete random variables. Write the functions cond_joint_entropy and cond_mutual_information that respectively compute $\mathcal{H(X,Y|Z)}$ and $\mathcal{I(X;Y|Z)}$ of two discrete random variable $\mathcal{X}$, $\mathcal{Y}$ given another discrete random variable $\mathcal{Z}$ from their joint probability distribution $P_\mathcal{X,Y,Z}$. Give the mathematical formulas that you are using and explain the key parts of your implementation.
Suggestion: Observe the mathematical definitions of these quantities and think how you could derive them from the joint entropy and the mutual information.

In [22]:
def cond_joint_entropy(Pxyz):
    """
    Computes the conditional joint entropy of X, Y knowing Z 
    from the joint probability distribution Pxyz
    Arguments:
    ----------
    - Pxyz: joint probability distribution of X, Y and Z
            in a 3-D array where Pxyz[i][j][k]=P(X=i,Y=j,Z=k)
    Return:
    -------
    - The conditional joint entropy H(X,Y|Z) as a number (integer, float or double)
    """

    PYZ = np.sum(Pxyz, axis=0)
    PZ = np.sum(PYZ, axis=0)
    HZ = entropy(PZ)

    HXYZ = 0
    for i in range(Pxyz.shape[2]):
        PXYZ = Pxyz[:,:,i]
        HXYZ += joint_entropy(PXYZ.reshape(PXYZ.shape[0], -1))

    return HXYZ - HZ


def cond_mutual_information(Pxyz):
    """
    Computes the conditional mutual information of X, Y knowing Z 
    from joint probability distribution Pxyz
    Arguments:
    ----------
    - Pxyz: joint probability distribution of X, Y and Z
            in a 3-D array where Pxyz[i][j][k]=P(X=i,Y=j,Z=k)
    Return:
    -------
    - I(X;Y|Z): The conditional joint entropy as a number (integer, float or double)
    
    """

    PXZ = np.sum(Pxyz, axis=1)
    HXYZ = 0

    for i in range(Pxyz.shape[0]):
        for k in range(Pxyz.shape[2]):
            for j in range(Pxyz.shape[1]):
                if Pxyz[i][j][k] != 0:
                    HXYZ -= Pxyz[i][j][k] * np.log2(Pxyz[i][j][k] / np.sum(Pxyz, axis=0)[j][k])

    IXYZ = conditional_entropy(PXZ) - HXYZ
    return IXYZ

In [23]:
# [Locked Cell] Evaluation of your functions by the examiner. 
# You don't have access to the evaluation, this will be done by the examiner.
# Therefore, this cell will return nothing for the students.
import os
if os.path.isfile("private_evaluation.py"):
    from private_evaluation import unit_tests
    unit_tests(entropy, joint_entropy, conditional_entropy, mutual_information, cond_joint_entropy, cond_mutual_information)

### Football outcome

You may create cells below to answer the different questions related to football outcome. Unlike in the first part (Implementation), you are free to define as many cells as you need below to answer the different questions. Try to be structured and clear in your code (comment it if necessary). Note that you have to answer the questions in the pdf report, including the numbers you get!

### Predicting the outcome of a football game

Let's assume that the coach of a football team has kept track of previous match data to improve his team's performance through statistical analysis. This database is composed of 13 discrete variables described in Table 1. Note that these variables have different cardinalities (i.e., the number of possible values they can take). Using the database provided with this assignment, where each sample corresponds to a set of 13 values related to a previous game, answer the following questions. Include all your codes below the last cell of the Jupyter notebook (you may create several cells for better readability). Note that you have to answer the questions in the pdf report, including the numbers you get in the Notebook! The data is available on the website (data.csv).

In [24]:
df = pd.read_csv('data.csv')
df.head(5)

Unnamed: 0,outcome,previous_outcome,day,time,month,wind_speed,weather,location,capacity,stadium_state,injury,match_type,opponent_strength
0,loss,loss,saturday,morning,september,no_wind,sunny,home,medium,dry,no,friendly,average
1,win,win,sunday,morning,march,low,cloudy,away,small,dry,no,competitive,weak
2,win,loss,monday,evening,october,high,rainy,away,medium,wet,no,friendly,average
3,tie,tie,wednesday,evening,september,low,sunny,home,medium,dry,no,competitive,strong
4,win,tie,thursday,evening,march,low,cloudy,home,medium,dry,yes,friendly,weak


### Question 6

Compute and report the entropy of each variable, and compare each value with its corresponding variable cardinality. What do you notice? Justify theoretically.

In [25]:
for col in df.columns:
    print("H(" + col + ") = ", entropy(df[col].value_counts(normalize=True)))

H(outcome) =  1.3348792529653017
H(previous_outcome) =  1.4830023175176483
H(day) =  2.806559657184559
H(time) =  0.9325249591116449
H(month) =  3.5826314181996297
H(wind_speed) =  1.5847100094439006
H(weather) =  1.7640836027093239
H(location) =  0.9999389442845601
H(capacity) =  1.5339149966727514
H(stadium_state) =  0.6395467715690346
H(injury) =  0.9998419902704185
H(match_type) =  0.9999029333006982
H(opponent_strength) =  1.5843542880706156


In [26]:
print(df.shape[0])
print(df[df['stadium_state'] == 'dry'].shape[0])
print(df[df['injury'] == 'yes'].shape[0])

5000
4189
2537


### Question 7

Compute and report the conditional entropy of outcome given each of the other variables. Considering the variable descriptions, what do you notice when the conditioning variable is (a) wind_speed and (b) previous_outcome?

In [27]:
out_cond_entr = {}

to_cond = df.columns[0]

for col in df.columns:
    if col != to_cond:
        cond_entr = conditional_entropy(pd.crosstab(df[to_cond], df[col], normalize=True).to_numpy())
        print("H(" + to_cond + "|" + col + ") = ", cond_entr)
        out_cond_entr[col] = cond_entr

H(outcome|previous_outcome) =  1.1814755551974467
H(outcome|day) =  1.3334941322458804
H(outcome|time) =  1.3338032007184812
H(outcome|month) =  1.3303613323938615
H(outcome|wind_speed) =  1.334727799689012
H(outcome|weather) =  1.33383591648553
H(outcome|location) =  1.3335129925165217
H(outcome|capacity) =  1.3320215182015702
H(outcome|stadium_state) =  1.3343243002115326
H(outcome|injury) =  1.330242767276265
H(outcome|match_type) =  1.3348306627351736
H(outcome|opponent_strength) =  0.9386104485077148


### Question 8

Compute the mutual information between the variables month and capacity. What can you deduce about the relationship between these two variables? What about the variables day and time?

In [28]:
print("I(month, capacity) = ", mutual_information(pd.crosstab(df['month'], df['capacity'], normalize=True).to_numpy()))
print("I(day, time) = ", mutual_information(pd.crosstab(df['day'], df['time'], normalize=True).to_numpy()))

I(month, capacity) =  0.006068927667434791
I(day, time) =  0.5046071260288234


### Question 9

Let's assume that you have decided to place a bid on the outcome of the match, but the data is now only available through a paid service. With limited funds, you must choose a single variable to invest in. Based on the mutual information, which variable would you keep? Would you make another choice if it was based on the conditional entropy?

In [29]:
out_mut_inf = {}

to_mut = df.columns[0]

for col in df.columns:
    if col != to_mut:
        mut_inf = mutual_information(pd.crosstab(df[to_mut], df[col], normalize=True).to_numpy())
        print("I(" + to_mut + "," + col + ") = ", mut_inf)
        out_mut_inf[col] = mut_inf

max_mut_inf = max(out_mut_inf, key=out_mut_inf.get)
print("The maximum mutual information is between " + to_mut + " and " + max_mut_inf)

min_cond_entr = min(out_cond_entr, key=out_cond_entr.get)
print("The minimum conditional entropy is between " + to_cond + " and " + min_cond_entr)

I(outcome,previous_outcome) =  0.15340369776785479
I(outcome,day) =  0.0013851207194213906
I(outcome,time) =  0.0010760522468203503
I(outcome,month) =  0.004517920571440142
I(outcome,wind_speed) =  0.00015145327628988243
I(outcome,weather) =  0.001043336479771722
I(outcome,location) =  0.001366260448780049
I(outcome,capacity) =  0.0028577347637310526
I(outcome,stadium_state) =  0.0005549527537693491
I(outcome,injury) =  0.0046364856890366846
I(outcome,match_type) =  4.859023012811908e-05
I(outcome,opponent_strength) =  0.39626880445758694
The maximum mutual information is between outcome and opponent_strength
The minimum conditional entropy is between outcome and opponent_strength


### Question 10

With the outcome of previous matches between the same opponent now being available for free, would you change your answer? What can you say about the amount of information provided by this variable? Compare this value with previous results.

In [30]:
Y = np.array(df.columns.difference(['outcome', 'previous_outcome']))
Z = 'previous_outcome'
X = 'outcome'

out_cond_mut_inf = {}

for y in Y:
    temp = pd.crosstab(df[X], [df[y], df[Z]], normalize=True).to_numpy()
    Pxyz = np.zeros((len(df[X].unique()), len(df[y].unique()), len(df[Z].unique())))
    for i in range(len(df[X].unique())):
        Pxyz[i] = temp[i].reshape(len(df[y].unique()), len(df[Z].unique()))
    cond_mut_inf = cond_mutual_information(Pxyz)

    print("I(" + X + ";" + y + "|" + Z + ") = ", cond_mut_inf)
    out_cond_mut_inf[y] = cond_mut_inf

max_cond_mut_inf = max(out_cond_mut_inf, key=out_cond_mut_inf.get)
print("The maximum conditional mutual information is between " + X + " and " + max_cond_mut_inf)

I(outcome;capacity|previous_outcome) =  0.004953992513573757
I(outcome;day|previous_outcome) =  0.004737405783874049
I(outcome;injury|previous_outcome) =  0.008997458391027058
I(outcome;location|previous_outcome) =  0.002128508451425759
I(outcome;match_type|previous_outcome) =  0.0005145649247342288
I(outcome;month|previous_outcome) =  0.013689670050333058
I(outcome;opponent_strength|previous_outcome) =  0.2445855018361326
I(outcome;stadium_state|previous_outcome) =  0.0006226110403606544
I(outcome;time|previous_outcome) =  0.0034734328084082833
I(outcome;weather|previous_outcome) =  0.0030334910764067136
I(outcome;wind_speed|previous_outcome) =  0.002120983416720623
The maximum conditional mutual information is between outcome and opponent_strength


### Question 11

Using information theory, discover the particularity of the stadium of the home team, in particular using the stadium_state and weather variables. Justify with computations.

In [31]:
df_home = df[df['location'] == 'home']

print("Number of rows: ", len(df_home))
print("Number of 'stadium_state' dry matchs: ", len(df_home[df_home['stadium_state'] == 'dry']))
print("Number of 'weather' snowy matchs: ", len(df_home[df_home['weather'] == 'snowy']))
print("Number of 'weather' rainy matchs: ", len(df_home[df_home['weather'] == 'rainy']))
print("Number of 'weather' sunny matchs: ", len(df_home[df_home['weather'] == 'sunny']))
print("Number of 'weather' cloudy matchs: ", len(df_home[df_home['weather'] == 'cloudy']))
print("H(stadium_state) = ", entropy(df_home['stadium_state'].value_counts(normalize=True)))
print("H(weather) = ", entropy(df_home['weather'].value_counts(normalize=True)))
print("I(stadium_state, weather) = ", mutual_information(pd.crosstab(df_home['stadium_state'], df_home['weather'], normalize=True).to_numpy()))

Number of rows:  2477
Number of 'stadium_state' dry matchs:  2477
Number of 'weather' snowy matchs:  150
Number of 'weather' rainy matchs:  426
Number of 'weather' sunny matchs:  1011
Number of 'weather' cloudy matchs:  890
H(stadium_state) =  -0.0
H(weather) =  1.740025152921622
I(stadium_state, weather) =  0.0
