# Project 1 - Information measures

The goal of this first project is to get accustomed to the information and uncertainty measures. We ask you to write a brief report (pdf format) collecting your answers to the different questions. All codes must be written in Python inside this Jupyter Notebook. No other code file will be accepted. Note that you can not change the content of locked cells or import any extra Python library than the ones already imported (numpy and pandas).

## Implementation

In this project, you will need to use information measures to answer several questions. Therefore, in this first part, you are asked to write several functions that implement some of the main measures seen in the first theoretical lectures. Remember that you need to fill in this Jupyter Notebook to answer these questions. Pay particular attention to the required output format of each function.

In [48]:
# [Locked Cell] You can not import any extra Python library in this Notebook.
import numpy as np
import pandas as pd

### Question 1

Write a function entropy that computes the entropy $\mathcal{H(X)}$ of a random variable $\mathcal{X}$ from its probability distribution $P_\mathcal{X} = (p_1, p_2, . . . , p_n)$. Give the mathematical formula that you are using and explain the key parts of your implementation. Intuitively, what is measured by the entropy?

In [49]:

def entropy(Px):
    
    n=len(Px)
    H=0
    for i in range(n):
        if Px[i]>0:
            H+=-Px[i]*np.log2(Px[i])
    return H

-8.0


### Question 2

Write a function joint_entropy that computes the joint entropy $\mathcal{H(X,Y)}$ of two discrete random variables $\mathcal{X}$ and $\mathcal{Y}$ from the joint probability distribution $P_\mathcal{X,Y}$. Give the mathematical formula that you are using and explain the key parts of your implementation. Compare the entropy and joint_entropy functions (and their corresponding formulas), what do you notice?

In [50]:
def joint_entropy(Pxy):
    
    (m,n)=np.shape(Pxy)
    H=0
    for i in range (m):
        for j in range (n):
            if Pxy[i][j]>0:
                H+=-Pxy[i][j]*np.log2(Pxy[i][j])
    return H
    

0.6621649927296975


### Question 3

Write a function conditional_entropy that computes the conditional entropy $\mathcal{H(X|Y)}$ of a discrete random variable $\mathcal{X}$ given another discrete random variable $\mathcal{Y}$ from the joint probability distribution $P_\mathcal{X,Y}$. Give the mathematical formula that you are using and explain the key parts of your implementation. Describe an equivalent way of computing that quantity.

In [51]:
def conditional_entropy(Pxy):
    m,n=np.shape(Pxy)
    Py=np.zeros(n)
    for i in range (n):
        for j in range (m):
            Py[i]+=Pxy[j][i]
    return joint_entropy(Pxy) - entropy(Py)


12.0

### Question 4

Write a function mutual_information that computes the mutual information $\mathcal{I(X;Y)}$ between two discrete random variables $\mathcal{X}$ and $\mathcal{Y}$ from the joint probability distribution $P_\mathcal{X,Y}$ . Give the mathematical formula that you are using and explain the key parts of your implementation. What can you deduce from the mutual information $\mathcal{I(X;Y)}$ on the relationship between $\mathcal{X}$ and $\mathcal{Y}$? Discuss.

In [52]:
def mutual_information(Pxy):
    
    m,n=np.shape(Pxy)
    Px=np.sum(Pxy, axis=(1))
    Py=np.sum(Pxy, axis=(0))
    return entropy(Px)+entropy(Py)-joint_entropy(Pxy)


0.04643934467101518

### Question 5

Let $\mathcal{X}$, $\mathcal{Y}$ and $\mathcal{Z}$ be three discrete random variables. Write the functions cond_joint_entropy and cond_mutual_information that respectively compute $\mathcal{H(X,Y|Z)}$ and $\mathcal{I(X;Y|Z)}$ of two discrete random variable $\mathcal{X}$, $\mathcal{Y}$ given another discrete random variable $\mathcal{Z}$ from their joint probability distribution $P_\mathcal{X,Y,Z}$. Give the mathematical formulas that you are using and explain the key parts of your implementation.
Suggestion: Observe the mathematical definitions of these quantities and think how you could derive them from the joint entropy and the mutual information.

In [53]:
def multi_joint_entropy(Pxyz):
    
    (m,n,o)=np.shape(Pxyz)
    H=0
    for i in range (m):
        for j in range (n):
            for o in range (o):
                if Pxyz[i][j][o]>0:
                    H+=-Pxyz[i][j][o]*np.log2(Pxyz[i][j][o])
    return H

def cond_joint_entropy(Pxyz):
    Pxz=np.sum(Pxyz, axis=(1))
    return conditional_entropy(Pxz)+multi_joint_entropy(Pxyz) -joint_entropy(Pxz)
    

-0.24162751467959487

In [54]:
def cond_mutual_information(Pxyz):
    Pxz=np.sum(Pxyz, axis=(1))
    Pyz=np.sum(Pxyz, axis=(0))
    return conditional_entropy(Pxz)-(multi_joint_entropy(Pxyz)-joint_entropy(Pyz))

1.1422578251107915

In [55]:
# [Locked Cell] Evaluation of your functions by the examiner. 
# You don't have access to the evaluation, this will be done by the examiner.
# Therefore, this cell will return nothing for the students.
import os
if os.path.isfile("private_evaluation.py"):
    from private_evaluation import unit_tests
    unit_tests(entropy, joint_entropy, conditional_entropy, mutual_information, cond_joint_entropy, cond_mutual_information)

## Weather forecasting

You may create cells below to answer the different questions related to weather forecasting. Unlike in the first part (Implementation), you are free to define as many cells as you need below to answer the different questions. Try to be structured and clear in your code (comment it if necessary). Note that you have to answer the questions in the pdf report, including the numbers you get!

In [56]:
# Write your code here or in other cells below (you may delete this comment)
data=pd.read_csv("weather_data.csv")
data

Unnamed: 0,temperature,air_pressure,same_day_rain,next_day_rain,relative_humidity,wind_direction,wind_speed,cloud_height,cloud_density,month,day,daylight,lightning,air_quality
0,cold,increasing,dry,dry,low,south,no_wind,low,low,january,tuesday,sunny,no_lightning,bad
1,medium,decreasing,deluge,deluge,high,north,high,high,high,september,monday,cloudy,no_lightning,medium
2,cold,increasing,dry,dry,high,east,high,no_cloud,no_cloud,october,sunday,sunny,no_lightning,bad
3,medium,increasing,dry,dry,low,west,high,high,high,january,tuesday,cloudy,no_lightning,bad
4,high,decreasing,deluge,deluge,high,north,high,high,high,july,monday,cloudy,low,bad
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,cold,decreasing,deluge,deluge,high,north,high,high,low,february,wednesday,sunny,no_lightning,bad
4996,high,decreasing,drizzle,drizzle,low,north,low,high,high,july,thursday,cloudy,no_lightning,bad
4997,medium,increasing,dry,dry,low,east,low,high,low,december,tuesday,cloudy,no_lightning,bad
4998,cold,increasing,drizzle,dry,high,south,low,low,high,april,friday,cloudy,no_lightning,bad


In [57]:
#Question 6

def list_features(Px):
    list=[]
    for w in Px:
        if w not in list:
            list.append(w)
    return list

def count_strings(Px):  
    return Px.value_counts()

    
def display_entropy(data):
    dico_final=dict()
    for feature in data.columns:
        values=list_features(data[feature])
        C=count_strings(data[feature])
        P=pd.Series(C)/5000
        D={feature:[entropy(P),len(values)]}
        dico_final.update(D)
    return dico_final

print(count_strings(data["temperature"]))
display_entropy(data)

medium    2182
cold      1850
high       968
Name: temperature, dtype: int64


{'temperature': [1.5113935187221061, 3],
 'air_pressure': [0.9999971146079947, 2],
 'same_day_rain': [1.475468797174184, 3],
 'next_day_rain': [1.5686562064046452, 3],
 'relative_humidity': [0.9997963972977278, 2],
 'wind_direction': [1.9995507337173037, 4],
 'wind_speed': [1.5848180054843541, 3],
 'cloud_height': [1.5846220675718725, 3],
 'cloud_density': [1.5844638106709676, 3],
 'month': [3.5834131970628738, 12],
 'day': [2.806398967708293, 7],
 'daylight': [0.9986283124374025, 2],
 'lightning': [0.3249678887742185, 3],
 'air_quality': [0.5358803475890053, 3]}

In [61]:
#Question 7

def joint_probability(data, feature1, feature2):
    dataX=data[feature1]
    dataY=data[feature2]
    X=list_features(dataX)
    Y=list_features(dataY)
    size_x=len(X)
    size_y=len(Y)
    cond_list=np.zeros(size_x*size_y)
    list=[]
    for x in (X):
        for y in (Y):
            list.append((x,y))
    for i in range (len(data)):
        xi=dataX.iloc[i]
        yi=dataY.iloc[i]
        j=list.index((xi,yi))
        cond_list[j]+=1
    cond_list=cond_list/len(data)
    cond_list=cond_list.reshape(size_x, size_y)
    return cond_list
            

for feature in data.columns:
    print ('entropy of next_day_rain given', feature,":", joint_entropy(joint_probability(data, "next_day_rain", feature)))


entropy of next_day_rain given temperature : 3.079494527678028
entropy of next_day_rain given air_pressure : 1.9399722725568471
entropy of next_day_rain given same_day_rain : 2.8649543482685877
entropy of next_day_rain given next_day_rain : 1.5686562064046452
entropy of next_day_rain given relative_humidity : 2.300851644397622
entropy of next_day_rain given wind_direction : 3.567366069231546
entropy of next_day_rain given wind_speed : 3.1525850932421506
entropy of next_day_rain given cloud_height : 3.1513850965695323
entropy of next_day_rain given cloud_density : 3.1510536954135615
entropy of next_day_rain given month : 5.148292946285451
entropy of next_day_rain given day : 4.373555777610448
entropy of next_day_rain given daylight : 2.5668875001271063
entropy of next_day_rain given lightning : 1.8932004636474211
entropy of next_day_rain given air_quality : 2.103761481744547


In [62]:
#Question 8

P_humid_speed=joint_probability(data, "relative_humidity", "wind_speed")
P_month_temp=joint_probability(data, "month", "temperature")
print("mutual information between relative\_humidity and wind\_speed :", mutual_information(P_humid_speed))
print("mutual information between month and temperature :", mutual_information(P_month_temp))

mutual information between relative\_humidity and wind\_speed : 0.00012439598067359725
mutual information between month and temperature : 0.5753467937246404


In [63]:
#Question 9

def max_mutual_info(data):
    list=[]
    for feature in data.columns:
        if feature != "next_day_rain":
            list.append((mutual_information(joint_probability(data, "next_day_rain", feature)), feature))
    return max(list)

print(max_mutual_info(data))

def min_conditional_entropy(data):
    list=[]
    for feature in data.columns:
        if feature != "next_day_rain":
            list.append((conditional_entropy(joint_probability(data, "next_day_rain", feature)), feature))
    return min(list)

print(min_conditional_entropy(data))

(0.6286810484557928, 'air_pressure')
(0.9399751579488526, 'air_pressure')


In [64]:
#Question 10
data_modif=data[data['next_day_rain'].isin(["drizzle","deluge"])]
data_modif

Unnamed: 0,temperature,air_pressure,same_day_rain,next_day_rain,relative_humidity,wind_direction,wind_speed,cloud_height,cloud_density,month,day,daylight,lightning,air_quality
1,medium,decreasing,deluge,deluge,high,north,high,high,high,september,monday,cloudy,no_lightning,medium
4,high,decreasing,deluge,deluge,high,north,high,high,high,july,monday,cloudy,low,bad
5,cold,decreasing,dry,drizzle,low,east,low,no_cloud,no_cloud,february,sunday,sunny,no_lightning,bad
6,cold,decreasing,dry,deluge,high,north,no_wind,no_cloud,no_cloud,march,saturday,sunny,no_lightning,bad
8,high,decreasing,deluge,deluge,high,east,high,low,low,july,thursday,sunny,no_lightning,bad
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4989,medium,decreasing,deluge,drizzle,low,west,low,low,high,december,sunday,cloudy,no_lightning,bad
4992,cold,decreasing,drizzle,drizzle,low,north,low,high,high,april,saturday,cloudy,no_lightning,bad
4995,cold,decreasing,deluge,deluge,high,north,high,high,low,february,wednesday,sunny,no_lightning,bad
4996,high,decreasing,drizzle,drizzle,low,north,low,high,high,july,thursday,cloudy,no_lightning,bad


In [65]:
#suite question 10 

print(max_mutual_info(data_modif))
print(min_conditional_entropy(data_modif))

(0.4391920975475516, 'relative_humidity')
(0.5601193454280589, 'relative_humidity')


In [66]:
#Question 11

def multi_joint_probability(data, feature1, feature2, feature3):
    dataX=data[feature1]
    dataY=data[feature2]
    dataZ=data[feature3]
    X=list_features(dataX)
    Y=list_features(dataY)
    Z=list_features(dataZ)
    size_x=len(X)
    size_y=len(Y)
    size_z=len(Z)
    cond_list=np.zeros(size_x*size_y*size_z)
    list=[]
    for x in (X):
        for y in (Y):
            for z in (Z):
                list.append((x,y,z))
    for i in range (len(data)):
                xi=dataX[i]
                yi=dataY[i]
                zi=dataZ[i]
                j=list.index((xi,yi, zi))
                cond_list[j]+=1
    cond_list=cond_list/len(data)
    cond_list=cond_list.reshape(size_x, size_y, size_z)
    return cond_list

def max_cond_mutual_info(data):
    list=[]
    for feature in data.columns:
        if feature != "next_day_rain":
            list.append((cond_mutual_information(multi_joint_probability(data, feature,"next_day_rain", "temperature")), feature))
    return max(list)

print(max_cond_mutual_info(data))

(5.641582994808372, 'month')


# Partie 3

In [None]:
################# code for section 3
import math 
import itertools
english_letters = ['a','z','e','r','t','y','u','i','o','p','q','s','d','f','g','h','j','k','l','m','w','x','c','v','b','n']

def count_occurrences_in_position(words,pos):
    counter = {english_letters[i]: 0 for i in range(len(english_letters))}
    for word in words:
        counter[word[pos]] = counter[word[pos]] + 1 
    return counter

def compute_entropy_of_cell(nb_words,counter):

    entropy = 0
    for letter in counter:
        if counter[letter] == 0:
            continue
        proba_of_letter = counter[letter]/nb_words
        entropy -= proba_of_letter * math.log2(proba_of_letter)

    return entropy

def print_entropy_of_cells(words):
    if not words :
        return 0

    print('Entropy of each cell')
    for i in range(len(words[0])):
        counter = count_occurrences_in_position(words,i)
        print('Cell ' + str(i) + ' = ' + str(compute_entropy_of_cell(len(words),counter)))

def make_all_possible_5_words():
    english_letters.sort()
    all_words = []
    for item in itertools.product(english_letters, repeat=5):
        all_words.append(item)
        
    return all_words
    
def eliminate_words_by_letter_in_pos(words, letter, pos):
    rest = []

    for word in words:
        if word[pos] != letter:
            rest.append(word)

    return rest

def eliminate_words_by_letters_in_pos(words,letters,pos):
    rest = words

    for letter in letters:
        rest = eliminate_words_by_letter_in_pos(rest,letter,pos)

    return rest

def eliminate_words_by_letters_different_to_letter(words,letter,pos):

    left_letters = []
    for l in english_letters:
        if l != letter:
            left_letters.append(l)

    return eliminate_words_by_letters_in_pos(words,left_letters,pos)

def compute_entropy_of_each_cell():
    return


def green_letters(words,letters,pos):
    rest = words
    for i in range(len(letters)):
        rest = eliminate_words_by_letters_different_to_letter(words,letters[i],pos[i])
    return rest

def grey_letters(words,letters):
    # Eliminates words if letters are present
    return [elements for elements in words if all(ch not in elements for ch in letters)] 

def orange_letters(words,letters,pos):
    rest = words
    rest = [elements for elements in words if all(ch in elements for ch in letters)] # Eliminate words if letters not present

    for i in range(len(letters)):
        rest = eliminate_words_by_letter_in_pos(rest,letters[i],pos[i])
    return rest

words = make_all_possible_5_words()
print("Number of different words at the start = " + str(len(words))) 
print("Entropy = " + str(math.log2(len(words))))

### After first guess, tble are not in the word. letter in pos 1 = a
words = grey_letters(words,['t','b','l','e'])
words = green_letters(words,['a'],[1])

print("Number of different words after first guess = " + str(len(words))) 
print("Entropy = " + str(math.log2(len(words))))

### After second guess, rouh are not in the word, g is in the word but not in pos 3
words = grey_letters(words, ['r','o','u','h'])
words = orange_letters(words, ['g'],[3])

print("Number of different words after second guess = " + str(len(words))) 
print("Entropy = " + str(math.log2(len(words))))
print_entropy_of_cells(words)