# Lab 4 - Information Theory in Machine Learning

## Part 1: Entropy and Information Gain in Decision Trees

## Step 1: Import Necessary Libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
import matplotlib.pyplot as plt

## Step 2: Load and Explore the Iris Dataset

In [2]:
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target

## Step 3: Calculate Entropy

In [3]:
def calculate_entropy(y):
    class_labels = np.unique(y)
    entropy = 0
    for label in class_labels:
        probability = len(y[y == label]) / len(y)
        entropy -= probability * np.log2(probability)
    return entropy

What is your observastion about the calculated Entropy?

The function uses the standard formula for entropy and assumes discrete class labels. It can be applied to both binary and multiclass problems. Higher entropy signifies more disorder, while lower entropy implies more homogeneity. 

## Step 4: Calculate Information Gain

In [4]:
def calculate_information_gain(df, attribute, target_name):
    total_entropy = calculate_entropy(df[target_name])
    values, counts = np.unique(df[attribute], return_counts=True)
    weighted_entropy = sum((counts[i] / sum(counts)) * calculate_entropy(df.where(df[attribute] == values[i]).dropna()[target_name]) for i in range(len(values)))
    information_gain = total_entropy - weighted_entropy
    return information_gain

Discuss your findings here.

 It calculates the information gain by comparing the total entropy of the dataset with the weighted average entropy of subsets created by splitting based on the attribute. Higher information gain indicates a better attribute for splitting. The function assumes non-empty sets.

## Part 2: Apply Entropy and Information Gain on a different dataset

## Task 1: Implement Entropy and Information Gain

In [5]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
import matplotlib.pyplot as plt

cancer = load_breast_cancer()
df_cancer = pd.DataFrame(cancer.data, columns=cancer.feature_names)
df_cancer['target'] = cancer.target

In [6]:
def calculate_entropy(y):
    class_labels = np.unique(y)
    entropy = 0
    for label in class_labels:
        probability = len(y[y == label]) / len(y)
        entropy -= probability * np.log2(probability)
    return entropy

In [7]:
def calculate_information_gain(df, attribute, target_name):
    total_entropy = calculate_entropy(df[target_name])
    values, counts = np.unique(df[attribute], return_counts=True)
    weighted_entropy = sum((counts[i] / sum(counts)) * calculate_entropy(df.where(df[attribute] == values[i]).dropna()[target_name]) for i in range(len(values)))
    information_gain = total_entropy - weighted_entropy
    return information_gain

In [8]:
info_gain_radius_mean = calculate_information_gain(df_cancer, 'mean radius', 'target')
info_gain_texture_mean = calculate_information_gain(df_cancer, 'mean texture', 'target')

print("Information Gain for 'mean radius':", info_gain_radius_mean)
print("Information Gain for 'mean texture':", info_gain_texture_mean)

Information Gain for 'mean radius': 0.8607815854835991
Information Gain for 'mean texture': 0.8357118798482908


## Task 2: Discuss your findings in detail

Provide detailed explanation and discussion about your findings.

The 'mean radius' Information Gain (0.8608):
Interpretation: 'mean radius' is a highly useful feature for distinguishing between malignant and benign tumors.
Implication: In a decision tree, 'mean radius' would likely be chosen early for splitting nodes due to its high information gain.

The'mean texture' Information Gain (0.8357):
Interpretation: 'mean texture' is also valuable for classifying tumors as malignant or benign.
Implication: It is likely to be an important feature in decision tree nodes, contributing to effective class separation.

Comparison:
Both features are highly informative, but 'mean radius' has a slightly higher information gain.
In decision tree construction, both features are expected to play key roles in creating more homogenous subsets.

In summary, both 'mean radius' and 'mean texture' are valuable for classifying breast tumors, with 'mean radius' standing out slightly in this analysis. These insights guide decision tree construction for effective classification.