# Lab 4 - Information Theory in Machine Learning

Welcome to this week's lab on Information Theory! This week, we will dive into the fascinating world of Information Theory as applied to Machine Learning. Specifically, we will focus on two key concepts: Entropy and Information Gain. These principles are fundamental in understanding how decision trees make split decisions to organize data effectively.

### Entropy
- Entropy, in the context of information theory, measures the level of uncertainty or disorder within a set of data.
- In machine learning, particularly in decision trees, entropy helps to determine how a dataset should be split. A high entropy means more disorder, indicating that our dataset is varied. Conversely, low entropy suggests more uniformity in the data.

### Information Gain
- Information Gain measures the reduction in entropy after the dataset is split on an attribute.
- It is crucial in building decision trees as it helps to decide the order of attributes the tree will use for splitting the data. The attribute with the highest Information Gain is chosen as the splitting attribute at each node.

## Part 1: Entropy and Information Gain in Decision Trees
Decision Trees use these concepts to create branches. By choosing splits that maximize Information Gain (or equivalently minimize entropy), a decision tree can effectively categorize data, leading to better classification or regression models.

### Step 1: Import Necessary Libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
import matplotlib.pyplot as plt

### Step 2: Load and Explore the Iris Dataset

In [2]:
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target

### Step 3: Calculate Entropy
To calculate the `entropy` we need to:
- First, extract the target variable `y` from your dataset (like the 'target' column in the Iris dataset).
- Then, call `calculate_entropy(y)` to get the entropy.

This function calculates the entropy of a given target variable `y`. It works by first determining the unique classes in `y`, then computes the probability of each class, and uses this probability to calculate the entropy. This is a crucial step in understanding the disorder or uncertainty in the dataset, a fundamental concept in information theory.

In [7]:
def calculate_entropy(y):
    class_labels = np.unique(y)
    entropy = 0
    for label in class_labels:
        probability = len(y[y == label]) / len(y)
        entropy -= probability * np.log2(probability)
    return entropy

Calculate the entropy for the target variable.  What is your observastion about the calculated Entropy?

In [5]:
target_entropy = calculate_entropy(df['target'])
print(f"Entropy of the target variable: {target_entropy}")

#The minimal target value suggests the target variable belongs to a highly uncertain category that matches the expectations for balanced 3-class data. The evaluation indicates that the data set will prove suitable for applying both decision trees and other classification models for training purposes.The available data contains enough information to divide it into useful classification groups.

Entropy of the target variable: 1.584962500721156


### Step 4: Calculate Information Gain
There are three steps for calculating the Information Gain:
1. Compute Overall Entropy: Use the entropy function from Step 3 on the entire target dataset.
2. Calculate Weighted Entropy for Each Attribute: For each unique value in the attribute, partition the dataset and calculate its entropy. Then calculate the weighted sum of these entropies, where the weights are the proportions of instances in each partition.
3. Compute Information Gain: Subtract the weighted entropy of the split from the original entropy.

The attribute with the highest Information Gain is generally chosen for splitting, as it provides the most significant reduction in uncertainty. This step is critical in constructing an effective decision tree, as it directly influences the structure and depth of the tree.

**Use the provided function to calculate the information gain for each of the features in the dataset.**

In [6]:
def calculate_information_gain(df, attribute, target_name):
    total_entropy = calculate_entropy(df[target_name])
    values, counts = np.unique(df[attribute], return_counts=True)
    weighted_entropy = sum((counts[i] / sum(counts)) * calculate_entropy(df.where(df[attribute] == values[i]).dropna()[target_name]) for i in range(len(values)))
    information_gain = total_entropy - weighted_entropy
    return information_gain


Discuss your findings here.

1. The overall entropy computation requires applying the Step 3 entropy function to the entire target dataset.
2. Wetted Entropy values need calculation for each attribute through partitioning the dataset before conducting the entropy measurement. The individual partition entropies are combined into one weighted measure by using partition proportions as weights.
3. Check Information Gain by subtracting weighted entropy of the split from original entropy.

Information Gain reaches its maximum value when selecting splitting features since this approach minimizes unpredictability most effectively. Therefore, the chosen feature results in the biggest uncertainty reduction. The determination of a split criterion during this step plays a central role in building a proper decision tree because it both influences its overall shape and its depth.

## Part 2: Apply Entropy and Information Gain on a different dataset

Your task is to choose a new dataset and implement what you learned in `Part 1` on this new dataset.

### Task 1: Implement Entropy and Information Gain

In [36]:
from collections import Counter
import numpy as np
import pandas as pd

df = pd.read_csv("fitness_tracker.csv")

def entropy(y):
    """Calculates the entropy of a dataset."""
    class_counts = Counter(y)
    total_samples = len(y)
    entropy_value = 0.0

    for count in class_counts.values():
        probability = count / total_samples
        entropy_value -= probability * np.log2(probability)

    return entropy_value

def information_gain(df, attribute, target_name):
    #Calculates the information gain of an attribute

    total_entropy = entropy(df[target_name])
    values, counts = np.unique(df[attribute], return_counts=True)
    weighted_entropy = sum((counts[i] / sum(counts)) * entropy(df.where(df[attribute] == values[i]).dropna()[target_name]) for i in range(len(values)))
    information_gain = total_entropy - weighted_entropy
    return information_gain, None  # Returning None for split, as it's not calculated here

target = 'ID'
features = ['Workout Duration (mins)', 'Calories Burned', 'Step Count', 'Heart Rate']

# Calculate information gain for each feature
ig_results = {}
for feature in features:
    gain, split = information_gain(df, feature, target)
    ig_results[feature] = (gain, split)

# Display results
for feature, (gain, split) in ig_results.items():
    print(f"{feature}:")
    print(f"  Max Information Gain = {gain:.4f}")
    print(f"  Best Split Value = {split if split is not None else 'N/A'}\n")

Workout Duration (mins):
  Max Information Gain = 6.5879
  Best Split Value = N/A

Calories Burned:
  Max Information Gain = 8.9658
  Best Split Value = N/A

Step Count:
  Max Information Gain = 8.9418
  Best Split Value = N/A

Heart Rate:
  Max Information Gain = 6.7247
  Best Split Value = N/A



### Task 2: Discuss your findings in detail
Provide detailed explanation and discussion about your findings.

## 1. Most Informative Features:

Calories Burned and Step Count are the most informative features, with the highest Information Gain values (8.9658 and 8.9418, respectively). This means they are the most useful for splitting the dataset and making predictions about the target variable.

These features likely have a strong relationship with the target variable, making them critical for decision-making in the model.

##** 2. Moderately Informative Features:**

Workout Duration and Heart Rate have lower Information Gain values (6.5879 and 6.7247, respectively). While they are still useful, they are less informative compared to Calories Burned and Step Count.

These features might still contribute to the model but are less critical than the top two.

##3. **Best Split Value = N/A:**

The N/A for Best Split Value across all features suggests that:

The features might not require further splitting (e.g., they are already optimal for decision-making).

The dataset might not have enough variability or clear thresholds for splitting continuous features.

The features might be categorical, and no split value is needed.

##4. **Implications for Model Building:**

Since Calories Burned and Step Count have the highest Information Gain, they should be prioritized in the decision tree model.

Workout Duration and Heart Rate can be included but might not contribute as significantly to the model's performance.

If the Best Split Value is N/A, the model might rely on binary splits or other criteria for further decision-making.

## Submission
Submit your completed Jupyter Notebook file through the submission link in Blackboard.