# Lab 4 - Information Theory in Machine Learning

Welcome to this week's lab on Information Theory! This week, we will dive into the fascinating world of Information Theory as applied to Machine Learning. Specifically, we will focus on two key concepts: Entropy and Information Gain. These principles are fundamental in understanding how decision trees make split decisions to organize data effectively.

### Entropy
- Entropy, in the context of information theory, measures the level of uncertainty or disorder within a set of data.
- In machine learning, particularly in decision trees, entropy helps to determine how a dataset should be split. A high entropy means more disorder, indicating that our dataset is varied. Conversely, low entropy suggests more uniformity in the data.

### Information Gain
- Information Gain measures the reduction in entropy after the dataset is split on an attribute.
- It is crucial in building decision trees as it helps to decide the order of attributes the tree will use for splitting the data. The attribute with the highest Information Gain is chosen as the splitting attribute at each node.

## Part 1: Entropy and Information Gain in Decision Trees
Decision Trees use these concepts to create branches. By choosing splits that maximize Information Gain (or equivalently minimize entropy), a decision tree can effectively categorize data, leading to better classification or regression models.

### Step 1: Import Necessary Libraries

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
import matplotlib.pyplot as plt

### Step 2: Load and Explore the Iris Dataset

In [None]:
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target

### Step 3: Calculate Entropy
To calculate the `entropy` we need to:
- First, extract the target variable `y` from your dataset (like the 'target' column in the Iris dataset).
- Then, call `calculate_entropy(y)` to get the entropy.

This function calculates the entropy of a given target variable `y`. It works by first determining the unique classes in `y`, then computes the probability of each class, and uses this probability to calculate the entropy. This is a crucial step in understanding the disorder or uncertainty in the dataset, a fundamental concept in information theory.

In [None]:
def calculate_entropy(y):
    class_labels = np.unique(y)
    entropy = 0
    for label in class_labels:
        probability = len(y[y == label]) / len(y)
        entropy -= probability * np.log2(probability)
    return entropy

What is your observastion about the calculated Entropy?

The entropy function calculates the uncertainty in the target labels by summing the probabilities of all unique class labels. Higher entropy occurs when the labels are more evenly distributed, indicating greater uncertainty in classification. Conversely, lower entropy suggests that one class dominates, making the data more predictable. For the Iris dataset, a perfectly pure subset will have an entropy of 0, while the entire dataset, with three fairly balanced species, will have an entropy close to 1.5.

### Step 4: Calculate Information Gain
There are three steps for calculating the Information Gain:
1. Compute Overall Entropy: Use the entropy function from Step 3 on the entire target dataset.
2. Calculate Weighted Entropy for Each Attribute: For each unique value in the attribute, partition the dataset and calculate its entropy. Then calculate the weighted sum of these entropies, where the weights are the proportions of instances in each partition.
3. Compute Information Gain: Subtract the weighted entropy of the split from the original entropy.

The attribute with the highest Information Gain is generally chosen for splitting, as it provides the most significant reduction in uncertainty. This step is critical in constructing an effective decision tree, as it directly influences the structure and depth of the tree.

In [None]:
def calculate_information_gain(df, attribute, target_name):
    total_entropy = calculate_entropy(df[target_name])
    values, counts = np.unique(df[attribute], return_counts=True)
    weighted_entropy = sum((counts[i] / sum(counts)) * calculate_entropy(df.where(df[attribute] == values[i]).dropna()[target_name]) for i in range(len(values)))
    information_gain = total_entropy - weighted_entropy
    return information_gain


Discuss your findings here.

In [None]:
The information gain function measures how well an attribute splits the data by comparing the total entropy before the split with the weighted entropy after the split. Attributes with higher information gain create more homogeneous subsets, reducing uncertainty about the target class. In decision trees, the attribute with the highest information gain is selected for splitting at each node, leading to more accurate classification. This ensures the tree focuses on the most informative features, improving prediction efficiency.

## Part 2: Apply Entropy and Information Gain on a different dataset

Your task is to choose a new dataset and implement what you learned in `Part 1` on this new dataset.

So, I have taken a dataset called titanic dataset from kaggle. i have downloaded that dataset from kaggle. The goal is to predict whether a passenger survived based on features like gender, class, and fare.

In [8]:
#Loading the dataset 
import pandas as pd
file_path = "C:\\Users\\apoor\\Downloads\\titanic dataset\\Titanic-Dataset.csv"
df = pd.read_csv(file_path)
df.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Task 1: Implement Entropy and Information Gain

In [12]:
#Implementing the entropy
import numpy as np
# Entropy function
def calculate_entropy(y):
    class_labels = np.unique(y)
    entropy = 0
    for label in class_labels:
        probability = len(y[y == label]) / len(y)
        entropy -= probability * np.log2(probability)
    return entropy
# Calculating entropy of the 'Survived' column
target_entropy = calculate_entropy(df['Survived'])
print(f"Entropy of the target (Survived): {target_entropy}")


Entropy of the target (Survived): 0.9607079018756469


In [14]:
#Implementing the information gain
# Information Gain function
def calculate_information_gain(df, attribute, target_name='Survived'):
    total_entropy = calculate_entropy(df[target_name])
    values, counts = np.unique(df[attribute], return_counts=True)
# Calculating weighted entropy
    weighted_entropy = sum(
        (counts[i] / sum(counts)) * calculate_entropy(
            df[df[attribute] == values[i]][target_name]
        )
        for i in range(len(values))
    )
    information_gain = total_entropy - weighted_entropy
    return information_gain
# Calculating Information Gain for 'Sex'
info_gain_sex = calculate_information_gain(df, 'Sex')
print(f"Information Gain for 'Sex': {info_gain_sex}")
# Calculating Information Gain for 'Pclass'
info_gain_pclass = calculate_information_gain(df, 'Pclass')
print(f"Information Gain for 'Pclass': {info_gain_pclass}")


Information Gain for 'Sex': 0.2176601066606142
Information Gain for 'Pclass': 0.0838310452960116


### Task 2: Discuss your findings in detail
Provide detailed explanation and discussion about your findings.

1. Entropy of the Target (Survived)
The entropy of the 'Survived' column reflects the randomness or uncertainty in the dataset.
In our case:The entropy is close to 1, indicating that the survival outcomes are fairly mixed (not entirely predictable).
This suggests that the dataset contains both survivors and non-survivors in somewhat balanced proportions.
2. Information Gain for 'Sex'
The information gain for 'Sex' is relatively high (~0.217), meaning that gender is a strong predictor of survival.
This aligns with the historical fact that women had higher survival rates during the Titanic disaster (more females survived than males).
3. Information Gain for 'Pclass'
The information gain for 'Pclass' is lower (~0.083), but it still provides some predictive power.
This indicates that passenger class also impacts survival: passengers in higher classes had better survival chances, but not as strongly as gender.
4. Conclusion
Gender (sex) is the most important feature in predicting survival, with higher information gain compared to class.
This suggests that any decision tree-based model trained on this dataset would likely first split the data based on Sex before using other attributes like Pclass.


## Submission
Submit a link to your completed Jupyter Notebook file hosted on your private GitHub repository through the submission link in Blackboard.