# Lab 4 - Information Theory in Machine Learning

Welcome to this week's lab on Information Theory! This week, we will dive into the fascinating world of Information Theory as applied to Machine Learning. Specifically, we will focus on two key concepts: Entropy and Information Gain. These principles are fundamental in understanding how decision trees make split decisions to organize data effectively.

### Entropy
- Entropy, in the context of information theory, measures the level of uncertainty or disorder within a set of data.
- In machine learning, particularly in decision trees, entropy helps to determine how a dataset should be split. A high entropy means more disorder, indicating that our dataset is varied. Conversely, low entropy suggests more uniformity in the data.

### Information Gain
- Information Gain measures the reduction in entropy after the dataset is split on an attribute.
- It is crucial in building decision trees as it helps to decide the order of attributes the tree will use for splitting the data. The attribute with the highest Information Gain is chosen as the splitting attribute at each node.

## Part 1: Entropy and Information Gain in Decision Trees
Decision Trees use these concepts to create branches. By choosing splits that maximize Information Gain (or equivalently minimize entropy), a decision tree can effectively categorize data, leading to better classification or regression models.

### Step 1: Import Necessary Libraries

In [49]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
import matplotlib.pyplot as plt

### Step 2: Load and Explore the Iris Dataset

In [2]:
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target

In [7]:
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


### Step 3: Calculate Entropy
To calculate the `entropy` we need to:
- First, extract the target variable `y` from your dataset (like the 'target' column in the Iris dataset).
- Then, call `calculate_entropy(y)` to get the entropy.

This function calculates the entropy of a given target variable `y`. It works by first determining the unique classes in `y`, then computes the probability of each class, and uses this probability to calculate the entropy. This is a crucial step in understanding the disorder or uncertainty in the dataset, a fundamental concept in information theory.

In [2]:
def calculate_entropy(y):
    class_labels = np.unique(y)
    entropy = 0
    for label in class_labels:
        probability = len(y[y == label]) / len(y)
        entropy -= probability * np.log2(probability)
    return entropy

What is your observastion about the calculated Entropy?

In [5]:
calculate_entropy(df['target'])

1.584962500721156

The entropy in the iris data indicates high uncertainty and therefore a decision tree is required to split the data for better classification model

### Step 4: Calculate Information Gain
There are three steps for calculating the Information Gain:
1. Compute Overall Entropy: Use the entropy function from Step 3 on the entire target dataset.
2. Calculate Weighted Entropy for Each Attribute: For each unique value in the attribute, partition the dataset and calculate its entropy. Then calculate the weighted sum of these entropies, where the weights are the proportions of instances in each partition.
3. Compute Information Gain: Subtract the weighted entropy of the split from the original entropy.

The attribute with the highest Information Gain is generally chosen for splitting, as it provides the most significant reduction in uncertainty. This step is critical in constructing an effective decision tree, as it directly influences the structure and depth of the tree.

In [3]:
def calculate_information_gain(df, attribute, target_name):
    total_entropy = calculate_entropy(df[target_name])
    values, counts = np.unique(df[attribute], return_counts=True)
    weighted_entropy = sum((counts[i] / sum(counts)) * calculate_entropy(df.where(df[attribute] == values[i]).dropna()[target_name]) for i in range(len(values)))
    information_gain = total_entropy - weighted_entropy
    return information_gain


Discuss your findings here.

In [11]:
calculate_information_gain(df,"sepal length (cm)","target")

0.8769376208910578

In [12]:
calculate_information_gain(df,"sepal width (cm)","target")

0.5166428756804977

In [13]:
calculate_information_gain(df,"petal length (cm)","target")

1.4463165236458

In [14]:
calculate_information_gain(df,"petal width (cm)","target")

1.4358978386754417

When using the decision tree, petal length is chosen for splitting the data with the highest information gains among all other attributes, petal width also obtained high information gain. These two attributes provide massive reduction in uncertainty of the data, which then reduce the depth of the decision tree to make the model simpler.

## Part 2: Apply Entropy and Information Gain on a different dataset

Your task is to choose a new dataset and implement what you learned in `Part 1` on this new dataset.

### Task 1: Implement Entropy and Information Gain

In [60]:
# Your code goes here

# This is the cleaned data of mushrooms based on their characteristics and whether they are edible 
mushroom=pd.read_csv("mushroom_cleaned.csv")
mushroom.head()

Unnamed: 0,cap-diameter,cap-shape,gill-attachment,gill-color,stem-height,stem-width,stem-color,season,class
0,1372,2,2,10,3.807467,1545,11,1.804273,1
1,1461,2,2,10,3.807467,1557,11,1.804273,1
2,1371,2,2,10,3.612496,1566,11,1.804273,1
3,1261,6,2,10,3.787572,1566,11,1.804273,1
4,1305,6,2,10,3.711971,1464,11,0.943195,1


In [5]:
# Check for missing value
mushroom.isnull().sum()

cap-diameter       0
cap-shape          0
gill-attachment    0
gill-color         0
stem-height        0
stem-width         0
stem-color         0
season             0
class              0
dtype: int64

In [6]:
# Check for data types
mushroom.dtypes

cap-diameter         int64
cap-shape            int64
gill-attachment      int64
gill-color           int64
stem-height        float64
stem-width           int64
stem-color           int64
season             float64
class                int64
dtype: object

In [32]:
# check the unique values of class, stem color, gill attachment, gill color, season, and cap shape variables to see if it's the classifier variable
mushroom['class'].unique()

array([1, 0], dtype=int64)

In [61]:
mushroom['gill-color'].unique()

array([10,  5,  7,  9,  0,  3, 11,  8,  1,  6,  4,  2], dtype=int64)

In [62]:
mushroom['stem-color'].unique()

array([11, 12,  6, 10,  0,  5,  9,  8,  1,  4,  3,  7,  2], dtype=int64)

In [63]:
mushroom['gill-attachment'].unique()

array([2, 0, 1, 5, 6, 4, 3], dtype=int64)

In [25]:
mushroom['cap-shape'].unique()

array([2, 6, 4, 0, 1, 5, 3], dtype=int64)

In [26]:
mushroom['season'].unique()

array([1.80427271, 0.94319455, 0.88845029, 0.02737213])

In [64]:
# Then convert them to categorical variables
mushroom['gill-color']=mushroom['gill-color'].astype('category')
mushroom['stem-color']=mushroom['stem-color'].astype('category')
mushroom['stem-color']=mushroom['season'].astype('category')
mushroom['stem-color']=mushroom['cap-shape'].astype('category')
mushroom['stem-color']=mushroom['gill-attachment'].astype('category')

In [52]:
# Calculate entropy
calculate_entropy(mushroom['class'])

0.993009580583416

In [53]:
# calculate information gain for season
calculate_information_gain(mushroom,"season","class")

0.01601829610359451

In [54]:
# calculate information gain for stem color
calculate_information_gain(mushroom,"stem-color","class")

0.026061249199076597

In [55]:
# calculate information gain for stem width
calculate_information_gain(mushroom,"stem-width","class")

0.1490404348729285

In [32]:
# calculate information gain for stem height
calculate_information_gain(mushroom,"stem-height","class")

0.08370862312457261

In [56]:
# calculate information gain for gill color
calculate_information_gain(mushroom,"gill-color","class")

0.032025445676068576

In [57]:
# calculate information gain for gill attachment
calculate_information_gain(mushroom,"gill-attachment","class")

0.026061249199076597

In [58]:
# calculate information gain for cap shape
calculate_information_gain(mushroom,"cap-shape","class")

0.03966888709859595

In [59]:
# calculate information gain for cap diameter
calculate_information_gain(mushroom,"cap-diameter","class")

0.06756970497036341

### Task 2: Discuss your findings in detail
Provide detailed explanation and discussion about your findings.

The entropy for the y variable (class of edible or poisonous) in the mushroom dataset is 0.99, which is lower than the entropy in part 1, indicating more uniformity in the mushroom data. The information gain for all attributes related to the mushroom are much lower than the ones in part 1, with the highest being the stem width which obtained 0.149 and chosen for splitting the decision tree. In this case, the decision tree is not as appropriate to determine whether the mushrooms are edible or not.

## Submission
Submit a link to your completed Jupyter Notebook file hosted on your private GitHub repository through the submission link in Blackboard.