# Lab 9: Continuous attribute spliting (part 2)

## Introduction

Welcome to Lab 8 (part2) in our Data Mining course! In this lab, we'll be diving into a crucial aspect of data analysis and preprocessing - finding the best continuous split in a dataset. This process is fundamental in various data mining tasks such as decision tree construction, feature selection, and improving the accuracy of our models.

### Objective

Our primary objective in this lab is to learn how to determine the best split point for a continuous variable in a dataset. We'll be focusing on the concept of **Information Gain**, a measure used in decision tree algorithms to split a node. Information gain helps us understand which split will most effectively help us classify the data.

### Dataset

We'll be working with the 'diabetes.csv' dataset, a widely used dataset in machine learning. This dataset presents a set of variables measured in individuals, some of whom have diabetes.

### Key Takeaways

By the end of this lab, you will be able to:
- Understand the concept of information gain and its importance in data mining.
- Implement a method to calculate the best split for continuous variables based on information gain.
- Apply these concepts to the 'diabetes.csv' dataset to gain practical experience.

Let's get started and explore how to make informed decisions using data!


In [2]:
import pandas as pd
import numpy as np

## Dataset Description

In this lab, we will be working with the `diabetes.csv` dataset. This dataset is commonly used in the field of data mining and machine learning to illustrate various techniques and algorithms.

### Overview of the Dataset

The `diabetes.csv` dataset comprises several medical predictor variables and one target variable, `Outcome`. The predictor variables include the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

### Features of the Dataset

The dataset contains the following columns:

- `Pregnancies`: Number of times pregnant
- `Glucose`: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- `BloodPressure`: Diastolic blood pressure (mm Hg)
- `SkinThickness`: Triceps skin fold thickness (mm)
- `Insulin`: 2-Hour serum insulin (mu U/ml)
- `BMI`: Body mass index (weight in kg/(height in m)^2)
- `DiabetesPedigreeFunction`: Diabetes pedigree function
- `Age`: Age (years)
- `Outcome`: Class variable (0 or 1) where 1 denotes the iill be crucial in our analysis.

_Note: Ensure that you have the `diabetes.csv` dataset available in your working environment before starting the lab exercises._


In [3]:
df = pd.read_csv('diabetes.csv')
print(df.head)

<bound method NDFrame.head of      Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0              6      148             72             35        0  33.6   
1              1       85             66             29        0  26.6   
2              8      183             64              0        0  23.3   
3              1       89             66             23       94  28.1   
4              0      137             40             35      168  43.1   
..           ...      ...            ...            ...      ...   ...   
763           10      101             76             48      180  32.9   
764            2      122             70             27        0  36.8   
765            5      121             72             23      112  26.2   
766            1      126             60              0        0  30.1   
767            1       93             70             31        0  30.4   

     DiabetesPedigreeFunction  Age  Outcome  
0                       0.627   50 

## Entropy Function Explanation

The `entropy` function is a crucial part of our information gain calculation. It measures the impurity or randomness in the data, which helps in deciding how a dataset should be split. The entropy of a dataset is calculated using the formula:

$$ \text{Entropy} = -\sum_{i=1}^{n} p(x_i) \log_2 p(x_i) $$

In [46]:
def entropy(target_col):
    elements, counts = np.unique(target_col, return_counts=True)
    entropy = # Fix Me #
    return entropy

## Understanding Information Gain

### What is Information Gain?

Information Gain is a measure used in decision trees and other machine learning algorithms to determine how well a feature splits a set of data. It's a key concept in algorithms like ID3, C4.5, and CART. Information Gain helps in selecting the feature that partitions the data most effectively, essentially dividing the dataset into groups that are as pure as possin Gain

The formula for Information Gain is:

$$ \text{Information Gain} = \text{Entropy (Parent)} - [\text{Weighted Average} \times \text{Entropy (Children)}] $$

- **Entropy (Parent)**: This is the measure of uncertainty or disorder before the split. It quantifies how much variation there is in the class label.

- **Weighted Average × Entropy (Children)**: This term represents the amount of disorder or uncertainty after the split. The entropy of each subset is calculated and then weighted by the proportion of the total dataset that each subset represents.

The goal is to maximize the information gain, which implies a greater reduction in uncertainty after the split.

### Parameters of the InfoGain Function

- `data`: The dataset on which the information gain calculation is to be performed. It should include both the feature columns and the target column.

- `split_column`: The name of the column in the dataset based on which the split is to be calculated. This column is the feature on which the decision to split will be based.

- `split_value`: The value in the `split_column` that is used to divide the data into two groups. Data where the value in `split_column` is less than or equal to `split_value` forms one group, while the rest forms another.

- `target_name`: The name of the target variable in the dataset. This is the variable for which the purity increase (or entropy dec of computational resources.
.


In [47]:
def InfoGain(data,split_column, split_value, target_name="Outcome"):
    total_entropy = entropy(data[target_name])
    lower_group = data.where(data[split_column] <= split_value).dropna()
    upper_group = # Fix Me #
    lower_entropy = # Fix Me #
    upper_entropy = # Fix Me #
    weighted_entropy = lower_entropy + upper_entropy
    information_gain = # Fix Me #
    return information_gain

## Midpoint Splitting for Any Continuous Variablew

Midpoint splitting is a technique used in data mining to determine optimal split points for continuous variabrocess

- Define `split_column` as the continuous variable of interest.
- Sort its unique values and calculate midpoints between consecutive pairs.
- These midpoints serve as potential sple's distribution.
's performance.


In [48]:
# Assuming 'outcome' is your class variable
# Replace 'BMI' with the column name you want to evaluate
split_column = 'BMI'
unique_values = sorted(df[split_column].unique())
midpoints = [(unique_values[i] + unique_values[i+1]) / 2 for i in range(len(unique_values)-1)]

## Finding the Optimal Split Pointw

The code aims to find the best split point in a dataset for a specified continuous variable, using information gain as the metric.

### Steps

1. Initialize `best_split` and `max_info_gain`.
2. Iterate over each midpoint calculated from the continuous variable.
3. Calculate the information gain for each midpoint using the `InfoGain` function.
4. Compare each information gain with the current maximum. If it's higher, update `max_info_gain` and `best_split`.
5. Finally, output the best split point and its corresponding inforin the dataset.


In [51]:
best_split = None
max_info_gain = -1

for midpoint in midpoints:
    current_info_gain = # Fix Me #
    #print("midpoint = ",midpoint, "information gain = ",current_info_gain)
    if current_info_gain > max_info_gain:
        max_info_gain = current_info_gain
        best_split = midpoint

print(f"Best split for BMI is at {best_split} with an information gain of {max_info_gain}")

Best split for BMI is at 27.85 with an information gain of 0.07489856064647493


## Exercise: Implementing Gini Index, Misclassification Error Rate, and Gain Calculation

### Objective

Your task is to implement the Gini index, Misclassification Error Rate, and Gain calculations at home. This exercise will enhance your understanding of different metrics used to evaluate the effectiveness of dataset splits in decision tree algorithms.

### Instructions

1. **Implement the Gini Index Function**:
   - Write a function to calculate the Gini index for a given dataset split.
   - The function should accept groups of data and the list of classes and return the Gini index.

2. **Implement the Misclassification Error Rate Function**:
   - Similarly, develop a function to calculate the Misclassification Error Rate for dataset splits.
   - This function should also take groups of data and the list of classes as inputs.

3. **Implement the Gain Calculation**:
   - Create a function to calculate the Gain, using either the Gini index or the Misclassification Error Rate.
   - This function should help compare the impurity of the dataset before and after the split.

4. **Test Your Implementations**:
   - Use a simple dataset to test your functions for both Gini index and Misclassification Error Rate.
   - Ensure the correctness of your implementations by cross-checking with theoretical values or using established libraries.

This exercise will deepen your practical understanding of key metrics in decision tree algorithms and their impact on model performance.
