<a href="https://colab.research.google.com/github/AppleBoiy/intro-to-machine-learning/blob/main/preprocess/discretization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What is Discretization?

Discretization is the process of converting continuous data into discrete categories or intervals. This technique is often used in data preprocessing, especially when algorithms require or work better with categorical data instead of continuous numerical features.


## Why Use Discretization?

1.	Improves Model Performance:
  *	Some machine learning algorithms (e.g., decision trees, k-NN) work better with categorical data or have an advantage when input features are discretized.
2.	Simplifies the Model:
  *	Converting continuous features to categories can reduce noise or complexity in the model, making the patterns more easily detectable.
3.	Handles Outliers:
  *	By grouping data into intervals, discretization can reduce the impact of outliers, making the model more robust.
4.	Makes Interpretation Easier:
  * Discretized features may be easier to interpret in terms of meaningful categories (e.g., “low,” “medium,” “high”) rather than raw continuous values.

## Methods of Discretization

1.	Equal Width Discretization:
	*	The range of the data is divided into intervals of equal size (i.e., each interval has the same width).

    Formula:

  $$
  \text{Interval Size} = \frac{\text{max}(X) - \text{min}(X)}{\text{Number of Intervals}}
  $$

  *	Each interval contains the same width of values, but the number of data points per interval may vary.
2.	Equal Frequency Discretization:
	*	The data is divided into intervals such that each interval contains the same number of data points.
	*	This method ensures that each interval has an equal representation of the data.
3.	Clustering-Based Discretization (e.g., k-means):
	*	This technique involves using clustering algorithms to group continuous data into clusters. Each cluster represents a discrete category.
4.	Decision Tree-Based Discretization:
	*	A decision tree algorithm can be used to automatically discretize data by creating splits based on feature values that best separate the data.
5.	Custom Binning:
  *	In this method, domain knowledge is used to define specific intervals (e.g., age ranges: 0-18, 19-35, 36-50, etc.).


## When to Use Discretization?

*	When you need to transform continuous data into categorical data for specific algorithms.
*	When the relationship between variables is non-linear and discretization helps capture patterns.
*	When simplifying the analysis or improving interpretability of features is important.


# Example of Discretization Using Python

Let’s demonstrate **equal width discretization** using the Iris dataset.

## Step 1: Import Libraries

In [1]:
import pandas as pd
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.datasets import load_iris

## Step 2: Load and Inspect Data

In [2]:
# Load the Iris dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)

In [3]:
# Inspect the first few rows
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


## Step 3: Discretize the Data Using Equal Width Binning

In [6]:
# Initialize the KBinsDiscretizer with equal width discretization
discretizer = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform')

# Apply discretization to the data
df_discretized = discretizer.fit_transform(df)

# Convert the result into a DataFrame for better readability
df_discretized = pd.DataFrame(df_discretized, columns=df.columns)

In [7]:
# Show the discretized data
df_discretized.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,1.0,3.0,0.0,0.0
1,0.0,2.0,0.0,0.0
2,0.0,2.0,0.0,0.0
3,0.0,2.0,0.0,0.0
4,0.0,3.0,0.0,0.0


In the output:
*	The values have been discretized into 5 bins (0, 1, 2, 3, 4). The continuous data has been replaced with corresponding bin numbers.

## Conclusion

Discretization is a powerful technique for transforming continuous data into discrete categories. It can be useful for improving the performance of machine learning algorithms, simplifying models, and making the data more interpretable. However, it should be used carefully, as improper discretization can lead to loss of valuable information or distort the relationships in the data.