# **Discretization Techniques**

In the previous topic we briefly mentioned a number of univariate discritization techniques. These techniques can be used to categorize a continous varaible or simply to smooth a variable. We will briefly expand on a number of the discritization binning techniqes in this topic.

# **Simple Discretization Methods: Binning**

The following binning techniques are the most simple. The first is where the bins are of equal size and the volume of data points differs for each bin. This can be very useful when trying to understand the density of your data.
The second is where the bin sizes varie but the desnity of each bin remains the same. These can be useful to understand possible ouliers and the spread of your data.

* Equal-width (distance) partitioning:
> It divides the range into N intervals of equal size: uniform grid
> * if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N.
> * The most straightforward
> * But outliers may dominate presentation
> * Skewed data is not handled well.

* Equal-depth (frequency) partitioning:
> * It divides the range into N intervals, each containing approximately same number of samples
> * Good data scaling
> * Managing categorical attributes can be tricky.

The diagram below shows how they work, and typically look like histograms. Neither of the above methods examines if there is any predictive power lost as we are not taking account of the effect of binning on the outcome variable.

![alt text](https://www.computing.dcu.ie/~amccarren/mcm_images/Simple_discritization_methods.jpg)

# **Binning Methods for Data Smoothing**

In this approach we are simply using the binning technique to smooth our data. So we use the bins to calculate new values that will effectively replace the old ones. Lets see how we do it for the following data:

Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Step 1 Partition into (equi-depth) bins:
      - Bin 1: 4, 8, 9, 15
      - Bin 2: 21, 21, 24, 25
      - Bin 3: 26, 28, 29, 34

* Step 2 Smoothing by bin means(you could use the median):
      - Bin 1: 9, 9, 9, 9
      - Bin 2: 23, 23, 23, 23
      - Bin 3: 29, 29, 29, 29

*  or Smoothing by bin boundaries:
      - Bin 1: 4, 4, 4, 15
      - Bin 2: 21, 21, 25, 25
      - Bin 3: 26, 26, 26, 34


The following code example from [GeeksforGeeks](https://www.geeksforgeeks.org/python-binning-method-for-data-smoothing/) does the above analysis for the IRIS dataset.

Try it out for the remainder of the columns in the IRIS dataset.


In [None]:
import numpy as np
import math
from sklearn.datasets import load_iris
from sklearn import datasets, linear_model, metrics

# load iris data set
dataset = load_iris()
a = dataset.data

b = np.zeros(150)

# take 1st column among 4 column of data set
for i in range (150):
    b[i]=a[i,1]

b=np.sort(b)  #sort the array
print('b is',b)
# create bins
bin1=np.zeros((30,5))
bin2=np.zeros((30,5))
bin3=np.zeros((30,5))

# Bin mean
for i in range (0,150,5):
    k=int(i/5)
    mean=(b[i] + b[i+1] + b[i+2] + b[i+3] + b[i+4])/5
    for j in range(5):
        bin1[k,j]=mean
print("Bin Mean: \n",bin1)

# Bin boundaries
for i in range (0,150,5):
    k=int(i/5)
    for j in range (5):
        if (b[i+j]-b[i]) < (b[i+4]-b[i+j]):
            bin2[k,j]=b[i]
        else:
            bin2[k,j]=b[i+4]
print("Bin Boundaries: \n",bin2)

# Bin median
for i in range (0,150,5):
    k=int(i/5)
    for j in range (5):
        bin3[k,j]=b[i+2]
print("Bin Median: \n",bin3)