# Data Discretization?
Data discretization is the process of transforming continuous data or numerical features into discrete buckets or intervals.

This is useful for:

- Simplifying models (e.g., decision trees)
- Handling non-linear relationships
- Improving interpretability
- Using algorithms that require categorical input

### 1. Binning (General Concept)
Binning is the most basic form of discretization where continuous values are grouped into intervals (bins).

**(Equal-width vs Equal-frequency)**

In [1]:
import pandas as pd

# Example data
data = pd.DataFrame({'Age': [22, 25, 27, 29, 35, 40, 50, 60, 70, 85]})

# Equal-width binning (3 bins)
data['EqualWidth'] = pd.cut(data['Age'], bins=3)

# Equal-frequency binning (quantile-based)
data['EqualFreq'] = pd.qcut(data['Age'], q=3)

print(data)

   Age      EqualWidth       EqualFreq
0   22  (21.937, 43.0]  (21.999, 29.0]
1   25  (21.937, 43.0]  (21.999, 29.0]
2   27  (21.937, 43.0]  (21.999, 29.0]
3   29  (21.937, 43.0]  (21.999, 29.0]
4   35  (21.937, 43.0]    (29.0, 50.0]
5   40  (21.937, 43.0]    (29.0, 50.0]
6   50    (43.0, 64.0]    (29.0, 50.0]
7   60    (43.0, 64.0]    (50.0, 85.0]
8   70    (64.0, 85.0]    (50.0, 85.0]
9   85    (64.0, 85.0]    (50.0, 85.0]


### 2. Quantile-based Binning
Also known as equal-frequency binning. Instead of dividing by range, it divides so each bin contains an equal number of values.

In [4]:
data['Quantile_Bin'] = pd.qcut(data['Age'], q=4)
print(data)

   Age      EqualWidth       EqualFreq    Quantile_Bin
0   22  (21.937, 43.0]  (21.999, 29.0]  (21.999, 27.5]
1   25  (21.937, 43.0]  (21.999, 29.0]  (21.999, 27.5]
2   27  (21.937, 43.0]  (21.999, 29.0]  (21.999, 27.5]
3   29  (21.937, 43.0]  (21.999, 29.0]    (27.5, 37.5]
4   35  (21.937, 43.0]    (29.0, 50.0]    (27.5, 37.5]
5   40  (21.937, 43.0]    (29.0, 50.0]    (37.5, 57.5]
6   50    (43.0, 64.0]    (29.0, 50.0]    (37.5, 57.5]
7   60    (43.0, 64.0]    (50.0, 85.0]    (57.5, 85.0]
8   70    (64.0, 85.0]    (50.0, 85.0]    (57.5, 85.0]
9   85    (64.0, 85.0]    (50.0, 85.0]    (57.5, 85.0]


### 3. Domain-Driven Grouping (Semantic Binning)
Discretization based on domain expertise or business rules.
These are not statistical, but manually defined.

In [5]:
bins = [0, 12, 19, 59, 100]
labels = ['Child', 'Teen', 'Adult', 'Senior']
data['Domain_Group'] = pd.cut(data['Age'], bins=bins, labels=labels)

In [6]:
print(data)

   Age      EqualWidth       EqualFreq    Quantile_Bin Domain_Group
0   22  (21.937, 43.0]  (21.999, 29.0]  (21.999, 27.5]        Adult
1   25  (21.937, 43.0]  (21.999, 29.0]  (21.999, 27.5]        Adult
2   27  (21.937, 43.0]  (21.999, 29.0]  (21.999, 27.5]        Adult
3   29  (21.937, 43.0]  (21.999, 29.0]    (27.5, 37.5]        Adult
4   35  (21.937, 43.0]    (29.0, 50.0]    (27.5, 37.5]        Adult
5   40  (21.937, 43.0]    (29.0, 50.0]    (37.5, 57.5]        Adult
6   50    (43.0, 64.0]    (29.0, 50.0]    (37.5, 57.5]        Adult
7   60    (43.0, 64.0]    (50.0, 85.0]    (57.5, 85.0]       Senior
8   70    (64.0, 85.0]    (50.0, 85.0]    (57.5, 85.0]       Senior
9   85    (64.0, 85.0]    (50.0, 85.0]    (57.5, 85.0]       Senior


### 4. Concept Hierarchies
This is more advanced: it groups data into levels of abstraction, often used in OLAP, data cubes, and knowledge discovery.

In [7]:
# Using fixed dates and extracting hierarchy levels
data['Sample_Date'] = pd.date_range(start='2023-01-01', periods=len(data), freq='M')
data['Month'] = data['Sample_Date'].dt.month
data['Quarter'] = data['Sample_Date'].dt.quarter
data['Year'] = data['Sample_Date'].dt.year

In [9]:
print(data[['Age', 'EqualWidth', 'Quantile_Bin', 'Domain_Group', 'Month', 'Quarter', 'Year']])

   Age      EqualWidth    Quantile_Bin Domain_Group  Month  Quarter  Year
0   22  (21.937, 43.0]  (21.999, 27.5]        Adult      1        1  2023
1   25  (21.937, 43.0]  (21.999, 27.5]        Adult      2        1  2023
2   27  (21.937, 43.0]  (21.999, 27.5]        Adult      3        1  2023
3   29  (21.937, 43.0]    (27.5, 37.5]        Adult      4        2  2023
4   35  (21.937, 43.0]    (27.5, 37.5]        Adult      5        2  2023
5   40  (21.937, 43.0]    (37.5, 57.5]        Adult      6        2  2023
6   50    (43.0, 64.0]    (37.5, 57.5]        Adult      7        3  2023
7   60    (43.0, 64.0]    (57.5, 85.0]       Senior      8        3  2023
8   70    (64.0, 85.0]    (57.5, 85.0]       Senior      9        3  2023
9   85    (64.0, 85.0]    (57.5, 85.0]       Senior     10        4  2023
