# 1) Encoding Numerical Data

| Feature         | Values |
| --------------- | ------ |
| Age             | 21–60  |
| YearsExperience | 0–35   |
| MonthlySalary   | 10k–5L |


In [1]:
# two techniques mainly 

### Discretitzation(Binning) & Binarization 

Binning (Discretization) : discretization is the process of transforming continuous variable 
into discrete variables , by creating a set of contiguos intervals that span the range of 
the variables values . Discreticxzation is also called binning , where bin is an alternative name
for intervals

why use dicretization:
1) To handle outliers 
2) To Improve the value spread

Binnig types: 
1) unsupervised binning
2) supervised binning 
3) custom binning 

### Equal width binning / uniform binning

### Equal Frequency/ Quantile Binning 

each interval contains x% of total observation

| Aspect         | Equal Width | Quantile |
| -------------- | ----------- | -------- |
| Bin size       | Fixed       | Variable |
| Data per bin   | Unequal     | Equal    |
| Outlier effect | High        | Low      |
| Empty bins     | Possible    | ❌        |
| Skew handling  | Poor        | ✅        |


### KMeans Binning

The centroid defines the bin, not the interval itself : 
1) A centroid is the mean (center) of all points assigned to a cluster.
2) Each cluster has exactly one centroid.

In [5]:
# sklearn Implementation

```python
from sklearn.preprocessing import KBinsDiscretizer

kb = KBinsDiscretizer(
    n_bins=3,
    strategy='kmeans',
    encode='ordinal'
)

X['income_bin'] = kb.fit_transform(X[['income']])
```

| strategy     | Binning Type    |
| ------------ | --------------- |
| `'uniform'`  | Equal width     |
| `'quantile'` | Equal frequency |
| `'kmeans'`   | K-Means         |


| encode           | Output        | Best For       |
| ---------------- | ------------- | -------------- |
| `'ordinal'`      | Integers      | Tree models    |
| `'onehot'`       | Sparse matrix | Linear models  |
| `'onehot-dense'` | Dense matrix  | Small datasets |


## Custom / Domain Binning

### What is Custom (Domain) Binning?
**Custom binning** (also called **domain-driven binning**) is a discretization technique where bin boundaries are defined using **expert knowledge, business rules, or real-world standards**, rather than purely statistical methods.

Instead of letting the algorithm decide bin edges, **humans decide what ranges are meaningful**.

---

### Why use Custom Binning?
- Improves **interpretability**
- Aligns features with **business logic**
- Makes models easier to **explain to stakeholders**
- Often required in **regulated industries** (finance, healthcare)

---

### Key Characteristics
- Bins are **manually defined**
- Does **not** rely on data distribution
- Target variable may or may not be used
- Highly **interpretable**, sometimes less optimal statistically

---

### Examples

#### Finance (Credit Scoring)
```text
Credit Score:
< 580      → Poor
580 – 670  → Fair
670 – 740  → Good
> 740      → Excellent


### Binnarization

## Binarization

### What is Binarization?
**Binarization** is a feature transformation technique where a numerical variable is converted into a **binary feature (0 or 1)** based on a chosen threshold.

It answers a **yes / no** question instead of preserving exact numerical magnitude.

---

### Why use Binarization?
- Simplifies numerical features
- Improves **interpretability**
- Useful when **presence matters more than magnitude**
- Works well with **linear models** and rule-based systems

---

### How Binarization Works
A threshold \( t \) is selected and the transformation is applied as:

\[
x' =
\begin{cases}
1 & \text{if } x > t \\
0 & \text{otherwise}
\end{cases}
\]

---

### Examples

#### Finance
```text
Income > 50,000 → 1 (High Income)
Else            → 0 (Low Income)


In [6]:
from sklearn.preprocessing import Binarizer

# sklearn example 
```python
from sklearn.preprocessing import Binarizer

binz = Binarizer(threshold=50000)
X['high_income'] = binz.fit_transform(X[['income']])
```