<a href="https://colab.research.google.com/github/Shyam456-IIIT/C_Program/blob/main/Discretization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Introduction**<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Let us assume that a shopkeeper is giving discount for buying more than 10 items.So,the shopkeeper doesn't look at the exact number of items we buy.He only consider the number of items more than 10 or not.Here in this case we will use Discretization.
<br>
Here he categorize data into two parts.
<br>
1.less than 10(no discount)<br>
2.more than 10(discount)<br>



**Discretization(Binnning):** <br>
  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Discretization is a process of continuous data and continuous variables into discrete counter parts.

        

**Steps in Discretization**<br>
**Step1:**<br>
 Identify the continuous attribute like temperature,blood pressure,age,income<br>
 **Step2:**<br>
Determine the number of intervals (bins) like 10-20,20-30,30-40 etc..<br>
**Step3:**<br>
Select a discretization category like young,middle-aged,old in age groups <br>
**Step4:**<br>
Apply cut points to the data


In [None]:
#Example  code:
import pandas as pd
age=[18,25,26,38,44,56,80]
parts=[0,30,60,80]
names=['young','middle-aged','old']
age_groups=pd.cut(age,bins=parts,labels=names,right=True)
print(age_groups)


['young', 'young', 'young', 'middle-aged', 'middle-aged', 'middle-aged', 'old']
Categories (3, object): ['young' < 'middle-aged' < 'old']


**Advantages of Discretization**<br>
1.Improves the performance of some machine learning algorithms <br>
2.makes data easier to understand and interpret.<br>
3.It can handle outliers by putting them in "low" or "high" bin.<br>


**Methods in discretization:**<br>
**1.Equal-Width binning technique**<br>
Equal-width binning divides the range of a continuous variable into intervals (bins) that all have the same width.
<br>
width=(max-min)/k &nbsp; &nbsp; &nbsp;(k=number of bins)


In [None]:
import pandas as pd
ages=[5, 12, 25, 36, 45, 52, 63, 75, 85, 95]   #width=(95-5)/4=22.5
bins=[5,27.5,50,72.5,95]
labels=["young","adult","middle","old"]
age_groups=pd.cut(ages,bins=4,labels=labels,right=True)
print(age_groups)


['young', 'young', 'young', 'adult', 'adult', 'middle', 'middle', 'old', 'old', 'old']
Categories (4, object): ['young' < 'adult' < 'middle' < 'old']


In [None]:
import pandas as pd
ages=[25,30,35,40,45,50,55,60,73,80]
df=pd.DataFrame(ages,columns=['age'])
#printing original data
print("Original ages\n")
print(df)
#printing discretized data
print("\nDiscretized ages\n")
bin_edges=[20,40,60,80]
labels=["young","middle","old"]
df['age_group']=pd.cut(df['age'],bins=bin_edges,labels=labels,right=True)
print(df)

Original ages

   age
0   25
1   30
2   35
3   40
4   45
5   50
6   55
7   60
8   73
9   80

Discretized ages

   age age_group
0   25     young
1   30     young
2   35     young
3   40     young
4   45    middle
5   50    middle
6   55    middle
7   60    middle
8   73       old
9   80       old


**Advantages:**<br>
1.Simple and fast to implement.<br>
2.Easy to interpret and explain.<br>
3.Works well when data is uniformly distributed.<br>
**Disadvantages:**<br>
1.Sensitive to outliers — extreme values can stretch the range.<br>
2.May produce uneven bin frequencies (some bins may have many or very few values).<br>
3.Doesn’t consider data distribution — not ideal for skewed data.<br>

**2.Equal-Frequency Binning:**<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Equal-frequency binning (also known as quantile binning) is a data preprocessing technique used to divide a numeric variable into bins (or intervals) such that each bin contains approximately the same number of data points.<br>


In [None]:
import pandas as pd

marks = [18, 22, 25, 27, 29, 33, 35, 40, 42, 45, 47, 50, 52, 55, 60,80,88,90,99,100]
df = pd.DataFrame(marks,columns=['marks'])
df['grade']=pd.qcut(df['marks'],q=5,labels=["Ok","Good","very Good","Excellent","Outstanding"])
print(df)
print(bins)

    marks        grade
0      18           Ok
1      22           Ok
2      25           Ok
3      27           Ok
4      29         Good
5      33         Good
6      35         Good
7      40         Good
8      42    very Good
9      45    very Good
10     47    very Good
11     50    very Good
12     52    Excellent
13     55    Excellent
14     60    Excellent
15     80    Excellent
16     88  Outstanding
17     90  Outstanding
18     99  Outstanding
19    100  Outstanding
0     (17.999, 28.0]
1     (17.999, 28.0]
2     (17.999, 28.0]
3     (17.999, 28.0]
4       (28.0, 40.0]
5       (28.0, 40.0]
6       (28.0, 40.0]
7       (28.0, 40.0]
8       (40.0, 48.5]
9       (40.0, 48.5]
10      (40.0, 48.5]
11      (48.5, 60.0]
12      (48.5, 60.0]
13      (48.5, 60.0]
14      (48.5, 60.0]
dtype: category
Categories (4, interval[float64, right]): [(17.999, 28.0] < (28.0, 40.0] < (40.0, 48.5] <
                                           (48.5, 60.0]]


**3.Custom/Domain Based Binning:**<br>
Custom (or domain-based) binning means you manually define bin boundaries based on your domain knowledge, business rules, or logical thresholds — rather than relying on statistical quantiles or equal widths.<br>


In [None]:
import pandas as pd

marks = [18, 22, 25, 27, 29, 33, 35, 40, 42, 45, 47, 50, 52, 55, 60, 80, 88, 90, 99, 100]
df = pd.DataFrame({'marks': marks})

# Custom / domain-based bins
bins = [0, 40, 60, 75, 90, 100]  # manually chosen cutoffs
labels = ['Fail', 'Pass', 'Average', 'Good', 'Excellent']

df['grade'] = pd.cut(df['marks'], bins=bins, labels=labels, right=True)

print(df)


    marks      grade
0      18       Fail
1      22       Fail
2      25       Fail
3      27       Fail
4      29       Fail
5      33       Fail
6      35       Fail
7      40       Fail
8      42       Pass
9      45       Pass
10     47       Pass
11     50       Pass
12     52       Pass
13     55       Pass
14     60       Pass
15     80       Good
16     88       Good
17     90       Good
18     99  Excellent
19    100  Excellent
