#**3-4-5 Rule for Attribute Discretization**

The 3-4-5 rule can be used to segment numerical data into relatively uniform, “natural” intervals.
> If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit, partition the range into 3 equiwidth intervals for 3,6,9 or 2-3-2 for 7

>If it covers 2, 4, or 8 distinct values at the most significant digit, partition the range into 4 equiwidth intervals

> If it covers 1, 5, or 10 distinct values at the most significant digit, partition the range into 5 equiwidth intervals

#**Example**

1. Suppose that profit data values for year 2017 for a company range from -3,51,
976 to +4,70,00,896.

2. For practical purpose of avoiding noise, extremely high or extremely low values are not considered. So first we need to smooth out our data. Let’s discard bottom 5% and top 5% values.

3. Suppose after discarding above data new values for LOW = -159876 and HIGH = 1838761.

4. Most Significant Digit or MSD is at million position, see highlighted digit : –159876 and 1838761.

5. Next step is to round down LOW and round up HIGH to MSD that million position.

6. So LOW = -1000000 and HIGH = 2000000. -1000000 is nearest down million to -159876 and 2000000 is nearest up million to 1838761.

7. Now let’s identify range of this interval.

8. Range = HIGH – LOW that is 2000000 – (-1000000) = 3000000. We consider only MSD here which is 3.

9. Now that we know range MSD = 3, we can apply rule #1.

Rule #1 says that we can divide this interval into three equal size intervals:

Interval 1 : -1000000 to 0
Interval 2 : 0 to 1000000
Interval 3 : 1000000 to 2000000

You should be thinking of how 0 can be part of multiple intervals? You’re right! We should represent it as follows:

Interval 1 : (-1000000 … 0]
Interval 2 : (0 … 1000000]
Interval 3 : (1000000 … 2000000]
Here (a … b] denotes range that excludes a but includes b. ( , ]  is notation for half-open interval.



# **Excercise**

Write a python script which uses the 3-4-5 rule to detemine the intervals for the following measurments:

231,12,4500,-110, 24.5,673.1, 2100.23, -2, -99.2, 1999, 2410,-112,-45,1101.78, 2567.5,6.1, 109.4, 4.5, -456.6, 1.231, 3152

Alternate your data to test your code.

# **Please do not show this code until they have attempted the excercise**

In [None]:
import pandas as pd
import numpy as np
import math
# function to to apply rules for the intervals
def rule(interval,max,min):
  split=[]

  if interval in (3, 6, 9):
    gap=(max-min)/3

    split=[min,min+gap,(min+gap*2),max]
  elif (interval == 7):
    gap=(max-min)/7
    split=[min,min+(gap*2),min+(gap*5),max]
  elif interval in (2, 4,8):
    gap=(max-min)/4
    split=[min,min+gap,min+(gap*2),min+(gap*3),max]
  elif interval in (1, 5,10):
    gap=(max-min)/5
    split=[5]
    split=[min,min+gap,min+(gap*2),min+(gap*3),min+(gap*4),max]

  return split

X=pd.Series([231,12,4500,-110, 24.5,673.1, 2100.23, -2, -99.2, 1999, 2410,-112,-45,1101.78, 2567.5,6.1, 109.4, 4.5, -456.6, 1.231, 3152])
X=X.sort_values()
b5= math.floor(len(X)*0.05)
t5=math.floor(len(X)*0.95)
print(len(X), b5,t5)
XS=X[b5:(t5+1)]
print(XS.max())
print(XS.min())
j = math.floor(math.log10(np.abs(XS).max()))

max=math.ceil(XS.max()/(10**j))*(10**j)
min=math.floor(XS.min()/(10**j))*(10**j)

no_intervals=(max-min)/(10**j)
#print(no_intervals,max,min)
s=rule(no_intervals,max,min)
print(rule(no_intervals,max,min))
