## Topic: Measure of Position


### OUTCOMES

- 1. Percentile

- 2. Quartiles

- 3. IQR

- 4. Z-score

- 5. Example

In [4]:
import numpy as np
import pandas as pd

### 1. Percentile

- Definition:
    - A percentile indicates the value below which a given percentage of observations fall.

    - eg: The 75th percentile means 75% of data values lie below that point.

- Formula: Find the number of student below given P (percentage)
    - PL = (P/100) * (N + 1)

    - where
        - PL => Desired percentile value location.
        - P => percentile rank(Expressed as percentage)
        - N => Total observations

    - eg:
        - PL = (75/100)* (10 + 1)
            = 8.25
        - number of students 8.25 that get marks => 75%

- Formula: Percentile rank (Find the Percentile from value)
    - (number of value below (x)/ total number of value )* 100

    - eg: 78, 82, 84, 91, 93, 94, 96, 98, 99
    - find the Percentile rank for 84

    - calculation:
        - (3/10) * 100
        - 30%
        - so, 30% student get 84 mark


- Remember while calculating Percentile:
    - 1. Data should be sorted in ascending order.
    - 2. we are basically finding the loction of a observation.


- Use in ML:
    - To detect Outliers
    - To normalize data 
        

In [6]:
# dataset
data = [10, 12, 13, 15, 18, 21, 23, 25, 28, 30, 34, 35, 37, 40]
print("Data: ", data)

Data:  [10, 12, 13, 15, 18, 21, 23, 25, 28, 30, 34, 35, 37, 40]


In [4]:
# using numpy
data_25_percentile = np.percentile(data, 25)
data_50_percentile = np.percentile(data, 50)
data_75_percentile = np.percentile(data, 75)


print("25th Pecentile: ",data_25_percentile)
print("50th Pecentile: ", data_50_percentile)
print("75th Pecentile: ", data_75_percentile)


25th Pecentile:  15.75
50th Pecentile:  24.0
75th Pecentile:  33.0


In [None]:
# using pandas
# convert list to df
df = pd.DataFrame({"Data": data})
df

Unnamed: 0,Data
0,10
1,12
2,13
3,15
4,18
5,21
6,23
7,25
8,28
9,30


In [11]:
# find 25, 50, 75% using pandas 

# 25 %
data_25_per = df['Data'].quantile(0.25)
data_50_per = df['Data'].quantile(0.50)
data_75_per = df['Data'].quantile(0.75)


print("25th Pecentile: ",data_25_per)
print("50th Pecentile: ", data_50_per)
print("75th Pecentile: ", data_75_per)


25th Pecentile:  15.75
50th Pecentile:  24.0
75th Pecentile:  33.0


In [None]:
# using numpy for df

percentile = np.percentile(df['Data'], [25, 50, 60, 75])

print("Percentile: \n", percentile) # (25, 50, 60, 75)percentile

Percentile: 
 [15.75 24.   27.4  33.  ]


### 2. Quartiles

- Definition:
    - Divide data into four equal parts.

    - Q1 => (25th Percentile)
    - Q2 => (50th Percentile) [Median]
    - Q3 => (75th Percentile)


- Use in ML:
    - Helps visualize data spread using boxplots.

    - Used in feature scalling and outlier handling.

    - eg: when preprocessing numerical features, we can remove values below 1st percentile or above 99th percentile to reduce the effects of extreme data point(outliers).

In [21]:
df.head()

Unnamed: 0,Data
0,10
1,12
2,13
3,15
4,18


In [25]:
# example of quantile using percentile()

p25 = np.percentile(df['Data'], 25)

p50 = np.percentile(df['Data'], 50)

p75 = np.percentile(df['Data'], 75)

print("Percentile 25: ", p25)
print("Percentile 50: ", p50)
print("Percentile 75: ", p75)


Percentile 25:  15.75
Percentile 50:  24.0
Percentile 75:  33.0


In [None]:
# Example of quantile using quantile()

Q1 = df['Data'].quantile(0.25)
Q2 = df['Data'].quantile(0.50)
Q3 = df['Data'].quantile(0.75)


print("Q1: ", Q1, "\nQ2: ", Q2, "\nQ3: ", Q3)

Q1:  15.75 
Q2:  24.0 
Q3:  33.0


In [28]:
df

Unnamed: 0,Data
0,10
1,12
2,13
3,15
4,18
5,21
6,23
7,25
8,28
9,30


### 3. Interquartile Range (IQR)

- Definiton:
    - IQR measures the spread of the middle 50% of data.

- Formula:
    - IQR = Q3 - Q1    
    - where
        - Q3 => 75% data
        - Q1 => 25% data
    
    - IQR Value =>
        - Smaller then data points are tightly packed.
            - eg: 1.2, 1.5, 2, 1.3,2.1,25

        - Larger then data points are high variability.
            - eg: 10, 50, 100, 105, 300
    
- Outlier Detection Rule:
    - min_value = Q1 - (1.5 * IQR)
    - max_value = Q3 + (1.5 + IQR)

    - in this range min_value and max_value is outliers Free. 
    - if  data are exist in this range then those data is outliers free.
    - else data are consider as a outliers


- ML Connection:
    - Used for outlier removal.

    - To Prevent extreme values form skewing result.

    - Helps for create boxplot.


In [30]:
# Example of IQR (Interquartile Range)
# IQR = Q3 - Q1

Q1 = df['Data'].quantile(0.25)

Q3 = df['Data'].quantile(0.75)

IQR = Q3 - Q1

print("IQR: ", IQR)

IQR:  17.25


- IQR = 17.25 means the middle 50% of the data values are spread over a range of 17.25 units.

- Limition:
    - if dataset contain large value at the end then Q3 is effect by the large value for that IQR value is not proper result.

    - eg: [10, 12, 14, 15, 12, 16, 80, 90, 98] 
    - here 80, 90, 95 => affect Q3
    - Therefor, the IQR result is not proper.so, can't be detect the outlier properly.

In [23]:
# limitiation of IQR method to detect outlier

lst = [22, 25, 27, 29, 35, 40, 42, 100, 110, 115]

df1 = pd.DataFrame({'Income':lst })

df1

Unnamed: 0,Income
0,22
1,25
2,27
3,29
4,35
5,40
6,42
7,100
8,110
9,115


In [None]:
# find IQR
Q1 = df1['Income'].quantile(0.25)

Q3 = df1['Income'].quantile(0.75)

IQR = Q3 - Q1

# outliers detection
# ( lower_val > outlier < higger_val)

lower_val = Q1 - (1.5 * IQR)

higger_val =  Q3 + (1.5 * IQR)


outlier_iqr = df1[
    (df1['Income']< lower_val) | (df1['Income'] > higger_val)
    ]

print("IQR: ", IQR)

print("Outlier: ", outlier_iqr)

IQR:  58.0
Outlier:  Empty DataFrame
Columns: [Income]
Index: []


### 4. Z-Score

- Definition:
    - The Z-score (or Standard Score) tells how far a data point is from the mean in terms of standard deviations.

- Math formula:
    - z = (X−μ​)/σ
    - where
        - z => z-score
        - X => each data values
        - μ = mean
        - σ = standard deviation

- Code Formula:
    - Z-score = (data - mean)/std


- Interpretation:
    - z = 0 (values equals the mean)
    - z = 1 (1 Std is above the mean)
    - z = -1 (-1 std is below the mean)

- Use in ML:
    - used in feature scaling(Standardization)
    - outlier Detection
        - condition [Not outliers]: -3 < Z-score > +3 
        -  condition [outliers]: abs(Z) > 3


In [6]:
# Example of z-score
df1

Unnamed: 0,Income
0,22
1,25
2,27
3,29
4,35
5,40
6,42
7,100
8,110
9,115


In [16]:
# z-score = (data - mean)/std

# mean
mean_val = df1['Income'].mean()

std_val = df1['Income'].std()

Z_scores = (df1["Income"] - mean_val)/std_val

print(Z_scores)



0   -0.858756
1   -0.779486
2   -0.726640
3   -0.673793
4   -0.515254
5   -0.383137
6   -0.330291
7    1.202258
8    1.466491
9    1.598607
Name: Income, dtype: float64


In [17]:
# Another way
z_score = (df1 - df1.mean()) / df1.std()

z_score

Unnamed: 0,Income
0,-0.858756
1,-0.779486
2,-0.72664
3,-0.673793
4,-0.515254
5,-0.383137
6,-0.330291
7,1.202258
8,1.466491
9,1.598607


In [None]:
# To Detection the outliers using z-score
df1['Z-score'] = (df1 - df1.mean()) / df1.std()

df1

# add an column name Z-score

Unnamed: 0,Income,Z-score
0,22,-0.858756
1,25,-0.779486
2,27,-0.72664
3,29,-0.673793
4,35,-0.515254
5,40,-0.383137
6,42,-0.330291
7,100,1.202258
8,110,1.466491
9,115,1.598607


In [None]:
# Detection outliers condition: abs(z-score) > 3

outliers = df1[np.abs(df1['Z-score'] > 3)]

print("Outliers: ",outliers)

# here no outliers
# Hence the given data has outliers(100, 110, 115)

Outliers:  Empty DataFrame
Columns: [Income, Z-score]
Index: []


### use modified z-score method (Roubust) to detect outliers

- formula : 
- modifiedZ = 0.6745 x (xi - median)/MAD

- where

    - MAD(Median Absolute Deviation)
        - formula: MAD = median(abs(xi - median(X)))
        
        - where, xi = each individual data point
        - X => The entire Dataset
    

- Intuitive Meaning:
    - modified Z = 0 => Value equal the median
    - positive modified Z => value is above the median.
    - negative modified Z => value is below the median.

    - outliers condition:
        - abs(Modified Z) > 3.5


- Why This Is Powerful?
 - Regular Z-score uses mean and standard deviation, which get affected by extreme values.

- Modified Z-score uses median and MAD, which stay stable even if there are extreme outliers.

In [None]:
# Detect the outlier using modified Z-score outliers

df1

Unnamed: 0,Income,Z-score
0,22,-0.858756
1,25,-0.779486
2,27,-0.72664
3,29,-0.673793
4,35,-0.515254
5,40,-0.383137
6,42,-0.330291
7,100,1.202258
8,110,1.466491
9,115,1.598607


In [None]:
# apply Modified z-score
# formula: modifiedZ = 0.6745 x (xi - median)/MAD
# MAD = median(abs(xi - median(X)))

# entire dataset median(X)
median = df1['Income'].median()

# MAD
mad = np.median(np.abs(df1['Income'] - median))

# modified z-score

df1['mod_z'] = 0.6745 * (df1['Income'] - median)/mad

# Outliers detection

outliers_mz = df1[np.abs(df1['mod_z']) > 3.5]

df1

Unnamed: 0,Income,Z-score,mod_z
0,22,-0.858756,-0.909109
1,25,-0.779486,-0.733152
2,27,-0.72664,-0.615848
3,29,-0.673793,-0.498543
4,35,-0.515254,-0.14663
5,40,-0.383137,0.14663
6,42,-0.330291,0.263935
7,100,1.202258,3.665761
8,110,1.466491,4.252283
9,115,1.598607,4.545543


In [33]:
print("Outliers: ", outliers_mz)

Outliers:     Income   Z-score     mod_z
7     100  1.202258  3.665761
8     110  1.466491  4.252283
9     115  1.598607  4.545543


### Conclusion:
- 
    - Percentiles and quartiles help to describe the data spread.

    - IQR shows the range of central data and helps detect outliers.

    - Z-score standardizes data and helps identify extreme values.

- These Tools are fundamental for data Preprocessing, feature scalling, and roubust ml model performance.