# Calculating the Measures Of Central Tendancy and Dispersion


In [1]:
%store -r boxcox_df
%store -r quantitative_features


**Measures of Central Tendency**
Measures of central tendency provide information about the central or typical value of a dataset. They help us understand the central concentration or average value around which the data points tend to cluster. The commonly used measures of central tendency are:

1. **Mean**: The mean is the sum of all values divided by the total number of values in the dataset. It represents the arithmetic average of the data.
2. **Median**: The median is the middle value in a sorted dataset. It divides the data into two equal halves, with 50% of the values above and 50% below it.
3. **Mode**: The mode represents the value(s) that occur most frequently in the dataset. A dataset can have one or more modes, or it may have no mode if no value is repeated.

**Measures of Dispersion (Variability)**
Measures of dispersion provide information about the spread, variability, or dispersion of data points around the central tendency. They quantify how the data points are scattered or spread out from the average. The commonly used measures of dispersion are:

1. **Range**: The range is the difference between the maximum and minimum values in a dataset. It provides a simple measure of the total spread of the data.
2. **Variance**: The variance measures the average squared deviation of each data point from the mean. It provides a measure of the overall dispersion of the data.
3. **Standard Deviation**: The standard deviation is the square root of the variance. It represents the average distance between each data point and the mean. It is widely used due to its intuitive interpretation and compatibility with the units of the original data.
4. **Interquartile Range (IQR)**: The IQR is the range between the 25th and 75th percentiles of the dataset. It measures the spread of the middle 50% of the data and is less affected by extreme values than the range.


In [2]:
# Let's have summary statistic about our Box-Cox transformed data
disc = boxcox_df.describe()
disc


Unnamed: 0,Age,Diastolic BP,Poverty index,Race,Red blood cells,Sedimentation rate,Serum Albumin,Serum Cholesterol,Serum Iron,Serum Magnesium,Serum Protein,Sex,Systolic BP,TIBC,TS,White blood cells,BMI,Pulse pressure,death
count,5384.0,5384.0,5384.0,5384.0,5384.0,5384.0,5384.0,5384.0,5384.0,5384.0,5384.0,5384.0,5384.0,5384.0,5384.0,5384.0,5384.0,5384.0,5384.0
mean,8.952745,21.778062,19.744257,1.171062,6.464346,3.987047,3.750688,22.316061,26.406975,0.663062,2.015137,1.597883,2.220451,5.351574,13.324657,2.763455,2.72918,5.711914,0.347325
std,1.534053,1.983321,5.695996,0.407858,0.258175,1.839072,0.347005,2.240542,5.685689,0.120478,0.065468,0.490371,0.024423,0.115964,3.536027,0.4718,0.11954,0.649528,0.476164
min,6.45589,15.825606,0.800095,1.0,5.523133,0.0,2.829415,14.737637,8.178013,0.325251,1.842137,1.0,2.138099,4.953312,2.558651,1.069052,2.321031,3.188769,0.0
25%,7.609171,20.476726,15.778364,1.0,6.288502,2.797466,3.524685,20.769396,22.528184,0.585446,1.974659,1.0,2.204022,5.27264,10.903828,2.439336,2.645313,5.274387,0.0
50%,8.886755,21.539725,19.955079,1.0,6.458033,3.977324,3.758926,22.252486,26.152807,0.661689,2.020488,2.0,2.219988,5.348325,13.286609,2.751028,2.728999,5.667157,0.0
75%,10.546131,23.243277,23.876263,1.0,6.635405,5.322908,3.994325,23.821001,30.195462,0.747042,2.064477,2.0,2.23619,5.430192,15.73693,3.103883,2.811677,6.163557,1.0
max,11.228807,27.049004,32.808171,3.0,7.350761,7.93994,4.70703,27.940557,41.267149,1.000659,2.186769,2.0,2.278387,5.631659,22.123776,3.934029,3.01294,7.220139,1.0


In [3]:
def calculate_mode(x):
    # Helper method to calculate the mode of a certain feature
    return x.mode().iat[0]


def calculate_range(y):
    # Helper method to calculate the range of a certain feature
    return y.max() - y.min()


def calculate_IQR(y):
    # Helper method to calculate the IQR of a certain feature
    return y.quantile(0.75) - y.quantile(0.25)


In [4]:
# Let's calculate our measures of central tendency and dispersion
measures = []
for feature in quantitative_features:
    measures.append(boxcox_df[feature].agg(
        ["mean", "median", calculate_mode, "var", "std", calculate_range, calculate_IQR]))

for item in measures:
    item.rename({"calculate_mode": "mode", "var": "variance", "std": "standard deviation",
                "calculate_range": "range", "calculate_IQR": "IQR"}, inplace=True)
    print(item)
    print("--------------------------------")


mean                   8.952745
median                 8.886755
mode                  10.546131
variance               2.353318
standard deviation     1.534053
range                  4.772917
IQR                    2.936959
Name: Age, dtype: float64
--------------------------------
mean                  21.778062
median                21.539725
mode                  21.539725
variance               3.933561
standard deviation     1.983321
range                 11.223399
IQR                    2.766551
Name: Diastolic BP, dtype: float64
--------------------------------
mean                  19.744257
median                19.955079
mode                  21.510369
variance              32.444369
standard deviation     5.695996
range                 32.008077
IQR                    8.097899
Name: Poverty index, dtype: float64
--------------------------------
mean                  6.464346
median                6.458033
mode                  6.435906
variance              0.066654
standard

In [5]:
%store disc

Stored 'disc' (DataFrame)
