## **Exercise: Measures of Central Tendency and Dispersion**

**Objective:** Apply measures of central tendency **(mean, median, and mode)** to analyze student performance in an exam.

### **📊 Dataset 1: Ages Analysis** 

#### **🔹 Task 1: Collect the Data**
We will analyze a dataset containing the ages of a group of people:

```
    13, 12, 14, 10, 12, 16, 16, 16, 15, 19, 18, 17, 18, 18,
     21, 24, 21, 23, 23, 20, 21, 23, 24, 24, 24, 21, 27, 25, 
     29, 26, 28, 29, 25, 26, 28, 27, 27, 26, 27, 26, 28, 30,
     32, 34, 30, 30, 33, 32, 31, 31, 32, 30, 36, 38, 38, 39,
     36, 36, 36, 40
```

In [150]:
import numpy as np
import matplotlib.pyplot as plt

In [151]:
ages = np.array(
    [13, 12, 14, 10, 12, 16, 16, 16, 15, 19, 18, 17, 18, 18,
     21, 24, 21, 23, 23, 20, 21, 23, 24, 24, 24, 21, 27, 25, 
     29, 26, 28, 29, 25, 26, 28, 27, 27, 26, 27, 26, 28, 30,
     32, 34, 30, 30, 33, 32, 31, 31, 32, 30, 36, 38, 38, 39,
     36, 36, 36, 40
    ]
)

#### **🔹 Task 2: Measures of Central Tendency**  
 **Calculate the mean, median, and mode of the ages.**  
   - Which of these measures best represents the data?  
   - What does the mode indicate about the most common age?  

In [152]:
def mean(arr:np.array) -> np.float64:
    return np.round(arr.sum() / arr.size,3)

In [153]:
def median(arr:np.array) -> np.float64:
    arr.sort()
    length = arr.size
    isOdd = length % 2 != 0
    med = np.mean([arr[length // 2 - 1],arr[length // 2]]) if not isOdd else arr[length // 2]
    return med

In [171]:
def mode(arr:np.array) -> np.array:
    values, counts = np.unique(arr, return_counts=True)
    return values[counts == np.max(counts)], counts[counts == np.max(counts)]

In [155]:
ages_mean = mean(ages)
ages_median = median(ages)
age_mod_values,_ = mode(ages)
print(f'Mean: {ages_mean}\nMedian: {ages_median}\nModes: {age_mod_values}')

Mean: 25.517
Median: 26.0
Modes: [21 24 26 27 30 36]


- Which of these measures best represents the data? 
  - **ans**: All measures are important. But the `mean` and `median` represent the main starting point to know how dispersed the data are.
- What does the mode indicate about the most common age? 
  - **ans**: There are 6 `modes` this represent results `multimodal`.

#### **🔹 Task 3: Frequency Table**  
**Create a frequency table by grouping the data into intervals.**  
   - Which interval has the highest frequency?  

In [172]:
import numpy as np

class FrequencyTable:
    def __init__(self, data: np.array):
        self.data = data
        self.min_value = np.min(data)
        self.max_value = np.max(data)
        self.range = self.max_value - self.min_value
        self.k = self.sturgers_distribution(len(data))
        self.amplitude = np.ceil(self.range / self.k)
        self.intervals = self.get_intervals()
        self.abs_frequency, self.edges = self.calculate_absolute_frequency()
        self.mindpoints = self.calculate_midpoints()
        self.xf = self.mindpoints * self.abs_frequency
        self.rel_frequency = self.abs_frequency / len(data)
        self.cum_frequency = np.cumsum(self.abs_frequency)
        self.mean = np.mean(data)
        self.median = np.median(data)
        self.mode = mode(data)
        self.variance = np.var(data)
        self.std_dev = np.sqrt(self.variance)

    def sturgers_distribution(self, total_data: int) -> int:
        k = 1 + np.log2(total_data)
        return np.ceil(k) if int(k) % 2 == 0 else np.floor(k)

    def get_intervals(self) -> np.array:
        intervals = np.arange(self.min_value, self.max_value, self.amplitude)
        return np.append(intervals, self.max_value)

    def calculate_absolute_frequency(self):
        return np.histogram(self.data, bins=self.intervals)

    def calculate_midpoints(self) -> np.array:
        return np.array([(self.edges[i] + self.edges[i + 1]) / 2 for i in range(len(self.edges) - 1)])

    def print_table(self):
        print(f"{'Interval':<25}{'Midpoint':<10}{'f':<10}{'xf':<10}{'fr':<10}{'F':<10}")
        print("=" * 85)
        for i in range(len(self.edges) - 1):
            print(f"[{self.edges[i]:.2f}, {self.edges[i+1]:.2f})".ljust(25) +
                  f"{self.mindpoints[i]:<10.2f}{self.abs_frequency[i]:<10}" +
                  f"{self.xf[i]:<10.2f}{self.rel_frequency[i]:<10.3f}{self.cum_frequency[i]:<10}")

    def summary(self):
        print(f"\n **Summary Statistics**")
        print(f"Mean: {self.mean:.2f}, Median: {self.median:.2f}, Mode: {self.mode}")
        print(f"Variance: {self.variance:.2f}, Std Dev: {self.std_dev:.2f}")
        print(f"Total Intervals: {int(self.k)}, Amplitude: {self.amplitude:.2f}")

In [173]:
age_table = FrequencyTable(ages)
age_table.print_table()
age_table.summary()

Interval                 Midpoint  f         xf        fr        F         
[10.00, 15.00)           12.50     5         62.50     0.083     5         
[15.00, 20.00)           17.50     9         157.50    0.150     14        
[20.00, 25.00)           22.50     12        270.00    0.200     26        
[25.00, 30.00)           27.50     15        412.50    0.250     41        
[30.00, 35.00)           32.50     11        357.50    0.183     52        
[35.00, 40.00)           37.50     8         300.00    0.133     60        

 **Summary Statistics**
Mean: 25.52, Median: 26.00, Mode: (array([21, 24, 26, 27, 30, 36]), array([4, 4, 4, 4, 4, 4]))
Variance: 54.88, Std Dev: 7.41
Total Intervals: 7, Amplitude: 5.00


- Which interval has the highest frequency?
  - **ans**: The interval **4**, `[25.00, 30.00)`. Here is are concentrated the `median`. So, this indicates que almost the 50 percent of ages, is less than or equal to `26`.

#### **🔹 Task 4: Measures of Dispersion**  
**Calculate the range, variance, and standard derivation.**  
   - Are the data widely spread out or concentrated?  
   - If the variance is high, what does it mean in terms of age? 

In [175]:
range = np.ptp(ages)
variance = np.var(ages)
standard_derivation = np.std(ages)
cv = (standard_derivation / ages_mean) * 100
print('#### DISPERSION MEASUREMENTS #####')
print(f'Range: {range}\nVariance: {variance}\nStandard Derivation: {standard_derivation}\nCoefficient of variation: {cv}')

#### DISPERSION MEASUREMENTS #####
Range: 30
Variance: 54.88305555555556
Standard Derivation: 7.40830989872559
Coefficient of variation: 29.032840454307284


- Are the data widely spread out or concentrated?
  - **ans**: The data show that the **CV** are *29%* this indicates moderate variability.
- If the variance is high, what does it mean in terms of age? 
  - **ans**: A moderate variance means there is some diversity in ages, but they are not extremely differente from the mean.

---

### **📊 Dataset 2: Weights Analysis**  
📌 **Description:** We will analyze a dataset containing the weights (kg) of a group of people.

```md
    66.53, 50.98, 63.42, 98.16, 95.43, 43.31, 36.32, 80.75, 94.39,
    39.77, 78.47, 51.37, 72.91, 56.4, 75.45, 46.06, 59.53, 61.14,
    80.28, 93.24, 75.14, 57.01, 38.58, 68.55, 64.42, 48.66, 53.27,
    39.38, 56.1, 38.45, 65.69, 99.24, 83.3, 62.37, 85.76, 56.31,
    72.34, 81.63, 47.07, 73.54, 42.61, 82.61, 36.79, 36.25, 87.61,
    59.82, 49.96, 67.66, 62.18, 95.53
```

#### **🔹 Task 1: Descriptive Statistics**  
**Calculate the mean, median, and mode of the weights.**  
   - Which of these measures best represents the data?  
   - Does the distribution appear symmetrical or skewed?  

In [183]:
weights = np.array([66.53, 50.98, 63.42, 98.16, 95.43, 43.31, 36.32, 80.75, 94.39,
       39.77, 78.47, 51.37, 72.91, 56.4 , 75.45, 46.06, 59.53, 61.14,
       80.28, 93.24, 75.14, 57.01, 38.58, 68.55, 64.42, 48.66, 53.27,
       39.38, 56.1 , 38.45, 65.69, 99.24, 83.3 , 62.37, 85.76, 56.31,
       72.34, 81.63, 47.07, 73.54, 42.61, 82.61, 36.79, 36.25, 87.61,
       59.82, 49.96, 67.66, 62.18, 95.53])

In [179]:
weights_mean = mean(weights)
weights_median = median(weights)
weights_mod_values,_ = mode(weights)
print(f'Mean: {weights_mean}\nMedian: {weights_median}\nModes: {weights_mod_values}')

Mean: 64.635
Median: 62.894999999999996
Modes: [36.25 36.32 36.79 38.45 38.58 39.38 39.77 42.61 43.31 46.06 47.07 48.66
 49.96 50.98 51.37 53.27 56.1  56.31 56.4  57.01 59.53 59.82 61.14 62.18
 62.37 63.42 64.42 65.69 66.53 67.66 68.55 72.34 72.91 73.54 75.14 75.45
 78.47 80.28 80.75 81.63 82.61 83.3  85.76 87.61 93.24 94.39 95.43 95.53
 98.16 99.24]


In [185]:
# Mean: 65.2294
# Variance: 317.61417864000003
# INTERVAL                       x                  f             xf            (x-X̄)^2      (x-X̄)^2 * f       fr             F         
# ----------------------------------------------------------------------------------------------------------------------------------
# [36.25, 45.25)                 40.75              9             366.75        599.24        5393.17         0.180           9         
# [45.25, 54.25)                 49.75              7             348.25        239.61        1677.28         0.140           16        
# [54.25, 63.25)                 58.75              9             528.75        41.98         377.84          0.180           25        
# [63.25, 72.25)                 67.75              6             406.50        6.35          38.12           0.120           31        
# [72.25, 81.25)                 76.75              8             614.00        132.72        1061.79         0.160           39        
# [81.25, 90.25)                 85.75              5             428.75        421.10        2105.48         0.100           44        
# [90.25, 99.24)                 94.75              6             568.47        871.17        5227.02         0.120           50        

- Which of these measures best represents the data?
  - **ans**: The **median 62.89** is likely the best representation because it is less affected by extreme values (outliers). The mean **64.64** is slightly higher, which suggests that some higher values might be pulling it up.
- Does the distribution appear symmetrical or skewed?  
  - **ans**: The **mean > median**, the distributions appears to have a **positive skewed**

In [186]:
from scipy.stats import skew
skewness = skew(weights)

print(f"CV: {skewness}")

CV: 0.18842404093416057


#### **🔹 Task 2: Variability Analysis**  
2️⃣ **Calculate the coefficient of variation (CV).**  
   - Does the CV indicate high variability in the weights?  

📌 **Formula:**  
\[
CV = $\left( \frac{\text{Standard Deviation}}{\text{Mean}} \right) \times 100$
\]

In [180]:
range = np.ptp(weights)
variance = np.var(weights)
standard_derivation = np.std(weights)
cv = (standard_derivation / weights_mean) * 100
print('#### DISPERSION MEASUREMENTS #####')
print(f'Range: {range}\nVariance: {variance}\nStandard Derivation: {standard_derivation}\nCoefficient of variation: {cv}')

#### DISPERSION MEASUREMENTS #####
Range: 62.989999999999995
Variance: 338.46007295999993
Standard Derivation: 18.397284390909434
Coefficient of variation: 28.463347088898328


- Does the CV indicate high variability in the weights? 
  - **ans**: The CV is less to *30%*, this indicates moderate variability

#### **🎯 Final Thoughts: Analysis and Conclusions**  
📌 After completing both analyses, answer:  
- Which dataset shows greater dispersion and variability?  
  - **ans**: The weight dataset has a much **higher variance (338.46 vs. 54.88)** and **higher standard deviation (18.40 vs. 7.41)**, meaning weights are more widely spread. 
- What differences exist between the age and weight distributions?
  - **ans**: The **age distribution may be more clustered**, while weights are more **evenly spread** across a wider range.  
  - **Weights have a larger spread (range = 62.99 vs. 30 for ages)**, indicating more variability in weight values.  
- Which statistical measures were the most useful in each case? 
  - **ans**: **Variance and standard deviation** were useful in determining the absolute spread.
    -  **Coefficient of variation (CV)** was important to compare relative dispersion, since ages and weights have different units.