### **Exercise: Monthly Income Analysis**

**Objective:** The objective of this exercise is to apply descriptive statistics concepts to analyze a sample of monthly incomes in a community.

#### **Step 1: Collect the Data**
Imagine you have the following monthly incomes (in COP) from a sample of 100 people in a community:

```
       11654950.29,  9971081.28, 11273992.69,  2861346.03,  2582670.88,
       11888504.48,  8726121.58, 10837270.77, 11470281.3 ,  8557203.55,
        4635899.65, 11043709.38,  6349800.38,  4323161.56,  3048335.62,
        2412571.86,  1745902.92,  1988633.44, 10964236.12, 12014430.89,
        3183807.08,  9839627.91,  2429435.56,  6452005.39,  4036398.71,
        3717501.87,  3984322.57,  3337302.79,  6436620.64,  9513158.72,
       10283633.29,  6411711.02,  3302846.9 , 11912103.19,  8552094.72,
       12514545.83,  1936603.14,  8971357.23, 10041493.07,  2172989.33,
        5368011.25,  4771829.24,  7285612.35,  3051332.98,  7813967.81,
        6644630.33,  5758413.61,  8154393.07,  2447156.32, 11752212.91,
        8074621.85, 12531034.79,  4126223.82,  1720036.97,  8671911.18,
        6887645.18, 11228313.28,  6703808.44,  7575595.72,  6285031.13,
        3945769.93,  8065980.29,  5007651.63, 10361360.83,  9436583.01,
        2728355.7 , 11611662.56,  7842398.44,  6523517.23,  3761886.42,
        1910856.57,  6841450.36,  9624154.66, 12081254.77, 11474200.07,
        3816361.55,  2968327.21, 12279056.33, 12495371.15,  6325754.98,
        4447851.82,  3291363.82,  8961625.08,  7413539.25,  2234083.16,
        7869924.71,  5888912.51,  8455810.07,  6035226.29,  7625690.51,
        9550061.14,  8954896.71,  7011638.4 ,  7257551.94,  4147618.45,
        8814156.85, 12234370.52,  3787806.12,  8139101.68,  2845207.64
```

In [1]:
import math
import numpy as np # type: ignore

In [2]:
# read the data to numpy array
monthly_incomes = np.array([11654950.29,  9971081.28, 11273992.69,  2861346.03,  2582670.88,
 11888504.48,  8726121.58, 10837270.77, 11470281.3 ,  8557203.55,
  4635899.65, 11043709.38,  6349800.38,  4323161.56,  3048335.62,
  2412571.86,  1745902.92,  1988633.44, 10964236.12, 12014430.89,
  3183807.08,  9839627.91,  2429435.56,  6452005.39,  4036398.71,
  3717501.87,  3984322.57,  3337302.79,  6436620.64,  9513158.72,
 10283633.29,  6411711.02,  3302846.9 , 11912103.19,  8552094.72,
 12514545.83,  1936603.14,  8971357.23, 10041493.07,  2172989.33,
  5368011.25,  4771829.24,  7285612.35,  3051332.98,  7813967.81,
  6644630.33,  5758413.61,  8154393.07,  2447156.32, 11752212.91,
  8074621.85, 12531034.79,  4126223.82,  1720036.97,  8671911.18,
  6887645.18, 11228313.28,  6703808.44,  7575595.72,  6285031.13,
  3945769.93,  8065980.29,  5007651.63, 10361360.83,  9436583.01,
  2728355.7 , 11611662.56,  7842398.44,  6523517.23,  3761886.42,
  1910856.57,  6841450.36,  9624154.66, 12081254.77, 11474200.07,
  3816361.55,  2968327.21, 12279056.33, 12495371.15,  6325754.98,
  4447851.82,  3291363.82,  8961625.08,  7413539.25,  2234083.16,
  7869924.71,  5888912.51,  8455810.07,  6035226.29,  7625690.51,
  9550061.14,  8954896.71,  7011638.4 ,  7257551.94,  4147618.45,
  8814156.85, 12234370.52,  3787806.12,  8139101.68,  2845207.64])

In [3]:
monthly_incomes.size

100

In [4]:
print(monthly_incomes.size.__class__)

<class 'int'>


#### **Step 2: Determine the Range of the Data**
Use the **range** to understand the spread of incomes in the sample.

In [5]:
max_value = monthly_incomes.max()
min_value = monthly_incomes.min()
R = max_value - min_value
print(f'Max value: {max_value}\nMin value: {min_value}\nRange: {R}')

Max value: 12531034.79
Min value: 1720036.97
Range: 10810997.819999998


#### **Step 3: Create the Intervals**
Using the **range** and the **number of classes (intervals)**, divide the data into intervals. Use **Sturges' rule** to determine the number of classes:


In [6]:
def sturgers_distribution(total_data:int)->int:
    k = 1 + 3.322 * np.log10(total_data)
    if int(k) % 2 == 0:
        k =  math.ceil(k)
    else:
        k = math.floor(k)
    return k

In [7]:
k = sturgers_distribution(monthly_incomes.size)
print(f'Number of classes: {k}')

Number of classes: 7


#### **Step 4: Calculate the Frequencies**
Once you've created the intervals, create amplitude and count how many data points fall within each interval.

In [8]:
# amplitude = Range / Classess
a = R / k

In [9]:
def get_intervals_manually(a, k, min_value, max_value) -> np.array:
    bins = 0
    intervals = []
    for i in range(k):
        if i == 0:
            bins += min_value
        else:
            bins += a
        intervals.append(bins)
    intervals.append(max_value)
    return np.array(intervals)

In [10]:
def get_intervals_numpy(min_value, max_value, amplitude) -> np.array:
    intervals = np.arange(min_value, max_value, amplitude)
    return np.append(intervals, max_value)

In [11]:
print(get_intervals_manually(a, k, min_value, max_value))
intervals = get_intervals_numpy(min_value, max_value, a)
print(intervals)

[ 1720036.97  3264465.23  4808893.49  6353321.75  7897750.01  9442178.27
 10986606.53 12531034.79]
[ 1720036.97  3264465.23  4808893.49  6353321.75  7897750.01  9442178.27
 10986606.53 12531034.79]


#### **Step 5: Build the Frequency Table**
Construct the frequency table, which should include:

- The intervals.
- The absolute frequency.
- The relative frequency.
- The cumulative frequency.

Example table:

| Interval  | Absolute Frequency | Relative Frequency | Cumulative Frequency |
|-----------|--------------------|--------------------|----------------------|
| 1500-2000 | 6                  | 0.30               | 6                    |
| 2001-2500 | 7                  | 0.35               | 13                   |
| 2501-3000 | 5                  | 0.25               | 18                   |
| 3001-3500 | 2                  | 0.10               | 20                   |

In [32]:
def abs_frqcy(intervals:np.array, data:np.array):
    ocurrencies, edges = np.histogram(data, bins=intervals)
    return ocurrencies, edges

In [33]:
def mindpoint(edges) -> np.array:
    return np.array([np.mean([edges[i], edges[i + 1]]) for i in range(edges.size - 1)])

In [34]:
def rel_frqcy(abs_frequency:np.array, total_data:int) -> np.array:
    return abs_frequency / total_data

In [35]:
def cum_frqcy(abs_frequency:np.array) -> np.array:
    return np.cumsum(abs_frequency)

In [78]:
def print_frqcy_table(edges, mindpoints, abs_frequency, rel_frequency, cum_frequency):
    # Definir encabezados con mayor separación
    print("{:<38} {:<18} {:<10} {:<15} {:<10}".format("INTERVAL", "X", "f", "fr", "F"))
    print("-" * 90)  # Separador más ancho
    
    for i in range(edges.size - 1):
        bins = f"[{edges[i]:.2f}, {edges[i+1]:.2f})"  # Formato de intervalo
        print("{:<38} {:<18.2f} {:<10} {:<15.4f} {:<10}".format(
            bins, mindpoints[i], abs_frequency[i], rel_frequency[i], cum_frequency[i]
        ))

In [59]:
abs_frequency, edges = abs_frqcy(intervals, monthly_incomes)
mindpoints = mindpoint(edges)
rel_frequency = rel_frqcy(abs_frequency,monthly_incomes.size)
cum_frequnecy = cum_frqcy(abs_frequency)

In [79]:
print_frqcy_table(edges,mindpoints,abs_frequency,rel_frequency,cum_frequnecy)

INTERVAL                               X                  f          fr              F         
------------------------------------------------------------------------------------------
[1720036.97, 3264465.23)               2492251.10         18         0.1800          18        
[3264465.23, 4808893.49)               4036679.36         16         0.1600          34        
[4808893.49, 6353321.75)               5581107.62         8          0.0800          42        
[6353321.75, 7897750.01)               7125535.88         17         0.1700          59        
[7897750.01, 9442178.27)               8669964.14         14         0.1400          73        
[9442178.27, 10986606.53)              10214392.40        10         0.1000          83        
[10986606.53, 12531034.79)             11758820.66        17         0.1700          100       


#### **Step 7: Distribution Analysis**
Analyze the data distribution:

- Where do the incomes tend to cluster?
- Is there any trend in the incomes?

#### Where do the incomes tend to Cluster?

The analysis show, that the data tend to cluster in the first interval **[1720036.97, 3264465.23)**, the middle **[6353321.75, 7897750.01)** and the last range **[10986606.53, 12531034.79)**.

#### Is there any trend in the incomes?

The data suggests a polarized income distribution, with more individuals earning at the lower and upper extremes.
