<center>
    <h1 id='numerical-features-statistic-ii' style='color:#7159c1'>🧮 Numerical Features Statistic II 🧮</h1>
    <i>Exploring Numerical Features - Dispersion Metrics</i>
</center>

```
- Interquartile Range and Outliers
- Variance
- Standard Deviation
- Mean Absolute Deviation (MAD)
- Coefficient of Variation (CV)
- Skewness
- Kurtosis
```

In [1]:
import pandas as pd # pip install pandas
from functools import reduce # pip install functools

df = pd.read_csv('./datasets/students.csv')
df.head()

Unnamed: 0,name,main_breed,position,favorite_color,age
0,goku,sayan,low,purple,45
1,vegeta,sayan,high,purple,73
2,broly,sayan,high,purple,23
3,gohan,sayan,low,black,64
4,granolah,alien,low,purple,56


<p id='0-interquartile-range-and-outliers' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>0 | Interquartile Range and Outliers</p>

The `Interquartile Range` tells how much is the range of numerical variables. For instance, picture a variable that goes from 17 to 45, its range is equals to 28, since 28 is the difference between the minimum and the maximum values.

Besides, having this metric in mind, we can calculate and identify outliers. There are two types of outliers: `down-bound and upper-bound`. The first one are those that are way too large than the maximum value, whereas the second one are those that are way too small than the minimum value. These are the equations to calculate them:

$$
\text{down_bound} = \text{Q1} - (1.5 \cdot \text{IQR})
$$

$$
\text{upper_bound} = \text{Q3} + (1.5 \cdot \text{IQR})
$$

where:

- Down Bound: values that are way too small than the minimum value. They are outliers;
- Upper Bound: values that are way too large than the maximum value. They are outliers;
- Q1: first quartile;
- Q3: third quartile;
- IQR: interquartile range.

<br />

Let's see these metrics in practice using the `age` variable from our Dragon Ball Dataset.

In [2]:
# ---- Interquartile Range ----
age_iqr = df.age.max() - df.age.min()
print(f'- Age Interquartile Range: {age_iqr}')

- Age Interquartile Range: 55


In [3]:
# ---- Outliers: Down and Upper Bounds ----
down_bound = df.age.quantile(0.25) - (1.5 * age_iqr)
upper_bound = df.age.quantile(0.75) + (1.5 * age_iqr)

down_bound_outliers = df.loc[df.age < down_bound]
upper_bound_outliers = df.loc[df.age > upper_bound]

print(f'- Number of Down Bound Outliers: {down_bound_outliers.shape[0]}')
print(f'- Number of Upper Bound Outliers: {upper_bound_outliers.shape[0]}')

- Number of Down Bound Outliers: 0
- Number of Upper Bound Outliers: 0


<p id='1-variance' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>1 | Variance</p>

Consider now we want to guess the new if a new character is added into the dataset. For this task, we could use the Arithmetic Mean or the Median, right? But don't you think that only having one vaue to guess will kind of lead us to the error? But, what about we guess an interval of values, like, instead of saying the new character will be 35 years-old, guessing that it will be between 25 and 45 years-old will give us more chances to get the right result, don't you think?

This interval is calculated by using the `Variance` and works like this:

$$
variance < mean < variance
$$

<br />

So, if the mean is equals to 35 and the variance equals to 10, our interval looks like this:

$$
25 < 35 < 45
$$

<br />

To calculate the variance, we apply the equation:

$$
\sigma^2 = \frac{\sum_{i=1}^{n}{(x_i - \overline{x})^2}} {n}
$$

where:

$\sigma^2 \text{: variance}$

$x_i \text{: each element in the list}$

$\overline{x} \text{: list's arithmetic mean}$

$n \text{: number of elements in the list}$

In [4]:
# ---- Variance ----
age_variance = round(df.age.var(), 4)
print(f'- Age Variance: {age_variance} squared years-old')

- Age Variance: 361.4333 squared years-old


<p id='2-standard-deviation' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>2 | Standard Deviation</p>

As you noticed, the variance has two problems:

1. the age is way too large, it's impossible to a human live more than 150 years;
2. the variance consider the age as squared and, tell me about it, it's weird talking to ages as squared years-old, isnt' it?

To solve these two issues, we use `Standard Deviation`. This metric has been created to take off the square from the variance and, consequently, give us a more accurate and plausible range of values. It's equation is literally the variance's square root.

$$
\sigma = \sqrt{\sigma^2}
$$

that is:

$$
\sigma = \sqrt{\frac{\sum_{i=1}^{n}{(x_i - \overline{x})^2}} {n}}
$$

In [5]:
# ---- Standard Deviation ----
age_std = round(df.age.std(), 4)
age_mean = round(df.age.mean(), 4)

print(f'- Age Standard Deviation: {age_std} years-old')
print(f'- The guess goes from {age_mean - age_std} to {age_mean + age_std} years-old')

- Age Standard Deviation: 19.0114 years-old
- The guess goes from 32.0886 to 70.1114 years-old


<p id='3-mean-absolute-deviation-mad' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>3 | Mean Absolute Deviation (MAD)</p>

Both Variance and Standard Deviation are used when the Arithmetic Mean is the recommended metric to resume the variable. However, when it's the Median the recommended metric to resume the variable, we switch to `Mean Abosulte Deviation (MAD)` in order to find the range of values.

It's equation is given as:

$$
\text{mad} = \frac{\sum_{i=1}^{n}{| x_1 - \overline{x} |}} {n}
$$

where:

$\text{mad : mean absolute deviation}$

$x_i \text{: each element in the list}$

$\overline{x} \text{: list's arithmetic mean}$

$n \text{: number of elements in the list}$

---

By the way, use the Mean Absolute Deviation (MAD) instead of Standard Deviation in the following scenarios:

- **Robustness to Outliers** - `MAD is less sensitive to extreme values or outliers compared to Standard Deviation. If your data contains outliers that you don't want to heavily influence the measure of dispersion, MAD is a better choice`;

- **Simpler Interpretation** - `MAD is conceptually simpler since it measures the average absolute distance from the mean, whereas Standard Deviation squares the differences, making it harder to interpret in some cases`;

- **Non-Normal Distributions** - `If your data does not follow a normal distribution, MAD can sometimes provide a more accurate measure of variability. Standard Deviation assumes a normal distribution, so it may not be as appropriate for data that is skewed or has heavy tails`;

- **Ordinal Data** - `When dealing with ordinal data (data where the order matters but the difference between values is not necessarily consistent), MAD can be more appropriate because it doesn't assume interval or ratio scales like Standard Deviation does`;

- **Emphasis on Median** - `In cases where the median is a better measure of central tendency than the mean (e.g., with skewed data), MAD is more aligned with the median, making it more consistent in those contexts`.

In [6]:
# ---- Mean Absolute Deviation ----
age_mad = round(
    sum(
        df.age.apply(lambda age: abs(age - age_mean))
    ) / df.age.shape[0]
, 4)

print(f'- Age Mean Absolute Deviation: {age_mad} years-old')
print(f'- The guess goes from {age_mean - age_mad} to {age_mean + age_mad} years-old')

- Age Mean Absolute Deviation: 15.3 years-old
- The guess goes from 35.8 to 66.4 years-old


<p id='4-coefficient-of-variation-cv' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>4 | Coefficient of Variation (CV)</p>

`Coefficient of Variation (CV)` is commonly used to compare the dispersion between two different samples of the dataset. Foor instance, consider I and you got two different samples of Dragon Ball Dataset.

In my sample, I got a mean equals to 30 and a standard deviation equals to 5. And you got a mean equals to 28 and a standard deviation equals to 10. So, who is correct? You or I? The answer is neither!! I'm right on my sample and you are on yours. It happens because these metrics are calculated considering the sample that we are working on, and not the excluded population.

Besides, there are a way to compare our results, this way is the Coefficient of Variation and is given by:

$$
\text{CV} = \frac{\sigma}{\overline{x}} \cdot 100
$$

where:

$\text{CV: Coefficient of Variation}$

$\sigma \text{: standard deviation}$

$\overline{x} \text{: arithmetic mean}$

<br />

The results goes from 0 to 100 and their conclusions are given below:

> **Low CV (0 to 20) (0-20%)** - `a low coefficient of variation indicates low relative variability. This suggests that the values in the dataset are relatively close to the mean, and there is less spread or dispersion in the data`;

> **Moderate CV (20 to 50) (20-50%)** - `a moderate coefficient of variation suggests a moderate level of relative variability. The values in the dataset are moderately spread around the mean`;

> **High CV (50 to 100) (50% and above)** - `a high coefficient of variation indicates high relative variability. This suggests that the values in the dataset are widely spread around the mean, and there is a significant degree of dispersion`.

In [7]:
# ---- Coefficient of Variation ----
sample_cv = round((age_std / age_mean) * 100, 4)
print(f'- Coefficient of Variation of the Age Variable for the present Dataset: {sample_cv} years-old')

- Coefficient of Variation of the Age Variable for the present Dataset: 37.2043 years-old


<p id='5-skewness' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>5 | Skewness</p>

`Skewness` tells how much the variable's distribution is skewed. Its equation is given by this big one:

$$
\text{Skewness} = \frac{n}{(n - 1) \cdot (n - 2)}
\cdot \sum_{i=1}^{n}{(\frac{x_1 - \overline{x}}{\sigma})^3}
$$

where:

$\text{n: number of elements in the list}$

$x_i \text{: each individual element in the list}$

$\overline{x} \text{: sample's arithmetic mean}$

$\sigma \text{: sample's standard deviation}$

In [8]:
# ---- Skewness ----
age_skewness = round(df.age.skew(), 4)
print(f'- Age Skewness: {age_skewness}')

- Age Skewness: -0.2409


<p id='6-kurtosis' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>6 | Kurtosis</p>

`Kurtosis` tells how long is the variable's distribution tail and is given by the following equation:

$$
\text{Kurtosis} = \frac{n \cdot (n + 1)}{(n - 1) \cdot (n - 2) \cdot (n - 3)}
\cdot \sum_{i=1}^{n}{(\frac{x_i - \overline{x}}{\sigma})^4}
- \frac{3 \cdot (n - 1)^2}{(n - 2) \cdot (n - 3)}
$$

where:

$\text{n: number of elements in the list}$

$x_i \text{: each individual element}$

$\overline{x} \text{: sample's arithmetic mean}$

$\sigma \text{: sample's standard deviation}$

In [9]:
# ---- Kurtosis ----
age_kurtosis = round(df.age.kurt(), 4)
print(f'- Age Kurtosis: {age_kurtosis}')

- Age Kurtosis: -1.0057


<p id='reach-me' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📫 | Reach Me</p>

> **Email** - [csfelix08@gmail.com](mailto:csfelix08@gmail.com?)

> **Linkedin** - [linkedin.com/in/csfelix/](https://www.linkedin.com/in/csfelix/)

> **GitHub:** - [CSFelix](https://github.com/CSFelix)

> **Kaggle** - [DSFelix](https://www.kaggle.com/dsfelix)

> **Portfolio** - [CSFelix.io](https://csfelix.github.io/).