<a href="https://colab.research.google.com/github/AhMedDa1/Statistics/blob/main/descriptive_analytics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [Descriptive statistics](https://en.wikipedia.org/wiki/Descriptive_statistics)

## Center of Distribution

Mean and median both try to measure the "central tendency" in a data set. The goal of each is to get an idea of a "typical" value in the data set. The mean is commonly used, but sometimes the median is preferred.

### Mean

For a data set, the arithmetic mean also known as arithmetic average, is a central value of a finite set of numbers: specifically, the sum of the values divided by the number of values.

If the data set were based on a series of observations obtained by sampling from a statistical population, the arithmetic mean is the sample mean (denoted ${\displaystyle {\bar {x}}}$) to distinguish it from the mean, or expected value of the underlying distribution, the population mean (denoted ${\displaystyle \mu }$ or ${\displaystyle \mu _{x}}$.

- Population mean: $\mu$ or $\mu_x$
- Sample mean: $\bar x$

In [1]:
import numpy as np
from scipy import stats

In [7]:
rng = np.random.default_rng(20)
arr = rng.integers(low=0, high=11, size=11)
arr

array([9, 3, 2, 5, 9, 1, 2, 5, 2, 4, 3])

In [8]:
mu = np.mean(arr)
mu

4.090909090909091

### Median

the **median** is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as "the middle" value.

In [9]:
median = np.median(arr)
median

3.0

### Mode

The **mode** is the value that appears most often in a set of data values. If **X** is a discrete random variable, the mode is the value x (i.e, _**X**_ = _x_) at which the probability mass function takes its maximum value. In other words, it is the value that is most likely to be sampled.

In [10]:
mode = stats.mode(arr)[0]
mode

array([2])

---

## Data Patterns

### Percentile

A **_k_-th** **percentile** (**percentile score** or **centile**) is a score below which a given percentage _k_ of scores in its frequency distribution falls (exclusive definition) or a score _at or below which_ a given percentage falls (inclusive definition). For example, the 50th percentile (the median)) is the score below which (exclusive) or at or below which (inclusive) 50% of the scores in the distribution may be found. Percentiles are expressed in the same unit of measurement as the input scores; for example, if the scores refer to human weight, the corresponding percentiles will be expressed in kilograms or pounds.

The 25th percentile is also known as the first quartile ($Q_1$), the 50th percentile as the median or second quartile ($Q_2$), and the 75th percentile as the third quartile ($Q_3$).

In [13]:
np.percentile(arr, 25, interpolation="lower")

2

In [14]:
np.percentile(arr, 25, interpolation="midpoint")

2.0

In [15]:
np.percentile(arr, 25, interpolation="higher")

2

In [16]:
np.percentile(arr, 75, interpolation="lower")

5

In [17]:
np.percentile(arr, 75, interpolation="midpoint")

5.0

In [18]:
np.percentile(arr, 75, interpolation="higher")

5

### [IQR](https://en.wikipedia.org/wiki/Interquartile_range)

The **interquartile range** (**IQR**) is a measure of statistical dispersion. It is the spread of the data or observations. The IQR may also be called the **midspread**, **middle 50%**, or **H‑spread.** It is defined as the spread difference between the 75th and 25th percentiles of the data. To calculate the IQR, the data set is divided into quartiles, or four rank-ordered even parts via linear interpolation. These quartiles are denoted by $Q_1$ (also called the lower quartile), $Q_2$ (the median), and $Q_3$ (also called the upper quartile). The lower quartile corresponds with the 25th percentile and the upper quartile corresponds with the 75th percentile, so $IQR = Q_3 − Q_1$.

In [19]:
stats.iqr(arr)

3.0

In [20]:
stats.iqr(arr, interpolation="midpoint")

3.0

## Data Variability

### Variance

 **Variance** is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers is spread out from their average value. The variance is the square of the standard deviation, the second central moment of a distribution, and the covariance of the random variable with itself, and it is often represented by ${\displaystyle \sigma ^{2}}$, ${\displaystyle s^{2}}$, ${\displaystyle \operatorname {Var} (X)}$, ${\displaystyle V(X)}$, or ${\displaystyle \mathbb {V} (X)}$.

- Population Variance: $\sigma^2$
- Sample Variance: $s^2$

In [22]:
# Population Variance
variance = np.var(arr)
variance

6.8099173553719

In [23]:
# Sample Variance
variance = np.var(arr, ddof=1)
variance

7.49090909090909

### Standard Deviation

The **standard deviation** is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the values are spread out over a wider range.

Standard deviation may be abbreviated **SD**, and is most commonly represented in mathematical texts and equations by the lower case Greek letter sigma **σ**, for the population standard deviation, or the Latin letter **s**, for the sample standard deviation.

- Population SD: $\sigma$
- Sample SD: $s$

In [24]:
# Population SD
sd = np.std(arr)
sd

2.609581835346786

In [25]:
# Sample SD
sd = np.std(arr, ddof=1)
sd

2.7369525189358126

## Describe

In [28]:
def describe(arr):
    print("mean:", np.mean(arr))
    print("Population Variance:", np.var(arr))
    print("Sample Variance:", np.var(arr, ddof=1))
    print("Population SD:", np.std(arr))
    print("Sample SD:", np.std(arr, ddof=1))
    print("median:", np.median(arr))
    
describe(arr)

mean: 4.090909090909091
Population Variance: 6.8099173553719
Sample Variance: 7.49090909090909
Population SD: 2.609581835346786
Sample SD: 2.7369525189358126
median: 3.0


---

# Modeling Data Distribution

## [Z-Scores](https://en.wikipedia.org/wiki/Standard_score)

The **standard score** is the number of standard deviations by which the value of a raw score (i.e., an observed value or data point) is above or below the mean value of what is being observed or measured. Raw scores above the mean have positive standard scores, while those below the mean have negative standard scores.

Standard scores are most commonly called **_z_-scores**;  Other equivalent terms in use include **z-values, normal scores**, **standardized variables**.

Computing a z-score requires knowledge of the mean and standard deviation..

In [30]:
# Z-Score calculation for single data point
def zscore(dp, mean, sd):
    z = (dp - mean) / sd
    return z

zscore(71, 76, 3)

-1.6666666666666667

In [None]:
# Z-Scores calculation for every item in an array
stats.zscore(arr)

array([-1.13750025, -1.13750025, -1.13750025, -0.83815808, -0.53881591,
        0.65855277,  0.65855277,  0.95789494,  0.95789494,  1.55657928])

**references**



* https://en.wikipedia.org/wiki/Statistics 
* https://en.wikipedia.org/wiki/Statistical_dispersion 
* https://en.wikipedia.org/wiki/Mean 
* https://en.wikipedia.org/wiki/Expected_value 
* https://en.wikipedia.org/wiki/Standard_deviation

