## What Is Statistics?

**Statistics** is the science of **collecting, organizing, and analyzing data**.

**Data:**  
Facts or pieces of information.

---

## Types of Statistics

### 1. Descriptive Statistics  
Descriptive statistics consist of **organizing and summarizing data**.

**Examples:**
- Measures of central tendency: **Mean, Median, Mode**
- Measures of dispersion: **Variance, Standard Deviation**
- Different types of data distributions
- **Histogram**
- **PDF (Probability Density Function)**
- **PMF (Probability Mass Function)**

**Question Type:**  
- What is the most common age in your statistics class?

---

### 2. Inferential Statistics  
Inferential statistics consist of **using measured data to form conclusions** about a population.

**Examples:**
- **Z-test**
- **T-test**
- **Chi-square test**
- **ANOVA**
- Hypothesis testing: **H₀ (null hypothesis), H₁ (alternative hypothesis)**
- **p-value**
- **Level of significance**

**Question Type:**  
- Are the ages of students in the classroom similar to the ages of students in the university?


## Population and Sample

**Population:**  
The entire group that we are interested in studying.  
- Denoted by **N**

**Sample:**  
A subset of the population selected for study.  
- Denoted by **n**

*The goal of sampling is to create a sample that is representative of the entire population.*

---

## Types of Sampling

### 1. Simple Random Sampling  
When performing simple random sampling, **every member of the population (N)** has an **equal chance** of being selected for the sample (n).

---

### 2. Stratified Sampling  
- The population is divided into **non-overlapping groups** called **strata**  
- "Stratified" means **layering**

---

### 3. Systematic Sampling  
- Selection is made at **regular intervals** from the population list

---

### 4. Convenience Sampling / Voluntary Response Sampling  
- Samples are chosen based on **ease of access** or **self-selection**


## Types of Variables

A **variable** is a property that can take on many values.

> A variable represents a **single value**.  
> Example: `Ages = [1, 5, 3, 6]` is **not** a variable.

---

## Types of Variables

### 1. Quantitative Variables (Numerical)

#### a. Discrete Variable  
- Takes **whole number** values  
- **Example:**  
  - Number of children = 2, 3

#### b. Continuous Variable  
- Takes **any real number**  
- **Example:**  
  - Height = 175.24, 180.9

---

### 2. Qualitative / Categorical Variables

#### a. Nominal  
- No **order** in the data  
- No category is greater than another  
- **Example:**  
  - Gender = Male, Female

#### b. Ordinal  
- There **is an order** in the data  
- One category is better or higher than another  
- **Example:**  
  - Grade = Poor, Good, Excellent


## Measure of Central Tendency

**Definition:**  
Central tendency refers to the measure used to determine the **center** of a data distribution.

**Examples:**  
- Mean  
- Median  
- Mode  

---

### Mean (Average)

#### Population Data (N)
$$\mu = \frac{1}{N} \sum_{i=1}^{N} x_i$$

#### Sample Data (n)
$$\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i$$

---

### Median

**Steps to calculate the median:**
1. Sort all the numbers in ascending order  
2. Find the central element  

**Cases:**
- **Odd length:** middle value  
- **Even length:** average of the two middle values  

---

### Mode

- The **most frequently occurring** element in the dataset

---

### Notes
- Use **median** for detecting outliers  
- Use **mode** for **categorical features**


## Measure of Dispersion

Measures of dispersion describe how **spread out** the data values are.

### Types of Measure of Dispersion
1. Variance  
2. Standard Deviation  

---

## Variance

**Variance** measures the **spread of the data** around the mean.

### For Population (N)
$$\sigma^2 = \frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}$$

### For Sample (n)
$$s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}$$

**Note:**  
- \( n - 1 \) is used for sample data due to **Bessel’s correction** (degree of freedom).

---

## Standard Deviation

- Standard Deviation is the **square root of variance**.

$$\text{S.D.} = \sqrt{\text{Variance}}$$

<img src="https://image4.slideserve.com/254161/standard-deviation-l.jpg" alt="Variance vs Standard Deviation" width="300" style="background-color:white;">

### Interpretation
- When **variance is large**, standard deviation is large → **spread increases**
- When **variance is small**, standard deviation is small → **data is more concentrated**

---

## Visualization

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/f9/Comparison_standard_deviations.svg/960px-Comparison_standard_deviations.svg.png" alt="Variance vs Standard Deviation" width="300" style="background-color:white;">


## Coefficient of Variation (CV)

The **Coefficient of Variation (CV)** is a **relative measure of dispersion**.  
It expresses the **standard deviation as a percentage of the mean**.

---

## Formula

Since CV is a ratio, the formula is the same for both **population** and **sample**, as long as the corresponding **mean** and **standard deviation** are used.

### Standard Formula

### Coefficient of Variation (CV)
$$CV = \left( \frac{\sigma}{\mu} \right) \times 100$$

**Where:**
* $\sigma$ or $s$: Standard Deviation
* $\mu$ or $\bar{x}$: Mean 

---

## Key Properties

- **Unitless:**  
  Units cancel out, allowing comparison between datasets with different units  
  *(e.g., weight in kg vs. height in cm)*

- **Relative Variability:**  
  Measures how much **noise** (standard deviation) exists relative to the **signal** (mean)

- **Scale Invariant:**  
  Multiplying all data points by a constant does **not** change the CV  
  *(unlike standard deviation)*

---

## Interpretation

- **Low CV:**  
  Data is more **consistent** and **stable** relative to its size

- **High CV:**  
  Data is more **volatile** or **inconsistent** relative to its size

---

## Comparing Two Series

| Aspect        | Lower CV          | Higher CV        |
|--------------|-------------------|------------------|
| Consistency  | More consistent   | Less consistent  |
| Stability    | More stable       | Less stable      |
| Risk (Finance) | Less risky      | More risky       |

---

## When to Use CV vs. Standard Deviation

### Use **Standard Deviation**
- To understand the spread **within a single dataset**
- When **units matter**  
  *Example: “The error margin is 5 cm”*

### Use **Coefficient of Variation (CV)**
- To compare **two or more datasets**
- When datasets have **different means or different units**  
  *Example: “Which is more volatile: the price of Gold or the price of Milk?”*

---

## Important Note

- CV is meaningful **only for ratio-scale data** (data with a true zero), such as:
  - Height
  - Weight
  - Income

- CV is **not useful** for data like temperature in  
  * Celsius: $^\circ\text{C}$ or* Fahrenheit: $^\circ\text{F}$, where zero is arbitrary.