# Descriptive Statistics

-----------

## 1. Measures of Central Tendency

--------------

#### **1. Mean (Arithmetic Average)**  
**Definition**:  
- The mean is the sum of all values divided by the number of values.  
- It's a way of summarizing a set of values with one representative number.

-----------

![image.png](attachment:3d40a84a-6fea-477c-aad0-467d3f32777f.png)

**Key Properties**:  
- Sensitive to **outliers** (extreme values can skew the mean).  
- Used when data is **symmetrically distributed**.  

---------

![image.png](attachment:9237e36c-a38c-4c56-8389-ad07c8fd8a82.png)

-------------

**When to Use**:  
- For normally distributed data (e.g., heights, exam scores).  
- When all data points are equally important.  

************
************

#### **2. Median (Middle Value)**  
**Definition**:  
The median is the middle value in an ordered dataset.  

--------------
**Steps to Find Median**:  
1. Arrange data in **ascending or descending order**.  
2. If **odd** number of observations:  
   - Median = Middle value.  
3. If **even** number of observations:  
   - Median = Average of two middle values.  

--------------
**Key Properties**:  
- **Not affected by outliers** (robust measure).  
- Better for **skewed distributions** (e.g., income, house prices).  

--------------------

![image.png](attachment:afd7a07b-894b-4f20-b7be-b6c555a73b7a.png)

------------

**When to Use**:  
- When data has **outliers** or is **skewed**.  
- For ordinal data (e.g., survey ratings).

----------------
-----------------

#### **3. Mode (Most Frequent Value)**  
**Definition**:  
The mode is the value that appears **most frequently** in a dataset.  

-----------

**Key Properties**:  
- Can be **one mode (unimodal)**, **two modes (bimodal)**, or **no mode**.  
- Used for **categorical data** (e.g., colors, brands).  

-----------------

**Example**:  
- **Shoe Sizes Sold**: [7, 8, 8, 9, 9, 9, 10] → Mode = **9**  
- **No Mode**: [1, 2, 3, 4] (all values appear once).  

-------------------

**When to Use**:  
- For **nominal data** (categories like gender, car models).  
- Identifying **most common** preferences (e.g., favorite ice cream flavor).

----------------
----------------



### **Comparison Summary**  
| Measure  | Definition          | Sensitive to Outliers? | Best Used For                  | Example Use Case               |  
|----------|--------------------|-----------------------|-------------------------------|--------------------------------|  
| **Mean**  | Average value      | Yes                   | Symmetric data (normal dist.)  | Average exam score             |  
| **Median**| Middle value       | No                    | Skewed data                   | Median household income        |  
| **Mode**  | Most frequent value| No                    | Categorical data              | Most sold product size         |  



----------

### **Real-Life Observations to Learn These Concepts**  
1. **Mean**:  
   - Calculate your **average monthly phone usage** (total data / months).  
   - Observe how weather reports give **average temperatures**.  

2. **Median**:  
   - Check **real estate listings**: Why do they report median home prices instead of mean?  
   - Compare salaries in a company (CEO salary may skew the mean).  

3. **Mode**:  
   - Notice **best-selling products** on e-commerce sites (mode = most bought item).  
   - Identify the **most common** blood type in your family.  

------------
------------

### **Key Takeaways**  
- **Mean** = Best for symmetric data, but distorted by outliers.  
- **Median** = Robust for skewed data (e.g., income, age).  
- **Mode** = Only measure for categorical data (e.g., election results).  

----
----

# 2. Measures of Dispersion

# **Measures of Dispersion: Detailed Notes**

Measures of dispersion describe how spread out or varied a dataset is. They help us understand the variability in data beyond just the central tendency (mean, median, mode). The key measures include:

1. **Range**  
2. **Variance**  
3. **Standard Deviation**  
4. **Interquartile Range (IQR)**  

## **1. Range**
### **Concept**
- The simplest measure of dispersion.
- Defined as the difference between the maximum and minimum values in a dataset.
  
**Formula:**  

![image.png](attachment:2d8d1770-5e3c-4c35-a7bf-30eeb8e1155a.png)


### **Usage**
- Quick and easy to compute.
- Useful for getting a rough idea of spread in small datasets.

### **Limitations**
- **Highly sensitive to outliers** (a single extreme value can distort the range).
- **Does not consider all data points**, only the extremes.
- **Not reliable for skewed distributions**.

### **When to Use?**
- For a quick initial assessment of variability.
- When the dataset is small and free of extreme outliers.

---
---


## **2. Variance**
### **Concept**
- Measures how far each data point is from the mean.
- The average of the squared differences from the mean.

**Formula (Population Variance):** 

![image.png](attachment:e0cfa3d3-d30d-472a-8ea4-129e0161735a.png)

### **Usage**
- Used in statistical tests (ANOVA, regression).
- Helps in understanding data spread around the mean.

### **Limitations**
- **Squared units** make interpretation difficult (e.g., if data is in meters, variance is in m²).
- **Sensitive to outliers** (squaring amplifies extreme values).

### **When to Use?**
- When working with parametric statistical models.
- When the mean is the primary measure of central tendency.

---
---

## **3. Standard Deviation (SD)**
### **Concept**
- The square root of variance.
- Provides dispersion in the same units as the original data.

![image.png](attachment:2376ac7c-8803-415b-98dc-52cf488119df.png)


### **Usage**
- Most widely used measure of dispersion.
- Helps in calculating confidence intervals and hypothesis testing.

### **Limitations**
- Still **affected by outliers**, though less than variance.
- **Assumes normal distribution** for many statistical applications.

### **When to Use?**
- When the data is normally distributed.
- When comparing variability across different datasets.

---
---

## **4. Interquartile Range (IQR)**
### **Concept**
- Measures the spread of the middle 50% of data.
- Calculated as the difference between the 75th percentile (Q3) and the 25th percentile (Q1).

**Formula:**  

![image.png](attachment:34d32695-7e48-4883-a2d3-d616fb73b52a.png)


### **Limitations**
- **Ignores data outside Q1 and Q3**.
- **Less intuitive** than SD for normally distributed data.

### **When to Use?**
- For skewed distributions or datasets with outliers.
- When median is the preferred measure of central tendency.

---
---

## **Comparison & When to Use Which Measure**

| Measure          | Best Used When... | Sensitive to Outliers? | Units | Works Best With |
|------------------|-------------------|-----------------------|-------|----------------|
| **Range**        | Quick estimate    | **Highly sensitive**  | Same as data | Small, outlier-free data |
| **Variance**     | Statistical models | **Sensitive**         | Squared units | Normally distributed data |
| **Standard Deviation** | General use | **Moderately sensitive** | Same as data | Normally distributed data |
| **IQR**          | Skewed data       | **Robust**            | Same as data | Non-normal or outlier-prone data |

---

## **Key Takeaways**
- **For normally distributed data**: Use **Standard Deviation** (most consistent).
- **For skewed data or outliers**: Use **IQR** (more robust).
- **For quick assessment**: Use **Range** (but be cautious of outliers).
- **For statistical modeling**: Use **Variance** (foundational for many tests).

-------
-------

# 3 . Probability Distributions


### **1. Statistics and Statistical Models**  

#### **A. Descriptive Statistics**  
1. **Measures of Central Tendency**  
   - Mean, Median, Mode  
2. **Measures of Dispersion**  
   - Range, Variance, Standard Deviation, IQR  
3. **Probability Distributions**  
   - Discrete vs. Continuous Distributions  
   - Gaussian (Normal) Distribution  
   - Skewness and Kurtosis  
4. **Regression Analysis**  
   - Linear Regression  
   - Non-Linear Regression  
   - Goodness of Fit (R², Adjusted R²)  
5. **Normality and Model Assumptions**  
   - Normality Tests (Shapiro-Wilk, KS Test)  
   - Homoscedasticity  
6. **Analysis of Variance (ANOVA)**  
   - One-Way ANOVA  
   - Two-Way ANOVA  

#### **B. Inferential Statistics**  
1. **Hypothesis Testing**  
   - Null & Alternative Hypotheses  
   - Type I & Type II Errors  
   - p-values & Significance Levels  
2. **Parametric Tests**  
   - **t-Tests**  
     - One-Sample t-Test  
     - Independent (Two-Sample) t-Test  
     - Paired t-Test  
   - **z-Test**  
   - **ANOVA** (One-Way, Two-Way)  
3. **Non-Parametric Tests**  
   - Chi-Square Test (Goodness-of-Fit, Independence)  
4. **Data Types in Testing**  
   - Continuous Data (t-Test, z-Test, ANOVA)  
   - Categorical Data (Chi-Square, Logistic Regression)  
