# **Descriptive Statistics**

Descriptive statistics help summarize and understand datasets using numerical measures. The two main types are:

1. **Measures of Central Tendency** – Describe where the center of the data lies.
2. **Measures of Dispersion (Variability)** – Describe how spread out the data is.

---

## **1. Measures of Central Tendency**
These measures indicate the typical value in a dataset.

### **a) Mean (Average)**
- The sum of all values divided by the number of values.
- Formula:  
  $$
  \text{Mean} = \frac{\sum X}{N}
  $$
- Example: If the exam scores are **[80, 85, 90, 75, 95]**,  
  $$
  \text{Mean} = \frac{80+85+90+75+95}{5} = 85
  $$

### **b) Median**
- The middle value when data is sorted.
- If **N** (number of values) is:
  - **Odd** → Median = Middle value.
  - **Even** → Median = Average of two middle values.
- Example:  
  - **Odd dataset:** [10, 20, 30] → **Median = 20**  
  - **Even dataset:** [10, 20, 30, 40] →  
    $$
    \text{Median} = \frac{20+30}{2} = 25
    $$

### **c) Mode**
- The most frequently occurring value in the dataset.
- Example: [1, 2, 2, 3, 4, 4, 4, 5]  
  - **Mode = 4** (appears most times)  
- A dataset may have:
  - **No mode** (if no value repeats).
  - **One mode** (Unimodal).
  - **Multiple modes** (Bimodal or Multimodal).  

---

## **2. Measures of Dispersion**
These measures show how much data varies around the central value.

### **a) Range**
- The difference between the maximum and minimum values.
- Formula:  
  $$
  \text{Range} = \text{Max} - \text{Min}
  $$
- Example: If scores are **[50, 60, 70, 80, 90]**,  
  $$
  \text{Range} = 90 - 50 = 40
  $$

### **b) Variance**
- Measures the average squared deviation from the mean.
- Formula for **population variance**:  
  $$
  \sigma^2 = \frac{\sum (X - \mu)^2}{N}
  $$
- Formula for **sample variance**:  
  $$
  s^2 = \frac{\sum (X - \bar{X})^2}{N-1}
  $$
- Example: If data is **[10, 20, 30]**,  
  - Mean = **20**  
  - Variance:  
    $$
    \sigma^2 = \frac{(10-20)^2 + (20-20)^2 + (30-20)^2}{3}
    $$  
    $$
    = \frac{100 + 0 + 100}{3} = 66.67
    $$

### **c) Standard Deviation**
- Square root of variance.
- Formula:  
  $$
  \sigma = \sqrt{\sigma^2}, \quad s = \sqrt{s^2}
  $$
- Lower standard deviation → Data is **closer to the mean** (less spread).  
- Higher standard deviation → Data is **more spread out**.  
- Example: If variance is **66.67**,  
  $$
  \sigma = \sqrt{66.67} \approx 8.16
  $$

### **d) Interquartile Range (IQR)**
- Measures the spread of the middle **50%** of the data.
- Formula:  
  $$
  \text{IQR} = Q3 - Q1
  $$
- Steps:
  1. Arrange data in **ascending order**.
  2. **Q1 (First Quartile)** = Median of the first half.
  3. **Q3 (Third Quartile)** = Median of the second half.
  4. **IQR = Q3 - Q1**.
- Example: [10, 20, 30, 40, 50, 60, 70]  
  - Q1 = **20**, Q3 = **60**  
  - $$
    \text{IQR} = 60 - 20 = 40
    $$

---

## **Key Takeaways**
| Measure  | Purpose |
|----------|---------|
| **Mean** | Shows the "average" value. |
| **Median** | Shows the "middle" value. |
| **Mode** | Shows the most frequent value. |
| **Range** | Shows the total spread of data. |
| **Variance** | Shows the average squared deviation from the mean. |
| **Standard Deviation** | Shows how much data deviates from the mean. |
| **IQR** | Shows the spread of the middle 50% of data. |

Understanding these concepts is essential for **data preprocessing, feature scaling, and outlier detection** in Machine Learning. 🚀


# **Why is Sample Variance Divided by (N-1)?**

In statistics, when we calculate variance, we use different formulas for **population variance** and **sample variance**.

## **1. Population Variance (σ²)**
When we have the entire population, variance is calculated as:

$$
\sigma^2 = \frac{\sum (X_i - \mu)^2}{N}
$$

where:
- \( X_i \) are the data points,
- \( \mu \) is the population mean,
- \( N \) is the total population size.

Since we have all data points, this formula correctly represents the actual variance.

---

## **2. Sample Variance (s²)**
In most real-world scenarios, we don't have the entire population, so we estimate variance from a **sample**.

Instead of dividing by \( N \), we divide by **(N-1)**:

$$
s^2 = \frac{\sum (X_i - \bar{X})^2}{N-1}
$$

where:
- \( \bar{X} \) is the sample mean,
- \( N \) is the sample size.

---

## **3. Why (N-1)? (Bessel’s Correction)**
Dividing by **(N-1)** instead of **N** is known as **Bessel’s correction**. The reason is:

1. **Sample Mean (𝑋̄) is an Estimate**:  
   - The sample mean \( \bar{X} \) is only an **approximation** of the population mean \( \mu \).
   - This causes the squared deviations \( (X_i - \bar{X})^2 \) to be **slightly smaller** than if we knew \( \mu \).

2. **Unbiased Estimation**:  
   - If we divide by **N**, the variance estimate will be **biased** (it underestimates the true population variance).
   - Dividing by **(N-1)** corrects this bias, making \( s^2 \) an **unbiased estimator** of \( \sigma^2 \).

---




## **1. Skewness**
Skewness measures the **asymmetry** of data distribution.  

### **Types of Skewness:**
1. **Positive Skew (Right-Skewed):**  
   - Tail is longer on the right side.
   - Mean > Median > Mode.
   - Example: Income distribution in most countries.

2. **Negative Skew (Left-Skewed):**  
   - Tail is longer on the left side.
   - Mean < Median < Mode.
   - Example: Exam scores where most students score high.

3. **Zero Skewness (Symmetric):**  
   - Mean = Median = Mode.
   - Example: Normal distribution.


---

## **2. Kurtosis**
Kurtosis measures the **tailedness** of the distribution.

### **Types of Kurtosis:**
1. **Leptokurtic (Kurtosis > 3):**  
   - Sharp peak, heavy tails.
   - More extreme outliers.
   - Example: Stock market crashes.

2. **Mesokurtic (Kurtosis ≈ 3):**  
   - Normal distribution.
   - Example: Heights of people.

3. **Platykurtic (Kurtosis < 3):**  
   - Flat peak, light tails.
   - Fewer outliers.
   - Example: Uniform distribution.


---

## **3. Percentiles and Quartiles**
### **Percentile:**
A percentile indicates the value **below which a certain percentage of observations fall**.

- **25th percentile (Q1):** 25% of data is below this.
- **50th percentile (Q2):** Median (50% of data below).
- **75th percentile (Q3):** 75% of data is below this.

Formula for **k-th percentile**:

$$
P_k = X_{ \left( \frac{k}{100} \times N \right) }
$$


### **Quartiles:**
Quartiles divide data into **4 equal parts**:
- **Q1 (25th percentile)** → Lower Quartile
- **Q2 (50th percentile)** → Median
- **Q3 (75th percentile)** → Upper Quartile

---

## **4. Five Number Summary**
The **Five Number Summary** provides a quick overview of a dataset:

1. **Minimum (Min)** → Smallest value.
2. **Q1 (25th percentile)** → Lower quartile.
3. **Median (Q2, 50th percentile)** → Middle value.
4. **Q3 (75th percentile)** → Upper quartile.
5. **Maximum (Max)** → Largest value.

📌 **Used in Boxplots** to visualize data spread and detect outliers.

---

## **5. Covariance**
Covariance measures **how two variables change together**.

### **Formula:**
$$
Cov(X, Y) = \frac{\sum (X_i - \bar{X}) (Y_i - \bar{Y})}{N-1}
$$


📌 **Interpretation:**
- \( Cov(X, Y) > 0 \) → Variables move **in the same direction**.
- \( Cov(X, Y) < 0 \) → Variables move **in opposite directions**.
- \( Cov(X, Y) = 0 \) → No relationship.

⚠ **Problem:** Covariance is **not standardized**, making it hard to interpret. That’s why we use **correlation**.

---

## **6. Correlation**
Correlation measures **both direction and strength** of a relationship.

### **Formula:**
$$
r = \frac{Cov(X, Y)}{s_X s_Y}
$$

where:
- \( Cov(X, Y) \) is the covariance,
- \( s_X, s_Y \) are the standard deviations of X and Y.

📌 **Interpretation:**
- \( r = 1 \) → Perfect **positive** correlation.
- \( r = -1 \) → Perfect **negative** correlation.
- \( r = 0 \) → No correlation.

📊 **Example:**
- **Height vs Weight:** \( r = 0.8 \) (Strong positive).
- **Hours Studied vs Exam Score:** \( r = 0.9 \) (Very strong positive).
- **Ice Cream Sales vs Temperature:** \( r = 0.75 \) (Positive).
- **Number of Accidents vs Rainfall:** \( r = -0.6 \) (Negative correlation).

---

📌 **Key Takeaway:**  
- **Use correlation instead of covariance** when comparing relationships.  
- **Skewness and kurtosis** help analyze data distribution.  
- **Percentiles and quartiles** summarize spread effectively.  




# Example

In [2]:
import pandas as pd
import numpy as np
from scipy.stats import skew, kurtosis

# Sample dataset (Modify as needed)
data = {
    'Age': [25, 30, 35, 40, 45, 50, 55, 60, 65, 70],
    'Salary': [3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 11000, 12000],
    'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female'],
    'Department': ['HR', 'IT', 'HR', 'IT', 'HR', 'IT', 'HR', 'IT', 'HR', 'IT']
}

df = pd.DataFrame(data)
df.head()


Unnamed: 0,Age,Salary,Gender,Department
0,25,3000,Male,HR
1,30,4000,Female,IT
2,35,5000,Male,HR
3,40,6000,Female,IT
4,45,7000,Male,HR


In [3]:
# Separate numerical and categorical columns
num_cols = df.select_dtypes(include=['number']).columns
cat_cols = df.select_dtypes(include=['object']).columns

print("Numerical Columns:", list(num_cols))
print("Categorical Columns:", list(cat_cols))


Numerical Columns: ['Age', 'Salary']
Categorical Columns: ['Gender', 'Department']


In [7]:
# Compute numerical summary statistics
df.describe().T



Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,10.0,47.5,15.138252,25.0,36.25,47.5,58.75,70.0
Salary,10.0,7500.0,3027.650354,3000.0,5250.0,7500.0,9750.0,12000.0


In [8]:
# Compute categorical summary statistics
df.describe(include=['object'])


Unnamed: 0,Gender,Department
count,10,10
unique,2,2
top,Male,HR
freq,5,5


In [9]:
df.describe(include='all')

Unnamed: 0,Age,Salary,Gender,Department
count,10.0,10.0,10,10
unique,,,2,2
top,,,Male,HR
freq,,,5,5
mean,47.5,7500.0,,
std,15.138252,3027.650354,,
min,25.0,3000.0,,
25%,36.25,5250.0,,
50%,47.5,7500.0,,
75%,58.75,9750.0,,


In [10]:
num_stats = {}

for col in num_cols:
    num_stats[col] = {
        'Mean': np.mean(df[col]),
        'Median': np.median(df[col]),
        'Mode': df[col].mode()[0]
    }

pd.DataFrame(num_stats)


Unnamed: 0,Age,Salary
Mean,47.5,7500.0
Median,47.5,7500.0
Mode,25.0,3000.0


In [11]:
range_stats = {}

for col in num_cols:
    range_stats[col] = {
        'Min': np.min(df[col]),
        'Max': np.max(df[col]),
        'Range': np.max(df[col]) - np.min(df[col])
    }

pd.DataFrame(range_stats)


Unnamed: 0,Age,Salary
Min,25,3000
Max,70,12000
Range,45,9000


In [12]:
var_sd_stats = {}

for col in num_cols:
    var_sd_stats[col] = {
        'Variance': np.var(df[col], ddof=1),
        'Standard Deviation': np.std(df[col], ddof=1)
    }

pd.DataFrame(var_sd_stats)


Unnamed: 0,Age,Salary
Variance,229.166667,9166667.0
Standard Deviation,15.138252,3027.65


In [13]:
skew_kurt_stats = {}

for col in num_cols:
    skew_kurt_stats[col] = {
        'Skewness': skew(df[col]),
        'Kurtosis': kurtosis(df[col])
    }

pd.DataFrame(skew_kurt_stats)



Unnamed: 0,Age,Salary
Skewness,0.0,0.0
Kurtosis,-1.224242,-1.224242


In [14]:
iqr_stats = {}

for col in num_cols:
    q1 = np.percentile(df[col], 25)
    q2 = np.percentile(df[col], 50)
    q3 = np.percentile(df[col], 75)
    
    iqr_stats[col] = {
        'Q1 (25th Percentile)': q1,
        'Median (50th Percentile)': q2,
        'Q3 (75th Percentile)': q3,
        'Interquartile Range (IQR)': q3 - q1,
        'Five-Number Summary': (np.min(df[col]), q1, q2, q3, np.max(df[col]))
    }

pd.DataFrame(iqr_stats)


Unnamed: 0,Age,Salary
Q1 (25th Percentile),36.25,5250.0
Median (50th Percentile),47.5,7500.0
Q3 (75th Percentile),58.75,9750.0
Interquartile Range (IQR),22.5,4500.0
Five-Number Summary,"(25, 36.25, 47.5, 58.75, 70)","(3000, 5250.0, 7500.0, 9750.0, 12000)"
