
# 📌 Descriptive Statistics – Commonly Asked Interview Q\&A

---

### **Q1. What are the key measures of central tendency?**

**Answer:**

* **Mean** → arithmetic average.
* **Median** → middle value when data is sorted.
* **Mode** → most frequently occurring value.
  👉 Median is more robust in presence of outliers.

---

### **Q2. What are measures of dispersion (spread) in data?**

**Answer:**

* Range = Max – Min
* Variance = Avg squared deviation from mean
* Standard Deviation = Square root of variance
* Interquartile Range (IQR) = Q3 – Q1 (middle 50% of data)
* Coefficient of Variation (CV) = (SD / Mean) × 100

---

### **Q3. When would you use Mean vs Median vs Mode?**

**Answer:**

* **Mean** → symmetric, continuous data without outliers.
* **Median** → skewed data or when outliers exist (e.g., salaries).
* **Mode** → categorical data (e.g., most purchased product).

---

### **Q4. What is the difference between Variance and Standard Deviation?**

**Answer:**

* **Variance** = average squared deviation from mean.
* **Standard deviation** = square root of variance (keeps unit same as data).
  👉 SD is more interpretable in real-world terms.

---

### **Q5. Explain Skewness. What does positive and negative skew mean?**

**Answer:**

* **Skewness** measures asymmetry in distribution.

  * **Positive skew (right-skewed):** long tail to right, mean > median.
  * **Negative skew (left-skewed):** long tail to left, mean < median.
    👉 Example: Income distribution is typically **right-skewed**.

---

### **Q6. What is Kurtosis?**

**Answer:**

* **Kurtosis** measures the "tailedness" of distribution.

  * **Leptokurtic (Kurtosis > 3):** heavy tails, sharp peak.
  * **Platykurtic (Kurtosis < 3):** light tails, flat peak.
  * **Mesokurtic (≈3):** normal distribution.
    👉 Important in finance/risk analysis for extreme events.

---

### **Q7. What is the difference between Percentiles and Quartiles?**

**Answer:**

* **Percentiles** divide data into 100 equal parts.
* **Quartiles** divide data into 4 equal parts.

  * Q1 = 25th percentile
  * Q2 = 50th percentile (median)
  * Q3 = 75th percentile

---

### **Q8. What is the Interquartile Range (IQR)? Why is it useful?**

**Answer:**

* **IQR = Q3 – Q1** (spread of middle 50% of data).
* Useful because it is **robust to outliers**, unlike variance.
  👉 Often used in **boxplots** and **outlier detection** (Tukey’s rule).

---

### **Q9. What are Outliers? How do you detect them?**

**Answer:**

* **Outliers** = data points far from majority of distribution.
* Detection methods:

  * Z-score (|Z| > 3).
  * IQR Rule (values below Q1 – 1.5*IQR or above Q3 + 1.5*IQR).
  * Visualization (Boxplots, Scatterplots).
    👉 Outliers may indicate noise, fraud, or genuine rare events.

---

### **Q10. What is Coefficient of Variation (CV)?**

**Answer:**

$$
CV = \frac{\text{Standard Deviation}}{\text{Mean}} \times 100
$$

* CV is a **relative measure of variability** (unitless).
* Useful for comparing datasets with different units/scales.

---

### **Q11. How would you summarize a dataset for EDA?**

**Answer:**

* Central Tendency → Mean, Median, Mode.
* Spread → Variance, SD, IQR, Range.
* Shape → Skewness, Kurtosis.
* Visualization → Histogram, Boxplot, Scatterplot.

---

### **Q12. How do you handle outliers in ML datasets?**

**Answer:**

* Remove them (if they are errors).
* Transform data (log, sqrt).
* Winsorization (capping extreme values).
* Use robust models (Decision Trees, Random Forests).

---

### **Q13. Why is Standard Deviation important in ML?**

**Answer:**

* Used in **feature scaling (z-score normalization)**.
* Helps measure **model stability and variance in predictions**.
* Used in **confidence intervals, Gaussian assumptions** in models.

---

### **Q14. Explain Boxplot and what insights it gives.**

**Answer:**

* Boxplot shows:

  * Median (Q2), Q1, Q3.
  * IQR spread.
  * Outliers (beyond 1.5\*IQR).
    👉 Quick way to check **spread, skewness, and outliers**.




# 📌 Descriptive Statistics – Hands-On with Answers

---

### **Q1. Compute the mean, median, and mode of a dataset.**

Dataset: `[12, 15, 12, 18, 19, 12, 25, 30, 18]`

```python
import numpy as np
import pandas as pd
from scipy import stats

data = [12, 15, 12, 18, 19, 12, 25, 30, 18]

mean = np.mean(data)
median = np.median(data)
mode = stats.mode(data, keepdims=True)

print("Mean:", mean)
print("Median:", median)
print("Mode:", mode.mode[0], "Count:", mode.count[0])
```

✅ **Answer:**

* Mean = 18.78
* Median = 18
* Mode = 12 (appears 3 times)

---

### **Q2. Calculate variance and standard deviation.**

Dataset: `[10, 12, 23, 23, 16, 23, 21, 16]`

```python
data = [10, 12, 23, 23, 16, 23, 21, 16]

variance = np.var(data, ddof=1)   # sample variance
std_dev = np.std(data, ddof=1)

print("Variance:", variance)
print("Standard Deviation:", std_dev)
```

✅ **Answer:**

* Variance ≈ 24.55
* Standard Deviation ≈ 4.95

---

### **Q3. Find the range, quartiles, and interquartile range (IQR).**

Dataset: `[7, 8, 5, 6, 3, 4, 5, 9, 12, 15, 18, 21]`

```python
data = sorted([7, 8, 5, 6, 3, 4, 5, 9, 12, 15, 18, 21])

data_range = max(data) - min(data)
q1 = np.percentile(data, 25)
q2 = np.percentile(data, 50)  # median
q3 = np.percentile(data, 75)
iqr = q3 - q1

print("Range:", data_range)
print("Q1:", q1, "Q2 (Median):", q2, "Q3:", q3)
print("IQR:", iqr)
```

✅ **Answer:**

* Range = 18
* Q1 = 5
* Median = 8.5
* Q3 = 15.75
* IQR = 10.75

---

### **Q4. Detect outliers using the IQR method.**

Dataset: `[100, 102, 98, 95, 105, 110, 250, 300, 102, 99]`

```python
data = [100, 102, 98, 95, 105, 110, 250, 300, 102, 99]

q1, q3 = np.percentile(data, [25, 75])
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

outliers = [x for x in data if x < lower_bound or x > upper_bound]

print("Outliers:", outliers)
```

✅ **Answer:**
Outliers = `[250, 300]`

---

### **Q5. Calculate skewness and kurtosis.**

Dataset: `[2, 3, 5, 6, 9, 10, 10, 11, 12, 14, 18, 20]`

```python
data = [2, 3, 5, 6, 9, 10, 10, 11, 12, 14, 18, 20]

skewness = stats.skew(data)
kurtosis = stats.kurtosis(data)

print("Skewness:", skewness)
print("Kurtosis:", kurtosis)
```

✅ **Answer:**

* Skewness ≈ 0.31 (slightly right-skewed)
* Kurtosis ≈ -1.06 (flatter than normal distribution → platykurtic)

---

### **Q6. Calculate the coefficient of variation (CV).**

Dataset: `[40, 50, 60, 70, 80, 90, 100]`

```python
data = [40, 50, 60, 70, 80, 90, 100]

mean = np.mean(data)
std_dev = np.std(data, ddof=1)
cv = (std_dev / mean) * 100

print("CV (%):", cv)
```

✅ **Answer:**
CV ≈ 30%

---

### **Q7. Summarize a dataset with `describe()` in Pandas.**

Using Iris dataset’s `sepal_length`.

```python
from sklearn.datasets import load_iris
import pandas as pd

iris = load_iris(as_frame=True)
df = iris.frame

print(df['sepal length (cm)'].describe())
```

✅ **Answer (sample output):**

```
count    150.000000
mean       5.843333
std        0.828066
min        4.300000
25%        5.100000
50%        5.800000
75%        6.400000
max        7.900000
```

---

### **Q8. Create a boxplot and interpret it.**

Dataset: `[10, 12, 15, 18, 19, 21, 25, 30, 40, 100]`

```python
import matplotlib.pyplot as plt

data = [10, 12, 15, 18, 19, 21, 25, 30, 40, 100]
pd.Series(data).plot(kind='box', vert=False)
plt.show()
```

✅ **Answer:**

* Median ≈ 20
* IQR between ≈ 15 and 30
* Outlier detected at **100**

---

### **Q9. Normalize a dataset using Z-score (Standardization).**

Dataset: `[5, 7, 9, 10, 15, 20, 25]`

```python
data = np.array([5, 7, 9, 10, 15, 20, 25])

z_scores = (data - np.mean(data)) / np.std(data)
print("Z-scores:", z_scores)
```

✅ **Answer (approx):**
`[-1.24, -0.91, -0.57, -0.41, 0.41, 1.24, 1.49]`

---

### **Q10. Compare variability between two datasets using CV.**

* Dataset A: `[5, 10, 15, 20, 25]`
* Dataset B: `[100, 200, 300, 400, 500]`

```python
data_a = [5, 10, 15, 20, 25]
data_b = [100, 200, 300, 400, 500]

cv_a = (np.std(data_a, ddof=1) / np.mean(data_a)) * 100
cv_b = (np.std(data_b, ddof=1) / np.mean(data_b)) * 100

print("CV of Dataset A:", cv_a)
print("CV of Dataset B:", cv_b)
```

✅ **Answer:**

* CV of A = 47.14%
* CV of B = 47.14%
  👉 Both datasets have **same relative variability**, even though values differ in scale.

