<a href="https://colab.research.google.com/github/Rumaisa1054/Data_Science/blob/main/stats_visuals/stats/Other_normality_Tests.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# **Checking Normality of Data**

## **Shapiro-Wilk Test**
- **Purpose:** Test if a dataset comes from a normal distribution.
- **Hypotheses:**
  - H0: Data is normally distributed
  - H1: Data is not normally distributed
- **Typical usage:** Works well for **small to medium datasets** (n < 2000).

**Important:**  
Do **not use p-value > 0.5** as threshold. Use **p-value > 0.05** to accept normality.

---

## **Why Shapiro-Wilk Fails for Large Datasets**
- Very sensitive to sample size.
- **Large datasets (n > 2000):** Even tiny deviations from normality can produce a **very small p-value**, making it reject H0 even if data looks normal.
- Therefore, p-value can be misleading.

---

## **Alternatives for Large Datasets**
### A. D’Agostino’s K-squared Test
- Works better for large sample sizes.
```python
from scipy import stats
stat, p = stats.normaltest(x)
````

### **B. Anderson-Darling Test**

* Compares test statistic to critical values.

```python
result = stats.anderson(x)
```

### **C. Visual Checks**

* **Histogram with KDE:** symmetry and shape
* **Q-Q Plot:** points near diagonal indicate normality

```python
import scipy.stats as stats
import matplotlib.pyplot as plt
stats.probplot(x, dist="norm", plot=plt)
plt.show()
```

---

## **Practical Advice**

* **Small/medium datasets:** Shapiro-Wilk test is reliable.
* **Large datasets:** Combine **visual inspection + D’Agostino test**.
* Never rely on statistical test alone — always **look at the data**.

