````markdown
# Notebook: 01_eda_preprocessing.ipynb

## ðŸ“– Purpose
This notebook performs **exploratory data analysis (EDA)** and **preprocessing setup** for the *PneumoDetect* project â€” a deep learning workflow for pneumonia detection on chest X-rays.

It focuses on understanding the datasetâ€™s structure, class balance, image quality, and metadata before any model training begins.

---

## Objectives
1. **Load and inspect** the dataset (sample or full RSNA Pneumonia Detection subset).  
2. **Visualize class distribution** â€” pneumonia vs normal.  
3. **Display random montages** of X-rays for sanity checks.  
4. **Compute descriptive statistics** for image dimensions and intensity values.  
5. **Document observations** that guide later preprocessing and model design.  

---

## Key Steps
| Step | Description |
|------|--------------|
| **1. Load labels** | Reads the `train_labels_subset.csv` into a Pandas DataFrame. |
| **2. Data overview** | Displays data types, missing values, and sample rows. |
| **3. Label distribution** | Uses `seaborn.countplot()` to visualise pneumonia vs normal counts. |
| **4. Random montage** | Loads and plots 3Ã—3 grid of random X-rays using `pydicom` and `matplotlib`. |
| **5. Image statistics** | Samples image shapes and pixel intensity ranges. |
| **6. Documentation** | Summarises findings in Markdown: class imbalance, size variance, grayscale range. |

---

## Observations (Typical)
- **Class ratio:** 0 â‰ˆ 75 %, 1 â‰ˆ 25 % (imbalanced â€” requires weighted loss).  
- **Image size:** Ranges from 1024Ã—1024 to 2048Ã—2048.  
- **Pixel intensity:** 12-bit grayscale (0â€“4096). Normalize to `[0, 1]`.  
- **Visual quality:** Pneumonia cases show local opacity, normal lungs appear uniformly translucent.

---

## Dependencies
Ensure these libraries are available in your Conda environment:
```bash
pandas
matplotlib
seaborn
pydicom
opencv-python
numpy
````

If missing, install them:

```bash
pip install pandas matplotlib seaborn pydicom opencv-python numpy
```

---

## Expected Input

```
data/
â”œâ”€â”€ rsna_subset/
â”‚   â”œâ”€â”€ train_images/          # 2 000 sampled .dcm files
â”‚   â””â”€â”€ train_labels_subset.csv
```

---

## Output Artifacts

| Output                           | Description                                             |
| -------------------------------- | ------------------------------------------------------- |
| `figures/label_distribution.png` | Class distribution bar chart                            |
| `figures/random_montage.png`     | Montage of random X-rays                                |
| Markdown cell summary            | Recorded class ratio, image size stats, intensity range |

---

## Notes

* Use this notebook **before model training** to ensure preprocessing is grounded in actual data properties.
* Save all visualisations in a `/figures/` folder for later inclusion in reports.
* Findings here inform choices for resizing, normalization, and class weighting in subsequent notebooks.

---

**Author:** Adrian Adewunmi
**Project:** AI-Assisted Pneumonia Detection (`PneumoDetect`)
**Date:** Week 1, Day 2 (W1-D2)
