# Estrutura do RepositÃ³rio  
**Projecto: Pipeline de ClassificaÃ§Ã£o de Cancro do PulmÃ£o (TCIA + ResNet + YOLOv8 + LLMs)**  

---

## 1. Estrutura Geral do RepositÃ³rio

```plaintext
project-root/
â”‚
â”œâ”€â”€ data/
â”‚   â”œâ”€â”€ raw/                # Dados brutos (NÃƒO subir para o Git)
â”‚   â”œâ”€â”€ processed/          # Dados prÃ©-processados (metadados pequenos)
â”‚   â”œâ”€â”€ external/           # Links, scripts de download (TCIA, etc.)
â”‚   â””â”€â”€ README.md           # ExplicaÃ§Ã£o de gestÃ£o de dados
â”‚
â”œâ”€â”€ models/
â”‚   â”œâ”€â”€ checkpoints/        # Pesos treinados (colocar no .gitignore)
â”‚   â”œâ”€â”€ resnet/             # Modelos ResNet
â”‚   â”œâ”€â”€ yolo/               # Pesos YOLOv8
â”‚   â””â”€â”€ llm/                # Scripts de integraÃ§Ã£o com Gemini/GPT
â”‚
â”œâ”€â”€ src/
â”‚   â”œâ”€â”€ preprocessing/      
â”‚   â”‚   â”œâ”€â”€ dicom_loader.py
â”‚   â”‚   â”œâ”€â”€ normalize.py
â”‚   â”‚   â”œâ”€â”€ segmentation.py
â”‚   â”‚   â””â”€â”€ utils_preprocessing.py
â”‚   â”‚
â”‚   â”œâ”€â”€ training/
â”‚   â”‚   â”œâ”€â”€ train_resnet.py
â”‚   â”‚   â”œâ”€â”€ train_yolo.py
â”‚   â”‚   â””â”€â”€ optimization.py
â”‚   â”‚
â”‚   â”œâ”€â”€ inference/
â”‚   â”‚   â”œâ”€â”€ inference_resnet.py
â”‚   â”‚   â”œâ”€â”€ inference_yolo.py
â”‚   â”‚   â””â”€â”€ llm_inference.py
â”‚   â”‚
â”‚   â”œâ”€â”€ visualization/
â”‚   â”‚   â”œâ”€â”€ plot_metrics.py
â”‚   â”‚   â”œâ”€â”€ gradcam.py
â”‚   â”‚   â””â”€â”€ dashboards.py
â”‚   â”‚
â”‚   â””â”€â”€ utils/
â”‚       â”œâ”€â”€ config.py
â”‚       â”œâ”€â”€ metrics.py
â”‚       â””â”€â”€ helpers.py
â”‚
â”œâ”€â”€ notebooks/
â”‚   â”œâ”€â”€ 01_data_exploration.ipynb
â”‚   â”œâ”€â”€ 02_preprocessing.ipynb
â”‚   â”œâ”€â”€ 03_training.ipynb
â”‚   â”œâ”€â”€ 04_evaluation.ipynb
â”‚   â””â”€â”€ 05_llm_analysis.ipynb
â”‚
â”œâ”€â”€ scripts/
â”‚   â”œâ”€â”€ download_tcia.py
â”‚   â”œâ”€â”€ convert_formats.py
â”‚   â”œâ”€â”€ run_pipeline.sh
â”‚   â””â”€â”€ evaluate.sh
â”‚
â”œâ”€â”€ tests/
â”‚   â”œâ”€â”€ test_preprocessing.py
â”‚   â”œâ”€â”€ test_models.py
â”‚   â””â”€â”€ test_utils.py
â”‚
â”œâ”€â”€ .gitignore
â”œâ”€â”€ environment.yml
â”œâ”€â”€ requirements.txt
â”œâ”€â”€ README.md
â””â”€â”€ LICENSE



---
### **1. Dados TCIA fora do Git**
- Manter imagens DICOM localmente.  
- ReproduzÃ­vel atravÃ©s de scripts de download.  
- `.gitignore` impede carregar ficheiros volumosos.

### **2. MÃ³dulos separados por funÃ§Ã£o**
- `preprocessing/` â†’ normalizaÃ§Ãµes, segmentaÃ§Ãµes pulmonares, preparaÃ§Ã£o das imagens.  
- `modeling/` â†’ modelos independentes:
  - ResNet (classificaÃ§Ã£o),
  - YOLOv8 (detecÃ§Ã£o),
  - LLMs (protocolos TNM, relatÃ³rios clÃ­nicos).  

### **3. ConfiguraÃ§Ãµes YAML**
- Reproduzibilidade experimental.  
- SeparaÃ§Ã£o clara entre cÃ³digo e parÃ¢metros.

### **4. Notebooks organizados por fluxo cientÃ­fico**
- ExploraÃ§Ã£o â†’ PrÃ©-processamento â†’ Experimentos â†’ RelatÃ³rios LLM.

### **5. DiretÃ³rio `results/`**
- Curvas, matrizes de confusÃ£o, mÃ©tricas TNM.  

---

## .gitignore
- data/raw/
- data/processed/*.npy
- models/checkpoints/
- *.ckpt
- *.pt
- *.pth
- *.h5
- __pycache__/
- .ipynb_checkpoints/
- .env


---

# FUNCTION â†’ SCRIPT ASSIGNMENT

Below each script shows **exactly which of your functions belong inside**.

---

## **src/utils/extract_number.py**

### Contains:
- `extract_number()`

---

## Âª*src/utils/subject_utils.py**

### Contains:
- `count_unique_subject_ids()`
- `patient_count_by_group()`

---

## **src/utils/file_utils.py**

### Contains:
- `load_dataset()`
- simple helper functions involving filesystem checks  
- functions using *os, glob, shutil* that are generic (not domain-specific)

---

## **src/utils/metrics_utils.py**

### Contains:
- **empty now**, but used later for ML metrics, cross-model comparison etc.

---

# PREPROCESSING MODULE

## **src/preprocessing/sampling.py**

### Contains:
- `relaxed_stratified_sample()`
- `add_more_patients()`
- `get_target_sample()`
- `adjust_sample_size()`
- `calculate_stage_distribution()`
- `compare_distributions()`
- `check_balance()`

These functions perform **metadata-level sampling and cohort balancing**.

---

## **src/preprocessing/dicom_io.py**

### Contains:
- `read_dicom_image()`

---

## **src/preprocessing/uid_mapping.py**

### Contains:
- `getUID_path()`

---

## **src/preprocessing/xml_parsing.py**

### Contains:
- `extract_bounding_boxes()`
- XML â†’ bounding box extraction utilities

---

## **src/preprocessing/dataset_building.py**

### Contains:
- `create_dataset()`
- `get_images_by_patient_id()`

Functions that create **Python datasets from DICOM + XML**.

---

## **src/preprocessing/yolo_conversion.py**

### Contains:
- `preprocess_images()`

This is your function converting **DICOM + XML â†’ YOLO format**.

---

## **src/preprocessing/image_preprocessing.py**

### Contains:
- resizing
- normalization logic
- future transforms  
Currently no function here, but **keep folder for future expansions**.

---

# VISUALIZATION MODULE

## **src/visualization/plot_images.py**

### Contains:
- simple image plotting helpers

---

## **src/visualization/plot_bboxes.py**

### Contains:
- `visualize_image_with_bboxes()`
- `visualize_image_with_bboxes_legend()`

---

## **src/visualization/patient_visualization.py**

### Contains:
- `visualize_image_by_uid()`

---

## **src/visualization/yolo_visualization.py**

### Contains:
- `visualize_yolo_images()`

---

## **src/visualization/plot_distributions.py**

### Contains:
- functions to plot class distribution, stage distribution, etc.  
(You have metadata functions but no plots yet â€” reserved here.)

---

# AUGMENTATION MODULE

## **src/augmentation/yolo_augment_train.py**

### Contains:
- `augment_yolo_images_train()`

---

## **src/augmentation/yolo_augment_val.py**

### Contains:
- `augment_yolo_images_val()`

---

## **src/augmentation/augmentation_utils.py**

### Contains:
- reusable augmentation pipelines  
- bbox transformations  
- helper functions

---

# DATA SPLITTING

## **src/splitting/dataset_splitting.py**

### Contains:
- `split_data()`

This is your **train/val/test split with patient grouping**.

---

# STATISTICS MODULE

## **src/statistics/stage_distributions.py**

### Contains:
- any future plotting of stage distribution

---

## **src/statistics/cohort_balance.py**

### Contains:
- already included in preprocessing/sampling â€” mirror here if needed

---

## **src/statistics/label_statistics.py**

### Contains:
- `count_labels_by_class_and_source()`
- `count_labels_by_class()`
- `count_images_labels_patients_by_class()`

---

# SUMMARY TABLE

| Function | Assigned Script |
|---------|-----------------|
| extract_number | utils/extract_number.py |
| count_unique_subject_ids | utils/subject_utils.py |
| relaxed_stratified_sample | preprocessing/sampling.py |
| add_more_patients | preprocessing/sampling.py |
| get_target_sample | preprocessing/sampling.py |
| adjust_sample_size | preprocessing/sampling.py |
| calculate_stage_distribution | preprocessing/sampling.py |
| compare_distributions | preprocessing/sampling.py |
| check_balance | preprocessing/sampling.py |
| read_dicom_image | preprocessing/dicom_io.py |
| getUID_path | preprocessing/uid_mapping.py |
| extract_bounding_boxes | preprocessing/xml_parsing.py |
| create_dataset | preprocessing/dataset_building.py |
| get_images_by_patient_id | preprocessing/dataset_building.py |
| visualize_image_with_bboxes | visualization/plot_bboxes.py |
| visualize_image_by_uid | visualization/patient_visualization.py |
| visualize_image_with_bboxes_legend | visualization/plot_bboxes.py |
| augment_yolo_images_train | augmentation/yolo_augment_train.py |
| augment_yolo_images_val | augmentation/yolo_augment_val.py |
| count_labels_by_class | statistics/label_statistics.py |
| count_labels_by_class_and_source | statistics/label_statistics.py |
| count_images_labels_patients_by_class | statistics/label_statistics.py |
| split_data | splitting/dataset_splitting.py |
| preprocess_images | preprocessing/yolo_conversion.py |
| patient_count_by_group | utils/subject_utils.py |
| load_dataset | utils/file_utils.py |
| visualize_yolo_images | visualization/yolo_visualization.py |

---

# ðŸŽ¯ Final Result

You now have:
- a **professional**, scalable codebase
- every function exactly where it belongs
- perfect separation of preprocessing, augmentation, visualization, splitting, and utils
- ready for **GitHub**, **MLflow**, **DVC**, or **production**

