# Estrutura do Repositório  
**Projecto: Pipeline de Classificação do Cancro do Pulmão (TCIA + ResNet + YOLOv8 + LLMs)**  

---

## 1. Estrutura Geral do Repositório (Atualizada e Corrigida)

```plaintext
project-root/
│
├── data/
│   ├── raw/                    # Dados brutos (NÃO subir ao Git)
│   ├── processed/              # Dados pré-processados
│   ├── external/               # Scripts de download (TCIA)
│   └── README.md               # Instruções para gestão de dados
│
├── models/
│   ├── checkpoints/            # Pesos treinados (.gitignore)
│   ├── resnet/
│   ├── yolo/
│   └── llm/
│
├── src/
│   ├── utils/
│   │   ├── extract_number.py
│   │   ├── subject_utils.py
│   │   ├── file_utils.py
│   │   ├── metrics_utils.py
|   |   └── file_utils.py  # loadFileInformation
│   │
│   ├── preprocessing/
│   │   ├── sampling.py             # Stratified sampling + balancing
│   │   ├── dicom_io.py             # read_dicom_image
│   │   ├── uid_mapping.py          # getUID_path
│   │   ├── xml_parsing.py          # extract_bounding_boxes
│   │   ├── dataset_building.py     # create_dataset, get_images_by_patient_id
│   │   ├── yolo_conversion.py      # preprocess_images (DICOM → YOLO)
│   │   └── image_preprocessing.py  # Reserved for normalization/transforms
│   │
│   ├── visualization/
│   │   ├── plot_images.py
│   │   ├── plot_bboxes.py
│   │   ├── patient_visualization.py
│   │   ├── yolo_visualization.py
│   │   └── plot_distributions.py
│   │
│   ├── augmentation/
│   │   ├── augmentation_yolo.py        # train + val augmentation
│   │   └── augmentation_utils.py
│   │
│   ├── splitting/
│   │   └── dataset_splitting.py
│   │
│   └── statistics/
│       ├── stage_distributions.py
│       ├── cohort_balance.py
│       └── patient_label_statistics.py
│
├── notebooks/
│   ├── 01_data_exploration.ipynb
│   ├── 02_preprocessing.ipynb
│   ├── 03_training.ipynb
│   ├── 04_evaluation.ipynb
│   └── 05_llm_analysis.ipynb
│
├── scripts/
│   ├── download_tcia.py
│   ├── convert_formats.py
│   ├── run_pipeline.sh
│   └── evaluate.sh
│
├── tests/
│   ├── test_preprocessing.py
│   ├── test_models.py
│   └── test_utils.py
│
├── .gitignore
├── environment.yml
├── requirements.txt
├── README.md
└── LICENSE



---
### **1. Dados TCIA fora do Git**
- Manter imagens DICOM localmente.  
- Reproduzível através de scripts de download.  
- `.gitignore` impede carregar ficheiros volumosos.

### **2. Módulos separados por função**
- `preprocessing/` → normalizações, segmentações pulmonares, preparação das imagens.  
- `modeling/` → modelos independentes:
  - ResNet (classificação),
  - YOLOv8 (detecção),
  - LLMs (protocolos TNM, relatórios clínicos).  

### **3. Configurações YAML**
- Reproduzibilidade experimental.  
- Separação clara entre código e parâmetros.

### **4. Notebooks organizados por fluxo científico**
- Exploração → Pré-processamento → Experimentos → Relatórios LLM.

### **5. Diretório `results/`**
- Curvas, matrizes de confusão, métricas TNM.  

---

## .gitignore
- data/raw/
- data/processed/*.npy
- models/checkpoints/
- *.ckpt
- *.pt
- *.pth
- *.h5
- __pycache__/
- .ipynb_checkpoints/
- .env


---

# FUNCTION → SCRIPT ASSIGNMENT

Below each script shows **exactly which of your functions belong inside**.

---

## **src/utils/extract_number.py**

### Contains:
- `extract_number()`

---

## **src/utils/subject_utils.py**

### Contains:
- `count_unique_subject_ids()`
- `patient_count_by_group()`

---

## **src/utils/file_utils.py**

### Contains:
- `load_dataset()`
- simple helper functions involving filesystem checks  
- functions using *os, glob, shutil* that are generic (not domain-specific)

---

## **src/utils/metrics_utils.py**

### Contains:
- **empty now**, but used later for ML metrics, cross-model comparison etc.

---

# PREPROCESSING MODULE

## **src/preprocessing/sampling.py**

### Contains:
- `relaxed_stratified_sample()`
- `add_more_patients()`
- `get_target_sample()`
- `adjust_sample_size()`
- `calculate_stage_distribution()`
- `compare_distributions()`
- `check_balance()`

These functions perform **metadata-level sampling and cohort balancing**.

---

## **src/preprocessing/dicom_io.py**

### Contains:
- `read_dicom_image()`

---

## **src/preprocessing/uid_mapping.py**

### Contains:
- `getUID_path()`

---

## **src/preprocessing/xml_parsing.py**

### Contains:
- `extract_bounding_boxes()`
- XML → bounding box extraction utilities

---

## **src/preprocessing/dataset_building.py**

### Contains:
- `create_dataset()`
- `get_images_by_patient_id()`

Functions that create **Python datasets from DICOM + XML**.

---

## **src/preprocessing/yolo_conversion.py**

### Contains:
- `preprocess_images()`

This is your function converting **DICOM + XML → YOLO format**.

---

## **src/preprocessing/image_preprocessing.py**

### Contains:
- resizing
- normalization logic
- future transforms  
Currently no function here, but **keep folder for future expansions**.

---

# VISUALIZATION MODULE

## **src/visualization/plot_images.py**

### Contains:
- simple image plotting helpers

---

## **src/visualization/plot_bboxes.py**

### Contains:
- `visualize_image_with_bboxes()`
- `visualize_image_with_bboxes_legend()`

---

## **src/visualization/patient_visualization.py**

### Contains:
- `visualize_image_by_uid()`

---

## **src/visualization/yolo_visualization.py**

### Contains:
- `visualize_yolo_images()`

---

## **src/visualization/plot_distributions.py**

### Contains:
- functions to plot class distribution, stage distribution, etc.  
(You have metadata functions but no plots yet — reserved here.)

---

# AUGMENTATION MODULE

## **src/augmentation/augmentation_yolo.py**

### Contains:
- `augment_yolo_images_train()`
- `augment_yolo_images_val()`

---


## **src/augmentation/augmentation_utils.py**

### Contains:
- reusable augmentation pipelines  
- bbox transformations  
- helper functions

---

# DATA SPLITTING

## **src/splitting/dataset_splitting.py**

### Contains:
- `split_data()`

This is your **train/val/test split with patient grouping**.

---

# STATISTICS MODULE

## **src/statistics/stage_distributions.py**

### Contains:
- any future plotting of stage distribution

---

## **src/statistics/cohort_balance.py**

### Contains:
- already included in preprocessing/sampling — mirror here if needed

---

## **src/statistics/patient_label_statistics.py**

### Contains:
- `count_labels_by_class_and_source()`
- `count_labels_by_class()`
- `count_images_labels_patients_by_class()`

---

# SUMMARY TABLE

| Function                              | Assigned Script                        |
| ------------------------------------- | -------------------------------------- |
| extract_number                        | utils/extract_number.py                |
| count_unique_subject_ids              | utils/subject_utils.py                 |
| patient_count_by_group                | utils/subject_utils.py                 |
| load_dataset                          | utils/file_utils.py                    |
| relaxed_stratified_sample             | preprocessing/sampling.py              |
| add_more_patients                     | preprocessing/sampling.py              |
| get_target_sample                     | preprocessing/sampling.py              |
| adjust_sample_size                    | preprocessing/sampling.py              |
| calculate_stage_distribution          | preprocessing/sampling.py              |
| compare_distributions                 | preprocessing/sampling.py              |
| check_balance                         | preprocessing/sampling.py              |
| read_dicom_image                      | preprocessing/dicom_io.py              |
| getUID_path                           | preprocessing/uid_mapping.py           |
| extract_bounding_boxes                | preprocessing/xml_parsing.py           |
| create_dataset                        | preprocessing/dataset_building.py      |
| get_images_by_patient_id              | preprocessing/dataset_building.py      |
| preprocess_images                     | preprocessing/yolo_conversion.py       |
| visualize_image_with_bboxes           | visualization/plot_bboxes.py           |
| visualize_image_with_bboxes_legend    | visualization/plot_bboxes.py           |
| visualize_image_by_uid                | visualization/patient_visualization.py |
| visualize_yolo_images                 | visualization/yolo_visualization.py    |
| augment_yolo_images_train             | augmentation/augmentation_yolo.py      |
| augment_yolo_images_val               | augmentation/augmentation_yolo.py      |
| count_labels_by_class                 | statistics/patient_label_statistics.py |
| count_labels_by_class_and_source      | statistics/patient_label_statistics.py |
| count_images_labels_patients_by_class | statistics/patient_label_statistics.py |
| split_data                            | splitting/dataset_splitting.py         |



