<h1 align='center'>Combining data sources into a unified dataset</h1>

This chapter covers
- [ ] loading and processing raw data files
- [ ] implementing a python class to represent our data
- [ ] converting our data into a format using by pytorch
- [ ] visualizing the training and validation data 

<div align='center' ><img src='assets/dataloading.png'/></div>

Our CT data comes in two files: 
    - `a .mhd` file containing metadata header information, and 
    - `a .raw` file containing the raw bytes that make up the 3D array. 
    
Each file’s name starts with a unique identifier called the series UID (the name comes from the Digital Imaging and Communications in Medicine **DICOM** nomenclature) for the CT scan in question. 
For example, for series UID 1.2.3, there would be two files: 1.2.3.mhd and 1.2.3.raw

```py
class ct:
    def __init__(self):
        annotations.csv
        .raw files
        .mhd files
    
    def __getitems__(self,index):
        return {
            '3D Array':torch.Tensor,
            'isNodule':bool,
            'series_uid':str,
            'candidate_location':tuple
        }      

```

In [1]:
import torch
import torchvision
import pandas as pd 
import os
from matplotlib import pyplot as plt

In [2]:
DIRPATH:str = os.path.join(r"C:\Users\muthu\GitHub\DATA 📁")
LUNADir:str = os.path.join(DIRPATH,"Luna16")

print(f"List \n{os.listdir(LUNADir)}")

List 
['annotations.csv', 'candidates.csv', 'candidates_V2.zip', 'evaluationScript.zip', 'seg-lungs-LUNA16.zip', 'subset0.zip', 'subset1.zip', 'subset2.zip', 'subset3.zip', 'subset4.zip']


In [16]:
# unifying our annotation and candidate data
annotations= pd.read_csv(os.path.join(LUNADir,'annotations.csv'))
candidates = pd.read_csv(os.path.join(LUNADir,'candidates.csv'))

In [15]:
print(f"seriesuid:: {candidates.iloc[0,0]}")
annotations.loc[annotations['seriesuid']==candidates.iloc[0,0]]

seriesuid:: 1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222365663678666836860


Unnamed: 0,seriesuid,coordX,coordY,coordZ,diameter_mm
0,1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...,-128.699421,-175.319272,-298.387506,5.651471
1,1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...,103.783651,-211.925149,-227.12125,4.224708


In [17]:
from collections import namedtuple

In [23]:
candidateInfoTuple = namedtuple(typename='CandidateInfoTuple',field_names=['isNodulebool','diameter_mm','seies_uid','center_xyz'])

In [25]:
candidateInfoTuple(isNodulebool='yes',diameter_mm='232',seies_uid='222',center_xyz='32323')

CandidateInfoTuple(isNodulebool='yes', diameter_mm='232', seies_uid='222', center_xyz='32323')