# DLToolboxImg: Part 2
A set of helper functions that one repeatedly need to construct a dataset from raw images, visualise the performance of a neural network while it is getting trained, evaluate the performance of a model after training is completed. 

As a running example, I will apply the functinos on the LIDC dataset.

# Table of Contents
- [Generate Dataset](#generatedata)
    - [Generate Negative Examples](#neg)
    - [Generate Positive Examples](#pos)


In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib nbagg

In [2]:
from imports import *

<a id="generatedata"></a>
## Generate Dataset 

### Load the indices for train, valid, and test sets

In [3]:
root_dir="drive/"
interm_dir=root_dir+"interm5/"
filename=interm_dir+"scan_id_split"
with open(filename, 'rb') as f:  # Python 3: open(..., 'wb')
    scan_id_train,scan_id_valid,scan_id_test=pickle.load(f)

In [4]:
scan_id_train,scan_id_valid,scan_id_test

([15, 16, 17, 20, 22, 23], [19], [18, 21, 14])

### Generate Negative Examples
<a id="neg"></a>

In [5]:
!mkdir /home/mas/x110/data/
!mkdir /home/mas/x110/data/pos
!mkdir /home/mas/x110/data/neg

mkdir: cannot create directory ‘/home/mas/x110/data/’: File exists
mkdir: cannot create directory ‘/home/mas/x110/data/pos’: File exists
mkdir: cannot create directory ‘/home/mas/x110/data/neg’: File exists


In [7]:
interm_dir2='/home/mas/x110/data/pos'
interm_dir3='/home/mas/x110/data/neg'

I will be taking small cubes from the ctscan volume. the size of this small cube is 52x52x52. I can serially decompose the say 512x512x300 ctscan volume into 52x52x52 cubes. But the problem with this approach is that I will have many "unintresting" cubes. like cubes that are all black. As an alternative, I will first create a lung mask. pick random points that resides inside the lung mask, and extract the 52x52x52 cube where the random point is the center of that cube. As a final check, I will make make sure that there does not exist a nodule in that cube, because remember we ar now generating negative examples. A summary of what I just described is:

1. get a scan
2. Apply the lung mask 
3. Find the range of zs where the lung occupies >2% of the total area. 
4. Select a random zc location.
5. On that z slice, apply the lung mask.
6. Select a random xc,yc point that resides inside the lung mask.
7. extract a cube where xc,yc,zc is its center and its side is N=52.
8. sum the mask of the newly generated cube to ensure that it does not include a nodule. 
9. The naming convention would be neg_scan_id_cx_xy_cz

In [48]:
random.seed(313)
for scan_id in scan_id_train:#[xx+1:]:
    scan_1 = ctscan(scan_id) 
    S,B=get_segmented_lungs2(scan_1.image_resampled, plot=False)

    T = B.shape[1]**2
    Areas=[np.sum(b)/T for b in B]
    ind2=[i for i,a in enumerate(Areas) if a>.02]
    z1,z2=ind2[0],ind2[-1]

    for k in range(10):
        zz=np.random.randint(z1,z2)

        Bf=B[zz].flatten()
        #In that slice, find the elements that are true
        Cs=[i for i,e in enumerate(Bf) if e]
        #randomly select an element from Cs
        i = random.choice(Cs)
        #from i get the original row and column of that element in B
        a=B.shape[1];a
        r = i//a
        c=i-a*r

        #Thus, we have succssfully selected a random point that resides inside the lung area
        #we would like to extract a 52x52x52 patch from the ctscan volume.
        #The patch is centered at the conditioned random point we have generated
        m = 32
        cz,cy,cx =[zz,r,c]
        #grap the volume
        image=scan_1.image_normalized #zxy
        cube_img,corner0 = get_cube_from_img(image, cx, cy, cz, m)
        cube_label,corner1 = get_cube_from_img(scan_1.Z2, cx, cy, cz, m)
        if np.sum(cube_label)==0:
            #save file
            filename=interm_dir3+'/data_N_'+str(scan_id)+"_"+str(cx)+"_"+str(cy)+"_"+str(cz)+".pkl"
            with open(filename, 'wb') as f:  # Python 3: open(..., 'wb')
                pickle.dump([cube_img,cube_label.astype(np.bool)], f)
        else:
            k=k-1

#x=[i for i,j in enumerate(scan_id_train) if j==scan_id]
#xx=x[0]

Loading dicom files ... This may take a moment.


  borders[slicedim] = True
  borders[slicedim] = True


Loading dicom files ... This may take a moment.
Loading dicom files ... This may take a moment.
Loading dicom files ... This may take a moment.
Loading dicom files ... This may take a moment.
Loading dicom files ... This may take a moment.


#### Read Negative Examples
it is handy to create a csv file that contains a list of the file names and its class and some other features

In [8]:
if True:
    temp=!ls {interm_dir3} -irlat #>> myfiles2.csv
    #keep string that satisfy a condition
    temp1=[t for t in temp if "data" in t]
    temp1[0:5]

    temp2=[t.split(" ")[-1] for t in temp1]

    df=pd.DataFrame([t.split(".")[0].split("_")[-4:] for t in temp2],columns=['scan_id','x','y','z'])

    df['label']=0
    df['filename']=temp2
    df.to_csv(interm_dir3+"df_neg_scanid_centroid.csv")
else:
    df=pd.read_csv(interm_dir3+"df_neg_scanid_centroid.csv",index_col=0)
print(df.shape)
df.head(20)

(58, 6)


Unnamed: 0,scan_id,x,y,z,label,filename
0,15,181,261,164,0,data_N_15_181_261_164.pkl
1,15,178,183,233,0,data_N_15_178_183_233.pkl
2,15,262,276,77,0,data_N_15_262_276_77.pkl
3,15,102,246,130,0,data_N_15_102_246_130.pkl
4,15,282,226,237,0,data_N_15_282_226_237.pkl
5,15,183,182,131,0,data_N_15_183_182_131.pkl
6,15,160,188,205,0,data_N_15_160_188_205.pkl
7,15,296,223,117,0,data_N_15_296_223_117.pkl
8,15,297,202,95,0,data_N_15_297_202_95.pkl
9,15,324,220,158,0,data_N_15_324_220_158.pkl
