# Determining when lesions in the lungs are cancerous

In the United States, lung cancer strikes 225,000 people every year, and accounts for $12 billion in health care costs. Early detection is critical to give patients the best chance at recovery and survival.

One year ago, the office of the U.S. Vice President spearheaded a bold new initiative, the Cancer Moonshot, to make a decade's worth of progress in cancer prevention, diagnosis, and treatment in just 5 years.

Using a data set of thousands of high-resolution lung scans provided by the National Cancer Institute,developing algorithms that accurately determine when lesions in the lungs are cancerous. This will dramatically reduce the false positive rate that plagues the current detection technology, get patients earlier access to life-saving interventions, and give radiologists more time to spend with their patients.

## Data Description

In this dataset, you are given over a thousand low-dose CT images from high-risk patients in DICOM format. Each image contains a series with multiple axial slices of the chest cavity. Each image has a variable number of 2D slices, which can vary based on the machine taking the scan and patient.

The DICOM files have a header that contains the necessary information about the patient id, as well as scan parameters such as the slice thickness

## Importing the modules and data


In [8]:
%matplotlib inline

import numpy as np
import pydicom
import os 
import pandas as pd

import scipy.ndimage
import matplotlib.pyplot as plt

#from skimage import measure, morphology
from mpl_toolkits.mplot3d.art3d import Poly3DCollection


data_dir = 'stage1'
patients = os.listdir(data_dir)
labels_df = pd.read_csv('stage1_labels.csv', index_col=0)

labels_df.head()

Unnamed: 0_level_0,cancer
id,Unnamed: 1_level_1
0015ceb851d7251b8f399e39779d1e7d,1
0030a160d58723ff36d73f41b170ec21,0
003f41c78e6acfa92430a057ac0b306e,0
006b96310a37b36cccb2ab48d10b49a3,1
008464bb8521d09a42985dd8add3d0d2,1


## Visualising the raw data

In [9]:
for patient in patients[:1]:
    label = labels_df.at[patient, 'cancer']
    path = data_dir +'/' + patient
    

    slices = [pydicom.read_file(path + '/' + s) for s in os.listdir(path)]
    slices.sort(key = lambda x: int(x.ImagePositionPatient[2]))
    print(len(slices),label)
    print(slices[0])

195 1
(0008, 0000) Group Length                        UL: 358
(0008, 0005) Specific Character Set              CS: 'ISO_IR 100'
(0008, 0016) SOP Class UID                       UI: CT Image Storage
(0008, 0018) SOP Instance UID                    UI: 1.3.6.1.4.1.14519.5.2.1.7009.9004.321555830121981826540353244716
(0008, 0060) Modality                            CS: 'CT'
(0008, 103e) Series Description                  LO: 'Axial'
(0010, 0000) Group Length                        UL: 64
(0010, 0010) Patient's Name                      PN: '0015ceb851d7251b8f399e39779d1e7d'
(0010, 0020) Patient ID                          LO: '0015ceb851d7251b8f399e39779d1e7d'
(0010, 0030) Patient's Birth Date                DA: '19000101'
(0018, 0060) KVP                                 DS: ''
(0020, 0000) Group Length                        UL: 390
(0020, 000d) Study Instance UID                  UI: 2.25.51820907428519808061667399603379702974102486079290552633235
(0020, 000e) Series Instance UID     

In [10]:
for patient in patients[:3]:
    label = labels_df.at[patient, 'cancer']
    path = data_dir +'/' + patient
    

    slices = [pydicom.read_file(path + '/' + s) for s in os.listdir(path)]
    slices.sort(key = lambda x: int(x.ImagePositionPatient[2]))

    print(slices[0].pixel_array.shape, len(slices))


(512, 512) 195
(512, 512) 265
(512, 512) 233


## Number of examples

In [11]:
len(patients)

1595

## PreProcessing

- 1) Adding pixel value in the Z - direction becuase it is missing . This attribute is termed as the thickness of the slice

        function takes the subject, collect all the slices, sort the data according to the imagepositionpatient ,then try for calculating the slice_thickness. adding this new attribute to each slice of the patient and return
        
     ### How the slice thickness calculated? 
            The slice Image Position of any two consecutive slice is subtracted and the value thus is the thickness of each slice of the data

In [14]:
def load_scan(path):
    slices = [dicom.read_file(path + '/' + s) for s in os.listdir(path)]
    slices.sort(key = lambda x: float(x.ImagePositionPatient[2]))
    try:
        slice_thickness = np.abs(slices[0].ImagePositionPatient[2] - slices[1].ImagePositionPatient[2])
    except:
        slice_thickness = np.abs(slices[0].SliceLocation - slices[1].SliceLocation)
        
    for s in slices:
        s.SliceThickness = slice_thickness
        
    return slices

- 2) Converting pixel vales to Hounsfield Units (HU)