# Understand Data
*final goal: common sensical/basic understanding of data variables, or what data you are given*

## References:
"In this dataset, you are given over a thousand low-dose CT images from high-risk patients in DICOM format. Each image contains a series with multiple axial slices of the chest cavity. Each image has a variable number of 2D slices, which can vary based on the machine taking the scan and patient."

"The DICOM files have a header that contains the necessary information about the patient id, as well as scan parameters such as the slice thickness."

"The images in this dataset come from many sources and will vary in quality. For example, older scans were imaged with less sophisticated equipment. You should expect the stage 2 data to be, on the whole, more recent and higher quality than the stage 1 data (generally having thinner slice thickness). Ideally, your algorithm should perform well across a range of image quality."

"Each patient id has an associated directory of DICOM files. The patient id is found in the DICOM header and is identical to the patient name. The exact number of images will differ from case to case, varying according in the number of slices. Images were compressed as .7z files due to the large size of the dataset.
* stage1.7z - contains all images for the first stage of the competition, including both the training and test set
* stage1_labels.csv - contains the cancer ground truth for the stage 1 training set images"

([kaggle-dsb data](https://www.kaggle.com/c/data-science-bowl-2017/data))

## Useful Kernels:
*these kernels help you understand the dataset better*
### [Anokas' Exploratory Data Analysis](https://www.kaggle.com/anokas/data-science-bowl-2017/exploratory-data-analysis)
shows you how to open DICOMs, and see the information in the headers which include:
  1. patient id
  2. scan parameters (z-index, slice thickness = pixel spacing, etc...)

### [Zuidhof's Full Preprocessing Tutorial](https://www.kaggle.com/gzuidhof/data-science-bowl-2017/full-preprocessing-tutorial)

## Opening Data Files

In [1]:
import pandas as pd

pd.read_csv('../../../../data/dsb/stage1_labels.csv')

Unnamed: 0,id,cancer
0,0015ceb851d7251b8f399e39779d1e7d,1
1,0030a160d58723ff36d73f41b170ec21,0
2,003f41c78e6acfa92430a057ac0b306e,0
3,006b96310a37b36cccb2ab48d10b49a3,1
4,008464bb8521d09a42985dd8add3d0d2,1
5,0092c13f9e00a3717fdc940641f00015,0
6,00986bebc45e12038ef0ce3e9962b51a,0
7,00cba091fa4ad62cc3200a657aeb957e,0
8,00edff4f51a893d80dae2d42a7f45ad1,1
9,0121c2845f2b7df060945b072b2515d7,0


In [2]:
import pandas as pd

pd.read_csv('../../../../data/dsb/stage1_sample_submission.csv')

Unnamed: 0,id,cancer
0,026470d51482c93efc18b9803159c960,0.5
1,031b7ec4fe96a3b035a8196264a8c8c3,0.5
2,03bd22ed5858039af223c04993e9eb22,0.5
3,06a90409e4fcea3e634748b967993531,0.5
4,07b1defcfae5873ee1f03c90255eb170,0.5
5,0b20184e0cd497028bdd155d9fb42dc9,0.5
6,12db1ea8336eafaf7f9e3eda2b4e4fef,0.5
7,159bc8821a2dc39a1e770cb3559e098d,0.5
8,174c5f7c33ca31443208ef873b9477e5,0.5
9,1753250dab5fc81bab8280df13309733,0.5


## Summary:
### stage1.7z
* contains folders, each folder is a separate patient, 
* in each patient's folder are their CT scan image files
* CT scan image file
  * header will contain infomation (such as: patient id, scan thickness, etc...)
  * different number of image files for each person (based on type of equipment used)
  * many different image sources, and varying image quality (due to different equipment used)
  * each image for a patient is a 2D slice, combine all 2D slices (based on z-position) to get a 3D representation patient's lungs

### stage1_labels.csv
* contains the cancer ground truth labels for the stage 1 training set images (the correct answers for each patient id

### stage1_sample_submission.csv
* need to predict 198 for the test set

### Data Variables:
1. the images of the patient's lungs (2D versions compiled to make 3D representation)
2. scan thickness (or quality of the equipment)