# Exploratory Data Analysis
Get a general understanding of the dataset provided and what potential issues to handle.

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path

In [2]:
data_dir = Path.cwd().parent

In [3]:
candidates_df = pd.read_csv(data_dir / 'candidates_V2.csv')

In [4]:
candidates_df.head()

Unnamed: 0,seriesuid,coordX,coordY,coordZ,class
0,1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...,68.42,-74.48,-288.7,0
1,1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...,-95.209361,-91.809406,-377.42635,0
2,1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...,-24.766755,-120.379294,-273.361539,0
3,1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...,-63.08,-65.74,-344.24,0
4,1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...,52.946688,-92.688873,-241.067872,0


In [5]:
candidates_df.shape

(754975, 5)

There are 551065 lumps, i.e. potential malignant lumps in the dataset.

---
Columns:
- `seriesuid` the UID in DICOM format,
- `XYZ` coordinates,
- `class`, this corresponds to the nodule status (boolean value: 0 not actual nodule, and 1 for a nodule, either malignant or benign)

In [6]:
candidates_df['class'].unique()

array([0, 1], dtype=int64)

In [7]:
# Counts for each class
candidates_df['class'].value_counts()

0    753418
1      1557
Name: class, dtype: int64

Of all the potential candidates, 1351 are malignant or benign. These are then annotated in the `annotations.csv`.

In [10]:
candidates_df.seriesuid.nunique()

888

In [11]:
## Annotations.csv
annot_df = pd.read_csv(data_dir / 'annotations.csv')

In [12]:
annot_df.head()

Unnamed: 0,seriesuid,coordX,coordY,coordZ,diameter_mm
0,1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...,-128.699421,-175.319272,-298.387506,5.651471
1,1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...,103.783651,-211.925149,-227.12125,4.224708
2,1.3.6.1.4.1.14519.5.2.1.6279.6001.100398138793...,69.639017,-140.944586,876.374496,5.786348
3,1.3.6.1.4.1.14519.5.2.1.6279.6001.100621383016...,-24.013824,192.102405,-391.081276,8.143262
4,1.3.6.1.4.1.14519.5.2.1.6279.6001.100621383016...,2.441547,172.464881,-405.493732,18.54515


In [13]:
annot_df.shape

(1186, 5)

Ensure there is a fair balance of these nodules in both the training and validation set.