# Assignment nº1
## Lung Cancer Classification using Computerized Tomography (CT) Data


##### Work assembled by Alejandro Gonçalves, Francisca Mihalache, João Sousa and Vitor Ferreira.


## Table of contents <a name="contents"></a>

1. [Introduction](#introduction)
2. [Data](#data)

   - 2.1 [Data cleaning](#cleaning)



## Introduction <a name="introduction"></a>
[[go back to the top]](#contents)

Lung cancer is the leading cause of cancer-related deaths, with survival rates heavily dependent on early detection. While CT imaging is a valuable non-invasive tool for detecting lung nodules, only 16% of cases are diagnosed early. 

Computer-Aided Diagnosis (CAD) systems can help assess malignancy risk, but variability in human interpretation and large data volumes pose challenges. Radiomics, by extracting quantitative data from medical images, offers the potential for non-invasive diagnostic tools. 

Previous studies using technologies like CNNs (convolutional neural networks), PyRadiomics, and machine learning algorithms have shown promise, but ethical and legal challenges persist. This project aims to develop an advanced lung nodule malignancy prediction system using radiomics and deep learning to improve early diagnosis and reduce variability.

As suggested by the professor's guidelines, we will implement the CRISP-DM (Cross Industry Standard Process for Data Mining) methodology for this project. This well-established framework consists of six phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment, however, since deployment is not relevant to the scope of this academic project, we will focus on the first five phases.

### Imports
[[go back to the top]](#contents)

Here are the imports we need.

In [70]:
import numpy as np 
import pandas as pd 

import matplotlib.pyplot as plt
import pydicom

import radiomics
from radiomics import featureextractor, getFeatureClasses
import pylidc as pl







###  Data <a name="data"></a>
[[go back to the top]](#contents)

The data was obtained from the LIDC-IDRI collection, a database of thoracic computed tomography (CT) images from a total of 1010 patients, accompanied by expert radiologist annotations. This dataset is available in this link https://www.cancerimagingarchive.net/collection/lidc-idri/.


### Data cleaning <a name="cleaning"></a> 
[[go back to the topic]](#data)

In this section we will clean up the data.

In [60]:
df= pd.read_csv("LIDC-IDRI_MetaData.csv")
df_sorted= df.sort_values(by='Subject ID')
df_sorted


Unnamed: 0,Subject ID,Study UID,Study Description,Study Date,Series ID,Series Description,Number of images,File Size (Bytes),Collection Name,Modality,Manufacturer
348,LIDC-IDRI-0001,1.3.6.1.4.1.14519.5.2.1.6279.6001.298806137288...,,2000-01-01 00:00:00.0,1.3.6.1.4.1.14519.5.2.1.6279.6001.179049373636...,,133,70018838,LIDC-IDRI,CT,GE MEDICAL SYSTEMS
186,LIDC-IDRI-0001,1.3.6.1.4.1.14519.5.2.1.6279.6001.175012972118...,,2000-01-01 00:00:00.0,1.3.6.1.4.1.14519.5.2.1.6279.6001.141365756818...,,2,16357620,LIDC-IDRI,DX,GE MEDICAL SYSTEMS
1100,LIDC-IDRI-0002,1.3.6.1.4.1.14519.5.2.1.6279.6001.116951808801...,,2000-01-01 00:00:00.0,1.3.6.1.4.1.14519.5.2.1.6279.6001.493562949900...,,1,6909958,LIDC-IDRI,DX,GE MEDICAL SYSTEMS
1141,LIDC-IDRI-0002,1.3.6.1.4.1.14519.5.2.1.6279.6001.490157381160...,,2000-01-01 00:00:00.0,1.3.6.1.4.1.14519.5.2.1.6279.6001.619372068417...,,261,137396696,LIDC-IDRI,CT,GE MEDICAL SYSTEMS
189,LIDC-IDRI-0003,1.3.6.1.4.1.14519.5.2.1.6279.6001.202063331127...,,2000-01-01 00:00:00.0,1.3.6.1.4.1.14519.5.2.1.6279.6001.142026812390...,,5,38580794,LIDC-IDRI,DX,GE MEDICAL SYSTEMS
...,...,...,...,...,...,...,...,...,...,...,...
990,LIDC-IDRI-1008,1.3.6.1.4.1.14519.5.2.1.6279.6001.339975625902...,,2000-01-01 00:00:00.0,1.3.6.1.4.1.14519.5.2.1.6279.6001.332510758903...,,115,60517648,LIDC-IDRI,CT,TOSHIBA
1248,LIDC-IDRI-1009,1.3.6.1.4.1.14519.5.2.1.6279.6001.849069697860...,CT THORAX W/CONTRAST,2000-01-01 00:00:00.0,1.3.6.1.4.1.14519.5.2.1.6279.6001.855232435861...,Recon 2: CHEST,125,65830674,LIDC-IDRI,CT,GE MEDICAL SYSTEMS
1120,LIDC-IDRI-1010,1.3.6.1.4.1.14519.5.2.1.6279.6001.145373944605...,CT ANGIO CHEST (NON-CO,2000-01-01 00:00:00.0,1.3.6.1.4.1.14519.5.2.1.6279.6001.550599855064...,BOTTOM TO TOP,252,132720966,LIDC-IDRI,CT,GE MEDICAL SYSTEMS
749,LIDC-IDRI-1011,1.3.6.1.4.1.14519.5.2.1.6279.6001.287560874054...,CT THORAX W/CONTRAST,2000-01-01 00:00:00.0,1.3.6.1.4.1.14519.5.2.1.6279.6001.272123398257...,CHEST,133,70030058,LIDC-IDRI,CT,GE MEDICAL SYSTEMS


In [61]:
df1= pd.read_excel('tcia-diagnosis-data-2012-04-20.xls')
df1_sorted= df1.sort_values(by= 'TCIA Patient ID')
df1_sorted

#pip install xlrd (tem de rodar isto para poder mexer com o excel)

Unnamed: 0,TCIA Patient ID,"Diagnosis at the Patient Level\n0=Unknown\n1=benign or non-malignant disease\n2= malignant, primary lung cancer\n3 = malignant metastatic\n",Diagnosis Method\n0 = unknown\n1 = review of radiological images to show 2 years of stable nodule\n2 = biopsy\n3 = surgical resection\n4 = progression or response,Primary tumor site for metastatic disease,"Nodule 1\nDiagnosis at the Nodule Level \n0=Unknown\n1=benign or non-malignant disease\n2= malignant, primary lung cancer\n3 = malignant metastatic)\n",Nodule 1\nDiagnosis Method at the Nodule Level\n0 = unknown\n1 = review of radiological images to show 2 years of stable nodule\n2 = biopsy\n3 = surgical resection\n4 = progression or response\n,"Nodule 2\nDiagnosis at the Nodule Level \n0=Unknown\n1=benign or non-malignant disease\n2= malignant, primary lung cancer\n3 = malignant metastatic)\n",Nodule 2\nDiagnosis Method at the Nodule Level\n0 = unknown\n1 = review of radiological images to show 2 years of stable nodule\n2 = biopsy\n3 = surgical resection\n4 = progression or response\n,"Nodule 3\nDiagnosis at the Nodule Level \n0=Unknown\n1=benign or non-malignant disease\n2= malignant, primary lung cancer\n3 = malignant metastatic)\n",Nodule 3\nDiagnosis Method at the Nodule Level\n0 = unknown\n1 = review of radiological images to show 2 years of stable nodule\n2 = biopsy\n3 = surgical resection\n4 = progression or response\n,"Nodule 4\nDiagnosis at the Nodule Level \n0=Unknown\n1=benign or non-malignant disease\n2= malignant, primary lung cancer\n3 = malignant metastatic)\n",Nodule 4\nDiagnosis Method at the Nodule Level\n0 = unknown\n1 = review of radiological images to show 2 years of stable nodule\n2 = biopsy\n3 = surgical resection\n4 = progression or response\n,"Nodule 5\nDiagnosis at the Nodule Level \n0=Unknown\n1=benign or non-malignant disease\n2= malignant, primary lung cancer\n3 = malignant metastatic)\n",Nodule 5\nDiagnosis Method at the Nodule Level\n0 = unknown\n1 = review of radiological images to show 2 years of stable nodule\n2 = biopsy\n3 = surgical resection\n4 = progression or response\n
0,LIDC-IDRI-0068,3,4,Head & Neck Cancer,3.0,4.0,,,,,,,,
1,LIDC-IDRI-0071,3,1,Head & Neck,1.0,1.0,,,,,,,,
2,LIDC-IDRI-0072,2,4,Lung Cancer,1.0,4.0,,,,,,,,
3,LIDC-IDRI-0088,3,0,Uterine Cancer,0.0,0.0,,,,,,,,
4,LIDC-IDRI-0090,2,3,NSCLC,2.0,3.0,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
152,LIDC-IDRI-0994,2,3,LUL Large cell CA,2.0,3.0,,,,,,,,
153,LIDC-IDRI-1002,2,2,non-small cell carcinoma,,,,,,,,,,
154,LIDC-IDRI-1004,2,3,LUL NSCLC,2.0,3.0,,,,,,,,
155,LIDC-IDRI-1010,0,0,lymphoma,0.0,0.0,,,,,,,,


In [62]:
df2= pd.read_excel('lidc-idri-nodule-counts-6-23-2015.xlsx')
df2_sorted= df2.sort_values(by= 'TCIA Patent ID')
df2_sorted
#pip install openpyxl

Unnamed: 0,TCIA Patent ID,Total Number of Nodules*,Number of Nodules >=3mm**,Number of Nodules <3mm***,Unnamed: 4,Unnamed: 5
0,LIDC-IDRI-0001,4,1,3,,
1,LIDC-IDRI-0002,12,1,11,,*total number of lesions that received either ...
2,LIDC-IDRI-0003,4,4,0,,"**total number of lesions that received a ""nod..."
3,LIDC-IDRI-0004,4,1,3,,"***total number of lesions that received a ""no..."
4,LIDC-IDRI-0005,9,3,6,,
...,...,...,...,...,...,...
1014,LIDC-IDRI-1009,2,1,1,,
1015,LIDC-IDRI-1010,10,1,9,,
1016,LIDC-IDRI-1011,4,4,0,,
1017,LIDC-IDRI-1012,1,1,0,,


## pylidc


In [63]:
scans = pl.query(pl.Scan).all()
print(scans[0])
print(len(scans))

Scan(id=1,patient_id=LIDC-IDRI-0078)
1018


In [64]:
print(scans[0].patient_id,
      scans[0].pixel_spacing,
      scans[0].slice_thickness,
      scans[0].slice_spacing)

LIDC-IDRI-0078 0.65 3.0 3.0


In [65]:
print(len(scans[0].annotations))

13


In [72]:
annotations = pl.query(pl.Annotation).all() # as anotacoes fazem parte da biblioteca pylidc
# Exibir todas as anotações
#for ann in annotations:
    #print(ann)

In [67]:
scans[0].annotations

[Annotation(id=1,scan_id=1),
 Annotation(id=2,scan_id=1),
 Annotation(id=3,scan_id=1),
 Annotation(id=4,scan_id=1),
 Annotation(id=5,scan_id=1),
 Annotation(id=6,scan_id=1),
 Annotation(id=7,scan_id=1),
 Annotation(id=8,scan_id=1),
 Annotation(id=9,scan_id=1),
 Annotation(id=10,scan_id=1),
 Annotation(id=11,scan_id=1),
 Annotation(id=12,scan_id=1),
 Annotation(id=13,scan_id=1)]

In [69]:
nods = scans[0].cluster_annotations()
print("%s has %d nodules." % (scans[0], len(nods)))

for i,nod in enumerate(nods):
    print("Nodule %d has %d annotations." % (i+1, len(nods[i])))

AttributeError: module 'numpy' has no attribute 'int'.
`np.int` was a deprecated alias for the builtin `int`. To avoid this error in existing code, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations