# API TCIA - Cancer Imaging Archive

#### Clasificación de imágenes de TAC para el diagnóstico / prevención del cáncer de pulmón.
**Objetivo**: En esta primera fase del proyecto el objetivo es conseguir la mayor cantidad posible de imágenes y utilizar Computer Vision y Deep Learning para poder entrenar el modelo posteriormente. 

In [1]:
# Importamos las bibliotecas
import requests # # Para descargar datos, imágenes, archivos, o enviar datos a una API a través de una URL.
import pandas as pd # Para manipular, analizar y limpiar datos de DataFrames
from IPython.display import HTML # Para insertar contenido HTML directamente en el entorno de Jupyter Notebook

In [2]:
# Espcificamos la API de la página
url = "https://cancerimagingarchive.net/api/v1/collections/"

In [3]:
# Solicitamos el endpoint y obtenemos una respuesta:
response = requests.get(url)
print(response) # output: <Response [200]>  BIEN!!

<Response [200]>


In [4]:
# Pasamos la respuesta a json
data = response.json()

In [5]:
# Convertimos el json en DataFrame para poder visualizar toda la información:
df = pd.DataFrame(data)

# Sacamos el nombre de las columnas y vemos cuáles de ellas nos interesan
df.columns

Index(['id', 'date', 'date_gmt', 'guid', 'modified', 'modified_gmt', 'slug',
       'status', 'type', 'link', 'title', 'featured_media', 'template',
       'yoast_head', 'yoast_head_json', 'cancer_types', 'citations',
       'collection_doi', 'collection_download_info', 'collection_downloads',
       'versions', 'additional_resources', 'cancer_locations',
       'collection_page_accessibility', 'publications_related',
       'version_change_log_archived', 'collection_status',
       'publications_using', 'related_analysis_results', 'species',
       'version_number', 'collection_title', 'date_updated',
       'related_collection', 'subjects', 'analysis_results',
       'collection_short_title', 'data_types', 'version_change_log',
       'collection_browse_title', 'detailed_description', 'supporting_data',
       'collection_featured_image', 'collection_summary',
       'collection_acknowledgements', 'collection_funding',
       'hide_from_browse_table', 'program', '_links'],
      dtyp

In [8]:
# La información que nos interesa es el 'id','cancer-types', 'cancer_locations','collection_features_images'-'guid'
# Por ello, limpiamos el df

df.drop(columns=['date','link','date_gmt', 'guid', 'modified', 'modified_gmt', 'slug',
       'status', 'type', 'featured_media', 'template',
       'yoast_head', 'yoast_head_json', 'citations',
       'collection_doi', 'collection_download_info', 'collection_downloads',
       'versions', 'additional_resources', 'collection_title', 'species',
       'collection_page_accessibility', 'publications_related',
       'version_change_log_archived', 'collection_status',
       'publications_using', 'related_analysis_results',
       'version_number', 'date_updated',
       'related_collection', 'subjects', 'analysis_results',
       'collection_short_title', 'data_types', 'version_change_log',
       'collection_browse_title', 'detailed_description', 'supporting_data', 'collection_summary',
       'collection_acknowledgements', 'collection_funding',
       'hide_from_browse_table', 'program', '_links'], inplace=True)
df

Unnamed: 0,id,title,cancer_types,cancer_locations,collection_featured_image
0,47755,{'rendered': 'RPA-Head-and-Neck-Lymph-Nodes'},[Head and Neck Cancer],[Head-Neck],"{'ID': '49531', 'post_author': '24', 'post_dat..."
1,48003,{'rendered': 'Spine-Mets-CT-SEG'},"[Metastatic disease, Bladder Cancer, Breast Ca...",[Bone],"{'ID': '49513', 'post_author': '22', 'post_dat..."
2,48983,{'rendered': 'BRATS-AFRICA'},[Brain Cancer],[Brain],"{'ID': '9095', 'post_author': '29', 'post_date..."
3,49073,{'rendered': 'CMB-OV'},[Ovarian Cancer],[Ovary],"{'ID': '6787', 'post_author': '29', 'post_date..."
4,49063,{'rendered': 'CMB-BRCA'},[Breast Invasive Carcinoma],[Breast],"{'ID': '6787', 'post_author': '29', 'post_date..."
5,48767,{'rendered': 'MEDIASTINAL-LYMPH-NODE-SEG'},"[Various, Breast Cancer, Non-small Cell Lung C...",[Lymph Node],"{'ID': '48947', 'post_author': '37', 'post_dat..."
6,48455,{'rendered': 'DFCI-BCH-BWH-PEDs-HGG'},"[High Grade Glioma, Diffuse Midline Glioma]",[Brain],False
7,43451,{'rendered': 'RIDER-LUNG-CT'},[Lung Cancer],[Chest],"{'ID': '48941', 'post_author': '20', 'post_dat..."
8,42513,{'rendered': 'HISTOLOGYHSI-GB'},[Glioblastoma],[Brain],"{'ID': '48945', 'post_author': '11', 'post_dat..."
9,48465,{'rendered': 'HNC-IMRT-70-33'},[Head and Neck Cancer],[Head-Neck],False


In [10]:
# Dentro de la columna 'guid' está el link a las imágenes que queremos, por lo que extraemos esa parte y hacemos una nueva columna:
df["images_link"]=df["collection_featured_image"].apply(lambda x:x.get("guid") if isinstance(x, dict) and 'guid' in x else None)
df
# Añadimos el if instance para que no de error si la columna info contiene valores True, False, o NaN (valores nulos) en lugar de un diccionario

Unnamed: 0,id,title,cancer_types,cancer_locations,collection_featured_image,images_link
0,47755,{'rendered': 'RPA-Head-and-Neck-Lymph-Nodes'},[Head and Neck Cancer],[Head-Neck],"{'ID': '49531', 'post_author': '24', 'post_dat...",https://www.cancerimagingarchive.net/wp-conten...
1,48003,{'rendered': 'Spine-Mets-CT-SEG'},"[Metastatic disease, Bladder Cancer, Breast Ca...",[Bone],"{'ID': '49513', 'post_author': '22', 'post_dat...",https://www.cancerimagingarchive.net/wp-conten...
2,48983,{'rendered': 'BRATS-AFRICA'},[Brain Cancer],[Brain],"{'ID': '9095', 'post_author': '29', 'post_date...",https://stage.cancerimagingarchive.net/wp-cont...
3,49073,{'rendered': 'CMB-OV'},[Ovarian Cancer],[Ovary],"{'ID': '6787', 'post_author': '29', 'post_date...",https://stage.cancerimagingarchive.net/wp-cont...
4,49063,{'rendered': 'CMB-BRCA'},[Breast Invasive Carcinoma],[Breast],"{'ID': '6787', 'post_author': '29', 'post_date...",https://stage.cancerimagingarchive.net/wp-cont...
5,48767,{'rendered': 'MEDIASTINAL-LYMPH-NODE-SEG'},"[Various, Breast Cancer, Non-small Cell Lung C...",[Lymph Node],"{'ID': '48947', 'post_author': '37', 'post_dat...",https://www.cancerimagingarchive.net/wp-conten...
6,48455,{'rendered': 'DFCI-BCH-BWH-PEDs-HGG'},"[High Grade Glioma, Diffuse Midline Glioma]",[Brain],False,
7,43451,{'rendered': 'RIDER-LUNG-CT'},[Lung Cancer],[Chest],"{'ID': '48941', 'post_author': '20', 'post_dat...",https://www.cancerimagingarchive.net/wp-conten...
8,42513,{'rendered': 'HISTOLOGYHSI-GB'},[Glioblastoma],[Brain],"{'ID': '48945', 'post_author': '11', 'post_dat...",https://www.cancerimagingarchive.net/wp-conten...
9,48465,{'rendered': 'HNC-IMRT-70-33'},[Head and Neck Cancer],[Head-Neck],False,


In [12]:
# Limpiamos las columnas con valor nulo en la última columna
df_clean= df.dropna(how='any') 
df_clean

Unnamed: 0,id,title,cancer_types,cancer_locations,collection_featured_image,images_link
0,47755,{'rendered': 'RPA-Head-and-Neck-Lymph-Nodes'},[Head and Neck Cancer],[Head-Neck],"{'ID': '49531', 'post_author': '24', 'post_dat...",https://www.cancerimagingarchive.net/wp-conten...
1,48003,{'rendered': 'Spine-Mets-CT-SEG'},"[Metastatic disease, Bladder Cancer, Breast Ca...",[Bone],"{'ID': '49513', 'post_author': '22', 'post_dat...",https://www.cancerimagingarchive.net/wp-conten...
2,48983,{'rendered': 'BRATS-AFRICA'},[Brain Cancer],[Brain],"{'ID': '9095', 'post_author': '29', 'post_date...",https://stage.cancerimagingarchive.net/wp-cont...
3,49073,{'rendered': 'CMB-OV'},[Ovarian Cancer],[Ovary],"{'ID': '6787', 'post_author': '29', 'post_date...",https://stage.cancerimagingarchive.net/wp-cont...
4,49063,{'rendered': 'CMB-BRCA'},[Breast Invasive Carcinoma],[Breast],"{'ID': '6787', 'post_author': '29', 'post_date...",https://stage.cancerimagingarchive.net/wp-cont...
5,48767,{'rendered': 'MEDIASTINAL-LYMPH-NODE-SEG'},"[Various, Breast Cancer, Non-small Cell Lung C...",[Lymph Node],"{'ID': '48947', 'post_author': '37', 'post_dat...",https://www.cancerimagingarchive.net/wp-conten...
7,43451,{'rendered': 'RIDER-LUNG-CT'},[Lung Cancer],[Chest],"{'ID': '48941', 'post_author': '20', 'post_dat...",https://www.cancerimagingarchive.net/wp-conten...
8,42513,{'rendered': 'HISTOLOGYHSI-GB'},[Glioblastoma],[Brain],"{'ID': '48945', 'post_author': '11', 'post_dat...",https://www.cancerimagingarchive.net/wp-conten...


In [14]:
# Quitamos la columna de collection_featured_image para que quede el dataframe más limpio
df2 = df_clean.drop(columns=['collection_featured_image'])
df2

Unnamed: 0,id,title,cancer_types,cancer_locations,images_link
0,47755,{'rendered': 'RPA-Head-and-Neck-Lymph-Nodes'},[Head and Neck Cancer],[Head-Neck],https://www.cancerimagingarchive.net/wp-conten...
1,48003,{'rendered': 'Spine-Mets-CT-SEG'},"[Metastatic disease, Bladder Cancer, Breast Ca...",[Bone],https://www.cancerimagingarchive.net/wp-conten...
2,48983,{'rendered': 'BRATS-AFRICA'},[Brain Cancer],[Brain],https://stage.cancerimagingarchive.net/wp-cont...
3,49073,{'rendered': 'CMB-OV'},[Ovarian Cancer],[Ovary],https://stage.cancerimagingarchive.net/wp-cont...
4,49063,{'rendered': 'CMB-BRCA'},[Breast Invasive Carcinoma],[Breast],https://stage.cancerimagingarchive.net/wp-cont...
5,48767,{'rendered': 'MEDIASTINAL-LYMPH-NODE-SEG'},"[Various, Breast Cancer, Non-small Cell Lung C...",[Lymph Node],https://www.cancerimagingarchive.net/wp-conten...
7,43451,{'rendered': 'RIDER-LUNG-CT'},[Lung Cancer],[Chest],https://www.cancerimagingarchive.net/wp-conten...
8,42513,{'rendered': 'HISTOLOGYHSI-GB'},[Glioblastoma],[Brain],https://www.cancerimagingarchive.net/wp-conten...


In [16]:
# Definimos una función que permita visualizar las imágenes
# Lo hacemos con la etiqueta HTML <img> con el enlace de la imagen

def show_image(url):
    return f'<img src="{url}" width="10000"/>'

# Aplicamos la función y creamos una nueva columna con la imagen ya 
df2['image'] = df2['images_link'].apply(show_image)

# Mostrar el DataFrame con imágenes usando HTML
df_clean_images = HTML(df2.to_html(escape=False))
df_clean_images

Unnamed: 0,id,title,cancer_types,cancer_locations,images_link,image
0,47755,{'rendered': 'RPA-Head-and-Neck-Lymph-Nodes'},[Head and Neck Cancer],[Head-Neck],https://www.cancerimagingarchive.net/wp-content/uploads/tcia_coverimage.png,
1,48003,{'rendered': 'Spine-Mets-CT-SEG'},"[Metastatic disease, Bladder Cancer, Breast Cancer, Colon Cancer, Kidney Cancer, Lung Cancer, Prostate Cancer, Soft-tissue Sarcoma, Skin Cancer]",[Bone],https://www.cancerimagingarchive.net/wp-content/uploads/Spine-Mets-CT-SEG_selected_image.png,
2,48983,{'rendered': 'BRATS-AFRICA'},[Brain Cancer],[Brain],https://stage.cancerimagingarchive.net/wp-content/uploads/BRATS_banner_noCaption.png,
3,49073,{'rendered': 'CMB-OV'},[Ovarian Cancer],[Ovary],https://stage.cancerimagingarchive.net/wp-content/uploads/NIH-Cancer-Moonshot-logo.png,
4,49063,{'rendered': 'CMB-BRCA'},[Breast Invasive Carcinoma],[Breast],https://stage.cancerimagingarchive.net/wp-content/uploads/NIH-Cancer-Moonshot-logo.png,
5,48767,{'rendered': 'MEDIASTINAL-LYMPH-NODE-SEG'},"[Various, Breast Cancer, Non-small Cell Lung Cancer, Hodgkin Lymphoma, Small Cell Lung Cancer, Thyroid Cancer, Adenocarcinoma, Melanoma, Head and Neck Cancer, Prostate Cancer, Mesothelioma, Ovarian Cancer, Colon Cancer]",[Lymph Node],https://www.cancerimagingarchive.net/wp-content/uploads/Mediastinal-Image.jpg,
7,43451,{'rendered': 'RIDER-LUNG-CT'},[Lung Cancer],[Chest],https://www.cancerimagingarchive.net/wp-content/uploads/RiderLungCT-Image2.jpg,
8,42513,{'rendered': 'HISTOLOGYHSI-GB'},[Glioblastoma],[Brain],https://www.cancerimagingarchive.net/wp-content/uploads/41597_2024_3510_Fig1_HTML.webp,


In [18]:
type(df_clean_images)

IPython.core.display.HTML

In [20]:
# Como se trata de un IPython.core.display.HTML, extraemos el HTML como texto
html_data = df_clean_images.data  

# Usamos pandas para leerlo como DataFrame
df_convert= pd.read_html(html_data)[0]  # [0] porque read_html devuelve una lista de DataFrames
df_convert

  df_convert= pd.read_html(html_data)[0]  # [0] porque read_html devuelve una lista de DataFrames


Unnamed: 0.1,Unnamed: 0,id,title,cancer_types,cancer_locations,images_link,image
0,0,47755,{'rendered': 'RPA-Head-and-Neck-Lymph-Nodes'},[Head and Neck Cancer],[Head-Neck],https://www.cancerimagingarchive.net/wp-conten...,
1,1,48003,{'rendered': 'Spine-Mets-CT-SEG'},"[Metastatic disease, Bladder Cancer, Breast Ca...",[Bone],https://www.cancerimagingarchive.net/wp-conten...,
2,2,48983,{'rendered': 'BRATS-AFRICA'},[Brain Cancer],[Brain],https://stage.cancerimagingarchive.net/wp-cont...,
3,3,49073,{'rendered': 'CMB-OV'},[Ovarian Cancer],[Ovary],https://stage.cancerimagingarchive.net/wp-cont...,
4,4,49063,{'rendered': 'CMB-BRCA'},[Breast Invasive Carcinoma],[Breast],https://stage.cancerimagingarchive.net/wp-cont...,
5,5,48767,{'rendered': 'MEDIASTINAL-LYMPH-NODE-SEG'},"[Various, Breast Cancer, Non-small Cell Lung C...",[Lymph Node],https://www.cancerimagingarchive.net/wp-conten...,
6,7,43451,{'rendered': 'RIDER-LUNG-CT'},[Lung Cancer],[Chest],https://www.cancerimagingarchive.net/wp-conten...,
7,8,42513,{'rendered': 'HISTOLOGYHSI-GB'},[Glioblastoma],[Brain],https://www.cancerimagingarchive.net/wp-conten...,


In [22]:
# Filtramos las imágenes correspondientes a cáncer de pulmón 
# Indicando que coja aquellas filas en las que la localización del cáncer del pulmón sea el tórax
df_convert_lung = df_convert[df_convert['cancer_locations'].str.contains("Chest", case=False, na=False)] 

# Seleccionamos la URL de la imagen de interés
lung_image_url = df_convert_lung['images_link'].iloc[0] 

# Solicitamos permiso para descargar la imagen
response = requests.get(lung_image_url) # output: 200 bien!
response

<Response [200]>

In [26]:
from PIL import Image #Importa el módulo Image de la biblioteca PIL (Pillow), que permite abrir, modificar y guardar imágenes.
from io import BytesIO #Importa BytesIO de la biblioteca io, que permite tratar los datos binarios como un archivo en memoria. 
#Esto es útil cuando manejamos archivos en bytes, como imágenes descargadas de la web.

# Convertimos el conteido binario en un objeto similar a un archiv que Image.open puede abrir como si fuera una imagen
img = Image.open(BytesIO(response.content))
    
# Guardamos la imagen en el disco
img.save("lung_cancer_image.png")  

##### CONCLUSIÓN: De esta API únicamente hemos podido obtener una imagen.
#### Próximos pasos: Buscaremos otras fuentes de información donde recopilar más imágenes de pulmones sanos y enfermos con las que entrenar el modelo posteriormente.