# Detección de tumores cerebrales
Este notebook toma como referencia el proyecto [Brain Tumor Detection v1.0 || CNN, VGG-16](https://www.kaggle.com/code/ruslankl/brain-tumor-detection-v1-0-cnn-vgg-16) disponible en Kaggle. El objetivo de este proyecto es ralizar un análisis topológico de datos sobre las resonancias magnéticas del conjunto de datos trabajado en el proyecto mencionado, buscando mejorar los resultados del modelo obtenido en el mismo.

## Descripción del conjunto de datos
El [conjunto de datos](https://www.kaggle.com/datasets/navoneel/brain-mri-images-for-brain-tumor-detection) utilizado consta de 253 resonancias magnéticas (MRI) de distintos cerebros divididas en dos clases, aquellas que pressentan por lo menos un tumor cerebral visible y aquellas que no. Se cuenta con 155 resonancias magnéticas en las que se observa presencia de tumores y 98 en las que no.

Para este análisis se optó, primeramente, por convertir las imágenes a arreglos de numpy en escala de grises, fijando un mismo tamaño de imagen para todas ellas (224x224 píxeles). Considerando lo anterior, podemos esperar que algunas imágenes se hayan distorsionado durante el proceso.

In [None]:
from data.image_converter import get_images

image_size = (224, 224)
tumor_positives, tumor_negatives = get_images(image_size=image_size)

In [None]:
from plotly.graph_objects import Figure

fig = Figure()
fig.add_bar(x=['Positivos para tumor', 'Negativos para tumor'], y=[len(tumor_positives), len(tumor_negatives)], 
            name='Elementos por clase', marker={'color':['#007bff', '#ff7f00']})
fig.update_layout(title='Conteo de elementos por clase')

Podemos observar que los cerebros presentes en cada imagen varían en nivel de iluminación, forma, detalles, etc. Un análisis más minucioso de los datos también revelaría que los tumores varían en tamaño y las resonancias sin tumores no parecen provenir necesariamente de cerebros completamente sanos. Adicionalmente, algunas resonancias presentan añadidos como flechas señalando tumores o marcas indicando la perspectiva desde la que se tomó la resonancia magnética en cuestión. 

Anomalías en los cerebros como la presencia de ojos en la resonancia o pliegues súmamente marcados podrían afectar negativamente el rendimiento de un modelo basado en un análisis topológico de datos.

In [None]:
import numpy as np
from plotly.subplots import make_subplots
from gtda.plotting import plot_heatmap

seed = 123
np.random.seed(123)

random_positives_idx = np.random.randint(0, len(tumor_positives) - 1, size=3)
random_negatives_idx = np.random.randint(0, len(tumor_negatives) - 1, size=3)

fig = make_subplots(rows=3, cols=2, subplot_titles=['MRI con tumores', 'MRI sin tumores'], vertical_spacing=0.02)
fig.update_layout(width=600, height=700)

for row, idx in enumerate(random_positives_idx):
    fig.add_trace(plot_heatmap(tumor_positives[idx])['data'][0], row=row+1, col=1)
for row, idx in enumerate(random_negatives_idx):
    fig.add_trace(plot_heatmap(tumor_negatives[idx])['data'][0], row=row+1, col=2)
fig

## Aumento del conjunto de datos

In [None]:
from data.data_split import train_val_test_split

train_val_test_split((70, 20, 10))

In [None]:
from os import getcwd, listdir

fig = Figure()
data_path = getcwd() + r'\data\brain_tumor_dataset\split_data'

fig.add_bar(x=['Entrenamiento', 'Validación', 'Prueba'], 
            y=[len(listdir(data_path + r'\train\yes')), 
               len(listdir(data_path + r'\val\yes')), 
               len(listdir(data_path + r'\test\yes'))],
           name='Con tumores', marker={'color':['#007bff', '#007bff', '#007bff']})
fig.add_bar(x=['Entrenamiento', 'Validación', 'Prueba'], 
            y=[len(listdir(data_path + r'\train\no')), 
               len(listdir(data_path + r'\val\no')), 
               len(listdir(data_path + r'\test\no'))],
           name='Sin tumores', marker={'color':['#ff7f00', '#ff7f00', '#ff7f00']})
fig

In [None]:
%%time
from data.mkdir_augmented import mk_augmented_dir
from keras.preprocessing.image import ImageDataGenerator
from keras.applications.vgg16 import preprocess_input

mk_augmented_dir()

train_datagen = ImageDataGenerator(
    rotation_range=15,
    width_shift_range=0.1,
    height_shift_range=0.1,
    shear_range=0.1,
    brightness_range=[0.5, 1.5],
    horizontal_flip=True,
    vertical_flip=True,
    preprocessing_function=preprocess_input,
    fill_mode='constant',
    cval=0
)

train_copies = 1
val_copies = 1
folders = ['yes', 'no']

for folder in folders:
    i = 1
    for batch in train_datagen.flow_from_directory(
            directory=data_path + r'\train',
            save_to_dir=data_path + r'\train_augmented\\' + folder,
            classes=[folder],
            color_mode='rgb',
            target_size=image_size,
            batch_size=32,
            class_mode='binary',
            seed=seed,
        ):
        i += 1
        if i > train_copies: 
            break

In [None]:
fig = Figure()
data_path = getcwd() + r'\data\brain_tumor_dataset\split_data'

fig.add_bar(x=['Entrenamiento', 'Validación', 'Prueba'], 
            y=[len(listdir(data_path + r'\train_augmented\yes')), 
               len(listdir(data_path + r'\val\yes')), 
               len(listdir(data_path + r'\test\yes'))],
           name='Con tumores', marker={'color':['#007bff', '#007bff', '#007bff']})
fig.add_bar(x=['Entrenamiento', 'Validación', 'Prueba'], 
            y=[len(listdir(data_path + r'\train_augmented\no')), 
               len(listdir(data_path + r'\val\no')), 
               len(listdir(data_path + r'\test\no'))],
           name='Sin tumores', marker={'color':['#ff7f00', '#ff7f00', '#ff7f00']})
fig

In [None]:
train_yes, train_no = get_images(image_size=image_size, augmented=True)

In [None]:
%%time
from gtda.images import Binarizer

rows = 3
cols = 10

np.random.seed(123)
random_positives_idx = np.random.randint(0, len(train_yes) - 1, size=rows)
random_negatives_idx = np.random.randint(0, len(train_no) - 1, size=rows)

fig = make_subplots(rows=rows, cols=cols)
fig.update_layout(width=1300, height=600, title='Resonancias magnéticas binarizadas con tumores')

for threshold in range(1, cols):
    binarizer = Binarizer(threshold=threshold/10, n_jobs=-1)
    for row, idx in enumerate(random_positives_idx):
        binarized_yes = binarizer.fit_transform(train_yes[idx].reshape(-1, train_yes[idx].shape[0], train_yes[idx].shape[1]))
        fig.add_trace(binarizer.plot(binarized_yes)['data'][0], row=row+1, col=threshold)
fig

In [None]:
%%time
fig = make_subplots(rows=rows, cols=cols)
fig.update_layout(width=1300, height=600)

for threshold in range(1, cols):
    binarizer = Binarizer(threshold=threshold/10, n_jobs=-1)
    for row, idx in enumerate(random_positives_idx):
        binarized_no = binarizer.fit_transform(train_no[idx].reshape(-1, train_no[idx].shape[0], train_no[idx].shape[1]))
        fig.add_trace(binarizer.plot(binarized_no)['data'][0], row=row+1, col=threshold)
fig

## Análisis de una sola resonancia

In [None]:
np.random.seed(123)
im_yes_idx = np.random.randint(0, len(train_yes) - 1)
im_no_idx = np.random.randint(0, len(train_no) - 1)

im_yes = train_yes[im_yes_idx]
im_yes_plot = plot_heatmap(im_yes)

im_no = train_no[im_no_idx]
im_no_plot = plot_heatmap(im_no)

fig = make_subplots(rows=1, cols=2, subplot_titles=['Con tumor', 'Sin tumor'])
fig.update_layout(width=600, height=350)
fig.add_trace(im_yes_plot['data'][0], row=1, col=1)
fig.add_trace(im_no_plot['data'][0], row=1, col=2)
fig

In [None]:
from gtda.images import RadialFiltration

radial_filtration = RadialFiltration(center=np.array([196, 196]), n_jobs=-1)
im_yes = im_yes[None, :, :]
im_no = im_no[None, :, :]

im_yes_radial = radial_filtration.fit_transform(im_yes)
im_yes_radial_plot = radial_filtration.plot(im_yes_radial, colorscale='jet')

im_no_radial = radial_filtration.fit_transform(im_no)
im_no_radial_plot = radial_filtration.plot(im_no_radial, colorscale='jet')

fig = make_subplots(rows=1, cols=2, subplot_titles=['Con tumor', 'Sin tumor'])
fig.update_layout(width=600, height=350)
fig.add_trace(im_yes_radial_plot['data'][0], row=1, col=1)
fig.add_trace(im_no_radial_plot['data'][0], row=1, col=2)
fig

In [None]:
from gtda.homology import CubicalPersistence
from gtda.diagrams import Scaler

cubical_pesistence = CubicalPersistence(n_jobs=-1)
scaler = Scaler(n_jobs=-1)

im_yes_cubical = cubical_pesistence.fit_transform(im_yes_radial)
im_yes_cubical_scaled = scaler.fit_transform(im_yes_cubical)
im_yes_cubical_scaled_plot = scaler.plot(im_yes_cubical_scaled, 
                                         plotly_params={'layout':{'title':'Diagrama de persistencia de un cerebro con tumor'}})

im_no_cubical = cubical_pesistence.fit_transform(im_no_radial)
im_no_cubical_scaled = scaler.fit_transform(im_no_cubical)
im_no_cubical_scaled_plot = scaler.plot(im_no_cubical_scaled, 
                                       plotly_params={'layout':{'title':'Daiagrama de persistencia de un cerebro sin tumores'}})

im_yes_cubical_scaled_plot.show()
im_no_cubical_scaled_plot.show()

In [None]:
from gtda.diagrams import HeatKernel

heat_kernel = HeatKernel(sigma=0.15, n_bins=60, n_jobs=-1)

im_yes_cubical_heat = heat_kernel.fit_transform(im_yes_cubical_scaled)
im_no_cubical_heat = heat_kernel.fit_transform(im_no_cubical_scaled)

im_yes_cubical_heat_plot = heat_kernel.plot(im_yes_cubical_heat, homology_dimension_idx=1, colorscale='jet')
im_no_cubical_heat_plot = heat_kernel.plot(im_no_cubical_heat, homology_dimension_idx=1, colorscale='jet')

fig = make_subplots(rows=1, cols=2, subplot_titles=['Con tumor', 'Sin tumor'])
fig.add_trace(im_yes_cubical_heat_plot['data'][0], row=1, col=1)
fig.add_trace(im_no_cubical_heat_plot['data'][0], row=1, col=2)
fig.update_layout(width=800, height=400)

In [None]:
%%time
fig = make_subplots(rows=3, cols=10)

for threshold in range(1, 10):
    binarizer = Binarizer(threshold=threshold/10, n_jobs=-1)
    im_yes_binarized = binarizer.fit_transform(im_yes.reshape(-1, im_yes.shape[0], im_yes.shape[1]))
    im_yes_binarized_plot = binarizer.plot(im_yes_binarized)

    im_yes_binarized_radial = radial_filtration.fit_transform(im_yes_binarized)
    im_yes_binarized_radial_plot = radial_filtration.plot(im_yes_binarized_radial, colorscale='jet')
    
    im_yes_cubical = cubical_pesistence.fit_transform(im_yes_binarized_radial)
    im_yes_cubical_scaled = scaler.fit_transform(im_yes_cubical)
    im_yes_cubical_heat = heat_kernel.fit_transform(im_yes_cubical_scaled)
    im_yes_cubical_heat_plot = heat_kernel.plot(im_yes_cubical_heat, homology_dimension_idx=1, colorscale='jet')
    
    fig.add_trace(im_yes_binarized_plot['data'][0], row=1, col=threshold)
    fig.add_trace(im_yes_binarized_radial_plot['data'][0], row=2, col=threshold)
    fig.add_trace(im_yes_cubical_heat_plot['data'][0], row=3, col=threshold)
    
fig.update_layout(width=1300, height=600)

In [None]:
%%time
fig = make_subplots(rows=3, cols=10)

for threshold in range(1, 10):
    binarizer = Binarizer(threshold=threshold/10, n_jobs=-1)
    im_no_binarized = binarizer.fit_transform(im_no.reshape(-1, im_no.shape[0], im_no.shape[1]))
    im_no_binarized_plot = binarizer.plot(im_no_binarized)

    im_no_binarized_radial = radial_filtration.fit_transform(im_no_binarized)
    im_no_binarized_radial_plot = radial_filtration.plot(im_no_binarized_radial, colorscale='jet')

    im_no_cubical = cubical_pesistence.fit_transform(im_no_binarized_radial)
    im_no_cubical_scaled = scaler.fit_transform(im_no_cubical)
    im_no_cubical_heat = heat_kernel.fit_transform(im_no_cubical_scaled)
    im_no_cubical_heat_plot = heat_kernel.plot(im_no_cubical_heat, homology_dimension_idx=1, colorscale='jet')

    fig.add_trace(im_no_binarized_plot['data'][0], row=1, col=threshold)
    fig.add_trace(im_no_binarized_radial_plot['data'][0], row=2, col=threshold)
    fig.add_trace(im_no_cubical_heat_plot['data'][0], row=3, col=threshold)
    
fig.update_layout(width=1300, height=600)

## Diseño de un Pipeline

In [None]:
from itertools import product
from sklearn.pipeline import make_pipeline, make_union
from sklearn import set_config
from gtda.diagrams import PersistenceEntropy, Amplitude

threshold_iter = np.arange(0.1, 1, 0.1)
#direction_iter = product({1, -1, 0}, {1, -1, 0})
center_iter = product({28, 112 ,196}, {28, 112, 196})

binarizers = ([Binarizer(threshold=threshold, n_jobs=-1) for threshold in threshold_iter])
radial_filtrations = ([RadialFiltration(center=np.array(center), n_jobs=-1) for center in center_iter])

steps = [
    [
        binarizer,
        radial_filtration,
        #HeatKernel(sigma=0.15, n_bins=60, n_jobs=-1),
        CubicalPersistence(n_jobs=-1),
        Scaler(n_jobs=-1)
    ] for radial_filtration in radial_filtrations
    for binarizer in binarizers
]

metric_iter = [
    {'metric':'bottleneck', 'metric_params':{}},
    {'metric':'wasserstein', 'metric_params':{'p':1}},
    {'metric':'wasserstein', 'metric_params':{'p':2}},
    {'metric':'landscape', 'metric_params':{'p':1, 'n_layers':1, 'n_bins':100}},
    {'metric':'landscape', 'metric_params':{'p':1, 'n_layers':2, 'n_bins':100}},
    {'metric':'landscape', 'metric_params':{'p':2, 'n_layers':1, 'n_bins':100}},
    {'metric':'landscape', 'metric_params':{'p':2, 'n_layers':2, 'n_bins':100}},
    {'metric':'betti', 'metric_params':{'p':1, 'n_bins':100}},
    {'metric':'betti', 'metric_params':{'p':2, 'n_bins':100}},
    {'metric':'heat', 'metric_params':{'p':1, 'sigma':1.6, 'n_bins':100}},
    {'metric':'heat', 'metric_params':{'p':1, 'sigma':3.2, 'n_bins':100}},
    {'metric':'heat', 'metric_params':{'p':2, 'sigma':1.6, 'n_bins':100}},
    {'metric':'heat', 'metric_params':{'p':2, 'sigma':3.2, 'n_bins':100}}
]
amplitudes = ([Amplitude(**metric, n_jobs=-1) for metric in metric_iter])
amplitudes_union = make_union(*[PersistenceEntropy(nan_fill_value=-1)] + amplitudes)

tda_pipeline = make_union(
    *[make_pipeline(*step, amplitudes_union) for step in steps], n_jobs=-1
)

set_config(display='diagram')
tda_pipeline

In [None]:
%%time
im_yes_aux = im_yes
im_yes_pipeline = tda_pipeline.fit_transform(im_yes_aux)
im_yes_pipeline.shape

In [None]:
%%time
train_yes_pipeline = tda_pipeline.fit_transform(train_yes)
train_yes_pipeline.shape

In [None]:
%%time
train_no_pipeline = tda_pipeline.fit_transform(train_no)
train_no_pipeline.shape

In [None]:
train_pipeline = np.concatenate((train_no_pipeline, train_yes_pipeline))
train_pipeline.shape

In [None]:
from sklearn.model_selection import train_test_split

X = np.concatenate((tumor_positives, tumor_negatives))
y = np.concatenate((np.ones(tumor_positives.shape[0]), np.zeros(tumor_negatives.shape[0])))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=123)
X_train.shape, X_test.shape

In [None]:
%%time
X_train_pipeline = tda_pipeline.fit_transform(X_train)
X_train_pipeline.shape