### Preprocesado por imagen (paciente -> imagen)

El objetivo es convertir el dataset inicial con la información organizada por pacientes (dos imágenes por entrada) a un formato en el que haya una imagen por fila. Para que las anotaciones de las enfermedades sea correcta para cada imagen, se deben tener en cuenta las 'keywords' anotadas en cada ojo. Se tendrá que ver para cada enfermedad cuales son las 'keywords' utilizadas y asignar el diagnóstico para cada ojo en función de la presencia de estas 'keywords'.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
from scipy.stats import ttest_ind
from scipy.stats import chi2_contingency

In [2]:
df = pd.read_excel('data.xlsx')

#### Columna 'D'

Se buscan las 'keywords' que se asocian con la enfermedad D.

In [3]:
# most common keywords for diagnostic 'D' == 1

In [3]:
# make columns with list of keywords (left/rigth)
# split by chinese comma o normal comma
df['left_keywords'] = df['Left-Diagnostic Keywords'].str.split(r'[,\uFF0C]', regex=True)
df['right_keywords'] = df['Right-Diagnostic Keywords'].str.split(r'[,\uFF0C]', regex=True)
# make a column with list of keywords in left and right
df['keywords'] = df['left_keywords'] + df['right_keywords']

Las cuatro más comunes, a excepción de 'normal fundus' están asociadas con D.

In [4]:
# keywords that appear with 'D'==1
# top 10 most frequent list
keywords_D = df[df['D'] == 1]['keywords'].explode().value_counts()
print(keywords_D.head(10))
# select keywords with most 100 occurrences (related to D == 1)
keywords_D_lst = list(keywords_D.index[keywords_D > 100])
# remove 'normal fundus'
keywords_D_lst.remove('normal fundus')

keywords
moderate non proliferative retinopathy    997
mild nonproliferative retinopathy         552
normal fundus                             257
severe nonproliferative retinopathy       161
hypertensive retinopathy                   85
laser spot                                 81
macular epiretinal membrane                67
epiretinal membrane                        65
diabetic retinopathy                       56
lens dust                                  48
Name: count, dtype: int64


Existen otras seis 'keywords' que también están asociadas con D. Se incorporan a la lista `keywords_D_lst`.

In [5]:
# check for patients D==1 and without keywords_D described above
df['keywords_D'] = df['keywords'].apply(lambda x: any(item in keywords_D_lst for item in x))
print(f"Number patients with D annotated as 1 but without keyword: {df[~df['keywords_D'] & df['D'] == 1].shape[0]}")
df[~df['keywords_D'] & df['D'] == 1][['left_keywords','right_keywords']].head(5)

Number patients with D annotated as 1 but without keyword: 47


Unnamed: 0,left_keywords,right_keywords
64,[diabetic retinopathy],"[macular epiretinal membrane, diabetic retinop..."
67,[diabetic retinopathy],[diabetic retinopathy]
71,[diabetic retinopathy],"[wet age-related macular degeneration, diabeti..."
108,"[retinal pigment epithelium atrophy, diabetic ...",[normal fundus]
120,"[proliferative diabetic retinopathy, hypertens...","[proliferative diabetic retinopathy, hypertens..."


In [5]:
# add extra also related to D == 1
keywords_D_lst.append('diabetic retinopathy')
keywords_D_lst.append('proliferative diabetic retinopathy')
keywords_D_lst.append('severe proliferative diabetic retinopathy')
keywords_D_lst.append('suspicious diabetic retinopathy')
keywords_D_lst.append('suspected diabetic retinopathy')
keywords_D_lst.append('suspected moderate non proliferative retinopathy')
# Total list of keywords
print(keywords_D_lst)
df['keywords_D'] = df['keywords'].apply(lambda x: any(item in keywords_D_lst for item in x))
print(f"Number patients with D annotated as 1 but without keyword: {df[~df['keywords_D'] & df['D'] == 1].shape[0]}")

['moderate non proliferative retinopathy', 'mild nonproliferative retinopathy', 'severe nonproliferative retinopathy', 'diabetic retinopathy', 'proliferative diabetic retinopathy', 'severe proliferative diabetic retinopathy', 'suspicious diabetic retinopathy', 'suspected diabetic retinopathy', 'suspected moderate non proliferative retinopathy']
Number patients with D annotated as 1 but without keyword: 0


Se convierte el dataset original. En vez de estar organizado por paciente, se organizan los datos por imagen. El dataframe con la información necesaria se guarda en .

In [6]:
# dataframe by eyes images with annotation D_eye 0/1
def patient_to_eye(df, image_col, keywords_col, disease_col, keywords):
    """
    Converts patient-level data to eye-level data based on keywords and disease status.
    
    Args:
        df (pd.DataFrame): Input DataFrame containing patient data.
        image_col (str): Name of the column containing fundus/image data.
        keywords_col (str): Name of the column with keyword lists
        disease_col (str): Name of the column indicating disease presence
        keywords (list): List of keywords related to disease to match

    Returns:
        pd.DataFrame: Transformed DataFrame with columns:
            - 'Patient Sex' (original)
            - 'Patient Age' (original)
            - 'Fundus' (renamed from image_col)
            - '[disease_col]_eye' (binary column, 1 = disease + keyword match)
    """
    columns = ['Patient Sex', 'Patient Age', image_col, keywords_col, disease_col]
    df_sel = df[columns].copy()
    disease_col_eye = disease_col + '_eye'
    df_sel[disease_col_eye] = [1 if x else 0 for x in (df_sel[disease_col] == 1) & (df_sel[keywords_col].apply(lambda x: any(item in keywords for item in x)))]
    df_sel = df_sel.drop([keywords_col, disease_col], axis=1)
    df_sel.columns = ['Patient Sex', 'Patient Age', 'Fundus', disease_col_eye]
    return df_sel
    
keywords_D_lst = ['moderate non proliferative retinopathy', 'mild nonproliferative retinopathy', 
                  'severe nonproliferative retinopathy', 'diabetic retinopathy', 
                  'proliferative diabetic retinopathy', 'severe proliferative diabetic retinopathy', 
                  'suspicious diabetic retinopathy', 'suspected diabetic retinopathy', 
                  'suspected moderate non proliferative retinopathy', 'diabetic retinopathy', 
                  'proliferative diabetic retinopathy', 'severe proliferative diabetic retinopathy', 
                  'suspicious diabetic retinopathy', 'suspected diabetic retinopathy', 
                  'suspected moderate non proliferative retinopathy']
df_left = patient_to_eye(df, 'Left-Fundus', 'left_keywords', 'D', keywords_D_lst)
df_right = patient_to_eye(df, 'Right-Fundus', 'right_keywords', 'D', keywords_D_lst)
df_eye = pd.concat([df_left, df_right])
#print(df_eye)
df_eye.to_csv("data_eye_D.csv")
    