This notebook will examine and process the metadata in preparation for modeling.

In [1]:
import numpy as np
import pandas as pd

In [2]:
PATH = '../data/train.csv'
df = pd.read_csv(PATH)
df.head()

Unnamed: 0,image_name,patient_id,sex,age_approx,anatom_site_general_challenge,diagnosis,benign_malignant,target
0,ISIC_2637011,IP_7279968,male,45.0,head/neck,unknown,benign,0
1,ISIC_0015719,IP_3075186,female,45.0,upper extremity,unknown,benign,0
2,ISIC_0052212,IP_2842074,female,50.0,lower extremity,nevus,benign,0
3,ISIC_0068279,IP_6890425,female,45.0,head/neck,unknown,benign,0
4,ISIC_0074268,IP_8723313,female,55.0,upper extremity,unknown,benign,0


## Examination

There are three columns with missing data:

In [3]:
df.isnull().sum()[df.isnull().sum() > 0]

sex                               65
age_approx                        68
anatom_site_general_challenge    527
dtype: int64

This missing sex data is localized to just two patients, all of whose sex data is missing:

In [4]:
missing_sex = df.loc[df['sex'].isnull(), 'patient_id'].value_counts()
missing_sex

IP_5205991    48
IP_9835712    17
Name: patient_id, dtype: int64

In [5]:
for patient in missing_sex.index:
    print(f'\n{patient}\n', df.loc[df['patient_id'] == patient, ['image_name','sex']].count())


IP_5205991
 image_name    48
sex            0
dtype: int64

IP_9835712
 image_name    17
sex            0
dtype: int64


The age data is similarly localized to three patients, all of whose age data is missing:

In [6]:
missing_age = df.loc[df['age_approx'].isnull(), 'patient_id'].value_counts()
missing_age

IP_5205991    48
IP_9835712    17
IP_0550106     3
Name: patient_id, dtype: int64

In [7]:
for patient in missing_age.index:
    print(f'\n{patient}\n', df.loc[df['patient_id'] == patient, ['image_name','age_approx']].count())


IP_5205991
 image_name    48
age_approx     0
dtype: int64

IP_9835712
 image_name    17
age_approx     0
dtype: int64

IP_0550106
 image_name    3
age_approx    0
dtype: int64


As such, the missing sex and age data cannot be imputed based on available data from other images.

It is difficult to argue for a modal imputation for sex, owing to the spread of data across the groups:

In [8]:
df['sex'].value_counts()

male      17080
female    15981
Name: sex, dtype: int64

However, age is normally spread, and can therefore be imputed at the mean:

In [9]:
df['age_approx'].value_counts().sort_index(ascending=False)

90.0      80
85.0     149
80.0     419
75.0     981
70.0    1968
65.0    2527
60.0    3240
55.0    3824
50.0    4270
45.0    4466
40.0    3576
35.0    2850
30.0    2358
25.0    1544
20.0     655
15.0     132
10.0      17
0.0        2
Name: age_approx, dtype: int64

In [10]:
df['age_approx'].mean() // 5 * 5

45.0

The missing location data is difficult to impute effectively, owing to its spread across the other features:

In [11]:
for column in ['sex','age_approx', 'target']:
    print(f'\n{column}\n', df.loc[df['anatom_site_general_challenge'].isnull(), column].value_counts())


sex
 male      292
female    235
Name: sex, dtype: int64

age_approx
 45.0    180
35.0     48
50.0     46
40.0     45
30.0     40
55.0     39
60.0     39
65.0     27
25.0     21
70.0     19
20.0     13
75.0      7
80.0      3
Name: age_approx, dtype: int64

target
 0    518
1      9
Name: target, dtype: int64


It is, however, able to be imputed based on mode. There are more torso images than other images combined:

In [12]:
df['anatom_site_general_challenge'].value_counts()

torso              16845
lower extremity     8417
upper extremity     4983
head/neck           1855
palms/soles          375
oral/genital         124
Name: anatom_site_general_challenge, dtype: int64

## Conclusions

* It is safe to impute all location data as "torso."
* It is safe to impute all age data at the mean (45).
* The missing sex data will have a separate category.