In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import eda_functions
from scipy import stats as ss
%matplotlib inline

### EDA preprocessing

In [2]:
PATH = '../data/eda.csv'
df = eda_functions.eda_preprocessing(PATH)
df.head()

Unnamed: 0,image_name,patient_id,sex,age_approx,anatom_site_general_challenge,target,blue_range,blue_iqr,blue_skew,blue_kurtosis,...,green_skew,green_kurtosis,green_mean,green_median,red_range,red_iqr,red_skew,red_kurtosis,red_mean,red_median
0,ISIC_2637011,IP_7279968,male,45.0,head/neck,0,285.0,98.5,0.851258,0.505571,...,0.851258,0.505571,70.3125,62.0,285.0,98.5,0.851258,0.505571,70.3125,62.0
1,ISIC_0015719,IP_3075186,female,45.0,upper extremity,0,225.0,135.25,0.529849,-1.06992,...,0.529849,-1.06992,70.3125,52.0,225.0,135.25,0.529849,-1.06992,70.3125,52.0
2,ISIC_0052212,IP_2842074,female,50.0,lower extremity,0,125.0,41.0,1.313852,0.491882,...,1.313852,0.491882,21.9375,0.0,125.0,41.0,1.313852,0.491882,21.9375,0.0
3,ISIC_0068279,IP_6890425,female,45.0,head/neck,0,111.0,29.0,1.372237,1.365879,...,1.372237,1.365879,21.9375,16.0,111.0,29.0,1.372237,1.365879,21.9375,16.0
4,ISIC_0074268,IP_8723313,female,55.0,upper extremity,0,339.0,107.25,1.123784,-0.064303,...,1.123784,-0.064303,70.3125,17.5,339.0,107.25,1.123784,-0.064303,70.3125,17.5


In [3]:
melanoma = df[df['target'] == 1]
no_melanoma = df[df['target'] == 0]

First we examined the red, green, and blue channel pixel intensity histogram data based on location. While examining the data via ANOVA modeling, all channels presented significant differences between patients based on the location of the image on the body. ANOVAs of range, IQR, and mean of all three color channel intensities all produced significant F values, suggesting that patients differentiate amongst each other. This is to be expected, as on an individual basis, patients are expected to be different.

Insignificant P-values began to appear while examining the ANOVA between sexes of patients split by location. For example, an insignificant P-value of 0.695 appeared for the head/neck region on the means of the green channel, while 

_Thanks to @ayhan for his [StackOverflow answer](https://stackoverflow.com/a/44066097/7287543)._

In [4]:
eda_functions.anova_report(no_melanoma,
                           grouping='anatom_site_general_challenge',
                           comparison='sex',
                           aggregator='mean')


Location: head/neck
  Channel: Blue
     F value: 0.154
     p value: 0.695
  Channel: Green
     F value: 0.154
     p value: 0.695
  Channel: Red
     F value: 0.154
     p value: 0.695

Location: lower extremity
  Channel: Blue
     F value: 2.1e+02
     p value: 4.91e-47
  Channel: Green
     F value: 2.1e+02
     p value: 4.91e-47
  Channel: Red
     F value: 2.1e+02
     p value: 4.91e-47

Location: oral/genital
  Channel: Blue
     F value: 2.13
     p value: 0.147
  Channel: Green
     F value: 2.13
     p value: 0.147
  Channel: Red
     F value: 2.13
     p value: 0.147

Location: palms/soles
  Channel: Blue
     F value: 0.729
     p value: 0.394
  Channel: Green
     F value: 0.729
     p value: 0.394
  Channel: Red
     F value: 0.729
     p value: 0.394

Location: torso
  Channel: Blue
     F value: 3.76
     p value: 0.0525
  Channel: Green
     F value: 3.76
     p value: 0.0525
  Channel: Red
     F value: 3.76
     p value: 0.0525

Location: upper extremity
  Channel