*This is a work in progress*

Melanoma is a type of skin cancer that can quickly spread to other organs if not treated, which means that it is very dangerous. However, if caught early, it can be treated with minor surgery.

In this contest, you take an image of a lesion, predict the probability of the image being a malignant tumor.  Then, you give the image a label.

0 is benign; 1 is malign.

The evaluation is with the receiver operating characteristic curve.

The true positive rate (the probability of proper detection) is plotted against the false-positive rate (the probability of false alarm).

The true positive rate (TPR) is equal to: **1 - misses**

The false positive rate (FPR) is equal to: **1 - correct rejections**


In [None]:
import numpy as np
import pandas as pd
import os
from os import listdir
from os.path import isfile, join
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
%matplotlib inline
import pydicom
from sklearn.impute import SimpleImputer
print("Complete")

# 1. The Data

File Paths:

In [None]:
train_jpeg_dir = '../input/siim-isic-melanoma-classification/jpeg/train/'
train_jpeg = [f for f in listdir(train_jpeg_dir) if isfile(join(train_jpeg_dir, f))]

test_jpeg_dir = '../input/siim-isic-melanoma-classification/jpeg/test/'
test_jpeg = [f for f in listdir(test_jpeg_dir) if isfile(join(test_jpeg_dir, f))]

In [None]:
train_dcm_dir = '../input/siim-isic-melanoma-classification/train/'
train_dcm = [f for f in listdir(train_dcm_dir) if isfile(join(train_dcm_dir, f))]

test_dcm_dir = '../input/siim-isic-melanoma-classification/test/'
test_dcm = [f for f in listdir(test_dcm_dir) if isfile(join(test_dcm_dir, f))]

In [None]:
train = pd.read_csv('../input/siim-isic-melanoma-classification/train.csv')
test = pd.read_csv('../input/siim-isic-melanoma-classification/test.csv')

Train and Test Information

In [None]:
train.head()

In [None]:
test.head()

Train: jpeg and dcm images

In [None]:
fig=plt.figure(figsize=(15, 10))
columns = 4
rows = 3
for i in range(1, columns*rows +1):
    path = train_jpeg_dir + train_jpeg[i]
    fig.add_subplot(rows, columns, i)
    plt.imshow(mpimg.imread(path))
    fig.add_subplot

In [None]:
fig=plt.figure(figsize=(15, 10))
columns = 4
rows = 3
for i in range(1, columns*rows +1):
    ds = pydicom.dcmread(train_dcm_dir + train_dcm[i])
    fig.add_subplot(rows, columns, i)
    plt.imshow(ds.pixel_array, cmap=plt.cm.bone)
    fig.add_subplot

Test: jpeg and dcm images

In [None]:
fig=plt.figure(figsize=(15, 10))
columns = 4
rows = 3
for i in range(1, columns*rows +1):
    path = test_jpeg_dir + test_jpeg[i]
    fig.add_subplot(rows, columns, i)
    plt.imshow(mpimg.imread(path))
    fig.add_subplot

In [None]:
fig=plt.figure(figsize=(15, 10))
columns = 4
rows = 3
for i in range(1, columns*rows +1):
    ds = pydicom.dcmread(test_dcm_dir + test_dcm[i])
    fig.add_subplot(rows, columns, i)
    plt.imshow(ds.pixel_array, cmap=plt.cm.bone)
    fig.add_subplot

# 2. Basic Graphs Representing Distribution of Data

In [None]:
features_first = ["sex", "age_approx", "anatom_site_general_challenge"]
features_train = ["diagnosis", "benign_malignant", "target"]
features = ["sex", "age_approx", "anatom_site_general_challenge", "diagnosis", "benign_malignant", "target"]

sns.set(style="ticks", color_codes=True)
fig = plt.gcf()
fig.set_size_inches(15, 10)

In [None]:
for i in features_first:
    sns.set(font_scale=0.6)
    plt.title("Count of " + i + " train")
    sns.catplot(x = i, kind="count", palette="ch:.25", data=train)
    
    sns.set(font_scale=0.6)
    plt.title("Count of " + i + " test")
    sns.catplot(x = i, kind="count", palette="ch:.25", data=test)

In [None]:
for i in features_train:
    sns.set(font_scale=0.6)
    plt.title("Count of " + i + " train")
    sns.catplot(x = i, kind="count", palette="ch:.25", data=train)

From these graphs, we can see several things:
- the distribution based on sex is different in the train group than the test group
- the age distribution is more similar between the train group and the test group but slightly different
- the train group has most of the lesions in the torso
- the test group has the same distribution of lesions in train and test group

In [None]:
for i in features_first:
    sns.set(font_scale=0.7)
    plt.title("belign_malignant for " + i)
    sns.catplot(x=i,kind='count', hue = "benign_malignant", palette="ch:.25", data=train)

In [None]:
for i in features_first:
    sns.set(font_scale=0.7)
    plt.title("target for " + i)
    sns.catplot(x=i,kind='count', hue = "target", palette="ch:.25", data = train)

In [None]:
for i in features_first:
    sns.set(font_scale=0.7)
    plt.title("melanoma for " + i)
    sns.catplot(x=i,kind= 'count', hue= "diagnosis", palette="ch:.25", data = train)

Missing Values:

In [None]:
print('Train Set')
print(train.info())

In [None]:
print('Test Set')
print(test.info())

Filling null values with simple imputer (https://gist.github.com/wmlba/07a36758096b9462431b3e7daca3ad41)

In [None]:
imp_mean_train = SimpleImputer( strategy='most_frequent')
train_no_null = pd.DataFrame(imp_mean_train.fit_transform(train))
train_no_null.columns=train.columns
train_no_null.index=train.index
train_no_null.head()

In [None]:
imp_mean_test = SimpleImputer( strategy='most_frequent')
test_no_null = pd.DataFrame(imp_mean_train.fit_transform(test))
test_no_null.columns=test.columns
test_no_null.index=test.index
test_no_null.head()

Graphs of Columns with Missing Values Filled Compared To Those Without

In [None]:
for i in features:
    sns.set(font_scale=0.6)
    plt.title("Count of " + i + " without filling missing values")
    sns.catplot(x = i, kind="count", palette="ch:.25", data=train)
    
    sns.set(font_scale=0.6)
    plt.title("Count of " + i + " filling missing values")
    sns.catplot(x = i, kind="count", palette="ch:.25", data=train_no_null)

In [None]:
for i in features_first:
    sns.set(font_scale=0.6)
    plt.title("Count of " + i + " without filling missing values")
    sns.catplot(x = i, kind="count", palette="ch:.25", data=test)
    
    sns.set(font_scale=0.6)
    plt.title("Count of " + i + " filling missing values")
    sns.catplot(x = i, kind="count", palette="ch:.25", data=test_no_null)


**Attribution:**

The ISIC 2020 Challenge Dataset https://doi.org/10.34970/2020-ds01 (c) by ISDIS, 2020

Creative Commons Attribution-Non Commercial 4.0 International License.

The dataset was generated by the International Skin Imaging Collaboration (ISIC) and images are from the following sources: Hospital Clínic de Barcelona, Medical University of Vienna, Memorial Sloan Kettering Cancer Center, Melanoma Institute Australia, The University of Queensland, and the University of Athens Medical School.

You should have received a copy of the license along with this work.

If not, see https://creativecommons.org/licenses/by-nc/4.0/legalcode.txt.


Information about Melanoma:
https://www.skincancer.org/skin-cancer-information/melanoma/

**Inspiration**:

https://www.kaggle.com/nxrprime/siim-eda-augmentations-model-seresnet-unet/?#six

https://www.kaggle.com/parulpandey/melanoma-classification-eda-starter