# Cancer ID Model using XGBoost

This script will identify whether or not a lesion is cancerous and determine the type given metadata. 

## STEP 1: Import Libraries

We will be using tensorflow, numpy, pandas, and other tools to aid us through the CNN development process.

The model that we will use will be the Convolutional 2D Neural Network.

In [2]:
# ----- IMPORT NECESSARY LIBRARIES -----
import tensorflow as tf         # Models
import sklearn as sk       # Other Models and Data Manipulation
import numpy as np              # Math and Calculations
import pandas as pd             # Data Science and Storage
import matplotlib.pyplot as plt # Data Visualization and Graphs
import csv                      # Data Imports
import kagglehub                # Also Data Imports
from xgboost import XGBClassifier   # XGBoost Model

  from .autonotebook import tqdm as notebook_tqdm


## STEP 2: Initial Data

This step will include loading the data into the project and preparing it for a neural network to be able to use.

In [3]:
# Step 2.1: Load Dataset
dataset_path = kagglehub.dataset_download("farjanakabirsamanta/skin-cancer-dataset") # Get Data from Kaggle Database
print("Data set path: ", dataset_path)

df = pd.read_csv(dataset_path + "/HAM10000_metadata.csv")
df.head() # See data in csv form before processing

Data set path:  C:\Users\leyan\.cache\kagglehub\datasets\farjanakabirsamanta\skin-cancer-dataset\versions\1


Unnamed: 0,lesion_id,image_id,dx,dx_type,age,sex,localization
0,HAM_0000118,ISIC_0027419,bkl,histo,80.0,male,scalp
1,HAM_0000118,ISIC_0025030,bkl,histo,80.0,male,scalp
2,HAM_0002730,ISIC_0026769,bkl,histo,80.0,male,scalp
3,HAM_0002730,ISIC_0025661,bkl,histo,80.0,male,scalp
4,HAM_0001466,ISIC_0031633,bkl,histo,75.0,male,ear


In [5]:
# Step 2.2: Processing Sample Data
# Remove Unnecessary Data
df = df.drop(df[df.sex == "unknown"].index)

# Label Data
label_encoder = sk.preprocessing.LabelEncoder()
df["localization"] = label_encoder.fit_transform(df["localization"])
df["sex"] = label_encoder.fit_transform(df["sex"])
df["dx"] = label_encoder.fit_transform(df["dx"])

# Split data
X = df[["age", "sex", "localization"]]
y = df["dx"]

X_train, X_test, y_train, y_test = sk.model_selection.train_test_split(X, y)

"""
data['dx'] = data['dx'].map({
    'akiec':    0,  # Bowen's Disease
    'bcc':      1,  # Basal Cell Carinoma
    'bkl':      2,  # Benign Keratosis-like Lesions
    'df':       3,  # Dermatofibroma
    'mel':      4,  # Melanoma
    'nv':       5,  # Melanocytic Nevi
    'vasc':     6   # Vascular Lesions
})
"""

df.head()


Unnamed: 0,lesion_id,image_id,dx,dx_type,age,sex,localization
0,HAM_0000118,ISIC_0027419,2,histo,80.0,1,11
1,HAM_0000118,ISIC_0025030,2,histo,80.0,1,11
2,HAM_0002730,ISIC_0026769,2,histo,80.0,1,11
3,HAM_0002730,ISIC_0025661,2,histo,80.0,1,11
4,HAM_0001466,ISIC_0031633,2,histo,75.0,1,4


## Step 3: Train Model

In this step, we will train the model and determine its accuracy.

In [6]:
# Training
xgb = XGBClassifier()
xgb.fit(X_train, y_train)

# Predicting
predictions = xgb.predict(X_test)

# Determine Model Accuracy
print("Model Accuracy: ", sk.metrics.accuracy_score(y_test, predictions))

Model Accuracy:  0.704417670682731


## Visualizing Accuracy

After training the model, it helps to visualize how the model is performing.

In [7]:
results = pd.DataFrame({
    'Age':  X_test.age,
    'Sex':  X_test.sex,
    'Localization': X_test.localization,
    'Output':   predictions,
    'Actual':   y_test
})

results.head(50)

Unnamed: 0,Age,Sex,Localization,Output,Actual
8333,35.0,0,9,5,5
1161,40.0,1,9,5,3
2609,70.0,1,2,5,1
9144,55.0,1,14,5,5
9352,35.0,1,2,5,5
5825,45.0,1,14,5,5
8589,50.0,1,9,5,5
5766,60.0,0,2,5,5
9809,45.0,1,5,2,0
69,70.0,0,0,5,2
