# Cancer ID Model using XGBoost

This script will identify whether or not a lesion is cancerous and determine the type given metadata. 

## STEP 1: Import Libraries

We will be using tensorflow, numpy, pandas, and other tools to aid us through the CNN development process.

The model that we will use will be the Convolutional 2D Neural Network.

In [9]:
# ----- IMPORT NECESSARY LIBRARIES -----
import tensorflow as tf         # Models
import sklearn as sk       # Other Models and Data Manipulation
import numpy as np              # Math and Calculations
import pandas as pd             # Data Science and Storage
import matplotlib.pyplot as plt # Data Visualization and Graphs
import csv                      # Data Imports
import kagglehub                # Also Data Imports
from xgboost import XGBClassifier   # XGBoost Model

## STEP 2: Initial Data

This step will include loading the data into the project and preparing it for a neural network to be able to use.

In [14]:
# Step 2.1: Load Dataset
dataset_path = kagglehub.dataset_download("farjanakabirsamanta/skin-cancer-dataset") # Get Data from Kaggle Database
print("Data set path: ", dataset_path)

df = pd.read_csv(dataset_path + "/HAM10000_metadata.csv")
df.head() # See data in csv form before processing

df["dx"].unique()

Data set path:  C:\Users\leyan\.cache\kagglehub\datasets\farjanakabirsamanta\skin-cancer-dataset\versions\1


array(['bkl', 'nv', 'df', 'mel', 'vasc', 'bcc', 'akiec'], dtype=object)

In [11]:
# Step 2.2: Processing Sample Data
# Remove Unnecessary Data
df = df.drop(df[df.sex == "unknown"].index)

# Label Data
label_encoder = sk.preprocessing.LabelEncoder()
df["localization"] = label_encoder.fit_transform(df["localization"])
df["sex"] = label_encoder.fit_transform(df["sex"])
df["dx"] = label_encoder.fit_transform(df["dx"])

# Split data
X = df[["age", "sex", "localization"]]
y = df["dx"]

X_train, X_test, y_train, y_test = sk.model_selection.train_test_split(X, y)

df.head()


Unnamed: 0,lesion_id,image_id,dx,dx_type,age,sex,localization
0,HAM_0000118,ISIC_0027419,2,histo,80.0,1,11
1,HAM_0000118,ISIC_0025030,2,histo,80.0,1,11
2,HAM_0002730,ISIC_0026769,2,histo,80.0,1,11
3,HAM_0002730,ISIC_0025661,2,histo,80.0,1,11
4,HAM_0001466,ISIC_0031633,2,histo,75.0,1,4


## Step 3: Train Model

In this step, we will train the model and determine its accuracy.

In [12]:
# Training
xgb = XGBClassifier()
xgb.fit(X_train, y_train)

# Predicting
predictions = xgb.predict(X_test)

# Determine Model Accuracy
print("Model Accuracy: ", sk.metrics.accuracy_score(y_test, predictions))

X_test.head()

Model Accuracy:  0.706425702811245


Unnamed: 0,age,sex,localization
4163,40.0,1,2
9091,65.0,1,6
5866,50.0,0,12
307,40.0,1,9
4654,70.0,1,10


## Visualizing Accuracy

After training the model, it helps to visualize how the model is performing.

In [13]:
results = pd.DataFrame({
    'Age':  X_test.age,
    'Sex':  X_test.sex,
    'Localization': X_test.localization,
    'Output':   predictions,
    'Actual':   y_test
})

results.head(25)

Unnamed: 0,Age,Sex,Localization,Output,Actual
4163,40.0,1,2,5,5
9091,65.0,1,6,5,5
5866,50.0,0,12,5,5
307,40.0,1,9,5,2
4654,70.0,1,10,2,5
2649,85.0,1,5,1,1
5730,50.0,0,12,5,5
8907,80.0,1,11,2,5
192,65.0,0,5,2,2
4381,60.0,0,12,5,5
