You are provided with a breast cancer dataset (Breast_Cancer_Data.csv) taken originally from the UCI data repository. 
https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original).

This dataset has approximately 683 patient data having 10 features and 1 class label describing whether the patient has cancer or not. Each row describes one patient, and the class column describes if the patient tumor is benign (label = 2) or malignant (label = 4). For this dataset, build all the classification models (using Python and Scikit-learn) given below (no need to visualize) and tabulate the accuracy and confusion matrix obtained for each. Split the dataset such that the test data size is 25% of the total dataset.

8. XGBoost

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

In [2]:
# Load the dataset
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data'
column_names = ['id', 'clump_thickness', 'cell_size_uniformity', 'cell_shape_uniformity', 'marginal_adhesion', 
                'single_epithelial_cell_size', 'bare_nuclei', 'bland_chromatin', 'normal_nucleoli', 'mitoses', 'class']
data = pd.read_csv(url, header=None, names=column_names)


In [3]:
# Drop the 'id' column
data.drop('id', axis=1, inplace=True)

# Replace missing values denoted by '?' with NaN and then drop them
data.replace('?', pd.NA, inplace=True)
data.dropna(inplace=True)

# Convert all columns to numeric
data = data.apply(pd.to_numeric)

Split the Dataset

In [10]:
# Map the target variable to 0 and 1
data['class'] = data['class'].map({2: 0, 4: 1})

# Separate features and target variable
X = data.drop('class', axis=1)
y = data['class']

# Split the dataset into training and testing sets (75% training, 25% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)


Build and Evaluate XGBoost

In [14]:
from xgboost import XGBClassifier

# Initialize the XGBoost model with binary classification objective
XGBoost = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', objective='binary:logistic')

# Train the model
XGBoost.fit(X_train, y_train)

# Make predictions
y_pred = XGBoost.predict(X_test)

# Map predictions back to original labels
y_pred = pd.Series(y_pred).map({0: 2, 1: 4})

# Evaluate the model
accuracy = accuracy_score(y_test.map({0: 2, 1: 4}), y_pred)
conf_matrix = confusion_matrix(y_test.map({0: 2, 1: 4}), y_pred)

# Print the results
print(f"XGBoost Accuracy: {accuracy}")
print(f"XGBoost Confusion Matrix:\n{conf_matrix}")

XGBoost Accuracy: 0.9532163742690059
XGBoost Confusion Matrix:
[[102   1]
 [  7  61]]
