# Assignment: XGBoost Model

## Data: Use the Breast Cancer Wisconsin (Diagnostic) dataset 

### Steps:
- Load the Breast Cancer Wisconsin (Diagnostic) data into a Pandas DataFrame.
- Split the data into a training set and a test set.
- Create a XGBoost model.
- Fit the model to the training set.
- Evaluate the model on the test set.
- Report the accuracy of the model.

 Load the data into a Pandas DataFrame.
   - Import the necessary libraries: `numpy`, `pandas`, `sklearn.datasets`, and `xgboost`.
   - Load the dataset using the `load_breast_cancer()` function from `sklearn.datasets`.
   - Store the dataset in a variable called `data`.
   - Create a Pandas DataFrame named `df` to hold the data, with features in `data.data` and target labels in `data.target`.
   - Assign the feature names as column names for the DataFrame using `data.feature_names`.
   - Append the target labels as a new column named 'target' in the DataFrame.

In [None]:
# Install XGBoost
#!pip3 install xgboost

In [36]:
!pip3 install xgboost

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier


data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
print(df.head())

   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0        17.99         10.38          122.80     1001.0          0.11840   
1        20.57         17.77          132.90     1326.0          0.08474   
2        19.69         21.25          130.00     1203.0          0.10960   
3        11.42         20.38           77.58      386.1          0.14250   
4        20.29         14.34          135.10     1297.0          0.10030   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0           0.27760          0.3001              0.14710         0.2419   
1           0.07864          0.0869              0.07017         0.1812   
2           0.15990          0.1974              0.12790         0.2069   
3           0.28390          0.2414              0.10520         0.2597   
4           0.13280          0.1980              0.10430         0.1809   

   mean fractal dimension  ...  worst texture  worst perimeter  worst area  \
0             



Split the data into a training set and a test set.
   - Import the `train_test_split` function from `sklearn.model_selection`.
   - Separate the features (`df.drop('target', axis=1)`) and target labels (`df['target']`) for the split.
   - Call the `train_test_split()` function with the feature and target data, specifying a test size of 25% (`test_size=0.25`) and a random state of 42 (`random_state=42`).
   - The function returns four sets of data: `X_train`, `X_test`, `y_train`, and `y_test`, representing the features and target labels for the training and test sets.

In [39]:
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42)


Create an XGBoost classifier model.
   - Import the `XGBClassifier` class from `xgboost`.
   - Create an instance of the `XGBClassifier` model and assign it to the variable `model`.

In [42]:
 xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
xgb_model = XGBClassifier(eval_metric='logloss', random_state=42)


Train the model on the training set.
   - Call the `fit()` method of the `XGBClassifier` model, passing the training features (`X_train`) and corresponding target labels (`y_train`).
   - The model learns patterns and relationships between the features and target labels during this training process.

In [45]:
xgb_model.fit(X_train, y_train)

Make predictions on the test set.
   - Use the `predict()` method of the `XGBClassifier` model to make predictions on the test features (`X_test`).
   - The model applies the learned patterns to the test features and predicts the corresponding target labels.

In [48]:
y_pred_xgb = xgb_model.predict(X_test)
accuracy_xgb = accuracy_score(y_test, y_pred_xgb)
print('XGBoost Accuracy:', accuracy_xgb)

XGBoost Accuracy: 0.965034965034965


Calculate the accuracy of the model.
   - Import the `accuracy_score` function from `sklearn.metrics`.
   - Call the `accuracy_score()` function, passing the true target labels (`y_test`) and the predicted labels for the test set (`y_pred`).
   - The function calculates the accuracy score by comparing the predicted labels with the true labels.
   - Assign the accuracy score to the variable `accuracy`.


In [51]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train) 

accuracy_xgb = accuracy_score(y_test, y_pred_xgb)

Print the accuracy of the model.
   - Print the accuracy score to the console using `print('Accuracy:', accuracy)`.

In [54]:
print('XGBoost Accuracy:', accuracy_xgb)

XGBoost Accuracy: 0.965034965034965


Compare the accuracy of both XGBoost and Decision Tree
   - Compare the accuracy scores to determine which model performed better on the dataset.

In [57]:
# Decision Tree predictions and accuracy
y_pred_dt = dt_model.predict(X_test)    
accuracy_dt = accuracy_score(y_test, y_pred_dt)
print('Decision Tree Accuracy:', accuracy_dt)

# Compare the two
if accuracy_xgb > accuracy_dt:
    print(f"XGBoost outperforms Decision Tree by {accuracy_xgb - accuracy_dt:.4f} accuracy points.")
elif accuracy_dt > accuracy_xgb:
    print(f"Decision Tree outperforms XGBoost by {accuracy_dt - accuracy_xgb:.4f} accuracy points.")
else:
    print("Both models achieve the same accuracy.")

Decision Tree Accuracy: 0.951048951048951
XGBoost outperforms Decision Tree by 0.0140 accuracy points.
