# Classification

In this notebook, we will demonstrate classification skills using the [Breast Cancer Wisconsin (Diagnostic) dataset](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data). The dataset contains features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. The objective is to classify the breast cancer as either malignant (1) or benign (0).

## Dataset Description

| Feature Column Name   | Description                                           |
|---------------|-------------------------------------------------------|
| id            | ID number                                             |
| radius_mean   | Mean of distances from the center to points on the perimeter |
| texture_mean  | Standard deviation of gray-scale values               |
| perimeter_mean | Mean size of the core tumor                          |
| area_mean     | Mean area of the core tumor                           |
| smoothness_mean | Mean of local variation in radius lengths           |
| compactness_mean | Mean of perimeter^2 / area - 1.0                   |
| concavity_mean | Mean of severity of concave portions of the contour  |
| concave_points_mean | Mean for number of concave portions of the contour |
| symmetry_mean | Mean symmetry                                         |
| fractal_dimension_mean | Mean fractal dimension                        |
| radius_se     | Standard error for the mean of distances from center to points on the perimeter |
| ...           | ...                                                   |
| fractal_dimension_worst | Worst or largest fractal dimension           |

|Target Column Name  | Description|
|---------------|-------------------------------------------------------|
| diagnosis     | The diagnosis of breast tissues (M = malignant, B = benign) |

*Note: The dataset also contains standard errors and worst (largest) values for some of the features.


In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

pd.options.display.max_columns = None
pd.options.display.max_colwidth = 1000
# Load the dataset
data = pd.read_csv('breast-cancer.csv')

# Display the first few rows of the dataset
data.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


## Data Prep

In [2]:
data.drop('id',axis=1)


# Encode the categorical data
le = LabelEncoder()
data['diagnosis'] = le.fit_transform(data['diagnosis'])

# Display the first few rows of the dataset after encoding
data.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,1,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,1,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,1,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,1,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


# Classification Models

We will apply the following classification models to the Breast Cancer Wisconsin (Diagnostic) dataset:

1. Logistic Regression: It's a simple linear model for classification, making it a good baseline model. It's easy to interpret and works well when the relationship between features and the target variable is approximately linear.
2. XGBoost: XGBoost is a gradient boosted decision tree algorithm. It can handle a mix of feature types and scales, and is effective at identifying complex patterns.
3. Random Forest: A Random Forest is an ensemble of decision trees. It handles imbalanced data well and is robust to outliers and noise.
4. Support Vector Machine (SVC): The SVM classifier is effective when there is a clear margin of separation between classes. It works well for high-dimensional data.
5. K-Nearest Neighbors: KNN is a simple instance-based learning algorithm that can be effective when there are clear clusters of classes in the feature space.
6. Gaussian Naive Bayes: GaussianNB assumes that the features are normally distributed and are conditionally independent given the class. It works well when these assumptions hold.
7. Decision Tree: Decision trees are easy to interpret and can capture complex patterns in the data. They may overfit if not properly tuned.

We will use GridSearchCV to optimize the hyperparameters of each model and evaluate their performance.


In [3]:
# Split the data into features (X) and target (y) columns
X = data.drop("diagnosis", axis=1)
y = data["diagnosis"]

# Display the first few rows of the features and target columns
X.head()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the training and testing sets
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)

X_train shape: (455, 31)
X_test shape: (114, 31)


In [6]:
#!pip install xgboost
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
import warnings

# Create a dictionary to store the classifiers and their respective hyperparameters for GridSearchCV
classifiers = {
#    "LogisticRegression": {
#        "model": LogisticRegression(),
#        "params": {
#            "C": [0.1, 1, 10, 100],
#            "solver": ["lbfgs", "liblinear", "sag", "saga"] #"newton-cg",
#        }
#    },
    "XGBClassifier": {
        "model": XGBClassifier(),
        "params": {
            "learning_rate": [0.01, 0.1, 0.2],
            "n_estimators": [50, 100, 200],
            "max_depth": [3, 5, 7]
        }
    },
    "RandomForestClassifier": {
        "model": RandomForestClassifier(),
        "params": {
            "n_estimators": [10, 50, 100, 200],
            "max_depth": [None, 10, 20, 30],
            "min_samples_split": [2, 5, 10],
            "min_samples_leaf": [1, 2, 4]
        }
    },
    "SVC": {
        "model": SVC(),
        "params": {
            "C": [0.1, 1, 10, 100],
            "kernel": ["linear", "rbf", "poly", "sigmoid"],
            "degree": [2, 3, 4, 5]
        }
    },
    "KNeighborsClassifier": {
        "model": KNeighborsClassifier(),
        "params": {
            "n_neighbors": [3, 5, 7, 9],
            "weights": ["uniform", "distance"],
            "metric": ["euclidean", "manhattan", "minkowski"]
        }
    },
    "GaussianNB": {
        "model": GaussianNB(),
        "params": {
            "var_smoothing": [1e-9, 1e-8, 1e-7]
        }
    },
    "DecisionTreeClassifier": {
        "model": DecisionTreeClassifier(),
        "params": {
            "criterion": ["gini", "entropy"],
            "max_depth": [None, 10, 20, 30],
            "min_samples_split": [2, 5, 10],
            "min_samples_leaf": [1, 2, 4]
        }
    }
}

In [None]:

# Perform GridSearchCV for each classifier
warnings.filterwarnings("ignore")
results = {}
for classifier_name, classifier_info in classifiers.items():
    grid = GridSearchCV(estimator=classifier_info["model"],
                        param_grid=classifier_info["params"],
                        scoring="recall",
                        cv=5)
    grid.fit(X_train, y_train)
    results[classifier_name] = {
        "best_score": grid.best_score_,
        "best_params": grid.best_params_
    }
    print(f"Done with {classifier_name}.")

# Display the results in a DataFrame
results_df = pd.DataFrame(results).T
results_df

Done with XGBClassifier.
Done with RandomForestClassifier.


In [None]:
df

Different real-world scenarios when recall scores can be used as evaluation metrics
Recall score is an important metric to consider when measuring the effectiveness of your machine learning models. It can be used in a variety of real-world scenarios, and it’s important to always aim to improve recall and precision scores together. The following are examples of some real-world scenarios where recall scores can be used as evaluation metrics:

In medical diagnosis, the recall score should be an extremely high otherwise greater number of false negatives would prove to be fatal to the life of patients. The lower recall score would mean a greater false negative which essentially would mean that some patients who are positive are termed as falsely negative. That would mean that patients would get assured that he/she is not suffering from the disease and therefore he/she won’t take any further action. That could result in the disease getting aggravated and prove fatal to life.
Lets understand with an example of detection of breast cancer through mammography screening. ML models can be trained on large datasets of mammography images to assist radiologists in interpreting them. A high recall score is important in this scenario because it indicates that the model is able to correctly identify all cases of breast cancer, including those that may be difficult for a human radiologist to detect. A model with a low recall score may miss some cases of breast cancer, leading to delayed diagnosis and potentially worse outcomes for patients.