# **Classification Task: *Breast Cancer Classification***

#### This project focuses on building a machine learning classifier to distinguish between benign and malignant breast masses. I have utilized the classic ***Breast Cancer Wisconsin (Diagnostic) Data Set*** to perform a binary classification, a critical task for aiding in early and accurate cancer diagnosis.

## **Step 1: Import Libraries**
#### *The basic libraries that are required for data analysis and machine learning.*

In [None]:
import pandas as pd              # for handling data
import numpy as np               # for numerical operations
import matplotlib.pyplot as plt  # for data visualization
import seaborn as sns
#import sklearn                   # scikit-learn library (machine learning tools)
from sklearn.model_selection import train_test_split # split dataset into training and testing sets
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression  # machine learning algorithm
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                             f1_score, confusion_matrix, classification_report,
                             roc_curve, roc_auc_score, precision_recall_curve,
                             ConfusionMatrixDisplay, auc)

import warnings
warnings.filterwarnings('ignore')
sns.set(style='whitegrid')       # Set style for better visualization

## **Step 2: Load and Check Data**
#### *Load the dataset and look at the first few rows to understand the structure.*

In [None]:
classify = pd.read_csv("/kaggle/input/breast-cancer-wisconsin-data/data.csv")

In [None]:
# drop duplicate id column if present
if 'id' in classify.columns:
    classify = classify.drop('id', axis=1)
if 'Unnamed: 32' in classify.columns:
    classify = classify.drop(columns=['Unnamed: 32'], errors='ignore')

print("\nTarget value counts:")
display(classify['diagnosis'].value_counts())

In [None]:
classify.info() #complete information  of the dataset

In [None]:
classify.head() #first 5 rows of the dataset

In [None]:
classify.tail() #last 5 rows of the dataset

In [None]:
classify.shape #number of rows & coloumns of the dataset

In [None]:
classify.describe() #statistical summary of numerical columns

## **Step 3: Split the data into Training and Testing Sets**
#### *We separate the features (X) and the target (y).*
#### *- Encode target (M=malignant, B=benign)*  
#### *- Check missing values*
#### *- Create feature matrix X and target y*

In [None]:
# Encode diagnosis to 0/1: B=0, M=1
classify['target'] = classify['diagnosis'].map({'B':0, 'M':1})

# Drop original diagnosis column
classify = classify.drop('diagnosis', axis=1)

# Check missing
print("Missing values per column:\n", classify.isnull().sum().sort_values(ascending=False).head())

# Prepare X,y
X = classify.drop('target', axis=1)
y = classify['target']

print("Features:", X.shape[1])
print("Rows:", X.shape[0])

### **Split into 80% training and 20% testing data**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=42, stratify=y)

In [None]:
print("Train shape:", X_train.shape, "Test shape:", X_test.shape)
print("Train class distribution:\n", y_train.value_counts(normalize=True))
print("Test class distribution:\n", y_test.value_counts(normalize=True))

In [None]:
X

In [None]:
y

## **Step 4: Train - Model 1: *Logistic Regression***
#### *We create the model and fit it (train it) on the training data.*

In [None]:
model_log = LogisticRegression(max_iter=1000)  # max_iter=1000 ensures the model converges
model_log.fit(X_train.fillna(0), y_train)

## **Step 5: Make Predictions - Model 1: *Logistic Regression***
#### *Use the trained model to predict on the test set.*

In [None]:
y_pred_log = model_log.predict(X_test.fillna(0))

y_prob_log = model_log.predict_proba(X_test.fillna(0))[:, 1]  # for ROC/AUC

## **Step 6: Evaluate Model - Model 1: *Logistic Regression*** 
#### *Accuracy shows how many predictions were correct.*
#### *The confusion matrix shows how many true/false predictions were made.*

In [None]:
accuracy_log = accuracy_score(y_test, y_pred_log)
cm_log = confusion_matrix(y_test, y_pred_log)

In [None]:
print("Accuracy of the model:", round(accuracy_log, 3))
print("\nConfusion Matrix:")
print(cm_log)

## **Step 7: Display Classification Report - Model 1: *Logistic Regression***
#### *Evaluate the Logistic Regression model performance with precision, recall, f1-score for each class.*

In [None]:
print("\nClassification Report:")    # Printing detailed evaluation metrics
print(classification_report(y_test, y_pred_log))

## **Step 8: Visualize - Model 1: *Logistic Regression***

## -------- 8.1. *Confusion Matrix*
#### *Heatmap helps understand prediction correctness visually.*

In [None]:
plt.figure(figsize=(6,4))
sns.heatmap(cm_log, annot=True, fmt='d')
plt.title("Confusion Matrix Heatmap")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.show()

## -------- 8.2. *ROC Curve*
#### *Visualize trade-off between True Positive Rate and False Positive Rate*

In [None]:
y_prob_log = model_log.predict_proba(X_test.fillna(0))[:,1]  # probabilities for positive class
fpr, tpr, thresholds = roc_curve(y_test, y_prob_log)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(6,4))
plt.plot(fpr, tpr, color='darkorange', label=f'ROC Curve (AUC = {roc_auc:.2f})')
plt.plot([0,1],[0,1], color='navy', linestyle='--')
plt.title('ROC Curve - Logistic Regression')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc='lower right')
plt.show()

## -------- 8.3. *Precision-Recall Curve*
#### *Visualize trade-off between Precision and Recall*

In [None]:
precision, recall, thresholds = precision_recall_curve(y_test, y_prob_log)

plt.figure(figsize=(6,4))
plt.plot(recall, precision, color='green')
plt.title('Precision-Recall Curve - Logistic Regression')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.show()

## **Step 9: Compare Training vs Testing Score - Model 1: *Logistic Regression***
#### *Check how well the Logistic Regression model generalizes on unseen data.*
#### *(Check if the model is overfitting or underfitting.)*

In [None]:
train_score = model_log.score(X_train.fillna(0), y_train)
test_score = model_log.score(X_test.fillna(0), y_test)

print("Training Accuracy:", round(train_score, 3))
print("Testing Accuracy:", round(test_score, 3))

## **Step 10: Train - Model 2: *Random Forest Classifier***
#### *Random Forest does not strictly require scaling; we fit and compute probabilities for ROC.*

In [None]:
# Initialize the Random Forest Classifier
model_ran = RandomForestClassifier(n_estimators=200, random_state=42)

# Train the model
model_ran.fit(X_train.fillna(0), y_train)

## **Step 11: Make Predictions - Model 2: *Random Forest Classifier***

In [None]:
y_pred_ran = model_ran.predict(X_test.fillna(0))

## **Step 12: Evaluate Model - Model 2: *Random Forest Classifier*** 

In [None]:
# Calculate accuracy
accuracy_ran = accuracy_score(y_test, y_pred_ran)

# Generate confusion matrix
cm_ran = confusion_matrix(y_test, y_pred_ran)

# Print evaluation metrics
print("Random Forest Model Accuracy:", round(accuracy_ran, 3))
print("\nConfusion Matrix:")
print(cm_ran)
print("\nClassification Report:")
print(classification_report(y_test, y_pred_ran))

## **Step 13: Display Classification Report - Model 2: *Random Forest Classifier***
#### *Evaluate the Random Forest model performance with precision, recall, f1-score for each class.*

In [None]:
print("\nClassification Report:")    # Printing detailed evaluation metrics
print(classification_report(y_test, y_pred_ran))

## **Step 14: Visualize - Model 2: *Random Forest Classifier***

## -------- 14.1. *Confusion Matrix*
#### *Heatmap helps understand prediction correctness visually.*

In [None]:
# confusion matrix
plt.figure(figsize=(6,4))
sns.heatmap(cm_ran, annot=True, fmt='d', cmap='Blues', xticklabels=['Benign', 'Malignant'], yticklabels=['Benign', 'Malignant'])
plt.title("Random Forest Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

## -------- 14.2. *ROC Curve*
#### *Visualize trade-off between True Positive Rate and False Positive Rate*

In [None]:
y_prob_ran = model_ran.predict_proba(X_test.fillna(0))[:,1]  # probabilities for positive class
fpr, tpr, thresholds = roc_curve(y_test, y_prob_ran)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(6,4))
plt.plot(fpr, tpr, color='darkorange', label=f'ROC Curve (AUC = {roc_auc:.2f})')
plt.plot([0,1],[0,1], color='navy', linestyle='--')
plt.title('ROC Curve - Random Forest Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc='lower right')
plt.show()

## -------- 14.3. *Precision-Recall Curve*
#### *Visualize trade-off between Precision and Recall*

In [None]:
precision, recall, thresholds = precision_recall_curve(y_test, y_prob_ran)

plt.figure(figsize=(6,4))
plt.plot(recall, precision, color='green')
plt.title('Precision-Recall Curve - Random Forest Classifier')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.show()

## **Step 15: Compare Training vs Testing Score - Model 2: *Random Forest Classifier***
#### *Check how well the Random Forest model generalizes on unseen data.*
#### *(Check if the model is overfitting or underfitting.)*

In [None]:
train_score = model_ran.score(X_train.fillna(0), y_train)
test_score = model_ran.score(X_test.fillna(0), y_test)

print("Training Accuracy:", round(train_score, 3))
print("Testing Accuracy:", round(test_score, 3))

## **Step 16: Compare Model Performance**

## -------- 16.1. *Accuracy*

In [None]:
# Compare accuracies
print("Logistic Regression Accuracy:", round(accuracy_log, 3))
print("Random Forest Accuracy:", round(accuracy_ran, 3))

## -------- 16.2. *Horizontal Bar Plot for Accuracy*

In [None]:
# Define models and their corresponding accuracies
models = ['Logistic Regression', 'Random Forest']
accuracies = [accuracy_log, accuracy_ran]

plt.figure(figsize=(6,4))

# Horizontal bar plot
plt.barh(models, accuracies, color=['skyblue', 'lightgreen'])
plt.xlim(0, 1)
plt.title('Model Accuracy Comparison')
plt.xlabel('Accuracy')
for i, acc in enumerate(accuracies):
    plt.text(acc + 0.005, i, str(round(acc,3)), va='center')  # annotate bars with accuracy
plt.show()

## **Summary**
*300-500 words of all the code, including dataset description, preprocessing, model implementation, results, and interpretation*

In this classification task, my goal was to build machine learning models to classify breast tumors as either benign or malignant using the Wisconsin Diagnostic Breast Cancer (WDBC) dataset. The dataset included 569 samples, each with 30 numerical features that were extracted from digitized images of fine needle aspirates. These features described different characteristics of cell nuclei, and the target variable indicated whether the tumor was benign (0) or malignant (1), making the task suitable for a binary classification problem.

I started by loading the dataset into a pandas DataFrame and reviewing its structure with functions such as .info(), .describe(), .head(), and .tail(). From this initial exploration, I confirmed that the dataset had no missing values and that all the features were numerical, which simplified the preprocessing stage. After that, I separated the features (X) from the target variable (y) and created an 80:20 training-testing split so I could evaluate how well the models would perform on unseen data.

The first model I implemented was Logistic Regression, which is a widely used method for binary classification. I trained the model on the training set and made predictions on the test set. To evaluate its performance, I looked at accuracy, the confusion matrix, and visualizations such as the ROC curve and Precision-Recall curve. These metrics and plots helped me understand how well the model balanced true positives and false positives, and the overall accuracy showed that Logistic Regression performed strongly on this dataset.

To compare results and potentially improve performance, I then used a Random Forest Classifier. This algorithm builds multiple decision trees and combines their outputs, which helps capture more complex patterns and reduce overfitting. After training the Random Forest model, I evaluated it using accuracy, the classification report, the confusion matrix, and a comparison of training and testing scores. The Random Forest Classifier slightly outperformed Logistic Regression, showing better predictive strength and good generalization.

I also used visualizations such as heatmaps, ROC curves, and Precision-Recall curves to better interpret the results. These plots made it easier to compare both models and understand the strengths of each one. Logistic Regression offered simplicity and interpretability, while the Random Forest model proved more robust and accurate overall.

This task helped me practice the full pipeline for a classification problem: exploring the dataset, preparing the features, choosing suitable models, training them, evaluating the results, and using visualizations to interpret performance. Both models worked well, but the Random Forest classifier showed slightly better predictive ability. Overall, this exercise strengthened my understanding of supervised classification and highlighted the importance of using multiple evaluation metrics when working with real-world datasets.