# Breast Cancer Classification Machine – Software Artefact

## Introduction
This artefact documents the full machine learning workflow used to build a breast cancer classification model. The goal is to predict whether a tumour is *Benign* or *Malignant* using clinical features from the Breast Cancer Wisconsin dataset.

This notebook includes:
- Data loading and inspection  
- Cleaning and preprocessing  
- Feature engineering  
- Model training  
- Evaluation  
- Exporting the trained model for deployment in a Flask web app  


Import Libraries



In [13]:
#import all required libraries for data handling,
#preprocessing, model training, and evaluate.



import pandas as pd
import numpy as np


from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.ensemble import RandomForestClassifier



Load Datasets



In [6]:
# Load the Breast Cancer Wisconsin Diagnostic dataset. 
# This dataset contains 30 numerical tumour features 
# and a binary diagnosis label (M = malignant, B = benign).

import pandas as pd
df = pd.read_csv("breast_cancer_diagnostic.csv")

# Display the first few rows to confirm successful loading
df.shape, df.head()



((569, 32),
          id diagnosis  radius_mean  texture_mean  perimeter_mean  area_mean  \
 0    842302         M        17.99         10.38          122.80     1001.0   
 1    842517         M        20.57         17.77          132.90     1326.0   
 2  84300903         M        19.69         21.25          130.00     1203.0   
 3  84348301         M        11.42         20.38           77.58      386.1   
 4  84358402         M        20.29         14.34          135.10     1297.0   
 
    smoothness_mean  compactness_mean  concavity_mean  concave points_mean  \
 0          0.11840           0.27760          0.3001              0.14710   
 1          0.08474           0.07864          0.0869              0.07017   
 2          0.10960           0.15990          0.1974              0.12790   
 3          0.14250           0.28390          0.2414              0.10520   
 4          0.10030           0.13280          0.1980              0.10430   
 
    ...  radius_worst  texture_worst

In [7]:
df.columns
# Display the feature columns used for model training.


Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst'],
      dtype='object')

In [21]:
X.columns
# Show the list of feature names in X.


Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst'],
      dtype='object')

## Inspecting Data Quality

In [None]:
# Check dataset shape, column names, and basic info. 
# This helps confirm that the dataset is clean and complete.

df.shape
df.info()
df.describe()
df.isnull().sum()    # Verify no missing values


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

id                         0
diagnosis                  0
radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
dtype: int64

Clean the Data

In [None]:
# Remove any rows containing missing values. 
# The diagnostic dataset normally has no NaNs, but this step 
# ensures the dataframe is clean if additional data is added. 
# (Alternatively, imputation could be used instead of dropping.)

df = df.dropna()  # or impute if needed

# Remove duplicate rows to prevent data leakage and ensure 
# the model is trained on unique, non-redundant samples.
df = df.drop_duplicates()


Define Features and Target

In [None]:
# Define the target variable and remove non‑predictive columns. 
# 'id' is removed because it does not contain clinical information.

TARGET = "diagnosis"

X = df.drop(["id", TARGET], axis=1)     # 30 numerical features
                                        # Target labels (M/B)
y = df[TARGET]

# Confirm feature count
X.shape, y.shape




((569, 31), (569,))

Train Test/Split Data


In [20]:
## Train Test/Split Data
# Split the dataset into training and testing sets.
# test_size=0.2 means 20% of data is used for evaluation.
# random_state ensures reproducibility.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


Scale the Data

In [22]:
# Standardize numerical features so they have mean=0 and std=1. 
# This improves model performance and stability.

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


Train the Model

In [17]:
# Train a RandomForestClassifier on the scaled training data. 
# Random Forest is robust, handles non-linear patterns, 
# and performs well on medical datasets.

model = RandomForestClassifier(random_state=42)
model.fit(X_train_scaled, y_train)



RandomForestClassifier(random_state=42)

Evaluate the Model

In [18]:
# Generate predictions on the test set and evaluate accuracy.
# Also display confusion matrix and classification report 
# for deeper performance analysis.

from sklearn.ensemble import RandomForestClassifier
y_pred = model.predict(X_test_scaled)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))



Accuracy: 0.9649122807017544

Confusion Matrix:
 [[70  1]
 [ 3 40]]

Classification Report:
               precision    recall  f1-score   support

           B       0.96      0.99      0.97        71
           M       0.98      0.93      0.95        43

    accuracy                           0.96       114
   macro avg       0.97      0.96      0.96       114
weighted avg       0.97      0.96      0.96       114



In [15]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Load dataset
df = pd.read_csv("breast_cancer_diagnostic.csv")

# Prepare data
TARGET = "diagnosis"
X = df.drop(["id", TARGET], axis=1)
y = df[TARGET]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model
model = RandomForestClassifier(random_state=42)
model.fit(X_train_scaled, y_train)

# Evaluate
y_pred = model.predict(X_test_scaled)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Accuracy: 0.9649122807017544

Confusion Matrix:
 [[70  1]
 [ 3 40]]

Classification Report:
               precision    recall  f1-score   support

           B       0.96      0.99      0.97        71
           M       0.98      0.93      0.95        43

    accuracy                           0.96       114
   macro avg       0.97      0.96      0.96       114
weighted avg       0.97      0.96      0.96       114



Save the Model for Flask

In [16]:
# Save the trained model and scaler so they can be loaded 
# by the Flask web application for real-time predictions.

import joblib

joblib.dump(model, "breast_cancer_unified_model.pkl")
joblib.dump(scaler, "scaler.pkl")



['scaler.pkl']

## Conclusion

This artefact demonstrates the full machine learning workflow used to create the Breast Cancer Classification Machine. The trained model was exported and integrated into a Flask web application, allowing real‑time predictions through a user‑friendly interface.
