# Breast Cancer Prediction using PyTorch

This project builds a classifier to predict if a breast tumor is malignant or benign based on features extracted from cell nuclei images. This is a binary classification task with applications in healthcare for early cancer detection.

### Objective
The main objective is to use the Breast Cancer Wisconsin (Diagnostic) Dataset to train and evaluate a neural network model using PyTorch, classifying breast tumors as benign or malignant with high accuracy.


## Step 1: Install Necessary Libraries

installing PyTorch and other libraries for data processing and model building. Run the cell below to install these libraries 


In [1]:
!pip install torch torchvision pandas scikit-learn

Defaulting to user installation because normal site-packages is not writeable


## Step 2: Import Libraries

After installing, importing required libraries for data processing, model building, and evaluation.


In [2]:
import pandas as pd
import torch
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader


## Step 3: Load and Explore the Dataset

The Breast Cancer Wisconsin (Diagnostic) dataset is loaded from the UCI repository. It contains 30 features computed from digitized images of breast tumors, labeled as benign or malignant.


In [3]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data"
columns = ["ID", "Diagnosis"] + [f"feature_{i}" for i in range(1, 31)]
data = pd.read_csv(url, header=None, names=columns)
data.drop("ID", axis=1, inplace=True)  # Drop the ID column as it is not useful for prediction
data.head()

Unnamed: 0,Diagnosis,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,...,feature_21,feature_22,feature_23,feature_24,feature_25,feature_26,feature_27,feature_28,feature_29,feature_30
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


## Step 4: Data Preprocessing

In this step, we preprocess the data by converting labels, scaling features, and splitting the data into training, validation, and test sets.


In [4]:
# Convert Diagnosis to binary values and separate features and labels
data['Diagnosis'] = data['Diagnosis'].map({'M': 1, 'B': 0})
X = data.drop("Diagnosis", axis=1).values
y = data['Diagnosis'].values

# Standardize the feature values
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Split dataset into training, validation, and test sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

## Step 5: Create PyTorch Dataset and DataLoader

We create a custom PyTorch Dataset class to wrap our data and a DataLoader for efficient batch processing.


In [5]:
class BreastCancerDataset(Dataset):
    def __init__(self, features, labels):
        self.features = torch.tensor(features, dtype=torch.float32)
        self.labels = torch.tensor(labels, dtype=torch.long)

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return self.features[idx], self.labels[idx]

# Create datasets
train_dataset = BreastCancerDataset(X_train, y_train)
val_dataset = BreastCancerDataset(X_val, y_val)
test_dataset = BreastCancerDataset(X_test, y_test)

# Create data loaders
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32)
test_loader = DataLoader(test_dataset, batch_size=32)

## Step 6: Define the Neural Network Model

We define a simple feedforward neural network using PyTorch's `nn.Module` for binary classification.


In [6]:
class BreastCancerModel(nn.Module):
    def __init__(self):
        super(BreastCancerModel, self).__init__()
        self.fc1 = nn.Linear(30, 64)
        self.fc2 = nn.Linear(64, 32)
        self.fc3 = nn.Linear(32, 2)  # 2 output classes

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

model = BreastCancerModel()

## Step 7: Define the Loss Function and Optimizer

For this classification task, we use CrossEntropyLoss and the Adam optimizer.


In [7]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

## Step 8: Train the Model

We define a loop to train the model for a fixed number of epochs, with validation after each epoch.


In [8]:
num_epochs = 20

for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0
    for inputs, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()

    # Validation
    model.eval()
    val_loss = 0.0
    val_correct = 0
    val_total = 0
    with torch.no_grad():
        for inputs, labels in val_loader:
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            val_loss += loss.item()
            _, predicted = torch.max(outputs, 1)
            val_correct += (predicted == labels).sum().item()
            val_total += labels.size(0)

    print(f"Epoch [{epoch+1}/{num_epochs}], Train Loss: {running_loss/len(train_loader):.4f}, Validation Loss: {val_loss/len(val_loader):.4f}, Validation Accuracy: {100 * val_correct / val_total:.2f}%")

Epoch [1/20], Train Loss: 0.5947, Validation Loss: 0.5169, Validation Accuracy: 75.29%
Epoch [2/20], Train Loss: 0.4303, Validation Loss: 0.3452, Validation Accuracy: 94.12%
Epoch [3/20], Train Loss: 0.2814, Validation Loss: 0.2117, Validation Accuracy: 95.29%
Epoch [4/20], Train Loss: 0.1870, Validation Loss: 0.1376, Validation Accuracy: 96.47%
Epoch [5/20], Train Loss: 0.1281, Validation Loss: 0.1041, Validation Accuracy: 98.82%
Epoch [6/20], Train Loss: 0.1008, Validation Loss: 0.0873, Validation Accuracy: 98.82%
Epoch [7/20], Train Loss: 0.0847, Validation Loss: 0.0797, Validation Accuracy: 98.82%
Epoch [8/20], Train Loss: 0.0741, Validation Loss: 0.0758, Validation Accuracy: 98.82%
Epoch [9/20], Train Loss: 0.0729, Validation Loss: 0.0749, Validation Accuracy: 98.82%
Epoch [10/20], Train Loss: 0.0684, Validation Loss: 0.0743, Validation Accuracy: 98.82%
Epoch [11/20], Train Loss: 0.0560, Validation Loss: 0.0749, Validation Accuracy: 98.82%
Epoch [12/20], Train Loss: 0.0521, Valida

## Step 9: Test the Model

In [9]:
model.eval()
test_correct = 0
test_total = 0
with torch.no_grad():
    for inputs, labels in test_loader:
        outputs = model(inputs)
        _, predicted = torch.max(outputs, 1)
        test_correct += (predicted == labels).sum().item()
        test_total += labels.size(0)

print(f"Test Accuracy: {100 * test_correct / test_total:.2f}%")

Test Accuracy: 100.00%
