# Breast Cancer Prediction System - Model Development

**Name:** Olubadejo Folajuwon  
**Matric Number:** 23CG034128  
**Project:** 5 â€“ Breast Cancer Prediction System

## Objective
Develop a machine learning model to predict whether a tumor is benign or malignant using the Breast Cancer Wisconsin (Diagnostic) dataset.

## Selected Features
1. radius_mean
2. texture_mean
3. perimeter_mean
4. area_mean
5. smoothness_mean

## Algorithm
Logistic Regression

In [None]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
import joblib
import os

### 1. Load Dataset

In [None]:
# Load the dataset
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['diagnosis'] = data.target # 0 = Malignant, 1 = Benign (Check standard sklearn mapping)
# Note: In sklearn breast cancer dataset: 
# malignant = 0
# benign = 1
# We might want to map it to strings for clarity or keep as is. 
# Let's check the target names
print("Target Names:", data.target_names)

# For the purpose of this project, let's keep it numeric for training: 0 (Malignant), 1 (Benign)
# Or usually, Malignant is treated as the positive class (1). 
# Sklearn default: Malignant=0, Benign=1. 
# Let's invert it if we want Malignant=1 (Positive detection), but usually standard is fine as long as we interpret correctly.
# Let's stick to default and interpret results accordingly.

### 2. Feature Selection

In [None]:
selected_features = ['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean']
X = df[selected_features]
y = df['diagnosis']

# Check for missing values
print("Missing values:\n", X.isnull().sum())

### 3. Data Preprocessing

In [None]:
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature Scaling (Mandatory for distance based, good for Logistic Regression)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### 4. Model Training (Logistic Regression)

In [None]:
model = LogisticRegression()
model.fit(X_train_scaled, y_train)

### 5. Model Evaluation

In [None]:
y_pred = model.predict(X_test_scaled)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")

print("\nClassification Report:\n", classification_report(y_test, y_pred, target_names=data.target_names))

### 6. Save Model

In [None]:
# Save the model and the scaler
# It is crucial to save the scaler to transform new input data in the app
model_filename = 'breast_cancer_model.pkl'
scaler_filename = 'scaler.pkl'

joblib.dump(model, model_filename)
joblib.dump(scaler, scaler_filename)

print(f"Model saved to {model_filename}")
print(f"Scaler saved to {scaler_filename}")

### 7. Test Loading

In [None]:
loaded_model = joblib.load(model_filename)
loaded_scaler = joblib.load(scaler_filename)

# Test with a sample
sample = X_test.iloc[0].values.reshape(1, -1)
sample_scaled = loaded_scaler.transform(sample)
prediction = loaded_model.predict(sample_scaled)
print(f"Prediction for sample: {data.target_names[prediction[0]]}")