# BREAST CANCER DIAGNOSIS & PROGNOSIS USING MACHINE LEARNING
# PROJECT REPORT
## 1. BUSINESS UNDERSTANDING
### Overview

Breast cancer is a major global health concern, and effective care relies on early detection and understanding how aggressive a tumor may be. Manual examination of histopathology images is slow, subjective, and prone to error. Breast cancer also has several molecular subtypes—such as Luminal A, Luminal B, HER2-enriched, and Basal-like that influence tumor growth, treatment response, and patient outcomes.
These subtypes are commonly identified using the PAM50 assay, a genomic test that is often costly and not widely accessible.

This project addresses these challenges by using Convolutional Neural Networks (CNNs) to automatically classify breast tissue images as benign(non-cancerous) or malignant(cancerous), and by using machine learning to predict molecular subtypes and survival outcomes. The system offers a practical, data-driven tool to support diagnosis, treatment decisions, and prognosis.

## 2. PROBLEM STATEMENT

Breast cancer diagnosis faces two key challenges.

Manual examination of tissue images is time-consuming, error-prone, and requires specialist expertise.

Molecular subtypes, which guide treatment, are usually determined through expensive genomic test called the PAM50 assay, which is not widely accessible.

This project solves these challenges by building models that classify tissue images as benign or malignant and predict molecular subtypes and survival outcomes. The goal is to enable earlier detection, data-driven treatment planning, and improved patient outcomes.

## 3. BUSINESS OBJECTIVES

This project aims to deliver practical, AI-driven tools for breast cancer diagnosis and prognosis:

### Image Classification
Develop a deep learning model that distinguishes benign from malignant breast tissue using the BreakHis dataset, improving accuracy and diagnosis speed.

### Molecular Subtype Prediction
Predict PAM50 + Claudin-low subtypes (Luminal A, Luminal B, HER2-enriched, Basal-like) to provide a digital alternative to costly genomic testing and support personalized treatment.

### Survival Prediction
Build models that predict patient outcomes as:

Binary: Living vs Deceased

Multi-class: Living, Died of Disease, Died of Other Causes

### Deployment
Deploy all models into user friendly web applications for easy clinical use.

## 4. DATA UNDERSTANDING
### 4.1 BreakHis Image Dataset

Contain microscopic breast tissue images labeled as benign or malignant at multiple magnifications (40×, 100×, 200×, 400×)

Supports image based tumor classification

### 4.2 METABRIC Clinical & Genomic Dataset

Data from 2,509 patients, with 39 clinical/genomic features

Contains patient demographics, biomarkers, tumor characteristics, treatment indicators, and survival outcomes

**Key Features Include:**

PAM50 + Claudin-low subtype
Age at diagnosis, tumor stage, tumor size
ER/PR/HER2 biomarkers
Treatment indicators (chemo, hormone therapy, radiotherapy)
Histologic grade 
Overall survival and relapse-free status

**Data Quality**

Missing numerical values filled with median

Missing categorical values filled with mode

**Target Variables/Columns**

* Pam50 + Claudin-low subtype - Multi-class target for molecular subtype prediction.
* Overall Survival Status - Binary target for survival prediction (Living / Deceased).
* Patient’s Vital Status - Multi-class target for classifying survival outcomes as (Living / Died of Disease / Died of Other Causes).
  
**Together the datasets enable both image-based tumor classification and clinical prognosis prediction.**

## 5. METHODS
### Image Classification Workflow

Load dataset and map correct image paths
Check and remove duplicates & corrupted images
Data split 70% training, 30% validation/test
Handle class imbalance by balancing the class weights
Prepared training images by Normalization(scaling) and augmentation(Flip, Rotate, Zoom the images) 
Prepare validation/test images with normalization only.
Build CNN and MobileNetV2 models
Evaluate using accuracy, precision, recall, F1 score

### Clinical Data Workflow

* Data Cleaning - Handle missing values by filling the mean to numerical columns and mode to categorical columns & Removed duplicates
* Performed Exploratory Data Analysis (EDA) to check the distribution, relationships and patterns among the features
* Correlation analysis to check multicollinear features - identifies features  that are too closely related to each other in the dataset
* Train-test split 80% for training and 20 for testing/validation
* Label-encoded the target column to convert it into a numerical format suitable for model training, and applied one-hot encoding to all categorical features to ensure they are properly represented for machine learning algorithms
* Applied SMOTE and SMOTENC for class imbalance
* Model building with Random Forest, XGBoost, Logistic Regression
* Evaluate using accuracy, F1-score
* Feature importance analysis to determine which features contribute most to the model’s predictions.

## 6. EXPLORATORY DATA ANALYSIS (KEY FINDINGS)
Subtype vs Survival
* Luminal A showed the best survival outcomes.
Subtype Response to Chemotherapy
* Basal and HER2 subtypes respond best to chemo.
* Luminal A showed highest resistance.
Subtype Response to Hormone Therapy
* Luminal A and Luminal B respond strongly to hormone-based treatment.
ER/PR Status vs Survival
* ER+ / PR+ patients have the highest survival rates.
* ER– / PR– show poor prognosis.
Tumor Stage vs Survival
* Earlier cancer stages (0 & 1) show far better survival compared to late stages (2,3).
Age vs Survival
* Older patients have lower survival;
* Middle-aged women (45–65) show the best outcomes.

## 7. MODELING RESULTS
### 7.1 Tumor Image Classification

Models built: CNN and MobileNetV2
Performance on malignant class:

CNN: Precision 0.88, Recall 0.86, F1 = 0.87

MobileNetV2: Precision 0.95, Recall 0.69, F1 = 0.80

**Conclusion**
CNN is preferred for medical use due to higher recall—fewer missed cancer cases.

## 7.2 Molecular Subtype Prediction

Models built: Random Forest, XGBoost

* Random Forest achieved weighted F1 = 70% with more balanced class performance

## 7.3 Survival Status Prediction

Models built: Random Forest, Logistic Regression

Random Forest: Accuracy 77%, ROC-AUC 83%

Logistic Regression: Accuracy 73%, ROC-AUC 79%

* Both models predicts deceased patients better  important for identifying high risk cases.

## 8. KEY INSIGHTS

* CNNs capture detailed patterns in tissue images, enabling accurate tumor classification.
* Image augmentation improves generalization and reduces overfitting.
* Clinical features such as tumor stage, ER/PR/HER2 biomarkers, lymph node involvement, and age at diagnosis strongly influence subtype and survival predictions.
* Predicting molecular subtypes provides a practical solution for situations where genomic testing cannot be accessed.

## 9. CONCLUSIONS
* The CNN model demonstrates strong potential to support early detection of breast cancer by accurately distinguishing benign and malignant tissue, helping identify high-risk patients earlier.
* Predicting molecular subtypes enables more personalized and targeted treatment planning, improving the likelihood of effective clinical outcomes.
* Survival prediction models provide valuable insights for prioritizing follow-up care and allocating clinical attention to patients with the highest risk.

## 10. RECOMMENDATIONS
* For health centers: Integrate these predictive tools to support faster, more informed diagnostic and treatment decisions.
* For personalized care: Adapt treatments based on predicted subtype and risk profile.
* Clinical Prioritization: Provide closer monitoring and follow-up for high risk patients, including older individuals, those diagnosed at later stages, and patients with aggressive tumors or multiple positive lymph nodes.
* Public health:Strengthen awareness initiatives, screening programs, and routine check-ups to promote early detection and improve overall outcomes.