# BREAST CANCER DIAGNOSIS & PROGNOSIS USING MACHINE LEARNING
# PROJECT REPORT
## 1. BUSINESS UNDERSTANDING
### 1.1 Overview

Breast cancer is a major global health concern, and effective care relies on early detection and understanding how aggressive a tumor may be. Manual examination of histopathology images is slow, subjective, and prone to error. Breast cancer also has several molecular subtypes such as Luminal A, Luminal B, HER2-enriched, and Basal-like that influence the rate of tumor growth, treatment response, and patient outcomes.
These subtypes are commonly identified using the PAM50 assay, a genomic test that is often costly and not widely accessible.

This project addresses these challenges by using Convolutional Neural Networks (CNN) to automatically classify breast tissue images as   malignant(cancerous) or benign(non-cancerous), and by using machine learning to predict molecular subtypes and survival outcomes. It offers a practical, data-driven tool that support diagnosis, treatment decisions, and prognosis.

## 2. PROBLEM STATEMENT

Breast cancer diagnosis faces these challenges.

* Manual examination of tissue images is time consuming, prone to error, and requires specialist expertise.
* Molecular subtypes which guide treatment decisions, are usually determined using the PAM50 assay— a costly genomic test that is not widely accessible.
* Patients often seek clear information about their prognosis, yet survival predictions rely on multiple clinical and genomic factors that require advanced analytical skills.

This project solves these challenges by building models that classify breast cancer tissue images as benign or malignant, predict molecular subtypes and estimate survival outcomes. The goal is to support earlier detection, data-driven treatment planning, and improved patient outcomes.

## 3. BUSINESS OBJECTIVES

This project aims to deliver a practical, AI-driven tools for breast cancer diagnosis and prognosis:

### 3.1 Image Classification
Develop a deep learning model that distinguishes cancerous and non-cancerous breast tissue images using the BreakHis dataset, improving accuracy and diagnosis speed.

### 3.2 Molecular Subtype Prediction
Predict PAM50 + Claudin-low subtypes (Luminal A, Luminal B, HER2-enriched, Basal-like) to provide a digital alternative to genomic testing and support personalized treatment.

### 3.3 Survival Prediction
Predict patients overall survival outcomes as either living or deceased

### 3.4 Deployment
Deploy all models into user friendly web application for easy clinical use.

## 4. DATA UNDERSTANDING
### 4.1 BreakHis Image Dataset

Contain microscopic breast tissue images labeled as benign or malignant at multiple magnification levels (40×, 100×, 200×, 400×) to support image based tumor classification

### 4.2 METABRIC Clinical & Genomic Dataset

Data from 2,509 patients, with 39 clinical/genomic features

Contains patient demographics, biomarkers, tumor characteristics, treatment indicators, and survival outcomes

**Key Features Include:**

PAM50 + Claudin-low subtype
Age at diagnosis, tumor stage, tumor size
ER/PR/HER2 biomarkers
Treatment indicators (chemo, hormone therapy, radiotherapy)
Lymph nodes examined positive 
Overall survival

**Data Quality**

Missing numerical values were filled with median

Missing categorical values were filled with mode

**Target Variables/Columns**

* Pam50 + Claudin-low subtype - Multi-class target for molecular subtype prediction.
* Overall Survival Status - Binary target for survival prediction (Living / Deceased).
* Patient’s Vital Status - Multi-class target for classifying survival outcomes as (Living / Died of Disease / Died of Other Causes).
  
**Together the datasets enable both image-based tumor classification and clinical prognosis prediction.**

## 5. METHODS
### 5.1 Image Classification Workflow

* Load dataset to the correct images path
* Check and remove duplicates & corrupted images
* Data split 70% training, 30% validation/test
* Handle class imbalance by balancing the class weights
* Prepared training images by Normalization and augmentation(Flip, Rotate, Zoom the images) 
* Prepare validation/test images with normalization only.
* Build CNN and MobileNetV2 models
* Evaluate using recall, accuracy, F1 score and confusion matrix 

### 5.2 Clinical Data Workflow

* Data Cleaning - Handle missing values by filling the mean to numerical columns and mode to categorical columns & Removed duplicates
* Performed Exploratory Data Analysis (EDA) to check the distribution, relationships and patterns among the features
* Correlation analysis to check multicollinear features - remove features that are too closely related to each other in the dataset
* Train test split, 80% for training and 20% for testing/validation
* Label-encode the target column to convert it into a numerical format suitable for model training
* One-hot encode all categorical features into a numeric format that machine learning models can easily understand.
* Applied SMOTE and SMOTENC to handle class imbalance
* Build models such as Random Forest, XGBoost, Logistic Regression
* Evaluation using accuracy, F1-score, confusion matrix
* Feature importance analysis to determine which features contribute most to the model’s predictions.

## 6. EXPLORATORY DATA ANALYSIS (KEY FINDINGS)
Tumor Stage vs Survival
* Earlier cancer stages (0 & 1) show far better survival compared to late stages (2,3).
Breast cancer subtype vs Survival
* Luminal A showed the best survival outcomes.
Breast cancer subtype response to Chemotherapy
* Basal and HER2 subtypes respond best to chemo.
* Luminal A showed highest resistance to chemo.
Breast cancer  subtype response to Hormone Therapy
* Luminal A and Luminal B respond strongly to hormone based treatment.
ER/PR Status vs Survival
* ER+ / PR+ patients show the best survival rates.
* ER– / PR– patients show poor prognosis.
Age vs Survival
* Older patients show poorer survival status
* Middle-aged women (45–65) show the best survival outcomes.

## 7. MODELING RESULTS
### 7.1 Tumor Image Classification

Models built: CNN and MobileNetV2
Performance on malignant class:

CNN: Precision 0.88, Recall 0.86, F1 = 0.87

MobileNetV2: Precision 0.95, Recall 0.69, F1 = 0.80

**Conclusion**
CNN is preferred for medical use due to higher recall meaning it is missing fewer cancer cases.

## 7.2 Molecular Subtype Prediction

Models built: Random Forest, XGBoost

* Random Forest achieved weighted F1 = 72% with more balanced class performance

## 7.3 Survival Status Prediction

Models built: Random Forest, Logistic Regression

Random Forest: Accuracy 77%, ROC-AUC 83%
Logistic Regression: Accuracy 73%, ROC-AUC 79%

* The Random Forest model performed better. It predicts deceased patients better important for identifying high risk patients who may need closer monitoring.

## 8. KEY INSIGHTS

* CNNs capture detailed patterns in tissue images, enabling accurate tumor classification.
* Predicting molecular subtypes and estimating survival status help guide treatment planning and provide patients with valuable prognostic information
* Clinical features such as tumor stage, ER/PR/HER2 biomarkers, lymph node involvement, and age at diagnosis strongly influence subtype and survival predictions.

## 9. CONCLUSIONS
* The CNN model demonstrates strong potential to support **early detection** of breast cancer by accurately distinguishing benign and malignant tissue.
* Predicting molecular subtypes enables more personalized and targeted treatment planning, improving the likelihood of effective clinical outcomes.
* Survival status prediction model provide valuable insights for patients who may need more careful follow up and extra attention (high risk individuals).

## 10. RECOMMENDATIONS
* For health centers: Integrate these predictive tools to support faster, more informed diagnostic and treatment decisions.
* For personalized care: Adapt treatments based on predicted subtype and risk profile.
* Clinical Prioritization: Provide closer monitoring and follow up for high risk patients, including older individuals, those diagnosed at later stages, and patients with aggressive tumors or multiple positive lymph nodes.
* For public health: Strengthen awareness initiatives, screening programs, and routine check-ups to promote early detection and improve overall outcomes.