# **Classification of Breast Cancer Molecular Subtypes Using Gene Expression Profiles and Machine Learning Models**

## **Introduction**

Breast cancer is the most frequently diagnosed malignancy and a leading cause of cancer-related mortality among women worldwide. According to the World Health Organization (2020), breast cancer affected more than 2.3 million individuals and caused approximately 685,000 deaths globally. The burden of breast cancer is not only a medical challenge but also a socioeconomic concern, particularly in regions with limited access to diagnostic and treatment resources.

One of the critical complexities in managing breast cancer lies in its heterogeneity. Breast tumors are not uniform; they exhibit significant variations at the morphological, molecular, and clinical levels. Based on molecular features, breast cancer is broadly categorized into distinct subtypes: Luminal A, Luminal B, HER2-enriched, and Basal-like (commonly associated with Triple-Negative Breast Cancer - TNBC). Each subtype is associated with unique biological pathways, treatment responses, and patient prognoses. Accurate subtype classification is essential, as it directly informs the therapeutic approach — from hormone therapy to HER2-targeted treatments to chemotherapy.

Historically, breast cancer subtyping has relied on histopathological analysis and immunohistochemistry (IHC) techniques, which assess the presence of estrogen receptors (ER), progesterone receptors (PR), and HER2 expression. While these methods are widely used in clinical practice, they are subject to human interpretation, inter-observer variability, and often fail to capture the full molecular complexity of tumors.

The advent of high-throughput technologies, particularly RNA sequencing (RNA-Seq), has enabled comprehensive measurement of gene expression levels across thousands of genes simultaneously. These gene expression profiles provide a more nuanced and objective molecular view of tumors, making them a promising tool for precise classification. However, the high dimensionality and complexity of gene expression data pose significant analytical challenges, necessitating the application of advanced computational techniques.

In this context, machine learning (ML) emerges as a powerful approach to uncover meaningful patterns within complex datasets. ML models can learn intricate relationships between gene expression patterns and cancer subtypes, offering potential improvements in classification accuracy and biological interpretability. While various ML techniques have been proposed in cancer genomics, there is still a need for comparative evaluation of classical algorithms like Random Forest and Support Vector Machines (SVM) alongside modern deep learning models such as Convolutional Neural Networks (CNNs).

This study aims to develop, compare, and evaluate multiple machine learning models for the classification of breast cancer subtypes using gene expression data derived from the TCGA-BRCA (The Cancer Genome Atlas – Breast Invasive Carcinoma) dataset. The research further seeks to identify gene signatures that contribute significantly to classification, which may aid future studies in biomarker discovery and clinical applications.


## **Problem Statement**
Accurate classification of breast cancer subtypes is essential for effective treatment planning and prognosis. However, current clinical approaches primarily rely on histopathological assessment and IHC testing, which, although valuable, are limited in their ability to reflect the molecular heterogeneity of tumors. These methods are prone to subjective interpretation and do not fully capture the dynamic and complex gene expression profiles that distinguish cancer subtypes.

RNA-Seq technologies offer a high-resolution molecular snapshot of gene activity within tumors, enabling data-driven classification of subtypes. Nevertheless, the vast scale and complexity of RNA-Seq datasets present a significant challenge in terms of analysis. Traditional statistical techniques are often inadequate for extracting meaningful insights from such high-dimensional data, highlighting the need for robust computational methods.

Although machine learning models have demonstrated potential in the classification of cancer subtypes, there remains a lack of systematic evaluation of both classical and deep learning approaches specifically applied to breast cancer gene expression data. Moreover, limited studies provide interpretable outputs that highlight the biological significance of specific genes contributing to classification outcomes.

Thus, there is a clear need for a study that not only applies and compares multiple machine learning algorithms to breast cancer subtype classification but also integrates interpretability techniques to identify the most relevant gene expression markers. Addressing this gap is critical for advancing precision medicine and supporting oncologists in making informed treatment decisions.

Objectives
General Objective
To develop and evaluate machine learning models for the accurate classification of breast cancer molecular subtypes based on gene expression profiles from the TCGA-BRCA dataset.

Specific Objectives
To acquire and preprocess RNA-Seq gene expression data and clinical subtype labels from the TCGA-BRCA dataset.

To perform exploratory data analysis (EDA) to understand the structure, quality, and distribution of the gene expression data.

To develop classification models using Random Forest, Support Vector Machine (SVM), and Convolutional Neural Network (CNN) algorithms.

To compare model performance based on classification metrics including accuracy, precision, recall, F1-score, and ROC-AUC.

To identify and interpret the most informative genes contributing to subtype classification using model-specific feature importance techniques.

To assess the applicability of the developed models in supporting clinical breast cancer diagnosis and subtype stratification.

## **Significance of the Study**
The accurate classification of breast cancer subtypes plays a pivotal role in determining optimal treatment pathways and predicting patient outcomes. Misclassification can lead to suboptimal therapeutic interventions, unnecessary side effects, and poorer prognoses. Enhancing the accuracy and objectivity of subtype classification through machine learning has the potential to transform clinical practice and contribute to the advancement of precision oncology.

This study is significant for several reasons:

Clinical Relevance: The models developed may complement existing diagnostic protocols by offering data-driven, reproducible, and scalable alternatives for molecular subtyping.

Technological Advancement: The research explores the use of both traditional machine learning and modern deep learning models, offering insights into the comparative strengths of each in handling gene expression data.

Biological Insight: By identifying key genes that influence classification, the study contributes to the understanding of subtype-specific molecular pathways, potentially guiding future biomarker discovery.

Research Contribution: The project addresses an important gap in the literature by systematically evaluating multiple models on the same dataset and emphasizing both predictive performance and biological interpretability.

By leveraging computational methods to enhance breast cancer subtype prediction, this study aims to support more accurate, personalized, and effective cancer care.