# COGS 108 - Project Proposal

## Authors

- Richard Wang: Background research, Writing - original draft, question, data, datasets
- Alena Lee: Background research, Writing - original draft, background, question, datasets
- Steven Ngo: Background research, Writing - original draft, timeline, team ex., hypotheses
- Justin Suh: Background research, Writing - original draft, question, ethics, team ex.
- Jeff Lin: Background research, Writing - original draft, background, hypotheses, question

## Research Question

Which genes contribute most to a classification model distinguishing breast cancer subtypes? Which genes are expressed more for Basal, compared to other types (e.g., Luminal A, Luminal B, HER2-enriched)?

## Background and Prior Work

**Breast cancer** is a disease involving uncontrolled proliferation of cells resulting from DNA mutations acquired throughout a person’s lifetime. It is the most frequently diagnosed cancer, as well as one of the top-leading causes of death that are cancer-related. In 2026, an estimated 300,000+ cases of invasive breast cancer will be diagnosed, roughly translating to 1 in 8 women in the United States.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) Gene expression profiling is a method that aims to distinguish cancerous from normal samples by analyzing individual, unique “molecular signatures” — specific patterns of genes that are switched on or off in cancer cells, but not in healthy tissue. In our question, we want to narrow down which genes contribute the most to specific types of breast cancer. Each subtype of breast cancer primarily differs by the presence or absence of hormone receptors. These markers allow us to categorize tumors into four main molecular subtypes: Basal-like, HER2-enriched, Luminal A, and Luminal B. These four comprise our Kaggle dataset with the exception of cell_line breast cancer, which are immortalized in vitro models used in research whereas the 4 main subtypes are clinical, patient-derivived classifications. Our analysis will only focus on the 4 main previously mentioned. 

After doing research on known information about what gene mutations cause/significantly increase the risk of developing breast cancer, we were able to narrow down hypotheses for each subtype of breast cancer. 

Previous research has explored a similar aspect in a published NlH article titled “Identification of breast cancer subtypes and drug response prediction through forward and reverse translation”.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) In this study, they predicted subtype-specific therapeutic drug response by connecting patient tumor data with cancer screening data. This group trained NMF-based models on DepMap cell-line data in order to predict CDK4 vs CDK6 (classes of medicine used in combination with hormone therapy) dependency from gene expression and applied those models to TCGA tumors. Results from this test suggested that Luminal A was mostly skewed towards a CDK4 dependency, while Luminal B had a skewed CDK6 dependency. Additionally, some subtypes showed more hormone expression than others, such as those with higher estrogen-driven expression, and others with high cell-cycle programs. The core differences between this experimental model and our project is that they were more focused on using patterns across thousands of genes to discover and describe subgroups, whereas we want to determine which specific genes are the root that cause breast cancer subgroups to be expressed. 

Another study published in Science Direct takes a more similar approach to our project than the last, but focusing on using feature selection strategies to “identify statistically significant genes and accurately classify cancer types from RNA-seq data”.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-3) Essentially, this project aims to create a masterlist of what specific genes are found in cancer tumor tissue. This study is more similar to our project, but instead is interested in a general view of cancer types, rather than solely narrowing in on breast cancer. Our project wants to create a link between named genes and a specific subgroup of breast cancer, rather than cancer as a whole. In the analysis, they utilized Ridge Regression and Lasso algorithms to find the best features from the data. With Ridge Regression, this method helped to identify dominant genes among their 800+ cancer tissue sample set (bound to have a lot of noise). It is a well suited statistical method for high-dimensional genomic datasets. In contrast, Lasso serves as a regularization technique and is particularly useful when only a subset of features are informative. The results of this study created a refined list of the top ~100 genes that were associated with various tumor types. Furthermore, they were able to find the highest vs lowest expression of the top 50 genes with a distinction by color, expanding their conclusion to also consider co-expressed genes and condition-specific patterns.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-3) 

1. <a name="cite_note-1"></a> [^](#cite_ref-1) American Cancer Society. (2026, January 13). *Key statistics for breast cancer: How common is breast cancer?* Retrieved February 4, 2026, from https://www.cancer.org/cancer/types/breast-cancer/about/how-common-is-breast-cancer.html

2. <a name="cite_note-2"></a> [^](#cite_ref-2) Karam, J., Rejto, P. A., Bienkowska, J. R., Mu, X. J., & Roh, W. (2025). Identification of breast cancer subtypes and drug response prediction through forward and reverse translation. *NPJ Precision Oncology, 9*, Article 267. https://doi.org/10.1038/s41698-025-01062-w

3. <a name="cite_note-3"></a> [^](#cite_ref-3) Akter, S., Adesola, R. O., & Basnet, S. (2025). Machine learning approach to identify significant genes and classify cancer types from RNA-seq data. *Global Medical Genetics, 12*(4), 100079. https://doi.org/10.1016/j.gmg.2025.100079


## Hypothesis


Based on our research, we hypothesize that Basal-type breast cancer will be highly correlated with KRT5 and KRT14, while HER2-enriched breast cancer will be most impacted by ERBB2 and GRB7. We also hypothesize that hormone receptors such as  ESR1 and PGR will be the most commonly expressed genes for Luminal A-type breast cancer, and Luminal B-type breast cancer will be most commonly expressed by ESR1 and MKI67.

## Data

### Part 1

To answer our research question, the ideal dataset would be one that consists of gene expression profiles from tumor samples with cancer subtype labels clearly defined. The primary variables would include normalized gene expression measurements for a large number of genes per sample, with a categorical variable indicating the cancer subtype (ie. Luminal A, Luminal B, Basal, etc). These gene expression features can act as the input to a classification model, while the subtype labels can act as the target variable. Preferably, data related to the patient, such as age or tumor characteristics, would be included as control variables to account for the potential source of biological variability.

Ideally, the dataset would include tumor samples/observations ranging from several hundred to a few thousand, with reasonably balanced representation for each subtype to ensure that our model has enough statistical power and has stable performance. When it comes to data collection, these data would usually be collected from tumor biopsy samples from cancer patients in clinical or research facilities. Gene expression level would be measured using standardized genomic profiling techniques such as microarray analysis or RNA sequencing. Cancer subtype labels would be assigned based on established clinical or molecular classification criteria, and all samples would undergo consistent preprocessing and normalization to ensure data are comparable across observations. 

Finally, the data would be stored in a structured table format where each row represents an individual tumor sample, and each column represents a gene expression value or relevant clinical or demographic variables such as age and tumor characteristics mentioned above, with subtype labels stored as a separate target variable. Gene identifiers would also be standardized, and the dataset would be organized in ways to support machine learning workflow, such as feature selection, model training, and interpretation of gene importance scores.


### Part 2

Dataset 1: Breast_GSE45827 (https://sbcb.inf.ufrgs.br/cumida)

This dataset comes from the Curated Microarray Database (CuMiDa), which is a publicly available collection of curated cancer gene expression datasets created by the Structural Bioinformatics and Computational Biology Lab and can be accessed without special permission. The important variables in this dataset include normalized gene expression levels for thousands of genes measured across tumor samples. There is also a categorical variable indicating the cancer subtype or class for each sample. The gene expression values can serve as predictor variables in the classification model, while the subtype label can be used as the response variable. The curated and standardized preprocessing applied in CuMiDa can help reduce technical variability, which makes it suitable for identifying genes that contribute most to distinguishing cancer subtypes.

Dataset 2: https://www.kaggle.com/datasets/waalbannyantudre/gene-expression-cancer-rna-seq-donated-on-682016?select=data.csv

This dataset comes from the RNA-Seq(HiSeq) PANCAN dataset through the UCI Machine Learning Repository, which is publicly available and can be accessed directly without any special permission. It contains 2 files, one of which includes the gene expression measurement and the other includes the corresponding cancer type labels for each sample. The most important variables in this dataset are the normalized gene expression value for over 20,000+ genes, which can serve as the predictor variables, and the categorical cancer type label that includes tumors such as breast, kidney, colon, lung, etc, which can serve as the response variable. These variables combined can allow us to train supervised classification models and analyze which genes contribute the most to distinguishing between cancer types after filtering to focus on breast cancer.

Dataset 3: https://www.kaggle.com/datasets/saurabhshahane/gene-expression-profiles-of-breast-cancer?select=BC-TCGA

This dataset is publicly available through Mendeley Data and can be accessed directly without requesting special permission. It includes four separate sub-datasets used in prior cancer classification research and a simulated dataset, each containing gene expression measurements for thousands of genes across breast cancer-related samples. The most important variables are the 10,000+ normalized gene expression values for each sub-dataset and binary/categorical outcome labels (ie. cancer vs. normal tissue, recurrence vs. non-recurrence, treatment response, etc). These variables make the dataset suitable for supervised classification tasks and for analyzing which genes are most informative in distinguishing breast cancer.

  

## Ethics

### A. Data Collection
- [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?


> The tissue samples were collected with consent for research purposes and deposited in the Gene Expression Omnibus (GEO), a publicly accessible database. 


- [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?


> The dataset was collected from a single study and lacks demographic information such as race, ethnicity, and age. This limits our ability to assess whether certain populations are underrepresented and whether our findings generalize across different groups. We likely won’t be able to take steps to address this, as the data was initially anonymized.


- [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?


> The dataset we chose already contains no personally identifiable information. The dataset only contains numerical gene expression values and cancer subtype labels.


- [ ] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?


### B. Data Storage
- [ ] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
- [ ] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
- [ ] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?


### C. Analysis
- [ ] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
- [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, 
confirmation bias, imbalanced classes, or omitted confounding variables)?


> The dataset has some class imbalance, with some subtypes of breast cancer appearing more than others. This may lead to models performing better on larger classes. We will report per-class metrics and exclude cell line/normal samples as they may not accurately represent patient tumors.


- [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?


> All of the visualizations, summary statistics, and reports that we will create will be solely generated from the values in this dataset. We will clearly label sample sizes, report statistical significance where appropriate, and avoid making any claims that don’t correspond with our data.


- [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?


> Our initial dataset contains no PII. All analysis uses only anonymized gene expression values and subtype labels.


- [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?


> All code will be documented in Jupyter notebooks. The dataset is also publicly available, so our results should be easily reproducible.


### D. Modeling
- [ ] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
- [ ] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
- [ ] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
- [ ] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
- [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?


> Our analysis is purely educational and explanatory. We will clearly state that findings should not be used for clinical decision making without validation on larger, more diverse datasets.


### E. Deployment
- [ ] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
- [ ] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
- [ ] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
- [ ] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?

## Team Expectations 

- Our main form of communication will be over iMessage. We believe this is the easiest way to get a hold of everyone by convenience. On most weekdays, we will reply within 36 hours.
- Meetings: 2x/week (one planning meeting + one progress/check-in), 45–60 minutes, mostly virtual; in-person as needed.
- Meeting norms: Come prepared with updates, blockers, and next steps. We’ll keep brief notes and action items after each meeting.
- We commit to being respectful with our critique and assume feedback is well-meant.
- We will actively include quieter members (round-robin check-ins, asking for opinions directly, leaving space before moving on).
- We aim for consensus, but if needed we’ll use a majority vote after discussion
- If an urgent decision is required and someone is non-responsive past the response window, the facilitator + section lead can decide and document it, then we revisit at the next meeting if needed.
- We will use a shared Google docs with clear outlines of tasks and deliverables
- Roles can rotate, but we will assign clear owners for each deliverable (wrangling, EDA/viz, modeling, interpretation, writing/editing).
Everyone must contribute to:
- question/dataset decisions, code (wrangling/EDA/modeling), written narrative, and editing/review.
- Any code merged must be commented, readable, and reviewed by at least one teammate.
- We will maintain a living project timeline with internal deadlines at least 48–72 hours before course deadlines. This can be a bit flexible, as long as that is communicated.
- If you’re stuck or falling behind, try to figure out the problem yourself, then tell the team if you are truly stuck (no later than 48 hours after realizing).
- The team will respond by: splitting the task and adjusting scope while preserving core requirements.
- We will address issues early and respectfully: describe the problem, impact, and a proposed fix.
- If needed, we’ll revisit expectations and reassign tasks to match availability and strengths.
- If someone repeatedly misses deadlines/meetings or is not contributing, we will: notify them in written form with specific missing items and a clear 1-week improvement plan, redistribute tasks as needed to protect the project, and if there is no improvement, contact the professor (by Week 8 at the latest) with documented specifics.

## Project Timeline Proposal

## Outline

### Week 5 – Project Proposal
- **We are proposing to use a high dimensional dataset that may include advanced visualization and machine learning techniques not covered in this class. However, we believe that we have the ability to overcome most challenges in this project, and will happily use resources given to us from the teaching staff.**

### Week 6-7 – Data cleaning, wrangling and exploratory data analysis
- Inspect data, checking for missing values, potential confounding variables, normalize data
- Aggregate analysis (groupby operations)
- Perform left inner join on affymetrix gene dataset (ID column) to map uninterpretable feature names (affymetrix ID column names) from Kaggle/CuMiDa dataset to more interpretable feature values
    - Choose 1-2: Sequence Type, Target Description, Gene Title, Gene Symbol
    - Convert affymetrix gene dataset .txt file to .csv
    - Transpose one of the datasets (if necessary)
- OR look at affymetrix gene dataset and handpick/narrow down genes to use (filter)
- Perform visualization operations (dimensionality reduction via PCA/t-SNE/DBSCAN, line plots, scatter plots, box-plots, heatmaps)
    - Use seaborn (sns)
    - Explore different subproblems
    - Visualize different genes and clusters
    - Interpret variance of gene expressiveness and document early results

### Week 8 – Statistical Analysis
- Frame statistical question(s) and choose appropriate groups (distributions) of interest
- Perform statistical tests to compare sample vs. population or multiple groups
    - Tests include t-tests, ANOVA, etc.

### Week 9 – Feature Engineering, Predictive Modeling
- Engineer new features if needed 
    - Nonlinear combinations (less interpretable)
    - Normalize data
    - Treat columns (genes) as vectors
- Filter cancer types
- Choose features and train classification models (multi-class) on different combinations of features
- Interpret weights and biases, strength of metrics (accuracy, precision, recall)
- Optimize bias-variance (Random Forests, NNs)
    - Visualize train-test accuracy over model complexity
- Handle class imbalances

### Week 10 – Write Discussion, Additional Results, Finalize everything
- Build conclusions, interpret results
- Interpret column names (affymetrix IDs) to determine how our chosen genes may contribute most (or least) to breast cancer based on how much they are expressed in terms of their predictive power 
- Finalize ethics sections
- Finish discussion section
- Fix bugs and errors

### Finals Week – Video, Team Eval Survey


## Timeline Table

| Meeting Date | Meeting Time | Completed Before Meeting | Discuss at Meeting |
|---|---|---|---|
| 2/4 | 2 PM | NA | Edit, finalize, and submit proposal;<br>Search for datasets / confirm CuMiDa context;<br>Assign Week 6–7 tasks + set up workflow (repo, board, channels) |
| 2/9 | 2 PM | **Import & Wrangle Data (Justin, Steven, Richard; Reviewed by Alena, Jeff):** load data, check shape/labels, basic cleaning/formatting, inspect missingness/confounders;<br>**EDA (All):** class counts, initial PCA/t-SNE, basic distribution checks;<br>Locate Affymetrix annotation file + start `.txt → .csv` conversion | Review/edit wrangling + EDA outputs;<br>Finalize cleaning decisions (normalization, outliers, label handling);<br>Plan/confirm Affymetrix ID → gene mapping approach + which fields to keep;<br>Assign next visuals + early-writeup tasks |
| 2/16 | 2 PM | Finalize wrangling + EDA (cleaned dataset saved, mapped IDs or partial mapping);<br>Expanded visuals (PCA/t-SNE/heatmaps/boxplots) + brief interpretations;<br>**Begin Analysis (Alena, Jeff; Reviewed by TBD):** define statistical question(s), pick groups, draft test plan (t-test/ANOVA assumptions/alternatives) | Edit/approve final EDA + interpretations;<br>Confirm statistical tests + comparisons that answer the research question;<br>Decide Week 9 modeling plan (baseline models, CV, feature selection);<br>Progress check + rebalance tasks if needed |
| 2/23 | 2 PM | Run statistical analysis (tests + effect sizes + p-values, note multiple comparisons if needed);<br>Start modeling (baseline classifier(s), initial feature selection, first metrics);<br>**Draft Results/Discussion outline (Jeff, Alena; Reviewed by TBD):** section skeleton + figure placeholders | Review stats results for correctness + interpretation;<br>Debug modeling + discuss confusion points (e.g., luminal A vs B);<br>Choose final model(s) + tuning plan;<br>Assign writing owners (Methods/Results/Discussion/Ethics) + editing reviewers |
| 3/2 | 2 PM | Complete modeling + finalize figures (CV metrics, confusion matrix, macro-F1, performance vs complexity);<br>Finalize feature importance/top genes (if mapped);<br>**Draft Results/Conclusion/Discussion (All):** full draft + references + captions | Full-project edit session (clarity, structure, citations);<br>Connect findings to prior work + biology;<br>Finalize ethics section;<br>Identify remaining gaps/bugs + assign final polish + video plan |
| 3/9 | 2 PM | Full draft complete (all sections written, citations added, code cleaned/commented, figures finalized);<br>Video script/outline ready; peer-review pass completed | Final QA: run notebook top-to-bottom, fix bugs, rubric check;<br>Finalize video recording roles + timeline;<br>Confirm submission checklist + team eval expectations |
| 3/16 | Before EOD | NA | Turn in Final Project & Group Project Surveys |
