# <span style="color: green">Breast Cancer and related Genes</span>

Breast cancer is a complex disease with both hereditary and environmental risk factors. While mutations in high-penetrance genes such as BRCA1, BRCA2, PTEN, TP53, CDH1, and STK11 account for up to 25% of hereditary breast cancer cases, a significant proportion of genetic predisposition remains unexplained. Advances in genomic research have identified moderate-risk genes and low-penetrance alleles, but their individual contributions to breast cancer susceptibility remain small and challenging to interpret.

# <span style="color: green">Types of brest cancer</span>

### situ and invasive cancer(Common Type):

* **Ductal Carcinoma In Situ (DCIS):** DCIS is a non-invasive breast cancer where abnormal cells are confined within the milk ducts and have not spread to surrounding breast tissue. While not life-threatening, DCIS can increase the risk of developing invasive breast cancer if left untreated.

* **Invasive Ductal Carcinoma (IDC):** Breast cancers that have spread into surrounding breast tissue are known as invasive breast cancers.It begins in the milk ducts and invades nearby breast tissue, with the potential to spread to other parts of the body.

* **Invasive Lobular Carcinoma (ILC)** ILC starts in the milk-producing lobules and can spread to surrounding breast tissue and beyond. It is the second most common type of breast cancer.

### Special type

Some invasive breast cancers have special features or develop in different ways that influence their treatment and outlook. These cancers are less common but can be more serious than other types of breast cancer.

### ...


# <span style="color: green">Cancer Classification for varius breast cancer mutation using HMM</span>

## Goal:
The purpose of this project is to use Hidden Markov model for cancer classification using gene expression profiles. The model is designed to classify different types of cancer by modeling the gene expression of each tumor type with an HMM. 
We are classifying cancer using gene expression profiles by integrating gene ranking methods with Hidden Markov Models (HMMs). The method focuses on selecting the most informative genes such as including t-test, entropy, receiver operating characteristic curve, Wilcoxon test and signal to noise ratio.

1. Understanding Gene Expression & Why It Matters in Cancer Classification

**What is Gene Expression?**

Every cell in your body contains the same DNA, but different genes are turned on or off depending on the cell type and its function. The process of turning genes on/off is called gene expression.

# method of expressong genes

![alt text](/123.png)
2. Protein-Based Methods
mRNA levels do not always correlate directly with protein levels, so these methods measure protein abundance.

(a) Western Blot
Uses antibodies to detect specific proteins in a sample.
Advantages: Specific, qualitative/semi-quantitative.
Disadvantages: Low-throughput, only detects known proteins.

(b) Enzyme-Linked Immunosorbent Assay (ELISA)
Uses antibodies to quantify proteins in a liquid sample.
Advantages: Highly sensitive and specific.
Disadvantages: Limited to known proteins, cannot detect post-translational modifications.

(c) Mass Spectrometry-Based Proteomics
Identifies and quantifies proteins in a sample using liquid chromatography-mass spectrometry (LC-MS/MS).
Advantages: High-throughput, detects post-translational modifications.
Disadvantages: Requires expertise, expensive.

3. Functional Methods
Instead of measuring mRNA or protein abundance, these methods assess gene expression by measuring biological activity.

(a) Reporter Gene Assays
Inserts a reporter gene (e.g., GFP, Luciferase) under a promoter of interest.
Advantages: Measures real-time activity.
Disadvantages: Artificial system, may not reflect natural gene regulation.

(b) Ribosome Profiling (Ribo-Seq)
Measures actively translated mRNA fragments (ribosome footprints).
Advantages: Provides direct evidence of translation.
Disadvantages: Technically complex, requires deep sequencing.


**Why is Gene Expression Important in Cancer?**

Cancer occurs when specific genes (oncogenes) become overactive or when tumor suppressor genes stop functioning properly. Identifying which genes behave abnormally in cancer cells allows researchers to classify cancer types and potentially guide treatments.

2. Gene Selection: Choosing the Most Important Genes
**Problem: Too Many Genes, Not All Relevant**
* A DNA microarray (gene chip) can measure the expression levels of thousands of genes at once, but not all of them are useful for classification.
* If we include all genes, the model becomes too complex and inefficient.
* The goal is to filter out irrelevant genes and identify the most informative ones.
Solution: Gene Selection Methods
* The article introduces several ranking techniques to select key genes that differentiate between cancerous and normal tissues. These techniques include:

**t-test:** Finds genes that show significant differences in expression levels between cancerous and normal cells.

**Entropy test:** – Measures disorder in gene expression; genes with high entropy provide better class separation.

**Receiver Operating Characteristic (ROC) Curve:** Selects genes with strong discriminatory power.

**Wilcoxon test:** A non-parametric test that ranks genes based on their median expression difference.
**Signal-to-Noise Ratio (SNR):** Compares differences in mean expression levels with the standard deviation.

## Modified Analytic Hierarchy Process (AHP) for Gene Selection

Traditional AHP is a decision-making method used for prioritizing factors based on expert judgment.
The authors modified AHP to integrate the rankings from the above five methods automatically.
Instead of relying on human experts, this method uses statistical rankings to create a robust and stable set of key genes for classification.
3. Using Hidden Markov Models (HMMs) for Cancer Classification
Why HMMs?

* Cancer develops in stages, similar to how states change in an HMM.
HMMs are great for analyzing sequential patterns and capturing the transitions between different gene expression states.
How HMMs Work in Cancer Classification
Each cancer type gets its own HMM model.
* If we are classifying between normal and cancerous tissues, we train two separate HMMs:
    * One for normal gene expression patterns.
    * One for cancer gene expression patterns.
### Training the HMM
The selected genes from the AHP method serve as input features.
The model learns probability distributions for these genes from labeled training data.
Classifying a New Sample

When a new patient’s gene expression data is inputted, it is tested against both trained HMMs.
The HMM with the higher probability of generating the observed gene expression determines the classification (cancer or normal).

4. Performance Evaluation and Results

#########################################################################################################################################

# <span style="color: green">Data Collection</span>


The National Center for Biotechnology Information (NCBI) is a part of the National Library of Medicine at the National Institutes of Health (NIH). It provides access to biological information and data to advance science and healt

What does NCBI do?

* Maintains databases of biological information, including DNA sequences, genes, proteins, and more 
* Provides tools for analyzing biological data 
* Creates resources to help researchers understand the relationship between genes and health 
* Produces resources to help researchers understand how pathogens spread and how to prevent foodborne disease 
* Creates resources to help researchers understand how diseases affect the body at the molecular and cellular level 

What resources does NCBI provide?
* GenBank: A database of publicly available DNA sequences 
* PubMed: A database of citations and abstracts for published life science journals 
* Entrez: A database retrieval system that integrates data from multiple databases 
* BLAST: A tool for searching for local alignments in biological sequences 
* ClinicalTrials.gov: A database of clinical studies funded by the public and private sectors 
* NCBI Bookshelf: A collection of books that cover topics like molecular biology, genetics, and disease states 

In [23]:
from Bio import Entrez
import ssl
import Bio
import xml.etree.ElementTree as ET
# Disable SSL verification (for development purposes)
ssl._create_default_https_context = ssl._create_unverified_context
Entrez.email = "ashkan.nikfarjam@sjsu.edu"

# Search for cancer-related gene expression studies in GEO
#Type: DCIS
query = "HER2[Title] AND Homo sapiens[Organism]"
handle = Entrez.esearch(db="gds", term=query)  # "gds" is the GEO DataSets database
record = Entrez.read(handle)
handle.close()

# Get GEO dataset IDs
geo_ids = record["IdList"]
print("Found GEO datasets:", geo_ids)

Found GEO datasets: ['200264264', '200245132', '200270021', '200254188', '200283522', '200283272', '200267855', '200263177', '200248460', '200188653', '200261815', '200261380', '200268427', '200262825', '200274220', '200267921', '200267920', '200267919', '200252799', '200241458']


In [32]:
record

{'Count': '2016', 'RetMax': '20', 'RetStart': '0', 'IdList': ['200264264', '200245132', '200270021', '200254188', '200283522', '200283272', '200267855', '200263177', '200248460', '200188653', '200261815', '200261380', '200268427', '200262825', '200274220', '200267921', '200267920', '200267919', '200252799', '200241458'], 'TranslationSet': [{'From': 'Homo sapiens[Organism]', 'To': '"Homo sapiens"[Organism]'}], 'TranslationStack': [{'Term': 'HER2[Title]', 'Field': 'Title', 'Count': '2294', 'Explode': 'N'}, {'Term': '"Homo sapiens"[Organism]', 'Field': 'Organism', 'Count': '4158885', 'Explode': 'Y'}, 'AND'], 'QueryTranslation': 'HER2[Title] AND "Homo sapiens"[Organism]'}

In [24]:
data_dic={"IDs":[], "record":[]}
for info in geo_ids:
    hendles=Entrez.esummary(db="gds", id=info)
    record_extracted=Entrez.read(hendles)
    data_dic["IDs"].append(info)
    data_dic["record"].append(record_extracted)

In [25]:
print(data_dic)

{'IDs': ['200264264', '200245132', '200270021', '200254188', '200283522', '200283272', '200267855', '200263177', '200248460', '200188653', '200261815', '200261380', '200268427', '200262825', '200274220', '200267921', '200267920', '200267919', '200252799', '200241458'], 'record': [[{'Item': [], 'Id': '200264264', 'Accession': 'GSE264264', 'GDS': '', 'title': 'RUNX1-PDGFBB-AKT pathway mediated CDK4/6 inhibitor resistance in HR+/HER2- breast cancer', 'summary': 'Cyclin-dependent kinases 4 and 6 (CDK4/6) are essential drivers of the cell cycle and are also critical for the initiation and progression of diverse malignancies. Pharmacological inhibitors targeting CDK4/6 have demonstrated significant activity against various tumor types such as breast cancer. However, resistance to CDK4/6 inhibitors (CDK4/6i) (such as palbociclib) remain an immense obstacle in clinical and the underlying mechanisms have not been fully understood. Using Conditional medium co-culture, quantitative high-throughpu

In [26]:
import pandas as pd
df=pd.DataFrame(data_dic)
df.head()

Unnamed: 0,IDs,record
0,200264264,"[{'Item': [], 'Id': '200264264', 'Accession': ..."
1,200245132,"[{'Item': [], 'Id': '200245132', 'Accession': ..."
2,200270021,"[{'Item': [], 'Id': '200270021', 'Accession': ..."
3,200254188,"[{'Item': [], 'Id': '200254188', 'Accession': ..."
4,200283522,"[{'Item': [], 'Id': '200283522', 'Accession': ..."


In [27]:
data_dic['record'][0]

[{'Item': [], 'Id': '200264264', 'Accession': 'GSE264264', 'GDS': '', 'title': 'RUNX1-PDGFBB-AKT pathway mediated CDK4/6 inhibitor resistance in HR+/HER2- breast cancer', 'summary': 'Cyclin-dependent kinases 4 and 6 (CDK4/6) are essential drivers of the cell cycle and are also critical for the initiation and progression of diverse malignancies. Pharmacological inhibitors targeting CDK4/6 have demonstrated significant activity against various tumor types such as breast cancer. However, resistance to CDK4/6 inhibitors (CDK4/6i) (such as palbociclib) remain an immense obstacle in clinical and the underlying mechanisms have not been fully understood. Using Conditional medium co-culture, quantitative high-throughput combinational screen (qHTCS), and genomic sequencing, we report that the RUNX1-PDGFBB-AKt pathway was significantly elevated in palbociclib-resistance cells. Inhibition of this axis can enhance the therapeutic efficacy of Palbociclib and surmount Palbociclib resistance both in v

gpl:GPL stands for GEO Platform. A GEO Platform (GPLxxx) represents a specific type of microarray or sequencing technology used in an experiment, essentially defining the features or elements measured on the array. 

gse:GSE stands for GEO Series. A GEO Series (GSExxx) represents a collection of related samples that together form a single experiment or study. It provides a summary and description of the entire research projec

In the Gene Expression Omnibus (GEO), "Samples" refer to individual biological samples that have been analyzed in a functional genomics experiment, such as gene expression profiling, RNA sequencing, or other high-throughput assays. Each Sample record in GEO contains detailed information about the sample, including its source, the protocols used to analyze it, and the resulting expression data

In [11]:
!pip install GEOparse

Collecting GEOparse
  Downloading GEOparse-2.0.4-py3-none-any.whl.metadata (6.5 kB)
Downloading GEOparse-2.0.4-py3-none-any.whl (29 kB)
Installing collected packages: GEOparse
Successfully installed GEOparse-2.0.4

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [28]:
import GEOparse

# Load the GEO dataset
gse = GEOparse.get_GEO("GSE264264")

# Get the associated GPL
for gpl in gse.gpls:
    print("GPL ID:", gpl)

06-Feb-2025 01:16:01 DEBUG utils - Directory ./ already exists. Skipping.
06-Feb-2025 01:16:01 INFO GEOparse - Downloading ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE264nnn/GSE264264/soft/GSE264264_family.soft.gz to ./GSE264264_family.soft.gz
100%|██████████| 2.70k/2.70k [00:00<00:00, 6.03kB/s]
06-Feb-2025 01:16:02 DEBUG downloader - Size validation passed
06-Feb-2025 01:16:02 DEBUG downloader - Moving /var/folders/pd/wynthhk510v_lw3l_4q234gh0000gn/T/tmpj7p1sv7p to /Users/rav007/Documents/SeniorYearProject/CancerClassification/GSE264264_family.soft.gz
06-Feb-2025 01:16:02 DEBUG downloader - Successfully downloaded ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE264nnn/GSE264264/soft/GSE264264_family.soft.gz
06-Feb-2025 01:16:02 INFO GEOparse - Parsing ./GSE264264_family.soft.gz: 
06-Feb-2025 01:16:02 DEBUG GEOparse - DATABASE: GeoMiame
06-Feb-2025 01:16:02 DEBUG GEOparse - SERIES: GSE264264
06-Feb-2025 01:16:02 DEBUG GEOparse - PLATFORM: GPL20301
06-Feb-2025 01:16:02 DEBUG GEOparse - SAMPLE: GS

GPL ID: GPL20301


In [29]:
import gzip
import GEOparse

# Open the compressed file
# with gzip.open('GSE283522_family.soft.gz', 'rt') as f:
    # Parse the SOFT file
gse = GEOparse.get_GEO(filepath='GSE264264_family.soft.gz')

# Access the data
print(gse.gsms)  # Access GSMs (samples)
print(gse.gpls)  # Access GPLs (platforms)

06-Feb-2025 01:16:18 INFO GEOparse - Parsing GSE264264_family.soft.gz: 
06-Feb-2025 01:16:18 DEBUG GEOparse - DATABASE: GeoMiame
06-Feb-2025 01:16:18 DEBUG GEOparse - SERIES: GSE264264
06-Feb-2025 01:16:18 DEBUG GEOparse - PLATFORM: GPL20301
06-Feb-2025 01:16:18 DEBUG GEOparse - SAMPLE: GSM8215181
06-Feb-2025 01:16:18 DEBUG GEOparse - SAMPLE: GSM8215182
06-Feb-2025 01:16:18 DEBUG GEOparse - SAMPLE: GSM8215183
06-Feb-2025 01:16:18 DEBUG GEOparse - SAMPLE: GSM8215184
06-Feb-2025 01:16:18 DEBUG GEOparse - SAMPLE: GSM8215185
06-Feb-2025 01:16:18 DEBUG GEOparse - SAMPLE: GSM8215186


{'GSM8215181': <SAMPLE: GSM8215181>, 'GSM8215182': <SAMPLE: GSM8215182>, 'GSM8215183': <SAMPLE: GSM8215183>, 'GSM8215184': <SAMPLE: GSM8215184>, 'GSM8215185': <SAMPLE: GSM8215185>, 'GSM8215186': <SAMPLE: GSM8215186>}
{'GPL20301': <d: GPL20301>}


In [31]:

gse_id = "GSE264264"
gse = GEOparse.get_GEO(geo=gse_id, destdir="./")

# List all samples
samples = gse.gsms
print(samples.keys())  # List all GSM sample IDs

# Check metadata for a sample
gsm_id = list(samples.keys())[0]  # Pick the first sample
sample_metadata = gse.gsms[gsm_id].metadata
print(f"Metadata for {gsm_id}:")
for key, value in sample_metadata.items():
    print(f"{key}: {value}")

# Get expression data for a sample
expression_table = gse.gsms[gsm_id].table
print(expression_table)  # Preview the first few rows


06-Feb-2025 01:18:08 DEBUG utils - Directory ./ already exists. Skipping.
06-Feb-2025 01:18:08 INFO GEOparse - File already exist: using local version.
06-Feb-2025 01:18:08 INFO GEOparse - Parsing ./GSE264264_family.soft.gz: 
06-Feb-2025 01:18:08 DEBUG GEOparse - DATABASE: GeoMiame
06-Feb-2025 01:18:08 DEBUG GEOparse - SERIES: GSE264264
06-Feb-2025 01:18:08 DEBUG GEOparse - PLATFORM: GPL20301
06-Feb-2025 01:18:08 DEBUG GEOparse - SAMPLE: GSM8215181
06-Feb-2025 01:18:08 DEBUG GEOparse - SAMPLE: GSM8215182
06-Feb-2025 01:18:08 DEBUG GEOparse - SAMPLE: GSM8215183
06-Feb-2025 01:18:08 DEBUG GEOparse - SAMPLE: GSM8215184
06-Feb-2025 01:18:08 DEBUG GEOparse - SAMPLE: GSM8215185
06-Feb-2025 01:18:08 DEBUG GEOparse - SAMPLE: GSM8215186


dict_keys(['GSM8215181', 'GSM8215182', 'GSM8215183', 'GSM8215184', 'GSM8215185', 'GSM8215186'])
Metadata for GSM8215181:
title: ['MCF-7 PR-1 cells']
geo_accession: ['GSM8215181']
status: ['Public on Jan 30 2025']
submission_date: ['Apr 17 2024']
last_update_date: ['Jan 30 2025']
type: ['SRA']
channel_count: ['1']
source_name_ch1: ['Breast']
organism_ch1: ['Homo sapiens']
taxid_ch1: ['9606']
characteristics_ch1: ['tissue: Breast', 'cell line: MCF-7 PR', 'cell type: Breast cancer cells', 'genotype: PR', 'treatment: Palbociclib']
growth_protocol_ch1: ['All cells were maintained in DMEM supplemented with 10% FBS at 37 °C with 5% CO2.']
molecule_ch1: ['total RNA']
extract_protocol_ch1: ['Total RNA was harvested using Rneasy mini plus kit (Qiagen). 2ug of the total RNA was used for the construction of sequencing libraries', 'Total RNA was extracted and prepared for cDNA libraries within Illumina TruSeq Stranded mRNA sample preparation kit']
data_processing: ['GLC Genomics Workbench v 11.0.1'

# Terminologies

**Benign vs maignant:** Benign tumors are noncancerous, while malignant tumors are cancerous

# Referances 

### Genes envolved in BreastCancer:
https://pmc.ncbi.nlm.nih.gov/articles/PMC4478970/

### Types of breast cancer
https://www.cancer.org/cancer/types/breast-cancer/about/types-of-breast-cancer.html#:~:text=Invasive%20(or%20infiltrating)%20breast%20cancer%20has%20spread%20into%20surrounding%20breast,80%25%20of%20all%20breast%20cancers.

### HMM and Cancer Classification

https://www.sciencedirect.com/science/article/pii/S0020025515002807?casa_token=Jtoir6gnzQwAAAAA:2t-gmz2xFtE0Q5VdBsRz33maQgC2UomwfskesDtgtAoqlLeiLO92No2JC_tP6yd-qmwUXpbbI2Mt