![AltText](cover.png)

# <span style="color: green">Breast Cancer and related Genes</span>

Breast cancer is the most frequently diagnosed cancer in women and ranked as a worldwide leading fatal illness. Cancer develops primarily from the cells that line both milk ducts and lobules that perform milk production and transportation functions. The majority of breast cancer cases occur in female patients but it does develop in males. 

![breastcancer](BreastCancer.png)

Medical statistics indicate that women make up the majority of breast cancer patients in the United States where 310,720 new cases are predicted for 2024. Statistics show that breast cancer will affect 2,790 male patients during this period. Breast cancer affects men in 0.9% of total breast cancer cases while women stand at 99.1% of the total cases. The illness initially stays contained in breast tissue yet advances to penetrate neighboring tissues prior to reaching lymph nodes and distant parts of the body through bloodstream or lymphatic systems. The correct early discovery and group assignment of breast cancer remains essential because different breast cancer subtypes demand particular treatment methodologies which enhance survival outcomes.


# <span style="color: green">Comon Types of brest cancer</span>

### situ and invasive cancer(Common Type):

* **Ductal Carcinoma In Situ (DCIS):** DCIS is a non-invasive breast cancer where abnormal cells are confined within the milk ducts and have not spread to surrounding breast tissue. While not life-threatening, DCIS can increase the risk of developing invasive breast cancer if left untreated.

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE281307

* **Invasive Ductal Carcinoma (IDC):** Breast cancers that have spread into surrounding breast tissue are known as invasive breast cancers.It begins in the milk ducts and invades nearby breast tissue, with the potential to spread to other parts of the body.

* **Invasive Lobular Carcinoma (ILC)** ILC starts in the milk-producing lobules and can spread to surrounding breast tissue and beyond. It is the second most common type of breast cancer.

### Special type

Some invasive breast cancers have special features or develop in different ways that influence their treatment and outlook. These cancers are less common but can be more serious than other types of breast cancer.

# molacular level

In molacular level breast cancer happens because of mutation on  human epidermal growth factor receptor 2 (HER2, encoded by ERBB2), and hormone receptors BRCA.




# <span style="color: green">Cancer Classification for varius breast cancer mutation using HMM</span>

## Goal:
The purpose of this project is to use Hidden Markov model for cancer classification using gene expression profiles. The model is designed to classify different types of cancer by modeling the gene expression of each tumor type with an HMM. 
We are classifying cancer using gene expression profiles by integrating gene ranking methods with Hidden Markov Models (HMMs). The method focuses on selecting the most informative genes such as including t-test, entropy, receiver operating characteristic curve, Wilcoxon test and signal to noise ratio.
## <span style="color: green">Broader impact</span>

1. Cancer Subtype Classification

* HMMs can help differentiate between cancer subtypes by modeling gene expression patterns.
* We can discover hidden states that correspond to different cancer progression stages or molecular subtypes.
2. Identification of Biomarkers for Cancer Progression
* By analyzing state transitions, you can identify genes that play a crucial role in cancer progression.
* Genes consistently associated with high-expression hidden states can be potential biomarkers for diagnosis or prognosis.
3. Gene Regulatory Network Inference
* HMMs can reveal gene co-expression patterns and identify groups of genes that are regulated together.
* This insight helps in understanding the underlying biological mechanisms of cancer development.
4. Detection of Aberrant Gene Expression States
* Some genes may transition between normal and cancerous expression states.
* Tracking these transitions can help identify early warning signs of cancer or predict relapse risk.
5. Evolutionary Insights into Cancer Development
* HMMs can model cancer as a sequential process, showing how gene expression evolves over time.
* This can help predict tumor progression, drug resistance, or metastatic potential.
6. Personalized Treatment and Therapy Response Prediction
* If different patients have different HMM states, these states could be used to predict responses to specific treatments.
* Certain expression patterns may indicate resistance or sensitivity to chemotherapy or targeted therapies.
7. Identification of Key Pathways Driving Cancer
* By linking gene expression states to biological pathways, HMMs can highlight which cellular pathways are disrupted in cancer.
* This insight is valuable for developing targeted therapies.
8. Survival Analysis & Prognosis
* Patients with similar HMM-derived expression states can be grouped to predict overall survival or recurrence risk.
This can aid in risk stratification and clinical decision-making.


**process:**

1. Understanding Gene Expression & Why It Matters in Cancer Classification

**What is Gene Expression?**

Every cell in your body contains the same DNA, but different genes are turned on or off depending on the cell type and its function. The process of turning genes on/off is called gene expression.

**Fold enrichment:** is a statistical measure that compares the frequency of a specific gene set to the background frequency of all genes. It's used in bioinformatics and next-generation sequencing (NGS). 

# method of expressong genes

![alt text](./123.png)
2. Protein-Based Methods
mRNA levels do not always correlate directly with protein levels, so these methods measure protein abundance.

(a) Western Blot
Uses antibodies to detect specific proteins in a sample.
Advantages: Specific, qualitative/semi-quantitative.
Disadvantages: Low-throughput, only detects known proteins.

(b) Enzyme-Linked Immunosorbent Assay (ELISA)
Uses antibodies to quantify proteins in a liquid sample.
Advantages: Highly sensitive and specific.
Disadvantages: Limited to known proteins, cannot detect post-translational modifications.

(c) Mass Spectrometry-Based Proteomics
Identifies and quantifies proteins in a sample using liquid chromatography-mass spectrometry (LC-MS/MS).
Advantages: High-throughput, detects post-translational modifications.
Disadvantages: Requires expertise, expensive.

3. Functional Methods
Instead of measuring mRNA or protein abundance, these methods assess gene expression by measuring biological activity.

(a) Reporter Gene Assays
Inserts a reporter gene (e.g., GFP, Luciferase) under a promoter of interest.
Advantages: Measures real-time activity.
Disadvantages: Artificial system, may not reflect natural gene regulation.

(b) Ribosome Profiling (Ribo-Seq)
Measures actively translated mRNA fragments (ribosome footprints).
Advantages: Provides direct evidence of translation.
Disadvantages: Technically complex, requires deep sequencing.


**Why is Gene Expression Important in Cancer?**

Cancer occurs when specific genes (oncogenes) become overactive or when tumor suppressor genes stop functioning properly. Identifying which genes behave abnormally in cancer cells allows researchers to classify cancer types and potentially guide treatments.

2. Gene Selection: Choosing the Most Important Genes
**Problem: Too Many Genes, Not All Relevant**
* A DNA microarray (gene chip) can measure the expression levels of thousands of genes at once, but not all of them are useful for classification.
* If we include all genes, the model becomes too complex and inefficient.
* The goal is to filter out irrelevant genes and identify the most informative ones.
Solution: Gene Selection Methods
* The article introduces several ranking techniques to select key genes that differentiate between cancerous and normal tissues. These techniques include:

**t-test:** Finds genes that show significant differences in expression levels between cancerous and normal cells.

**Entropy test:** – Measures disorder in gene expression; genes with high entropy provide better class separation.

**Receiver Operating Characteristic (ROC) Curve:** Selects genes with strong discriminatory power.

**Wilcoxon test:** A non-parametric test that ranks genes based on their median expression difference.
**Signal-to-Noise Ratio (SNR):** Compares differences in mean expression levels with the standard deviation.

## Modified Analytic Hierarchy Process (AHP) for Gene Selection

Traditional AHP is a decision-making method used for prioritizing factors based on expert judgment.
The authors modified AHP to integrate the rankings from the above five methods automatically.
Instead of relying on human experts, this method uses statistical rankings to create a robust and stable set of key genes for classification.
3. Using Hidden Markov Models (HMMs) for Cancer Classification
Why HMMs?

* Cancer develops in stages, similar to how states change in an HMM.
HMMs are great for analyzing sequential patterns and capturing the transitions between different gene expression states.
How HMMs Work in Cancer Classification
Each cancer type gets its own HMM model.
* If we are classifying between normal and cancerous tissues, we train two separate HMMs:
    * One for normal gene expression patterns.
    * One for cancer gene expression patterns.
### Training the HMM
The selected genes from the AHP method serve as input features.
The model learns probability distributions for these genes from labeled training data.
Classifying a New Sample

When a new patient’s gene expression data is inputted, it is tested against both trained HMMs.
The HMM with the higher probability of generating the observed gene expression determines the classification (cancer or normal).

4. Performance Evaluation and Results

#########################################################################################################################################

# <span style="color: green">Data Collection</span>


The National Center for Biotechnology Information (NCBI) is a part of the National Library of Medicine at the National Institutes of Health (NIH). It provides access to biological information and data to advance science and healt

What does NCBI do?

* Maintains databases of biological information, including DNA sequences, genes, proteins, and more 
* Provides tools for analyzing biological data 
* Creates resources to help researchers understand the relationship between genes and health 
* Produces resources to help researchers understand how pathogens spread and how to prevent foodborne disease 
* Creates resources to help researchers understand how diseases affect the body at the molecular and cellular level 

What resources does NCBI provide?
* GenBank: A database of publicly available DNA sequences 
* PubMed: A database of citations and abstracts for published life science journals 
* Entrez: A database retrieval system that integrates data from multiple databases 
* BLAST: A tool for searching for local alignments in biological sequences 
* ClinicalTrials.gov: A database of clinical studies funded by the public and private sectors 
* NCBI Bookshelf: A collection of books that cover topics like molecular biology, genetics, and disease states 

![Limmunia](Illumina-Sequencing-Steps-1644x2048.webp)

Steps/Process of Illumina Sequencing
1. Nucleic Acid Extraction
The first step in Illumina sequencing is isolating the genetic material from samples of interest. The extraction process is important because the quality of the nucleic acids extracted will directly affect the sequencing results. After extraction, a quality control check is usually performed to ensure the nucleic acids are pure and accurately quantified. UV spectrophotometry is typically used to check the purity, while fluorometric methods are preferred for measuring nucleic acid concentration.

2. Library Preparation
After nucleic acids are isolated, they are prepared for sequencing by creating a library which is a collection of adapter-ligated DNA fragments that can be read by the sequencer. The process starts with DNA fragmentation, where the sample is broken into smaller fragments using methods like mechanical shearing, enzymatic digestion, or transposon-based fragmentation. These fragments undergo end repair and A-tailing to prepare for the attachment of short specific DNA sequences called adapters to both ends of the fragments. These adapters contain sequences that help bind the DNA to the sequencing flow cell. They also include barcode sequences that allow multiple samples to be sequenced simultaneously and distinguished later in the analysis.

3. Cluster Generation by Bridge Amplification
The DNA library is loaded onto a flow cell containing small lanes where amplification and sequencing occurs. The DNA fragments bind to complementary primers attached to the solid surface of the flow cell and undergo bridge amplification. In bridge PCR, each DNA strand bends over to form a bridge on a chip. Forward and reverse primers on the chip help the DNA form these bridges. Each bridge is amplified, creating many clusters at each spot. The process of cluster generation finishes when each DNA spot on the chip has enough copies to produce a strong, clear signal. 

4. Sequencing by Synthesis (SBS)
Once clusters are generated, the SBS process begins. Fluorescently labeled nucleotides are added one by one to the growing DNA strand and each nucleotide emits a fluorescence as it attaches. The specific color emitted allows the system to identify the nucleotide. The sequence of each DNA fragment is determined over multiple cycles.

5. Data Analysis
Once the sequencing is completed, the sequences obtained are processed and analyzed using bioinformatics tools. Images collected from each cycle are converted into base sequences by analyzing the fluorescent signals. Bioinformatics tools clean up and organize the data, ensuring the sequences are ready for analysis. Then, the data are analyzed, aligning the sequences to a reference genome or assembling them if a reference is unavailable. This process helps identify sequence variants, map gene locations, and allow downstream analyses. Finally, the data is interpreted to analyze pathways, identify potential biomarkers, or predict gene functions. This step helps translate raw sequencing data into meaningful biological insights. Some Illumina instruments have built-in, easy-to-use analysis software that can help researchers without bioinformatics expertise.



In [3]:
from Bio import Entrez
import ssl
import Bio
import xml.etree.ElementTree as ET
# Disable SSL verification (for development purposes)
ssl._create_default_https_context = ssl._create_unverified_context
Entrez.email = "ashkan.nikfarjam@sjsu.edu"

# Search for cancer-related gene expression studies in GEO
#Type: DCIS
query = "*Ductal Carcinoma In Situ[Title] AND Homo sapiens[Organism]"
handle = Entrez.esearch(db="gds", term=query)  # "gds" is the GEO DataSets database
record = Entrez.read(handle)
handle.close()

# Get GEO dataset IDs
geo_ids = record["IdList"]
print("Found GEO datasets:", geo_ids)

Found GEO datasets: ['200281307', '200281303', '200231984', '200231969', '200230193', '200196208', '200169393', '200113795', '200148548', '200148547', '200148546', '200113909', '200100503', '200092697', '200059248', '200059247', '200059246', '200069994', '200069993', '200069240']


In [32]:
record

{'Count': '2016', 'RetMax': '20', 'RetStart': '0', 'IdList': ['200264264', '200245132', '200270021', '200254188', '200283522', '200283272', '200267855', '200263177', '200248460', '200188653', '200261815', '200261380', '200268427', '200262825', '200274220', '200267921', '200267920', '200267919', '200252799', '200241458'], 'TranslationSet': [{'From': 'Homo sapiens[Organism]', 'To': '"Homo sapiens"[Organism]'}], 'TranslationStack': [{'Term': 'HER2[Title]', 'Field': 'Title', 'Count': '2294', 'Explode': 'N'}, {'Term': '"Homo sapiens"[Organism]', 'Field': 'Organism', 'Count': '4158885', 'Explode': 'Y'}, 'AND'], 'QueryTranslation': 'HER2[Title] AND "Homo sapiens"[Organism]'}

In [24]:
data_dic={"IDs":[], "record":[]}
for info in geo_ids:
    hendles=Entrez.esummary(db="gds", id=info)
    record_extracted=Entrez.read(hendles)
    data_dic["IDs"].append(info)
    data_dic["record"].append(record_extracted)

In [25]:
print(data_dic)

{'IDs': ['200264264', '200245132', '200270021', '200254188', '200283522', '200283272', '200267855', '200263177', '200248460', '200188653', '200261815', '200261380', '200268427', '200262825', '200274220', '200267921', '200267920', '200267919', '200252799', '200241458'], 'record': [[{'Item': [], 'Id': '200264264', 'Accession': 'GSE264264', 'GDS': '', 'title': 'RUNX1-PDGFBB-AKT pathway mediated CDK4/6 inhibitor resistance in HR+/HER2- breast cancer', 'summary': 'Cyclin-dependent kinases 4 and 6 (CDK4/6) are essential drivers of the cell cycle and are also critical for the initiation and progression of diverse malignancies. Pharmacological inhibitors targeting CDK4/6 have demonstrated significant activity against various tumor types such as breast cancer. However, resistance to CDK4/6 inhibitors (CDK4/6i) (such as palbociclib) remain an immense obstacle in clinical and the underlying mechanisms have not been fully understood. Using Conditional medium co-culture, quantitative high-throughpu

In [26]:
import pandas as pd
df=pd.DataFrame(data_dic)
df.head()

Unnamed: 0,IDs,record
0,200264264,"[{'Item': [], 'Id': '200264264', 'Accession': ..."
1,200245132,"[{'Item': [], 'Id': '200245132', 'Accession': ..."
2,200270021,"[{'Item': [], 'Id': '200270021', 'Accession': ..."
3,200254188,"[{'Item': [], 'Id': '200254188', 'Accession': ..."
4,200283522,"[{'Item': [], 'Id': '200283522', 'Accession': ..."


In [27]:
data_dic['record'][0]

[{'Item': [], 'Id': '200264264', 'Accession': 'GSE264264', 'GDS': '', 'title': 'RUNX1-PDGFBB-AKT pathway mediated CDK4/6 inhibitor resistance in HR+/HER2- breast cancer', 'summary': 'Cyclin-dependent kinases 4 and 6 (CDK4/6) are essential drivers of the cell cycle and are also critical for the initiation and progression of diverse malignancies. Pharmacological inhibitors targeting CDK4/6 have demonstrated significant activity against various tumor types such as breast cancer. However, resistance to CDK4/6 inhibitors (CDK4/6i) (such as palbociclib) remain an immense obstacle in clinical and the underlying mechanisms have not been fully understood. Using Conditional medium co-culture, quantitative high-throughput combinational screen (qHTCS), and genomic sequencing, we report that the RUNX1-PDGFBB-AKt pathway was significantly elevated in palbociclib-resistance cells. Inhibition of this axis can enhance the therapeutic efficacy of Palbociclib and surmount Palbociclib resistance both in v

gpl:GPL stands for GEO Platform. A GEO Platform (GPLxxx) represents a specific type of microarray or sequencing technology used in an experiment, essentially defining the features or elements measured on the array. 

gse:GSE stands for GEO Series. A GEO Series (GSExxx) represents a collection of related samples that together form a single experiment or study. It provides a summary and description of the entire research projec

In the Gene Expression Omnibus (GEO), "Samples" refer to individual biological samples that have been analyzed in a functional genomics experiment, such as gene expression profiling, RNA sequencing, or other high-throughput assays. Each Sample record in GEO contains detailed information about the sample, including its source, the protocols used to analyze it, and the resulting expression data

In [11]:
!pip install GEOparse

Collecting GEOparse
  Downloading GEOparse-2.0.4-py3-none-any.whl.metadata (6.5 kB)
Downloading GEOparse-2.0.4-py3-none-any.whl (29 kB)
Installing collected packages: GEOparse
Successfully installed GEOparse-2.0.4

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [28]:
import GEOparse

# Load the GEO dataset
gse = GEOparse.get_GEO("GSE264264")

# Get the associated GPL
for gpl in gse.gpls:
    print("GPL ID:", gpl)

06-Feb-2025 01:16:01 DEBUG utils - Directory ./ already exists. Skipping.
06-Feb-2025 01:16:01 INFO GEOparse - Downloading ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE264nnn/GSE264264/soft/GSE264264_family.soft.gz to ./GSE264264_family.soft.gz
100%|██████████| 2.70k/2.70k [00:00<00:00, 6.03kB/s]
06-Feb-2025 01:16:02 DEBUG downloader - Size validation passed
06-Feb-2025 01:16:02 DEBUG downloader - Moving /var/folders/pd/wynthhk510v_lw3l_4q234gh0000gn/T/tmpj7p1sv7p to /Users/rav007/Documents/SeniorYearProject/CancerClassification/GSE264264_family.soft.gz
06-Feb-2025 01:16:02 DEBUG downloader - Successfully downloaded ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE264nnn/GSE264264/soft/GSE264264_family.soft.gz
06-Feb-2025 01:16:02 INFO GEOparse - Parsing ./GSE264264_family.soft.gz: 
06-Feb-2025 01:16:02 DEBUG GEOparse - DATABASE: GeoMiame
06-Feb-2025 01:16:02 DEBUG GEOparse - SERIES: GSE264264
06-Feb-2025 01:16:02 DEBUG GEOparse - PLATFORM: GPL20301
06-Feb-2025 01:16:02 DEBUG GEOparse - SAMPLE: GS

GPL ID: GPL20301


In [29]:
import gzip
import GEOparse

# Open the compressed file
# with gzip.open('GSE283522_family.soft.gz', 'rt') as f:
    # Parse the SOFT file
gse = GEOparse.get_GEO(filepath='GSE264264_family.soft.gz')

# Access the data
print(gse.gsms)  # Access GSMs (samples)
print(gse.gpls)  # Access GPLs (platforms)

06-Feb-2025 01:16:18 INFO GEOparse - Parsing GSE264264_family.soft.gz: 
06-Feb-2025 01:16:18 DEBUG GEOparse - DATABASE: GeoMiame
06-Feb-2025 01:16:18 DEBUG GEOparse - SERIES: GSE264264
06-Feb-2025 01:16:18 DEBUG GEOparse - PLATFORM: GPL20301
06-Feb-2025 01:16:18 DEBUG GEOparse - SAMPLE: GSM8215181
06-Feb-2025 01:16:18 DEBUG GEOparse - SAMPLE: GSM8215182
06-Feb-2025 01:16:18 DEBUG GEOparse - SAMPLE: GSM8215183
06-Feb-2025 01:16:18 DEBUG GEOparse - SAMPLE: GSM8215184
06-Feb-2025 01:16:18 DEBUG GEOparse - SAMPLE: GSM8215185
06-Feb-2025 01:16:18 DEBUG GEOparse - SAMPLE: GSM8215186


{'GSM8215181': <SAMPLE: GSM8215181>, 'GSM8215182': <SAMPLE: GSM8215182>, 'GSM8215183': <SAMPLE: GSM8215183>, 'GSM8215184': <SAMPLE: GSM8215184>, 'GSM8215185': <SAMPLE: GSM8215185>, 'GSM8215186': <SAMPLE: GSM8215186>}
{'GPL20301': <d: GPL20301>}


In [31]:

gse_id = "GSE264264"
gse = GEOparse.get_GEO(geo=gse_id, destdir="./")

# List all samples
samples = gse.gsms
print(samples.keys())  # List all GSM sample IDs

# Check metadata for a sample
gsm_id = list(samples.keys())[0]  # Pick the first sample
sample_metadata = gse.gsms[gsm_id].metadata
print(f"Metadata for {gsm_id}:")
for key, value in sample_metadata.items():
    print(f"{key}: {value}")

# Get expression data for a sample
expression_table = gse.gsms[gsm_id].table
print(expression_table)  # Preview the first few rows


06-Feb-2025 01:18:08 DEBUG utils - Directory ./ already exists. Skipping.
06-Feb-2025 01:18:08 INFO GEOparse - File already exist: using local version.
06-Feb-2025 01:18:08 INFO GEOparse - Parsing ./GSE264264_family.soft.gz: 
06-Feb-2025 01:18:08 DEBUG GEOparse - DATABASE: GeoMiame
06-Feb-2025 01:18:08 DEBUG GEOparse - SERIES: GSE264264
06-Feb-2025 01:18:08 DEBUG GEOparse - PLATFORM: GPL20301
06-Feb-2025 01:18:08 DEBUG GEOparse - SAMPLE: GSM8215181
06-Feb-2025 01:18:08 DEBUG GEOparse - SAMPLE: GSM8215182
06-Feb-2025 01:18:08 DEBUG GEOparse - SAMPLE: GSM8215183
06-Feb-2025 01:18:08 DEBUG GEOparse - SAMPLE: GSM8215184
06-Feb-2025 01:18:08 DEBUG GEOparse - SAMPLE: GSM8215185
06-Feb-2025 01:18:08 DEBUG GEOparse - SAMPLE: GSM8215186


dict_keys(['GSM8215181', 'GSM8215182', 'GSM8215183', 'GSM8215184', 'GSM8215185', 'GSM8215186'])
Metadata for GSM8215181:
title: ['MCF-7 PR-1 cells']
geo_accession: ['GSM8215181']
status: ['Public on Jan 30 2025']
submission_date: ['Apr 17 2024']
last_update_date: ['Jan 30 2025']
type: ['SRA']
channel_count: ['1']
source_name_ch1: ['Breast']
organism_ch1: ['Homo sapiens']
taxid_ch1: ['9606']
characteristics_ch1: ['tissue: Breast', 'cell line: MCF-7 PR', 'cell type: Breast cancer cells', 'genotype: PR', 'treatment: Palbociclib']
growth_protocol_ch1: ['All cells were maintained in DMEM supplemented with 10% FBS at 37 °C with 5% CO2.']
molecule_ch1: ['total RNA']
extract_protocol_ch1: ['Total RNA was harvested using Rneasy mini plus kit (Qiagen). 2ug of the total RNA was used for the construction of sequencing libraries', 'Total RNA was extracted and prepared for cDNA libraries within Illumina TruSeq Stranded mRNA sample preparation kit']
data_processing: ['GLC Genomics Workbench v 11.0.1'

In [6]:
###Experiment with HER2

import numpy as np
import pandas as pd

data_array = np.genfromtxt('GSE264264_raw_counts.csv', delimiter=',')
data_array

array([[       nan,        nan,        nan, ...,        nan,        nan,
               nan],
       [       nan, 0.0000e+00, 1.0000e+00, ..., 0.0000e+00, 1.0000e+00,
               nan],
       [       nan, 2.3000e+02, 1.6700e+02, ..., 3.4200e+02, 2.1500e+02,
               nan],
       ...,
       [       nan, 1.3000e+01, 9.0000e+00, ..., 6.0000e+00, 9.0000e+00,
               nan],
       [       nan, 3.9100e+02, 3.8200e+02, ..., 2.9400e+02, 6.0200e+02,
               nan],
       [       nan, 3.3673e+04, 4.2540e+04, ..., 4.2507e+04, 3.6005e+04,
               nan]])

In [2]:
import pandas as pd
data_df=pd.read_csv("GSE183947_fpkm.csv")
data_df.head()

Unnamed: 0.1,Unnamed: 0,CA.102548,CA.104338,CA.105094,CA.109745,CA.1906415,CA.1912627,CA.1924346,CA.1926760,CA.1927842,...,CAP.2040686,CAP.2046297,CAP.2046641,CAP.348981,CAP.354300,CAP.359448,CAP.94377,CAP.98389,CAP.98475,CAP.99145
0,TSPAN6,0.93,1.97,0.0,5.45,4.52,4.75,3.96,3.58,6.41,...,6.66,8.35,8.94,6.33,5.94,6.35,3.74,4.84,10.46,4.54
1,TNMD,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.23,0.39,...,0.12,0.17,1.08,0.29,0.0,0.07,9.19,1.18,0.09,0.39
2,DPM1,0.0,0.43,0.0,3.43,8.45,8.53,7.8,7.62,6.4,...,4.93,7.47,5.72,4.96,9.28,9.15,4.77,3.75,7.31,2.77
3,SCYL3,5.78,5.17,8.76,4.58,7.2,6.03,9.05,5.37,5.92,...,8.02,6.0,5.28,4.98,4.45,7.0,4.14,5.51,7.45,2.33
4,C1orf112,2.83,6.26,3.37,6.24,5.16,13.69,6.69,5.28,7.65,...,7.91,4.61,8.35,9.84,7.68,5.62,2.81,7.08,7.28,5.39


In [3]:
# ahm analysis feature or expressive gene selection
len(data_df.columns)

61

30 --> CT
30 --> NCT

Test/Val/Train

# Broader impact:

Beyond classification, HMM reveals the hidden states of cancer progression, distinguishing between
benign, pre-cancerous, and malignant phases with probabilistic precision. It uncovers key genetic
markers and regulatory pathways, highlighting genes whose expression shifts significantly as cancer
evolves.

# Terminologies

**Benign vs maignant:** Benign tumors are noncancerous, while malignant tumors are cancerous

# Referances 

### Genes envolved in BreastCancer:
https://pmc.ncbi.nlm.nih.gov/articles/PMC4478970/

### Types of breast cancer
https://www.cancer.org/cancer/types/breast-cancer/about/types-of-breast-cancer.html#:~:text=Invasive%20(or%20infiltrating)%20breast%20cancer%20has%20spread%20into%20surrounding%20breast,80%25%20of%20all%20breast%20cancers.

### HMM and Cancer Classification

https://www.sciencedirect.com/science/article/pii/S0020025515002807?casa_token=Jtoir6gnzQwAAAAA:2t-gmz2xFtE0Q5VdBsRz33maQgC2UomwfskesDtgtAoqlLeiLO92No2JC_tP6yd-qmwUXpbbI2Mt

In [7]:
from Bio import Entrez
import subprocess

# Set your email address
Entrez.email = "ashkan.nikfarjam@sjsu.edu"

# Define the SRA Experiment accession number
sra_experiment = "SRX26911921"

# Search for the SRA Experiment
search_handle = Entrez.esearch(db="sra", term=sra_experiment)
search_results = Entrez.read(search_handle)
search_handle.close()

# Check if any IDs were found
if not search_results['IdList']:
    print(f"No records found for {sra_experiment}")
else:
    # Fetch the SRA Run information
    sra_id = search_results['IdList'][0]
    fetch_handle = Entrez.efetch(db="sra", id=sra_id, rettype="runinfo", retmode="text")
    run_info = fetch_handle.read()
    fetch_handle.close()

    # Decode the bytes object to a string
    run_info_str = run_info.decode('utf-8')

    # Split the string into lines
    lines = run_info_str.splitlines()
    if len(lines) > 1:
        header = lines[0].split(',')
        run_index = header.index('Run')
        for line in lines[1:]:
            run_id = line.split(',')[run_index]
            print(f"Found Run ID: {run_id}")

            # Use fasterq-dump to download and convert the SRA file to FASTQ format
            try:
                subprocess.run(["fasterq-dump", run_id], check=True)
                print(f"Successfully downloaded and converted {run_id} to FASTQ format")
            except subprocess.CalledProcessError as e:
                print(f"Error during download and conversion: {e}")
    else:
        print(f"No run information found for {sra_experiment}")

Found Run ID: SRR31545697


FileNotFoundError: [Errno 2] No such file or directory: 'fasterq-dump'

In [11]:
from Bio import Entrez
import ssl
import Bio
import xml.etree.ElementTree as ET
# Disable SSL verification (for development purposes)
ssl._create_default_https_context = ssl._create_unverified_context
Entrez.email = "ashkan.nikfarjam@sjsu.edu"

# Search for cancer-related gene expression studies in GEO
#Type: DCIS
query = "DCIS[Title] AND Homo sapiens[Organism]"
handle = Entrez.esearch(db="sra", term=query)  # "gds" is the GEO DataSets database
record = Entrez.read(handle)
handle.close()

# Get GEO dataset IDs
geo_ids = record["IdList"]
print("Found GEO datasets:", geo_ids)

Found GEO datasets: ['23625149', '23624959', '23624931', '23624737', '23624736', '23624735', '23624734', '23624733', '23624731', '23624730', '23624729', '23624728', '23624727', '23624726', '23624725', '23624724', '23624723', '23624722', '23624721', '23624720']


In [13]:
from Bio import Entrez
import ssl

# Disable SSL verification (for development purposes)
ssl._create_default_https_context = ssl._create_unverified_context

# Set your email address
Entrez.email = "ashkan.nikfarjam@sjsu.edu"

# Search for DCIS-related studies in the SRA database
query = "DCIS[Title] AND Homo sapiens[Organism]"
handle = Entrez.esearch(db="sra", term=query)
record = Entrez.read(handle)
handle.close()

# Get SRA dataset IDs
sra_ids = record["IdList"]
print("Found SRA datasets:", sra_ids)

# Fetch and parse each SRA dataset summary
for sra_id in sra_ids:
    print(f"\nFetching SRA dataset with ID: {sra_id}")
    fetch_handle = Entrez.esummary(db="sra", id=sra_id, rettype="xml")
    data = Entrez.read(fetch_handle)
    fetch_handle.close()
    
    # Extract and display relevant information
    dataset = data[0]
    title = dataset.get("Title", "No title available")
    study = dataset.get("Study", "No study information available")
    design = dataset.get("Design", "No design information available")
    print(f"Title: {title}")
    print(f"Study: {study}")
    print(f"Design: {design}")


Found SRA datasets: ['23625149', '23624959', '23624931', '23624737', '23624736', '23624735', '23624734', '23624733', '23624731', '23624730', '23624729', '23624728', '23624727', '23624726', '23624725', '23624724', '23624723', '23624722', '23624721', '23624720']

Fetching SRA dataset with ID: 23625149
Title: No title available
Study: No study information available
Design: No design information available

Fetching SRA dataset with ID: 23624959
Title: No title available
Study: No study information available
Design: No design information available

Fetching SRA dataset with ID: 23624931
Title: No title available
Study: No study information available
Design: No design information available

Fetching SRA dataset with ID: 23624737
Title: No title available
Study: No study information available
Design: No design information available

Fetching SRA dataset with ID: 23624736
Title: No title available
Study: No study information available
Design: No design information available

Fetching SRA datas

In [14]:
import os
import time
from Bio import Entrez

# Set your email for NCBI Entrez
Entrez.email = "your_email@example.com"

# List of GEO dataset IDs related to invasive vs non-invasive breast cancer
dataset_ids = ["GSE1456", "GSE15852", "GSE35023"]  # Add more GEO IDs if needed

# Directory to save downloaded files
download_dir = "geo_datasets"
os.makedirs(download_dir, exist_ok=True)


def fetch_geo_data(geo_id):
    """Download GEO dataset metadata and raw data."""
    try:
        print(f"Fetching data for {geo_id}...")
        
        # Search GEO dataset
        handle = Entrez.esearch(db="gds", term=geo_id)
        record = Entrez.read(handle)
        handle.close()
        
        if not record["IdList"]:
            print(f"No records found for {geo_id}")
            return
        
        geo_uid = record["IdList"][0]
        
        # Fetch full dataset summary
        handle = Entrez.esummary(db="gds", id=geo_uid)
        summary = Entrez.read(handle)
        handle.close()
        
        print(f"Dataset Title: {summary[0]['title']}")
        
        # Fetch dataset file links
        handle = Entrez.efetch(db="gds", id=geo_uid, rettype="full", retmode="text")
        geo_data = handle.read()
        handle.close()
        
        # Save metadata
        metadata_file = os.path.join(download_dir, f"{geo_id}_metadata.txt")
        with open(metadata_file, "w") as f:
            f.write(geo_data)
        print(f"Metadata saved: {metadata_file}")
        
        # Download raw data
        ftp_link = f"https://ftp.ncbi.nlm.nih.gov/geo/series/{geo_id[:-3]}nnn/{geo_id}/matrix/{geo_id}_series_matrix.txt.gz"
        raw_data_file = os.path.join(download_dir, f"{geo_id}_series_matrix.txt.gz")
        os.system(f"wget -O {raw_data_file} {ftp_link}")
        print(f"Downloaded raw data: {raw_data_file}")
    except Exception as e:
        print(f"Error fetching {geo_id}: {e}")
    
    # Wait between requests to avoid getting blocked by NCBI
    time.sleep(3)


# Loop through datasets and download them
for geo_id in dataset_ids:
    fetch_geo_data(geo_id)

print("All downloads completed.")


Fetching data for GSE1456...
Dataset Title: Two New Stromal Signatures Stratify Breast Cancers with Different Prognosis
Metadata saved: geo_datasets/GSE1456_metadata.txt


--2025-02-08 13:38:39--  https://ftp.ncbi.nlm.nih.gov/geo/series/GSE1nnn/GSE1456/matrix/GSE1456_series_matrix.txt.gz
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 2607:f220:41e:250::7, 2607:f220:41e:250::31, 2607:f220:41e:250::11, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|2607:f220:41e:250::7|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2025-02-08 13:38:39 ERROR 404: Not Found.



Downloaded raw data: geo_datasets/GSE1456_series_matrix.txt.gz
Fetching data for GSE15852...
Dataset Title: Expression data from human breast tumors and their paired normal tissues
Metadata saved: geo_datasets/GSE15852_metadata.txt


--2025-02-08 13:38:44--  https://ftp.ncbi.nlm.nih.gov/geo/series/GSE15nnn/GSE15852/matrix/GSE15852_series_matrix.txt.gz
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 2607:f220:41e:250::31, 2607:f220:41e:250::11, 2607:f220:41e:250::12, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|2607:f220:41e:250::31|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6920473 (6.6M) [application/x-gzip]
Saving to: ‘geo_datasets/GSE15852_series_matrix.txt.gz’

     0K .......... .......... .......... .......... ..........  0%  288K 23s
    50K .......... .......... .......... .......... ..........  1%  629K 17s
   100K .......... .......... .......... .......... ..........  2% 51.0M 11s
   150K .......... .......... .......... .......... ..........  2% 6.07M 9s
   200K .......... .......... .......... .......... ..........  3%  667K 9s
   250K .......... .......... .......... .......... ..........  4% 17.0M 7s
   300K .......... .......... .......... ...

Downloaded raw data: geo_datasets/GSE15852_series_matrix.txt.gz
Fetching data for GSE35023...
Dataset Title: Differential allelic expression in normal breast tissue
Metadata saved: geo_datasets/GSE35023_metadata.txt


--2025-02-08 13:38:50--  https://ftp.ncbi.nlm.nih.gov/geo/series/GSE35nnn/GSE35023/matrix/GSE35023_series_matrix.txt.gz
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 2607:f220:41e:250::11, 2607:f220:41e:250::12, 2607:f220:41e:250::13, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|2607:f220:41e:250::11|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20888506 (20M) [application/x-gzip]
Saving to: ‘geo_datasets/GSE35023_series_matrix.txt.gz’

     0K .......... .......... .......... .......... ..........  0%  281K 73s
    50K .......... .......... .......... .......... ..........  0%  552K 55s
   100K .......... .......... .......... .......... ..........  0%  129M 36s
   150K .......... .......... .......... .......... ..........  0%  587K 36s
   200K .......... .......... .......... .......... ..........  1%  148M 29s
   250K .......... .......... .......... .......... ..........  1% 48.3M 24s
   300K .......... .......... .......... 

Downloaded raw data: geo_datasets/GSE35023_series_matrix.txt.gz
All downloads completed.


In [15]:
import pandas as pd
data_df = pd.read_csv("GSE183947_fpkm.csv")
data_df.head()

Unnamed: 0.1,Unnamed: 0,CA.102548,CA.104338,CA.105094,CA.109745,CA.1906415,CA.1912627,CA.1924346,CA.1926760,CA.1927842,...,CAP.2040686,CAP.2046297,CAP.2046641,CAP.348981,CAP.354300,CAP.359448,CAP.94377,CAP.98389,CAP.98475,CAP.99145
0,TSPAN6,0.93,1.97,0.0,5.45,4.52,4.75,3.96,3.58,6.41,...,6.66,8.35,8.94,6.33,5.94,6.35,3.74,4.84,10.46,4.54
1,TNMD,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.23,0.39,...,0.12,0.17,1.08,0.29,0.0,0.07,9.19,1.18,0.09,0.39
2,DPM1,0.0,0.43,0.0,3.43,8.45,8.53,7.8,7.62,6.4,...,4.93,7.47,5.72,4.96,9.28,9.15,4.77,3.75,7.31,2.77
3,SCYL3,5.78,5.17,8.76,4.58,7.2,6.03,9.05,5.37,5.92,...,8.02,6.0,5.28,4.98,4.45,7.0,4.14,5.51,7.45,2.33
4,C1orf112,2.83,6.26,3.37,6.24,5.16,13.69,6.69,5.28,7.65,...,7.91,4.61,8.35,9.84,7.68,5.62,2.81,7.08,7.28,5.39


In [16]:
data_df.columns

Index(['Unnamed: 0', 'CA.102548', 'CA.104338', 'CA.105094', 'CA.109745',
       'CA.1906415', 'CA.1912627', 'CA.1924346', 'CA.1926760', 'CA.1927842',
       'CA.1933414', 'CA.1940640', 'CA.2004407', 'CA.2005288', 'CA.2006047',
       'CA.2008260', 'CA.2009329', 'CA.2009381', 'CA.2009850', 'CA.2017611',
       'CA.2039179', 'CA.2040686', 'CA.2045012', 'CA.2046297', 'CA.348981',
       'CA.354300', 'CA.359448', 'CA.94377', 'CA.98389', 'CA.98475',
       'CA.99145', 'CAP.102548', 'CAP.104338', 'CAP.105094', 'CAP.109745',
       'CAP.1906415', 'CAP.1912627', 'CAP.1924346', 'CAP.1926760',
       'CAP.1927842', 'CAP.1933414', 'CAP.1940640', 'CAP.2004407',
       'CAP.2005288', 'CAP.2006047', 'CAP.2008260', 'CAP.2009329',
       'CAP.2009381', 'CAP.2009850', 'CAP.2017611', 'CAP.2039179',
       'CAP.2040686', 'CAP.2046297', 'CAP.2046641', 'CAP.348981', 'CAP.354300',
       'CAP.359448', 'CAP.94377', 'CAP.98389', 'CAP.98475', 'CAP.99145'],
      dtype='object')

In [25]:
from Bio.Affy import CelFile

with open('GSM1116238_BC16_ARN0137_s1h1s1_U133p2.CEL', 'rb') as f:
    parser = CelFile.Parser()
    data = parser.parse(f)

print(data)  # Print to check the parsed CEL file


AttributeError: module 'Bio.Affy.CelFile' has no attribute 'Parser'

In [30]:
!pip install pyaffy

Collecting pyaffy
  Using cached pyaffy-0.3.2.tar.gz (26 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting future<1,>=0.16 (from pyaffy)
  Using cached future-0.18.3.tar.gz (840 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting scipy<1,>=0.15.1 (from pyaffy)
  Using cached scipy-0.19.1.tar.gz (14.1 MB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting cython<1,>=0.23.4 (from pyaffy)
  Using cached Cython-0.29.37-py2.py3-none-any.whl.metadata (3.1 kB)
Collecting genometools<0.3,>=0.2 (from pyaffy)
  Using cached genometools-0.2.7-py3-none-any.whl.metadata (5.3 kB)
Collecting configparser<4,>=3.5 (from pyaffy)
  Using cached configparser-3.8.1-py2.py3-none-any.whl.metadata (10 kB)
Collecting ftputil<4,>=3.3.1 (from genometools<0.3,>=0.2->pyaffy)
  Using cached ftputil-3.4.tar.gz (141 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting pandas<1,>=0.18 (from genometools<0.3,>=0.2->pyaffy)
  Using cached pandas-0.25.3.tar.gz (12.6 MB)
  

In [34]:
import pyaffy

# Load CEL file
cel_data = pyaffy.CelFile('GSM1116238_BC16_ARN0137_s1h1s1_U133p2.CEL')

# Print header information
print("Header Info:")
print(cel_data.header)

# Extract and print probe intensities
print("\nFirst 10 Probe Intensities:")
for probe in list(cel_data.data)[:10]:  # Extract first 10 probes
    print(probe)


ModuleNotFoundError: No module named 'pyaffy'

In [5]:
from Bio.Affy import CelFile

# Define the path to your .CEL file
cel_file_path = 'GSM519443.CEL'

# Open and parse the .CEL file
with open(cel_file_path, 'r') as file:
    cel_data = CelFile.read(file)

In [6]:
cel_data

<Bio.Affy.CelFile.Record at 0x16feb74c0>

In [8]:
print(dir(cel_data))

['Algorithm', 'AlgorithmParameters', 'DatHeader', 'GridCornerLL', 'GridCornerLR', 'GridCornerUL', 'GridCornerUR', 'NumberCells', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'intensities', 'mask', 'modified', 'ncols', 'nmask', 'nmodified', 'noutliers', 'npix', 'nrows', 'outliers', 'stdevs', 'version']


In [2]:
import requests
import json

# GDC API endpoint
GDC_API = "https://api.gdc.cancer.gov/files"

# Define query parameters
params = {
    "filters": {
        "op": "and",
        "content": [
            {"op": "in", "content": {"field": "cases.project.project_id", "value": ["TCGA-BRCA"]}}
        ]
    },
    "format": "json",
    "size": 10  # Limit results
}

# Send request
response = requests.post(GDC_API, json=params)

# Print JSON structure
data = response.json()
print(json.dumps(data, indent=2))


{
  "data": {
    "hits": [
      {
        "id": "27016c1c-f696-4a54-8c4a-6f9ba8a1c9e3",
        "data_format": "VCF",
        "access": "controlled",
        "file_name": "TCGA_BRCA.113f7fb4-ecf0-438e-8df1-2bbcd0f0e575.wxs.MuSE.somatic_annotation.vcf.gz",
        "submitter_id": "6df4379a-0b00-4725-9430-ea4d082e77d3",
        "data_category": "Simple Nucleotide Variation",
        "acl": [
          "phs000178"
        ],
        "type": "annotated_somatic_mutation",
        "platform": "Illumina",
        "file_size": 159679,
        "created_datetime": "2022-02-07T12:25:44.243581-06:00",
        "md5sum": "19069cea20586c18eec3dd921c9cb4c6",
        "updated_datetime": "2024-07-30T19:12:34.673297-05:00",
        "file_id": "27016c1c-f696-4a54-8c4a-6f9ba8a1c9e3",
        "data_type": "Annotated Somatic Mutation",
        "state": "released",
        "experimental_strategy": "WXS",
        "version": "2",
        "data_release": "32.0 - 42.0"
      },
      {
        "id": "776f51f8-6

In [5]:
!pip install ace-tools

Collecting ace-tools
  Downloading ace_tools-0.0-py3-none-any.whl.metadata (300 bytes)
Downloading ace_tools-0.0-py3-none-any.whl (1.1 kB)
Installing collected packages: ace-tools
Successfully installed ace-tools-0.0


In [None]:
import pandas as pd
import ace_tools as tools
# Sample JSON response (Replace with your actual data)
data = {
    "data": {
        "hits": [
            {
                "file_id": "46be7b4e-9297-41fd-ba04-99235fc30723",
                "data_format": "IDAT",
                "access": "open",
                "file_name": "b7434746-57e5-4c90-87cc-fb384b12bcdf_noid_Red.idat",
                "data_category": "DNA Methylation",
                "type": "masked_methylation_array",
                "platform": "Illumina Human Methylation 450",
                "experimental_strategy": "Methylation Array"
            }
        ]
    }
}

# Extract files related to Illumina genotyping/methylation arrays
illumina_files = [
    {
        "file_id": f["file_id"],
        "file_name": f["file_name"],
        "data_format": f["data_format"],
        "data_category": f["data_category"],
        "type": f["type"],
        "platform": f["platform"],
        "experimental_strategy": f["experimental_strategy"],
        "access": f["access"]
    }
    for f in data["data"]["hits"]
    if "Illumina" in f["platform"]
]

# Convert to DataFrame and display
df_illumina = pd.DataFrame(illumina_files)


tools.display_dataframe_to_user(name="Illumina Array Files", dataframe=df_illumina)


ModuleNotFoundError: No module named 'ace_tools'