# Cancer Classification for varius lunf related genetic mutation using HMM

## Goal:
The purpose of this project is to use Hidden Markov model for cancer classification using gene expression profiles. The model is designed to classify different types of cancer by modeling the gene expression of each tumor type with an HMM. 
We are classifying cancer using gene expression profiles by integrating gene ranking methods with Hidden Markov Models (HMMs). The method focuses on selecting the most informative genes such as including t-test, entropy, receiver operating characteristic curve, Wilcoxon test and signal to noise ratio.

1. Understanding Gene Expression & Why It Matters in Cancer Classification

**What is Gene Expression?**

Every cell in your body contains the same DNA, but different genes are turned on or off depending on the cell type and its function. The process of turning genes on/off is called gene expression.

**Why is Gene Expression Important in Cancer?**

Cancer occurs when specific genes (oncogenes) become overactive or when tumor suppressor genes stop functioning properly. Identifying which genes behave abnormally in cancer cells allows researchers to classify cancer types and potentially guide treatments.

2. Gene Selection: Choosing the Most Important Genes
**Problem: Too Many Genes, Not All Relevant**
* A DNA microarray (gene chip) can measure the expression levels of thousands of genes at once, but not all of them are useful for classification.
* If we include all genes, the model becomes too complex and inefficient.
* The goal is to filter out irrelevant genes and identify the most informative ones.
Solution: Gene Selection Methods
* The article introduces several ranking techniques to select key genes that differentiate between cancerous and normal tissues. These techniques include:

**t-test:** Finds genes that show significant differences in expression levels between cancerous and normal cells.

**Entropy test:** – Measures disorder in gene expression; genes with high entropy provide better class separation.

**Receiver Operating Characteristic (ROC) Curve:** Selects genes with strong discriminatory power.

**Wilcoxon test:** A non-parametric test that ranks genes based on their median expression difference.
**Signal-to-Noise Ratio (SNR):** Compares differences in mean expression levels with the standard deviation.

## Modified Analytic Hierarchy Process (AHP) for Gene Selection

Traditional AHP is a decision-making method used for prioritizing factors based on expert judgment.
The authors modified AHP to integrate the rankings from the above five methods automatically.
Instead of relying on human experts, this method uses statistical rankings to create a robust and stable set of key genes for classification.
3. Using Hidden Markov Models (HMMs) for Cancer Classification
Why HMMs?

* Cancer develops in stages, similar to how states change in an HMM.
HMMs are great for analyzing sequential patterns and capturing the transitions between different gene expression states.
How HMMs Work in Cancer Classification
Each cancer type gets its own HMM model.
* If we are classifying between normal and cancerous tissues, we train two separate HMMs:
    * One for normal gene expression patterns.
    * One for cancer gene expression patterns.
### Training the HMM
The selected genes from the AHP method serve as input features.
The model learns probability distributions for these genes from labeled training data.
Classifying a New Sample

When a new patient’s gene expression data is inputted, it is tested against both trained HMMs.
The HMM with the higher probability of generating the observed gene expression determines the classification (cancer or normal).

4. Performance Evaluation and Results

#########################################################################################################################################

# Data Collection


The National Center for Biotechnology Information (NCBI) is a part of the National Library of Medicine at the National Institutes of Health (NIH). It provides access to biological information and data to advance science and healt

What does NCBI do?

* Maintains databases of biological information, including DNA sequences, genes, proteins, and more 
* Provides tools for analyzing biological data 
* Creates resources to help researchers understand the relationship between genes and health 
* Produces resources to help researchers understand how pathogens spread and how to prevent foodborne disease 
* Creates resources to help researchers understand how diseases affect the body at the molecular and cellular level 

What resources does NCBI provide?
* GenBank: A database of publicly available DNA sequences 
* PubMed: A database of citations and abstracts for published life science journals 
* Entrez: A database retrieval system that integrates data from multiple databases 
* BLAST: A tool for searching for local alignments in biological sequences 
* ClinicalTrials.gov: A database of clinical studies funded by the public and private sectors 
* NCBI Bookshelf: A collection of books that cover topics like molecular biology, genetics, and disease states 

In [3]:
from Bio import Entrez
import ssl
import Bio
import xml.etree.ElementTree as ET
# Disable SSL verification (for development purposes)
ssl._create_default_https_context = ssl._create_unverified_context
Entrez.email = "ashkan.nikfarjam@sjsu.edu"

# Search for cancer-related gene expression studies in GEO
#Type: BRCA1
query = "BRCA1[Title] AND Homo sapiens[Organism]"
handle = Entrez.esearch(db="gds", term=query)  # "gds" is the GEO DataSets database
record = Entrez.read(handle)
handle.close()

# Get GEO dataset IDs
geo_ids = record["IdList"]
print("Found GEO datasets:", geo_ids)

Found GEO datasets: ['200277160', '200268963', '200249247', '200249246', '200249245', '200246226', '200246225', '200246224', '200229874', '200229005', '200223886', '200244027', '200239907', '200243966', '200243963', '200243956', '200237142', '200234017', '200202723', '200186733']


In [5]:
record

{'Count': '1375', 'RetMax': '20', 'RetStart': '0', 'IdList': ['200277160', '200268963', '200249247', '200249246', '200249245', '200246226', '200246225', '200246224', '200229874', '200229005', '200223886', '200244027', '200239907', '200243966', '200243963', '200243956', '200237142', '200234017', '200202723', '200186733'], 'TranslationSet': [{'From': 'Homo sapiens[Organism]', 'To': '"Homo sapiens"[Organism]'}], 'TranslationStack': [{'Term': 'BRCA1[Title]', 'Field': 'Title', 'Count': '1612', 'Explode': 'N'}, {'Term': '"Homo sapiens"[Organism]', 'Field': 'Organism', 'Count': '4157106', 'Explode': 'Y'}, 'AND'], 'QueryTranslation': 'BRCA1[Title] AND "Homo sapiens"[Organism]'}

In [7]:
data_dic={"IDs":[], "record":[]}
for info in geo_ids:
    hendles=Entrez.esummary(db="gds", id=info)
    record_extracted=Entrez.read(hendles)
    data_dic["IDs"].append(info)
    data_dic["record"].append(record_extracted)

In [10]:
print(data_dic)

{'IDs': ['200277160', '200268963', '200249247', '200249246', '200249245', '200246226', '200246225', '200246224', '200229874', '200229005', '200223886', '200244027', '200239907', '200243966', '200243963', '200243956', '200237142', '200234017', '200202723', '200186733'], 'record': [[{'Item': [], 'Id': '200277160', 'Accession': 'GSE277160', 'GDS': '', 'title': 'Integrated multi-omics analysis of zinc finger proteins uncovers roles in RNA regulation [RNA-seq]', 'summary': 'RNA interactome studies have revealed that hundreds of zinc finger proteins (ZFPs) are candidate RNA-binding proteins (RBPs), despite being annotated as DNA-interacting transcription factors. The RNA substrates of ZFPs and the functional significance of their potential roles as DNA- and RNA-binding proteins (DRBPs) remain largely uncharacterized. Here we present a systematic multi-omics analysis of the DNA and RNA binding targets and regulatory roles of more than 100 ZFPs representing 37 zinc finger families. We show tha

In [12]:
import pandas as pd
df=pd.DataFrame(data_dic)
df.head()

Unnamed: 0,IDs,record
0,200277160,"[{'Item': [], 'Id': '200277160', 'Accession': ..."
1,200268963,"[{'Item': [], 'Id': '200268963', 'Accession': ..."
2,200249247,"[{'Item': [], 'Id': '200249247', 'Accession': ..."
3,200249246,"[{'Item': [], 'Id': '200249246', 'Accession': ..."
4,200249245,"[{'Item': [], 'Id': '200249245', 'Accession': ..."


In [19]:
def get_accession(data):
    #return data['Accession']
    #print(Bio.Entrez.Parser.StringElement(tag="Accession"))
    return Bio.Entrez.Parser.StringElement(data,tag="Accession")
Accesssion=list(map(get_accession, list(data_dic['record'])))
Accesssion[:10]

TypeError: StringElement.__init__() missing 2 required positional arguments: 'attributes' and 'key'

In [14]:
type(data_dic['record'][0])

Bio.Entrez.Parser.ListElement

# Data Colleciton method:
I searched NCBI 
## filter and wrapper methods:
* Filter methods rank all features in terms of their
goodness using the relation of each single gene with the class label based on a univariate scoring metric. 

# Referances 

### Lung Cancer types:

https://cancer.org/cancer/types/lung-cancer/about/what-is.html

### HMM and Cancer Classification

https://www.sciencedirect.com/science/article/pii/S0020025515002807?casa_token=Jtoir6gnzQwAAAAA:2t-gmz2xFtE0Q5VdBsRz33maQgC2UomwfskesDtgtAoqlLeiLO92No2JC_tP6yd-qmwUXpbbI2Mt