## Sanyukta Chapagain
## Machine Learning in Bioinformatics
## Homework 2
## Programming 2 

#### In this project, we worked with the BRCA_TCGA_PUB2015 dataset from The Cancer Genome Atlas (TCGA), specifically focused on breast cancer patients.This dataset was obtained as a compressed file: brca_tcga_pub2015.tar.gz After extraction, it provided multiple data types including:
Mutation data (data_mutations.txt)

Clinical data (data_clinical_sample.txt)


In [3]:
import zipfile
import os

In [4]:
import tarfile
import os

# Path to the .tar.gz file
tarPath = "brca_tcga_pub2015.tar.gz"

# Open and extract
if tarfile.is_tarfile(tarPath):
    with tarfile.open(tarPath, "r:gz") as tar:
        tar.extractall("brca_tcga_pub2015")
        print(" Extracted .tar.gz successfully!")
else:
    print(" Not a valid tar.gz file.")


 Extracted .tar.gz successfully!


In [5]:
# Check what's inside the extracted folder
unzippedPath = "brca_tcga_pub2015/brca_tcga_pub2015/"
os.listdir(unzippedPath)


['case_lists',
 'data_clinical_patient.txt',
 'data_clinical_sample.txt',
 'data_cna.txt',
 'data_cna_hg19.seg',
 'data_gistic_genes_amp.txt',
 'data_gistic_genes_del.txt',
 'data_linear_cna.txt',
 'data_methylation_hm27.txt',
 'data_methylation_hm450.txt',
 'data_mrna_agilent_microarray.txt',
 'data_mrna_agilent_microarray_zscores_ref_all_samples.txt',
 'data_mrna_agilent_microarray_zscores_ref_diploid_samples.txt',
 'data_mrna_seq_v2_rsem.txt',
 'data_mrna_seq_v2_rsem_zscores_ref_all_samples.txt',
 'data_mrna_seq_v2_rsem_zscores_ref_diploid_samples.txt',
 'data_mutations.txt',
 'data_mutsig.txt',
 'data_rppa.txt',
 'data_rppa_zscores.txt',
 'LICENSE',
 'meta_clinical_patient.txt',
 'meta_clinical_sample.txt',
 'meta_cna.txt',
 'meta_cna_hg19_seg.txt',
 'meta_linear_cna.txt',
 'meta_methylation_hm27.txt',
 'meta_methylation_hm450.txt',
 'meta_mrna_agilent_microarray.txt',
 'meta_mrna_agilent_microarray_zscores_ref_all_samples.txt',
 'meta_mrna_agilent_microarray_zscores_ref_diploid_sa

## File preprocessing


In [6]:
#Import libraries

#Basic data handling libraries
import pandas as pd 
import numpy as np
import os
import matplotlib.pyplot as plt, seaborn as sns
%matplotlib inline
sns.set(rc={'axes.facecolor':'lightblue'})

#Scikitlearn libraries
import sklearn
from sklearn import metrics
from sklearn import model_selection
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, precision_recall_curve
from sklearn.model_selection import StratifiedKFold, train_test_split, RepeatedStratifiedKFold, GridSearchCV, cross_val_score
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.impute import KNNImputer
from sklearn.datasets import make_blobs
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score
from sklearn.naive_bayes import GaussianNB, BernoulliNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis


# Statistics import
from statistics import mean, stdev


# Importing statsmodels
import statsmodels.api as sm

# Importing 'variance_inflation_factor'
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Importing warnings
# ------------------
import warnings
warnings.filterwarnings('ignore')



In [7]:
dataPath = "brca_tcga_pub2015/brca_tcga_pub2015/"

In [8]:
# Load clinical data
df_clinicalsample = pd.read_csv(dataPath + "data_clinical_sample.txt", sep="\t", skiprows=4)
df_clinicalsample.replace('[Not Available]', np.nan, inplace=True)

In [9]:
df_clinicalsample.head()

Unnamed: 0,PATIENT_ID,SAMPLE_ID,OTHER_SAMPLE_ID,DAYS_TO_COLLECTION,IS_FFPE,OCT_EMBEDDED,PATHOLOGY_REPORT_FILE_NAME,SURGICAL_PROCEDURE_FIRST,FIRST_SURGICAL_PROCEDURE_OTHER,SURGERY_FOR_POSITIVE_MARGINS,...,METASTATIC_TUMOR_INDICATOR,PROJECT_CODE,TISSUE_SOURCE_SITE,TUMOR_TISSUE_SITE,CANCER_TYPE,CANCER_TYPE_DETAILED,ONCOTREE_CODE,SAMPLE_TYPE,SOMATIC_STATUS,TMB_NONSYNONYMOUS
0,TCGA-LQ-A4E4,TCGA-LQ-A4E4-01,8C8D4BD4-3AA9-4AD1-9715-2376EE540A0C,414,NO,True,TCGA-LQ-A4E4.C13D16C9-BAC7-4CA4-B062-D925466F9...,Modified Radical Mastectomy,,,...,,,LQ,Breast,Invasive Breast Carcinoma,Breast Invasive Lobular Carcinoma,ILC,Primary,Matched,2.3
1,TCGA-A2-A3KC,TCGA-A2-A3KC-01,8318A9A4-E78E-4DB8-97C7-2CC733D0C512,299,NO,True,TCGA-A2-A3KC.593AF241-8F84-4BA0-8878-7C1CE72A4...,Simple Mastectomy,,,...,NO,,A2,Breast,Invasive Breast Carcinoma,Breast Invasive Lobular Carcinoma,ILC,Primary,Matched,0.766667
2,TCGA-A2-A3KD,TCGA-A2-A3KD-01,401D075C-D443-4592-8C03-E675CAAD2B50,700,NO,True,TCGA-A2-A3KD.3E4717BA-E9AE-45F9-9696-F8928D9FC...,Lumpectomy,,Mastectomy NOS,...,NO,,A2,Breast,Invasive Breast Carcinoma,Invasive Breast Carcinoma,BRCA,Primary,Matched,0.266667
3,TCGA-A7-A0D9,TCGA-A7-A0D9-01,c144ae50-ed29-4e27-bbee-fa81e79ac7db,173,NO,False,TCGA-A7-A0D9.7D5763F7-7284-4AF5-BBFC-5D8B5CB7F...,Simple Mastectomy,,,...,,,A7,Breast,Invasive Breast Carcinoma,Breast Invasive Ductal Carcinoma,IDC,Primary,Matched,0.9
4,TCGA-A7-A0DA,TCGA-A7-A0DA-01,4f441e61-6bea-4a12-841d-def270804bbe,177,NO,False,TCGA-A7-A0DA.69AC5937-3FFD-40FB-9922-79DB3CED7...,Lumpectomy,,,...,,,A7,Breast,Invasive Breast Carcinoma,Breast Invasive Ductal Carcinoma,IDC,Primary,Matched,4.233333


##### The clinical dataset contains detailed information about each patient's tumor sample, including patient ID, sample type, surgery details, cancer subtype, and tumor mutational burden (TMB). This metadata is essential for linking clinical outcomes with mutation data and serves as the basis for our classification labels: Ductal vs. Lobular breast cancer.

## Load mutation data

In [10]:
# Load mutation data
df_mutations = pd.read_csv(dataPath + "data_mutations.txt", sep="\t")

In [11]:
# Preview
df_mutations.head()

Unnamed: 0,Hugo_Symbol,Entrez_Gene_Id,Center,NCBI_Build,Chromosome,Start_Position,End_Position,Strand,Consequence,Variant_Classification,...,DOMAINS,MOTIF_SCORE_CHANGE,PolyPhen,ENSP,Amino_acids,CCDS,EA_MAF,Allele,cDNA_position,PUBMED
0,PTGER3,5733,genome.wustl.edu;unc.edu,GRCh37,1,71512366,71512366,+,"missense_variant,splice_region_variant",Missense_Mutation,...,"Transmembrane_helices:Tmhmm,Pfam_domain:PF0000...",,probably_damaging(0.997),ENSP00000349003,L/V,CCDS655.1,,C,1106/1943,
1,FLG,2312,genome.wustl.edu;unc.edu,GRCh37,1,152285981,152285981,+,missense_variant,Missense_Mutation,...,"Low_complexity_(Seg):Seg,PROSITE_profiles:PS50324",,probably_damaging(0.988),ENSP00000357789,R/W,CCDS30860.1,,A,1417/12747,
2,GPR52,9293,genome.wustl.edu,GRCh37,1,174417411,174417411,+,synonymous_variant,Silent,...,"Transmembrane_helices:Tmhmm,Prints_domain:PR00...",,,ENSP00000356658,I,CCDS30941.1,,A,200/1472,
3,SLC35F3,148641,genome.wustl.edu;unc.edu,GRCh37,1,234452419,234452419,+,synonymous_variant,Silent,...,Pfam_domain:PF06027,,,ENSP00000355577,S,CCDS1600.1,,T,1045/2891,
4,OR2T3,343173,genome.wustl.edu,GRCh37,1,248636826,248636826,+,missense_variant,Missense_Mutation,...,"Pfam_domain:PF00001,Pfam_domain:PF10320,PROSIT...",,benign(0.001),ENSP00000352604,R/C,CCDS31117.1,,T,200/1008,


#####  The mutation dataset contains detailed information about somatic mutations found in each tumor sample. Each row represents a specific mutation in a gene, including columns for gene name (Hugo_Symbol), mutation type (Variant_Classification), chromosome position, amino acid change, predicted effect (e.g., PolyPhen score), and more. 

## Extracting Patient ID

In [12]:
# Create PATIENT_ID by removing the last 3 characters (e.g., '-01')
df_mutations['PATIENT_ID'] = df_mutations['Tumor_Sample_Barcode'].apply(lambda x: x[:-3])
df_mutations.head()

Unnamed: 0,Hugo_Symbol,Entrez_Gene_Id,Center,NCBI_Build,Chromosome,Start_Position,End_Position,Strand,Consequence,Variant_Classification,...,MOTIF_SCORE_CHANGE,PolyPhen,ENSP,Amino_acids,CCDS,EA_MAF,Allele,cDNA_position,PUBMED,PATIENT_ID
0,PTGER3,5733,genome.wustl.edu;unc.edu,GRCh37,1,71512366,71512366,+,"missense_variant,splice_region_variant",Missense_Mutation,...,,probably_damaging(0.997),ENSP00000349003,L/V,CCDS655.1,,C,1106/1943,,TCGA-B6-A0IG
1,FLG,2312,genome.wustl.edu;unc.edu,GRCh37,1,152285981,152285981,+,missense_variant,Missense_Mutation,...,,probably_damaging(0.988),ENSP00000357789,R/W,CCDS30860.1,,A,1417/12747,,TCGA-B6-A0IG
2,GPR52,9293,genome.wustl.edu,GRCh37,1,174417411,174417411,+,synonymous_variant,Silent,...,,,ENSP00000356658,I,CCDS30941.1,,A,200/1472,,TCGA-B6-A0IG
3,SLC35F3,148641,genome.wustl.edu;unc.edu,GRCh37,1,234452419,234452419,+,synonymous_variant,Silent,...,,,ENSP00000355577,S,CCDS1600.1,,T,1045/2891,,TCGA-B6-A0IG
4,OR2T3,343173,genome.wustl.edu,GRCh37,1,248636826,248636826,+,missense_variant,Missense_Mutation,...,,benign(0.001),ENSP00000352604,R/C,CCDS31117.1,,T,200/1008,,TCGA-B6-A0IG


In [13]:
# Merge mutation and clinical data on PATIENT_ID
df_merged = pd.merge(df_mutations, df_clinicalsample, on='PATIENT_ID', how='left')
df_merged.head()

Unnamed: 0,Hugo_Symbol,Entrez_Gene_Id,Center,NCBI_Build,Chromosome,Start_Position,End_Position,Strand,Consequence,Variant_Classification,...,METASTATIC_TUMOR_INDICATOR,PROJECT_CODE,TISSUE_SOURCE_SITE,TUMOR_TISSUE_SITE,CANCER_TYPE,CANCER_TYPE_DETAILED,ONCOTREE_CODE,SAMPLE_TYPE,SOMATIC_STATUS,TMB_NONSYNONYMOUS
0,PTGER3,5733,genome.wustl.edu;unc.edu,GRCh37,1,71512366,71512366,+,"missense_variant,splice_region_variant",Missense_Mutation,...,,,B6,Breast,Invasive Breast Carcinoma,Breast Invasive Ductal Carcinoma,IDC,Primary,Matched,1.366667
1,FLG,2312,genome.wustl.edu;unc.edu,GRCh37,1,152285981,152285981,+,missense_variant,Missense_Mutation,...,,,B6,Breast,Invasive Breast Carcinoma,Breast Invasive Ductal Carcinoma,IDC,Primary,Matched,1.366667
2,GPR52,9293,genome.wustl.edu,GRCh37,1,174417411,174417411,+,synonymous_variant,Silent,...,,,B6,Breast,Invasive Breast Carcinoma,Breast Invasive Ductal Carcinoma,IDC,Primary,Matched,1.366667
3,SLC35F3,148641,genome.wustl.edu;unc.edu,GRCh37,1,234452419,234452419,+,synonymous_variant,Silent,...,,,B6,Breast,Invasive Breast Carcinoma,Breast Invasive Ductal Carcinoma,IDC,Primary,Matched,1.366667
4,OR2T3,343173,genome.wustl.edu,GRCh37,1,248636826,248636826,+,missense_variant,Missense_Mutation,...,,,B6,Breast,Invasive Breast Carcinoma,Breast Invasive Ductal Carcinoma,IDC,Primary,Matched,1.366667


In [14]:
# Keep only the relevant columns
cancer = df_merged[['Hugo_Symbol','PATIENT_ID','Variant_Classification',
                    'Variant_Type','Transcript_ID','Feature','Gene',
                    'CANCER_TYPE','CANCER_TYPE_DETAILED']]
cancer.head()


Unnamed: 0,Hugo_Symbol,PATIENT_ID,Variant_Classification,Variant_Type,Transcript_ID,Feature,Gene,CANCER_TYPE,CANCER_TYPE_DETAILED
0,PTGER3,TCGA-B6-A0IG,Missense_Mutation,SNP,ENST00000356595,ENST00000356595,ENSG00000050628,Invasive Breast Carcinoma,Breast Invasive Ductal Carcinoma
1,FLG,TCGA-B6-A0IG,Missense_Mutation,SNP,ENST00000368799,ENST00000368799,ENSG00000143631,Invasive Breast Carcinoma,Breast Invasive Ductal Carcinoma
2,GPR52,TCGA-B6-A0IG,Silent,SNP,ENST00000367685,ENST00000367685,ENSG00000203737,Invasive Breast Carcinoma,Breast Invasive Ductal Carcinoma
3,SLC35F3,TCGA-B6-A0IG,Silent,SNP,ENST00000366618,ENST00000366618,ENSG00000183780,Invasive Breast Carcinoma,Breast Invasive Ductal Carcinoma
4,OR2T3,TCGA-B6-A0IG,Missense_Mutation,SNP,ENST00000359594,ENST00000359594,ENSG00000196539,Invasive Breast Carcinoma,Breast Invasive Ductal Carcinoma


##### Here, we selected only the relevant columns from the merged mutation and clinical dataset. These include the gene name (Hugo_Symbol), mutation type (Variant_Classification, Variant_Type), transcript and gene IDs, and cancer subtype information (CANCER_TYPE_DETAILED). This filtered data keeps only the necessary details needed for our classification task and simplifies further processing.

## PREPROCESSING


#### Filtering Cancer types:

In [15]:
# Keep only these two types
cancer = cancer[
    (cancer['CANCER_TYPE_DETAILED'] == 'Breast Invasive Ductal Carcinoma') |
    (cancer['CANCER_TYPE_DETAILED'] == 'Breast Invasive Lobular Carcinoma')
]
print(" Filtered shape:", cancer.shape)
cancer.head()

 Filtered shape: (56933, 9)


Unnamed: 0,Hugo_Symbol,PATIENT_ID,Variant_Classification,Variant_Type,Transcript_ID,Feature,Gene,CANCER_TYPE,CANCER_TYPE_DETAILED
0,PTGER3,TCGA-B6-A0IG,Missense_Mutation,SNP,ENST00000356595,ENST00000356595,ENSG00000050628,Invasive Breast Carcinoma,Breast Invasive Ductal Carcinoma
1,FLG,TCGA-B6-A0IG,Missense_Mutation,SNP,ENST00000368799,ENST00000368799,ENSG00000143631,Invasive Breast Carcinoma,Breast Invasive Ductal Carcinoma
2,GPR52,TCGA-B6-A0IG,Silent,SNP,ENST00000367685,ENST00000367685,ENSG00000203737,Invasive Breast Carcinoma,Breast Invasive Ductal Carcinoma
3,SLC35F3,TCGA-B6-A0IG,Silent,SNP,ENST00000366618,ENST00000366618,ENSG00000183780,Invasive Breast Carcinoma,Breast Invasive Ductal Carcinoma
4,OR2T3,TCGA-B6-A0IG,Missense_Mutation,SNP,ENST00000359594,ENST00000359594,ENSG00000196539,Invasive Breast Carcinoma,Breast Invasive Ductal Carcinoma


In [16]:
#Create a label mapping

In [17]:
# Unique patient → label mapping
df_target = cancer[['PATIENT_ID', 'CANCER_TYPE_DETAILED']].drop_duplicates()
df_target.head()

Unnamed: 0,PATIENT_ID,CANCER_TYPE_DETAILED
0,TCGA-B6-A0IG,Breast Invasive Ductal Carcinoma
45,TCGA-BH-A0HQ,Breast Invasive Ductal Carcinoma
84,TCGA-BH-A18G,Breast Invasive Ductal Carcinoma
1857,TCGA-A1-A0SD,Breast Invasive Ductal Carcinoma
1891,TCGA-A1-A0SF,Breast Invasive Ductal Carcinoma


### One hot-encode Hugo_symbol ( muatated gene column )

In [18]:
# Turn each gene into a binary column
dummies = pd.get_dummies(cancer["Hugo_Symbol"], drop_first=True)

# Convert boolean True/False to 1/0
dummies = dummies.astype(int)

# Add cancer type back
dummies = pd.concat([dummies, cancer["CANCER_TYPE_DETAILED"]], axis=1)
dummies.head()


Unnamed: 0,A2M,A2ML1,A4GALT,A4GNT,AAAS,AACS,AADAC,AADACL2,AADACL4,AADAT,...,ZWINT,ZXDA,ZXDB,ZXDC,ZYG11A,ZYG11B,ZYX,ZZEF1,ZZZ3,CANCER_TYPE_DETAILED
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Breast Invasive Ductal Carcinoma
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Breast Invasive Ductal Carcinoma
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Breast Invasive Ductal Carcinoma
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Breast Invasive Ductal Carcinoma
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Breast Invasive Ductal Carcinoma


##### we transformed the Hugo_Symbol column (which contains gene names) into binary features using one-hot encoding. Each gene is turned into a separate column with a value of 1 if it is mutated in that row, and 0 otherwise. We then added the CANCER_TYPE_DETAILED column back to associate each mutation profile with its corresponding cancer subtype. This prepares the dataset for classification by converting categorical gene mutation data into a numeric format that machine learning models can understand.

### Combine Patien Id with Mutaion and group 

In [19]:
# Combine with patient ID
df = pd.concat([cancer[['PATIENT_ID']], dummies], axis=1)

# Group by patient and take max (1 if gene mutated at least once)
sparse_df = df.groupby("PATIENT_ID")[dummies.columns].max()

# Just in case: remove any duplicated columns
sparse_df = sparse_df.loc[:, ~sparse_df.columns.duplicated()]

display(sparse_df.head())


Unnamed: 0_level_0,A2M,A2ML1,A4GALT,A4GNT,AAAS,AACS,AADAC,AADACL2,AADACL4,AADAT,...,ZWINT,ZXDA,ZXDB,ZXDC,ZYG11A,ZYG11B,ZYX,ZZEF1,ZZZ3,CANCER_TYPE_DETAILED
PATIENT_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TCGA-A1-A0SD,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Breast Invasive Ductal Carcinoma
TCGA-A1-A0SE,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Breast Invasive Lobular Carcinoma
TCGA-A1-A0SF,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Breast Invasive Ductal Carcinoma
TCGA-A1-A0SI,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Breast Invasive Ductal Carcinoma
TCGA-A1-A0SJ,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Breast Invasive Ductal Carcinoma


### Reset index and prprare X andY

In [20]:
# Reset index so PATIENT_ID is a column again
df_logir = sparse_df.copy()
df_logir.reset_index(inplace=True)

# Define X (features)
x = df_logir.drop(['CANCER_TYPE_DETAILED', 'PATIENT_ID'], axis=1)

# Define y (labels)
y = df_logir['CANCER_TYPE_DETAILED']

print(" X shape:", x.shape)
print(" y shape:", y.shape)


 X shape: (617, 15301)
 y shape: (617,)


In [21]:
y.value_counts()


CANCER_TYPE_DETAILED
Breast Invasive Ductal Carcinoma     490
Breast Invasive Lobular Carcinoma    127
Name: count, dtype: int64

####  we reset the index to bring PATIENT_ID back as a regular column and then prepared the input (X) and target (y) for our classification task.

X contains binary features representing the presence or absence of mutations in 15,301 genes for each patient.

y contains the corresponding cancer subtype label for each patient — either ductal or lobular.

We have a total of 617 patients, with 490 cases of ductal carcinoma and 127 cases of lobular carcinoma, making this a binary classification problem with imbalanced classes.

### CrossValidation

In [22]:
scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)

print(" x_scaled shape:", x_scaled.shape)

 x_scaled shape: (617, 15301)


#### Set  Evaluation metrics and 5 fold CV


In [23]:
# Scoring metrics
scoring = {
    'accuracy': make_scorer(accuracy_score),
    'precision': make_scorer(precision_score),
    'recall': make_scorer(recall_score),
    'f1_score': make_scorer(f1_score)
}

# 5-fold cross-validation
skf = KFold(n_splits=5, shuffle=True, random_state=42)

### logistic regressiion


In [24]:
logir_ovr = LogisticRegression(random_state=0, multi_class='ovr', max_iter=1000)

In [25]:
# Evaluation metrics
logit_accuracies, logit_precision, logit_recall, logit_f1scores = [], [], [], []

### Perform cross val

In [26]:
# Perform cross-validation
for train_idx, test_idx in skf.split(x_scaled, y):
    x_train, x_test = x_scaled[train_idx], x_scaled[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
    
    logir_ovr.fit(x_train, y_train)
    y_pred = logir_ovr.predict(x_test)
    
    logit_accuracies.append(logir_ovr.score(x_test, y_test))
    logit_f1scores.append(metrics.f1_score(y_test, y_pred, average='macro'))
    logit_precision.append(metrics.precision_score(y_test, y_pred, average='macro'))
    logit_recall.append(metrics.recall_score(y_test, y_pred, average='macro'))

# Show results
print("Logistic Regression F1 Scores:", logit_f1scores)


Logistic Regression F1 Scores: [0.44144144144144143, 0.4439461883408072, 0.4434389140271493, 0.4533333333333333, 0.4305555555555556]


### In this step, we performed 5-fold cross-validation using a logistic regression model to evaluate how well it can classify the two breast cancer subtypes based on mutation profiles.

For each fold:

The data was split into training and testing sets.

The model was trained on the training set and tested on the test set.

We recorded key evaluation metrics: accuracy, precision, recall, and F1 score (macro-averaged).

The F1 scores across the 5 folds show consistent performance, with scores ranging around 0.43–0.45, indicating that the model is moderately effective but likely impacted by class imbalance and high dimensionality.

## Naive Bayes classifcation 

In [27]:
# Model
naive_bayes = BernoulliNB()

In [28]:
# Evaluation metric lists
naive_accuracies, naive_precision, naive_recall, naive_f1scores = [], [], [], []

In [29]:
# 5-fold CV
for train_idx, test_idx in skf.split(x_scaled, y):
    x_train, x_test = x_scaled[train_idx], x_scaled[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
    
    naive_bayes.fit(x_train, y_train)
    y_pred = naive_bayes.predict(x_test)
    
    naive_accuracies.append(naive_bayes.score(x_test, y_test))
    naive_f1scores.append(metrics.f1_score(y_test, y_pred, average='macro'))
    naive_precision.append(metrics.precision_score(y_test, y_pred, average='macro'))
    naive_recall.append(metrics.recall_score(y_test, y_pred, average='macro'))

In [30]:
print("Naive Bayes F1 Scores:", naive_f1scores)

Naive Bayes F1 Scores: [0.4976851851851852, 0.5070336391437309, 0.48806881243063266, 0.4722707770538694, 0.45394006659267483]


#### we trained a Bernoulli Naive Bayes classifier using 5-fold cross-validation. This model is well-suited for binary input data, like our mutation presence/absence matrix.

For each fold:

The model was trained on the training split and predictions were made on the test split.

We calculated accuracy, precision, recall, and macro-averaged F1 score.

The F1 scores ranged from ~0.45 to 0.51, which are higher than logistic regression, indicating that Naive Bayes handled the high-dimensional sparse mutation data better in this case.

## LDA

In [31]:
# Initialize LDA model
lda = LinearDiscriminantAnalysis()

# Metric lists
lda_accuracies, lda_precision, lda_recall, lda_f1scores = [], [], [], []

# Cross-validation loop
for train_idx, test_idx in skf.split(x_scaled, y):
    x_train, x_test = x_scaled[train_idx], x_scaled[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
    
    lda.fit(x_train, y_train)
    y_pred = lda.predict(x_test)
    
    lda_accuracies.append(lda.score(x_test, y_test))
    lda_f1scores.append(metrics.f1_score(y_test, y_pred, average='macro'))
    lda_precision.append(metrics.precision_score(y_test, y_pred, average='macro'))
    lda_recall.append(metrics.recall_score(y_test, y_pred, average='macro'))

# Show results
print("LDA F1 Scores:", lda_f1scores)

LDA F1 Scores: [0.44144144144144143, 0.3721312553429342, 0.41380718954248363, 0.47435897435897434, 0.4305555555555556]


###  We applied Linear Discriminant Analysis (LDA) to classify the breast cancer subtypes using 5-fold cross-validation.

LDA attempts to find a linear combination of features that best separates the two classes. However, it may not perform optimally on high-dimensional sparse data like gene mutations.

The F1 scores varied across folds, ranging from ~0.37 to 0.47.

This performance was generally lower and more inconsistent compared to Naive Bayes, suggesting LDA struggled with the feature space or class imbalance.

In [33]:
def average(scores):
    return sum(scores) / len(scores)

# Store model names and their average F1 scores
model_names = ['Logistic Regression', 'Naive Bayes', 'LDA']
avg_f1s = [
    average(logit_f1scores),
    average(naive_f1scores),
    average(lda_f1scores)
]

# Print averages
print(" Average F1 Scores:")
for name, score in zip(model_names, avg_f1s):
    print(f"{name}: {round(score, 4)}")

# Find best model
best_index = avg_f1s.index(max(avg_f1s))
print("\n Best performing model based on F1 score:")
print(f" {model_names[best_index]} with F1 score: {round(avg_f1s[best_index], 4)}")


 Average F1 Scores:
Logistic Regression: 0.4425
Naive Bayes: 0.4838
LDA: 0.4265

 Best performing model based on F1 score:
 Naive Bayes with F1 score: 0.4838


### Comparision
### Based on the average F1 scores from 5-fold cross-validation, Naive Bayes achieved the best performance with a score of 0.4838, followed by Logistic Regression at 0.4425, and LDA at 0.4265.

### This suggests that Naive Bayes was more effective in handling our high-dimensional, sparse binary data (gene mutation presence/absence), likely due to its assumption of feature independence. Logistic Regression also performed reasonably well, while LDA showed the weakest performance, likely because it is less suited for sparse data with a large number of features compared to samples.