### Step 3: Model Training and Evaluation

We applied Logistic Regression, Random Forest, and Support Vector Machine models to classify between normal and cancerous samples. The dataset was split into 70% training and 30% testing. Accuracy, precision, recall, and F1-score were used to assess model performance.


# Model Training and Evaluation

This notebook builds ML models and evaluates performance.

## Steps involved in Model Training and Evaluation.  


##  Data Loading  
**Load the merged and labeled dataset containing cfDNA methylation and miRNA features.**



##  Data Standardization  
**Standardize all feature values using Z-score normalization to ensure uniform scale for ML algorithms.**



## Train-Test Split  
**Split the dataset into 80% training and 20% testing sets using stratified sampling to preserve class distribution.**



## Feature Selection  
**Use ANOVA F-score to select the top 100 most relevant features contributing to classification.**



## Model Training – Logistic Regression  
**Train a Logistic Regression model using the selected features to classify tumor vs normal samples.**



##  Model Training – Random Forest  
**Train a Random Forest model as a robust ensemble classifier for comparison and improved accuracy.**



##  Evaluation  
**Evaluate model performance using confusion matrix, precision, recall, F1-score, and accuracy metrics.**



##  Save Outputs  
**Export selected feature list and model performance reports for documentation and future reference.**


###  Load Prepared Dataset for Modeling

In this step, we load the final, labeled, and lightweight version of the dataset that was previously created by merging miRNA and methylation features. This dataset will now be used for training and evaluating machine learning models. We also preview the shape and structure to ensure it has loaded correctly.


In [4]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score

# Load the lightweight merged dataset
df = pd.read_csv(r"C:/Users/sanja/cfDNA_LungCancer_ML/data/processed/merged_labeled_light.csv", index_col=0)

print(" Dataset loaded successfully.")
print(" Shape of dataset:", df.shape)
df.head()



 Dataset loaded successfully.
 Shape of dataset: (450, 1001)


Unnamed: 0,hsa-let-7a-1,hsa-let-7a-2,hsa-let-7a-3,hsa-let-7b,hsa-let-7c,hsa-let-7d,hsa-let-7e,hsa-let-7f-1,hsa-let-7f-2,hsa-let-7g,...,NUP107_cg00036115,A2BP1_cg00036119,GALP_cg00036137,ADCY9_cg00036258,KIF26B_cg00036263,GRIN2A_cg00036299,ARL6IP4_cg00036328,ATXN7_cg00036369,MOGS_cg00036386,Label
TCGA.05.4384,13.8766,14.8745,13.8822,13.8259,10.6177,8.7119,10.8698,5.3122,15.1357,10.3183,...,-0.3378,-0.2096,0.2053,0.3476,0.4723,-0.1074,-0.4301,0.3398,-0.472,0
TCGA.05.4390,11.7425,12.7576,11.7578,13.0601,7.608,8.6168,10.4833,3.4069,12.4367,9.3119,...,-0.3536,-0.2863,0.2775,0.3677,0.4791,-0.0471,-0.4155,0.3944,-0.4646,0
TCGA.05.4396,14.0194,15.0255,14.0367,14.5902,11.1171,9.8454,11.4738,4.3995,14.3723,9.7934,...,-0.283,-0.2535,0.3093,0.3364,0.4603,0.01,-0.4004,0.3917,-0.4539,0
TCGA.05.4405,12.9428,13.9327,12.9499,14.217,11.1093,8.4836,10.3909,3.1985,12.5092,8.4956,...,-0.3241,-0.1369,0.2914,0.3384,0.4605,-0.0564,-0.4108,0.3659,-0.4704,0
TCGA.05.4410,12.715,13.7157,12.7252,13.7465,10.3613,8.736,10.0696,3.9421,13.0051,9.0249,...,-0.3825,-0.1144,0.2545,0.3771,0.4682,-0.0941,-0.4134,0.4214,-0.4599,0


###  Split Features and Labels, Standardize, and Prepare for Training

In this section, we separate the features (X) and labels (y) from the merged dataset. To ensure uniform scaling across features, we apply standardization using `StandardScaler`. Then, we split the dataset into training and testing sets (80% train, 20% test) while maintaining class distribution using stratified sampling.


In [5]:
# Split into features and labels
X = df.drop(columns=['Label'])
y = df['Label']

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, stratify=y, random_state=42)

print(" Data split and scaled successfully.")
print(" X_train:", X_train.shape)
print(" X_test :", X_test.shape)


 Data split and scaled successfully.
 X_train: (360, 1000)
 X_test : (90, 1000)


###  Handling Missing and Infinite Values for Model Readiness

Before training models, it's critical to ensure data quality. This code checks for any missing (NaN) or infinite (Inf) values in the training set. If such issues exist, we clean the dataset using `dropna()`, restandardize the features, and perform a fresh train-test split. This ensures the input to machine learning models is clean, consistent, and free of anomalies.


In [6]:
print(" Any NaNs in X_train?", np.isnan(X_train).any())
print("Any NaNs in y_train?", np.isnan(y_train).any())

print(" Any Infs in X_train?", np.isinf(X_train).any())
print(" Shapes - X_train:", X_train.shape, " y_train:", y_train.shape)
# Re-do train-test split with dropna
df_cleaned = df.dropna()

# Separate features and labels again
X = df_cleaned.drop(columns=['Label'])
y = df_cleaned['Label']

# Scale again
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split again
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, stratify=y, random_state=42)

print(" Cleaned and ready. X_train shape:", X_train.shape)




 Any NaNs in X_train? True
Any NaNs in y_train? False
 Any Infs in X_train? False
 Shapes - X_train: (360, 1000)  y_train: (360,)
 Cleaned and ready. X_train shape: (336, 1000)


###  Logistic Regression Model Training and Evaluation

This section fits a Logistic Regression model on the training data and evaluates its performance on the test set. The classification report provides key metrics like precision, recall, F1-score, and support, helping assess how well the model distinguishes between normal and cancer samples.


In [7]:
lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train, y_train)
y_pred_lr = lr_model.predict(X_test)

print(" Logistic Regression Report:")
print(classification_report(y_test, y_pred_lr))


 Logistic Regression Report:
              precision    recall  f1-score   support

           0       0.65      0.52      0.58        42
           1       0.60      0.71      0.65        42

    accuracy                           0.62        84
   macro avg       0.62      0.62      0.62        84
weighted avg       0.62      0.62      0.62        84



###  Random Forest Model Training and Evaluation

In this step, we train a Random Forest classifier with 100 decision trees to capture complex patterns in the data. After training, the model's performance on the test set is evaluated using a classification report, offering insights into its ability to classify lung cancer and normal cases accurately.


In [8]:
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

print(" Random Forest Report:")
print(classification_report(y_test, y_pred_rf))


 Random Forest Report:
              precision    recall  f1-score   support

           0       0.61      0.60      0.60        42
           1       0.60      0.62      0.61        42

    accuracy                           0.61        84
   macro avg       0.61      0.61      0.61        84
weighted avg       0.61      0.61      0.61        84



###  Support Vector Machine (SVM) Model Training and Evaluation

We use a Support Vector Machine with an RBF (Radial Basis Function) kernel to handle non-linear relationships in the data. This model is trained on the standardized training set, and its classification performance is assessed using precision, recall, and F1-score metrics on the test data.


In [9]:
svm_model = SVC(kernel='rbf', probability=True)
svm_model.fit(X_train, y_train)
y_pred_svm = svm_model.predict(X_test)

print(" SVM Report:")
print(classification_report(y_test, y_pred_svm))


 SVM Report:
              precision    recall  f1-score   support

           0       0.67      0.57      0.62        42
           1       0.62      0.71      0.67        42

    accuracy                           0.64        84
   macro avg       0.65      0.64      0.64        84
weighted avg       0.65      0.64      0.64        84



###  Model Accuracy Comparison

After evaluating all three models—Logistic Regression, Random Forest, and Support Vector Machine (SVM)—we compare their classification accuracies side-by-side. This provides a quick overview of each model's ability to correctly predict cancer status from the integrated cfDNA methylation and miRNA features.


In [16]:
print("Model Accuracy Comparison:")
print(f"Logistic Regression: {accuracy_score(y_test, y_pred_lr):.4f}")
print(f"Random Forest      : {accuracy_score(y_test, y_pred_rf):.4f}")
print(f"SVM                : {accuracy_score(y_test, y_pred_svm):.4f}")


Model Accuracy Comparison:
Logistic Regression: 0.6190
Random Forest      : 0.6071
SVM                : 0.6429
