### Step 1: Data Preprocessing

We first inspect the miRNA and methylation datasets for missing values and apply standard scaling to normalize the expression levels across all samples. This ensures that features contribute equally during model training.


#  Data Preprocessing

This notebook involves loading, cleaning, and formatting the raw cfDNA methylation and miRNA expression datasets to prepare them for integration and modeling.

## Steps Involved in Data Preprocessing

1. **Import required libraries**  
   Load essential Python libraries such as `pandas`, `numpy`, and `os`.

2. **Load datasets**  
   Read the miRNA and methylation datasets from `.csv` or `.tsv` format.

3. **Inspect dataset structure**  
   Use `.head()`, `.shape`, and `.info()` to understand data organization.

4. **Clean column names**  
   Remove leading/trailing whitespaces and correct formatting issues.

5. **Handle missing values**  
   Drop empty rows/columns or apply imputation as needed.

6. **Transpose datasets (if required)**  
   Ensure samples are represented as rows and features as columns.

7. **Set sample IDs as index**  
   Assign sample identifiers (e.g., TCGA codes) as row indices.

8. **Save cleaned datasets**  
   Export the cleaned and properly formatted datasets to the `data/processed/` folder for further use.


###  Data Preprocessing – Loading and Previewing Datasets

In this step, we load the preprocessed miRNA and cfDNA methylation datasets using pandas. These datasets contain expression values for lung cancer and normal samples. Previewing their shapes and contents helps ensure that both datasets are loaded correctly and are ready for downstream processing.



In [4]:
# Data Preprocessing

import pandas as pd

# Load miRNA dataset
mirna_path = r"C:\Users\sanja\cfDNA_LungCancer_ML\data\processed\miRNA_TCGA_LUAD.csv"
mirna_df = pd.read_csv(mirna_path)

# Load methylation dataset
meth_path = r"C:\Users\sanja\cfDNA_LungCancer_ML\data\processed\methylation_TCGA_LUAD.csv"
meth_df = pd.read_csv(meth_path)

# Preview
print("miRNA dataset shape:", mirna_df.shape)
print("Methylation dataset shape:", meth_df.shape)

mirna_df.head()


miRNA dataset shape: (809, 451)
Methylation dataset shape: (336284, 459)


Unnamed: 0,attrib_name,TCGA.05.4384,TCGA.05.4390,TCGA.05.4396,TCGA.05.4405,TCGA.05.4410,TCGA.05.4415,TCGA.05.4417,TCGA.05.4424,TCGA.05.4425,...,TCGA.NJ.A4YG,TCGA.NJ.A4YI,TCGA.NJ.A4YP,TCGA.NJ.A4YQ,TCGA.NJ.A55A,TCGA.NJ.A55O,TCGA.NJ.A55R,TCGA.NJ.A7XG,TCGA.O1.A52J,TCGA.S2.AA1A
0,hsa-let-7a-1,13.8766,11.7425,14.0194,12.9428,12.715,13.0099,12.151,12.9538,13.7344,...,13.1164,13.0787,12.9371,11.5804,12.6737,13.726,12.3826,12.6324,11.9579,13.2691
1,hsa-let-7a-2,14.8745,12.7576,15.0255,13.9327,13.7157,14.0169,13.1524,13.9443,14.7439,...,14.1031,14.0725,13.9439,12.596,13.6404,14.7255,13.3917,13.6234,12.9604,14.2701
2,hsa-let-7a-3,13.8822,11.7578,14.0367,12.9499,12.7252,13.0417,12.1721,12.9644,13.7445,...,13.1158,13.0935,12.947,11.5914,12.6664,13.7416,12.3986,12.6361,11.9678,13.2772
3,hsa-let-7b,13.8259,13.0601,14.5902,14.217,13.7465,12.6094,13.1777,14.0479,14.5261,...,13.7042,13.5944,14.3297,12.3857,13.3484,13.8372,12.8563,13.4904,12.5495,14.037
4,hsa-let-7c,10.6177,7.608,11.1171,11.1093,10.3613,9.2237,9.483,10.6913,10.8448,...,9.9743,10.2977,9.5997,8.9539,11.6456,10.6305,9.2932,10.1604,10.7978,10.9685


### Cleaning Column Names and Removing Empty Data

To ensure consistency, we remove any leading or trailing whitespaces from the column names in both datasets. We also drop columns that contain only missing values, as they do not contribute any meaningful information to our analysis.


In [6]:
# Strip whitespace from column names
mirna_df.columns = mirna_df.columns.str.strip()
meth_df.columns = meth_df.columns.str.strip()

# Drop empty or irrelevant columns (if any)
mirna_df.dropna(axis=1, how='all', inplace=True)
meth_df.dropna(axis=1, how='all', inplace=True)


###  Transposing the DataFrames

To prepare the data for machine learning, we transpose both the miRNA and methylation datasets so that each **row represents a patient/sample** and each **column represents a feature** (gene or CpG site). This orientation is essential for most ML models which expect samples as rows and features as columns.


In [3]:
# Transpose so that rows become samples
mirna_df = mirna_df.set_index(mirna_df.columns[0]).T
meth_df = meth_df.set_index(meth_df.columns[0]).T

print("Transposed miRNA shape:", mirna_df.shape)
print("Transposed methylation shape:", meth_df.shape)


Transposed miRNA shape: (450, 809)
Transposed methylation shape: (458, 336284)
