### Step 2: Data Integration and Labeling

After aligning sample IDs across the miRNA and methylation datasets, we concatenate the features and assign binary class labels: 0 for normal and 1 for lung cancer. This merged dataset will serve as input for machine learning models.


#  Data Integration and Labeling

This notebook merges the preprocessed cfDNA methylation and miRNA datasets based on common sample IDs and assigns class labels for supervised machine learning.

## Steps Involved in Data Integration and Labeling

1. **Import cleaned datasets**  
   Load the preprocessed miRNA and methylation datasets from the `data/processed/` directory.

2. **Align common sample IDs**  
   Identify and extract the common set of sample IDs between both datasets.

3. **Subset datasets**  
   Keep only those samples that exist in both datasets.

4. **Merge datasets**  
   Concatenate the miRNA and methylation datasets horizontally (columns-wise) into a single feature matrix.

5. **Assign class labels**  
   Add a new `Label` column to indicate sample class (e.g., `1` for tumor, `0` for normal).

6. **Validate merged dataset**  
   Check the shape, head, and label distribution of the final dataset.

7. **Export final merged dataset**  
   Save the labeled feature matrix to be used for machine learning model training.




### Loading Processed Datasets

We begin by loading the cleaned and preprocessed miRNA and methylation datasets using `pandas`. This step ensures that we have access to the correct data structure before applying further preprocessing or analysis. A quick preview of the shape and first few rows helps verify the integrity and format of the data.


In [4]:
# Load without setting index
import pandas as pd
mirna_df = pd.read_csv(r"C:/Users/sanja/cfDNA_LungCancer_ML/data/processed/miRNA_TCGA_LUAD.csv")
meth_df = pd.read_csv(r"C:/Users/sanja/cfDNA_LungCancer_ML/data/processed/methylation_TCGA_LUAD.csv")

# Preview structure
print("miRNA shape:", mirna_df.shape)
print("miRNA columns:", mirna_df.columns[:5].tolist())
print("miRNA first 5 rows:")
print(mirna_df.head())


miRNA shape: (809, 451)
miRNA columns: ['attrib_name', 'TCGA.05.4384', 'TCGA.05.4390', 'TCGA.05.4396', 'TCGA.05.4405']
miRNA first 5 rows:
    attrib_name  TCGA.05.4384  TCGA.05.4390  TCGA.05.4396  TCGA.05.4405  \
0  hsa-let-7a-1       13.8766       11.7425       14.0194       12.9428   
1  hsa-let-7a-2       14.8745       12.7576       15.0255       13.9327   
2  hsa-let-7a-3       13.8822       11.7578       14.0367       12.9499   
3    hsa-let-7b       13.8259       13.0601       14.5902       14.2170   
4    hsa-let-7c       10.6177        7.6080       11.1171       11.1093   

   TCGA.05.4410  TCGA.05.4415  TCGA.05.4417  TCGA.05.4424  TCGA.05.4425  ...  \
0       12.7150       13.0099       12.1510       12.9538       13.7344  ...   
1       13.7157       14.0169       13.1524       13.9443       14.7439  ...   
2       12.7252       13.0417       12.1721       12.9644       13.7445  ...   
3       13.7465       12.6094       13.1777       14.0479       14.5261  ...   
4       10

###  Setting Sample IDs as Index

We set the first column of both the miRNA and methylation datasets as the index, which typically contains the sample identifiers (e.g., TCGA IDs). This makes it easier to align and merge datasets based on sample IDs. We then preview the first few sample IDs to confirm the structure.


In [5]:
mirna_df.set_index(mirna_df.columns[0], inplace=True)
meth_df.set_index(meth_df.columns[0], inplace=True)
print("miRNA sample IDs:", mirna_df.index[:5].tolist())
print("Methylation sample IDs:", meth_df.index[:5].tolist())


miRNA sample IDs: ['hsa-let-7a-1', 'hsa-let-7a-2', 'hsa-let-7a-3', 'hsa-let-7b', 'hsa-let-7c']
Methylation sample IDs: ['RBL2_cg00000029', 'VDAC3_cg00000236', 'ACTN1_cg00000289', 'ATP2A1_cg00000292', 'SFRP1_cg00000321']


### Transposing and Cleaning miRNA Data

To prepare the miRNA data for machine learning, we transpose the DataFrame so that each row corresponds to a patient sample and each column to a specific miRNA feature. We then rename the columns systematically for clarity, clean up the sample IDs by replacing special characters with dashes, and truncate them to a uniform length. This step ensures consistency in sample naming before merging with other datasets.


In [6]:

# Transpose so samples become rows
mirna_df = mirna_df.transpose()

# Fix column names after transpose
mirna_df.columns = [f"miRNA_{i+1}" for i in range(mirna_df.shape[1])]

# Clean the index (sample IDs)
mirna_df.index = mirna_df.index.str.replace(r"[._]", "-", regex=True).str[:12]

# Confirm the shape and preview
print("miRNA shape after cleaning:", mirna_df.shape)
print(mirna_df.index[:5])
mirna_df.head()


miRNA shape after cleaning: (450, 809)
Index(['TCGA-05-4384', 'TCGA-05-4390', 'TCGA-05-4396', 'TCGA-05-4405',
       'TCGA-05-4410'],
      dtype='object')


Unnamed: 0,miRNA_1,miRNA_2,miRNA_3,miRNA_4,miRNA_5,miRNA_6,miRNA_7,miRNA_8,miRNA_9,miRNA_10,...,miRNA_800,miRNA_801,miRNA_802,miRNA_803,miRNA_804,miRNA_805,miRNA_806,miRNA_807,miRNA_808,miRNA_809
TCGA-05-4384,13.8766,14.8745,13.8822,13.8259,10.6177,8.7119,10.8698,5.3122,15.1357,10.3183,...,1.0159,0.0,2.5,0.0,2.4707,2.0238,3.3015,6.1628,8.3201,14.7985
TCGA-05-4390,11.7425,12.7576,11.7578,13.0601,7.608,8.6168,10.4833,3.4069,12.4367,9.3119,...,2.6638,0.0,2.5369,0.617,0.4392,4.1762,4.5819,6.9962,5.0913,16.1543
TCGA-05-4396,14.0194,15.0255,14.0367,14.5902,11.1171,9.8454,11.4738,4.3995,14.3723,9.7934,...,1.2395,0.1437,2.4873,0.1437,0.6074,2.1891,3.8691,6.6731,8.0122,13.9981
TCGA-05-4405,12.9428,13.9327,12.9499,14.217,11.1093,8.4836,10.3909,3.1985,12.5092,8.4956,...,1.8467,0.0,2.1296,0.0,2.0413,1.1168,4.9047,5.3844,8.5943,14.2003
TCGA-05-4410,12.715,13.7157,12.7252,13.7465,10.3613,8.736,10.0696,3.9421,13.0051,9.0249,...,1.2397,0.0,3.0982,0.0,1.3293,0.6827,4.1774,5.0606,8.3316,13.8615



###  Transposing and Standardizing Methylation Data

To make the methylation dataset compatible with the miRNA data, we first transpose it so that each row corresponds to a unique patient sample. We then assign standardized column names (e.g., CpG_1, CpG_2, ...) for easier reference. Finally, we clean and truncate the sample IDs to ensure consistent formatting across both datasets, which is crucial for proper integration in later steps.


In [7]:
#  Step 1: Transpose so that each row = one sample
meth_df = meth_df.transpose()

#  Step 2: Rename columns to generic CpG labels (optional, for uniformity)
meth_df.columns = [f"CpG_{i+1}" for i in range(meth_df.shape[1])]

#  Step 3: Clean sample IDs in index to match miRNA format (e.g., TCGA-XX-YYYY)
meth_df.index = meth_df.index.str.replace(r"[._]", "-", regex=True).str[:12]

#  Step 4: Confirm structure
print("Methylation shape after cleaning:", meth_df.shape)
print("First few sample IDs:", meth_df.index[:5].tolist())
meth_df.head()


Methylation shape after cleaning: (458, 336284)
First few sample IDs: ['TCGA-05-4384', 'TCGA-05-4390', 'TCGA-05-4396', 'TCGA-05-4405', 'TCGA-05-4410']


Unnamed: 0,CpG_1,CpG_2,CpG_3,CpG_4,CpG_5,CpG_6,CpG_7,CpG_8,CpG_9,CpG_10,...,CpG_336275,CpG_336276,CpG_336277,CpG_336278,CpG_336279,CpG_336280,CpG_336281,CpG_336282,CpG_336283,CpG_336284
TCGA-05-4384,-0.2285,0.3194,0.0812,0.1955,-0.0415,-0.4576,0.2563,-0.3566,0.3985,-0.3933,...,-0.43,-0.43,-0.4243,-0.4243,-0.429,-0.429,-0.4274,-0.4274,-0.3986,-0.3986
TCGA-05-4390,-0.2701,0.347,0.2593,0.2115,-0.2062,-0.474,0.3229,-0.3918,0.4188,-0.3992,...,-0.4764,-0.4764,-0.4494,-0.4494,-0.4346,-0.4346,-0.3812,-0.3812,-0.3715,-0.3715
TCGA-05-4396,-0.3215,0.3321,0.165,0.167,-0.3105,-0.4587,0.3424,-0.374,0.3966,-0.3959,...,-0.3776,-0.3776,-0.4638,-0.4638,-0.4229,-0.4229,-0.3487,-0.3487,-0.3959,-0.3959
TCGA-05-4405,-0.0782,0.3295,0.0539,0.1961,0.0822,-0.4673,0.3119,-0.3997,0.3794,-0.3845,...,-0.4186,-0.4186,-0.4591,-0.4591,-0.4339,-0.4339,-0.3769,-0.3769,-0.4054,-0.4054
TCGA-05-4410,-0.1437,0.3419,0.1259,0.1511,-0.0158,-0.4908,0.3126,-0.3271,0.4057,-0.4063,...,-0.4387,-0.4387,-0.458,-0.458,-0.4469,-0.4469,-0.391,-0.391,-0.4314,-0.4314


In [11]:
#  Step 1: Find common sample IDs
common_ids = mirna_df.index.intersection(meth_df.index)
print(f"Number of common samples: {len(common_ids)}")

#  Step 2: Filter both datasets to retain only common samples
mirna_common = mirna_df.loc[common_ids]
meth_common = meth_df.loc[common_ids]

#  Step 3: Concatenate both datasets column-wise (side-by-side)
merged_df = pd.concat([mirna_common, meth_common], axis=1)

#  Step 4: Preview and confirm shape
print("Merged dataset shape:", merged_df.shape)
merged_df.head()


Number of common samples: 450
Merged dataset shape: (450, 337093)


Unnamed: 0,miRNA_1,miRNA_2,miRNA_3,miRNA_4,miRNA_5,miRNA_6,miRNA_7,miRNA_8,miRNA_9,miRNA_10,...,CpG_336275,CpG_336276,CpG_336277,CpG_336278,CpG_336279,CpG_336280,CpG_336281,CpG_336282,CpG_336283,CpG_336284
TCGA-05-4384,13.8766,14.8745,13.8822,13.8259,10.6177,8.7119,10.8698,5.3122,15.1357,10.3183,...,-0.43,-0.43,-0.4243,-0.4243,-0.429,-0.429,-0.4274,-0.4274,-0.3986,-0.3986
TCGA-05-4390,11.7425,12.7576,11.7578,13.0601,7.608,8.6168,10.4833,3.4069,12.4367,9.3119,...,-0.4764,-0.4764,-0.4494,-0.4494,-0.4346,-0.4346,-0.3812,-0.3812,-0.3715,-0.3715
TCGA-05-4396,14.0194,15.0255,14.0367,14.5902,11.1171,9.8454,11.4738,4.3995,14.3723,9.7934,...,-0.3776,-0.3776,-0.4638,-0.4638,-0.4229,-0.4229,-0.3487,-0.3487,-0.3959,-0.3959
TCGA-05-4405,12.9428,13.9327,12.9499,14.217,11.1093,8.4836,10.3909,3.1985,12.5092,8.4956,...,-0.4186,-0.4186,-0.4591,-0.4591,-0.4339,-0.4339,-0.3769,-0.3769,-0.4054,-0.4054
TCGA-05-4410,12.715,13.7157,12.7252,13.7465,10.3613,8.736,10.0696,3.9421,13.0051,9.0249,...,-0.4387,-0.4387,-0.458,-0.458,-0.4469,-0.4469,-0.391,-0.391,-0.4314,-0.4314


###  Adding Tumor and Normal Labels to Samples

In this step, we manually define the number of tumor (300) and normal (150) samples and assign binary labels: `1` for tumor and `0` for normal. These labels are added as a new column to the merged dataset, which enables supervised learning in later stages. Finally, the labeled dataset is saved for reuse in downstream tasks.


In [12]:
#  Step 1: Define how many samples are tumor and normal
num_tumor = 300
num_normal = 150

#  Step 2: Create label list: 1 for tumor, 0 for normal
labels = pd.Series([1] * num_tumor + [0] * num_normal, index=merged_df.index)

# Step 3: Add label column to merged_df
merged_df['Label'] = labels.values

#  Step 4: Confirm label distribution
print("Label distribution:\n", merged_df['Label'].value_counts())

#  Optional: Save labeled dataset
merged_df.to_csv(r"C:\Users\sanja\cfDNA_LungCancer_ML\data\processed\merged_labeled_dataset.csv")

print(" Labeling completed and saved.")


Label distribution:
 Label
1    300
0    150
Name: count, dtype: int64
 Labeling completed and saved.


###  Creating a Lightweight Merged Dataset

To improve notebook performance and reduce memory usage, we create a smaller version of our dataset by selecting the top 500 features from both the miRNA and methylation datasets. We then align the samples based on common IDs and merge them horizontally. A dummy label column is added (0 for the first half, 1 for the second half) for testing purposes. The final lightweight dataset is saved for quick development and experimentation.


In [8]:
import pandas as pd

# Use your actual file paths
mirna_path = r"C:\Users\sanja\cfDNA_LungCancer_ML\data\processed\miRNA_TCGA_LUAD.csv"
meth_path  = r"C:\Users\sanja\cfDNA_LungCancer_ML\data\processed\methylation_TCGA_LUAD.csv"

#  Load datasets
mirna_df = pd.read_csv(mirna_path, index_col=0)
meth_df  = pd.read_csv(meth_path, index_col=0)

#  Transpose if samples are columns
if mirna_df.shape[0] > mirna_df.shape[1]:
    mirna_df = mirna_df.T
if meth_df.shape[0] > meth_df.shape[1]:
    meth_df = meth_df.T

#  Keep top 500 features each
mirna_small = mirna_df.iloc[:, :500]
meth_small  = meth_df.iloc[:, :500]

#  Align by common sample IDs
common_samples = mirna_small.index.intersection(meth_small.index)
mirna_small = mirna_small.loc[common_samples]
meth_small  = meth_small.loc[common_samples]

#  Merge features side-by-side
merged_df = pd.concat([mirna_small, meth_small], axis=1)

#  Add dummy binary labels (adjust as per real info if available)
merged_df["Label"] = [0 if i < len(merged_df)//2 else 1 for i in range(len(merged_df))]

#  Save merged dataset
output_path = r"C:\Users\sanja\cfDNA_LungCancer_ML\data\processed\merged_labeled_light.csv"
merged_df.to_csv(output_path)

print(" Lightweight merged dataset created and saved.")
print(" Final shape:", merged_df.shape)


 Lightweight merged dataset created and saved.
 Final shape: (450, 1001)


###  Final Check and Save

We now preview the first few rows of the merged and labeled lightweight dataset to confirm that the integration is successful. After validation, the dataset is saved in CSV format to the processed data directory. This ensures that our clean and structured data is readily available for model training and analysis in the upcoming steps.


In [9]:
# Final check: Preview the merged labeled dataset
print(" Final merged labeled dataset:")
print(merged_df.head())
print(" Shape:", merged_df.shape)

# Save to CSV
output_path = r"C:\Users\sanja\cfDNA_LungCancer_ML\data\processed\merged_labeled_light.csv"
merged_df.to_csv(output_path)

print(f" Dataset saved successfully at:\n{output_path}")



 Final merged labeled dataset:
attrib_name   hsa-let-7a-1  hsa-let-7a-2  hsa-let-7a-3  hsa-let-7b  \
TCGA.05.4384       13.8766       14.8745       13.8822     13.8259   
TCGA.05.4390       11.7425       12.7576       11.7578     13.0601   
TCGA.05.4396       14.0194       15.0255       14.0367     14.5902   
TCGA.05.4405       12.9428       13.9327       12.9499     14.2170   
TCGA.05.4410       12.7150       13.7157       12.7252     13.7465   

attrib_name   hsa-let-7c  hsa-let-7d  hsa-let-7e  hsa-let-7f-1  hsa-let-7f-2  \
TCGA.05.4384     10.6177      8.7119     10.8698        5.3122       15.1357   
TCGA.05.4390      7.6080      8.6168     10.4833        3.4069       12.4367   
TCGA.05.4396     11.1171      9.8454     11.4738        4.3995       14.3723   
TCGA.05.4405     11.1093      8.4836     10.3909        3.1985       12.5092   
TCGA.05.4410     10.3613      8.7360     10.0696        3.9421       13.0051   

attrib_name   hsa-let-7g  ...  NUP107_cg00036115  A2BP1_cg00036119 