# Project Overview:

---


Recent advances in cancer research have highlighted the role of microRNAs (miRNAs) as key regulators of cancer cell behavior, including epithelial-mesenchymal transition (EMT)—a process strongly associated with metastasis. As sequencing technologies become more affordable and accessible, miRNA expression profiling is emerging as a promising tool for clinical decision-making.

This project aims to leverage the power of machine learning to distinguish between primary colorectal tumors that have initiated metastasis and those that have not, based solely on the expression levels of selected miRNAs known to be associated with tumor progression.

By focusing on biologically meaningful features, the goal is to contribute toward non-invasive, expression-based classifiers that can support early detection of metastatic risk in colorectal cancer, enabling more precise and timely interventions.


---



# Building the necessary cohorts on the GDC web page:
The first step involves applying the filters provided by the GDC page of the NIH to download the miRNA transcriptome files corresponding to the colorectal cancer patients.

To download all the files related with colorectal cancer:

Primary Site: colon, rectosigmoid junction, rectum

Tissue or Organ of Origin: appendix, ascending colon, cecum, colon, nos, descending colon, hepatic flexure of colon, overlapping lesion of colon
rectosigmoid junction, rectum, nos, sigmoid colon, splenic flexure of colon,
transverse colon

To download files of pathological stages from I to II:
Ajcc Pathologic Stage: stage i, stage ia, stage ii, stage iia, stage iib, stage iic.

To download files of pathological stages from III to IV:
Ajcc Pathologic Stage: stage iii, stage iiia, stage iiib, stage iiic, stage iv, stage iva, stage ivb, stage ivc.

To download miRNA transcriptomic files:

Experimental Strategy: miRNA seq

Data Category: Transcriptome profiling

Data Type: Isoform expression quantification

Tissue type: Tumor

Tissue descriptor: Primary

---



---



# Data collection from miRNA transcriptomic files:
The second step involves extracting the count per million (CPM) of the miRNAs related with metastatic cancer progression from the downloaded files. To do so, we need a list of the miRNAs with their correct nomenclature.

miRNA list: hsa-mir-29a, hsa-mir-125b-1, hsa-mir-125b-2, hsa-mir-145, hsa-mir-149, hsa-mir-607-5p, hsa-mir-1246, hsa-mir-4488, hsa-mir-6777-5p, hsa-mir-492, hsa-mir-200a, hsa-mir-338, hsa-mir-29c, hsa-mir-101, hsa-mir-148a, hsa-mir-92a, hsa-mir-424, hsa-mir-210.

The code will extract the files from a drive folder built by the user:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
%matplotlib inline

In [None]:
!pip install numpy==2.2.0
!pip install pandas==2.2.3
!pip install scikit-learn==1.6.0
!pip install matplotlib==3.9.3
!pip install seaborn==0.13.2

Collecting numpy==2.2.0
  Downloading numpy-2.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m [31m601.9 kB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-2.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.4/16.4 MB[0m [31m43.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.0.2
    Uninstalling numpy-2.0.2:
      Successfully uninstalled numpy-2.0.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.18.0 requires numpy<2.1.0,>=1.26.0, but you have numpy 2.2.0 which is incompatible.
numba 0.60.0 requires numpy<2.1,>=1.22, but you have numpy 2.2.0 wh

#Collecting miRNA CPM from stage I-II cohort files:

In [None]:
# Step 1: Mount Google Drive in Colab
from google.colab import drive
drive.mount('/content/drive')

import os
import pandas as pd

# Step 2: Define the top-level Drive folder
base_folder = '/content/drive/MyDrive/.../.../'

# Step 3: List of miRNAs to extract
miRNA_list = [
    'hsa-mir-29a',
    'hsa-mir-125b-1', 'hsa-mir-125b-2',
    'hsa-mir-145', 'hsa-mir-149', 'hsa-mir-607-5p', 'hsa-mir-1246',
    'hsa-mir-4488', 'hsa-mir-6777-5p', 'hsa-mir-492', 'hsa-mir-200a',
    'hsa-mir-338', 'hsa-mir-29c', 'hsa-mir-101', 'hsa-mir-148a',
    'hsa-mir-92a', 'hsa-mir-424', 'hsa-mir-210'
]

# Step 4: Function to extract CPMs from one .txt file
def extract_miRNA_cpm(file_path, miRNAs):
    """
    Reads a .txt file and extracts 'reads_per_million_miRNA_mapped' for the specified miRNAs.
    Returns a dict with all miRNAs, filling missing ones with 0.
    """
    try:
        df = pd.read_csv(file_path, sep='\t')

        # Ensure required columns are present
        if 'miRNA_ID' not in df.columns or 'reads_per_million_miRNA_mapped' not in df.columns:
            raise ValueError("Missing required columns")

        df_filtered = df[df['miRNA_ID'].isin(miRNAs)]
        grouped = df_filtered.groupby('miRNA_ID')['reads_per_million_miRNA_mapped'].sum()
        return {miRNA: float(grouped.get(miRNA, 0.0)) for miRNA in miRNAs}

    except Exception as e:
        print(f"Skipping file {file_path}: {e}")
        return None  # Skip invalid files

# Step 5: Recursively find valid .txt files
txt_file_paths = []
for root, _, files in os.walk(base_folder):
    for f in files:
        if f.endswith('.txt') and 'annotation' not in f.lower():
            txt_file_paths.append(os.path.join(root, f))

# Step 6: Process each file
data_rows = []
file_ids = []

for file_path in txt_file_paths:
    result = extract_miRNA_cpm(file_path, miRNA_list)
    if result is not None:
        data_rows.append(result)
        file_ids.append(os.path.basename(file_path))  # Change to file_path for full path as ID

# Step 7: Create final DataFrame
df_final = pd.DataFrame(data_rows, index=file_ids)
df_final.index.name = 'File_ID'

# Preview
df_final.head()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Unnamed: 0_level_0,hsa-mir-29a,hsa-mir-125b-1,hsa-mir-125b-2,hsa-mir-145,hsa-mir-149,hsa-mir-607-5p,hsa-mir-1246,hsa-mir-4488,hsa-mir-6777-5p,hsa-mir-492,hsa-mir-200a,hsa-mir-338,hsa-mir-29c,hsa-mir-101,hsa-mir-148a,hsa-mir-92a,hsa-mir-424,hsa-mir-210
File_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
3cd62167-7962-44ea-8923-6e9c7fc97807.mirbase21.isoforms.quantification.txt,6433.474746,96.445011,100.828873,2455.025573,3.704676,0.0,0.0,0.0,0.0,0.061745,1066.081631,147.754741,772.671464,0.0,17596.706501,0.0,258.401,290.508173
ef4cd175-6f73-4360-b2c6-71b424d64f53.mirbase21.isoforms.quantification.txt,6301.933109,79.897088,79.762807,1111.710899,5.908356,0.0,0.268562,0.0,0.0,0.0,2083.501186,975.415869,323.885335,0.0,120874.62469,0.0,522.889509,1422.839555
98694eb1-1282-4426-8fb2-001ac8190323.mirbase21.isoforms.quantification.txt,5633.065554,87.078831,91.506569,2537.58534,1.967884,0.0,0.491971,0.0,0.0,0.491971,3693.224731,139.719706,453.105099,0.0,62440.932759,0.0,175.141602,719.753268
a6f1d4ee-b216-4b96-95a6-5705662254d7.mirbase21.isoforms.quantification.txt,17290.15572,151.468017,158.525667,3408.301859,54.56106,0.0,0.0,0.0,0.0,0.0,3575.242417,388.713621,1074.934321,0.0,95548.36037,0.0,130.837964,69.490703
e3f4c57a-45e8-4dd6-96b1-e12ba2bdb415.mirbase21.isoforms.quantification.txt,9060.427026,85.35494,81.798484,1135.93201,4.267747,0.0,0.0,0.0,0.0,0.0,2634.622522,522.087724,1532.832487,0.0,180015.705306,0.0,103.848509,396.189185


In [None]:
# Lets check whether the data frame has all the features and the correct number of rows
df_final.shape

(325, 18)

In [None]:
# Now we have to add the target variable column. We achieved this by establishing a for loop
stages = []
for i in range(437):
  stages.append('Stage I-II')
# Then we transform the stages list to a data frame
df_list = pd.DataFrame(stages, columns=['Stages'])
df_list.head()

In [None]:
# Finally, we assign the main data frame index to the stages index for appropiate concatenation
df_list.index = df_final.index
df_stages_I_II = pd.concat([df_final, df_list], axis=1)
df_stages_I_II['Stages'].head()

In [None]:
# Download the stages I to II data frame
df_stages_I_II.to_csv('df_stages_I_II.csv')
from google.colab import files

files.download('df_stages_I_II.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

#Collecting miRNA CPM from stage III-IV cohort files:

In [None]:
# Step 1: Mount Google Drive in Colab
from google.colab import drive
drive.mount('/content/drive')

import os
import pandas as pd

# Step 2: Define the top-level Drive folder
base_folder = '/content/drive/MyDrive/.../.../'

# Step 3: List of miRNAs to extract
miRNA_list = [
    'hsa-mir-29a',
    'hsa-mir-125b-1', 'hsa-mir-125b-2',
    'hsa-mir-145', 'hsa-mir-149', 'hsa-mir-607-5p', 'hsa-mir-1246',
    'hsa-mir-4488', 'hsa-mir-6777-5p', 'hsa-mir-492', 'hsa-mir-200a',
    'hsa-mir-338', 'hsa-mir-29c', 'hsa-mir-101', 'hsa-mir-148a',
    'hsa-mir-92a', 'hsa-mir-424', 'hsa-mir-210'
]

# Step 4: Function to extract CPMs from one .txt file
def extract_miRNA_cpm(file_path, miRNAs):
    """
    Reads a .txt file and extracts 'reads_per_million_miRNA_mapped' for the specified miRNAs.
    Returns a dict with all miRNAs, filling missing ones with 0.
    """
    try:
        df = pd.read_csv(file_path, sep='\t')

        # Ensure required columns are present
        if 'miRNA_ID' not in df.columns or 'reads_per_million_miRNA_mapped' not in df.columns:
            raise ValueError("Missing required columns")

        df_filtered = df[df['miRNA_ID'].isin(miRNAs)]
        grouped = df_filtered.groupby('miRNA_ID')['reads_per_million_miRNA_mapped'].sum()
        return {miRNA: float(grouped.get(miRNA, 0.0)) for miRNA in miRNAs}

    except Exception as e:
        print(f"Skipping file {file_path}: {e}")
        return None  # Skip invalid files

# Step 5: Recursively find valid .txt files
txt_file_paths = []
for root, _, files in os.walk(base_folder):
    for f in files:
        if f.endswith('.txt') and 'annotation' not in f.lower():
            txt_file_paths.append(os.path.join(root, f))

# Step 6: Process each file
data_rows = []
file_ids = []

for file_path in txt_file_paths:
    result = extract_miRNA_cpm(file_path, miRNA_list)
    if result is not None:
        data_rows.append(result)
        file_ids.append(os.path.basename(file_path))  # Change to file_path for full path as ID

# Step 7: Create final DataFrame
df_final2 = pd.DataFrame(data_rows, index=file_ids)
df_final2.index.name = 'File_ID'

# Preview
df_final2.head()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Unnamed: 0_level_0,hsa-mir-29a,hsa-mir-125b-1,hsa-mir-125b-2,hsa-mir-145,hsa-mir-149,hsa-mir-607-5p,hsa-mir-1246,hsa-mir-4488,hsa-mir-6777-5p,hsa-mir-492,hsa-mir-200a,hsa-mir-338,hsa-mir-29c,hsa-mir-101,hsa-mir-148a,hsa-mir-92a,hsa-mir-424,hsa-mir-210
File_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
b2c7bf42-2861-4989-8527-4d463dc13580.mirnaseq.isoforms.quantification.txt,3566.804206,447.145348,511.023255,2484.332649,174.369421,0.0,0.0,0.0,0.0,0.0,1173.972343,53.519328,117.397235,0.0,9761.234744,0.0,96.680075,433.333908
cdd8a243-9f28-46f4-a8dd-94880a2f76c9.mirbase21.isoforms.quantification.txt,8039.412112,42.274099,45.581253,1480.456207,6.182947,0.0,0.287578,0.287578,0.0,0.0,3424.920978,562.21676,1310.209492,0.0,54509.143783,0.0,552.007708,1137.805941
90613574-2797-4a09-a42c-dab0092a293b.mirbase21.isoforms.quantification.txt,13455.454529,62.512068,58.04692,1645.853424,5.358177,0.0,0.178606,0.0,0.0,0.0,2437.256194,893.743954,951.076451,0.0,28920.225495,0.0,80.194051,417.937822
310f8d96-22f7-4a08-a79d-1e798033daac.mirbase21.isoforms.quantification.txt,7196.164299,265.098049,273.097069,5921.276576,23.040663,0.0,0.086946,0.0,0.0,0.0,1355.138788,144.764923,592.101575,0.0,38185.508641,0.0,416.992534,634.705064
4a065215-4eee-46f1-a66f-ed2b2d90bb22.mirbase21.isoforms.quantification.txt,3062.318671,550.310967,574.589393,3916.109951,74.453837,0.0,0.0,0.0,0.0,0.0,1991.64013,730.78059,642.568979,0.0,29279.780523,0.0,145.67055,1260.050257


In [None]:
# Lets check whether the data frame has all the features and the correct number of rows
df_final2.shape

(410, 18)

In [None]:
# Now we have to add the target variable column. We achieved this by establishing a for loop
list_stages_III_IV = []
for i in range(410):
  list_stages_III_IV.append('Stage III-IV')
#Then we transform the stages list to a data frame
df_list_stages_III_IV = pd.DataFrame(list_stages_III_IV, columns=['Stages'])
df_list_stages_III_IV.head()

In [None]:
# Finally, we assign the main data frame index to the stages index for appropiate concatenation
df_list_stages_III_IV.index = df_final2.index
df_stages_III_IV = pd.concat([df_final2, df_list_stages_III_IV], axis=1)
df_stages_III_IV.head()

Unnamed: 0_level_0,hsa-mir-29a,hsa-mir-125b-1,hsa-mir-125b-2,hsa-mir-145,hsa-mir-149,hsa-mir-607-5p,hsa-mir-1246,hsa-mir-4488,hsa-mir-6777-5p,hsa-mir-492,hsa-mir-200a,hsa-mir-338,hsa-mir-29c,hsa-mir-101,hsa-mir-148a,hsa-mir-92a,hsa-mir-424,hsa-mir-210,Stages
File_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
b2c7bf42-2861-4989-8527-4d463dc13580.mirnaseq.isoforms.quantification.txt,3566.804206,447.145348,511.023255,2484.332649,174.369421,0.0,0.0,0.0,0.0,0.0,1173.972343,53.519328,117.397235,0.0,9761.234744,0.0,96.680075,433.333908,Stage III-IV
cdd8a243-9f28-46f4-a8dd-94880a2f76c9.mirbase21.isoforms.quantification.txt,8039.412112,42.274099,45.581253,1480.456207,6.182947,0.0,0.287578,0.287578,0.0,0.0,3424.920978,562.21676,1310.209492,0.0,54509.143783,0.0,552.007708,1137.805941,Stage III-IV
90613574-2797-4a09-a42c-dab0092a293b.mirbase21.isoforms.quantification.txt,13455.454529,62.512068,58.04692,1645.853424,5.358177,0.0,0.178606,0.0,0.0,0.0,2437.256194,893.743954,951.076451,0.0,28920.225495,0.0,80.194051,417.937822,Stage III-IV
310f8d96-22f7-4a08-a79d-1e798033daac.mirbase21.isoforms.quantification.txt,7196.164299,265.098049,273.097069,5921.276576,23.040663,0.0,0.086946,0.0,0.0,0.0,1355.138788,144.764923,592.101575,0.0,38185.508641,0.0,416.992534,634.705064,Stage III-IV
4a065215-4eee-46f1-a66f-ed2b2d90bb22.mirbase21.isoforms.quantification.txt,3062.318671,550.310967,574.589393,3916.109951,74.453837,0.0,0.0,0.0,0.0,0.0,1991.64013,730.78059,642.568979,0.0,29279.780523,0.0,145.67055,1260.050257,Stage III-IV


In [None]:
# Download the stages III to IV data frame
from google.colab import files

df_stages_III_IV.to_csv('df_stages_III_IV.csv')
files.download('df_stages_III_IV.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>