**Table of contents**<a id='toc0_'></a>    
- [1. The project](#toc1_1_)    
  - [2. Datasets](#toc1_2_)    
    - [2.1 Data cleaning](#toc1_2_1_)    
      - [2.1.1 MRI features](#toc1_2_1_1_)    
      - [2.1.2 Target feature](#toc1_2_1_2_)    
    - [2.2 Join image features with target variable](#toc1_2_2_)    
  - [2.2 Split datasets into train and test sets](#toc1_3_)    
    - [2.1.1 Missing values](#toc1_3_1_)    
  - [3. Data split train and test sets](#toc1_4_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [1]:
import sys
import jupyter_black
import warnings

jupyter_black.load()
from IPython.display import display

warnings.filterwarnings("ignore")

%load_ext autoreload
%autoreload 2

In [2]:
import datetime

import numpy as np
import pandas as pd
import plotly.express as px
import polars as pl
import polars.selectors as cs
import os

from modeling import data_preprocessing

In [3]:
os.chdir("..")

## <a id='toc1_1_'></a>[1. The project](#toc0_)

Saha and colleagues collected a large set of breast MRI cases that
were used to detect breast cancer. The dataset also contains detailed
information about the characteristic of the patient and subtype of tumors
that were eventually diagnosed. This information can be used to determine
the best treatment.

Basic image processing was performed to extract features from the MRI
cases which were used to predict the molecular tumor subtypes from MRI.
In this assignment, we will try to predict the estrogen receptor (ER) status
from the MRI image features that were provided.

## <a id='toc1_2_'></a>[2. Datasets](#toc0_)

### <a id='toc1_2_1_'></a>[2.1 Data cleaning](#toc0_)

According to the research papaer, features were derived from the following sources: (a) features
that capture the properties of breast as a whole are in category 1, (b)
categories 2, 4, 7, 8 and 10 quantify characteristics of the tumour, (c)
characteristics of FGT are expressed by features included in
categories 1, 3, 6 and 9, and (d) category 5 represents features
that capture the properties of tumour and FGT enhancement
simultaneously. As per the image processing, (a) features in
categories 1 and 2 do not require the voxel intensity values after
the extraction of the corresponding masks (b) features in categories
3, 4 and 5 use voxel intensity values for capturing tumour/tissue
enhancement but the spatial relationship of the voxels are not
explored, (c) features in categories 6, 7 and 8 exploit the spatial
relationship of the voxels while quantifying enhancement (d)
features in categories 9 and 10 use variation in enhancement over
time or in values but does not use the spatial relationship.

#### <a id='toc1_2_1_1_'></a>[2.1.1 MRI features](#toc0_)

In [6]:
## download MRI features dataset
data_raw_mri_features = pl.read_excel("dataset/raw/Imaging_Features.xlsx")

### convert Patient ID into numeric values
data_raw_mri_features = data_raw_mri_features.with_columns(
    data_raw_mri_features["Patient ID"].str.slice(-3, 3).cast(pl.Int32)
)
print(data_raw_mri_features)

shape: (922, 530)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ Patient   ┆ F1_DT_POS ┆ F1_DT_POS ┆ F1_DT_POS ┆ … ┆ WashinRat ┆ WashinRat ┆ WashinRat ┆ WashinRa │
│ ID        ┆ TCON (T11 ┆ TCON (T11 ┆ TCON (T11 ┆   ┆ e_map_mea ┆ e_map_std ┆ e_map_ske ┆ te_map_k │
│ ---       ┆ =0.05,T12 ┆ =0.05,T12 ┆ =0.02,T12 ┆   ┆ n_tissue_ ┆ _dev_tiss ┆ wness_tis ┆ urtosis_ │
│ i32       ┆ =0.5)     ┆ =0.1)     ┆ =0.5)     ┆   ┆ PostC…    ┆ ue_Po…    ┆ sue_P…    ┆ tissue_P │
│           ┆ ---       ┆ ---       ┆ ---       ┆   ┆ ---       ┆ ---       ┆ ---       ┆ …        │
│           ┆ f64       ┆ f64       ┆ f64       ┆   ┆ f64       ┆ f64       ┆ f64       ┆ ---      │
│           ┆           ┆           ┆           ┆   ┆           ┆           ┆           ┆ f64      │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡
│ 1         ┆ 1.0       ┆ 0.120721  ┆ 0.530395  ┆ … ┆ 14.517894 ┆ 20.3475

In [7]:
data_preprocessing.group_by_col_type(data_raw_mri_features)

shape: (3, 2)
┌─────────┬───────┐
│ Type    ┆ Count │
│ ---     ┆ ---   │
│ str     ┆ u32   │
╞═════════╪═══════╡
│ Int32   ┆ 1     │
│ Int64   ┆ 2     │
│ Float64 ┆ 527   │
└─────────┴───────┘


As you can see, all columns have a numeric data type and maintain consistency.

#### <a id='toc1_2_1_2_'></a>[2.1.2 Target feature](#toc0_)

In [12]:
### download clinical features dataset as pandas dataframe as it handles some of the values more efficiently
data_raw_clinic_features = pd.read_excel(
    "dataset/raw/Clinical_and_Other_Features.xlsx",
    dtype=str,
)

### convert to polars dataframe
data_raw_clinic_features = pl.from_pandas(data_raw_clinic_features)

### convert Patient ID into numeric values
data_raw_clinic_features = data_raw_clinic_features.with_columns(
    data_raw_clinic_features["Patient ID"].str.slice(-3, 3).cast(pl.Int32)
)

According to the Task 3 we need to use image features to predict the ER status so I select only 2 columns from this datasets: Patient ID and ER status.

ER Status (Estrogen Receptor Status) refers to whether a breast cancer tumor has estrogen receptors (ER) on its cells. These receptors are proteins that bind to estrogen, a hormone that can promote the growth of some breast cancers.

Types of ER Status: 
<br>ER-Positive (ER+):
- The cancer cells have estrogen receptors and depend on estrogen to grow.
- They tend to grow more slowly than ER-negative tumors.

<br>TER-Negative (ER-):
- The cancer cells do not have estrogen receptors and do not rely on estrogen for growth.
- These tumors may be more aggressive and more likely to require chemotherapy.

Predicting ER status can help in identifying tumor subtypes based on imaging and clinical features.

In [13]:
### select "Patient ID" and "ER" columns
target_ER = data_raw_clinic_features.select(["Patient ID", "ER"])

### Convert columns containing only numeric strings to integers
target_ER = data_preprocessing.convert_str_to_int(target_ER)

In [14]:
print(target_ER)

shape: (922, 2)
┌────────────┬─────┐
│ Patient ID ┆ ER  │
│ ---        ┆ --- │
│ i64        ┆ i64 │
╞════════════╪═════╡
│ 1          ┆ 0   │
│ 2          ┆ 0   │
│ 3          ┆ 1   │
│ 4          ┆ 1   │
│ 5          ┆ 1   │
│ …          ┆ …   │
│ 918        ┆ 1   │
│ 919        ┆ 1   │
│ 920        ┆ 1   │
│ 921        ┆ 1   │
│ 922        ┆ 1   │
└────────────┴─────┘


### <a id='toc1_2_2_'></a>[2.2 Join image features with target variable](#toc0_)

In [15]:
### Join two DataFrames on a specified column.
total_features = data_preprocessing.join_datasets(
    target_ER, data_raw_mri_features, join_col="Patient ID", method="inner"
)

In [16]:
print(total_features)

shape: (922, 531)
┌────────────┬────────────┬────────────┬────────────┬───┬────────────┬───────────┬───────────┬─────┐
│ Patient ID ┆ F1_DT_POST ┆ F1_DT_POST ┆ F1_DT_POST ┆ … ┆ WashinRate ┆ WashinRat ┆ WashinRat ┆ ER  │
│ ---        ┆ CON (T11=0 ┆ CON (T11=0 ┆ CON (T11=0 ┆   ┆ _map_std_d ┆ e_map_ske ┆ e_map_kur ┆ --- │
│ i32        ┆ .05,T12=0. ┆ .05,T12=0. ┆ .02,T12=0. ┆   ┆ ev_tissue_ ┆ wness_tis ┆ tosis_tis ┆ i64 │
│            ┆ 5)         ┆ 1)         ┆ 5)         ┆   ┆ Po…        ┆ sue_P…    ┆ sue_P…    ┆     │
│            ┆ ---        ┆ ---        ┆ ---        ┆   ┆ ---        ┆ ---       ┆ ---       ┆     │
│            ┆ f64        ┆ f64        ┆ f64        ┆   ┆ f64        ┆ f64       ┆ f64       ┆     │
╞════════════╪════════════╪════════════╪════════════╪═══╪════════════╪═══════════╪═══════════╪═════╡
│ 1          ┆ 1.0        ┆ 0.120721   ┆ 0.530395   ┆ … ┆ 20.347506  ┆ 1.62587   ┆ 11.406955 ┆ 0   │
│ 2          ┆ 1.0        ┆ 0.129546   ┆ 0.485217   ┆ … ┆ 83.909561  ┆ 0.

In [None]:
### Save dataset as excel file
total_features.write_excel("dataset/intermediate/dataset_MRI_ER_target.xlsx")

<xlsxwriter.workbook.Workbook at 0x74c1a21965d0>