# Transformation for Liver Cancer Prediction
---

This script performs feature selection and transformation on a dataset related to liver cancer prediction.

The goal is to identify the most relevant variables using different feature selection techniques, such as Mutual Information, Pearson, and Spearman correlation, and apply optimal transformations to improve model performance.

The dataset originates from the PLCO study, and transformations are applied to both the training and test sets to maintain data consistency for machine learning models.

**Key steps in this script:**
- **Feature importance evaluation** using different selection methods.
- **Transforming selected features** to enhance model interpretability.
- **Ensuring consistency** between training and test datasets.

Author: Juan Armario  
Date: 2024

# Importing libraries
---

In [37]:
import pandas as pd
import numpy as np

## Others
import warnings
import sys
from collections import Counter

## Plot
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# Custom functions
sys.path.append("../../0. Scripts")
import data_analysing_functions as daf
import model_metrics_functions as mmf
import model_transformation_functions as mtf

## Stats
import scipy.stats as stats
from scipy.stats import pearsonr, spearmanr

## Feature selection
from sklearn.feature_selection import mutual_info_classif

## Preprocessing
from sklearn.preprocessing import scale

## Selection Models
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold, RandomizedSearchCV, cross_val_score, cross_validate

## Models
from sklearn.ensemble import RandomForestClassifier

## Metrics
from sklearn.metrics import mean_squared_error, r2_score, confusion_matrix, accuracy_score
from sklearn.metrics import classification_report, balanced_accuracy_score, make_scorer
from sklearn.metrics import precision_score, recall_score, f1_score, matthews_corrcoef, roc_auc_score

# Loading Data
---

In [39]:
## Imputed dataset
imputed_train_df = pd.read_csv('../../0. Data/3. Imputed/mean_median_imputed_train_df.csv')

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# Setting to avoind scientific notation
pd.set_option('display.float_format', lambda x: '%.8f' % x)

In [40]:
target = imputed_train_df.liver_cancer
target.info()

<class 'pandas.core.series.Series'>
RangeIndex: 119495 entries, 0 to 119494
Series name: liver_cancer
Non-Null Count   Dtype
--------------   -----
119495 non-null  int64
dtypes: int64(1)
memory usage: 933.7 KB


# Getting most inportant variables
---
In this section, we identify the most relevant features for liver cancer prediction using different feature selection techniques. Selecting the right variables is crucial to improving model efficiency and interpretability while reducing noise and overfitting.

**Key aspects of this process:**
- **Evaluation of feature importance** using methods like Mutual Information, Pearson, and Spearman correlation.
- **Filtering out less relevant variables** to enhance model performance.
- **Comparison of selection methods** to ensure the best subset of predictors.

This step allows us to focus on the most informative features, optimizing the dataset for the subsequent modeling process.

In [42]:
# Get the most important variables according to the selected metric (default mutual_info)
importance_removing_below_mean_train_df = mmf.calculate_feature_importance(imputed_train_df, target=target, method='mutual_info', remove_below_mean=True)
importance_removing_below_mean_train_df

Unnamed: 0,Feature,Importance
0,liver_exitstat,0.03201947
1,filtered_f,0.01823899
2,race7,0.01773162
3,bmi_20c,0.01729875
4,in_TGWAS_population,0.01515494
5,sex,0.01474745
6,arm,0.01410353
7,mortality_exitstat,0.01206001
8,smokea_f,0.0110061
9,cig_stat,0.01070765


In [43]:
importance_train_df = mmf.calculate_feature_importance(imputed_train_df, target=target, method='mutual_info', remove_below_mean=False)
importance_train_df

Unnamed: 0,Feature,Importance
0,liver_exitstat,0.0316122
1,race7,0.01866012
2,filtered_f,0.01855865
3,bmi_20c,0.01738824
4,in_TGWAS_population,0.01549923
5,arm,0.01417874
6,sex,0.01413578
7,mortality_exitstat,0.01199807
8,smokea_f,0.01194858
9,cig_stat,0.0115006


# Transforming variables
---
Feature transformation is a crucial step in data preprocessing that helps improve model performance by enhancing relationships between variables and making data distributions more suitable for Machine Learning algorithms.

**Key aspects of this process:**
- **Applying mathematical transformations** such as logarithm, square root, exponentiation, and power transformations.
- **Identifying the best transformation for each variable** based on Mutual Information, Pearson, or Spearman correlation with the target variable.
- E**nhancing feature interpretability** and ensuring better data scaling for model training.

By optimizing variable transformations, we aim to improve model accuracy, stability, and robustness in predicting liver cancer.

In [45]:
# Creation of a DataFrame where the variables found in importance_removing_below_mean_train_df are removed
df_copy1 = imputed_train_df.copy()

columns_to_remove_1 = importance_removing_below_mean_train_df['Feature'].tolist()
imputed_removing_below_mean_train_df = df_copy1[columns_to_remove_1]

# Creation of a DataFrame where the variables found in importance_df are removed
df_copy2 = imputed_train_df.copy()

columns_to_remove_2 = importance_train_df['Feature'].tolist()
imputed_removing_not_important_variables_train_df = df_copy2[columns_to_remove_2]

## Removing not important variables below the mean
---
In this step, we identify and remove variables whose importance falls below the average feature importance score.

In [47]:
# Finding the best transformations for the training set
transformations = mtf.mejorTransf(imputed_removing_below_mean_train_df, target, tipo='mutual_info')
transformations

{'liver_exitstat': 'sqrt',
 'filtered_f': 'log',
 'race7': 'log',
 'bmi_20c': 'raiz4',
 'in_TGWAS_population': 'ident',
 'sex': 'sqr',
 'arm': 'ident',
 'mortality_exitstat': 'log',
 'smokea_f': 'raiz4',
 'cig_stat': 'log',
 'ssmokea_f': 'ident',
 'bmi_50c': 'raiz4',
 'fh_cancer': 'ident',
 'bmi_curc': 'raiz4',
 'preg_f': 'log',
 'agelevel': 'log',
 'fmenstr': 'log'}

In [48]:
# Aplying the best transformation
transformed_removing_below_mean_train_df = mtf.apply_best_transformations(imputed_removing_below_mean_train_df, transformations, target)

In [49]:
transformed_removing_below_mean_train_df.head()

Unnamed: 0,liver_exitstat_sqrt,filtered_f_log,race7_log,bmi_20c_raiz4,sex_sqr,mortality_exitstat_log,smokea_f_raiz4,cig_stat_log,bmi_50c_raiz4,bmi_curc_raiz4,preg_f_log,agelevel_log,fmenstr_log
0,2.82842712,0.0001,0.0001,1.0,1,0.69319718,2.05976714,-9.21034037,1.18920712,1.18920712,-9.21034037,0.0001,-9.21034037
1,2.44948974,0.0001,0.0001,1.18920712,4,0.0001,2.21336384,0.0001,1.31607401,1.41421356,0.0001,-9.21034037,0.69319718
2,2.82842712,0.0001,0.0001,1.18920712,4,1.09864562,2.05976714,-9.21034037,1.41421356,1.41421356,0.0001,-9.21034037,0.0001
3,2.23606798,0.0001,0.0001,1.18920712,1,0.0001,2.05976714,-9.21034037,1.31607401,1.18920712,-9.21034037,0.69319718,-9.21034037
4,2.82842712,0.0001,0.0001,1.18920712,4,0.0001,2.11474253,0.69319718,1.18920712,1.18920712,0.0001,1.09864562,0.69319718


## Removing not important variables
---
To further refine our dataset, we remove additional variables that do not contribute meaningfully to the predictive power of the model. By leveraging feature importance metrics such as Mutual Information, we discard features that show little to no correlation with the target variable.

In [51]:
# Finding the best transformations for the training set
transformations2 = mtf.mejorTransf(imputed_removing_not_important_variables_train_df, target, tipo='mutual_info')
transformations2

{'liver_exitstat': 'sqrt',
 'race7': 'log',
 'filtered_f': 'log',
 'bmi_20c': 'raiz4',
 'in_TGWAS_population': 'ident',
 'arm': 'sqr',
 'sex': 'sqr',
 'mortality_exitstat': 'log',
 'smokea_f': 'raiz4',
 'cig_stat': 'exp',
 'bmi_50c': 'raiz4',
 'ssmokea_f': 'sqrt',
 'bmi_curc': 'raiz4',
 'fh_cancer': 'cuarta',
 'preg_f': 'log',
 'agelevel': 'log',
 'fmenstr': 'log',
 'menstrs': 'log',
 'sisters': 'log',
 'arthrit_f': 'log',
 'center': 'raiz4',
 'brothers': 'log',
 'hyperten_f': 'log',
 'horm_f': 'log',
 'rndyear': 'raiz4',
 'urinate_f': 'log',
 'bcontr_f': 'log',
 'height_f': 'sqrt',
 'asppd': 'log',
 'mortality_exitage': 'raiz4',
 'liver_exitage': 'log',
 'ibuppd': 'log',
 'hyster_f': 'log',
 'pipe': 'log',
 'cigar': 'log',
 'miscar': 'log',
 'vasect_f': 'log',
 'gallblad_f': 'log',
 'bbd': 'log',
 'enlpros_f': 'log',
 'tuballig': 'log',
 'benign_ovcyst': 'log',
 'uterine_fib': 'log',
 'ph_any_trial': 'log',
 'diabetes_f': 'log',
 'trypreg': 'log',
 'liver_comorbidity': 'log',
 'hearta

In [52]:
# Aplying the best transformation
transformed_removing_not_important_variables_train_df = mtf.apply_best_transformations(imputed_removing_not_important_variables_train_df, transformations2, target)

In [53]:
transformed_removing_not_important_variables_train_df.head()

Unnamed: 0,liver_exitstat_sqrt,race7_log,filtered_f_log,bmi_20c_raiz4,arm_sqr,sex_sqr,mortality_exitstat_log,smokea_f_raiz4,cig_stat_exp,bmi_50c_raiz4,ssmokea_f_sqrt,bmi_curc_raiz4,fh_cancer_cuarta,preg_f_log,agelevel_log,fmenstr_log,menstrs_log,sisters_log,arthrit_f_log,center_raiz4,brothers_log,hyperten_f_log,horm_f_log,rndyear_raiz4,urinate_f_log,bcontr_f_log,height_f_sqrt,asppd_log,mortality_exitage_raiz4,liver_exitage_log,ibuppd_log,hyster_f_log,pipe_log,cigar_log,miscar_log,vasect_f_log,gallblad_f_log,bbd_log,enlpros_f_log,tuballig_log,benign_ovcyst_log,uterine_fib_log,ph_any_trial_log,diabetes_f_log,trypreg_log,liver_comorbidity_log,hearta_f_log
0,2.82842712,0.0001,0.0001,1.0,4,1,0.69319718,2.05976714,1.0,1.18920712,6.52143406,1.18920712,1.0,-9.21034037,0.0001,-9.21034037,-9.21034037,0.0001,-9.21034037,1.68179283,-9.21034037,-9.21034037,-9.21034037,6.68573057,0.0001,-9.21034037,8.30662386,1.79177614,3.01834948,4.30406644,1.79177614,-9.21034037,-9.21034037,-9.21034037,-9.21034037,-9.21034037,-9.21034037,-9.21034037,-9.21034037,-9.21034037,-9.21034037,-9.21034037,-9.21034037,-9.21034037,-9.21034037,-9.21034037,-9.21034037
1,2.44948974,0.0001,0.0001,1.18920712,4,4,0.0001,2.21336384,2.71828183,1.31607401,6.52143406,1.41421356,0.0,0.0001,-9.21034037,0.69319718,0.0001,0.0001,-9.21034037,1.41421356,1.38631936,0.0001,0.0001,6.68573057,-9.21034037,0.0001,7.74596669,1.60945791,2.90278311,4.14313631,1.79177614,-9.21034037,-9.21034037,-9.21034037,-9.21034037,-9.21034037,-9.21034037,-9.21034037,-9.21034037,0.0001,-9.21034037,-9.21034037,-9.21034037,-9.21034037,-9.21034037,-9.21034037,-9.21034037
2,2.82842712,0.0001,0.0001,1.18920712,1,4,1.09864562,2.05976714,1.0,1.41421356,6.52143406,1.41421356,0.0,0.0001,-9.21034037,0.0001,0.0001,0.69319718,0.0001,1.56508458,0.0001,0.0001,0.0001,6.68405684,-9.21034037,0.0001,8.0,1.60945791,2.91295063,4.23410795,0.69319718,-9.21034037,-9.21034037,-9.21034037,0.69319718,-9.21034037,-9.21034037,0.0001,-9.21034037,-9.21034037,-9.21034037,0.0001,-9.21034037,-9.21034037,-9.21034037,-9.21034037,-9.21034037
3,2.23606798,0.0001,0.0001,1.18920712,1,1,0.0001,2.05976714,1.0,1.31607401,6.52143406,1.18920712,1.0,-9.21034037,0.69319718,-9.21034037,-9.21034037,0.69319718,-9.21034037,1.73205081,0.0001,0.0001,-9.21034037,6.68405684,0.69319718,-9.21034037,7.93725393,-9.21034037,2.9813075,4.36944912,-9.21034037,-9.21034037,-9.21034037,-9.21034037,-9.21034037,0.0001,-9.21034037,-9.21034037,-9.21034037,-9.21034037,-9.21034037,-9.21034037,-9.21034037,0.0001,-9.21034037,-9.21034037,-9.21034037
4,2.82842712,0.0001,0.0001,1.18920712,1,4,0.0001,2.11474253,7.3890561,1.18920712,6.32455532,1.18920712,1.0,0.0001,1.09864562,0.69319718,0.69319718,-9.21034037,0.0001,1.56508458,0.69319718,-9.21034037,0.0001,6.68740305,-9.21034037,-9.21034037,8.0,-9.21034037,3.07147866,4.38202788,-9.21034037,0.0001,-9.21034037,-9.21034037,-9.21034037,-9.21034037,-9.21034037,-9.21034037,-9.21034037,-9.21034037,-9.21034037,-9.21034037,-9.21034037,-9.21034037,-9.21034037,-9.21034037,-9.21034037


# Transformation comparison
---
After applying different transformations to the selected variables, it is essential to evaluate their impact on model performance. In this step, we compare the effectiveness of the transformations by training and testing Random Forest classifiers on both transformed datasets.

**Key objectives of this comparison:**
- Assess how different transformations affect model accuracy and predictive power.
- Compare evaluation metrics such as Accuracy, F1-Score, Recall, and AUC-ROC to determine which transformation yields better results.
- Identify whether reducing the number of variables (while applying transformations) leads to an optimized and efficient model.

By conducting this analysis, we ensure that the best possible transformation approach is selected, leading to a well-performing and generalizable model for liver cancer detection.

In [55]:
y = target

X1_train, X1_test, y1_train, y1_test = train_test_split(transformed_removing_below_mean_train_df, y, test_size=0.2, random_state=42)
X2_train, X2_test, y2_train, y2_test = train_test_split(transformed_removing_not_important_variables_train_df, y, test_size=0.2, random_state=42)

In [69]:
daf.nulls_percentage(transformed_removing_below_mean_train_df)

liver_exitstat_sqrt , 0.0% nulls , 7 unique values, float64
filtered_f_log , 0.0% nulls , 3 unique values, float64
race7_log , 0.0% nulls , 7 unique values, float64
bmi_20c_raiz4 , 0.0% nulls , 4 unique values, float64
sex_sqr , 0.0% nulls , 2 unique values, int64
mortality_exitstat_log , 0.0% nulls , 4 unique values, float64
smokea_f_raiz4 , 0.0% nulls , 63 unique values, float64
cig_stat_log , 0.0% nulls , 3 unique values, float64
bmi_50c_raiz4 , 0.0% nulls , 4 unique values, float64
bmi_curc_raiz4 , 0.0% nulls , 4 unique values, float64
preg_f_log , 0.0% nulls , 3 unique values, float64
agelevel_log , 0.0% nulls , 4 unique values, float64
fmenstr_log , 0.0% nulls , 6 unique values, float64


In [71]:
daf.nulls_percentage(transformed_removing_not_important_variables_train_df)

liver_exitstat_sqrt , 0.0% nulls , 7 unique values, float64
race7_log , 0.0% nulls , 7 unique values, float64
filtered_f_log , 0.0% nulls , 3 unique values, float64
bmi_20c_raiz4 , 0.0% nulls , 4 unique values, float64
arm_sqr , 0.0% nulls , 2 unique values, int64
sex_sqr , 0.0% nulls , 2 unique values, int64
mortality_exitstat_log , 0.0% nulls , 4 unique values, float64
smokea_f_raiz4 , 0.0% nulls , 63 unique values, float64
cig_stat_exp , 0.0% nulls , 3 unique values, float64
bmi_50c_raiz4 , 0.0% nulls , 4 unique values, float64
ssmokea_f_sqrt , 0.0% nulls , 68 unique values, float64
bmi_curc_raiz4 , 0.0% nulls , 4 unique values, float64
fh_cancer_cuarta , 0.0% nulls , 2 unique values, float64
preg_f_log , 0.0% nulls , 3 unique values, float64
agelevel_log , 0.0% nulls , 4 unique values, float64
fmenstr_log , 0.0% nulls , 6 unique values, float64
menstrs_log , 0.0% nulls , 5 unique values, float64
sisters_log , 0.0% nulls , 8 unique values, float64
arthrit_f_log , 0.0% nulls , 2 uniq

In [73]:
# Model 1: Random Forest using df1 (df1 transformation)
model1 = RandomForestClassifier(random_state=42)
model1.fit(X1_train, y1_train)
y1_pred = model1.predict(X1_test)

# Model 2: Random Forest using df2 (df2 transformation)
model2 = RandomForestClassifier(random_state=42)
model2.fit(X2_train, y2_train)
y2_pred = model2.predict(X2_test)

In [74]:
# Model 1 evaluation
accuracy1 = accuracy_score(y1_test, y1_pred)
f1_1 = f1_score(y1_test, y1_pred)
precision1 = precision_score(y1_test, y1_pred)
recall1 = recall_score(y1_test, y1_pred)
auc_roc1 = roc_auc_score(y1_test, y1_pred)

# EModel 2 evaluation
accuracy2 = accuracy_score(y2_test, y2_pred)
f1_2 = f1_score(y2_test, y2_pred)
precision2 = precision_score(y2_test, y2_pred)
recall2 = recall_score(y2_test, y2_pred)
auc_roc2 = roc_auc_score(y2_test, y2_pred)

# Results
print("Model 1 (df1) - Accuracy:", accuracy1)
print("Model 1 (df1) - F1-Score:", f1_1)
print("Model 1 (df1) - Recall:", recall1)
print("Model 1 (df1) - AUC-ROC:", auc_roc1)
print("Model 1 (df1) - Confusion matrix:\n", confusion_matrix(y1_test, y1_pred))

print("\nModel 2 (df2) - Accuracy:", accuracy2)
print("Model 2 (df2) - F1-Score:", f1_2)
print("Model 2 (df2) - Recall:", recall2)
print("Model 2 (df2) - AUC-ROC:", auc_roc2)
print("Model 2 (df2) - Confusion matrix:\n", confusion_matrix(y2_test, y2_pred))

Model 1 (df1) - Accuracy: 1.0
Model 1 (df1) - F1-Score: 1.0
Model 1 (df1) - Recall: 1.0
Model 1 (df1) - AUC-ROC: 1.0
Model 1 (df1) - Confusion matrix:
 [[23861     0]
 [    0    38]]

Model 2 (df2) - Accuracy: 0.999958157245073
Model 2 (df2) - F1-Score: 0.9866666666666666
Model 2 (df2) - Recall: 0.9736842105263158
Model 2 (df2) - AUC-ROC: 0.986842105263158
Model 2 (df2) - Confusion matrix:
 [[23861     0]
 [    1    37]]


In [75]:
# We got the same result. It would be better to get the one with the least variables. However, I will save both to try and compare future procedures

In [76]:
transformed_removing_below_mean_train_df.to_csv("../../0. Data/4. Transformed/transformed_removing_below_mean_train_df.csv", index=False)
transformed_removing_not_important_variables_train_df.to_csv("../../0. Data/4. Transformed/transformed_removing_not_important_variables_train_df.csv", index=False)

In [80]:
transformed_removing_below_mean_train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119495 entries, 0 to 119494
Data columns (total 13 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   liver_exitstat_sqrt     119495 non-null  float64
 1   filtered_f_log          119495 non-null  float64
 2   race7_log               119495 non-null  float64
 3   bmi_20c_raiz4           119495 non-null  float64
 4   sex_sqr                 119495 non-null  int64  
 5   mortality_exitstat_log  119495 non-null  float64
 6   smokea_f_raiz4          119495 non-null  float64
 7   cig_stat_log            119495 non-null  float64
 8   bmi_50c_raiz4           119495 non-null  float64
 9   bmi_curc_raiz4          119495 non-null  float64
 10  preg_f_log              119495 non-null  float64
 11  agelevel_log            119495 non-null  float64
 12  fmenstr_log             119495 non-null  float64
dtypes: float64(12), int64(1)
memory usage: 11.9 MB


# Applying same transformation to test_df
---
To ensure consistency in our model evaluation, we apply the same transformations identified in the training dataset to the test dataset. This step is crucial for maintaining the integrity of our predictive model and preventing data leakage.

**Key objectives of this process:**
- Apply the previously selected transformations to the test dataset to align it with the training data.
- Ensure that only the relevant features, as determined during the training phase, are retained in the test set.
- Maintain the same data preprocessing steps to guarantee that model performance is evaluated under the same conditions as training.

This ensures that our trained model can make fair and unbiased predictions on unseen data, ultimately improving its reliability for liver cancer detection.

In [82]:
## Imputed dataset
imputed_test_df = pd.read_csv('../../0. Data/3. Imputed/mean_median_imputed_test_df.csv')
imputed_test_copy1_df = imputed_test_df.copy()
imputed_test_copy2_df = imputed_test_df.copy()

In [83]:
imputed_removing_below_mean_test_df = imputed_test_copy1_df[columns_to_remove_1]
imputed_removing_not_important_variables_test_df = imputed_test_copy2_df[columns_to_remove_2]

In [84]:
# Aplying the best transformation
transformed_removing_below_mean_test_df = mtf.apply_best_transformations(imputed_removing_below_mean_test_df, transformations, target)

In [85]:
# Aplying the best transformation
transformed_removing_not_important_variables_test_df = mtf.apply_best_transformations(imputed_removing_not_important_variables_test_df, transformations2, target)

In [86]:
transformed_removing_below_mean_test_df.to_csv("../../0. Data/4. Transformed/transformed_removing_below_mean_test_df.csv", index=False)
transformed_removing_not_important_variables_test_df.to_csv("../../0. Data/4. Transformed/transformed_removing_not_important_variables_test_df.csv", index=False)