# Disjointification Example

- Demonstrates feature selection through disjointification
- Data is an example of gene expression on patients
- Can be used to automate tuning of hyper parameters: correlation_threshold and min_num_features

## defs/imports/loads

In [1]:
import disjointification
from disjointification import load_gene_expression_data, Disjointification
from utils.utils import get_dt_in_fmt
from pathlib import Path
from pprint import pprint
import pandas as pd
import numpy as np

## Survey the dataset & decide on model parameters
### Survey the data

In [2]:
if 'labels_df' not in locals() or 'features_df' not in locals():
        print(f"Dataframes not loaded. Loading.")
        ge_data = load_gene_expression_data()
        features_df = ge_data["features"]
        labels_df = ge_data["labels"]
        print(f"features_df loaded with shape {features_df.shape}")
        print(f"labels_df loaded with shape {labels_df.shape}")
print(f"labels df shape: {labels_df.shape}")
print(f"features df shape: {features_df.shape}")

Dataframes not loaded. Loading.
features_df loaded with shape (3069, 9266)
labels_df loaded with shape (3069, 8)
labels df shape: (3069, 8)
features df shape: (3069, 9266)


### Set Model paramers
- load_last_save_point, last_save_point - enables loading of a previous model and provide a path where it was saved, in .pkl format
- min_num_features - disjointification will stop after the best N features found
- correlation_threshold - disjointification will only select a feature less correlated to the previous ones than this
- select_num_features select_num_instance - allows shrinking the dataset to a given size (int) or fraction (fraction), primarily for debugging
- alert selection, debug print - printout when a feature has been selected via disjointification and when various actions are taken, for debugging
- model_save_folder - root path under which different models are saved

In [3]:
load_last_save_point = False
# last_save_point = r"model\06_24_2023__10_58_52\06_24_2023__10_59_03_(3069, 9260).pkl"

# shrink the dataset for debugging
select_num_features = 1.0
select_num_instances = 1.0
model_save_folder = r"\model"
min_num_features = 500
correlation_threshold = 0.4

## Iterate over correlation threshold values and run the model

In [4]:
correlation_thresholds = [0.4]

for correlation_threshold in correlation_thresholds:
    iter_time = get_dt_in_fmt()
    print(f"\n{iter_time}: correlation threshold set to {correlation_threshold}. Initializing disjointification process")
    disj = Disjointification(features_file_path=None, labels_file_path=None, features_df=features_df, 
                             labels_df=labels_df, select_num_features=select_num_features, select_num_instances=select_num_instances, 
                             root_save_folder=model_save_folder, do_set=False, 
                             correlation_threshold=correlation_threshold, min_num_features=min_num_features)
    disj.set()
    disj.describe()
    start_time = disjointification.utils.get_dt_in_fmt()
    print(f"\n{start_time} Running Disjointificatioin\n")
    disj.run_disjointification()
    end_time = get_dt_in_fmt()
    
    n,m = disj.get_num_features_selected_for_regression(), disj.get_num_features_selected_for_classification()
    print(f"number of features selected: regression {n}, classification {m}")
    
    print(f"\n{end_time}: ended disjointification.\n\n\n")
    
    

print(f"\nDone running all disjointifications!")


09_13_2023__01_12_39: correlation threshold set to 0.4. Initializing disjointification process
saving model...
saved model to C:\model\09_13_2023__01_12_39\09_13_2023__01_12_39.pkl
Disjointification Test Description
features data: (3069, 9259)
labels data: (3069, 2)
regression label: Lympho
classification label: ER
correlation method regression: pearson
correlation method regression: <function point_bi_serial_r_correlation at 0x00000155ABABAE60>
min num of features to keep in disjointification: 500
correlation threshold: 0.4
last save point: \model\09_13_2023__01_12_39\09_13_2023__01_12_39.pkl
number of features kept in disjointification: lin 0, log 0

09_13_2023__01_13_07 Running Disjointificatioin

09_13_2023__01_13_07 : Running both regression and classification disjointification.


09_13_2023__01_13_07 : Running regression disjointification.


09_13_2023__01_18_10 - after 100 iterations, found 0 features!
09_13_2023__01_22_14 - after 200 iterations, found 0 features!
09_13_2023__0