# Disjointification Example

- Demonstrates feature selection through disjointification
- Data is an example of gene expression on patients
- Can be used to automate tuning of hyper parameters: correlation_threshold and min_num_features

## defs/imports/loads

In [19]:
import disjointification
from disjointification import load_gene_expression_data, Disjointification
from pathlib import Path
from pprint import pprint
import pandas as pd
import numpy as np

## Survey the dataset & decide on model parameters
### Survey the data

In [20]:
if 'labels_df' not in locals() or 'features_df' not in locals():
        print(f"Dataframes not loaded. Loading.")
        ge_data = load_gene_expression_data()
        features_df = ge_data["features"]
        labels_df = ge_data["labels"]
        print(f"features_df loaded with shape {features_df.shape}")
        print(f"labels_df loaded with shape {labels_df.shape}")
print(f"labels df shape: {labels_df.shape}")
print(f"features df shape: {features_df.shape}")

labels df shape: (3069, 6)
features df shape: (3069, 9259)


### Set Model paramers
- load_last_save_point, last_save_point - enables loading of a previous model and provide a path where it was saved, in .pkl format
- min_num_features - disjointification will stop after the best N features found
- correlation_threshold - disjointification will only select a feature less correlated to the previous ones than this
- select_num_features select_num_instance - allows shrinking the dataset to a given size (int) or fraction (fraction), primarily for debugging
- alert selection, debug print - printout when a feature has been selected via disjointification and when various actions are taken, for debugging
- model_save_folder - root path under which different models are saved

In [21]:
load_last_save_point = False
# last_save_point = r"model\06_24_2023__10_58_52\06_24_2023__10_59_03_(3069, 9260).pkl"

# shrink the dataset for debugging
select_num_features = 1.0
select_num_instances = 0.2
alert_selection = True
debug_print = False
model_save_folder = r"\model"
min_num_features = 200
correlation_threshold = 0.2

### Create model

In [22]:
if load_last_save_point:
    print(f"loading model from last save point {last_save_point}")
    test = disjointification.from_file(last_save_point)
else:
    test = Disjointification(features_file_path=None, labels_file_path=None, features_df=features_df, 
                             labels_df=labels_df, select_num_features=select_num_features, select_num_instances=select_num_instances, 
                             root_save_folder=model_save_folder, do_set=False, alert_selection=alert_selection, 
                             correlation_threshold=correlation_threshold, min_num_features=min_num_features)
    test.set()
test.describe()

saving model...
saved model to C:\model\09_08_2023__18_46_09\09_08_2023__18_46_09.pkl
Disjointification Test Description
features data: (613, 9259)
labels data: (613, 2)
regression label: Lympho
classification label: ER
correlation method regression: pearson
correlation method regression: <function point_bi_serial_r_correlation at 0x000001CD296DDE10>
min num of features to keep in disjointification: 200
correlation threshold: 0.2
last save point: \model\09_08_2023__18_46_09\09_08_2023__18_46_09.pkl
number of features kept in disjointification: lin 0, log 0


### Create a save point

In [23]:
last_save_point = test.last_save_point_file
print('last save point:')
print(last_save_point)
test = disjointification.from_file(last_save_point)

last save point:
\model\09_08_2023__18_46_09\09_08_2023__18_46_09.pkl


In [24]:
test.describe()

Disjointification Test Description
features data: (613, 9259)
labels data: (613, 2)
regression label: Lympho
classification label: ER
correlation method regression: pearson
correlation method regression: <function point_bi_serial_r_correlation at 0x000001CD296DDE10>
min num of features to keep in disjointification: 200
correlation threshold: 0.2
last save point: \model\09_08_2023__18_46_09\09_08_2023__18_46_09.pkl
number of features kept in disjointification: lin 0, log 0


### Run Disjointification

In [25]:
start_time = disjointification.utils.get_dt_in_fmt()
print(f"{start_time} Running Disjointificatioin")
test.run_disjointification()

09_08_2023__18_46_14 Running Disjointificatioin
09_08_2023__18_46_14 : Running both regression and classification disjointification.


09_08_2023__18_46_14 : Running regression disjointification.


09_08_2023__18_46_48 - after 100 iterations, found 0 features!
09_08_2023__18_47_22 - after 200 iterations, found 0 features!
09_08_2023__18_47_55 - after 300 iterations, found 0 features!
09_08_2023__18_48_36 - after 400 iterations, found 0 features!
09_08_2023__18_49_13 - after 500 iterations, found 0 features!
09_08_2023__18_49_49 - after 600 iterations, found 0 features!
09_08_2023__18_50_24 - after 700 iterations, found 0 features!
09_08_2023__18_51_00 - after 800 iterations, found 0 features!
09_08_2023__18_51_33 - after 900 iterations, found 0 features!
09_08_2023__18_52_05 - after 1000 iterations, found 0 features!
09_08_2023__18_52_39 - after 1100 iterations, found 0 features!
09_08_2023__18_53_14 - after 1200 iterations, found 0 features!
09_08_2023__18_53_47 - after 1300 iteration

In [26]:
test.describe()

Disjointification Test Description
features data: (613, 9259)
labels data: (613, 2)
regression label: Lympho
classification label: ER
correlation method regression: pearson
correlation method regression: <function point_bi_serial_r_correlation at 0x000001CD296DDE10>
min num of features to keep in disjointification: 200
correlation threshold: 0.2
last save point: \model\09_08_2023__18_46_09\09_08_2023__18_46_09.pkl
number of features kept in disjointification: lin 200, log 200
