In [None]:
pip install -e ../.

In [None]:
import gSELECT.io as gsio
import gSELECT.feature_selection as gsfs
import gSELECT.classification as gsc
import gSELECT.visualization as gsv

In [None]:
filepath = "your/path/here"
output_path = "output"

## Explore the Dataset

Use `explore_h5ad()` to preview the structure of your input file. This helps you choose filters or column names.

In [None]:
gsio.explore_h5ad(filepath)

## Load Gene Expression Data

You can load data in one of two ways:

---

### Option 1: Load from `.h5ad` (AnnData)

Use this when working with `.h5ad` files, which often contain metadata like cell types or experimental conditions.

To load only a subset of the data (e.g., a specific tissue or condition), you must specify:

- `filter_column`: the column in `.obs` to filter by (e.g., `"cell_type"`).
- `filter_values`: the values within that column to keep (e.g., `["T cells", "B cells"]`).

```python
filter_column = "your_filtercolumn"  # e.g., "cell_type"
filter_values = ['value_zero', 'value_one']

genes, data = gsio.load_h5ad(
    filepath,
    filter_column=filter_column,
    filter_values=filter_values
)


In [None]:
filter_column = "your_filtercolumn"
filter_values=['value_zero', 'value_one']

genes,data = gsio.load_h5ad(filepath,filter_column=filter_column, filter_values=filter_values)

### Option 2: Load from CSV
No filtering required — just provide the file path.

In [None]:
genes,data = gsio.load(filepath)

## Optional: Create a Final Hold-Out Test Set

This step is optional to prevent information leakage and avoid circularity in the analysis.

- The dataset is first transposed so that **samples are rows** and **genes are columns**.
- 80% of the samples are randomly selected as the **training set**.
- The remaining 20% are set aside as a **final test set**, which will not be used during mutual information (MI) calculation, gene selection, or model training.
- Both sets are then transposed back to the original format (**samples × genes**).

Why take this extra step?

By removing the test set before computing mutual information and training the model, you ensure that no information from those samples influences feature selection or model development. This eliminates circularity and enables an unbiased final evaluation of model performance.

Only the training data should be used for mutual information scoring, gene selection, and training. The final test data should be reserved exclusively for the last evaluation step.


In [None]:
data_total = data.transpose()
training_data = data_total.sample(frac=0.8)
test_data = data_total.drop(training_data.index)
training_data = training_data.transpose()
test_data = test_data.transpose()

## Compute Mutual Information Scores

This step calculates **mutual information (MI)** between each gene and the class labels in the training data.  
The MI score reflects how informative each gene is for distinguishing between classes and will be used to rank genes for classification.

You can also provide an **optional exclusion list** of genes (e.g., housekeeping or control genes) that you want to exclude from evaluation and ranking.

In [None]:
exclusion_list = [
    "example_gene"
]
mutual_info = gsfs.compute_mutual_information(genes, training_data, output_folder=output_path,exclusion_list=exclusion_list)

## Classification Step – Running MLP-based Gene Expression Classifiers

This section runs the core classification logic using a **Multilayer Perceptron (MLP)** neural network. The classifier is trained using different gene selection strategies and evaluated across multiple randomized sweeps to assess stability and generalization.

You can control various modes of operation through the provided functions:

---

### Main Options

- **Number of sweeps**: Run repeated training/evaluation cycles to get stable performance estimates (default = 10).
- **Test mode**:
  - Provide an explicit `test_data`, *or*
  - Let the code auto-split training data into train/test splits internally (used when `test_data is None` or in specific sweeps).
- **Gene selection mode**:
  - `0`: Use the selected gene list directly.
  - `1`: Random genes (same size as selected list, used for baseline comparison).
  - `2`: All non-constant genes (automatically selected).

---

### Functional Entry Points

- `run_selected_genes`: Use top `N` ranked genes based on mutual information.
- `run_multiple_gene_selections`: Evaluate performance across multiple gene counts (e.g. [1, 2, 5, 10, 100]).
- `run_with_custom_gene_set`: Run classification with a user-defined list of gene names.
- `run_explorative_gene_selections`: Automatically evaluate all combinations of the top `N` genes (uses exhaustive or greedy search depending on size).
- `run_explorative_gene_selections_with_custom_set`: Same as above, but restricted to a custom set of genes.
- `run_all_genes`: Uses **all non-constant genes** for classification (can be used as an upper-bound or reference model).

---

### Outputs

Each function returns a collection of results:
- **Test accuracy** (balanced)
- **Train accuracy**
- **Gene selection mode**
- **Number of misclassified test samples**

These outputs can be passed directly to the `gsv.plot_results()` or `gsv.plot_explorative_gene_selections()` functions for visualization and comparison.

---

### Note
Make sure that the mutual information scores and gene indices passed to these functions are consistent with your data matrix format (`genes × cells`).


In [None]:
results = gsc.run_selected_genes(training_data, mutual_info, test_data=test_data, number_sweeps=10, top_n_genes=5)
gsv.plot_results(results, output_path)

In [None]:
gene_selection=[1,2,5,10,100]
results = gsc.run_multiple_gene_selections(training_data, mutual_info, test_data=test_data, number_sweeps=3, gene_selection=gene_selection)
gsv.plot_multiple_gene_selections(results)

In [None]:
gene_list = ["gene1", "gene2", "gene3"] # all genes have to exist in the data set
results = gsc.run_with_custom_gene_set(training_data, gene_list, mutual_info, test_data=test_data, number_sweeps=10)
gsv.plot_results(results, output_path)

In [None]:
results = gsc.run_explorative_gene_selections(training_data, mutual_info, test_data=test_data, number_sweeps=2, top_n_genes=11)
gsv.plot_explorative_gene_selections(results, output_folder=output_path)

In [None]:
gene_list = ["gene1", "gene2"] # all genes have to exist in the data set
results = gsc.run_explorative_gene_selections_with_custom_set(training_data, gene_list, mutual_info, test_data=test_data, number_sweeps=2)
gsv.plot_explorative_gene_selections(results, top_n=10, output_folder=output_path)

In [None]:
results = gsc.run_all_genes(training_data, mutual_info, test_data=test_data, number_sweeps=3)
gsv.plot_results(results, output_path)