 # 1-Basic usage

 Before working with binary landscapes, let's set up our environment and import the necessary libraries.

---

## 1.1. Importing the required packages

In this tutorial, we will use:

- **`numpy`** — for numerical operations and array handling.  
- **`pandas`** — to manage and visualize tabular data.  
- **`matplotlib`** — for simple visualizations and plots.  
- **`epistasia`** — the main package used for representing and analyzing binary landscapes.

We will also define a small helper `header` function for formatted section headers, and adjust the Python path so the interpreter can find the `epistasia` package locally.


In [11]:
###########################
#         IMPORTS         #
###########################

import os
import sys
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

###########################
#         HELPERS         #
###########################

# Simple header function for clean console output
def header(title):
    print("\n" + "=" * len(title))
    print(title)
    print("=" * len(title))

#############################################
#     LOAD BINARY LANDSCAPES DEPENDENCE     #
#############################################

# Include your local path to the library here
base_path = os.path.expanduser("~/FunEcoLab_IBFG Dropbox/")
sys.path.insert(1, base_path)

# Import the main package
import epistasia as ep

## 1.2. Loading a Binary Landscape from file (recommended)

In most real applications, landscape data is stored on disk as a table (e.g. CSV or TSV), where each row corresponds to a binary state and columns contain experimental replicates or measurements.

Epistasia provides a small set of input/output utilities to load such files robustly and convert them directly into a `Landscape` object (see bellow). This is the **recommended** way to construct landscapes from empirical data.

In [12]:
path=os.path.expanduser("~/FunEcoLab_IBFG Dropbox/Noise/Datasets/")

L = ep.landscape_from_file(
    os.path.join(path, "Complete_landscape_Sabela.csv"),
)

display(L.to_dataframe)

  return pd.read_csv(path, sep=sep, encoding=enc)


<bound method Landscape.to_dataframe of Landscape(N=10, R=5, shape=(1024, 5))
                  C10  C9  C8  C7  C6  C5  C4  C1  C12  C14     rep_1     rep_2     rep_3     rep_4     rep_5
state      Order                                                                                             
0000000000 0        0   0   0   0   0   0   0   0    0    0  0.000000  0.000000  0.000000  0.000000  0.000000
0000000001 1        0   0   0   0   0   0   0   0    0    1  0.048467  0.038793  0.133346  0.046642  0.144859
0000000010 1        0   0   0   0   0   0   0   0    1    0  0.215349  0.030650  0.212454  0.166285  0.199879
0000000011 2        0   0   0   0   0   0   0   0    1    1  0.162641  0.163804  0.130344  0.141865  0.180170
0000000100 1        0   0   0   0   0   0   0   1    0    0  0.160771  0.154306  0.164416  0.111621  0.099257
0000000101 2        0   0   0   0   0   0   0   1    0    1  0.170876  0.139765  0.253409  0.152870  0.192046
0000000110 2        0   0   0   0   0   0 

## 1.3. Building a Binary Landscape from a DataFrame

For illustration purposes, it is often convenient to construct a `Landscape` directly from a pandas `DataFrame`. This is useful for
tutorials, synthetic examples, or programmatically generated data. However, in real applications, landscapes are typically loaded from disk using `epistasia.io.landscape_from_file`. See Section 1.1 for details.

The **core module** defines the class `Landscape`, which represents an empirical (binary) landscape dataset. Each row corresponds to one observed binary state (for example, a genotype or a presence/absence combination), and each column in `values` corresponds to a replicate or experimental measurement. 

We can create a demo data frame to illustrate de basic utilities of `epistasia`. Some missing values are also included to recreate a common situation when working with empirical data.

In [13]:
# --- Demo DataFrame ---
df = pd.DataFrame({
    "g0": [0, 0, 0, 1, 1, 1, 0, 1],
    "g1": [0, 0, 1, 0, 1, 1, 1, 0],
    "g2": [0, 1, 0, 0, 0, 1, 1, 1],
    "rep_1": [1.00, 1.10, 1.20, 1.30, np.nan, 1.50, 1.60, 1.70],
    "rep_2": [1.05, 1.12, np.nan, 1.28, 1.38, 1.48, 1.58, 1.68],
    "rep_3": [0.95, 1.08, 1.22, 1.33, 1.41, np.nan, 1.61, 1.69],
})

header("Demo DataFrame (first rows)")
display(df)



Demo DataFrame (first rows)


Unnamed: 0,g0,g1,g2,rep_1,rep_2,rep_3
0,0,0,0,1.0,1.05,0.95
1,0,0,1,1.1,1.12,1.08
2,0,1,0,1.2,,1.22
3,1,0,0,1.3,1.28,1.33
4,1,1,0,,1.38,1.41
5,1,1,1,1.5,1.48,
6,0,1,1,1.6,1.58,1.61
7,1,0,1,1.7,1.68,1.69


From a tabular `DataFrame`, you can construct a `Landscape` object using the class method `from_dataframe()`:

In [14]:
# --- Build a Landscape from the DataFrame ---
L = ep.Landscape.from_dataframe(df)  # N: total number of species -> inferred automatically
                                  # R: total number of replics -> inferred automatically

## 1.4. Inspecting Basic Properties

We can easily check the main properties of the constructed landscape

In [15]:

# Show basic properties of the dataset
header("Landscape summary")
print(f"N (dimensions):      {L.N}")       # Number of binary variables
print(f"R (replicates):      {L.R}")       # Number of experimental replicates
print(f"order:               {L.order}")   # State ordering ('lex' = lexicographic)
print(f"M (observed states): {L.M}")       # Number of observed states (rows)
print(f"Feature names:       {L.feature_names}")  # List of feature names
print(L.states)                            # Binary matrix (M × N)


Landscape summary
N (dimensions):      3
R (replicates):      3
order:               lex
M (observed states): 8
Feature names:       ['g0', 'g1', 'g2']
[[0 0 0]
 [0 0 1]
 [0 1 0]
 [1 0 0]
 [1 1 0]
 [1 1 1]
 [0 1 1]
 [1 0 1]]


## 1.5. Useful Methods

The `Landscape` class provides several basic utility methods for **data cleaning**, **summarization**, and **subsetting**.  
These allow you to easily filter, inspect, or select specific parts of your binary landscape.

In this section, you will learn how to:
- Compute the **mean value per state** across replicates (ignoring missing data).  
- **Select** a subset of replicate columns to focus on specific conditions or experiments.  
- **Select** a subset of states (rows) based on indices, boolean masks, or biological patterns.
- **Access replicate values** corresponding to a specific binary state (for interactive inspection).

The `mean_over_replicates()` method calculates the average value for each binary state across all available replicates, automatically ignoring missing (`NaN`) entries.

In [16]:
# Compute mean value per state, ignoring NaNs
mean = L.mean_over_replicates()
mean_df = pd.DataFrame({
    "State index": range(L.M),
    "Mean across replicates": mean
})
display(mean_df)

Unnamed: 0,State index,Mean across replicates
0,0,1.0
1,1,1.1
2,2,1.21
3,3,1.303333
4,4,1.395
5,5,1.49
6,6,1.596667
7,7,1.69


The method `select_replicates()` allows you to extract only a subset of replicate measurements, for instance when comparing specific experimental conditions.

In [17]:
# --- Select a subset of replicate columns (keeps all states) ---
L_subset = L.select_replicates([0, 2])  # Select only replicates 0 and 2

# Convert to DataFrame for inspection
subset_df = pd.DataFrame(
    np.hstack([L_subset.states, L_subset.values]),
    columns=[f"s{i}" for i in range(L_subset.N)] +
            [f"rep_{j}" for j in range(L_subset.R)]
)

print("Subset with replicate columns 0 and 2:")
display(subset_df)

Subset with replicate columns 0 and 2:


Unnamed: 0,s0,s1,s2,rep_0,rep_1
0,0.0,0.0,0.0,1.0,0.95
1,0.0,0.0,1.0,1.1,1.08
2,0.0,1.0,0.0,1.2,1.22
3,1.0,0.0,0.0,1.3,1.33
4,1.0,1.0,0.0,,1.41
5,1.0,1.0,1.0,1.5,
6,0.0,1.0,1.0,1.6,1.61
7,1.0,0.0,1.0,1.7,1.69


The `select_states()` method allows subsetting the landscape by rows, using:

-A list of indices ([0, 2, 5])

-A boolean mask (e.g. to keep only states with valid data)

-A slice (e.g. 0:4 for the first four states)

In [18]:
# --- Select a subset of states (rows) ---
# Example 1: Select by explicit indices
L_sel_idx = L.select_states([0, 3, 5])

# Example 2: Select only rows without NaNs in the first replicate
mask_valid = ~np.isnan(L.values[:, 0])
L_sel_mask = L.select_states(mask_valid)

# Convert one of them to DataFrame for display
sel_df = pd.DataFrame(
    np.hstack([L_sel_idx.states, L_sel_idx.values]),
    columns=[f"s{i}" for i in range(L_sel_idx.N)] +
            [f"rep_{j}" for j in range(L_sel_idx.R)]
)

print("Subset with selected state indices [0, 3, 5]:")
display(sel_df)

Subset with selected state indices [0, 3, 5]:


Unnamed: 0,s0,s1,s2,rep_0,rep_1,rep_2
0,0.0,0.0,0.0,1.0,1.05,0.95
1,1.0,0.0,0.0,1.3,1.28,1.33
2,1.0,1.0,1.0,1.5,1.48,


The `get_values()` method lets you directly access the replicate values for a given binary configuration.
This is particularly useful when working from the terminal or when you only need to inspect the measurements for one state.

In [19]:
# --- Retrieve replicate values for a specific state ---
# Example 1: Pass the binary state as a list
vals = L.get_values([1, 0, 1])
print("Replicate values for state [1, 0, 1]:")
print(vals)

# Example 2: Retrieve only some replicates
vals_sub = L.get_values([1, 0, 1], replicates=[0, 1])
print("\nOnly first two replicates:")
print(vals_sub)

# Example 3: Return a labeled DataFrame
vals_df = L.get_values([1, 0, 1], as_dataframe=True)
display(vals_df)

# Example 4: Retrieve using integer encoding (binary 101 = 5)
print("\nBy integer encoding (5):")
print(L.get_values(5))


Replicate values for state [1, 0, 1]:
[[1.7  1.68 1.69]]

Only first two replicates:
[[1.7  1.68]]


Unnamed: 0,rep_0,rep_1,rep_2
0,1.7,1.68,1.69



By integer encoding (5):
[[1.7  1.68 1.69]]


## 1.6. Filtering strategies for replicate NaNs

We’ll prepare three filtered views of the same dataset

- **`drop_rows_with_any_nan()`**  
  Removes rows that contain *any* missing replicate.  
  Useful when you need a strictly complete dataset where all replicates were measured.

- **`drop_rows_with_all_nan()`**  
  Removes only those rows where *all* replicate values are missing.  
  Keeps partially measured states that still have usable information.

- **`missing_states()`**  
  Lists binary configurations that do not appear in the dataset at all.  
  Useful to evaluate experimental coverage or identify unmeasured combinations.


In [20]:
###############################################################
# 0 Create new demo dataset with three types of missing data  #
###############################################################

#State [1 1 1] has been removed
df_missing = pd.DataFrame({
    "g0": [0, 0, 0, 1, 1, 0, 1],
    "g1": [0, 0, 1, 0, 1, 1, 0],
    "g2": [0, 1, 0, 0, 0, 1, 1], 
    "rep1": [1.00, 1.10, 1.20, 1.30, np.nan, 1.60, 1.70],
    "rep2": [1.05, 1.12, np.nan, 1.28, np.nan, 1.58, 1.68],
    "rep3": [0.95, 1.08, 1.22, 1.33, np.nan,  1.61, 1.69],
})


header("Demo DataFrame with missing data")
display(df_missing)

#Load as a Landscape object
L_missing = ep.Landscape.from_dataframe(df_missing) 

#################################################
# 1 Strict filtering: remove rows with ANY NaN  #
#################################################

L_complete = L_missing.drop_rows_with_any_nan()
clean_df = pd.DataFrame(
    np.hstack([L_complete.states, L_complete.values]),
    columns=[f"g{i}" for i in range(L_complete.N)] +
            [f"rep_{j}" for j in range(L_complete.R)]
)


print("1-STRICT FILTERING: no missing replicates")
display(clean_df)

#####################################################
# 2 Permissive filtering: remove rows with ALL NaNs #
#####################################################

# Introduce one row that is completely NaN in all replicates
L_partial = L_missing.drop_rows_with_all_nan()
partial_df = pd.DataFrame(
    np.hstack([L_partial.states, L_partial.values]),
    columns=[f"g{i}" for i in range(L_partial.N)] +
            [f"rep_{j}" for j in range(L_partial.R)]
)

print("2-PERMISSIVE FILTERING: REMOVE ROWS WITH ALL NaNS")
display(partial_df)

###########################################################
# 3 Missing states: configurations not present in dataset #
###########################################################

print("3-MISSING STATES: CONFIGURATIONS NOT PRESENT IN DATASET")

missing = L_missing.missing_states()
print(f"\nMissing {len(missing)} states out of {2**L_missing.N} total:")
display(pd.DataFrame(missing, columns=[f"g{i}" for i in range(L_missing.N)]))




Demo DataFrame with missing data


Unnamed: 0,g0,g1,g2,rep1,rep2,rep3
0,0,0,0,1.0,1.05,0.95
1,0,0,1,1.1,1.12,1.08
2,0,1,0,1.2,,1.22
3,1,0,0,1.3,1.28,1.33
4,1,1,0,,,
5,0,1,1,1.6,1.58,1.61
6,1,0,1,1.7,1.68,1.69


1-STRICT FILTERING: no missing replicates


Unnamed: 0,g0,g1,g2,rep_0,rep_1,rep_2
0,0.0,0.0,0.0,1.0,1.05,0.95
1,0.0,0.0,1.0,1.1,1.12,1.08
2,1.0,0.0,0.0,1.3,1.28,1.33
3,0.0,1.0,1.0,1.6,1.58,1.61
4,1.0,0.0,1.0,1.7,1.68,1.69


2-PERMISSIVE FILTERING: REMOVE ROWS WITH ALL NaNS


Unnamed: 0,g0,g1,g2,rep_0,rep_1,rep_2
0,0.0,0.0,0.0,1.0,1.05,0.95
1,0.0,0.0,1.0,1.1,1.12,1.08
2,0.0,1.0,0.0,1.2,,1.22
3,1.0,0.0,0.0,1.3,1.28,1.33
4,0.0,1.0,1.0,1.6,1.58,1.61
5,1.0,0.0,1.0,1.7,1.68,1.69


3-MISSING STATES: CONFIGURATIONS NOT PRESENT IN DATASET

Missing 1 states out of 8 total:


Unnamed: 0,g0,g1,g2
0,1,1,1
