In [1]:
%pip install ccrvam --upgrade


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


> Make sure to have ccrvam's latest version installed using `pip`. More information about the latest version can be found at https://pypi.org/project/ccrvam/

In [2]:
import numpy as np
from ccrvam import (
    best_subset_ccram,
    all_subsets_ccram
)
from ccrvam import DataProcessor

# 2-Dimensional Case 

### Create Sample Contingency Table

For a 2D contingency table:

- `axis=0`: First variable ($X_1$) with 5 categories
- `axis=1`: Second variable ($X_2$) with 3 categories

The axis indexing follows NumPy's convention, starting from the outermost dimension. The variables are ordered such that:

- $X_1$ corresponds to (rows)
- $X_2$ corresponds to (columns)

This ordering is important for calculating measures of regression association between two variables.

In [3]:
contingency_table = np.array([
    [0, 0, 20],
    [0, 10, 0],
    [20, 0, 0],
    [0, 10, 0],
    [0, 0, 20]
])

### Calculate All Subsets CCRAM and Find Best Subset

For a 2D contingency table, we can calculate CCRAM for predicting $X_2$ (response) from $X_1$ (predictor). Since there's only one possible predictor, this demonstrates the basic functionality.

In [4]:
# Calculate all subsets CCRAM for predicting X2 (axis 2) from X1 (axis 1)
result_2d = all_subsets_ccram(
    contingency_table,
    response=2,  # Predict X2 (columns)
    scaled=False  # Use CCRAM (not scaled)
)

print("=== All Subsets CCRAM Results (2D) ===")
result_2d.results_df


=== All Subsets CCRAM Results (2D) ===


Unnamed: 0,k,predictors,response,ccram
0,1,"(1,)",2,0.84375


In [5]:
print("\n=== Summary Statistics ===")
result_2d.summary()


=== Summary Statistics ===


Unnamed: 0,k,max_ccram,mean_ccram,min_ccram,n_subsets
0,1,0.84375,0.84375,0.84375,1


### Find the Best Subset

The `best_subset_ccram` function returns the optimal predictor subset with the highest (S)CCRAM value.


In [6]:
# Find the best subset for predicting X2
best_2d = best_subset_ccram(
    contingency_table,
    response=2,
    scaled=False
)

print("=== Best Subset Result (2D) ===")
best_2d


=== Best Subset Result (2D) ===


BestSubsetCCRAMResult(
  Optimal Predictors: (X1)
  Response: X2
  CCRAM: 0.843750
  Number of Predictors (k): 1
  Rank within k=1 subsets: 1/1
)

In [7]:
print("=== Best Subset Summary DataFrame ===")
best_2d.summary_df()

=== Best Subset Summary DataFrame ===


Unnamed: 0,metric,value
0,Optimal Predictors,(X1)
1,Response,X2
2,CCRAM,0.843750
3,Number of Predictors (k),1
4,Rank within k,1
5,Total subsets with k predictors,1


# 4-Dimensional Case (Real Data Analysis)

### Load Sample Data in Cases / Frequency Form

This example demonstrates how one can analyze relationships between multiple categorical variables in a clinical dataset of back pain treatments using the `DataProcessor`.

The dataset contains 4 categorical variables from a medical study:

| Variable | Description | Categories |
|----------|-------------|------------|
| X₁ | Length of Previous Attack | 1=Short, 2=Long |
| X₂ | Pain Change | 1=Better, 2=Same, 3=Worse |
| X₃ | Lordosis | 1=Absent/Decreasing, 2=Present/Increasing |
| Pain | Back Pain Outcome | worse (W)=1, same (S)=2, slight.improvement (SI)=3, moderate.improvement (MODI)=4, marked.improvement (MARI)=5, complete.relief (CR)=6 |

1. Define variable names, and dimension tuple (the number of categores of each variable according to the order of the variables defined) 
2. (Optional) If your data has any variables with category names that are not integers, then create a category mapping for non-integer categories (for instance, required for 'pain' variable in the above dataset)
3. Load case-form/freq-form data from file (from path provided as an argument) or type table-form with proper mappings into a contingency table

Citation for the above dataset:
- J. A. Anderson, Regression and ordered categorical variables, Journal of the Royal Statistical Society: Series B (Methodological) 46 (1984)
1–22.

In [8]:
var_list_4d = ["x1", "x2", "x3", "pain"]
category_map_4d = {
    "pain": {
        "worse": 1,
        "same": 2,
        "slight.improvement": 3,
        "moderate.improvement": 4,
        "marked.improvement": 5,
        "complete.relief": 6
    },
}
data_dimension = (2, 3, 2, 6)

rda_contingency_table = DataProcessor.load_data(
                        "./data/caseform.pain.txt",
                        data_form="case_form",
                        dimension=data_dimension,
                        var_list=var_list_4d,
                        category_map=category_map_4d,
                        named=True,
                        delimiter="\t"
                    )
print("Read contingency table from case form data!")
print(rda_contingency_table)

rda_contingency_table_from_freq = DataProcessor.load_data(
                        "./data/freqform.pain.txt",
                        data_form="frequency_form",
                        dimension=data_dimension,
                        var_list=var_list_4d,
                        category_map=category_map_4d,
                        named=True,
                        delimiter="\t"
                    )
print("Read contingency table from frequency form data!")
print(rda_contingency_table_from_freq)

Read contingency table from case form data!
[[[[0 1 0 0 2 4]
   [0 0 0 1 3 0]]

  [[0 2 3 0 6 4]
   [0 1 0 2 0 1]]

  [[0 0 0 0 2 2]
   [0 0 1 1 3 0]]]


 [[[0 0 3 0 1 2]
   [0 1 0 0 3 0]]

  [[0 3 4 5 6 2]
   [1 4 4 3 0 1]]

  [[2 2 1 5 2 0]
   [2 0 2 3 0 0]]]]
Read contingency table from frequency form data!
[[[[0 1 0 0 2 4]
   [0 0 0 1 3 0]]

  [[0 2 3 0 6 4]
   [0 1 0 2 0 1]]

  [[0 0 0 0 2 2]
   [0 0 1 1 3 0]]]


 [[[0 0 3 0 1 2]
   [0 1 0 0 3 0]]

  [[0 3 4 5 6 2]
   [1 4 4 3 0 1]]

  [[2 2 1 5 2 0]
   [2 0 2 3 0 0]]]]


### Calculate All Subsets CCRAM for Predicting Pain Outcome

With 4 dimensions, we have 3 potential predictors ($X_1$, $X_2$, $X_3$) for predicting the Pain outcome ($X_4$). This gives us $\binom{3}{1} + \binom{3}{2} + \binom{3}{3} = 7$ possible predictor subsets.

In [9]:
# Define variable names for better readability in output
variable_names_4d = {
    1: "PrevAttack",
    2: "PainChange", 
    3: "Lordosis",
    4: "Pain"
}

# Calculate all subsets CCRAM for predicting Pain (variable 4)
result_4d = all_subsets_ccram(
    rda_contingency_table,
    response=4,  # Predict Pain outcome
    scaled=False,
    variable_names=variable_names_4d
)

print("=== All Subsets CCRAM Results (4D) ===")
result_4d.results_df


=== All Subsets CCRAM Results (4D) ===


Unnamed: 0,k,predictors,response,ccram,predictor_names
0,1,"(1,)",4,0.140553,"(PrevAttack,)"
1,1,"(2,)",4,0.0618,"(PainChange,)"
2,1,"(3,)",4,0.044115,"(Lordosis,)"
3,2,"(1, 2)",4,0.198164,"(PrevAttack, PainChange)"
4,2,"(1, 3)",4,0.176897,"(PrevAttack, Lordosis)"
5,2,"(2, 3)",4,0.117172,"(PainChange, Lordosis)"
6,3,"(1, 2, 3)",4,0.25756,"(PrevAttack, PainChange, Lordosis)"


In [10]:
print("=== Summary Statistics by Number of Predictors ===")
result_4d.summary()

=== Summary Statistics by Number of Predictors ===


Unnamed: 0,k,max_ccram,mean_ccram,min_ccram,n_subsets
0,1,0.140553,0.082156,0.044115,3
1,2,0.198164,0.164077,0.117172,3
2,3,0.25756,0.25756,0.25756,1


### Find the Best Predictor Subset for Pain Outcome

The `best_subset_ccram` function identifies which combination of predictors yields the highest CCRAM value.


In [11]:
# Find the best subset for predicting Pain
best_4d = best_subset_ccram(
    rda_contingency_table,
    response=4,
    scaled=False,
    variable_names=variable_names_4d
)

print("=== Best Subset Result (4D) ===")
best_4d

=== Best Subset Result (4D) ===


BestSubsetCCRAMResult(
  Optimal Predictors: (X1, X2, X3)
  Response: X4
  CCRAM: 0.257560
  Number of Predictors (k): 3
  Rank within k=3 subsets: 1/1
)

In [12]:
print("=== Best Subset Summary DataFrame ===")
best_4d.summary_df()

=== Best Subset Summary DataFrame ===


Unnamed: 0,metric,value
0,Optimal Predictors,"(X1, X2, X3)"
1,Response,X4
2,CCRAM,0.257560
3,Number of Predictors (k),3
4,Rank within k,1
5,Total subsets with k predictors,1


### Get Top Subsets and Filter by k

The `SubsetCCRAMResult` object provides helper methods to explore the results.


In [13]:
# Get top 3 subsets across all k values
print("=== Top 3 Subsets Overall ===")
result_4d.get_top_subsets(n=3)


=== Top 3 Subsets Overall ===


Unnamed: 0,k,predictors,response,ccram,predictor_names
0,3,"(1, 2, 3)",4,0.25756,"(PrevAttack, PainChange, Lordosis)"
1,2,"(1, 2)",4,0.198164,"(PrevAttack, PainChange)"
2,2,"(1, 3)",4,0.176897,"(PrevAttack, Lordosis)"


In [14]:
# Get all subsets with exactly k=2 predictors
print("=== All Subsets with k=2 Predictors ===")
result_4d.get_subsets_by_k(k=2)

=== All Subsets with k=2 Predictors ===


Unnamed: 0,k,predictors,response,ccram,predictor_names
0,2,"(1, 2)",4,0.198164,"(PrevAttack, PainChange)"
1,2,"(1, 3)",4,0.176897,"(PrevAttack, Lordosis)"
2,2,"(2, 3)",4,0.117172,"(PainChange, Lordosis)"


### Find Best Subset with Fixed k

You can also find the best subset constrained to a specific number of predictors using the `k` parameter.


In [15]:
# Find best subset with exactly k=1 predictor
best_k1 = best_subset_ccram(
    rda_contingency_table,
    response=4,
    scaled=False,
    k=1,  # Constrain to single predictor
    variable_names=variable_names_4d
)

print("=== Best Single Predictor (k=1) ===")
best_k1


=== Best Single Predictor (k=1) ===


BestSubsetCCRAMResult(
  Optimal Predictors: (X1)
  Response: X4
  CCRAM: 0.140553
  Number of Predictors (k): 1
  Rank within k=1 subsets: 1/3
)

In [16]:
# Find best subset with exactly k=2 predictors
best_k2 = best_subset_ccram(
    rda_contingency_table,
    response=4,
    scaled=False,
    k=2,  # Constrain to two predictors
    variable_names=variable_names_4d
)

print("=== Best Two-Predictor Subset (k=2) ===")
best_k2

=== Best Two-Predictor Subset (k=2) ===


BestSubsetCCRAMResult(
  Optimal Predictors: (X1, X2)
  Response: X4
  CCRAM: 0.198164
  Number of Predictors (k): 2
  Rank within k=2 subsets: 1/3
)

### Compare CCRAM vs SCCRAM for 4D Case

Finally, we compare the unscaled and scaled versions of CCRAM to see how normalization affects the results.


In [17]:

# Calculate SCCRAM for all subsets
result_4d_scaled = all_subsets_ccram(
    contingency_table=rda_contingency_table,
    response=4,
    scaled=True,  # Use SCCRAM
    variable_names=variable_names_4d
)

print("=== All Subsets SCCRAM Results (4D) ===")
result_4d_scaled.results_df


=== All Subsets SCCRAM Results (4D) ===


Unnamed: 0,k,predictors,response,sccram,predictor_names
0,1,"(1,)",4,0.146637,"(PrevAttack,)"
1,1,"(2,)",4,0.064476,"(PainChange,)"
2,1,"(3,)",4,0.046025,"(Lordosis,)"
3,2,"(1, 2)",4,0.206742,"(PrevAttack, PainChange)"
4,2,"(1, 3)",4,0.184554,"(PrevAttack, Lordosis)"
5,2,"(2, 3)",4,0.122244,"(PainChange, Lordosis)"
6,3,"(1, 2, 3)",4,0.26871,"(PrevAttack, PainChange, Lordosis)"


In [18]:
# Overall best subset using SCCRAM
best_4d_scaled = best_subset_ccram(
    rda_contingency_table,
    response=4,
    scaled=True,
    variable_names=variable_names_4d
)

print("=== Best Subset using SCCRAM ===")
best_4d_scaled

=== Best Subset using SCCRAM ===


BestSubsetCCRAMResult(
  Optimal Predictors: (X1, X2, X3)
  Response: X4
  SCCRAM: 0.268710
  Number of Predictors (k): 3
  Rank within k=3 subsets: 1/1
)