## **Pairwise HPO Correlation Analysis Workflow**

This notebook demonstrates how to load phenopacket data, generate HPO matrices, and disease target vectors, and perform correlation analyses using the `ppkt2synergy` package. The workflow covers steps from data loading to statistical testing and visualization.

---

### **Install Dependencies**

To get started, ensure you have the required libraries installed. Run the following command in your terminal or command prompt to install `ppkt2synergy` and `gpsea`:

```bash
pip install ppkt2synergy 



### **Import Libraries**
Once the dependencies are installed, import the necessary libraries to begin your analysis:

In [1]:
from ppkt2synergy import CohortDataLoader, HPOStatisticsAnalyzer,PhenopacketMatrixProcessor

### **Analysis of Correlation for Terms from a Single Cohort**

This section demonstrates how to perform correlation analysis for HPO terms, disease status, sex, and variant effects in a single cohort.

---

##### **Loading Data and Preparing HPO Matrices**

In [2]:
cohort_name = "FBN1"

phenopackets = CohortDataLoader.from_ppkt_store(cohort_name=cohort_name)
hpo_matrix, _ = PhenopacketMatrixProcessor.prepare_hpo_data(
    phenopackets=phenopackets, 
    threshold=0, 
    mode=None, 
    use_label=True,
    nan_strategy=None)

##### **Analysusing Data and Ploting the Heatmap**

In [3]:
correlation_type = "spearman"  # Can be "kendall"or "phi"
analyzer = HPOStatisticsAnalyzer(
    hpo_data = hpo_matrix,
    min_individuals_for_correlation_test=30)

analyzer.compute_correlation_matrix(
    stats_name=correlation_type,
    n_jobs=32,
    output_file='hpo_correlation_matrix_fbn1.xlsx',)

analyzer.plot_correlation_heatmap_with_significance( 
    correlation_type,
    lower_bound=-0.55,
    upper_bound=0.55,
    title_name=f"Cohort {cohort_name}",
    output_file='heatmap_fbn1.html')

### **Analysis of Correlation for Terms Across Multiple Cohorts**

This section demonstrates how to perform correlation analysis for HPO terms, disease status, sex, and variant effects across multiple cohorts.

---

##### **Loading Data and Preparing HPO Matrices**

In [4]:
multi_cohort_names = ["TGFBR1","TGFBR2","SMAD3","TGFB2","TGFB3","SMAD2"]

# Load phenopackets from multiple cohorts
multi_phenopackets = CohortDataLoader.from_ppkt_store(cohort_name=multi_cohort_names)

# Prepare HPO and target matrices for multiple cohorts
multi_hpo_matrix, _ = PhenopacketMatrixProcessor.prepare_hpo_data(
    multi_phenopackets,
    threshold=0,
    mode=None,
    use_label=True,
    nan_strategy=None
)

##### **Analysusing Data and Ploting the Heatmap**

In [5]:
correlation_type = "spearman"  # Can be "phi"

# Initialize analyzer for multiple cohorts
multi_analyzer = HPOStatisticsAnalyzer(
    multi_hpo_matrix, 
    min_individuals_for_correlation_test=30
)

# Compute correlation
multi_analyzer.compute_correlation_matrix(correlation_type, n_jobs=32)

# Plot heatmap
multi_analyzer.plot_correlation_heatmap_with_significance(
    correlation_type,
    lower_bound=-0.55,
    upper_bound=0.55,
    title_name='Loeys-Dietz syndrome')

### **Comprehensive Correlation Analysis Across All Cohorts in Phenopacket Store**
This section demonstrates how to perform correlation analysis for HPO terms, disease status, sex, and variant effects across all cohorts stored in the Phenopacket Store. It includes loading phenopacket data, preparing HPO matrices and disease target vectors, and performing pairwise correlation analyses between these factors across all available cohorts. This approach enables a comprehensive analysis across multiple datasets, providing insights into the relationships between HPO terms and disease-related factors.

---

##### **Extracting Phenopackets and Creating a Target Matrix for Multiple Cohorts**
The following demonstrates how to extract phenopackets for multiple target cohorts and create a binary target matrix indicating the presence of these cohorts.

In [6]:
target_cohort_names = 'FBN1'
target_ppkts, non_target_ppkts = CohortDataLoader.partition_phenopackets_by_cohorts(target_cohort_names)
all_ppkts = target_ppkts + non_target_ppkts

##### **Loading Data and Preparing HPO Matrices**

In [7]:
all_hpo_matrix, all_target_matrix = PhenopacketMatrixProcessor.prepare_hpo_data(
    phenopackets=all_ppkts, 
    threshold=0, 
    mode=None, 
    use_label=True,
    nan_strategy=None
)

target_hpo_matrix, target_target_matrix = PhenopacketMatrixProcessor.prepare_hpo_data(
    phenopackets=target_ppkts,  
    threshold=0, 
    mode=None, 
    use_label=True,
    nan_strategy=None
)

common_columns = all_hpo_matrix[0].columns.intersection(target_hpo_matrix[0].columns)
all_hpo_matrix_filtered = all_hpo_matrix[0][common_columns]



##### **Analysusing Data and Ploting the Heatmap**

In [8]:
correlation_type = "spearman"
# Initialize the HPO statistics analyzer with the filtered HPO matrix and target matrix
hpo_analyzer = HPOStatisticsAnalyzer(
    (all_hpo_matrix_filtered,target_hpo_matrix[1]), 
    min_individuals_for_correlation_test=30)

# Compute the correlation and p-value matrices using Spearman correlation
hpo_analyzer.compute_correlation_matrix(correlation_type, n_jobs=32)

# Plot the correlation heatmap with significance markers
hpo_analyzer.plot_correlation_heatmap_with_significance(
    correlation_type,
    lower_bound=-0.55,
    upper_bound=0.55,
    title_name="FBN1 Cohort HPO Correlation (From Full Phenopacket Store)" 
)