<span style="color: green">Author: Ashkan Nikfarjam</span>

Now that we have our data we are going to do AHP analysis that act as feature selection

AHPANL.py has all the functions to calculate AHP for us.

* <span style="color: green">Two-Sample t-Test:</span>

Purpose: Identifies statistically significant differences in gene expression between two groups (e.g., cancerous vs. healthy cells).

Method: Compares the means of two independent samples using the t-statistic.

Output: A t-score and p-value. A small p-value indicates significant differences in expression.

* <span style="color: green">Entropy Test:</span>

Purpose: Measures the disorder in gene expression levels.

Method: Computes entropy using histogram-based probability distributions.

Output: Higher entropy values indicate genes with more variability, which are more useful for classification.

* <span style="color: green">Wilcoxon Rank-Sum Test:</span>

Purpose: A non-parametric test used to rank genes based on their median expression differences.

Method: Compares the ranks of two independent samples instead of their means.

Output: A Wilcoxon statistic and a p-value. A low p-value suggests significant differences in gene ranks.

* <span style="color: green">Signal-to-Noise Ratio (SNR):</span>

Purpose: Compares the difference in mean expression levels relative to the standard deviation.

Method: SNR is calculated as the difference between the means of two groups divided by the sum of their standard deviations.

Output: A higher SNR suggests that the gene has a strong discriminatory power between groups.

* <span style="color: green">AHP Weighted Ranking:</span>

Purpose: Integrates statistical measures into a single weighted ranking system to prioritize significant genes.

Method: Normalizes scores across all statistical tests and applies predefined weights.

Output: A final ranking score indicating the importance of each gene in classification.

In [1]:
import numpy as np
import pandas as pd
from scipy.stats import ttest_ind, wilcoxon, entropy
from AHPANL import *  # Import functions

# Load the dataset
mutated_df = pd.read_csv("./data/mutatedDataSet.CSV", index_col=0)  # Cancer data
benign_df = pd.read_csv("./data/normalDataset.CSV", index_col=0)  # Healthy data

# Get all genes (intersection ensures only genes present in both datasets are analyzed)
common_genes = mutated_df.index.intersection(benign_df.index)

# Prepare AHP Analysis matrix
ahp_results = []

# Iterate through genes
for gene in common_genes:
    cancer_values = mutated_df.loc[gene].dropna().values  # Remove NaNs if any
    benign_values = benign_df.loc[gene].dropna().values  # Remove NaNs if any

    # Ensure we have at least some values to compare
    if len(cancer_values) == 0 or len(benign_values) == 0:
        continue  # Skip genes with missing data in either set

    # Compute AHP metrics
    t_score = compute_t_test(cancer_values, benign_values)
    entropy1, entropy2 = compute_entropy(cancer_values, benign_values)

    # Handle Wilcoxon test conditionally (it requires same-length samples)
    try:
        if len(cancer_values) == len(benign_values):
            wilcoxon_stat, _ = wilcoxon(cancer_values, benign_values)
        else:
            wilcoxon_stat = np.nan  # Mark as NaN if lengths are different
    except ValueError:
        wilcoxon_stat = np.nan  # Assign NaN if Wilcoxon fails

    snr_value = compute_snr(cancer_values, benign_values)

    # Append results
    ahp_results.append([gene, t_score, entropy1, entropy2, wilcoxon_stat, snr_value])

# Convert to DataFrame
ahp_df = pd.DataFrame(ahp_results, columns=["Gene", "T-Score", "Entropy1", "Entropy2", "Wilcoxon", "SNR"])

ahp_df.head()


  return mean_diff / std_sum


Unnamed: 0,Gene,T-Score,Entropy1,Entropy2,Wilcoxon,SNR
0,1/2-SBSRNA4,-0.689862,0.99934,2.066525,,-0.028173
1,A1BG,14.918537,0.142693,1.609023,,0.436874
2,A1BG-AS1,5.487732,1.085945,2.083701,,0.234907
3,A1CF,-0.126623,0.052357,0.189837,,-0.005238
4,A2LD1,2.457139,1.500725,1.822521,,0.081503


In [2]:
ahp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23368 entries, 0 to 23367
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Gene      23368 non-null  object 
 1   T-Score   22828 non-null  float64
 2   Entropy1  23368 non-null  float64
 3   Entropy2  23368 non-null  float64
 4   Wilcoxon  0 non-null      float64
 5   SNR       22828 non-null  float64
dtypes: float64(5), object(1)
memory usage: 1.1+ MB


In [3]:
ahp_modified=ahp_df.drop(columns=['Wilcoxon'])

In [4]:
ahp_modified.head()

Unnamed: 0,Gene,T-Score,Entropy1,Entropy2,SNR
0,1/2-SBSRNA4,-0.689862,0.99934,2.066525,-0.028173
1,A1BG,14.918537,0.142693,1.609023,0.436874
2,A1BG-AS1,5.487732,1.085945,2.083701,0.234907
3,A1CF,-0.126623,0.052357,0.189837,-0.005238
4,A2LD1,2.457139,1.500725,1.822521,0.081503


it seams like there is something wrong with wilcoxon metod it is all NAN that might throw error ranking

In [5]:
#now i calculated the scores lets rank these genes
AHPscore = []  # Use a list instead of a set to maintain order

for _, row in ahp_modified.iterrows():  # `row` is a Pandas Series
    score = compute_ahp_weighted_ranking(row['T-Score'], row['Entropy1'], row['Entropy2'], row['SNR'])
    AHPscore.append(score)

ahp_modified['AHP Score'] = AHPscore  # Assign the computed scores to the DataFrame
ahp_modified.head()

Unnamed: 0,Gene,T-Score,Entropy1,Entropy2,SNR,AHP Score
0,1/2-SBSRNA4,-0.689862,0.99934,2.066525,-0.028173,0.25
1,A1BG,14.918537,0.142693,1.609023,0.436874,0.25
2,A1BG-AS1,5.487732,1.085945,2.083701,0.234907,0.25
3,A1CF,-0.126623,0.052357,0.189837,-0.005238,0.25
4,A2LD1,2.457139,1.500725,1.822521,0.081503,0.25


In [6]:
ahp_modified['AHP Score'].max()

0.2500000000000142

In [7]:
ahp_modified['AHP Score'].min()

0.24999999999977263

In [8]:
ahp_modified.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23368 entries, 0 to 23367
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Gene       23368 non-null  object 
 1   T-Score    22828 non-null  float64
 2   Entropy1   23368 non-null  float64
 3   Entropy2   23368 non-null  float64
 4   SNR        22828 non-null  float64
 5   AHP Score  22828 non-null  float64
dtypes: float64(5), object(1)
memory usage: 1.1+ MB


In [9]:
ahp_modified["AHP Score"] = ahp_modified["AHP Score"].fillna(0)

In [10]:
ahp_modified.shape

(23368, 6)

In [11]:
import plotly.graph_objects as go

fig = go.Figure(data=
    go.Parcoords(
        line = dict(color = ahp_modified['AHP Score'],
                    colorscale = 'viridis'),
        dimensions = list([
            dict(range = [ahp_modified['T-Score'].min(), ahp_modified['T-Score'].max()],
                 label = 'T-Score', values = ahp_modified['T-Score']),
            dict(range = [ahp_modified['Entropy1'].min(), ahp_modified['Entropy1'].max()],
                 label = 'Entropy1', values = ahp_modified['Entropy1']),
            dict(range = [ahp_modified['Entropy2'].min(), ahp_modified['Entropy2'].max()],
                 label = 'Entropy2', values = ahp_modified['Entropy2']),
            dict(range = [ahp_modified['SNR'].min(), ahp_modified['SNR'].max()],
                 label = 'SNR', values = ahp_modified['SNR']),
            dict(range = [ahp_modified['AHP Score'].min(), ahp_modified['AHP Score'].max()],
                 label = 'AHP Score', values = ahp_modified['AHP Score'])
        ])
    )
)

fig.update_layout(
    plot_bgcolor = 'white',
    paper_bgcolor = 'white'
)

fig.show()


In [12]:
import plotly.express as px
ahp_modified2=ahp_modified.copy()
ahp_modified2['AHP Score']=ahp_modified['AHP Score']*1000
# Create the heatmap
fig = px.imshow(ahp_modified2.set_index('Gene').T, 
                color_continuous_scale='RdBu_r',
                aspect="auto",
                title="Gene Analysis Heatmap")

# Update layout for better readability
fig.update_layout(
    xaxis_title="Gene",
    yaxis_title="Metric",
    coloraxis_colorbar_title="Value"
)

# Show the plot
fig.show()
