# Project 2 - Notebook 3: Alternative Hypothesis Testing

The goal Notebook:

- Retest the hypothesis of differences in numerical column distributions between the Leukemia Positive and Negative groups using the Mann-Whitney U Test.

- Compares the Mann-Whitney U Test results with the Independent T-test results from the previous notebook.

- Determined if there was a difference in statistical significance conclusions between the two types of hypothesis tests.

- Documented the limitations and trade-offs of using the Independent T-test and Mann-Whitney U Test on this dataset.

## Import Libraries

In [None]:
import pandas as pd
import numpy as np
import scipy.stats as stats

## Load Data

In [None]:
df = pd.read_csv('../data/luekemia_prepared_data.csv')
df.info()

## Hypothesis Test for Numeric Columns vs Leukemia_Status (Man Whitney U)

In [None]:
numerical_col = [
    "WBC_Count",
    "Bone_Marrow_Blasts",
    "Age",
    "RBC_Count",
    "BMI" 
]

for col in numerical_col:
    positive_col = df[df['Leukemia_Status'] == 'Positive'][col]
    negative_col = df[df['Leukemia_Status'] == 'Negative'][col]
    
    # Man calculation for if data normal
    man_test_stats, p_value = stats.mannwhitneyu(positive_col, negative_col)
    test_name = "Man Whitney U"    
    
    print("======================================")
    print(f"Normality test result with Man Whitney U for column: {col}")
    print("--------------------------------------")
    print(f"Test statistic: {man_test_stats}")
    print(f"P-value: {p_value}")
    print("--------------------------------------")
    print("======================================")
    
    if p_value < 0.001:
        interpre_p_value = "P-value < 0.001 (Very Significant)"
    elif p_value < 0.01:
        interpre_p_value = "P-value < 0.01 (Very Significant)"
    elif p_value < 0.05:
        interpre_p_value = "P-value < 0.05 (Significant)"
    elif p_value < 0.10:
        interpre_p_value = "P-value < 0.10 (Marginal Significant - Careful Interpretation is Required)"
    else:
        interpre_p_value = "P-value >= 0.05 (Not significant)"
        
    print(f"Interpretation P-value: {interpre_p_value}")
    print("----------------------------------------")
        
    if man_test_stats > 0.05:
        # T-test calculation for if data normal
        man_test_stats, p_value = stats.ttest_ind(positive_col, negative_col)
        test_name = "T-test Independent"
        
    # interpretation P-value
    if p_value < 0.05:
        print(f"There is a statistically significant difference in the distribution of the:  {col}. Between Leukemia Positive and Negative group")
        print(f"{test_name} - {interpre_p_value}")
    else:
        print(f"There is a statistically significant difference in the distribution of the:  {col}. Between Leukemia Positive and Negative group")    
        print(f"{test_name} - {interpre_p_value}")
    print("====================================== \n")

Normality test result with Man Whitney U for column: WBC_Count
--------------------------------------
Test statistic: 1301945623.0
P-value: 0.8436879858591144
--------------------------------------
Interpretation P-value: P-value >= 0.05 (Not significant)
----------------------------------------
There is a statistically significant difference in the distribution of the:  WBC_Count. Between Leukemia Positive and Negative group
T-test Independent - P-value >= 0.05 (Not significant)

Normality test result with Man Whitney U for column: Bone_Marrow_Blasts
--------------------------------------
Test statistic: 1312571595.0
P-value: 0.08756761144846777
--------------------------------------
Interpretation P-value: P-value < 0.10 (Marginal Significant - Careful Interpretation is Required)
----------------------------------------
There is a statistically significant difference in the distribution of the:  Bone_Marrow_Blasts. Between Leukemia Positive and Negative group
T-test Independent - P-v

### Mann-Whitney U & T-test Results (Numeric Column vs Leukemia_Status):

- `WBC_Count`:
   - Mann-Whitney U: There was no significant difference in white blood cell count distribution between leukemia positive and negative groups (p >= 0.05).
   
   - T-test: There was no significant difference in mean white blood cell count between leukemia positive and negative groups (p >= 0.05).
   
   - Combined Insights: Neither the Mann-Whitney U-test (non-parametric) nor the T-test (parametric) found any SIGNIFICANT DIFFERENCE in the distribution or mean WBC_Count between the leukemia positive and negative groups.

- `Bone_Marrow_Blasts`:
   - Mann-Whitney U: There was an indication of a difference in the distribution of bone marrow blast percentage between the leukemia positive and negative groups, but it was MARGINALLY SIGNIFICANT (p < 0.10, p = 0.087). Careful interpretation is required.
  
   - T-test: There is an indication of a difference in mean bone marrow blast percentage between leukemia positive and negative groups, but SIGNIFICANT MARGINAL (p < 0.10, p = 0.087). Careful interpretation is required.
  
   - Combined Insights: Both the Mann-Whitney U-test and T-test showed INDICATIONS of SIGNIFICANT MARGINAL DIFFERENCE in the distribution or mean Bone_Marrow_Blasts between the leukemia positive and negative groups. CAUTIONARY INTERPRETATION NEEDED as significance is only marginal (p < 0.10), not below the more stringent 0.05 significance threshold.

- `Age`:
   - Mann-Whitney U: There was no significant difference in age distribution between the leukemia positive and negative groups (p >= 0.05).
   - T-test: There was no significant difference in mean age between the leukemia positive and negative groups (p >= 0.05).
   - Combined Insights: Both the Mann-Whitney U-test and T-test found NO SIGNIFICANT DIFFERENCE in the distribution or mean age between the leukemia positive and negative groups.

- `RBC_Count`:
   - Mann-Whitney U: There was no significant difference in red blood cell count distribution between the leukemia positive and negative groups (p >= 0.05).
   - T-test: There was no significant difference in mean red blood cell count between leukemia positive and negative groups (p >= 0.05).
   - Combined Insights: Neither the Mann-Whitney U test nor the T-test FOUND SIGNIFICANT DIFFERENCE in the distribution or mean RBC_Count between the leukemia positive and negative groups.

- `BMI`:
   - Mann-Whitney U: There was no significant difference in BMI distribution between the leukemia positive and negative groups (p >= 0.05).
   - T-test: There was no significant difference in mean BMI between the leukemia positive and negative groups (p >= 0.05).
   - Combined Insights: Neither the Mann-Whitney U test nor the T-test FOUND SIGNIFICANT DIFFERENCE in the distribution or mean BMI between the leukemia positive and negative groups.

### General conclusion: 

Based on the results of the updated HIPOTHESIS TEST (Mann-Whitney U-test and T-test), IN GENERAL, THERE WAS NO STATISTICALLY SIGNIFICANT DISTRIBUTION DIFFERENCE FOR THE NUMERICAL COLUMNS (WBC_Count, Age, RBC_Count, BMI) BETWEEN THE POSITIVE AND NEGATIVE LEUKEMIA GROUPS (p >= 0.05).

Except for the Bone_Marrow_Blasts column, there was an indication of a MARGINALLY SIGNIFICANT DISTRIBUTION DIFFERENCE (p < 0.10) BETWEEN THE POSITIVE AND NEGATIVE LEUKEMIA GROUPS, BUT NEEDS CAREFUL INTERPRETATION BECAUSE THE SIGNIFICANCE WAS ONLY MARGINAL.