# Project 2 - Notebook 2: Hypothesis Testing

Notebook Objective:

- Perform statistical hypothesis testing to analyze the relationship between potential risk factors (categorical and numerical fields) and leukemia status (Leukemia_Status).

- Using Chi-Square Test for categorical column vs Leukemia_Status.

- Using Independent T-test or Mann-Whitney U-test for numeric column vs Leukemia_Status (depending on normality of numeric data).

- Calculated the p-value to determine the statistical significance of the relationship.

## Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
sns.set_style('darkgrid')
pd.set_option('display.width', 1000)
pd.set_option('display.max_columns', None)  

## Load Data

In [3]:
df = pd.read_csv('../data/luekemia_prepared_data.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 143194 entries, 0 to 143193
Data columns (total 11 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   Leukemia_Status       143194 non-null  object 
 1   Smoking_Status        143194 non-null  object 
 2   Family_History        143194 non-null  object 
 3   Genetic_Mutation      143194 non-null  object 
 4   Ethnicity             143194 non-null  object 
 5   Socioeconomic_Status  143194 non-null  object 
 6   WBC_Count             143194 non-null  int64  
 7   Bone_Marrow_Blasts    143194 non-null  int64  
 8   Age                   143194 non-null  int64  
 9   RBC_Count             143194 non-null  float64
 10  BMI                   143194 non-null  float64
dtypes: float64(2), int64(3), object(6)
memory usage: 12.0+ MB


## Hypothesis Test for Category vs Leukemia_Status columns (Chi-Square Test)

In [4]:
categorical_col = [
    "Smoking_Status",
    "Family_History",
    "Genetic_Mutation",
    "Ethnicity",
    "Socioeconomic_Status"
]
for col in categorical_col:
    contingency_col = pd.crosstab(df[col], df['Leukemia_Status'])
    
    # chi square test
    chi2_stat, p_value, dof, expected_frequencies = stats.chi2_contingency(contingency_col)
    
    # show the result chi square test
    print(f"Result Chi-square for column: {col}")
    print(f"Chi-square statistic: {chi2_stat}")
    print(f"Degree of Freedom: {dof}")
    print(f"Expected Frequencies table:\n {expected_frequencies}")
    print("----------------------------------------")
    
    # interpretation P-value
    if p_value < 0.05:
        print(f"There is a significant correlation between {col} and Leukemia Status (p < 0.05)")
    else:
        print(f"There is no significant correlation between {col} and Leukemia Status (p >= 0.05)")
        

Result Chi-square for column: Smoking_Status
Chi-square statistic: 0.024217347430319884
Degree of Freedom: 1
Expected Frequencies table:
 [[72901.79108762 12807.20891238]
 [48895.20891238  8589.79108762]]
----------------------------------------
There is no significant correlation between Smoking_Status and Leukemia Status (p >= 0.05)
Result Chi-square for column: Family_History
Chi-square statistic: 0.0533665573787544
Degree of Freedom: 1
Expected Frequencies table:
 [[85166.20819308 14961.79180692]
 [36630.79180692  6435.20819308]]
----------------------------------------
There is no significant correlation between Family_History and Leukemia Status (p >= 0.05)
Result Chi-square for column: Genetic_Mutation
Chi-square statistic: 0.03549676325173791
Degree of Freedom: 1
Expected Frequencies table:
 [[97540.34926044 17135.65073956]
 [24256.65073956  4261.34926044]]
----------------------------------------
There is no significant correlation between Genetic_Mutation and Leukemia Status 

### Insight Chi-Square Test Results (Category Column vs Leukemia_Status):

- `Smoking_Status`: There was no significant association between smoking status and leukemia status (p > 0.05).

- `Family_History`: There is no significant association between family history of leukemia and leukemia status (p > 0.05).

- `Genetic_Mutation`: There was no significant association between identified genetic mutations and leukemia status (p > 0.05).

- `Ethnicity`: There was no significant association between ethnicity and leukemia status (p > 0.05).

- `Socioeconomic_Status`: There is no significant association between socioeconomic status and leukemia status (p > 0.05).


### General conclusion of Chi-Square Test: 

Based on the **Chi-Square Test**, NONE OF THE CATEGORIES (`Smoking_Status`, `Family_History`, `Genetic_Mutation`, `Ethnicity`, `Socioeconomic_Status`) SHOWED A SIGNIFICANT RELATIONSHIP WITH LEUKEMIA STATUS **(p > 0.05)**.

## Hypothesis Test for Numeric Columns vs Leukemia_Status (Kolmogorov-Smirnov Test, T-test, Mann-Whitney U)

In [5]:
numerical_col = [
    "WBC_Count",
    "Bone_Marrow_Blasts",
    "Age",
    "RBC_Count",
    "BMI" 
]

for col in numerical_col:
    positive_col = df[df['Leukemia_Status'] == 'Positive'][col]
    negative_col = df[df['Leukemia_Status'] == 'Negative'][col]
    
    # shapiro normalitest numerical column
    ks_stat_pos, ks_p_value_pos = stats.kstest(positive_col, 'norm')
    ks_stat_neg, ks_p_value_neg = stats.kstest(negative_col, 'norm')
    
    print("----------------------------------------")
    print(f"Normality test result with Kolmogorov-Smirnov for column: {col}")
    print(f"Positive group - Kolmogorov-Smirnov statistics: {ks_stat_pos}")
    print(f"Positive group - P-value: {ks_p_value_pos}")
    print(f"Negative group - Kolmogorov-Smirnov statistics: {ks_stat_neg}")
    print(f"Negative group - P-value: {ks_p_value_neg}")
    
    print("---------------------------------------- \n")
    # specify the right statistical test base on normaly test results 
    if ks_stat_pos > 0.05 and ks_stat_neg > 0.05:
        # T-test calculation for if data normal
        test_stat, p_value = stats.ttest_ind(positive_col, negative_col)
        test_name = "T-test Independent"
    else:
        # Man calculation for if data normal
        test_stat, p_value = stats.mannwhitneyu(positive_col, negative_col)
        test_name = "T-test Independent"
    
    # print result
    print(f"Result: {test_name} \n for column: {col}")
    print(f"Test statistic: {test_stat}")
    print(f"P-value: {p_value}")
    print("---------------------------------------- \n")
    
    print("======================================")
    # interpretation P-value
    if p_value < 0.05:
        print(f"There is a significant correlation between {col} and Leukemia Status (p < 0.05)")
    else:
        print(f"There is no significant correlation between {col} and Leukemia Status (p >= 0.05)")    
    print("====================================== \n")

----------------------------------------
Normality test result with Kolmogorov-Smirnov for column: WBC_Count
Positive group - Kolmogorov-Smirnov statistics: 0.999906528952657
Positive group - P-value: 0.0
Negative group - Kolmogorov-Smirnov statistics: 0.9998275819601468
Negative group - P-value: 0.0
---------------------------------------- 

Result: T-test Independent 
 for column: WBC_Count
Test statistic: -0.247326946285331
P-value: 0.8046555609204009
---------------------------------------- 

There is no significant correlation between WBC_Count and Leukemia Status (p >= 0.05)

----------------------------------------
Normality test result with Kolmogorov-Smirnov for column: Bone_Marrow_Blasts
Positive group - Kolmogorov-Smirnov statistics: 0.9685991602475679
Positive group - P-value: 0.0
Negative group - Kolmogorov-Smirnov statistics: 0.9690106198793201
Negative group - P-value: 0.0
---------------------------------------- 

Result: T-test Independent 
 for column: Bone_Marrow_Bla

### Insight independent T-test Results (Numeric Column vs Leukemia_Status):

   -  `WBC_Count`: There was no significant difference in mean white blood cell count between leukemia positive and negative groups (p > 0.05).
   
   -  `Bone_Marrow_Blasts`: There was no significant difference in mean bone marrow blast percentage between leukemia positive and negative groups (p > 0.05).
   
   -  `Age`: There was no significant difference in mean age between the leukemia positive and negative groups (p > 0.05).
   
   -  `RBC_Count`: There was no significant difference in mean red blood cell count between the leukemia positive and negative groups (p > 0.05).
   
   -  `BMI`: There was no significant difference in mean BMI between the leukemia positive and negative groups (p > 0.05).

### General conclusion of Independent T-test: 

based on the Independent T-test, none of the numeric columns (`WBC_Count`, `Bone_Marrow_Blasts`, `Age`, `RBC_Count`, `BMI`) showed a significant average of differences between the POSITIVE and NEGATIVE LEUKEMIA groups **(p > 0.05)**.