# Healthcare Hypothesis Testing
Author: Jade Aidoghie  
Date: 7/3/2023

# Overview
A pharmaceutical company has conducted a randomized controlled drug trial and want to ensure that their drug outcomes are reproducible. The organization is looking to know what proportion of drugs have adverse reactions. In this project I'll be analyzing their data to answer this question. To achieve this I'll check if the proportion of adverse effects differs significantly between the Drug and Placebo groups. I'll also determine if the number of adverse effects is independent of the treatment and control groups. Finally, I'll investigate if there is a significant difference in ages between the Drug and Placebo groups.  
  

> The dataset `drug_safety.csv` is sourced from [Hbiostat](https://hbiostat.org/data/) by the Vanderbilt University Department of Biostatistics. The dataset has been modified to include indicators for the presence and absence of adverse effects (adverse_effects) and the count of adverse effects per individual (num_effects). The ratio of drug observations to placebo observations is 2 to 1.

# The Data

| Column | Description |
|--------|-------------|
|`sex` | The gender of the individual |
|`age` | The age of the individual |
|`week` | The week of the drug testing |
|`trx` | The treatment (Drug) and control (Placebo) groups | 
|`wbc` | The count of white blood cells |
|`rbc` | The count of red blood cells |
|`adverse_effects` | The presence of at least a single adverse effect |
|`num_effects` | The number of adverse effects experienced by a single individual |

# Exploratory Data Analysis (EDA)

In [64]:
# Import packages
import numpy as np
import pandas as pd
from statsmodels.stats.proportion import proportions_ztest
import pingouin
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
import plotly.express as px

In [65]:
# Loading in the dataset
drug_safety = pd.read_csv("drug_safety.csv")

drug_safety.head()

Unnamed: 0,age,sex,trx,week,wbc,rbc,adverse_effects,num_effects
0,62,male,Drug,0,7.3,5.1,No,0
1,62,male,Drug,1,,,No,0
2,62,male,Drug,12,5.6,5.0,No,0
3,62,male,Drug,16,,,No,0
4,62,male,Drug,2,6.6,5.1,No,0


In [66]:
drug_safety.shape # Rows, Columns

(16103, 8)

In [67]:
drug_safety.info() # Full table details

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16103 entries, 0 to 16102
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   age              16103 non-null  int64  
 1   sex              16103 non-null  object 
 2   trx              16103 non-null  object 
 3   week             16103 non-null  int64  
 4   wbc              9128 non-null   float64
 5   rbc              9127 non-null   float64
 6   adverse_effects  16103 non-null  object 
 7   num_effects      16103 non-null  int64  
dtypes: float64(2), int64(3), object(3)
memory usage: 1006.6+ KB


In [68]:
drug_safety.isnull().sum() # Null values for each column

age                   0
sex                   0
trx                   0
week                  0
wbc                6975
rbc                6976
adverse_effects       0
num_effects           0
dtype: int64

In [69]:
drug_safety.nunique() # Unique values

age                 44
sex                  2
trx                  2
week                 8
wbc                154
rbc                143
adverse_effects      2
num_effects          4
dtype: int64

# Statistical Analysis 

### Q1. Checking if the proportion of adverse effects differs significantly between the Drug and Placebo groups

In [70]:
# Grouping data by trx (treatment and control groups) and counting the occurences of adverse effects within each group
adv_eff_by_trx = drug_safety.groupby("trx").adverse_effects.value_counts()
adv_eff_by_trx

trx      adverse_effects
Drug     No                 9703
         Yes                1024
Placebo  No                 4864
         Yes                 512
Name: count, dtype: int64

In [71]:
# Aggregating the total number of participants in each group (Drug and Placebo)
adv_eff_by_trx_sums = adv_eff_by_trx.groupby("trx").sum()
adv_eff_by_trx_sums 

trx
Drug       10727
Placebo     5376
Name: count, dtype: int64

In [72]:
# Extracting counts of participants who have experienced adverse effects in both drug and placebo groups
adveff = [adv_eff_by_trx["Drug"]["Yes"], adv_eff_by_trx["Placebo"]["Yes"]]
adveff

[1024, 512]

In [73]:
# Creating an array of total participants in the drug and placebo group
n = [adv_eff_by_trx_sums["Drug"], adv_eff_by_trx_sums["Placebo"]]
n

[10727, 5376]

In [74]:
# Performing a two-sided z-test on the two proportions
two_sample_results = proportions_ztest(adveff, n)
two_sample_results

(0.0452182684494942, 0.9639333330262475)

In [75]:
# Storing the p-value 
two_sample_p_value = two_sample_results[1]
two_sample_p_value

0.9639333330262475

**Q1 Outcome**  
  
To determine if the proportion of adverse effects differs significantly between the Drug and Placebo groups, I performed a two-sided z-test and compared the proportions of adverse effects in each group. Based on the z-test results, we fail to reject the null hypothesis that the proportion of adverse effects are the same in both the Drug and Placebo groups. The high p-value (0.9639) suggests that any observed difference in the proportions is likely due to random chance rather than a true effect of the treatment. Therefore, it can be concluded that the adverse effects experienced by participants are similar regardless of whether they were in the Drug group or the Placebo group.

### Q2. Determining if the number of adverse effects is independent of the treatment and control groups

In [76]:
with warnings.catch_warnings():
    warnings.simplefilter("ignore", category=UserWarning)
    # Performing a chi-square test to determine if there is an association between the number of adverse effects and the treament/control groups. 
    num_effects_groups = pingouin.chi2_independence(data=drug_safety, x="num_effects", y="trx")

In [77]:
# Extracting the p-value from the chi-square test rsults
num_effects_p_value = num_effects_groups[2]["pval"][0]
num_effects_p_value

0.6150123339426765

**Q2 Outcome**  
  
The chi-square test was performed to investigate the independence between the number of adverse effects and the treatment groups (Drug and Placebo). The p-value from the chi-square test is 0.62. Based on the chi-square test results, we fail to reject the null hypothesis that the number of adverse effects is independent of whether an individual is in the Drug or Placebo group. Therefore, we can conclude that the number of adverse effects experienced by individuals does not significantly differ between those who received the Drug and those who received the Placebo.

### Q3. Investigating if there is a significant difference in ages between the Drug and Placebo groups

In [78]:
color_map = {
    "Drug": "#1f77b4",  # Blue
    "Placebo": "#ff7f0e"  # Orange
}

# Creating an interactive histogram 
fig = px.histogram(
    drug_safety,
    x="age",
    color="trx",
    barmode='overlay',
    title="Age Distribution by Treatment Group",
    labels={"age": "Age", "trx": "Treatment Group"},
    opacity=0.7,
    nbins=30,
    color_discrete_map=color_map
)

# Updating layout for better visualization
fig.update_layout(
    xaxis_title="Age",
    yaxis_title="Count",
    legend_title="Treatment Group",
    template="plotly_white"
)

# Display the interactive plot
fig.show()

In [79]:
# Suppress the specific warning
with warnings.catch_warnings():
    warnings.simplefilter("ignore", category=UserWarning)
    # Checking the normality of the age distribution within each group to decide on the appropriate statistical test.
    normality = pingouin.normality(
        data=drug_safety,
        dv='age',
        group='trx',
        method='shapiro',
        alpha=0.05) 

normality

Unnamed: 0_level_0,W,pval,normal
trx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Drug,0.976785,2.1891519999999998e-38,False
Placebo,0.975595,2.22495e-29,False


In [80]:
# Extracting the ages of participant in the Drug group
age_trx = drug_safety.loc[drug_safety["trx"] == "Drug", "age"]

In [81]:
# Extracting the ages of participants in the Placebo group
age_placebo = drug_safety.loc[drug_safety["trx"] == "Placebo", "age"]

In [82]:
# The data distribution is not normal so I'm performing a two-sided Mann-Whitney U test to compare the ages between the two groups
age_group_effects = pingouin.mwu(age_trx, age_placebo)
age_group_effects

Unnamed: 0,U-val,alternative,p-val,RBC,CLES
MWU,29149339.5,two-sided,0.256963,-0.01093,0.505465


In [83]:
# Extracting the p-value from the results
age_group_effects_p_value = age_group_effects["p-val"]
age_group_effects_p_value

MWU    0.256963
Name: p-val, dtype: float64

**Q3 Outcome**  
  
Based on my analysis, there is no significant age difference between the Drug and Placebo groups. A histogram showed similar age distributions in both groups, with most participants aged 60-70. The Shapiro-Wilk test confirmed that the age distributions are not normally distributed (p-values < 0.05), so a non-parametric two-sided Mann-Whitney U test was used. This test resulted in a p-value of approximately 0.257, indicating no significant difference in the ages of participants between the Drug and Placebo groups. Therefore, age is not a confounding factor in this study, and any differences in outcomes can be attributed to the treatment itself rather than age.

# Conclusion

This project analyzed the outcomes of a drug trial to determine if the drug's effects are reproducible and if it causes adverse reactions. The analysis focused on three main questions: the difference in adverse effects between the Drug and Placebo groups, the independence of the number of adverse effects from the treatment groups, and the age comparison between the two groups.

First, a z-test showed no significant difference in the proportion of adverse effects between the Drug and Placebo groups (p-value: 0.9639). This suggests that the drug does not increase the risk of adverse effects compared to the placebo.

Second, a chi-square test indicated that the number of adverse effects is independent of whether participants received the Drug or the Placebo (p-value: 0.62). This means that the treatment type does not affect the number of adverse reactions.

Last, a Mann-Whitney U test found no significant difference in ages between the Drug and Placebo groups (p-value: 0.257). Therefore, age is not a factor influencing the study's results.

In conclusion, the drug's adverse effects are similar to the placebo, the number of adverse effects is not influenced by the treatment, and age differences are not significant. This confirms the drug's outcomes are reproducible and its safety profile is comparable to the placebo.