# Biostat Midterm Project
11/6/2024        
J. Cristiano

A study was conducted to test the impact of new drug in treating hyperglycemia in patients with diabetes mellitus.  The primary outcome was the total number of deaths.  
The purpose of this Mid-term Applied Project is to examine demographic and clinical characteristics, as well primary outcome (death) between the two groups (Triinhibitor vs. placebo).
The data consists of a total of more than 1000 patients, met eligibility criteria for the study for the randomized clincal trial (Triinhibitor vs. placebo). 

### Question 1 
Provide the structure of the data set, including length, data type, names, and the components of the data set.  Please take a look at the data dictionary (end of the mid-term applied project) to determine the data type. If numbers were used for levels of a variable, you should convert the numbers into factors, representing the actural meaning.  For example, for variable "White", 1 is for non-white, and 0 for White.

#### Dataset Structure

##### Length
The dataset comprises several rows, each representing an individual patient record. Based on the sample provided, each row contains information on various demographic and health-related attributes.

##### Column Names and Data Types
- **PID**: Integer, unique identifier for each patient.
- **Age**: Integer, representing the age of the patient in years.
- **Gender**: Categorical, F (Female) or M (Male).
- **Ethnicity**: Categorical, values include "HISPANIC OR LATINO" and "NOT HISPANIC OR LA".
- **White**: Binary, 0 for White, 1 for Non-white.
- **LVEF**: Integer, Left ventricular ejection fraction.
- **A1c**: Float, average glucose level in blood.
- **SBP**: Integer, systolic blood pressure in mm Hg.
- **DBP**: Integer, diastolic blood pressure in mm Hg.
- **BMI**: Float, body mass index.
- **Ischemic Status**: Categorical, either "Ischemic" or "Non-Ischemic".
- **Reason of End the Study**: Categorical, reasons include "STUDY TERMINATED BY SPONSOR" and "DEATH".
- **Months of Follow-up**: Float, duration in months.
- **Group**: Categorical, treatment group as either "Placebo" or "Triinhibitor".
- **GFR**: Float, glomerular filtration rate.

##### Components
Each row represents one patient, tracking demographic information (e.g., Age, Gender, Ethnicity), health metrics (e.g., LVEF, A1c, BMI), and study-related details (e.g., Ischemic Status, Reason of End the Study). This structure allows for analysis based on these health and demographic factors over the follow-up period.




### Question 2
Provide your statistical analysis plan to summarize characteristics of the data set.  You should describe your methods in detail for the following "Table 1" 

#### Descriptive Statistics 
Categorical variables will be reported by counts and precentages for each treatment group        
Continuous variables will be assessed for skew and summary statistics will be reported within each group. mean and standard deviation for normal distributions and median will be reported for variables that show a skew. 

#### Group Comparisons
If normallity assumptions hold, we'll use t-tests to compare difference of means for continuous variables. Categorical variables will be tested for sampling bias with chi-square or fisher's exact test.


### Question 3
Provide your "Table 1":  Demographic and Clinical Characteristics, as well outcomes of the patients according to the treatment groups (Triinhibitor vs. placebo).  In the table, you should present

  a) suitable descriptive statistics in the table for variables in total, Triinhibitor, and placebo groups, respectively. 

  b)  results (p-values) derived from inferential statistics for comparison of patients for each variable between Triinhibitor  and placebo groups. 

  You should specify your statistical test methods for obtaining p-value for each variable. 

In [1]:
#Loading in the dataset & required packages
import pandas as pd
from tableone import TableOne
from pprint import pprint
from scipy.stats import ttest_ind
#Assumnig the dataset is in the same directory as the script
data = pd.read_csv('BINF667Intro-1.csv')
data['White'] = data['White'].map({0: 'White', 1: 'Non-White'})

#Creating the table
table = TableOne(
    data,
    columns=["Age", "Gender", "Ethnicity", "White", "LVEF", "A1c", "SBP", "DBP", "BMI", "Ischemic Status", "Months of Follow-up", "GFR", "Reason of End the Study"],
    categorical=["Gender", "Ethnicity", "White", "Ischemic Status","Reason of End the Study"],
    groupby="Group",
    pval=True
)

print(table.tabulate(tablefmt="fancy_grid"))
# TableOne Options:
#     tablefmt (str): The format of the output table. Available options include:
#         - "plain"
#         - "simple"
#         - "github"
#         - "grid"
#         - "fancy_grid"
#         - "pipe"
#         - "orgtbl"
#         - "jira"
#         - "presto"
#         - "pretty"
#         - "psql"
#         - "rst"
#         - "mediawiki"
#         - "moinmoin"
#         - "youtrack"
#         - "html"
#         - "unsafehtml"
#         - "latex"
#         - "latex_raw"
#         - "latex_booktabs"
#         - "latex_longtable"

╒════════════════════════════════╤═════════════════════════════╤═══════════╤══════════════╤══════════════╤════════════════╤═══════════╕
│                                │                             │ Missing   │ Overall      │ Placebo      │ Triinhibitor   │ P-Value   │
╞════════════════════════════════╪═════════════════════════════╪═══════════╪══════════════╪══════════════╪════════════════╪═══════════╡
│ n                              │                             │           │ 1079         │ 542          │ 537            │           │
├────────────────────────────────┼─────────────────────────────┼───────────┼──────────────┼──────────────┼────────────────┼───────────┤
│ Age, mean (SD)                 │                             │ 0         │ 69.0 (8.9)   │ 69.3 (8.6)   │ 68.6 (9.3)     │ 0.193     │
├────────────────────────────────┼─────────────────────────────┼───────────┼──────────────┼──────────────┼────────────────┼───────────┤
│ Gender, n (%)                  │ F            

#### Tests Used

- **Categorical Variables**: 
  - For categorical variables such as *Gender*, *Ethnicity*, *White*, and *Ischemic Status*, the **Chi-square test** is used to compare proportions between groups. If any expected cell count is less than 5, **Fisher’s Exact test** is used instead.

- **Continuous Variables**: 
  - For continuous variables such as *Age*, *LVEF*, *A1c*, *SBP*, *DBP*, *BMI*, *Months of Follow-up*, and *GFR*:
    - If the variable follows a **normal distribution**, a **two-sample t-test** is used to compare the means between the treatment groups (Triinhibitor vs. Placebo).
    - If the variable does not follow a normal distribution, a **Mann-Whitney U test** is used as a non-parametric alternative.


### Question 4 
Provide your "Result" paragraph(s), summarzing the results based on the "Table 1".


**Results**

Baseline characteristics of the study participants are presented in Table 1. A total of 1,079 patients were enrolled, with 542 in the placebo group and 537 in the triinhibitor group. The mean age of the overall cohort was 69.0 years (SD 8.9), with no significant difference between the placebo group (69.3 years, SD 8.6) and the triinhibitor group (68.6 years, SD 9.3; *p* = 0.193).

The gender distribution was similar between groups, with females comprising 33.6% of the overall population (placebo: 34.3%, triinhibitor: 33.0%; *p* = 0.684). Ethnicity and race were balanced across treatment arms, with 26.2% Hispanic or Latino participants and 94.7% identifying as White in the overall sample (*p* = 0.357 and *p* = 1.000, respectively).

Clinical parameters, including left ventricular ejection fraction (LVEF), hemoglobin A1c (A1c), systolic blood pressure (SBP), diastolic blood pressure (DBP), body mass index (BMI), ischemic status, months of follow-up, glomerular filtration rate (GFR), and reasons for study termination did not differ significantly between the placebo and triinhibitor groups (all *p* > 0.05). Specifically, the mean LVEF was 37.5% (SD 13.2) overall, with values of 37.2% (SD 13.0) in the placebo group and 37.9% (SD 13.4) in the triinhibitor group (*p* = 0.354).

The mean A1c was consistent across groups at 7.5% (SD 1.6; *p* = 0.886). Mean SBP and DBP were also comparable between the placebo and triinhibitor groups (SBP: 123.5 mmHg vs. 124.3 mmHg, *p* = 0.399; DBP: 73.1 mmHg vs. 72.6 mmHg, *p* = 0.521). The mean BMI was slightly higher in the placebo group (31.5 kg/m², SD 6.2) compared to the triinhibitor group (30.8 kg/m², SD 5.9), but this difference was not statistically significant (*p* = 0.068).

Ischemic status was similar between groups, with 59.8% of patients classified as ischemic overall (placebo: 60.3%, triinhibitor: 59.2%; *p* = 0.756). The mean duration of follow-up was 9.6 months (SD 5.2), with no significant difference between groups (*p* = 0.307). GFR values were comparable between the placebo and triinhibitor groups (54.2 mL/min/1.73 m² vs. 52.4 mL/min/1.73 m², *p* = 0.101).

The primary reason for ending the study was termination by the sponsor in 88.5% of cases, with death accounting for 11.5% of study endpoints. There was no significant difference in the reasons for study termination between the two groups (*p* = 0.320).