# DAY 4 â€” Statistical Tests + Time EDA + Feature Engineering


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import ttest_ind, f_oneway, chi2_contingency

In [2]:
df=pd.read_csv("day3_output.csv")

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27132 entries, 0 to 27131
Data columns (total 31 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   encounter_id              27132 non-null  int64 
 1   patient_nbr               27132 non-null  int64 
 2   race                      27132 non-null  object
 3   gender                    27132 non-null  object
 4   admission_type_id         27132 non-null  int64 
 5   discharge_disposition_id  27132 non-null  int64 
 6   admission_source_id       27132 non-null  int64 
 7   days_in_hospital          27132 non-null  int64 
 8   payer_code                27132 non-null  object
 9   doctors_dep               27132 non-null  object
 10  num_lab_procedures        27132 non-null  int64 
 11  num_procedures            27132 non-null  int64 
 12  Total_medicines           27132 non-null  int64 
 13  number_outpatient         27132 non-null  int64 
 14  number_emergency      

## 1. HYPOTHESIS TESTING

### T TEST : 

In [6]:
from scipy.stats import ttest_ind

group1 = df[df['readmitted'] == 'NO']['days_in_hospital']
group2 = df[df['readmitted'] == '<30']['days_in_hospital']

t_stat, p_value = ttest_ind(group1, group2)

print("T-statistic:", t_stat)
print("P-value:", p_value)

T-statistic: -7.362048497709187
P-value: 1.8874693386497457e-13


### Conclusion :

An independent t-test was performed to compare the mean hospital stay between readmitted and non-readmitted patients.

T-statistic: -7.36

P-value: < 0.001

Since the p-value is significantly less than 0.05, we reject the null hypothesis. This indicates that there is a statistically significant difference in hospital stay duration between readmitted and non-readmitted patients.

ðŸ‘‰ Patients who were readmitted tend to have longer hospital stays.

### ANOVA : 



In [7]:
from scipy.stats import f_oneway

group1 = df[df['readmitted'] == 'NO']['days_in_hospital']
group2 = df[df['readmitted'] == '>30']['days_in_hospital']
group3 = df[df['readmitted'] == '<30']['days_in_hospital']

f_stat, p_value = f_oneway(group1, group2, group3)

print("F-statistic:", f_stat)
print("P-value:", p_value)

F-statistic: 48.40829506656537
P-value: 1.0326883514694842e-21


### Conclution:

A one-way ANOVA test was conducted to compare hospital stay across three readmission categories (NO, >30, <30).

F-statistic: 48.41

P-value: < 0.001

The results show a statistically significant difference among the groups. This confirms that hospital stay duration varies across different readmission categories.

ðŸ‘‰ Length of hospital stay is strongly associated with readmission status.

## Chi-Square Test:


In [8]:
from scipy.stats import chi2_contingency

contingency_table = pd.crosstab(df['gender'], df['readmitted'])

chi2, p, dof, expected = chi2_contingency(contingency_table)

print("Chi-square:", chi2)
print("P-value:", p)

Chi-square: 3.0828882340668753
P-value: 0.21407173346229869


### Conclution:

A chi-square test was conducted to examine the association between gender and hospital readmission. The results were not statistically significant (Ï‡Â² = 3.08, p = 0.214).Since the p-value is greater than 0.05, gender and readmission are not significantly related.
this suggests the Male and Female patients have similar readmission patterns.

## 2.Feature Engineering

###### Help decision-making

###### Support readmission risk analysis




### 1- High Medication Flag:

##### Purpose: Patients taking many medicines may be more severe cases.

#####  1 = High medication intensity ,0 = Normal

In [16]:
df['high_medication_flag'] = np.where(df['Total_medicines'] > 20, 1, 0)

### 2- Long Stay Flag:

##### purpose:Long hospital stay may indicate complications.

##### 1 = Stayed more than 7 days

In [20]:
df['long_stay_flag'] = np.where(df['days_in_hospital'] > 7, 1, 0)

### 3- Elderly Patient Flag:

##### Purpose: Older patients often have higher readmission risk.

##### 1 = age 65+

In [22]:
df['elderly_flag'] = np.where(df['age'] >= 65, 1, 0)

### 4- High Diagnosis Count Flag :

##### purpose: Patients with many diagnoses may have complex conditions.

##### 1 = number of diagonses more than 5

In [26]:
df['high_diagnosis_flag'] = np.where(df['number_diagnoses'] > 5, 1, 0)

### 5- Intensive Lab Testing Flag:

##### purpose: More lab tests may indicate severe case.

##### 1 = number of lab procedures more than 60

In [27]:
df['high_lab_flag'] = np.where(df['num_lab_procedures'] > 60, 1, 0)

In [28]:
df

Unnamed: 0,encounter_id,patient_nbr,race,gender,admission_type_id,discharge_disposition_id,admission_source_id,days_in_hospital,payer_code,doctors_dep,...,acarbose,insulin,readmitted,age,age_group,high_medication_flag,long_stay_flag,elderly_flag,high_diagnosis_flag,high_lab_flag
0,72091308,20123568,Caucasian,Female,1,22,7,7,MC,Orthopedics-Reconstructive,...,No,Steady,NO,75,Senior,0,0,1,1,0
1,72848634,20377854,Caucasian,Female,2,1,1,3,MC,Nephrology,...,No,Steady,NO,65,Senior,0,0,1,1,0
2,73062156,20408121,Caucasian,Female,1,1,7,4,MC,Emergency/Trauma,...,No,No,NO,95,Elder,0,0,1,1,0
3,73731852,20542797,Caucasian,Male,1,2,7,10,MC,InternalMedicine,...,No,Steady,NO,75,Senior,0,1,1,1,1
4,81355914,7239654,Caucasian,Female,1,3,6,12,UN,InternalMedicine,...,No,Steady,NO,75,Senior,0,1,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27127,443739044,106595208,Caucasian,Male,2,6,7,6,MC,Emergency/Trauma,...,No,Up,NO,75,Senior,0,0,1,1,0
27128,443793668,47293812,Caucasian,Male,1,13,7,3,MC,Emergency/Trauma,...,No,Down,NO,85,Elder,1,0,1,1,0
27129,443804570,33230016,Caucasian,Female,1,22,7,8,MC,InternalMedicine,...,No,Steady,>30,75,Senior,0,1,1,1,0
27130,443816024,106392411,Caucasian,Female,3,6,1,3,MC,Orthopedics,...,No,Steady,NO,75,Senior,1,0,1,1,0


## Feature Engineering

##### To enhance predictive insights and risk segmentation, several new features were engineered based on clinical severity and healthcare utilization patterns. These features capture treatment intensity, hospitalization duration, age-based risk, and diagnostic complexity.

#### the engineered features include:

##### High medication usage indicator

##### Long hospital stay indicator

##### Elderly patient flag

##### High diagnosis count indicator

##### Intensive lab testing flag

##### These features help identify high-risk patient groups and support better readmission risk analysis.

In [29]:
import os
os.makedirs("data/processed", exist_ok=True)

df.to_csv("day3_output.csv", index=False)

In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27132 entries, 0 to 27131
Data columns (total 36 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   encounter_id              27132 non-null  int64 
 1   patient_nbr               27132 non-null  int64 
 2   race                      27132 non-null  object
 3   gender                    27132 non-null  object
 4   admission_type_id         27132 non-null  int64 
 5   discharge_disposition_id  27132 non-null  int64 
 6   admission_source_id       27132 non-null  int64 
 7   days_in_hospital          27132 non-null  int64 
 8   payer_code                27132 non-null  object
 9   doctors_dep               27132 non-null  object
 10  num_lab_procedures        27132 non-null  int64 
 11  num_procedures            27132 non-null  int64 
 12  Total_medicines           27132 non-null  int64 
 13  number_outpatient         27132 non-null  int64 
 14  number_emergency      

## Key Insights & Recommended Actions


### Insight 1: Hospital Stay Duration is Significantly Associated with Readmission


#### Evidence:

T-test: p < 0.001

ANOVA: p < 0.001

Boxplot shows longer median stay for readmitted patients

#### Interpretation:

Patients who stayed longer in hospital are more likely to be readmitted.

#### Recommended Action:

Implement enhanced discharge planning for patients staying more than 7 days.

Schedule early follow-up appointments within 7 days of discharge.

Provide post-discharge monitoring for long-stay patients.

### Insight 2: Gender is Not a Significant Predictor


#### Evidence:

Chi-square test: p = 0.214 (> 0.05)

#### Interpretation:

There is no significant association between gender and hospital readmission.

#### Recommended Action:

Do not prioritize interventions based on gender alone.

Focus on clinical factors rather than demographic gender differences.


### Insight 3: High Medication Usage May Indicate Severe Cases


#### Evidence:

Engineered feature: high_medication_flag

Scatter and boxplot analysis

#### Interpretation:

Patients receiving a high number of medications may represent complex or severe cases.

#### Recommended Action:

Flag patients with >20 medications for additional discharge counseling.

Conduct medication reconciliation before discharge.

Provide medication adherence education.


### Insight 4: Long Hospital Stay Flag Identifies High-Risk Patients


#### Evidence:

Engineered feature: long_stay_flag

Statistical difference in stay duration

#### Interpretation:

Long hospital stays (>7 days) are associated with increased readmission probability.

#### Recommended Action:

Develop special discharge protocols for long-stay patients.

Implement case management support.

Arrange post-discharge home visits or telehealth follow-ups.


### Insight 5: High Diagnosis Count Reflects Medical Complexity


#### Evidence:

Engineered feature: high_diagnosis_flag

#### Interpretation:

Patients with multiple diagnoses are medically complex and more vulnerable.

#### Recommended Action:

Introduce multidisciplinary care planning.

Ensure coordinated follow-up with specialists.

Prioritize complex patients for monitoring programs.

