<a href="https://colab.research.google.com/github/HarikrishnaYashoda/Apollo-Hypothesis-testing/blob/main/Apollo_Hypothesis_testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

•	Which variables are significant in predicting the reason for hospitalization for different regions;

•	How well some variables like viral load, smoking, and severity level describe the hospitalization charges


**Data Exploration**

In [1]:
import pandas as pd
import scipy.stats as stats
import statsmodels.api as sm
df = pd.read_csv("apollo_data.csv", index_col = 0)
df.head()

Unnamed: 0,age,sex,smoker,region,viral load,severity level,hospitalization charges
0,19,female,yes,southwest,9.3,0,42212
1,18,male,no,southeast,11.26,1,4314
2,28,male,no,southeast,11.0,3,11124
3,33,male,no,northwest,7.57,0,54961
4,32,male,no,northwest,9.63,0,9667


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   age                      1338 non-null   int64  
 1   sex                      1338 non-null   object 
 2   smoker                   1338 non-null   object 
 3   region                   1338 non-null   object 
 4   viral load               1338 non-null   float64
 5   severity level           1338 non-null   int64  
 6   hospitalization charges  1338 non-null   int64  
dtypes: float64(1), int64(3), object(3)
memory usage: 83.6+ KB


**Data Dictionary**

The file apollo_data.csv contains anonymized data of COVID-19 hospital patients and includes the following variables:

age: Integer — Age of the primary beneficiary (only includes ages up to 64, as older individuals are generally covered by the government).

sex: Categorical — Gender of the policy holder (male or female).

smoker: Categorical — Indicates whether the insured regularly smokes tobacco (yes or no).

region: Categorical — Beneficiary’s residence in Delhi, categorized into four

geographic regions: northeast, southeast, southwest, and northwest.

viral load: Float — The amount of virus present in an infected person’s blood.

severity level: Integer — Numeric score indicating the severity of the patient’s condition.

hospitalization charges: Integer — Medical costs billed to health insurance for the patient's hospital stay.

Wonderful. What about the statistical information of the numerical columns like age, viral load, severity level, or hospitalization charges? Let’s take a look.

In [4]:
df.describe()

Unnamed: 0,age,viral load,severity level,hospitalization charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,10.221233,1.094918,33176.058296
std,14.04996,2.032796,1.205493,30275.029296
min,18.0,5.32,0.0,2805.0
25%,27.0,8.7625,0.0,11851.0
50%,39.0,10.13,1.0,23455.0
75%,51.0,11.5675,2.0,41599.5
max,64.0,17.71,5.0,159426.0


**Data Cleaning & Preprocessing**

Before performing any statistical analysis or modeling, it’s essential to ensure the dataset is clean and structured appropriately. In this section, we will:

Remove unnecessary columns,

Convert categorical features into usable formats,

Encode those features for compatibility with statistical models,

Prepare the dataset for correlation analysis and regression.

This step ensures that downstream results are both accurate and interpretable.

In [6]:
# Convert relevant columns to 'category' data type
categorical_cols = ["sex", "smoker", "region"]
for col in categorical_cols:
    df[col] = df[col].astype("category")

# Apply one-hot encoding for categorical features (drop_first avoids multicollinearity)
df_encoded = pd.get_dummies(df, drop_first=True).astype(int)

# Show the cleaned and preprocessed DataFrame
df_encoded.head()

Unnamed: 0,age,viral load,severity level,hospitalization charges,sex_male,smoker_yes,region_northwest,region_southeast,region_southwest
0,19,9,0,42212,0,1,0,0,1
1,18,11,1,4314,1,0,0,1,0
2,28,11,3,11124,1,0,0,1,0
3,33,7,0,54961,1,0,1,0,0
4,32,9,0,9667,1,0,1,0,0


**Question 1: Which variables are significant in predicting the reason for hospitalization for different regions?**

Apollo wants to understand whether the factors like age, sex, smoking status, viral load, or severity level significantly differ across regions. While the dataset does not contain an explicit “reason” field for hospitalization, regional variation in these variables could indicate different patterns in hospitalization motives or needs.

**🔍 Approach**

We’ll approach this by:

Performing ANOVA and Chi-square tests to identify if distributions of key variables differ significantly by region.
Testing continuous variables like age, viral load, and severity level using ANOVA.
Testing categorical variables like sex and smoker using Chi-Square Test of Independence.

**Step 1: ANOVA – Do continuous variables vary across regions?**

In [7]:
# For each continuous variable, perform one-way ANOVA across regions
anova_age = stats.f_oneway(
    df[df["region"] == "northeast"]["age"],
    df[df["region"] == "southeast"]["age"],
    df[df["region"] == "southwest"]["age"],
    df[df["region"] == "northwest"]["age"]
)

anova_viral = stats.f_oneway(
    df[df["region"] == "northeast"]["viral load"],
    df[df["region"] == "southeast"]["viral load"],
    df[df["region"] == "southwest"]["viral load"],
    df[df["region"] == "northwest"]["viral load"]
)

anova_severity = stats.f_oneway(
    df[df["region"] == "northeast"]["severity level"],
    df[df["region"] == "southeast"]["severity level"],
    df[df["region"] == "southwest"]["severity level"],
    df[df["region"] == "northwest"]["severity level"]
)

# Print results
anova_age, anova_viral, anova_severity

(F_onewayResult(statistic=np.float64(0.07978158162436333), pvalue=np.float64(0.970989069987742)),
 F_onewayResult(statistic=np.float64(39.46870879747587), pvalue=np.float64(1.9508165724449588e-24)),
 F_onewayResult(statistic=np.float64(0.7174932934640621), pvalue=np.float64(0.5415542568832501)))

**ANOVA Results Summary**

We tested whether continuous variables (age, viral load, severity level) vary significantly across different regions using one-way ANOVA.

Age:
F(3, 1334) = 0.08, p = 0.97 ❌
→ No significant difference in average age across regions.

Viral Load:
F(3, 1334) = 39.47, p < 0.001 ✅
→ Highly significant difference in viral load between regions. This suggests that the severity of viral exposure varies geographically.

Severity Level:
F(3, 1334) = 0.77, p = 0.54 ❌
→ No statistically significant difference in severity level across regions.

📌 Insight: Among the continuous predictors, only viral load shows meaningful variation across regions, which may reflect differing infection rates or testing/reporting practices by location.

 **Step 2: Chi-Square Test – Are sex and smoker status independent of region?**

In [8]:
# Cross-tabulation and Chi-Square for 'sex' vs. 'region'
contingency_sex = pd.crosstab(df["region"], df["sex"])
chi2_sex = stats.chi2_contingency(contingency_sex)

# Cross-tabulation and Chi-Square for 'smoker' vs. 'region'
contingency_smoker = pd.crosstab(df["region"], df["smoker"])
chi2_smoker = stats.chi2_contingency(contingency_smoker)

# Show test statistics and p-values
chi2_sex[0:2], chi2_smoker[0:2]

((np.float64(0.43513679354327284), np.float64(0.9328921288772233)),
 (np.float64(7.343477761407071), np.float64(0.06171954839170541)))

Chi-Square Test Results
We assessed whether the distribution of categorical variables (sex, smoker) is independent of the region:

Sex vs Region
χ² = 0.43, p = 0.93 ❌
→ No relationship between gender distribution and region. Gender is evenly spread geographically.

Smoker vs Region
χ² = 7.34, p = 0.061 ❌ (borderline)
→ While not statistically significant at p < 0.05, there is a weak regional trend in smoking behavior (marginal significance).

📌 Insight: Neither sex nor smoking status vary significantly by region, although smoking status comes close to the significance threshold. This may warrant deeper exploration in future studies.

From our statistical analysis:

✅ Viral Load is the only variable that shows a significant difference across regions.
❌ Age, severity level, sex, and smoking status do not vary significantly by region.
This suggests that while reasons for hospitalization may be influenced by local viral exposure levels, other demographic and behavioral factors are evenly distributed across regions.

**How well some variables like viral load, smoking, and severity level describe the hospitalization charges?**

Apollo is interested in understanding whether factors like viral load, smoking, and severity level can reliably predict hospitalization charges. To answer this, we use linear regression, which helps quantify how much each variable contributes to cost differences.

We'll also account for other potential confounding variables, such as:

Age

Sex

Region

to ensure a robust and interpretable model.


In [9]:
# Linear regression model to predict hospitalization charges

# Define target variable (y) and features (X)
y = df_encoded["hospitalization charges"]
X = df_encoded.drop(columns=["hospitalization charges"])

# Add intercept to the model
X = sm.add_constant(X)

# Fit the OLS regression model
model = sm.OLS(y, X).fit()

# Show summary of the model
model.summary()

0,1,2,3
Dep. Variable:,hospitalization charges,R-squared:,0.751
Model:,OLS,Adj. R-squared:,0.75
Method:,Least Squares,F-statistic:,501.1
Date:,"Wed, 11 Jun 2025",Prob (F-statistic):,0.0
Time:,11:44:27,Log-Likelihood:,-14773.0
No. Observations:,1338,AIC:,29560.0
Df Residuals:,1329,BIC:,29610.0
Df Model:,8,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-2.825e+04,2362.479,-11.958,0.000,-3.29e+04,-2.36e+04
age,642.1805,29.739,21.594,0.000,583.840,700.521
viral load,2508.9272,211.029,11.889,0.000,2094.941,2922.914
severity level,1189.1801,344.430,3.453,0.001,513.495,1864.865
sex_male,-341.4719,832.219,-0.410,0.682,-1974.077,1291.134
smoker_yes,5.963e+04,1032.658,57.748,0.000,5.76e+04,6.17e+04
region_northwest,-919.9419,1190.421,-0.773,0.440,-3255.251,1415.367
region_southeast,-2528.7675,1195.059,-2.116,0.035,-4873.175,-184.360
region_southwest,-2457.3958,1194.949,-2.056,0.040,-4801.588,-113.203

0,1,2,3
Omnibus:,299.543,Durbin-Watson:,2.092
Prob(Omnibus):,0.0,Jarque-Bera (JB):,716.487
Skew:,1.208,Prob(JB):,2.61e-156
Kurtosis:,5.648,Cond. No.,250.0


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Linear Regression Test Results
We used a multiple linear regression model to quantify how well various predictors explain hospitalization charges. The model includes:

Biological factors: viral load, severity level

Behavioral: smoker

Demographic: age, sex

Geographic: region (dummy encoded)

🔧 Model Fit & Significance

R-squared = 0.751 → Model explains ~75.1% of variance in charges ✅

F-statistic = 500.9, p < 0.001 ✅ → Model is statistically significant

n = 1338 observations

✅ Statistically Significant Predictors (p < 0.05)

Variable	Coef	p-value	Interpretation

Age	+642	< 0.001	Older patients incur higher charges.

Viral Load	+2545	< 0.001	Higher viral load significantly increases cost.

Severity Level	+1189	0.001	More severe cases lead to higher charges.

Smoker (yes)	+59,620	< 0.001	Smoking is strongly associated with much higher costs.

Region - Southeast	-2587	0.031	Lower cost than reference (Northeast).

Region - Southwest	-240

Insights and Recommendations

Based on the statistical analyses and modeling conducted, we outline the following key insights and strategic recommendations for Apollo Hospitals:



Key Insights
Viral Load Varies by Region

Viral load was the only continuous variable showing statistically significant differences across regions.

This may reflect varying levels of infection exposure or reporting between geographical areas.

Smoking Has the Largest Impact on Cost

Smoking is the most influential variable in predicting hospitalization charges.
Smokers incur, on average, nearly 60,000 units more in charges than non-smokers.

Biological Severity Drives Cost

Both viral load and severity level significantly increase hospitalization charges.

This aligns with clinical expectations: sicker patients cost more to treat.
Demographics Have Limited Cost Impact

Age slightly increases cost (roughly +640 per year).
Sex does not significantly affect hospitalization charges.
Regional Differences in Cost

Patients from southeast and southwest regions tend to have lower costs than those in the northeast.
This may be due to hospital infrastructure, local pricing, or clinical practice variation.