In [1]:
import pandas as pd
import numpy as np

Data Exploration

In [3]:
df = pd.read_csv('apollo_data.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,age,sex,smoker,region,viral load,severity level,hospitalization charges
0,0,19,female,yes,southwest,9.3,0,42212
1,1,18,male,no,southeast,11.26,1,4314
2,2,28,male,no,southeast,11.0,3,11124
3,3,33,male,no,northwest,7.57,0,54961
4,4,32,male,no,northwest,9.63,0,9667


In [4]:
df.drop(columns=['Unnamed: 0'], inplace=True)
df.head()

Unnamed: 0,age,sex,smoker,region,viral load,severity level,hospitalization charges
0,19,female,yes,southwest,9.3,0,42212
1,18,male,no,southeast,11.26,1,4314
2,28,male,no,southeast,11.0,3,11124
3,33,male,no,northwest,7.57,0,54961
4,32,male,no,northwest,9.63,0,9667


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   age                      1338 non-null   int64  
 1   sex                      1338 non-null   object 
 2   smoker                   1338 non-null   object 
 3   region                   1338 non-null   object 
 4   viral load               1338 non-null   float64
 5   severity level           1338 non-null   int64  
 6   hospitalization charges  1338 non-null   int64  
dtypes: float64(1), int64(3), object(3)
memory usage: 73.3+ KB


In [6]:
df.describe()

Unnamed: 0,age,viral load,severity level,hospitalization charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,10.221233,1.094918,33176.058296
std,14.04996,2.032796,1.205493,30275.029296
min,18.0,5.32,0.0,2805.0
25%,27.0,8.7625,0.0,11851.0
50%,39.0,10.13,1.0,23455.0
75%,51.0,11.5675,2.0,41599.5
max,64.0,17.71,5.0,159426.0


In [7]:
df.isna().sum()

Unnamed: 0,0
age,0
sex,0
smoker,0
region,0
viral load,0
severity level,0
hospitalization charges,0


Business Requirements

In [8]:
df['sex'].unique()

array(['female', 'male'], dtype=object)

In [9]:
df['smoker'].unique()

array(['yes', 'no'], dtype=object)

In [10]:
df['region'].unique()

array(['southwest', 'southeast', 'northwest', 'northeast'], dtype=object)

In [11]:
df_encoded = pd.get_dummies(df, columns=['sex', 'smoker', 'region'])
df_encoded.head()

Unnamed: 0,age,viral load,severity level,hospitalization charges,sex_female,sex_male,smoker_no,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest
0,19,9.3,0,42212,True,False,False,True,False,False,False,True
1,18,11.26,1,4314,False,True,True,False,False,False,True,False
2,28,11.0,3,11124,False,True,True,False,False,False,True,False
3,33,7.57,0,54961,False,True,True,False,False,True,False,False
4,32,9.63,0,9667,False,True,True,False,False,True,False,False


In [12]:
df_encoded.columns

Index(['age', 'viral load', 'severity level', 'hospitalization charges',
       'sex_female', 'sex_male', 'smoker_no', 'smoker_yes', 'region_northeast',
       'region_northwest', 'region_southeast', 'region_southwest'],
      dtype='object')

In [13]:
import statsmodels.api as sm

boolean_cols = ['sex_female', 'sex_male', 'smoker_no', 'smoker_yes', 'region_northeast',
       'region_northwest', 'region_southeast', 'region_southwest']

for col in boolean_cols:
    df_encoded[col] = df_encoded[col].astype(int)

df_encoded.head()

Unnamed: 0,age,viral load,severity level,hospitalization charges,sex_female,sex_male,smoker_no,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest
0,19,9.3,0,42212,1,0,0,1,0,0,0,1
1,18,11.26,1,4314,0,1,1,0,0,0,1,0
2,28,11.0,3,11124,0,1,1,0,0,0,1,0
3,33,7.57,0,54961,0,1,1,0,0,1,0,0
4,32,9.63,0,9667,0,1,1,0,0,1,0,0


OLS (Ordinary Least Squares) Regression Model

In [14]:

X = sm.add_constant(df_encoded[['age', 'viral load', 'severity level',
       'sex_female', 'sex_male', 'smoker_no', 'smoker_yes', 'region_northeast',
       'region_northwest', 'region_southeast', 'region_southwest']])
y = df_encoded['hospitalization charges']

model = sm.OLS(y, X).fit()
print(model.summary())

                               OLS Regression Results                              
Dep. Variable:     hospitalization charges   R-squared:                       0.751
Model:                                 OLS   Adj. R-squared:                  0.749
Method:                      Least Squares   F-statistic:                     500.9
Date:                     Thu, 05 Dec 2024   Prob (F-statistic):               0.00
Time:                             04:17:12   Log-Likelihood:                -14774.
No. Observations:                     1338   AIC:                         2.957e+04
Df Residuals:                         1329   BIC:                         2.961e+04
Df Model:                                8                                         
Covariance Type:                 nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------

1. Model Summary

R-squared: 0.751 means 75.1% of the variance in hospitalization charges is explained by the model.
F-statistic: 500.9 with the P-value converging to 0.

2. Coefficient

Higher absolute coefficient means a stronger significance level to the target variable.
The patient being a smoker or not has the strongest significance in deciding the hospitalization charges. smoker_no has a coef of -30180, meaning if the patient is a non smoker the charge of the hospitalization will drop by 30180 dollars. Being a smoker will increase the charge to $29440

3. Statistical Significance

sex variables both have a high p-value more than the significance level 0.05; sex_female 0.758, sex_male 0.436 meaning the variables are not significantly correlated to the target variable.
region also seems to have a lower significance; all region related variable have p-value level higher than 0.05.
smoker variables both have the p-value converging to 0 which is smaller than 0.05. We can say smoker variable is has the highest significance in predicting the reason for hospitalization charge.
viral load and severity level also have a strong significance to the target variable. Also the p-value for both variables are less than 0.05.

Divide by Region
Since our first business requirement is asking to check which variables are significant in predicting the reason for hospitalization for different regions, we will separate the df_encoded to 4 regions. Then fit in the regression model. We know that region is not a significant variable for predicting the hospitalization charges from above result but still will check it in case.

In [15]:
df_northeast = df_encoded[df_encoded['region_northeast']==True]
df_northeast

Unnamed: 0,age,viral load,severity level,hospitalization charges,sex_female,sex_male,smoker_no,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest
8,37,9.94,2,16016,0,1,1,0,1,0,0,0
10,25,8.74,0,6803,0,1,1,0,1,0,0,0
16,52,10.26,1,26993,1,0,1,0,1,0,0,0
17,23,7.95,0,5988,0,1,1,0,1,0,0,0
20,60,12.00,0,33072,1,0,1,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
1321,62,8.90,0,70253,0,1,0,1,1,0,0,0
1325,61,11.18,0,32858,0,1,1,0,1,0,0,0
1326,42,10.96,0,17625,1,0,1,0,1,0,0,0
1328,23,8.08,2,55989,1,0,1,0,1,0,0,0


In [16]:
X = sm.add_constant(df_northeast[['age', 'viral load', 'severity level',
       'sex_female', 'sex_male', 'smoker_no', 'smoker_yes']])
y = df_northeast['hospitalization charges']

model = sm.OLS(y, X).fit()
print(model.summary())

                               OLS Regression Results                              
Dep. Variable:     hospitalization charges   R-squared:                       0.709
Model:                                 OLS   Adj. R-squared:                  0.705
Method:                      Least Squares   F-statistic:                     155.2
Date:                     Thu, 05 Dec 2024   Prob (F-statistic):           4.44e-83
Time:                             04:18:57   Log-Likelihood:                -3578.4
No. Observations:                      324   AIC:                             7169.
Df Residuals:                          318   BIC:                             7192.
Df Model:                                5                                         
Covariance Type:                 nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------

In [17]:
df_northwest = df_encoded[df_encoded['region_northwest']==True]

In [18]:
X = sm.add_constant(df_northwest[['age', 'viral load', 'severity level',
       'sex_female', 'sex_male', 'smoker_no', 'smoker_yes']])
y = df_northwest['hospitalization charges']

model = sm.OLS(y, X).fit()
print(model.summary())

                               OLS Regression Results                              
Dep. Variable:     hospitalization charges   R-squared:                       0.704
Model:                                 OLS   Adj. R-squared:                  0.700
Method:                      Least Squares   F-statistic:                     152.0
Date:                     Thu, 05 Dec 2024   Prob (F-statistic):           3.45e-82
Time:                             04:19:35   Log-Likelihood:                -3586.9
No. Observations:                      325   AIC:                             7186.
Df Residuals:                          319   BIC:                             7208.
Df Model:                                5                                         
Covariance Type:                 nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------

In [19]:
df_southeast = df_encoded[df_encoded['region_southeast']==True]

In [20]:
X = sm.add_constant(df_southeast[['age', 'viral load', 'severity level',
       'sex_female', 'sex_male', 'smoker_no', 'smoker_yes']])
y = df_southeast['hospitalization charges']

model = sm.OLS(y, X).fit()
print(model.summary())

                               OLS Regression Results                              
Dep. Variable:     hospitalization charges   R-squared:                       0.799
Model:                                 OLS   Adj. R-squared:                  0.796
Method:                      Least Squares   F-statistic:                     284.8
Date:                     Thu, 05 Dec 2024   Prob (F-statistic):          2.17e-122
Time:                             04:19:57   Log-Likelihood:                -4031.7
No. Observations:                      364   AIC:                             8075.
Df Residuals:                          358   BIC:                             8099.
Df Model:                                5                                         
Covariance Type:                 nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------

In [21]:
df_southwest = df_encoded[df_encoded['region_southwest']==True]

In [22]:
X = sm.add_constant(df_southwest[['age', 'viral load', 'severity level',
       'sex_female', 'sex_male', 'smoker_no', 'smoker_yes']])
y = df_southwest['hospitalization charges']

model = sm.OLS(y, X).fit()
print(model.summary())

                               OLS Regression Results                              
Dep. Variable:     hospitalization charges   R-squared:                       0.786
Model:                                 OLS   Adj. R-squared:                  0.783
Method:                      Least Squares   F-statistic:                     234.3
Date:                     Thu, 05 Dec 2024   Prob (F-statistic):          1.75e-104
Time:                             04:20:16   Log-Likelihood:                -3548.3
No. Observations:                      325   AIC:                             7109.
Df Residuals:                          319   BIC:                             7131.
Df Model:                                5                                         
Covariance Type:                 nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------

Conclusion

In [23]:
X = sm.add_constant(df_encoded[['age', 'viral load', 'severity level',
       'sex_female', 'sex_male', 'smoker_no', 'smoker_yes', 'region_northeast',
       'region_northwest', 'region_southeast', 'region_southwest']])
y = df_encoded['hospitalization charges']

model = sm.OLS(y, X).fit()
print(model.summary())

                               OLS Regression Results                              
Dep. Variable:     hospitalization charges   R-squared:                       0.751
Model:                                 OLS   Adj. R-squared:                  0.749
Method:                      Least Squares   F-statistic:                     500.9
Date:                     Thu, 05 Dec 2024   Prob (F-statistic):               0.00
Time:                             04:20:45   Log-Likelihood:                -14774.
No. Observations:                     1338   AIC:                         2.957e+04
Df Residuals:                         1329   BIC:                         2.961e+04
Df Model:                                8                                         
Covariance Type:                 nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------

We can also answer the second business requirement; How well some variables like viral load, smoking, and severity level describe the hospitalization charges

viral load, smoker and severity level all has p-value less than the significance level meaning they are highly correlated to predicting the target variable.
smoker variable

The patient being a smoker or not has the strongest significance in deciding the hospitalization charges. smoker_no has a coef of -30180, meaning if the patient is a non-smoker the charge of the hospitalization will minus by 30180 dollars.Being a smoker will increase the charge to 29440 dollars.
According to the [0.025 and 0.975] we can check the 95% confidence interval for each coef. For the smoker_no variable we can be 95% confident that the true effect for a patient being non-smoker could reduce the hospitalization charge from 31600 dollars to 28800 dollars.
Same goes with the smoker_yes variable; 95% confidence for the patient being a smoker could increase the hospitalization charge from 27900 dollars to 31000 dollars.

viral load:

viral load is the third variable that is strongly correlated to when it comes to deciding the hospitalization charges. It has a coef of 2544, meaning when each unit increases in viral load, charges increase by about 2544 dollars.
The amount could be up from 2123 dollars to 2965 dollars.
severity level

severity level follows next after viral load. With a coef of 1188, the hospitalization charge increases by a unit of increase on severity level by around $1188.
The amount could be up from 513 dollars to 1864 dollars.